Commit graph

217 commits

Author SHA1 Message Date
Takuya UESHIN 3249528920 [SPARK-2196] [SQL] Fix nullability of CaseWhen.
`CaseWhen` should use `branches.length` to check if `elseValue` is provided or not.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1133 from ueshin/issues/SPARK-2196 and squashes the following commits:

510f12d [Takuya UESHIN] Add some tests.
dc25e8d [Takuya UESHIN] Fix nullable of CaseWhen to be nullable if the elseValue is nullable.
4f049cc [Takuya UESHIN] Fix nullability of CaseWhen.
2014-06-20 00:12:52 -07:00
Reynold Xin c55bbb49f7 [SPARK-2209][SQL] Cast shouldn't do null check twice.
Also took the chance to clean up cast a little bit. Too many arrows on each line before!

Author: Reynold Xin <rxin@apache.org>

Closes #1143 from rxin/cast and squashes the following commits:

dd006cb [Reynold Xin] Code review feedback.
c2b88ae [Reynold Xin] [SPARK-2209][SQL] Cast shouldn't do null check twice.
2014-06-20 00:01:19 -07:00
Reynold Xin 6175640973 [SPARK-2210] cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0)
```
explain select cast(cast(key=0 as boolean) as boolean) aaa from src
```
should be
```
[Physical execution plan:]
[Project [(key#10:0 = 0) AS aaa#7]]
[ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
```

However, it is currently
```
[Physical execution plan:]
[Project [NOT((key#10=0) = 0) AS aaa#7]]
[ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
```

Author: Reynold Xin <rxin@apache.org>

Closes #1144 from rxin/booleancast and squashes the following commits:

c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.
2014-06-19 23:58:23 -07:00
Andre Schumacher f479cf3743 SPARK-1293 [SQL] Parquet support for nested types
It should be possible to import and export data stored in Parquet's columnar format that contains nested types. For example:
```java
message AddressBook {
   required binary owner;
   optional group ownerPhoneNumbers {
      repeated binary array;
   }
   optional group contacts {
      repeated group array {
         required binary name;
         optional binary phoneNumber;
      }
   }
   optional group nameToApartmentNumber {
      repeated group map {
         required binary key;
         required int32 value;
      }
   }
}
```
The example could model a type (AddressBook) that contains records made of strings (owner), lists (ownerPhoneNumbers) and a table of contacts (e.g., a list of pairs or a map that can contain null values but keys must not be null). The list of tasks are as follows:

<h6>Implement support for converting nested Parquet types to Spark/Catalyst types:</h6>
- [x] Structs
- [x] Lists
- [x] Maps (note: currently keys need to be Strings)

<h6>Implement import (via ``parquetFile``) of nested Parquet types (first version in this PR)</h6>
- [x] Initial version

<h6>Implement export (via ``saveAsParquetFile``)</h6>
- [x] Initial version

<h6>Test support for AvroParquet, etc.</h6>
- [x] Initial testing of import of avro-generated Parquet data (simple + nested)

Example:
```scala
val data = TestSQLContext
  .parquetFile("input.dir")
  .toSchemaRDD
data.registerAsTable("data")
sql("SELECT owner, contacts[1].name, nameToApartmentNumber['John'] FROM data").collect()
```

Author: Andre Schumacher <andre.schumacher@iki.fi>
Author: Michael Armbrust <michael@databricks.com>

Closes #360 from AndreSchumacher/nested_parquet and squashes the following commits:

30708c8 [Andre Schumacher] Taking out AvroParquet test for now to remove Avro dependency
95c1367 [Andre Schumacher] Changes to ParquetRelation and its metadata
7eceb67 [Andre Schumacher] Review feedback
94eea3a [Andre Schumacher] Scalastyle
403061f [Andre Schumacher] Fixing some issues with tests and schema metadata
b8a8b9a [Andre Schumacher] More fixes to short and byte conversion
63d1b57 [Andre Schumacher] Cleaning up and Scalastyle
88e6bdb [Andre Schumacher] Attempting to fix loss of schema
37e0a0a [Andre Schumacher] Cleaning up
14c3fd8 [Andre Schumacher] Attempting to fix Spark-Parquet schema conversion
3e1456c [Michael Armbrust] WIP: Directly serialize catalyst attributes.
f7aeba3 [Michael Armbrust] [SPARK-1982] Support for ByteType and ShortType.
3104886 [Michael Armbrust] Nested Rows should be Rows, not Seqs.
3c6b25f [Andre Schumacher] Trying to reduce no-op changes wrt master
31465d6 [Andre Schumacher] Scalastyle: fixing commented out bottom
de02538 [Andre Schumacher] Cleaning up ParquetTestData
2f5a805 [Andre Schumacher] Removing stripMargin from test schemas
191bc0d [Andre Schumacher] Changing to Seq for ArrayType, refactoring SQLParser for nested field extension
cbb5793 [Andre Schumacher] Code review feedback
32229c7 [Andre Schumacher] Removing Row nested values and placing by generic types
0ae9376 [Andre Schumacher] Doc strings and simplifying ParquetConverter.scala
a6b4f05 [Andre Schumacher] Cleaning up ArrayConverter, moving classTag to NativeType, adding NativeRow
431f00f [Andre Schumacher] Fixing problems introduced during rebase
c52ff2c [Andre Schumacher] Adding native-array converter
619c397 [Andre Schumacher] Completing Map testcase
79d81d5 [Andre Schumacher] Replacing field names for array and map in WriteSupport
f466ff0 [Andre Schumacher] Added ParquetAvro tests and revised Array conversion
adc1258 [Andre Schumacher] Optimizing imports
e99cc51 [Andre Schumacher] Fixing nested WriteSupport and adding tests
1dc5ac9 [Andre Schumacher] First version of WriteSupport for nested types
d1911dc [Andre Schumacher] Simplifying ArrayType conversion
f777b4b [Andre Schumacher] Scalastyle
824500c [Andre Schumacher] Adding attribute resolution for MapType
b539fde [Andre Schumacher] First commit for MapType
a594aed [Andre Schumacher] Scalastyle
4e25fcb [Andre Schumacher] Adding resolution of complex ArrayTypes
f8f8911 [Andre Schumacher] For primitive rows fall back to more efficient converter, code reorg
6dbc9b7 [Andre Schumacher] Fixing some problems intruduced during rebase
b7fcc35 [Andre Schumacher] Documenting conversions, bugfix, wrappers of Rows
ee70125 [Andre Schumacher] fixing one problem with arrayconverter
98219cf [Andre Schumacher] added struct converter
5d80461 [Andre Schumacher] fixing one problem with nested structs and breaking up files
1b1b3d6 [Andre Schumacher] Fixing one problem with nested arrays
ddb40d2 [Andre Schumacher] Extending tests for nested Parquet data
745a42b [Andre Schumacher] Completing testcase for nested data (Addressbook(
6125c75 [Andre Schumacher] First working nested Parquet record input
4d4892a [Andre Schumacher] First commit nested Parquet read converters
aa688fe [Andre Schumacher] Adding conversion of nested Parquet schemas
2014-06-19 23:47:45 -07:00
Yin Huai f397e92eb2 [SPARK-2177][SQL] describe table result contains only one column
```
scala> hql("describe src").collect().foreach(println)

[key                 	string              	None                ]
[value               	string              	None                ]
```

The result should contain 3 columns instead of one. This screws up JDBC or even the downstream consumer of the Scala/Java/Python APIs.

I am providing a workaround. We handle a subset of describe commands in Spark SQL, which are defined by ...
```
DESCRIBE [EXTENDED] [db_name.]table_name
```
All other cases are treated as Hive native commands.

Also, if we upgrade Hive to 0.13, we need to check the results of context.sessionState.isHiveServerQuery() to determine how to split the result. This method is introduced by https://issues.apache.org/jira/browse/HIVE-4545. We may want to set Hive to use JsonMetaDataFormatter for the output of a DDL statement (`set hive.ddl.output.format=json` introduced by https://issues.apache.org/jira/browse/HIVE-2822).

The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2177

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1118 from yhuai/SPARK-2177 and squashes the following commits:

fd2534c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
b9b9aa5 [Yin Huai] rxin's comments.
e7c4e72 [Yin Huai] Fix unit test.
656b068 [Yin Huai] 100 characters.
6387217 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
8003cf3 [Yin Huai] Generate strings with the format like Hive for unit tests.
9787fff [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
440c5af [Yin Huai] rxin's comments.
f1a417e [Yin Huai] Update doc.
83adb2f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
366f891 [Yin Huai] Add describe command.
74bd1d4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
342fdf7 [Yin Huai] Split to up to 3 parts.
725e88c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
bb8bbef [Yin Huai] Split every string in the result of a describe command.
2014-06-19 23:41:38 -07:00
Michael Armbrust d3b7671c1f [SQL] Improve Speed of InsertIntoHiveTable
Author: Michael Armbrust <michael@databricks.com>

Closes #1130 from marmbrus/noFunctional and squashes the following commits:

ccdb68c [Michael Armbrust] Remove functional programming and Array allocations from fast path in InsertIntoHiveTable.
2014-06-19 23:39:03 -07:00
Reynold Xin 278ec8a203 More minor scaladoc cleanup for Spark SQL.
Author: Reynold Xin <rxin@apache.org>

Closes #1142 from rxin/sqlclean and squashes the following commits:

67a789e [Reynold Xin] More minor scaladoc cleanup for Spark SQL.
2014-06-19 22:34:21 -07:00
Reynold Xin 5464e79175 A few minor Spark SQL Scaladoc fixes.
Author: Reynold Xin <rxin@apache.org>

Closes #1139 from rxin/sparksqldoc and squashes the following commits:

c3049d8 [Reynold Xin] Fixed line length.
66dc72c [Reynold Xin] A few minor Spark SQL Scaladoc fixes.
2014-06-19 18:24:05 -07:00
Michael Armbrust 777c5958c4 [SPARK-2191][SQL] Make sure InsertIntoHiveTable doesn't execute more than once.
Author: Michael Armbrust <michael@databricks.com>

Closes #1129 from marmbrus/doubleCreateAs and squashes the following commits:

9c6d9e4 [Michael Armbrust] Fix typo.
5128fe2 [Michael Armbrust] Make sure InsertIntoHiveTable doesn't execute each time you ask for its result.
2014-06-19 14:14:03 -07:00
Reynold Xin 640c294369 [SPARK-2187] Explain should not run the optimizer twice.
@yhuai @marmbrus @concretevitamin

Author: Reynold Xin <rxin@apache.org>

Closes #1123 from rxin/explain and squashes the following commits:

def83b0 [Reynold Xin] Update unit tests for explain.
a9d3ba8 [Reynold Xin] [SPARK-2187] Explain should not run the optimizer twice.
2014-06-18 22:44:12 -07:00
Michael Armbrust 5ff75c748a [SPARK-2184][SQL] AddExchange isn't idempotent
...redPartitioning.

Author: Michael Armbrust <michael@databricks.com>

Closes #1122 from marmbrus/fixAddExchange and squashes the following commits:

3417537 [Michael Armbrust] Don't bind partitioning expressions as that breaks comparison with requiredPartitioning.
2014-06-18 17:52:42 -07:00
Yin Huai 587d32012c [SPARK-2176][SQL] Extra unnecessary exchange operator in the result of an explain command
```
hql("explain select * from src group by key").collect().foreach(println)

[ExplainCommand [plan#27:0]]
[ Aggregate false, [key#25], [key#25,value#26]]
[  Exchange (HashPartitioning [key#25:0], 200)]
[   Exchange (HashPartitioning [key#25:0], 200)]
[    Aggregate true, [key#25], [key#25]]
[     HiveTableScan [key#25,value#26], (MetastoreRelation default, src, None), None]
```

There are two exchange operators.

However, if we do not use explain...
```
hql("select * from src group by key")

res4: org.apache.spark.sql.SchemaRDD =
SchemaRDD[8] at RDD at SchemaRDD.scala:100
== Query Plan ==
Aggregate false, [key#8], [key#8,value#9]
 Exchange (HashPartitioning [key#8:0], 200)
  Aggregate true, [key#8], [key#8]
   HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
```
The plan is fine.

The cause of this bug is explained below.

When we create an `execution.ExplainCommand`, we use the `executedPlan` as the child of this `ExplainCommand`. But, this `executedPlan` is prepared for execution again when we generate the `executedPlan` for the `ExplainCommand`. Basically, `prepareForExecution` is called twice on a physical plan. Because after `prepareForExecution` we have already bounded those references (in `BoundReference`s), `AddExchange` cannot figure out we are using the same partitioning (we use `AttributeReference`s to create an `ExchangeOperator` and then those references will be changed to `BoundReference`s after `prepareForExecution` is called). So, an extra `ExchangeOperator` is inserted.

I think in `CommandStrategy`, we should just use the `sparkPlan` (`sparkPlan` is the input of `prepareForExecution`) to initialize the `ExplainCommand` instead of using `executedPlan`.

The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2176

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1116 from yhuai/SPARK-2176 and squashes the following commits:

197c19c [Yin Huai] Use sparkPlan to initialize a Physical Explain Command instead of using executedPlan.
2014-06-18 10:51:32 -07:00
Yin Huai d2f4f30b12 [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-2060

Programming guide: http://yhuai.github.io/site/sql-programming-guide.html

Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #999 from yhuai/newJson and squashes the following commits:

227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ce8eedd [Yin Huai] rxin's comments.
bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
94ffdaa [Yin Huai] Remove "get" from method names.
ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
79ea9ba [Yin Huai] Fix typos.
5428451 [Yin Huai] Newline
1f908ce [Yin Huai] Remove extra line.
d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
7ea750e [Yin Huai] marmbrus's comments.
6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
83013fb [Yin Huai] Update Java Example.
e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4fbddf0 [Yin Huai] Programming guide.
9df8c5a [Yin Huai] Python API.
7027634 [Yin Huai] Java API.
cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ab810b0 [Yin Huai] Make JsonRDD private.
6df0891 [Yin Huai] Apache header.
8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
8ffed79 [Yin Huai] Update the example.
a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
65b87f0 [Yin Huai] Fix sampling...
8846af5 [Yin Huai] API doc.
52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
0387523 [Yin Huai] Address PR comments.
666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
a2313a6 [Yin Huai] Address PR comments.
f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
0576406 [Yin Huai] Add Apache license header.
af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 19:14:59 -07:00
Zongheng Yang e243c5ffac [SPARK-2053][SQL] Add Catalyst expressions for CASE WHEN.
JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2053

This PR adds support for two types of CASE statements present in Hive. The first type is of the form `CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END`, with the semantics like a chain of if statements. The second type is of the form `CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END`, with the semantics like a switch statement on key `a`. Both forms are implemented in `CaseWhen`.

[This link](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ConditionalFunctions) contains more detailed descriptions on their semantics.

Notes / Open issues:

* Please check if any implicit contracts / invariants are broken in the implementations (especially for the operators). I am not very familiar with them and I currently find them tricky to spot.
* We should decide whether or not a non-boolean condition is allowed in a branch of `CaseWhen`. Hive throws a `SemanticException` for this situation and I think it'd be good to mimic it -- the question is where in the whole Spark SQL pipeline should we signal an exception for such a query.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1055 from concretevitamin/caseWhen and squashes the following commits:

4226eb9 [Zongheng Yang] Comment.
79d26fc [Zongheng Yang] Merge branch 'master' into caseWhen
caf9383 [Zongheng Yang] Update a FIXME.
9d26ab8 [Zongheng Yang] Add @transient marker.
788a0d9 [Zongheng Yang] Implement CastNulls, which fixes udf_case and udf_when.
7ef284f [Zongheng Yang] Refactors: remove redundant passes, improve toString, mark transient.
f47ae7b [Zongheng Yang] Modify queries in tests to have shorter golden files.
1c1fbfc [Zongheng Yang] Cleanups per review comments.
7d2b7e2 [Zongheng Yang] Translate CaseKeyWhen to CaseWhen at parsing time.
47d406a [Zongheng Yang] Do toArray once and lazily outside of eval().
bb3d109 [Zongheng Yang] Update scaladoc of a method.
aea3195 [Zongheng Yang] Fix bug that branchesArr is not used; remove unused import.
96870a8 [Zongheng Yang] Turn off scalastyle for some comments.
7392f3a [Zongheng Yang] Minor cleanup.
2cf08bb [Zongheng Yang] Merge branch 'master' into caseWhen
9f84b40 [Zongheng Yang] Add golden outputs from Hive.
db51a85 [Zongheng Yang] Add allCondBooleans check; uncomment tests.
3f9ef0a [Zongheng Yang] Cleanups and bug fixes (mainly in eval() and resolved).
be54bc8 [Zongheng Yang] Rewrite eval() to a low-level implementation. Separate two CASE stmts.
f2bcb9d [Zongheng Yang] WIP
5906f75 [Zongheng Yang] WIP
efd019b [Zongheng Yang] eval() and toString() bug fixes.
7d81e95 [Zongheng Yang] Clean up resolved.
a31d782 [Zongheng Yang] Finish up Case.
2014-06-17 13:30:17 +02:00
Xi Liu f5a4049e53 [SPARK-2164][SQL] Allow Hive UDF on columns of type struct
Author: Xi Liu <xil@conviva.com>

Closes #796 from xiliu82/sqlbug and squashes the following commits:

328dfc4 [Xi Liu] [Spark SQL] remove a temporary function after test
354386a [Xi Liu] [Spark SQL] add test suite for UDF on struct
8fc6f51 [Xi Liu] [SparkSQL] allow UDF on struct
2014-06-17 13:16:02 +02:00
Cheng Lian 237b96bc59 Minor fix: made "EXPLAIN" output to play well with JDBC output format
Fixed the broken JDBC output. Test from Shark `beeline`:

```
beeline> !connect jdbc:hive2://localhost:10000/
scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/
Enter username for jdbc:hive2://localhost:10000/: lian
Enter password for jdbc:hive2://localhost:10000/:
Connected to: Hive (version 0.12.0)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> explain select * from src;
+-------------------------------------------------------------------------------+
|                                     plan                                      |
+-------------------------------------------------------------------------------+
| ExplainCommand [plan#2:0]                                                     |
|  HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
+-------------------------------------------------------------------------------+
2 rows selected (1.386 seconds)
```

Before this change, the output looked something like this:

```
+-------------------------------------------------------------------------------+
|                                     plan                                      |
+-------------------------------------------------------------------------------+
| ExplainCommand [plan#2:0]
 HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
+-------------------------------------------------------------------------------+
```

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1097 from liancheng/multiLineExplain and squashes the following commits:

eb37967 [Cheng Lian] Made output of "EXPLAIN" play well with JDBC output format
2014-06-16 16:42:17 -07:00
Cheng Lian 273afcb254 [SQL][SPARK-2094] Follow up of PR #1071 for Java API
Updated `JavaSQLContext` and `JavaHiveContext` similar to what we've done to `SQLContext` and `HiveContext` in PR #1071. Added corresponding test case for Spark SQL Java API.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1085 from liancheng/spark-2094-java and squashes the following commits:

29b8a51 [Cheng Lian] Avoided instantiating JavaSparkContext & JavaHiveContext to workaround test failure
92bb4fb [Cheng Lian] Marked test cases in JavaHiveQLSuite with "ignore"
22aec97 [Cheng Lian] Follow up of PR #1071 for Java API
2014-06-16 21:32:51 +02:00
Kan Zhang 4fdb491775 [SPARK-2010] Support for nested data in PySpark SQL
JIRA issue https://issues.apache.org/jira/browse/SPARK-2010

This PR adds support for nested collection types in PySpark SQL, including
array, dict, list, set, and tuple. Example,

```
>>> from array import array
>>> from pyspark.sql import SQLContext
>>> sqlCtx = SQLContext(sc)
>>> rdd = sc.parallelize([
...         {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
...         {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
...                    {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]
True
>>> rdd = sc.parallelize([
...         {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
...         {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.collect() == \
... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
...  {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]
True
```

Author: Kan Zhang <kzhang@apache.org>

Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits:

1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO
504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL
2014-06-16 11:11:29 -07:00
Michael Armbrust 269fc62b20 [SQL] Support transforming TreeNodes with Option children.
Thanks goes to @marmbrus for his implementation.

Author: Michael Armbrust <michael@databricks.com>
Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1074 from concretevitamin/option-treenode and squashes the following commits:

ef27b85 [Zongheng Yang] Merge pull request #1 from marmbrus/pr/1074
73133c2 [Michael Armbrust] TreeNodes can't be inner classes.
ab78420 [Zongheng Yang] Add a test.
2ccb721 [Michael Armbrust] Add support for transformation of optional children.
2014-06-15 11:28:34 +02:00
Kan Zhang 2550533a28 [SPARK-2079] Support batching when serializing SchemaRDD to Python
Added batching with default batch size 10 in SchemaRDD.javaToPython

Author: Kan Zhang <kzhang@apache.org>

Closes #1023 from kanzhang/SPARK-2079 and squashes the following commits:

2d1915e [Kan Zhang] [SPARK-2079] Add batching in SchemaRDD.javaToPython
19b0c09 [Kan Zhang] [SPARK-2079] Removing unnecessary wrapping in SchemaRDD.javaToPython
2014-06-14 13:17:22 -07:00
Yin Huai 8919685091 [Spark-2137][SQL] Timestamp UDFs broken
https://issues.apache.org/jira/browse/SPARK-2137

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1081 from yhuai/SPARK-2137 and squashes the following commits:

c04f910 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2137
205f17b [Yin Huai] Make Hive UDF wrapper support Timestamp.
2014-06-13 23:28:57 -07:00
Cheng Lian ac96d9657c [SPARK-2094][SQL] "Exactly once" semantics for DDL and command statements
## Related JIRA issues

- Main issue:

  - [SPARK-2094](https://issues.apache.org/jira/browse/SPARK-2094): Ensure exactly once semantics for DDL/Commands

- Issues resolved as dependencies:

  - [SPARK-2081](https://issues.apache.org/jira/browse/SPARK-2081): Undefine output() from the abstract class Command and implement it in concrete subclasses
  - [SPARK-2128](https://issues.apache.org/jira/browse/SPARK-2128): No plan for DESCRIBE
  - [SPARK-1852](https://issues.apache.org/jira/browse/SPARK-1852): SparkSQL Queries with Sorts run before the user asks them to

- Other related issue:

  - [SPARK-2129](https://issues.apache.org/jira/browse/SPARK-2129): NPE thrown while lookup a view

    Two test cases, `join_view` and `mergejoin_mixed`, within the `HiveCompatibilitySuite` are removed from the whitelist to workaround this issue.

## PR Overview

This PR defines physical plans for DDL statements and commands and wraps their side effects in a lazy field `PhysicalCommand.sideEffectResult`, so that they are executed eagerly and exactly once.  Also, as a positive side effect, now DDL statements and commands can be turned into proper `SchemaRDD`s and let user query the execution results.

This PR defines schemas for the following DDL/commands:

- EXPLAIN command

  - `plan`: String, the plan explanation

- SET command

  - `key`: String, the key(s) of the propert(y/ies) being set or queried
  - `value`: String, the value(s) of the propert(y/ies) being queried

- Other Hive native command

  - `result`: String, execution result returned by Hive

  **NOTE**: We should refine schemas for different native commands by defining physical plans for them in the future.

## Examples

### EXPLAIN command

Take the "EXPLAIN" command as an example, we first execute the command and obtain a `SchemaRDD` at the same time, then query the `plan` field with the schema DSL:

```
scala> loadTestTable("src")
...

scala> val q0 = hql("EXPLAIN SELECT key, COUNT(*) FROM src GROUP BY key")
...
q0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExplainCommandPhysical [plan#11:0]
 Aggregate false, [key#4], [key#4,SUM(PartialCount#6L) AS c_1#2L]
  Exchange (HashPartitioning [key#4:0], 200)
   Exchange (HashPartitioning [key#4:0], 200)
    Aggregate true, [key#4], [key#4,COUNT(1) AS PartialCount#6L]
     HiveTableScan [key#4], (MetastoreRelation default, src, None), None

scala> q0.select('plan).collect()
...
[ExplainCommandPhysical [plan#24:0]
 Aggregate false, [key#17], [key#17,SUM(PartialCount#19L) AS c_1#2L]
  Exchange (HashPartitioning [key#17:0], 200)
   Exchange (HashPartitioning [key#17:0], 200)
    Aggregate true, [key#17], [key#17,COUNT(1) AS PartialCount#19L]
     HiveTableScan [key#17], (MetastoreRelation default, src, None), None]

scala>
```

### SET command

In this example we query all the properties set in `SQLConf`, register the result as a table, and then query the table with HiveQL:

```
scala> val q1 = hql("SET")
...
q1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[7] at RDD at SchemaRDD.scala:98
== Query Plan ==
<SET command: executed by Hive, and noted by SQLContext>

scala> q1.registerAsTable("properties")

scala> hql("SELECT key, value FROM properties ORDER BY key LIMIT 10").foreach(println)
...
== Query Plan ==
TakeOrdered 10, [key#51:0 ASC]
 Project [key#51:0,value#52:1]
  SetCommandPhysical None, None, [key#55:0,value#56:1]), which has no missing parents
14/06/12 12:19:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 5 (SchemaRDD[21] at RDD at SchemaRDD.scala:98
== Query Plan ==
TakeOrdered 10, [key#51:0 ASC]
 Project [key#51:0,value#52:1]
  SetCommandPhysical None, None, [key#55:0,value#56:1])
...
[datanucleus.autoCreateSchema,true]
[datanucleus.autoStartMechanismMode,checked]
[datanucleus.cache.level2,false]
[datanucleus.cache.level2.type,none]
[datanucleus.connectionPoolingType,BONECP]
[datanucleus.fixedDatastore,false]
[datanucleus.identifierFactory,datanucleus1]
[datanucleus.plugin.pluginRegistryBundleCheck,LOG]
[datanucleus.rdbms.useLegacyNativeValueStrategy,true]
[datanucleus.storeManagerType,rdbms]

scala>
```

### "Exactly once" semantics

At last, an example of the "exactly once" semantics:

```
scala> val q2 = hql("CREATE TABLE t1(key INT, value STRING)")
...
q2: org.apache.spark.sql.SchemaRDD =
SchemaRDD[28] at RDD at SchemaRDD.scala:98
== Query Plan ==
<Native command: executed by Hive>

scala> table("t1")
...
res9: org.apache.spark.sql.SchemaRDD =
SchemaRDD[32] at RDD at SchemaRDD.scala:98
== Query Plan ==
HiveTableScan [key#58,value#59], (MetastoreRelation default, t1, None), None

scala> q2.collect()
...
res10: Array[org.apache.spark.sql.Row] = Array([])

scala>
```

As we can see, the "CREATE TABLE" command is executed eagerly right after the `SchemaRDD` is created, and referencing the `SchemaRDD` again won't trigger a duplicated execution.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1071 from liancheng/exactlyOnceCommand and squashes the following commits:

d005b03 [Cheng Lian] Made "SET key=value" returns the newly set key value pair
f6c7715 [Cheng Lian] Added test cases for DDL/command statement RDDs
1d00937 [Cheng Lian] Makes SchemaRDD DSLs work for DDL/command statement RDDs
5c7e680 [Cheng Lian] Bug fix: wrong type used in pattern matching
48aa2e5 [Cheng Lian] Refined SQLContext.emptyResult as an empty RDD[Row]
cc64f32 [Cheng Lian] Renamed physical plan classes for DDL/commands
74789c1 [Cheng Lian] Fixed failing test cases
0ad343a [Cheng Lian] Added physical plan for DDL and commands to ensure the "exactly once" semantics
2014-06-13 12:59:48 -07:00
Michael Armbrust 1c2fd015b0 [SPARK-1964][SQL] Add timestamp to HiveMetastoreTypes.toMetastoreType
Author: Michael Armbrust <michael@databricks.com>

Closes #1061 from marmbrus/timestamp and squashes the following commits:

79c3903 [Michael Armbrust] Add timestamp to HiveMetastoreTypes.toMetastoreType()
2014-06-13 12:55:15 -07:00
Michael Armbrust 13f8cfdc04 [SPARK-2135][SQL] Use planner for in-memory scans
Author: Michael Armbrust <michael@databricks.com>

Closes #1072 from marmbrus/cachedStars and squashes the following commits:

8757c8e [Michael Armbrust] Use planner for in-memory scans.
2014-06-12 23:09:41 -07:00
Takuya UESHIN 9a2448daf9 [SPARK-2052] [SQL] Add optimization for CaseConversionExpression's.
Add optimization for `CaseConversionExpression`'s.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #990 from ueshin/issues/SPARK-2052 and squashes the following commits:

2568666 [Takuya UESHIN] Move some rules back.
dde7ede [Takuya UESHIN] Add tests to check if ConstantFolding can handle null literals and remove the unneeded rules from NullPropagation.
c4eea67 [Takuya UESHIN] Fix toString methods.
23e2363 [Takuya UESHIN] Make CaseConversionExpressions foldable if the child is foldable.
0ff7568 [Takuya UESHIN] Add tests for collapsing case statements.
3977d80 [Takuya UESHIN] Add optimization for CaseConversionExpression's.
2014-06-11 17:58:35 -07:00
Patrick Wendell d45e0c6b98 HOTFIX: Forgot to remove false change in previous commit 2014-06-11 15:55:41 -07:00
Patrick Wendell 14e6dc94f6 HOTFIX: PySpark tests should be order insensitive.
This has been messing up the SQL PySpark tests on Jenkins.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1054 from pwendell/pyspark and squashes the following commits:

1eb5487 [Patrick Wendell] False change
06f062d [Patrick Wendell] HOTFIX: PySpark tests should be order insensitive
2014-06-11 15:54:41 -07:00
Daoyuan ce6deb1e5b [SQL] Code Cleanup: Left Semi Hash Join
Some improvement for PR #837, add another case to white list and use `filter` to build result iterator.

Author: Daoyuan <daoyuan.wang@intel.com>

Closes #1049 from adrian-wang/clean-LeftSemiJoinHash and squashes the following commits:

b314d5a [Daoyuan] change hashSet name
27579a9 [Daoyuan] add semijoin to white list and use filter to create new iterator in LeftSemiJoinBNL

Signed-off-by: Michael Armbrust <michael@databricks.com>
2014-06-11 12:09:42 -07:00
Sameer Agarwal 4107cce58c [SPARK-2042] Prevent unnecessary shuffle triggered by take()
This PR implements `take()` on a `SchemaRDD` by inserting a logical limit that is followed by a `collect()`. This is also accompanied by adding a catalyst optimizer rule for collapsing adjacent limits. Doing so prevents an unnecessary shuffle that is sometimes triggered by `take()`.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #1048 from sameeragarwal/master and squashes the following commits:

3eeb848 [Sameer Agarwal] Fixing Tests
1b76ff1 [Sameer Agarwal] Deprecating limit(limitExpr: Expression) in v1.1.0
b723ac4 [Sameer Agarwal] Added limit folding tests
a0ff7c4 [Sameer Agarwal] Adding catalyst rule to fold two consecutive limits
8d42d03 [Sameer Agarwal] Implement trigger() as limit() followed by collect()
2014-06-11 12:01:04 -07:00
Qiuzhuang.Lian 6e11930310 SPARK-2107: FilterPushdownSuite doesn't need Junit jar.
Author: Qiuzhuang.Lian <Qiuzhuang.Lian@gmail.com>

Closes #1046 from Qiuzhuang/master and squashes the following commits:

0a9921a [Qiuzhuang.Lian] SPARK-2107: FilterPushdownSuite doesn't need Junit jar.
2014-06-11 00:36:06 -07:00
Cheng Lian 0266a0c8a7 [SPARK-1968][SQL] SQL/HiveQL command for caching/uncaching tables
JIRA issue: [SPARK-1968](https://issues.apache.org/jira/browse/SPARK-1968)

This PR added support for SQL/HiveQL command for caching/uncaching tables:

```
scala> sql("CACHE TABLE src")
...
res0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
CacheCommandPhysical src, true

scala> table("src")
...
res1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[3] at RDD at SchemaRDD.scala:98
== Query Plan ==
InMemoryColumnarTableScan [key#0,value#1], (HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None), false

scala> isCached("src")
res2: Boolean = true

scala> sql("CACHE TABLE src")
...
res3: org.apache.spark.sql.SchemaRDD =
SchemaRDD[4] at RDD at SchemaRDD.scala:98
== Query Plan ==
CacheCommandPhysical src, false

scala> table("src")
...
res4: org.apache.spark.sql.SchemaRDD =
SchemaRDD[11] at RDD at SchemaRDD.scala:98
== Query Plan ==
HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None

scala> isCached("src")
res5: Boolean = false
```

Things also work for `hql`.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1038 from liancheng/sqlCacheTable and squashes the following commits:

ecb7194 [Cheng Lian] Trimmed the SQL string before parsing special commands
6f4ce42 [Cheng Lian] Moved logical command classes to a separate file
3458a24 [Cheng Lian] Added comment for public API
f0ffacc [Cheng Lian] Added isCached() predicate
15ec6d2 [Cheng Lian] Added "(UN)CACHE TABLE" SQL/HiveQL statements
2014-06-11 00:06:50 -07:00
Takuya UESHIN 0402bd77ec [SPARK-2093] [SQL] NullPropagation should use exact type value.
`NullPropagation` should use exact type value when transform `Count` or `Sum`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1034 from ueshin/issues/SPARK-2093 and squashes the following commits:

65b6ff1 [Takuya UESHIN] Modify the literal value of the result of transformation from Sum to long value.
830c20b [Takuya UESHIN] Add Cast to the result of transformation from Count.
9314806 [Takuya UESHIN] Fix NullPropagation to use exact type value.
2014-06-10 23:13:48 -07:00
Zongheng Yang 601032f5bf HOTFIX: clear() configs in SQLConf-related unit tests.
Thanks goes to @liancheng, who pointed out that `sql/test-only *.SQLConfSuite *.SQLQuerySuite` passed but `sql/test-only *.SQLQuerySuite *.SQLConfSuite` failed. The reason is that some tests use the same test keys and without clear()'ing, they get carried over to other tests. This hotfix simply adds some `clear()` calls.

This problem was not evident on Jenkins before probably because `parallelExecution` is not set to `false` for `sqlCoreSettings`.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1040 from concretevitamin/sqlconf-tests and squashes the following commits:

6d14ceb [Zongheng Yang] HOTFIX: clear() confs in SQLConf related unit tests.
2014-06-10 21:59:01 -07:00
egraldlo 1abbde0e89 [SQL] Add average overflow test case from #978
By @egraldlo.

Author: egraldlo <egraldlo@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #1033 from marmbrus/pr/978 and squashes the following commits:

e228c5e [Michael Armbrust] Remove "test".
762aeaf [Michael Armbrust] Remove unneeded rule. More descriptive name for test table.
d414cd7 [egraldlo] fommatting issues
1153f75 [egraldlo] do best to avoid overflowing in function avg().
2014-06-10 14:07:55 -07:00
Cheng Hao db0c038a66 [SPARK-2076][SQL] Pushdown the join filter & predication for outer join
As the rule described in https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior, we can optimize the SQL Join by pushing down the Join predicate and Where predicate.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1015 from chenghao-intel/join_predicate_push_down and squashes the following commits:

10feff9 [Cheng Hao] fix bug of changing the join type in PredicatePushDownThroughJoin
44c6700 [Cheng Hao] Add logical to support pushdown the join filter
0bce426 [Cheng Hao] Pushdown the join filter & predicate for outer join
2014-06-10 12:59:52 -07:00
Cheng Lian a9a461c594 Moved hiveOperators.scala to the right package folder
The package is `org.apache.spark.sql.hive.execution`, while the file was placed under `sql/hive/src/main/scala/org/apache/spark/sql/hive/`.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1029 from liancheng/moveHiveOperators and squashes the following commits:

d632eb8 [Cheng Lian] Moved hiveOperators.scala to the right package folder
2014-06-10 01:14:44 -07:00
Zongheng Yang 08ed9ad813 [SPARK-1508][SQL] Add SQLConf to SQLContext.
This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands.

The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf.

For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #956 from concretevitamin/sqlconf and squashes the following commits:

4968c11 [Zongheng Yang] Very minor cleanup.
d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method.
c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf
26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in.
dd19666 [Zongheng Yang] Update a comment.
baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor.
5f7e6d8 [Zongheng Yang] Add default num partitions.
22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method.
e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap.
88dd0c8 [Zongheng Yang] Remove redundant SET Keyword.
271f0b1 [Zongheng Yang] Minor change.
f8983d1 [Zongheng Yang] Minor changes per review comments.
1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case.
b766af9 [Zongheng Yang] Remove a test.
d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf).
555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment.
c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list.
2ea8cdc [Zongheng Yang] Wrap long line.
41d7f09 [Zongheng Yang] Fix imports.
13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands.
b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf.
6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete.
5b67985 [Zongheng Yang] New line at EOF.
c651797 [Zongheng Yang] Add commands.scala.
efd82db [Zongheng Yang] Clean up semantics of several cases of SET.
c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs).
0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL.
41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite.
2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest.
3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes.
d0c4578 [Zongheng Yang] Tmux typo.
0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext.
ce22d80 [Zongheng Yang] Fix parsing issues.
cb722c1 [Zongheng Yang] Finish up SQLConf patch.
4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.
2014-06-10 00:49:09 -07:00
Zongheng Yang a9ec033c8c [SPARK-1704][SQL] Fully support EXPLAIN commands as SchemaRDD.
This PR attempts to resolve [SPARK-1704](https://issues.apache.org/jira/browse/SPARK-1704) by introducing a physical plan for EXPLAIN commands, which just prints out the debug string (containing various SparkSQL's plans) of the corresponding QueryExecution for the actual query.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1003 from concretevitamin/explain-cmd and squashes the following commits:

5b7911f [Zongheng Yang] Add a regression test.
1bfa379 [Zongheng Yang] Modify output().
719ada9 [Zongheng Yang] Override otherCopyArgs for ExplainCommandPhysical.
4318fd7 [Zongheng Yang] Make all output one Row.
439c6ab [Zongheng Yang] Minor cleanups.
408f574 [Zongheng Yang] SPARK-1704: Add CommandStrategy and ExplainCommandPhysical.
2014-06-09 16:47:44 -07:00
Michael Armbrust c6e041d171 [SQL] Simple framework for debugging query execution
Only records number of tuples and unique dataTypes output right now...

Example:
```scala
scala> import org.apache.spark.sql.execution.debug._
scala> hql("SELECT value FROM src WHERE key > 10").debug(sparkContext)

Results returned: 489
== Project [value#1:0] ==
Tuples output: 489
 value StringType: {java.lang.String}
== Filter (key#0:1 > 10) ==
Tuples output: 489
 value StringType: {java.lang.String}
 key IntegerType: {java.lang.Integer}
== HiveTableScan [value#1,key#0], (MetastoreRelation default, src, None), None ==
Tuples output: 500
 value StringType: {java.lang.String}
 key IntegerType: {java.lang.Integer}
```

Author: Michael Armbrust <michael@databricks.com>

Closes #1005 from marmbrus/debug and squashes the following commits:

dcc3ca6 [Michael Armbrust] Add comments.
c9dded2 [Michael Armbrust] Simple framework for debugging query execution
2014-06-09 14:24:19 -07:00
Daoyuan 0cf6002801 [SPARK-1495][SQL]add support for left semi join
Just submit another solution for #395

Author: Daoyuan <daoyuan.wang@intel.com>
Author: Michael Armbrust <michael@databricks.com>
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #837 from adrian-wang/left-semi-join-support and squashes the following commits:

d39cd12 [Daoyuan Wang] Merge pull request #1 from marmbrus/pr/837
6713c09 [Michael Armbrust] Better debugging for failed query tests.
035b73e [Michael Armbrust] Add test for left semi that can't be done with a hash join.
5ec6fa4 [Michael Armbrust] Add left semi to SQL Parser.
4c726e5 [Daoyuan] improvement according to Michael
8d4a121 [Daoyuan] add golden files for leftsemijoin
83a3c8a [Daoyuan] scala style fix
14cff80 [Daoyuan] add support for left semi join
2014-06-09 11:31:36 -07:00
Michael Armbrust a6c72ab16e [SPARK-1994][SQL] Weird data corruption bug when running Spark SQL on data in HDFS
Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?).

In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here.

Author: Michael Armbrust <michael@databricks.com>

Closes #1004 from marmbrus/fixAggBug and squashes the following commits:

e0c116c [Michael Armbrust] Compute aggregate expression during planning instead of lazily on workers.
2014-06-07 14:20:33 -07:00
witgo 41c4a33105 [SPARK-1841]: update scalatest to version 2.1.5
Author: witgo <witgo@qq.com>

Closes #713 from witgo/scalatest and squashes the following commits:

b627a6a [witgo] merge master
51fb3d6 [witgo] merge master
3771474 [witgo] fix RDDSuite
996d6f9 [witgo] fix TimeStampedWeakValueHashMap test
9dfa4e7 [witgo] merge bug
1479b22 [witgo] merge master
29b9194 [witgo] fix code style
022a7a2 [witgo] fix test dependency
a52c0fa [witgo] fix test dependency
cd8f59d [witgo] Merge branch 'master' of https://github.com/apache/spark into scalatest
046540d [witgo] fix RDDSuite.scala
2c543b9 [witgo] fix ReplSuite.scala
c458928 [witgo] update scalatest to version 2.1.5
2014-06-06 11:45:21 -07:00
Michael Armbrust 8d210560be [SPARK-2050 - 2][SQL] DIV and BETWEEN should not be case sensitive.
Followup: #989

Author: Michael Armbrust <michael@databricks.com>

Closes #994 from marmbrus/caseSensitiveFunctions2 and squashes the following commits:

9d9c8ed [Michael Armbrust] Fix DIV and BETWEEN.
2014-06-06 11:31:37 -07:00
Michael Armbrust 41db44c428 [SPARK-2050][SQL] LIKE, RLIKE and IN in HQL should not be case sensitive.
Author: Michael Armbrust <michael@databricks.com>

Closes #989 from marmbrus/caseSensitiveFuncitons and squashes the following commits:

681de54 [Michael Armbrust] LIKE, RLIKE and IN in HQL should not be case sensitive.
2014-06-05 23:20:59 -07:00
Michael Armbrust c7a183b2c2 [SPARK-2041][SQL] Correctly analyze queries where columnName == tableName.
Author: Michael Armbrust <michael@databricks.com>

Closes #985 from marmbrus/tableName and squashes the following commits:

3caaa27 [Michael Armbrust] Correctly analyze queries where columnName == tableName.
2014-06-05 17:42:08 -07:00
Takuya UESHIN e4c11eef2f [SPARK-2036] [SQL] CaseConversionExpression should check if the evaluated value is null.
`CaseConversionExpression` should check if the evaluated value is `null`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #982 from ueshin/issues/SPARK-2036 and squashes the following commits:

61e1c54 [Takuya UESHIN] Add check if the evaluated value is null.
2014-06-05 12:00:31 -07:00
Takuya UESHIN 7c160293d6 [SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits:

e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
2014-06-05 11:27:33 -07:00
Syed Hashmi 7782a304ad [SPARK-1942] Stop clearing spark.driver.port in unit tests
stop resetting spark.driver.port in unit tests (scala, java and python).

Author: Syed Hashmi <shashmi@cloudera.com>
Author: CodingCat <zhunansjtu@gmail.com>

Closes #943 from syedhashmi/master and squashes the following commits:

885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool)
b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master'
b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner"
57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner"
1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests
4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread"
fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner
6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread
4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
2014-06-03 12:04:47 -07:00
Cheng Lian 862283e9cc Avoid dynamic dispatching when unwrapping Hive data.
This is a follow up of PR #758.

The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost.

According to the same micro benchmark used in PR #758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table.

```
Optimized version:

CSV: 6870 ms, RCFile: 5687 ms
CSV: 6832 ms, RCFile: 5800 ms
CSV: 6822 ms, RCFile: 5679 ms
CSV: 6704 ms, RCFile: 5758 ms
CSV: 6819 ms, RCFile: 5725 ms

Original version:

CSV: 7042 ms, RCFile: 5667 ms
CSV: 6883 ms, RCFile: 5703 ms
CSV: 7115 ms, RCFile: 5665 ms
CSV: 7020 ms, RCFile: 5981 ms
CSV: 6871 ms, RCFile: 5906 ms
```

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #935 from liancheng/staticUnwrapping and squashes the following commits:

c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.
2014-06-02 19:20:23 -07:00
egraldlo ec8be274a7 [SPARK-1995][SQL] system function upper and lower can be supported
I don't know whether it's time to implement system function about string operation in spark sql now.

Author: egraldlo <egraldlo@gmail.com>

Closes #936 from egraldlo/stringoperator and squashes the following commits:

3c6c60a [egraldlo] Add UPPER, LOWER, MAX and MIN into hive parser
ea76d0a [egraldlo] modify the formatting issues
b49f25e [egraldlo] modify the formatting issues
1f0bbb5 [egraldlo] system function upper and lower supported
13d3267 [egraldlo] system function upper and lower supported
2014-06-02 18:02:57 -07:00
Cheng Lian d000ca98a8 [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
In cases like `Limit` and `TakeOrdered`, `executeCollect()` makes optimizations that `execute().collect()` will not.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #939 from liancheng/spark-1958 and squashes the following commits:

bdc4a14 [Cheng Lian] Copy rows to present immutable data to users
8250976 [Cheng Lian] Added return type explicitly for public API
192a25c [Cheng Lian] [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
2014-06-02 12:09:43 -07:00
Michael Armbrust 1a0da0ec57 [SQL] SPARK-1964 Add timestamp to hive metastore type parser.
Author: Michael Armbrust <michael@databricks.com>

Closes #913 from marmbrus/timestampMetastore and squashes the following commits:

8e0154f [Michael Armbrust] Add timestamp to hive metastore type parser.
2014-05-31 12:34:22 -07:00
Takuya UESHIN 3ce81494c5 [SPARK-1947] [SQL] Child of SumDistinct or Average should be widened to prevent overflows the same as Sum.
Child of `SumDistinct` or `Average` should be widened to prevent overflows the same as `Sum`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #902 from ueshin/issues/SPARK-1947 and squashes the following commits:

99c3dcb [Takuya UESHIN] Insert Cast for SumDistinct and Average.
2014-05-31 11:30:03 -07:00
Cheng Lian cf989601d0 [SPARK-1959] String "NULL" shouldn't be interpreted as null value
JIRA issue: [SPARK-1959](https://issues.apache.org/jira/browse/SPARK-1959)

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #909 from liancheng/spark-1959 and squashes the following commits:

306659c [Cheng Lian] [SPARK-1959] String "NULL" shouldn't be interpreted as null value
2014-05-30 22:13:11 -07:00
Cheng Lian 8f7141fbc0 [SPARK-1368][SQL] Optimized HiveTableScan
JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368)

This PR introduces two major updates:

- Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`.
- Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning.

My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively:

```
Original:

[info] CSV: 27676 ms, RCFile: 26415 ms
[info] CSV: 27703 ms, RCFile: 26029 ms
[info] CSV: 27511 ms, RCFile: 25962 ms

Optimized:

[info] CSV: 13820 ms, RCFile: 10402 ms
[info] CSV: 14158 ms, RCFile: 10691 ms
[info] CSV: 13606 ms, RCFile: 10346 ms
```

The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively.

Preparation code:

```scala
package org.apache.spark.examples.sql.hive

import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}

object HiveTableScanPrepare extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  hql("drop table scan_csv")
  hql("drop table scan_rcfile")

  hql("""create table scan_csv (key int, value string)
        |  row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
        |  with serdeproperties ('field.delim'=',')
      """.stripMargin)

  hql(s"""load data local inpath "${args(0)}" into table scan_csv""")

  hql("""create table scan_rcfile (key int, value string)
        |  row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
        |stored as
        |  inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
        |  outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
      """.stripMargin)

  hql(
    """
      |from scan_csv
      |insert overwrite table scan_rcfile
      |select scan_csv.key, scan_csv.value
    """.stripMargin)
}
```

Benchmark code:

```scala
package org.apache.spark.examples.sql.hive

import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}

object HiveTableScanBenchmark extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  val scanCsv = hql("select key from scan_csv")
  val scanRcfile = hql("select key from scan_rcfile")

  val csvDuration = benchmark(scanCsv.count())
  val rcfileDuration = benchmark(scanRcfile.count())

  println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")

  def benchmark(f: => Unit) = {
    val begin = System.currentTimeMillis()
    f
    val end = System.currentTimeMillis()
    end - begin
  }
}
```

@marmbrus Please help review, thanks!

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #758 from liancheng/fastHiveTableScan and squashes the following commits:

4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest
cf640d8 [Cheng Lian] More HiveTableScan optimisations:
bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest.
6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning
eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan
2014-05-29 15:24:03 -07:00
Takuya UESHIN 9df86835b6 [SPARK-1938] [SQL] ApproxCountDistinctMergeFunction should return Int value.
`ApproxCountDistinctMergeFunction` should return `Int` value because the `dataType` of `ApproxCountDistinct` is `IntegerType`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #893 from ueshin/issues/SPARK-1938 and squashes the following commits:

3970e88 [Takuya UESHIN] Remove a superfluous line.
5ad7ec1 [Takuya UESHIN] Make dataType for each of CountDistinct, ApproxCountDistinctMerge and ApproxCountDistinct LongType.
cbe7c71 [Takuya UESHIN] Revert a change.
fc3ac0f [Takuya UESHIN] Fix evaluated value type of ApproxCountDistinctMergeFunction to Int.
2014-05-27 22:17:50 -07:00
LY Lai 0682567450 [SQL] SPARK-1922
Allow underscore in column name of a struct field https://issues.apache.org/jira/browse/SPARK-1922 .

Author: LY Lai <ly.lai@vpon.com>

Closes #873 from lyuanlai/master and squashes the following commits:

2253263 [LY Lai] Allow underscore in struct field column name
2014-05-27 16:08:38 -07:00
Takuya UESHIN 3b0babad1f [SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null.
Average values are difference between the calculation is done partially or not partially.
Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #862 from ueshin/issues/SPARK-1915 and squashes the following commits:

b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.
2014-05-27 14:55:23 -07:00
Takuya UESHIN d1375a2bff [SPARK-1926] [SQL] Nullability of Max/Min/First should be true.
Nullability of `Max`/`Min`/`First` should be `true` because they return `null` if there are no rows.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #881 from ueshin/issues/SPARK-1926 and squashes the following commits:

322610f [Takuya UESHIN] Fix nullability of Min/Max/First.
2014-05-27 14:53:57 -07:00
Takuya UESHIN d6395d86f9 [SPARK-1914] [SQL] Simplify CountFunction not to traverse to evaluate all child expressions.
`CountFunction` should count up only if the child's evaluated value is not null.

Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #861 from ueshin/issues/SPARK-1914 and squashes the following commits:

3b37315 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-1914
2afa238 [Takuya UESHIN] Simplify CountFunction not to traverse to evaluate all child expressions.
2014-05-26 00:17:20 -07:00
Aaron Davidson c3576ffcd7 [SQL] Minor: Introduce SchemaRDD#aggregate() for simple aggregations
```scala
rdd.aggregate(Sum('val))
```
is just shorthand for

```scala
rdd.groupBy()(Sum('val))
```

but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows.

Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches.

Author: Aaron Davidson <aaron@databricks.com>

Closes #874 from aarondav/schemardd and squashes the following commits:

e9e68ee [Aaron Davidson] Add comment
db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations
2014-05-25 18:37:44 -07:00
Reynold Xin d66642e397 SPARK-1822: Some minor cleanup work on SchemaRDD.count()
Minor cleanup following #841.

Author: Reynold Xin <rxin@apache.org>

Closes #868 from rxin/schema-count and squashes the following commits:

5442651 [Reynold Xin] SPARK-1822: Some minor cleanup work on SchemaRDD.count()
2014-05-25 01:44:49 -07:00
Kan Zhang 6052db9dc1 [SPARK-1822] SchemaRDD.count() should use query optimizer
Author: Kan Zhang <kzhang@apache.org>

Closes #841 from kanzhang/SPARK-1822 and squashes the following commits:

2f8072a [Kan Zhang] [SPARK-1822] Minor style update
cf4baa4 [Kan Zhang] [SPARK-1822] Adding Scaladoc
e67c910 [Kan Zhang] [SPARK-1822] SchemaRDD.count() should use optimizer
2014-05-25 00:06:42 -07:00
Cheng Lian 5afe6af0b1 [SPARK-1913][SQL] Bug fix: column pruning error in Parquet support
JIRA issue: [SPARK-1913](https://issues.apache.org/jira/browse/SPARK-1913)

When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #863 from liancheng/spark-1913 and squashes the following commits:

f976b73 [Cheng Lian] Addessed the readability issue commented by @rxin
f5b257d [Cheng Lian] Added back comments deleted by mistake
ae60ab3 [Cheng Lian] [SPARK-1913] Attributes referenced only in predicates pushed down should remain in ParquetTableScan operator
2014-05-24 20:42:01 -07:00
Takuya UESHIN bb88875ad5 [SPARK-1889] [SQL] Apply splitConjunctivePredicates to join condition while finding join ke...
...ys.

When tables are equi-joined by multiple-keys `HashJoin` should be used, but `CartesianProduct` and then `Filter` are used.
The join keys are paired by `And` expression so we need to apply `splitConjunctivePredicates` to join condition while finding join keys.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #836 from ueshin/issues/SPARK-1889 and squashes the following commits:

fe1c387 [Takuya UESHIN] Apply splitConjunctivePredicates to join condition while finding join keys.
2014-05-21 15:37:47 -07:00
Tathagata Das 7f0cfe47f4 [Hotfix] Blacklisted flaky HiveCompatibility test
`lateral_view_outer` query sometimes returns a different set of 10 rows.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #838 from tdas/hive-test-fix2 and squashes the following commits:

9128a0d [Tathagata Das] Blacklisted flaky HiveCompatibility test.
2014-05-20 10:27:12 -07:00
witgo 6a2c5c610c [SPARK-1875]NoClassDefFoundError: StringUtils when building with hadoop 1.x and hive
Author: witgo <witgo@qq.com>

Closes #824 from witgo/SPARK-1875_commons-lang-2.6 and squashes the following commits:

ef7231d [witgo] review commit
ead3c3b [witgo] SPARK-1875:NoClassDefFoundError: StringUtils when building against Hadoop 1
2014-05-19 19:40:29 -07:00
Andre Schumacher 40d6acd6ba SPARK-1487 [SQL] Support record filtering via predicate pushdown in Parquet
Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test:

             | Uncompressed  | Compressed
-------------| ------------- | -------------
File size  |     10 GB  | 2 GB
Speedup |      2         | 1.8

Since mileage may vary I added a new option to SparkConf:

`org.apache.spark.sql.parquet.filter.pushdown`

Default value would be `true` and setting it to `false` disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter.

Because of an issue with Parquet ([see here](https://github.com/Parquet/parquet-mr/issues/371])) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this.

Author: Andre Schumacher <andre.schumacher@iki.fi>

Closes #511 from AndreSchumacher/parquet_filter and squashes the following commits:

16bfe83 [Andre Schumacher] Removing leftovers from merge during rebase
7b304ca [Andre Schumacher] Fixing formatting
c36d5cb [Andre Schumacher] Scalastyle
3da98db [Andre Schumacher] Second round of review feedback
7a78265 [Andre Schumacher] Fixing broken formatting in ParquetFilter
a86553b [Andre Schumacher] First round of code review feedback
b0f7806 [Andre Schumacher] Optimizing imports in ParquetTestData
85fea2d [Andre Schumacher] Adding SparkConf setting to disable filter predicate pushdown
f0ad3cf [Andre Schumacher] Undoing changes not needed for this PR
210e9cb [Andre Schumacher] Adding disjunctive filter predicates
a93a588 [Andre Schumacher] Adding unit test for filtering
6d22666 [Andre Schumacher] Extending ParquetFilters
93e8192 [Andre Schumacher] First commit Parquet record filtering
2014-05-16 13:41:41 -07:00
Michael Armbrust 032d6632ad [SQL] Implement between in hql
Author: Michael Armbrust <michael@databricks.com>

Closes #804 from marmbrus/between and squashes the following commits:

ae24672 [Michael Armbrust] add golden answer.
d9997ef [Michael Armbrust] Implement between in hql.
9bd4433 [Michael Armbrust] Better error on parse failures.
2014-05-16 11:47:00 -07:00
Cheng Hao a20fea9881 [Spark-1461] Deferred Expression Evaluation (short-circuit evaluation)
This patch unify the foldable & nullable interface for Expression.
1) Deterministic-less UDF (like Rand()) can not be folded.
2) Short-circut will significantly improves the performance in Expression Evaluation, however, the stateful UDF should not be ignored in a short-circuit evaluation(e.g. in expression: col1 > 0 and row_sequence() < 1000, row_sequence() can not be ignored even if col1 > 0 is false)

I brought an concept of DeferredObject from Hive, which has 2 kinds of children classes (EagerResult / DeferredResult), the former requires triggering the evaluation before it's created, while the later trigger the evaluation when first called its get() method.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #446 from chenghao-intel/expression_deferred_evaluation and squashes the following commits:

d2729de [Cheng Hao] Fix the codestyle issues
a08f09c [Cheng Hao] fix bug in or/and short-circuit evaluation
af2236b [Cheng Hao] revert the short-circuit expression evaluation for IF
b7861d2 [Cheng Hao] Add Support for Deferred Expression Evaluation
2014-05-15 22:12:34 -07:00
Michael Armbrust a4aafe5f9f [SQL] Fix tiny/small ints from HiveMetastore.
Author: Michael Armbrust <michael@databricks.com>

Closes #797 from marmbrus/smallInt and squashes the following commits:

2db9dae [Michael Armbrust] Fix tiny/small ints from HiveMetastore.
2014-05-15 16:50:42 -07:00
Stevo Slavić e66e31be51 SPARK-1803 Replaced colon in filenames with a dash
This patch replaces colon in several filenames with dash to make these filenames Windows compatible.

Author: Stevo Slavić <sslavic@gmail.com>
Author: Stevo Slavic <sslavic@gmail.com>

Closes #739 from sslavic/SPARK-1803 and squashes the following commits:

3ec66eb [Stevo Slavic] Removed extra empty line which was causing test to fail
b967cc3 [Stevo Slavić] Aligned tests and names of test resources
2b12776 [Stevo Slavić] Fixed a typo in file name
1c5dfff [Stevo Slavić] Replaced colon in file name with dash
8f5bf7f [Stevo Slavić] Replaced colon in file name with dash
c5b5083 [Stevo Slavić] Replaced colon in file name with dash
a49801f [Stevo Slavić] Replaced colon in file name with dash
401d99e [Stevo Slavić] Replaced colon in file name with dash
40a9621 [Stevo Slavić] Replaced colon in file name with dash
4774580 [Stevo Slavić] Replaced colon in file name with dash
004f8bb [Stevo Slavić] Replaced colon in file name with dash
d6a3e2c [Stevo Slavić] Replaced colon in file name with dash
b585126 [Stevo Slavić] Replaced colon in file name with dash
028e48a [Stevo Slavić] Replaced colon in file name with dash
ece0507 [Stevo Slavić] Replaced colon in file name with dash
84f5d2f [Stevo Slavić] Replaced colon in file name with dash
2fc7854 [Stevo Slavić] Replaced colon in file name with dash
9e1467d [Stevo Slavić] Replaced colon in file name with dash
2014-05-15 16:44:14 -07:00
Takuya UESHIN 94c9d6f598 [SPARK-1819] [SQL] Fix GetField.nullable.
`GetField.nullable` should be `true` not only when `field.nullable` is `true` but also when `child.nullable` is `true`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #757 from ueshin/issues/SPARK-1819 and squashes the following commits:

8781a11 [Takuya UESHIN] Modify a test to use named parameters.
5bfc77d [Takuya UESHIN] Fix GetField.nullable.
2014-05-15 11:21:33 -07:00
Takuya UESHIN db8cc6f28a [SPARK-1845] [SQL] Use AllScalaRegistrar for SparkSqlSerializer to register serializers of ...
...Scala collections.

When I execute `orderBy` or `limit` for `SchemaRDD` including `ArrayType` or `MapType`, `SparkSqlSerializer` throws the following exception:

```
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.$colon$colon
```

or

```
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.Vector
```

or

```
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap
```

and so on.

This is because registrations of serializers for each concrete collections are missing in `SparkSqlSerializer`.
I believe it should use `AllScalaRegistrar`.
`AllScalaRegistrar` covers a lot of serializers for concrete classes of `Seq`, `Map` for `ArrayType`, `MapType`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #790 from ueshin/issues/SPARK-1845 and squashes the following commits:

d1ed992 [Takuya UESHIN] Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections.
2014-05-15 11:20:21 -07:00
witgo bae07e36a6 fix different versions of commons-lang dependency and apache/spark#746 addendum
Author: witgo <witgo@qq.com>

Closes #754 from witgo/commons-lang and squashes the following commits:

3ebab31 [witgo] merge master
f3b8fa2 [witgo] merge master
2083fae [witgo] repeat definition
5599cdb [witgo] multiple version of sbt  dependency
c1b66a1 [witgo] fix different versions of commons-lang dependency
2014-05-14 22:26:26 -07:00
Prashant Sharma 46324279da Package docs
This is a few changes based on the original patch by @scrapcodes.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #785 from pwendell/package-docs and squashes the following commits:

c32b731 [Patrick Wendell] Changes based on Prashant's patch
c0463d3 [Prashant Sharma] added eof new line
ce8bf73 [Prashant Sharma] Added eof new line to all files.
4c35f2e [Prashant Sharma] SPARK-1563 Add package-info.java and package.scala files for all packages that appear in docs
2014-05-14 22:24:41 -07:00
wangfei 44165fc91a [SPARK-1826] fix the head notation of package object dsl
Author: wangfei <scnbwf@yeah.net>

Closes #765 from scwf/dslfix and squashes the following commits:

d2d1a9d [wangfei] Update package.scala
66ff53b [wangfei] fix the head notation of package object dsl
2014-05-14 17:59:11 -07:00
Patrick Wendell d58cb33ffa SPARK-1828: Created forked version of hive-exec that doesn't bundle other dependencies
See https://issues.apache.org/jira/browse/SPARK-1828 for more information.

This is being submitted to Jenkin's for testing. The dependency won't fully
propagate in Maven central for a few more hours.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #767 from pwendell/hive-shaded and squashes the following commits:

ea10ac5 [Patrick Wendell] SPARK-1828: Created forked version of hive-exec that doesn't bundle other dependencies
2014-05-14 09:51:01 -07:00
Michael Armbrust 6ce0884446 [SQL] Improve column pruning.
Fixed a bug that was preventing us from ever pruning beneath Joins.

## TPC-DS Q3
### Before:
```
Aggregate false, [d_year#12,i_brand#65,i_brand_id#64], [d_year#12,i_brand_id#64 AS brand_id#0,i_brand#65 AS brand#1,SUM(PartialSum#79) AS sum_agg#2]
 Exchange (HashPartitioning [d_year#12:0,i_brand#65:1,i_brand_id#64:2], 150)
  Aggregate true, [d_year#12,i_brand#65,i_brand_id#64], [d_year#12,i_brand#65,i_brand_id#64,SUM(CAST(ss_ext_sales_price#49, DoubleType)) AS PartialSum#79]
   Project [d_year#12:6,i_brand#65:59,i_brand_id#64:58,ss_ext_sales_price#49:43]
    HashJoin [ss_item_sk#36], [i_item_sk#57], BuildRight
     Exchange (HashPartitioning [ss_item_sk#36:30], 150)
      HashJoin [d_date_sk#6], [ss_sold_date_sk#34], BuildRight
       Exchange (HashPartitioning [d_date_sk#6:0], 150)
        Filter (d_moy#14:8 = 12)
         HiveTableScan [d_date_sk#6,d_date_id#7,d_date#8,d_month_seq#9,d_week_seq#10,d_quarter_seq#11,d_year#12,d_dow#13,d_moy#14,d_dom#15,d_qoy#16,d_fy_year#17,d_fy_quarter_seq#18,d_fy_week_seq#19,d_day_name#20,d_quarter_name#21,d_holiday#22,d_weekend#23,d_following_holiday#24,d_first_dom#25,d_last_dom#26,d_same_day_ly#27,d_same_day_lq#28,d_current_day#29,d_current_week#30,d_current_month#31,d_current_quarter#32,d_current_year#33], (MetastoreRelation default, date_dim, Some(dt)), None
       Exchange (HashPartitioning [ss_sold_date_sk#34:0], 150)
        HiveTableScan [ss_sold_date_sk#34,ss_sold_time_sk#35,ss_item_sk#36,ss_customer_sk#37,ss_cdemo_sk#38,ss_hdemo_sk#39,ss_addr_sk#40,ss_store_sk#41,ss_promo_sk#42,ss_ticket_number#43,ss_quantity#44,ss_wholesale_cost#45,ss_list_price#46,ss_sales_price#47,ss_ext_discount_amt#48,ss_ext_sales_price#49,ss_ext_wholesale_cost#50,ss_ext_list_price#51,ss_ext_tax#52,ss_coupon_amt#53,ss_net_paid#54,ss_net_paid_inc_tax#55,ss_net_profit#56], (MetastoreRelation default, store_sales, None), None
     Exchange (HashPartitioning [i_item_sk#57:0], 150)
      Filter (i_manufact_id#70:13 = 436)
       HiveTableScan [i_item_sk#57,i_item_id#58,i_rec_start_date#59,i_rec_end_date#60,i_item_desc#61,i_current_price#62,i_wholesale_cost#63,i_brand_id#64,i_brand#65,i_class_id#66,i_class#67,i_category_id#68,i_category#69,i_manufact_id#70,i_manufact#71,i_size#72,i_formulation#73,i_color#74,i_units#75,i_container#76,i_manager_id#77,i_product_name#78], (MetastoreRelation default, item, None), None
```
### After
```
Aggregate false, [d_year#172,i_brand#225,i_brand_id#224], [d_year#172,i_brand_id#224 AS brand_id#160,i_brand#225 AS brand#161,SUM(PartialSum#239) AS sum_agg#162]
 Exchange (HashPartitioning [d_year#172:0,i_brand#225:1,i_brand_id#224:2], 150)
  Aggregate true, [d_year#172,i_brand#225,i_brand_id#224], [d_year#172,i_brand#225,i_brand_id#224,SUM(CAST(ss_ext_sales_price#209, DoubleType)) AS PartialSum#239]
   Project [d_year#172:1,i_brand#225:5,i_brand_id#224:3,ss_ext_sales_price#209:0]
    HashJoin [ss_item_sk#196], [i_item_sk#217], BuildRight
     Exchange (HashPartitioning [ss_item_sk#196:2], 150)
      Project [ss_ext_sales_price#209:2,d_year#172:1,ss_item_sk#196:3]
       HashJoin [d_date_sk#166], [ss_sold_date_sk#194], BuildRight
        Exchange (HashPartitioning [d_date_sk#166:0], 150)
         Project [d_date_sk#166:0,d_year#172:1]
          Filter (d_moy#174:2 = 12)
           HiveTableScan [d_date_sk#166,d_year#172,d_moy#174], (MetastoreRelation default, date_dim, Some(dt)), None
        Exchange (HashPartitioning [ss_sold_date_sk#194:2], 150)
         HiveTableScan [ss_ext_sales_price#209,ss_item_sk#196,ss_sold_date_sk#194], (MetastoreRelation default, store_sales, None), None
     Exchange (HashPartitioning [i_item_sk#217:1], 150)
      Project [i_brand_id#224:0,i_item_sk#217:1,i_brand#225:2]
       Filter (i_manufact_id#230:3 = 436)
        HiveTableScan [i_brand_id#224,i_item_sk#217,i_brand#225,i_manufact_id#230], (MetastoreRelation default, item, None), None
```

Author: Michael Armbrust <michael@databricks.com>

Closes #729 from marmbrus/fixPruning and squashes the following commits:

5feeff0 [Michael Armbrust] Improve column pruning.
2014-05-13 23:27:22 -07:00
larvaboy c33b8dcbf6 Implement ApproximateCountDistinct for SparkSql
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.

A simple serializer and test cases are added as well.

Author: larvaboy <larvaboy@gmail.com>

Closes #737 from larvaboy/master and squashes the following commits:

bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct.
9ba8360 [larvaboy] Fix alignment and null handling issues.
95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct.
f57917d [larvaboy] Add the parser for the approximate count.
a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions.
7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog.
1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class.
653542b [larvaboy] Fix a couple of minor typos.
2014-05-13 21:26:08 -07:00
Michael Armbrust 44233865cf [SQL] Make it possible to create Java/Python SQLContexts from an existing Scala SQLContext.
Author: Michael Armbrust <michael@databricks.com>

Closes #761 from marmbrus/existingContext and squashes the following commits:

4651051 [Michael Armbrust] Make it possible to create Java/Python SQLContexts from an existing Scala SQLContext.
2014-05-13 21:23:51 -07:00
Andrew Ash 156df87e7c SPARK-1757 Failing test for saving null primitives with .saveAsParquetFile()
https://issues.apache.org/jira/browse/SPARK-1757

The first test succeeds, but the second test fails with exception:

```
[info] - save and load case class RDD with Nones as parquet *** FAILED *** (14 milliseconds)
[info]   java.lang.RuntimeException: Unsupported datatype StructType(List())
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetRelation.scala:201)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetRelation.scala:234)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetRelation.scala:267)
[info]   at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:143)
[info]   at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:122)
[info]   at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:139)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
[info]   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:264)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:264)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:265)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:265)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:268)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:268)
[info]   at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:66)
[info]   at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:98)
```

Author: Andrew Ash <andrew@andrewash.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #690 from ash211/rdd-parquet-save and squashes the following commits:

747a0b9 [Andrew Ash] Merge pull request #1 from marmbrus/pr/690
54bd00e [Michael Armbrust] Need to put Option first since Option <: Seq.
8f3f281 [Andrew Ash] SPARK-1757 Add failing test for saving SparkSQL Schemas with Option[?] fields as parquet
2014-05-12 19:23:39 -07:00
Michael Armbrust 2f1a337358 [SQL] Make Hive Metastore conversion functions publicly visible.
I need this to be public for the implementation of SharkServer2.  However, I think this functionality is generally useful and should be pretty stable.

Author: Michael Armbrust <michael@databricks.com>

Closes #750 from marmbrus/metastoreTypes and squashes the following commits:

f51b62e [Michael Armbrust] Make Hive Metastore conversion functions publicly visible.
2014-05-12 18:40:30 -07:00
Sean Owen 8586bf564f SPARK-1802. Audit dependency graph when Spark is built with -Phive
This initial commit resolves the conflicts in the Hive profiles as noted in https://issues.apache.org/jira/browse/SPARK-1802 .

Most of the fix was to note that Hive drags in Avro, and so if the hive module depends on Spark's version of the `avro-*` dependencies, it will pull in our exclusions as needed too. But I found we need to copy some exclusions between the two Avro dependencies to get this right. And then had to squash some commons-logging intrusions.

This turned up another annoying find, that `hive-exec` is basically an "assembly" artifact that _also_ packages all of its transitive dependencies. This means the final assembly shows lots of collisions between itself and its dependencies, and even other project dependencies. I have a TODO to examine whether that is going to be a deal-breaker or not.

In the meantime I'm going to tack on a second commit to this PR that will also fix some similar, last collisions in the YARN profile.

Author: Sean Owen <sowen@cloudera.com>

Closes #744 from srowen/SPARK-1802 and squashes the following commits:

a856604 [Sean Owen] Resolve JAR version conflicts specific to Hive profile
2014-05-12 14:17:25 -07:00
Sean Owen 7120a2979d SPARK-1798. Tests should clean up temp files
Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent.

Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former.

The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules.

Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method.

_If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._

Author: Sean Owen <sowen@cloudera.com>

Closes #732 from srowen/SPARK-1798 and squashes the following commits:

5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each
b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean
bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests
2014-05-12 14:16:19 -07:00
Matei Zaharia 7eefc9d2b3 SPARK-1708. Add a ClassTag on Serializer and things that depend on it
This pull request contains a rebased patch from @heathermiller (https://github.com/heathermiller/spark/pull/1) to add ClassTags on Serializer and types that depend on it (Broadcast and AccumulableCollection). Putting these in the public API signatures now will allow us to use Scala Pickling for serialization down the line without breaking binary compatibility.

One question remaining is whether we also want them on Accumulator -- Accumulator is passed as part of a bigger Task or TaskResult object via the closure serializer so it doesn't seem super useful to add the ClassTag there. Broadcast and AccumulableCollection in contrast were being serialized directly.

CC @rxin, @pwendell, @heathermiller

Author: Matei Zaharia <matei@databricks.com>

Closes #700 from mateiz/spark-1708 and squashes the following commits:

1a3d8b0 [Matei Zaharia] Use fake ClassTag in Java
3b449ed [Matei Zaharia] test fix
2209a27 [Matei Zaharia] Code style fixes
9d48830 [Matei Zaharia] Add a ClassTag on Serializer and things that depend on it
2014-05-10 12:10:24 -07:00
Takuya UESHIN 8e94d2721a [SPARK-1778] [SQL] Add 'limit' transformation to SchemaRDD.
Add `limit` transformation to `SchemaRDD`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #711 from ueshin/issues/SPARK-1778 and squashes the following commits:

33169df [Takuya UESHIN] Add 'limit' transformation to SchemaRDD.
2014-05-10 12:03:27 -07:00
Takuya UESHIN 322b1808d2 [SPARK-1754] [SQL] Add missing arithmetic DSL operations.
Add missing arithmetic DSL operations: `unary_-`, `%`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #689 from ueshin/issues/SPARK-1754 and squashes the following commits:

a09ef69 [Takuya UESHIN] Add also missing ! (not) operation.
f73ae2c [Takuya UESHIN] Remove redundant tests.
5b3f087 [Takuya UESHIN] Add tests relating DSL operations.
e09c5b8 [Takuya UESHIN] Add missing arithmetic DSL operations.
2014-05-08 15:31:47 -07:00
Michael Armbrust 19c8fb02bc [SQL] Improve SparkSQL Aggregates
* Add native min/max (was using hive before).
* Handle nulls correctly in Avg and Sum.

Author: Michael Armbrust <michael@databricks.com>

Closes #683 from marmbrus/aggFixes and squashes the following commits:

64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.
2014-05-08 01:08:43 -04:00
Cheng Hao ca43186867 [SQL] Fix Performance Issue in data type casting
Using lazy val object instead of function in the class Cast, which improved the performance nearly by 2X in my local micro-benchmark.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #679 from chenghao-intel/fix_type_casting and squashes the following commits:

71b0902 [Cheng Hao] using lazy val object instead of function for data type casting
2014-05-07 16:54:58 -04:00
Kan Zhang 967635a242 [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations...
... that do not change schema

Author: Kan Zhang <kzhang@apache.org>

Closes #448 from kanzhang/SPARK-1460 and squashes the following commits:

111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD
91dc787 [Kan Zhang] Taking into account newly added Ordering param
79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
2014-05-07 09:41:31 -07:00
Cheng Hao 3eb53bd59e [WIP][Spark-SQL] Optimize the Constant Folding for Expression
Currently, expression does not support the "constant null" well in constant folding.
e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime.

For example:
```
explain select isnull(key+null)  from src;
== Logical Plan ==
Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28]
 MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [true AS c_0#28]
 MetastoreRelation default, src, None

== Physical Plan ==
Project [true AS c_0#28]
 HiveTableScan [], (MetastoreRelation default, src, None), None
```

I've create a new Optimization rule called NullPropagation for such kind of constant folding.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #482 from chenghao-intel/optimize_constant_folding and squashes the following commits:

2f14b50 [Cheng Hao] Fix code style issues
68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation
29c8166 [Cheng Hao] Update the code for feedback of code review
50444cc [Cheng Hao] Remove the unnecessary null checking
80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding
27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests
b28e03a [Cheng Hao] Merge pull request #1 from marmbrus/pr/482
9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation.
543ef9d [Cheng Hao] fix code style issues
9cf0396 [Cheng Hao] update code according to the code review comment
536c005 [Cheng Hao] Add Exceptional case for constant folding
3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules
2645d4f [Cheng Hao] Constant Folding(null propagation)
2014-05-07 03:37:12 -04:00
Michael Armbrust 3c64750bdd [SQL] SPARK-1732 - Support for null primitive values.
I also removed a println that I bumped into.

Author: Michael Armbrust <michael@databricks.com>

Closes #658 from marmbrus/nullPrimitives and squashes the following commits:

a3ec4f3 [Michael Armbrust] Remove println.
695606b [Michael Armbrust] Support for null primatives from using scala and java reflection.
2014-05-05 22:59:42 -07:00
Cheng Lian 6d721c5f71 [SPARK-1678][SPARK-1679] In-memory compression bug fix and made compression configurable, disabled by default
In-memory compression is now configurable in `SparkConf` by the `spark.sql.inMemoryCompression.enabled` property, and is disabled by default.

To help code review, the bug fix is in [the first commit](d537a367ed), compression configuration is in [the second one](4ce09aa8aa).

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #608 from liancheng/spark-1678 and squashes the following commits:

66c3a8d [Cheng Lian] Renamed in-memory compression configuration key
f8fb3a0 [Cheng Lian] Added assertion for testing .hasNext of various decoder
4ce09aa [Cheng Lian] Made in-memory compression configurable via SparkConf
d537a36 [Cheng Lian] Fixed SPARK-1678
2014-05-05 19:38:59 -07:00
Michael Armbrust 92b2902ca0 Whitelist Hive Tests
This is ready when Jenkins is.

Author: Michael Armbrust <michael@databricks.com>

Closes #596 from marmbrus/moreTests and squashes the following commits:

85be703 [Michael Armbrust] Blacklist MR required tests.
35bc311 [Michael Armbrust] Add hive golden answers.
ede98fd [Michael Armbrust] More hive gitignore
da096ea [Michael Armbrust] update whitelist
2014-05-03 23:13:53 -07:00
Michael Armbrust b295714708 [SQL] Better logging when applying rules.
Author: Michael Armbrust <michael@databricks.com>

Closes #616 from marmbrus/ruleLogging and squashes the following commits:

39c09fe [Michael Armbrust] Fix off by one error.
5af3537 [Michael Armbrust] Better logging when applying rules.
2014-05-03 18:38:44 -07:00
ArcherShao 9347565f41 Update SchemaRDD.scala
Modify spelling errors

Author: ArcherShao <ArcherShao@users.noreply.github.com>

Closes #619 from ArcherShao/patch-1 and squashes the following commits:

2957195 [ArcherShao] Update SchemaRDD.scala
2014-05-03 00:17:36 -07:00
Michael Armbrust a43d9c14f2 [SQL] SPARK-1661 - Fix regex_serde test
The JIRA in question is actually reporting a bug with Shark, but I wanted to make sure Spark SQL did not have similar problems.  This fixes a bug in our parsing code that was preventing the test from executing, but it looks like the RegexSerDe is working in Spark SQL.

Author: Michael Armbrust <michael@databricks.com>

Closes #595 from marmbrus/fixRegexSerdeTest and squashes the following commits:

a4dc612 [Michael Armbrust] Add files created by hive to gitignore.
efa6402 [Michael Armbrust] Fix Hive serde_regex test.
2014-05-01 21:32:43 -07:00
witgo 030f2c2126 Improved build configuration
1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
2, Fix SPARK-1491: maven hadoop-provided profile fails to build
3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)

Author: witgo <witgo@qq.com>

Closes #480 from witgo/format_pom and squashes the following commits:

03f652f [witgo] review commit
b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
0da4bc3 [witgo] merge master
d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
e345919 [witgo] add avro dependency to yarn-alpha
77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
934f24d [witgo] review commit
cf46edc [witgo] exclude jruby
06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
99464d2 [witgo] fix maven hadoop-provided profile fails to build
0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
2014-04-28 22:51:46 -07:00
Cheng Hao f735884414 [SQL]Append some missing types for HiveUDF
Add the missing types

Author: Cheng Hao <hao.cheng@intel.com>

Closes #459 from chenghao-intel/missing_types and squashes the following commits:

21cba2e [Cheng Hao] Append some missing types for HiveUDF
2014-04-27 23:59:42 -07:00