This module provides support for executing relational queries expressed in either SQL or a LINQ-like Scala DSL.
Spark SQL is broken up into three subprojects:
- Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
- Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
- Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
Other dependencies for developers
---------------------------------
In order to create new hive test cases , you will need to set several environmental variables.
```
export HIVE_HOME="<pathto>/hive/build/dist"
export HIVE_DEV_HOME="<pathto>/hive/"
export HADOOP_HOME="<pathto>/hadoop-1.0.4"
```
Using the console
=================
An interactive scala console can be invoked by running `sbt/sbt hive/console`. From here you can execute queries and inspect the various stages of query optimization.
From the console you can even write rules that transform query plans. For example, the above query has redundant project operators that aren't doing anything. This redundancy can be eliminated using the `transform` function that is available on all [`TreeNode`](http://databricks.github.io/catalyst/latest/api/#catalyst.trees.TreeNode) objects.
```scala
scala> query.logicalPlan
res1: catalyst.plans.logical.LogicalPlan =
Project {key#0,value#1}
Project {key#0,value#1}
MetastoreRelation default, src, None
scala> query.logicalPlan transform {
| case Project(projectList, child) if projectList == child.output => child