spark-instrumented-optimizer/sql
wangfei a83936e109 [SPARK-5617][SQL] fix test failure of SQLQuerySuite
SQLQuerySuite test failure:
[info] - simple select (22 milliseconds)
[info] - sorting (722 milliseconds)
[info] - external sorting (728 milliseconds)
[info] - limit (95 milliseconds)
[info] - date row *** FAILED *** (35 milliseconds)
[info]   Results do not match for query:
[info]   'Limit 1
[info]    'Project [CAST(2015-01-28, DateType) AS c0#3630]
[info]     'UnresolvedRelation [testData], None
[info]
[info]   == Analyzed Plan ==
[info]   Limit 1
[info]    Project [CAST(2015-01-28, DateType) AS c0#3630]
[info]     LogicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
[info]
[info]   == Physical Plan ==
[info]   Limit 1
[info]    Project [16463 AS c0#3630]
[info]     PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
[info]
[info]   == Results ==
[info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
[info]   ![2015-01-28]               [2015-01-27] (QueryTest.scala:77)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
[info]   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:77)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:95)
[info]   at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply$mcV$sp(SQLQuerySuite.scala:300)
[info]   at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info]   at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info]   at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNode

Author: wangfei <wangfei1@huawei.com>

Closes #4395 from scwf/SQLQuerySuite and squashes the following commits:

1431a2d [wangfei] fix conflicts
c35fe5e [wangfei] minor fix
01dab3a [wangfei] fix test failure of SQLQuerySuite
2015-02-05 12:44:12 -08:00
..
catalyst [SPARK-5602][SQL] Better support for creating DataFrame from local data collection 2015-02-04 19:53:57 -08:00
core [SPARK-5617][SQL] fix test failure of SQLQuerySuite 2015-02-05 12:44:12 -08:00
hive [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits. 2015-02-04 23:44:34 -08:00
hive-thriftserver [SPARK-5097][SQL] DataFrame 2015-01-27 16:08:24 -08:00
README.md [SQL][Hiveconsole] Bring hive console code up to date and update README.md 2015-02-04 15:13:54 -08:00

Spark SQL

This module provides support for executing relational queries expressed in either SQL or a LINQ-like Scala DSL.

Spark SQL is broken up into four subprojects:

  • Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
  • Execution (sql/core) - A query planner / execution engine for translating Catalysts logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
  • Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  • HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Other dependencies for developers

In order to create new hive test cases , you will need to set several environmental variables.

export HIVE_HOME="<path to>/hive/build/dist"
export HIVE_DEV_HOME="<path to>/hive/"
export HADOOP_HOME="<path to>/hadoop-1.0.4"

Using the console

An interactive scala console can be invoked by running build/sbt hive/console. From here you can execute queries with HiveQl and manipulate DataFrame by using DSL.

catalyst$ build/sbt hive/console

[info] Starting scala interpreter...
import org.apache.spark.sql.Dsl._
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.test.TestHive._
import org.apache.spark.sql.types._
import org.apache.spark.sql.parquet.ParquetTestData
Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45).
Type in expressions to have them evaluated.
Type :help for more information.

scala> val query = sql("SELECT * FROM (SELECT * FROM src) a")
query: org.apache.spark.sql.DataFrame = org.apache.spark.sql.DataFrame@74448eed

Query results are DataFrames and can be operated as such.

scala> query.collect()
res2: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86], [311,val_311], [27,val_27]...

You can also build further queries on top of these DataFrames using the query DSL.

scala> query.where('key > 30).select(avg('key)).collect()
res3: Array[org.apache.spark.sql.Row] = Array([274.79025423728814])