History

Josh Rosen 6bcbf9b743 [SPARK-17351] Refactor JDBCRDD to expose ResultSet -> Seq[Row] utility methods This patch refactors the internals of the JDBC data source in order to allow some of its code to be re-used in an automated comparison testing harness. Here are the key changes: - Move the JDBC `ResultSetMetadata` to `StructType` conversion logic from `JDBCRDD.resolveTable()` to the `JdbcUtils` object (as a new `getSchema(ResultSet, JdbcDialect)` method), allowing it to be applied on `ResultSet`s that are created elsewhere. - Move the `ResultSet` to `InternalRow` conversion methods from `JDBCRDD` to `JdbcUtils`: - It makes sense to move the `JDBCValueGetter` type and `makeGetter` functions here given that their write-path counterparts (`JDBCValueSetter`) are already in `JdbcUtils`. - Add an internal `resultSetToSparkInternalRows` method which takes a `ResultSet` and schema and returns an `Iterator[InternalRow]`. This effectively extracts the main loop of `JDBCRDD` into its own method. - Add a public `resultSetToRows` method to `JdbcUtils`, which wraps the minimal machinery around `resultSetToSparkInternalRows` in order to allow it to be called from outside of a Spark job. - Make `JdbcDialect.get` into a `DeveloperApi` (`JdbcDialect` itself is already a `DeveloperApi`). Put together, these changes enable the following testing pattern: ```scala val jdbResultSet: ResultSet = conn.prepareStatement(query).executeQuery() val resultSchema: StructType = JdbcUtils.getSchema(jdbResultSet, JdbcDialects.get("jdbc:postgresql")) val jdbcRows: Seq[Row] = JdbcUtils.resultSetToRows(jdbResultSet, schema).toSeq checkAnswer(sparkResult, jdbcRows) // in a test case ``` Author: Josh Rosen <joshrosen@databricks.com> Closes #14907 from JoshRosen/modularize-jdbc-internals.		2016-09-02 18:53:12 +02:00
..
catalyst	[SPARK-16935][SQL] Verification of Function-related ExternalCatalog APIs	2016-09-02 22:31:01 +08:00
core	[SPARK-17351] Refactor JDBCRDD to expose ResultSet -> Seq[Row] utility methods	2016-09-02 18:53:12 +02:00
hive	[SPARK-16935][SQL] Verification of Function-related ExternalCatalog APIs	2016-09-02 22:31:01 +08:00
hive-thriftserver	[SPARK-17190][SQL] Removal of HiveSharedState	2016-08-25 12:50:03 +08:00
README.md	[SPARK-16557][SQL] Remove stale doc in sql/README.md	2016-07-14 19:24:42 -07:00

README.md

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.