[SPARK-7849] [SQL] [Docs] Updates SQL programming guide for 1.4
Author: Cheng Lian <lian@databricks.com>
Closes #6520 from liancheng/spark-7849 and squashes the following commits:
705264b [Cheng Lian] Updates SQL programming guide for 1.4
(cherry picked from commit 6e3f0c7810
)
Signed-off-by: Reynold Xin <rxin@databricks.com>
This commit is contained in:
parent
e7ba3ea86b
commit
b2b7601471
|
@ -11,6 +11,7 @@ title: Spark SQL and DataFrames
|
||||||
|
|
||||||
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
|
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
|
||||||
|
|
||||||
|
For how to enable Hive support, please refer to the [Hive Tables](#hive-tables) section.
|
||||||
|
|
||||||
# DataFrames
|
# DataFrames
|
||||||
|
|
||||||
|
@ -906,7 +907,7 @@ new data.
|
||||||
<td>
|
<td>
|
||||||
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
|
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
|
||||||
the save operation is expected to not save the contents of the DataFrame and to not
|
the save operation is expected to not save the contents of the DataFrame and to not
|
||||||
change the existing data. This is similar to a `CREATE TABLE IF NOT EXISTS` in SQL.
|
change the existing data. This is similar to a <code>CREATE TABLE IF NOT EXISTS</code> in SQL.
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
</table>
|
</table>
|
||||||
|
@ -1030,7 +1031,7 @@ teenagers <- sql(sqlContext, "SELECT name FROM parquetFile WHERE age >= 13 AND a
|
||||||
teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
|
teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
|
||||||
for (teenName in collect(teenNames)) {
|
for (teenName in collect(teenNames)) {
|
||||||
cat(teenName, "\n")
|
cat(teenName, "\n")
|
||||||
}
|
}
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
@ -1502,7 +1503,7 @@ Row[] results = sqlContext.sql("FROM src SELECT key, value").collect();
|
||||||
<div data-lang="python" markdown="1">
|
<div data-lang="python" markdown="1">
|
||||||
|
|
||||||
When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
|
When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
|
||||||
adds support for finding tables in the MetaStore and writing queries using HiveQL.
|
adds support for finding tables in the MetaStore and writing queries using HiveQL.
|
||||||
{% highlight python %}
|
{% highlight python %}
|
||||||
# sc is an existing SparkContext.
|
# sc is an existing SparkContext.
|
||||||
from pyspark.sql import HiveContext
|
from pyspark.sql import HiveContext
|
||||||
|
@ -1537,6 +1538,82 @@ results = sqlContext.sql("FROM src SELECT key, value").collect()
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
### Interacting with Different Versions of Hive Metastore
|
||||||
|
|
||||||
|
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
|
||||||
|
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.2.0, Spark SQL can
|
||||||
|
talk to two versions of Hive metastore, either 0.12.0 or 0.13.1, default to the latter. However, to
|
||||||
|
switch to desired Hive metastore version, users have to rebuild the assembly jar with proper profile
|
||||||
|
flags (either `-Phive-0.12.0` or `-Phive-0.13.1`), which is quite inconvenient.
|
||||||
|
|
||||||
|
Starting from 1.4.0, users no longer need to rebuild the assembly jar to switch Hive metastore
|
||||||
|
version. Instead, configuration properties described in the table below can be used to specify
|
||||||
|
desired Hive metastore version. Currently, supported versions are still limited to 0.13.1 and
|
||||||
|
0.12.0, but we are working on a more generalized mechanism to support a wider range of versions.
|
||||||
|
|
||||||
|
Internally, Spark SQL 1.4.0 uses two Hive clients, one for executing native Hive commands like `SET`
|
||||||
|
and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The former uses Hive
|
||||||
|
jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses Hive jars of the
|
||||||
|
version specified by users. An isolated classloader is used here to avoid dependency conflicts.
|
||||||
|
|
||||||
|
<table class="table">
|
||||||
|
<tr><th>Property Name</th><th>Meaning</th></tr>
|
||||||
|
<tr>
|
||||||
|
<td><code>spark.sql.hive.metastore.version</code></td>
|
||||||
|
<td>
|
||||||
|
The version of the hive client that will be used to communicate with the metastore. Available
|
||||||
|
options are <code>0.12.0</code> and <code>0.13.1</code>. Defaults to <code>0.13.1</code>.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
|
||||||
|
<tr>
|
||||||
|
<td><code>spark.sql.hive.metastore.jars</code></td>
|
||||||
|
<td>
|
||||||
|
The location of the jars that should be used to instantiate the HiveMetastoreClient. This
|
||||||
|
property can be one of three options:
|
||||||
|
<ol>
|
||||||
|
<li><code>builtin</code></li>
|
||||||
|
Use Hive 0.13.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
|
||||||
|
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
|
||||||
|
either <code>0.13.1</code> or not defined.
|
||||||
|
<li><code>maven</code></li>
|
||||||
|
Use Hive jars of specified version downloaded from Maven repositories.
|
||||||
|
<li>A classpath in the standard format for both Hive and Hadoop.</li>
|
||||||
|
</ol>
|
||||||
|
Defaults to <code>builtin</code>.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
|
||||||
|
<tr>
|
||||||
|
<td><code>spark.sql.hive.metastore.sharedPrefixes</code></td>
|
||||||
|
|
||||||
|
<td>
|
||||||
|
<p>
|
||||||
|
A comma separated list of class prefixes that should be loaded using the classloader that is
|
||||||
|
shared between Spark SQL and a specific version of Hive. An example of classes that should
|
||||||
|
be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need
|
||||||
|
to be shared are those that interact with classes that are already shared. For example,
|
||||||
|
custom appenders that are used by log4j.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
Defaults to <code>com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc</code>.
|
||||||
|
</p>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
|
||||||
|
<tr>
|
||||||
|
<td><code>spark.sql.hive.metastore.barrierPrefixes</code></td>
|
||||||
|
<td>
|
||||||
|
<p>
|
||||||
|
A comma separated list of class prefixes that should explicitly be reloaded for each version
|
||||||
|
of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a
|
||||||
|
prefix that typically would be shared (i.e. <code>org.apache.spark.*</code>).
|
||||||
|
</p>
|
||||||
|
<p>Defaults to empty.</p>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
|
||||||
## JDBC To Other Databases
|
## JDBC To Other Databases
|
||||||
|
|
||||||
Spark SQL also includes a data source that can read data from other databases using JDBC. This
|
Spark SQL also includes a data source that can read data from other databases using JDBC. This
|
||||||
|
@ -1570,7 +1647,7 @@ the Data Sources API. The following options are supported:
|
||||||
<tr>
|
<tr>
|
||||||
<td><code>dbtable</code></td>
|
<td><code>dbtable</code></td>
|
||||||
<td>
|
<td>
|
||||||
The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of
|
The JDBC table that should be read. Note that anything that is valid in a <code>FROM</code> clause of
|
||||||
a SQL query can be used. For example, instead of a full table you could also use a
|
a SQL query can be used. For example, instead of a full table you could also use a
|
||||||
subquery in parentheses.
|
subquery in parentheses.
|
||||||
</td>
|
</td>
|
||||||
|
@ -1714,7 +1791,7 @@ that these options will be deprecated in future release as more optimizations ar
|
||||||
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
|
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
|
||||||
performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
|
performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
|
||||||
statistics are only supported for Hive Metastore tables where the command
|
statistics are only supported for Hive Metastore tables where the command
|
||||||
`ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan` has been run.
|
<code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been run.
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
|
@ -1737,7 +1814,9 @@ that these options will be deprecated in future release as more optimizations ar
|
||||||
|
|
||||||
# Distributed SQL Engine
|
# Distributed SQL Engine
|
||||||
|
|
||||||
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, without the need to write any code.
|
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface.
|
||||||
|
In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries,
|
||||||
|
without the need to write any code.
|
||||||
|
|
||||||
## Running the Thrift JDBC/ODBC server
|
## Running the Thrift JDBC/ODBC server
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue