[SPARK-9148] [SPARK-10252] [SQL] Update SQL Programming Guide
Author: Michael Armbrust <michael@databricks.com> Closes #8441 from marmbrus/documentation.
This commit is contained in:
parent
fdd466bed7
commit
dc86a227e4
|
@ -11,7 +11,7 @@ title: Spark SQL and DataFrames
|
|||
|
||||
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
|
||||
|
||||
For how to enable Hive support, please refer to the [Hive Tables](#hive-tables) section.
|
||||
Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section.
|
||||
|
||||
# DataFrames
|
||||
|
||||
|
@ -213,6 +213,11 @@ df.groupBy("age").count().show()
|
|||
// 30 1
|
||||
{% endhighlight %}
|
||||
|
||||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.DataFrame).
|
||||
|
||||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.DataFrame).
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
<div data-lang="java" markdown="1">
|
||||
|
@ -263,6 +268,10 @@ df.groupBy("age").count().show();
|
|||
// 30 1
|
||||
{% endhighlight %}
|
||||
|
||||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/java/org/apache/spark/sql/DataFrame.html).
|
||||
|
||||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
|
||||
|
||||
</div>
|
||||
|
||||
<div data-lang="python" markdown="1">
|
||||
|
@ -320,6 +329,10 @@ df.groupBy("age").count().show()
|
|||
|
||||
{% endhighlight %}
|
||||
|
||||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
|
||||
|
||||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
|
||||
|
||||
</div>
|
||||
|
||||
<div data-lang="r" markdown="1">
|
||||
|
@ -370,10 +383,13 @@ showDF(count(groupBy(df, "age")))
|
|||
|
||||
{% endhighlight %}
|
||||
|
||||
</div>
|
||||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
|
||||
|
||||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/index.html).
|
||||
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
## Running SQL Queries Programmatically
|
||||
|
||||
|
@ -870,12 +886,11 @@ saveDF(select(df, "name", "age"), "namesAndAges.parquet", "parquet")
|
|||
|
||||
Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
|
||||
present. It is important to realize that these save modes do not utilize any locking and are not
|
||||
atomic. Thus, it is not safe to have multiple writers attempting to write to the same location.
|
||||
Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
|
||||
atomic. Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
|
||||
new data.
|
||||
|
||||
<table class="table">
|
||||
<tr><th>Scala/Java</th><th>Python</th><th>Meaning</th></tr>
|
||||
<tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr>
|
||||
<tr>
|
||||
<td><code>SaveMode.ErrorIfExists</code> (default)</td>
|
||||
<td><code>"error"</code> (default)</td>
|
||||
|
@ -1671,12 +1686,12 @@ results <- collect(sql(sqlContext, "FROM src SELECT key, value"))
|
|||
### Interacting with Different Versions of Hive Metastore
|
||||
|
||||
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
|
||||
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
|
||||
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
|
||||
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
|
||||
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
|
||||
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
|
||||
|
||||
Internally, Spark SQL uses two Hive clients, one for executing native Hive commands like `SET`
|
||||
and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The former uses Hive
|
||||
jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses Hive jars of the
|
||||
version specified by users. An isolated classloader is used here to avoid dependency conflicts.
|
||||
The following options can be used to configure the version of Hive that is used to retrieve metadata:
|
||||
|
||||
<table class="table">
|
||||
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
|
||||
|
@ -1685,7 +1700,7 @@ version specified by users. An isolated classloader is used here to avoid depend
|
|||
<td><code>0.13.1</code></td>
|
||||
<td>
|
||||
Version of the Hive metastore. Available
|
||||
options are <code>0.12.0</code> and <code>0.13.1</code>. Support for more versions is coming in the future.
|
||||
options are <code>0.12.0</code> through <code>1.2.1</code>.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
|
@ -1696,12 +1711,16 @@ version specified by users. An isolated classloader is used here to avoid depend
|
|||
property can be one of three options:
|
||||
<ol>
|
||||
<li><code>builtin</code></li>
|
||||
Use Hive 0.13.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
|
||||
Use Hive 1.2.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
|
||||
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
|
||||
either <code>0.13.1</code> or not defined.
|
||||
either <code>1.2.1</code> or not defined.
|
||||
<li><code>maven</code></li>
|
||||
Use Hive jars of specified version downloaded from Maven repositories.
|
||||
<li>A classpath in the standard format for both Hive and Hadoop.</li>
|
||||
Use Hive jars of specified version downloaded from Maven repositories. This configuration
|
||||
is not generally recommended for production deployments.
|
||||
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive
|
||||
and its dependencies, including the correct version of Hadoop. These jars only need to be
|
||||
present on the driver, but if you are running in yarn cluster mode then you must ensure
|
||||
they are packaged with you application.</li>
|
||||
</ol>
|
||||
</td>
|
||||
</tr>
|
||||
|
@ -2017,6 +2036,28 @@ options.
|
|||
|
||||
# Migration Guide
|
||||
|
||||
## Upgrading From Spark SQL 1.4 to 1.5
|
||||
|
||||
- Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with
|
||||
code generation for expression evaluation. These features can both be disabled by setting
|
||||
`spark.sql.tungsten.enabled` to `false.
|
||||
- Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
|
||||
`spark.sql.parquet.mergeSchema` to `true`.
|
||||
- Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
|
||||
access nested values. For example `df['table.column.nestedField']`. However, this means that if
|
||||
your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).
|
||||
- In-memory columnar storage partition pruning is on by default. It can be disabled by setting
|
||||
`spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
|
||||
- Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum
|
||||
precision of 38. When inferring schema from `BigDecimal` objects, a precision of (38, 18) is now
|
||||
used. When no precision is specified in DDL then the default remains `Decimal(10, 0)`.
|
||||
- Timestamps are now stored at a precision of 1us, rather than 1ns
|
||||
- In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
|
||||
unchanged.
|
||||
- The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
|
||||
- It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe
|
||||
and thus this output committer will not be used when speculation is on, independent of configuration.
|
||||
|
||||
## Upgrading from Spark SQL 1.3 to 1.4
|
||||
|
||||
#### DataFrame data reader/writer interface
|
||||
|
@ -2038,7 +2079,8 @@ See the API docs for `SQLContext.read` (
|
|||
|
||||
#### DataFrame.groupBy retains grouping columns
|
||||
|
||||
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
|
||||
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the
|
||||
grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
|
||||
|
||||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
|
@ -2175,7 +2217,7 @@ Python UDF registration is unchanged.
|
|||
When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of
|
||||
referencing a singleton.
|
||||
|
||||
## Migration Guide for Shark User
|
||||
## Migration Guide for Shark Users
|
||||
|
||||
### Scheduling
|
||||
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
|
||||
|
@ -2251,6 +2293,7 @@ Spark SQL supports the vast majority of Hive features, such as:
|
|||
* User defined functions (UDF)
|
||||
* User defined aggregation functions (UDAF)
|
||||
* User defined serialization formats (SerDes)
|
||||
* Window functions
|
||||
* Joins
|
||||
* `JOIN`
|
||||
* `{LEFT|RIGHT|FULL} OUTER JOIN`
|
||||
|
@ -2261,7 +2304,7 @@ Spark SQL supports the vast majority of Hive features, such as:
|
|||
* `SELECT col FROM ( SELECT a + b AS col from t1) t2`
|
||||
* Sampling
|
||||
* Explain
|
||||
* Partitioned tables
|
||||
* Partitioned tables including dynamic partition insertion
|
||||
* View
|
||||
* All Hive DDL Functions, including:
|
||||
* `CREATE TABLE`
|
||||
|
@ -2323,8 +2366,9 @@ releases of Spark SQL.
|
|||
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
|
||||
metadata. Spark SQL does not support that.
|
||||
|
||||
# Reference
|
||||
|
||||
# Data Types
|
||||
## Data Types
|
||||
|
||||
Spark SQL and DataFrames support the following data types:
|
||||
|
||||
|
@ -2937,3 +2981,13 @@ from pyspark.sql.types import *
|
|||
|
||||
</div>
|
||||
|
||||
## NaN Semantics
|
||||
|
||||
There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
|
||||
does not exactly match standard floating point semantics.
|
||||
Specifically:
|
||||
|
||||
- NaN = NaN returns true.
|
||||
- In aggregations all NaN values are grouped together.
|
||||
- NaN is treated as a normal value in join keys.
|
||||
- NaN values go last when in ascending order, larger than any other numeric value.
|
||||
|
|
Loading…
Reference in a new issue