[SPARK-15863][SQL][DOC][FOLLOW-UP] Update SQL programming guide.
## What changes were proposed in this pull request? This PR makes several updates to SQL programming guide. Author: Yin Huai <yhuai@databricks.com> Closes #13938 from yhuai/doc.
This commit is contained in:
parent
a0da854fb3
commit
dd6b7dbe70
|
@ -25,29 +25,35 @@ the `spark-shell`, `pyspark` shell, or `sparkR` shell.
|
||||||
One use of Spark SQL is to execute SQL queries.
|
One use of Spark SQL is to execute SQL queries.
|
||||||
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
|
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
|
||||||
configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
|
configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
|
||||||
SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes).
|
SQL from within another programming language the results will be returned as a [Dataset/DataFrame](#datasets-and-dataframes).
|
||||||
You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
|
You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
|
||||||
or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
|
or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
|
||||||
|
|
||||||
## Datasets and DataFrames
|
## Datasets and DataFrames
|
||||||
|
|
||||||
A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
|
A Dataset is a distributed collection of data.
|
||||||
|
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong
|
||||||
typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
|
typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
|
||||||
execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
|
execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
|
||||||
manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.).
|
manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.).
|
||||||
|
The Dataset API is available in [Scala][scala-datasets] and
|
||||||
|
[Java][java-datasets]. Python does not have the support for the Dataset API. But due to Python's dynamic nature,
|
||||||
|
many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally
|
||||||
|
`row.columnName`). The case for R is similar.
|
||||||
|
|
||||||
The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
|
A DataFrame is a *Dataset* organized into named columns. It is conceptually
|
||||||
2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
|
equivalent to a table in a relational database or a data frame in R/Python, but with richer
|
||||||
In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
|
optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
|
||||||
However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
|
as: structured data files, tables in Hive, external databases, or existing RDDs.
|
||||||
|
The DataFrame API is available in Scala,
|
||||||
|
Java, [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
|
||||||
|
In Scala and Java, a DataFrame is represented by a Dataset of `Row`s.
|
||||||
|
In [the Scala API][scala-datasets], `DataFrame` is simply a type alias of `Dataset[Row]`.
|
||||||
|
While, in [Java API][java-datasets], users need to use `Dataset<Row>` to represent a `DataFrame`.
|
||||||
|
|
||||||
[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
|
[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
|
||||||
[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
|
[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
|
||||||
|
|
||||||
Python does not have support for the Dataset API, but due to its dynamic nature many of the
|
|
||||||
benefits are already available (i.e. you can access the field of a row by name naturally
|
|
||||||
`row.columnName`). The case for R is similar.
|
|
||||||
|
|
||||||
Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
|
Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
|
||||||
|
|
||||||
# Getting Started
|
# Getting Started
|
||||||
|
@ -2042,14 +2048,6 @@ that these options will be deprecated in future release as more optimizations ar
|
||||||
<code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been run.
|
<code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been run.
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
|
||||||
<td><code>spark.sql.tungsten.enabled</code></td>
|
|
||||||
<td>true</td>
|
|
||||||
<td>
|
|
||||||
When true, use the optimized Tungsten physical execution backend which explicitly manages memory
|
|
||||||
and dynamically generates bytecode for expression evaluation.
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
<tr>
|
||||||
<td><code>spark.sql.shuffle.partitions</code></td>
|
<td><code>spark.sql.shuffle.partitions</code></td>
|
||||||
<td>200</td>
|
<td>200</td>
|
||||||
|
|
Loading…
Reference in a new issue