[SPARK-12069][SQL] Update documentation with Datasets
Author: Michael Armbrust <michael@databricks.com> Closes #10060 from marmbrus/docs.
This commit is contained in:
parent
9494521695
commit
3959489423
|
@ -71,7 +71,7 @@
|
||||||
<li><a href="programming-guide.html">Spark Programming Guide</a></li>
|
<li><a href="programming-guide.html">Spark Programming Guide</a></li>
|
||||||
<li class="divider"></li>
|
<li class="divider"></li>
|
||||||
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
|
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
|
||||||
<li><a href="sql-programming-guide.html">DataFrames and SQL</a></li>
|
<li><a href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li>
|
||||||
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
|
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
|
||||||
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
|
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
|
||||||
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
|
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
|
||||||
|
|
|
@ -87,7 +87,7 @@ options for deployment:
|
||||||
in all supported languages (Scala, Java, Python, R)
|
in all supported languages (Scala, Java, Python, R)
|
||||||
* Modules built on Spark:
|
* Modules built on Spark:
|
||||||
* [Spark Streaming](streaming-programming-guide.html): processing real-time data streams
|
* [Spark Streaming](streaming-programming-guide.html): processing real-time data streams
|
||||||
* [Spark SQL and DataFrames](sql-programming-guide.html): support for structured data and relational queries
|
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): support for structured data and relational queries
|
||||||
* [MLlib](mllib-guide.html): built-in machine learning library
|
* [MLlib](mllib-guide.html): built-in machine learning library
|
||||||
* [GraphX](graphx-programming-guide.html): Spark's new API for graph processing
|
* [GraphX](graphx-programming-guide.html): Spark's new API for graph processing
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
---
|
---
|
||||||
layout: global
|
layout: global
|
||||||
displayTitle: Spark SQL and DataFrame Guide
|
displayTitle: Spark SQL, DataFrames and Datasets Guide
|
||||||
title: Spark SQL and DataFrames
|
title: Spark SQL and DataFrames
|
||||||
---
|
---
|
||||||
|
|
||||||
|
@ -9,18 +9,51 @@ title: Spark SQL and DataFrames
|
||||||
|
|
||||||
# Overview
|
# Overview
|
||||||
|
|
||||||
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
|
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
|
||||||
|
by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
|
||||||
|
Spark SQL uses this extra information to perform extra optimizations. There are several ways to
|
||||||
|
interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
|
||||||
|
the same execution engine is used, independent of which API/language you are using to express the
|
||||||
|
computation. This unification means that developers can easily switch back and forth between the
|
||||||
|
various APIs based on which provides the most natural way to express a given transformation.
|
||||||
|
|
||||||
Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section.
|
All of the examples on this page use sample data included in the Spark distribution and can be run in
|
||||||
|
the `spark-shell`, `pyspark` shell, or `sparkR` shell.
|
||||||
|
|
||||||
# DataFrames
|
## SQL
|
||||||
|
|
||||||
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
|
One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
|
||||||
|
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
|
||||||
|
configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
|
||||||
|
SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
|
||||||
|
You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
|
||||||
|
or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
|
||||||
|
|
||||||
The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), [Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
|
## DataFrames
|
||||||
|
|
||||||
All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell.
|
A DataFrame is a distributed collection of data organized into named columns. It is conceptually
|
||||||
|
equivalent to a table in a relational database or a data frame in R/Python, but with richer
|
||||||
|
optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
|
||||||
|
as: structured data files, tables in Hive, external databases, or existing RDDs.
|
||||||
|
|
||||||
|
The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
|
||||||
|
[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
|
||||||
|
[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
|
||||||
|
|
||||||
|
## Datasets
|
||||||
|
|
||||||
|
A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
|
||||||
|
RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
|
||||||
|
optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
|
||||||
|
using functional transformations (map, flatMap, filter, etc.).
|
||||||
|
|
||||||
|
The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
|
||||||
|
[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
|
||||||
|
the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
|
||||||
|
access the field of a row by name naturally `row.columnName`). Full python support will be added
|
||||||
|
in a future release.
|
||||||
|
|
||||||
|
# Getting Started
|
||||||
|
|
||||||
## Starting Point: SQLContext
|
## Starting Point: SQLContext
|
||||||
|
|
||||||
|
@ -29,7 +62,7 @@ All of the examples on this page use sample data included in the Spark distribut
|
||||||
|
|
||||||
The entry point into all functionality in Spark SQL is the
|
The entry point into all functionality in Spark SQL is the
|
||||||
[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
|
[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
|
||||||
descendants. To create a basic `SQLContext`, all you need is a SparkContext.
|
descendants. To create a basic `SQLContext`, all you need is a SparkContext.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
val sc: SparkContext // An existing SparkContext.
|
val sc: SparkContext // An existing SparkContext.
|
||||||
|
@ -45,7 +78,7 @@ import sqlContext.implicits._
|
||||||
|
|
||||||
The entry point into all functionality in Spark SQL is the
|
The entry point into all functionality in Spark SQL is the
|
||||||
[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
|
[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
|
||||||
descendants. To create a basic `SQLContext`, all you need is a SparkContext.
|
descendants. To create a basic `SQLContext`, all you need is a SparkContext.
|
||||||
|
|
||||||
{% highlight java %}
|
{% highlight java %}
|
||||||
JavaSparkContext sc = ...; // An existing JavaSparkContext.
|
JavaSparkContext sc = ...; // An existing JavaSparkContext.
|
||||||
|
@ -58,7 +91,7 @@ SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
|
||||||
|
|
||||||
The entry point into all relational functionality in Spark is the
|
The entry point into all relational functionality in Spark is the
|
||||||
[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
|
[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
|
||||||
of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
|
of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
|
||||||
|
|
||||||
{% highlight python %}
|
{% highlight python %}
|
||||||
from pyspark.sql import SQLContext
|
from pyspark.sql import SQLContext
|
||||||
|
@ -70,7 +103,7 @@ sqlContext = SQLContext(sc)
|
||||||
<div data-lang="r" markdown="1">
|
<div data-lang="r" markdown="1">
|
||||||
|
|
||||||
The entry point into all relational functionality in Spark is the
|
The entry point into all relational functionality in Spark is the
|
||||||
`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
|
`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
|
||||||
|
|
||||||
{% highlight r %}
|
{% highlight r %}
|
||||||
sqlContext <- sparkRSQL.init(sc)
|
sqlContext <- sparkRSQL.init(sc)
|
||||||
|
@ -82,18 +115,18 @@ sqlContext <- sparkRSQL.init(sc)
|
||||||
In addition to the basic `SQLContext`, you can also create a `HiveContext`, which provides a
|
In addition to the basic `SQLContext`, you can also create a `HiveContext`, which provides a
|
||||||
superset of the functionality provided by the basic `SQLContext`. Additional features include
|
superset of the functionality provided by the basic `SQLContext`. Additional features include
|
||||||
the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the
|
the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the
|
||||||
ability to read data from Hive tables. To use a `HiveContext`, you do not need to have an
|
ability to read data from Hive tables. To use a `HiveContext`, you do not need to have an
|
||||||
existing Hive setup, and all of the data sources available to a `SQLContext` are still available.
|
existing Hive setup, and all of the data sources available to a `SQLContext` are still available.
|
||||||
`HiveContext` is only packaged separately to avoid including all of Hive's dependencies in the default
|
`HiveContext` is only packaged separately to avoid including all of Hive's dependencies in the default
|
||||||
Spark build. If these dependencies are not a problem for your application then using `HiveContext`
|
Spark build. If these dependencies are not a problem for your application then using `HiveContext`
|
||||||
is recommended for the 1.3 release of Spark. Future releases will focus on bringing `SQLContext` up
|
is recommended for the 1.3 release of Spark. Future releases will focus on bringing `SQLContext` up
|
||||||
to feature parity with a `HiveContext`.
|
to feature parity with a `HiveContext`.
|
||||||
|
|
||||||
The specific variant of SQL that is used to parse queries can also be selected using the
|
The specific variant of SQL that is used to parse queries can also be selected using the
|
||||||
`spark.sql.dialect` option. This parameter can be changed using either the `setConf` method on
|
`spark.sql.dialect` option. This parameter can be changed using either the `setConf` method on
|
||||||
a `SQLContext` or by using a `SET key=value` command in SQL. For a `SQLContext`, the only dialect
|
a `SQLContext` or by using a `SET key=value` command in SQL. For a `SQLContext`, the only dialect
|
||||||
available is "sql" which uses a simple SQL parser provided by Spark SQL. In a `HiveContext`, the
|
available is "sql" which uses a simple SQL parser provided by Spark SQL. In a `HiveContext`, the
|
||||||
default is "hiveql", though "sql" is also available. Since the HiveQL parser is much more complete,
|
default is "hiveql", though "sql" is also available. Since the HiveQL parser is much more complete,
|
||||||
this is recommended for most use cases.
|
this is recommended for most use cases.
|
||||||
|
|
||||||
|
|
||||||
|
@ -215,7 +248,7 @@ df.groupBy("age").count().show()
|
||||||
|
|
||||||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.DataFrame).
|
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.DataFrame).
|
||||||
|
|
||||||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.functions$).
|
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.functions$).
|
||||||
|
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
@ -270,7 +303,7 @@ df.groupBy("age").count().show();
|
||||||
|
|
||||||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/java/org/apache/spark/sql/DataFrame.html).
|
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/java/org/apache/spark/sql/DataFrame.html).
|
||||||
|
|
||||||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
|
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
@ -331,7 +364,7 @@ df.groupBy("age").count().show()
|
||||||
|
|
||||||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
|
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
|
||||||
|
|
||||||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
|
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
@ -385,7 +418,7 @@ showDF(count(groupBy(df, "age")))
|
||||||
|
|
||||||
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
|
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
|
||||||
|
|
||||||
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/index.html).
|
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/index.html).
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
@ -398,14 +431,14 @@ The `sql` function on a `SQLContext` enables applications to run SQL queries pro
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
val sqlContext = ... // An existing SQLContext
|
val sqlContext = ... // An existing SQLContext
|
||||||
val df = sqlContext.sql("SELECT * FROM table")
|
val df = sqlContext.sql("SELECT * FROM table")
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div data-lang="java" markdown="1">
|
<div data-lang="java" markdown="1">
|
||||||
{% highlight java %}
|
{% highlight java %}
|
||||||
SQLContext sqlContext = ... // An existing SQLContext
|
SQLContext sqlContext = ... // An existing SQLContext
|
||||||
DataFrame df = sqlContext.sql("SELECT * FROM table")
|
DataFrame df = sqlContext.sql("SELECT * FROM table")
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -428,15 +461,54 @@ df <- sql(sqlContext, "SELECT * FROM table")
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
## Creating Datasets
|
||||||
|
|
||||||
|
Datasets are similar to RDDs, however, instead of using Java Serialization or Kryo they use
|
||||||
|
a specialized [Encoder](api/scala/index.html#org.apache.spark.sql.Encoder) to serialize the objects
|
||||||
|
for processing or transmitting over the network. While both encoders and standard serialization are
|
||||||
|
responsible for turning an object into bytes, encoders are code generated dynamically and use a format
|
||||||
|
that allows Spark to perform many operations like filtering, sorting and hashing without deserializing
|
||||||
|
the bytes back into an object.
|
||||||
|
|
||||||
|
<div class="codetabs">
|
||||||
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
|
{% highlight scala %}
|
||||||
|
// Encoders for most common types are automatically provided by importing sqlContext.implicits._
|
||||||
|
val ds = Seq(1, 2, 3).toDS()
|
||||||
|
ds.map(_ + 1).collect() // Returns: Array(2, 3, 4)
|
||||||
|
|
||||||
|
// Encoders are also created for case classes.
|
||||||
|
case class Person(name: String, age: Long)
|
||||||
|
val ds = Seq(Person("Andy", 32)).toDS()
|
||||||
|
|
||||||
|
// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name.
|
||||||
|
val path = "examples/src/main/resources/people.json"
|
||||||
|
val people = sqlContext.read.json(path).as[Person]
|
||||||
|
|
||||||
|
{% endhighlight %}
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div data-lang="java" markdown="1">
|
||||||
|
|
||||||
|
{% highlight java %}
|
||||||
|
JavaSparkContext sc = ...; // An existing JavaSparkContext.
|
||||||
|
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
|
||||||
|
{% endhighlight %}
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
## Interoperating with RDDs
|
## Interoperating with RDDs
|
||||||
|
|
||||||
Spark SQL supports two different methods for converting existing RDDs into DataFrames. The first
|
Spark SQL supports two different methods for converting existing RDDs into DataFrames. The first
|
||||||
method uses reflection to infer the schema of an RDD that contains specific types of objects. This
|
method uses reflection to infer the schema of an RDD that contains specific types of objects. This
|
||||||
reflection based approach leads to more concise code and works well when you already know the schema
|
reflection based approach leads to more concise code and works well when you already know the schema
|
||||||
while writing your Spark application.
|
while writing your Spark application.
|
||||||
|
|
||||||
The second method for creating DataFrames is through a programmatic interface that allows you to
|
The second method for creating DataFrames is through a programmatic interface that allows you to
|
||||||
construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
|
construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
|
||||||
you to construct DataFrames when the columns and their types are not known until runtime.
|
you to construct DataFrames when the columns and their types are not known until runtime.
|
||||||
|
|
||||||
### Inferring the Schema Using Reflection
|
### Inferring the Schema Using Reflection
|
||||||
|
@ -445,11 +517,11 @@ you to construct DataFrames when the columns and their types are not known until
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes
|
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes
|
||||||
to a DataFrame. The case class
|
to a DataFrame. The case class
|
||||||
defines the schema of the table. The names of the arguments to the case class are read using
|
defines the schema of the table. The names of the arguments to the case class are read using
|
||||||
reflection and become the names of the columns. Case classes can also be nested or contain complex
|
reflection and become the names of the columns. Case classes can also be nested or contain complex
|
||||||
types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be
|
types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be
|
||||||
registered as a table. Tables can be used in subsequent SQL statements.
|
registered as a table. Tables can be used in subsequent SQL statements.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
// sc is an existing SparkContext.
|
// sc is an existing SparkContext.
|
||||||
|
@ -486,9 +558,9 @@ teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(printl
|
||||||
<div data-lang="java" markdown="1">
|
<div data-lang="java" markdown="1">
|
||||||
|
|
||||||
Spark SQL supports automatically converting an RDD of [JavaBeans](http://stackoverflow.com/questions/3295496/what-is-a-javabean-exactly)
|
Spark SQL supports automatically converting an RDD of [JavaBeans](http://stackoverflow.com/questions/3295496/what-is-a-javabean-exactly)
|
||||||
into a DataFrame. The BeanInfo, obtained using reflection, defines the schema of the table.
|
into a DataFrame. The BeanInfo, obtained using reflection, defines the schema of the table.
|
||||||
Currently, Spark SQL does not support JavaBeans that contain
|
Currently, Spark SQL does not support JavaBeans that contain
|
||||||
nested or contain complex types such as Lists or Arrays. You can create a JavaBean by creating a
|
nested or contain complex types such as Lists or Arrays. You can create a JavaBean by creating a
|
||||||
class that implements Serializable and has getters and setters for all of its fields.
|
class that implements Serializable and has getters and setters for all of its fields.
|
||||||
|
|
||||||
{% highlight java %}
|
{% highlight java %}
|
||||||
|
@ -559,9 +631,9 @@ List<String> teenagerNames = teenagers.javaRDD().map(new Function<Row, String>()
|
||||||
|
|
||||||
<div data-lang="python" markdown="1">
|
<div data-lang="python" markdown="1">
|
||||||
|
|
||||||
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
|
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
|
||||||
key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
|
key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
|
||||||
and the types are inferred by looking at the first row. Since we currently only look at the first
|
and the types are inferred by looking at the first row. Since we currently only look at the first
|
||||||
row, it is important that there is no missing data in the first row of the RDD. In future versions we
|
row, it is important that there is no missing data in the first row of the RDD. In future versions we
|
||||||
plan to more completely infer the schema by looking at more data, similar to the inference that is
|
plan to more completely infer the schema by looking at more data, similar to the inference that is
|
||||||
performed on JSON files.
|
performed on JSON files.
|
||||||
|
@ -780,7 +852,7 @@ for name in names.collect():
|
||||||
|
|
||||||
Spark SQL supports operating on a variety of data sources through the `DataFrame` interface.
|
Spark SQL supports operating on a variety of data sources through the `DataFrame` interface.
|
||||||
A DataFrame can be operated on as normal RDDs and can also be registered as a temporary table.
|
A DataFrame can be operated on as normal RDDs and can also be registered as a temporary table.
|
||||||
Registering a DataFrame as a table allows you to run SQL queries over its data. This section
|
Registering a DataFrame as a table allows you to run SQL queries over its data. This section
|
||||||
describes the general methods for loading and saving data using the Spark Data Sources and then
|
describes the general methods for loading and saving data using the Spark Data Sources and then
|
||||||
goes into specific options that are available for the built-in data sources.
|
goes into specific options that are available for the built-in data sources.
|
||||||
|
|
||||||
|
@ -834,9 +906,9 @@ saveDF(select(df, "name", "age"), "namesAndAges.parquet")
|
||||||
### Manually Specifying Options
|
### Manually Specifying Options
|
||||||
|
|
||||||
You can also manually specify the data source that will be used along with any extra options
|
You can also manually specify the data source that will be used along with any extra options
|
||||||
that you would like to pass to the data source. Data sources are specified by their fully qualified
|
that you would like to pass to the data source. Data sources are specified by their fully qualified
|
||||||
name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use their short
|
name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use their short
|
||||||
names (`json`, `parquet`, `jdbc`). DataFrames of any type can be converted into other types
|
names (`json`, `parquet`, `jdbc`). DataFrames of any type can be converted into other types
|
||||||
using this syntax.
|
using this syntax.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
@ -923,8 +995,8 @@ df <- sql(sqlContext, "SELECT * FROM parquet.`examples/src/main/resources/users.
|
||||||
### Save Modes
|
### Save Modes
|
||||||
|
|
||||||
Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
|
Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
|
||||||
present. It is important to realize that these save modes do not utilize any locking and are not
|
present. It is important to realize that these save modes do not utilize any locking and are not
|
||||||
atomic. Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
|
atomic. Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
|
||||||
new data.
|
new data.
|
||||||
|
|
||||||
<table class="table">
|
<table class="table">
|
||||||
|
@ -960,7 +1032,7 @@ new data.
|
||||||
<td>
|
<td>
|
||||||
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
|
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
|
||||||
the save operation is expected to not save the contents of the DataFrame and to not
|
the save operation is expected to not save the contents of the DataFrame and to not
|
||||||
change the existing data. This is similar to a <code>CREATE TABLE IF NOT EXISTS</code> in SQL.
|
change the existing data. This is similar to a <code>CREATE TABLE IF NOT EXISTS</code> in SQL.
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
</table>
|
</table>
|
||||||
|
@ -968,14 +1040,14 @@ new data.
|
||||||
### Saving to Persistent Tables
|
### Saving to Persistent Tables
|
||||||
|
|
||||||
When working with a `HiveContext`, `DataFrames` can also be saved as persistent tables using the
|
When working with a `HiveContext`, `DataFrames` can also be saved as persistent tables using the
|
||||||
`saveAsTable` command. Unlike the `registerTempTable` command, `saveAsTable` will materialize the
|
`saveAsTable` command. Unlike the `registerTempTable` command, `saveAsTable` will materialize the
|
||||||
contents of the dataframe and create a pointer to the data in the HiveMetastore. Persistent tables
|
contents of the dataframe and create a pointer to the data in the HiveMetastore. Persistent tables
|
||||||
will still exist even after your Spark program has restarted, as long as you maintain your connection
|
will still exist even after your Spark program has restarted, as long as you maintain your connection
|
||||||
to the same metastore. A DataFrame for a persistent table can be created by calling the `table`
|
to the same metastore. A DataFrame for a persistent table can be created by calling the `table`
|
||||||
method on a `SQLContext` with the name of the table.
|
method on a `SQLContext` with the name of the table.
|
||||||
|
|
||||||
By default `saveAsTable` will create a "managed table", meaning that the location of the data will
|
By default `saveAsTable` will create a "managed table", meaning that the location of the data will
|
||||||
be controlled by the metastore. Managed tables will also have their data deleted automatically
|
be controlled by the metastore. Managed tables will also have their data deleted automatically
|
||||||
when a table is dropped.
|
when a table is dropped.
|
||||||
|
|
||||||
## Parquet Files
|
## Parquet Files
|
||||||
|
@ -1003,7 +1075,7 @@ val people: RDD[Person] = ... // An RDD of case class objects, from the previous
|
||||||
// The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet.
|
// The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet.
|
||||||
people.write.parquet("people.parquet")
|
people.write.parquet("people.parquet")
|
||||||
|
|
||||||
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
|
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
|
||||||
// The result of loading a Parquet file is also a DataFrame.
|
// The result of loading a Parquet file is also a DataFrame.
|
||||||
val parquetFile = sqlContext.read.parquet("people.parquet")
|
val parquetFile = sqlContext.read.parquet("people.parquet")
|
||||||
|
|
||||||
|
@ -1025,7 +1097,7 @@ DataFrame schemaPeople = ... // The DataFrame from the previous example.
|
||||||
// DataFrames can be saved as Parquet files, maintaining the schema information.
|
// DataFrames can be saved as Parquet files, maintaining the schema information.
|
||||||
schemaPeople.write().parquet("people.parquet");
|
schemaPeople.write().parquet("people.parquet");
|
||||||
|
|
||||||
// Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
|
// Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
|
||||||
// The result of loading a parquet file is also a DataFrame.
|
// The result of loading a parquet file is also a DataFrame.
|
||||||
DataFrame parquetFile = sqlContext.read().parquet("people.parquet");
|
DataFrame parquetFile = sqlContext.read().parquet("people.parquet");
|
||||||
|
|
||||||
|
@ -1051,7 +1123,7 @@ schemaPeople # The DataFrame from the previous example.
|
||||||
# DataFrames can be saved as Parquet files, maintaining the schema information.
|
# DataFrames can be saved as Parquet files, maintaining the schema information.
|
||||||
schemaPeople.write.parquet("people.parquet")
|
schemaPeople.write.parquet("people.parquet")
|
||||||
|
|
||||||
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
|
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
|
||||||
# The result of loading a parquet file is also a DataFrame.
|
# The result of loading a parquet file is also a DataFrame.
|
||||||
parquetFile = sqlContext.read.parquet("people.parquet")
|
parquetFile = sqlContext.read.parquet("people.parquet")
|
||||||
|
|
||||||
|
@ -1075,7 +1147,7 @@ schemaPeople # The DataFrame from the previous example.
|
||||||
# DataFrames can be saved as Parquet files, maintaining the schema information.
|
# DataFrames can be saved as Parquet files, maintaining the schema information.
|
||||||
saveAsParquetFile(schemaPeople, "people.parquet")
|
saveAsParquetFile(schemaPeople, "people.parquet")
|
||||||
|
|
||||||
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
|
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
|
||||||
# The result of loading a parquet file is also a DataFrame.
|
# The result of loading a parquet file is also a DataFrame.
|
||||||
parquetFile <- parquetFile(sqlContext, "people.parquet")
|
parquetFile <- parquetFile(sqlContext, "people.parquet")
|
||||||
|
|
||||||
|
@ -1110,10 +1182,10 @@ SELECT * FROM parquetTable
|
||||||
|
|
||||||
### Partition Discovery
|
### Partition Discovery
|
||||||
|
|
||||||
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
|
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
|
||||||
table, data are usually stored in different directories, with partitioning column values encoded in
|
table, data are usually stored in different directories, with partitioning column values encoded in
|
||||||
the path of each partition directory. The Parquet data source is now able to discover and infer
|
the path of each partition directory. The Parquet data source is now able to discover and infer
|
||||||
partitioning information automatically. For example, we can store all our previously used
|
partitioning information automatically. For example, we can store all our previously used
|
||||||
population data into a partitioned table using the following directory structure, with two extra
|
population data into a partitioned table using the following directory structure, with two extra
|
||||||
columns, `gender` and `country` as partitioning columns:
|
columns, `gender` and `country` as partitioning columns:
|
||||||
|
|
||||||
|
@ -1155,7 +1227,7 @@ root
|
||||||
|
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
|
|
||||||
Notice that the data types of the partitioning columns are automatically inferred. Currently,
|
Notice that the data types of the partitioning columns are automatically inferred. Currently,
|
||||||
numeric data types and string type are supported. Sometimes users may not want to automatically
|
numeric data types and string type are supported. Sometimes users may not want to automatically
|
||||||
infer the data types of the partitioning columns. For these use cases, the automatic type inference
|
infer the data types of the partitioning columns. For these use cases, the automatic type inference
|
||||||
can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to
|
can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to
|
||||||
|
@ -1164,13 +1236,13 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w
|
||||||
|
|
||||||
### Schema Merging
|
### Schema Merging
|
||||||
|
|
||||||
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
|
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
|
||||||
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
|
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
|
||||||
up with multiple Parquet files with different but mutually compatible schemas. The Parquet data
|
up with multiple Parquet files with different but mutually compatible schemas. The Parquet data
|
||||||
source is now able to automatically detect this case and merge schemas of all these files.
|
source is now able to automatically detect this case and merge schemas of all these files.
|
||||||
|
|
||||||
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we
|
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we
|
||||||
turned it off by default starting from 1.5.0. You may enable it by
|
turned it off by default starting from 1.5.0. You may enable it by
|
||||||
|
|
||||||
1. setting data source option `mergeSchema` to `true` when reading Parquet files (as shown in the
|
1. setting data source option `mergeSchema` to `true` when reading Parquet files (as shown in the
|
||||||
examples below), or
|
examples below), or
|
||||||
|
@ -1284,10 +1356,10 @@ processing.
|
||||||
1. Hive considers all columns nullable, while nullability in Parquet is significant
|
1. Hive considers all columns nullable, while nullability in Parquet is significant
|
||||||
|
|
||||||
Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
|
Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
|
||||||
Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
|
Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
|
||||||
|
|
||||||
1. Fields that have the same name in both schema must have the same data type regardless of
|
1. Fields that have the same name in both schema must have the same data type regardless of
|
||||||
nullability. The reconciled field should have the data type of the Parquet side, so that
|
nullability. The reconciled field should have the data type of the Parquet side, so that
|
||||||
nullability is respected.
|
nullability is respected.
|
||||||
|
|
||||||
1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
|
1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
|
||||||
|
@ -1298,8 +1370,8 @@ Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation r
|
||||||
|
|
||||||
#### Metadata Refreshing
|
#### Metadata Refreshing
|
||||||
|
|
||||||
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
|
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
|
||||||
conversion is enabled, metadata of those converted tables are also cached. If these tables are
|
conversion is enabled, metadata of those converted tables are also cached. If these tables are
|
||||||
updated by Hive or other external tools, you need to refresh them manually to ensure consistent
|
updated by Hive or other external tools, you need to refresh them manually to ensure consistent
|
||||||
metadata.
|
metadata.
|
||||||
|
|
||||||
|
@ -1362,7 +1434,7 @@ Configuration of Parquet can be done using the `setConf` method on `SQLContext`
|
||||||
<td><code>spark.sql.parquet.int96AsTimestamp</code></td>
|
<td><code>spark.sql.parquet.int96AsTimestamp</code></td>
|
||||||
<td>true</td>
|
<td>true</td>
|
||||||
<td>
|
<td>
|
||||||
Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This
|
Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This
|
||||||
flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
|
flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
|
@ -1400,7 +1472,7 @@ Configuration of Parquet can be done using the `setConf` method on `SQLContext`
|
||||||
<td>
|
<td>
|
||||||
<p>
|
<p>
|
||||||
The output committer class used by Parquet. The specified class needs to be a subclass of
|
The output committer class used by Parquet. The specified class needs to be a subclass of
|
||||||
<code>org.apache.hadoop.<br />mapreduce.OutputCommitter</code>. Typically, it's also a
|
<code>org.apache.hadoop.<br />mapreduce.OutputCommitter</code>. Typically, it's also a
|
||||||
subclass of <code>org.apache.parquet.hadoop.ParquetOutputCommitter</code>.
|
subclass of <code>org.apache.parquet.hadoop.ParquetOutputCommitter</code>.
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
|
@ -1628,7 +1700,7 @@ YARN cluster. The convenient way to do this is adding them through the `--jars`
|
||||||
|
|
||||||
When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
|
When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
|
||||||
adds support for finding tables in the MetaStore and writing queries using HiveQL. Users who do
|
adds support for finding tables in the MetaStore and writing queries using HiveQL. Users who do
|
||||||
not have an existing Hive deployment can still create a `HiveContext`. When not configured by the
|
not have an existing Hive deployment can still create a `HiveContext`. When not configured by the
|
||||||
hive-site.xml, the context automatically creates `metastore_db` in the current directory and
|
hive-site.xml, the context automatically creates `metastore_db` in the current directory and
|
||||||
creates `warehouse` directory indicated by HiveConf, which defaults to `/user/hive/warehouse`.
|
creates `warehouse` directory indicated by HiveConf, which defaults to `/user/hive/warehouse`.
|
||||||
Note that you may need to grant write privilege on `/user/hive/warehouse` to the user who starts
|
Note that you may need to grant write privilege on `/user/hive/warehouse` to the user who starts
|
||||||
|
@ -1738,10 +1810,10 @@ The following options can be used to configure the version of Hive that is used
|
||||||
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
|
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
|
||||||
either <code>1.2.1</code> or not defined.
|
either <code>1.2.1</code> or not defined.
|
||||||
<li><code>maven</code></li>
|
<li><code>maven</code></li>
|
||||||
Use Hive jars of specified version downloaded from Maven repositories. This configuration
|
Use Hive jars of specified version downloaded from Maven repositories. This configuration
|
||||||
is not generally recommended for production deployments.
|
is not generally recommended for production deployments.
|
||||||
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive
|
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive
|
||||||
and its dependencies, including the correct version of Hadoop. These jars only need to be
|
and its dependencies, including the correct version of Hadoop. These jars only need to be
|
||||||
present on the driver, but if you are running in yarn cluster mode then you must ensure
|
present on the driver, but if you are running in yarn cluster mode then you must ensure
|
||||||
they are packaged with you application.</li>
|
they are packaged with you application.</li>
|
||||||
</ol>
|
</ol>
|
||||||
|
@ -1776,7 +1848,7 @@ The following options can be used to configure the version of Hive that is used
|
||||||
|
|
||||||
## JDBC To Other Databases
|
## JDBC To Other Databases
|
||||||
|
|
||||||
Spark SQL also includes a data source that can read data from other databases using JDBC. This
|
Spark SQL also includes a data source that can read data from other databases using JDBC. This
|
||||||
functionality should be preferred over using [JdbcRDD](api/scala/index.html#org.apache.spark.rdd.JdbcRDD).
|
functionality should be preferred over using [JdbcRDD](api/scala/index.html#org.apache.spark.rdd.JdbcRDD).
|
||||||
This is because the results are returned
|
This is because the results are returned
|
||||||
as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
|
as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
|
||||||
|
@ -1786,7 +1858,7 @@ provide a ClassTag.
|
||||||
run queries using Spark SQL).
|
run queries using Spark SQL).
|
||||||
|
|
||||||
To get started you will need to include the JDBC driver for you particular database on the
|
To get started you will need to include the JDBC driver for you particular database on the
|
||||||
spark classpath. For example, to connect to postgres from the Spark Shell you would run the
|
spark classpath. For example, to connect to postgres from the Spark Shell you would run the
|
||||||
following command:
|
following command:
|
||||||
|
|
||||||
{% highlight bash %}
|
{% highlight bash %}
|
||||||
|
@ -1794,7 +1866,7 @@ SPARK_CLASSPATH=postgresql-9.3-1102-jdbc41.jar bin/spark-shell
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
|
|
||||||
Tables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table using
|
Tables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table using
|
||||||
the Data Sources API. The following options are supported:
|
the Data Sources API. The following options are supported:
|
||||||
|
|
||||||
<table class="table">
|
<table class="table">
|
||||||
<tr><th>Property Name</th><th>Meaning</th></tr>
|
<tr><th>Property Name</th><th>Meaning</th></tr>
|
||||||
|
@ -1807,8 +1879,8 @@ the Data Sources API. The following options are supported:
|
||||||
<tr>
|
<tr>
|
||||||
<td><code>dbtable</code></td>
|
<td><code>dbtable</code></td>
|
||||||
<td>
|
<td>
|
||||||
The JDBC table that should be read. Note that anything that is valid in a <code>FROM</code> clause of
|
The JDBC table that should be read. Note that anything that is valid in a <code>FROM</code> clause of
|
||||||
a SQL query can be used. For example, instead of a full table you could also use a
|
a SQL query can be used. For example, instead of a full table you could also use a
|
||||||
subquery in parentheses.
|
subquery in parentheses.
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
|
@ -1816,7 +1888,7 @@ the Data Sources API. The following options are supported:
|
||||||
<tr>
|
<tr>
|
||||||
<td><code>driver</code></td>
|
<td><code>driver</code></td>
|
||||||
<td>
|
<td>
|
||||||
The class name of the JDBC driver needed to connect to this URL. This class will be loaded
|
The class name of the JDBC driver needed to connect to this URL. This class will be loaded
|
||||||
on the master and workers before running an JDBC commands to allow the driver to
|
on the master and workers before running an JDBC commands to allow the driver to
|
||||||
register itself with the JDBC subsystem.
|
register itself with the JDBC subsystem.
|
||||||
</td>
|
</td>
|
||||||
|
@ -1825,7 +1897,7 @@ the Data Sources API. The following options are supported:
|
||||||
<tr>
|
<tr>
|
||||||
<td><code>partitionColumn, lowerBound, upperBound, numPartitions</code></td>
|
<td><code>partitionColumn, lowerBound, upperBound, numPartitions</code></td>
|
||||||
<td>
|
<td>
|
||||||
These options must all be specified if any of them is specified. They describe how to
|
These options must all be specified if any of them is specified. They describe how to
|
||||||
partition the table when reading in parallel from multiple workers.
|
partition the table when reading in parallel from multiple workers.
|
||||||
<code>partitionColumn</code> must be a numeric column from the table in question. Notice
|
<code>partitionColumn</code> must be a numeric column from the table in question. Notice
|
||||||
that <code>lowerBound</code> and <code>upperBound</code> are just used to decide the
|
that <code>lowerBound</code> and <code>upperBound</code> are just used to decide the
|
||||||
|
@ -1938,7 +2010,7 @@ Configuration of in-memory caching can be done using the `setConf` method on `SQ
|
||||||
<td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
|
<td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
|
||||||
<td>10000</td>
|
<td>10000</td>
|
||||||
<td>
|
<td>
|
||||||
Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization
|
Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization
|
||||||
and compression, but risk OOMs when caching data.
|
and compression, but risk OOMs when caching data.
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
|
@ -1947,7 +2019,7 @@ Configuration of in-memory caching can be done using the `setConf` method on `SQ
|
||||||
|
|
||||||
## Other Configuration Options
|
## Other Configuration Options
|
||||||
|
|
||||||
The following options can also be used to tune the performance of query execution. It is possible
|
The following options can also be used to tune the performance of query execution. It is possible
|
||||||
that these options will be deprecated in future release as more optimizations are performed automatically.
|
that these options will be deprecated in future release as more optimizations are performed automatically.
|
||||||
|
|
||||||
<table class="table">
|
<table class="table">
|
||||||
|
@ -1957,7 +2029,7 @@ that these options will be deprecated in future release as more optimizations ar
|
||||||
<td>10485760 (10 MB)</td>
|
<td>10485760 (10 MB)</td>
|
||||||
<td>
|
<td>
|
||||||
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
|
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
|
||||||
performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
|
performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
|
||||||
statistics are only supported for Hive Metastore tables where the command
|
statistics are only supported for Hive Metastore tables where the command
|
||||||
<code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been run.
|
<code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been run.
|
||||||
</td>
|
</td>
|
||||||
|
@ -1995,8 +2067,8 @@ To start the JDBC/ODBC server, run the following in the Spark directory:
|
||||||
./sbin/start-thriftserver.sh
|
./sbin/start-thriftserver.sh
|
||||||
|
|
||||||
This script accepts all `bin/spark-submit` command line options, plus a `--hiveconf` option to
|
This script accepts all `bin/spark-submit` command line options, plus a `--hiveconf` option to
|
||||||
specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` for a complete list of
|
specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` for a complete list of
|
||||||
all available options. By default, the server listens on localhost:10000. You may override this
|
all available options. By default, the server listens on localhost:10000. You may override this
|
||||||
behaviour via either environment variables, i.e.:
|
behaviour via either environment variables, i.e.:
|
||||||
|
|
||||||
{% highlight bash %}
|
{% highlight bash %}
|
||||||
|
@ -2062,10 +2134,10 @@ options.
|
||||||
|
|
||||||
## Upgrading From Spark SQL 1.5 to 1.6
|
## Upgrading From Spark SQL 1.5 to 1.6
|
||||||
|
|
||||||
- From Spark 1.6, by default the Thrift server runs in multi-session mode. Which means each JDBC/ODBC
|
- From Spark 1.6, by default the Thrift server runs in multi-session mode. Which means each JDBC/ODBC
|
||||||
connection owns a copy of their own SQL configuration and temporary function registry. Cached
|
connection owns a copy of their own SQL configuration and temporary function registry. Cached
|
||||||
tables are still shared though. If you prefer to run the Thrift server in the old single-session
|
tables are still shared though. If you prefer to run the Thrift server in the old single-session
|
||||||
mode, please set option `spark.sql.hive.thriftServer.singleSession` to `true`. You may either add
|
mode, please set option `spark.sql.hive.thriftServer.singleSession` to `true`. You may either add
|
||||||
this option to `spark-defaults.conf`, or pass it to `start-thriftserver.sh` via `--conf`:
|
this option to `spark-defaults.conf`, or pass it to `start-thriftserver.sh` via `--conf`:
|
||||||
|
|
||||||
{% highlight bash %}
|
{% highlight bash %}
|
||||||
|
@ -2077,20 +2149,20 @@ options.
|
||||||
## Upgrading From Spark SQL 1.4 to 1.5
|
## Upgrading From Spark SQL 1.4 to 1.5
|
||||||
|
|
||||||
- Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with
|
- Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with
|
||||||
code generation for expression evaluation. These features can both be disabled by setting
|
code generation for expression evaluation. These features can both be disabled by setting
|
||||||
`spark.sql.tungsten.enabled` to `false`.
|
`spark.sql.tungsten.enabled` to `false`.
|
||||||
- Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
|
- Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
|
||||||
`spark.sql.parquet.mergeSchema` to `true`.
|
`spark.sql.parquet.mergeSchema` to `true`.
|
||||||
- Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
|
- Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
|
||||||
access nested values. For example `df['table.column.nestedField']`. However, this means that if
|
access nested values. For example `df['table.column.nestedField']`. However, this means that if
|
||||||
your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).
|
your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).
|
||||||
- In-memory columnar storage partition pruning is on by default. It can be disabled by setting
|
- In-memory columnar storage partition pruning is on by default. It can be disabled by setting
|
||||||
`spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
|
`spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
|
||||||
- Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum
|
- Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum
|
||||||
precision of 38. When inferring schema from `BigDecimal` objects, a precision of (38, 18) is now
|
precision of 38. When inferring schema from `BigDecimal` objects, a precision of (38, 18) is now
|
||||||
used. When no precision is specified in DDL then the default remains `Decimal(10, 0)`.
|
used. When no precision is specified in DDL then the default remains `Decimal(10, 0)`.
|
||||||
- Timestamps are now stored at a precision of 1us, rather than 1ns
|
- Timestamps are now stored at a precision of 1us, rather than 1ns
|
||||||
- In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
|
- In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
|
||||||
unchanged.
|
unchanged.
|
||||||
- The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
|
- The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
|
||||||
- It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe
|
- It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe
|
||||||
|
@ -2183,38 +2255,38 @@ sqlContext.setConf("spark.sql.retainGroupColumns", "false")
|
||||||
## Upgrading from Spark SQL 1.0-1.2 to 1.3
|
## Upgrading from Spark SQL 1.0-1.2 to 1.3
|
||||||
|
|
||||||
In Spark 1.3 we removed the "Alpha" label from Spark SQL and as part of this did a cleanup of the
|
In Spark 1.3 we removed the "Alpha" label from Spark SQL and as part of this did a cleanup of the
|
||||||
available APIs. From Spark 1.3 onwards, Spark SQL will provide binary compatibility with other
|
available APIs. From Spark 1.3 onwards, Spark SQL will provide binary compatibility with other
|
||||||
releases in the 1.X series. This compatibility guarantee excludes APIs that are explicitly marked
|
releases in the 1.X series. This compatibility guarantee excludes APIs that are explicitly marked
|
||||||
as unstable (i.e., DeveloperAPI or Experimental).
|
as unstable (i.e., DeveloperAPI or Experimental).
|
||||||
|
|
||||||
#### Rename of SchemaRDD to DataFrame
|
#### Rename of SchemaRDD to DataFrame
|
||||||
|
|
||||||
The largest change that users will notice when upgrading to Spark SQL 1.3 is that `SchemaRDD` has
|
The largest change that users will notice when upgrading to Spark SQL 1.3 is that `SchemaRDD` has
|
||||||
been renamed to `DataFrame`. This is primarily because DataFrames no longer inherit from RDD
|
been renamed to `DataFrame`. This is primarily because DataFrames no longer inherit from RDD
|
||||||
directly, but instead provide most of the functionality that RDDs provide though their own
|
directly, but instead provide most of the functionality that RDDs provide though their own
|
||||||
implementation. DataFrames can still be converted to RDDs by calling the `.rdd` method.
|
implementation. DataFrames can still be converted to RDDs by calling the `.rdd` method.
|
||||||
|
|
||||||
In Scala there is a type alias from `SchemaRDD` to `DataFrame` to provide source compatibility for
|
In Scala there is a type alias from `SchemaRDD` to `DataFrame` to provide source compatibility for
|
||||||
some use cases. It is still recommended that users update their code to use `DataFrame` instead.
|
some use cases. It is still recommended that users update their code to use `DataFrame` instead.
|
||||||
Java and Python users will need to update their code.
|
Java and Python users will need to update their code.
|
||||||
|
|
||||||
#### Unification of the Java and Scala APIs
|
#### Unification of the Java and Scala APIs
|
||||||
|
|
||||||
Prior to Spark 1.3 there were separate Java compatible classes (`JavaSQLContext` and `JavaSchemaRDD`)
|
Prior to Spark 1.3 there were separate Java compatible classes (`JavaSQLContext` and `JavaSchemaRDD`)
|
||||||
that mirrored the Scala API. In Spark 1.3 the Java API and Scala API have been unified. Users
|
that mirrored the Scala API. In Spark 1.3 the Java API and Scala API have been unified. Users
|
||||||
of either language should use `SQLContext` and `DataFrame`. In general theses classes try to
|
of either language should use `SQLContext` and `DataFrame`. In general theses classes try to
|
||||||
use types that are usable from both languages (i.e. `Array` instead of language specific collections).
|
use types that are usable from both languages (i.e. `Array` instead of language specific collections).
|
||||||
In some cases where no common type exists (e.g., for passing in closures or Maps) function overloading
|
In some cases where no common type exists (e.g., for passing in closures or Maps) function overloading
|
||||||
is used instead.
|
is used instead.
|
||||||
|
|
||||||
Additionally the Java specific types API has been removed. Users of both Scala and Java should
|
Additionally the Java specific types API has been removed. Users of both Scala and Java should
|
||||||
use the classes present in `org.apache.spark.sql.types` to describe schema programmatically.
|
use the classes present in `org.apache.spark.sql.types` to describe schema programmatically.
|
||||||
|
|
||||||
|
|
||||||
#### Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)
|
#### Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)
|
||||||
|
|
||||||
Many of the code examples prior to Spark 1.3 started with `import sqlContext._`, which brought
|
Many of the code examples prior to Spark 1.3 started with `import sqlContext._`, which brought
|
||||||
all of the functions from sqlContext into scope. In Spark 1.3 we have isolated the implicit
|
all of the functions from sqlContext into scope. In Spark 1.3 we have isolated the implicit
|
||||||
conversions for converting `RDD`s into `DataFrame`s into an object inside of the `SQLContext`.
|
conversions for converting `RDD`s into `DataFrame`s into an object inside of the `SQLContext`.
|
||||||
Users should now write `import sqlContext.implicits._`.
|
Users should now write `import sqlContext.implicits._`.
|
||||||
|
|
||||||
|
@ -2222,7 +2294,7 @@ Additionally, the implicit conversions now only augment RDDs that are composed o
|
||||||
case classes or tuples) with a method `toDF`, instead of applying automatically.
|
case classes or tuples) with a method `toDF`, instead of applying automatically.
|
||||||
|
|
||||||
When using function inside of the DSL (now replaced with the `DataFrame` API) users used to import
|
When using function inside of the DSL (now replaced with the `DataFrame` API) users used to import
|
||||||
`org.apache.spark.sql.catalyst.dsl`. Instead the public dataframe functions API should be used:
|
`org.apache.spark.sql.catalyst.dsl`. Instead the public dataframe functions API should be used:
|
||||||
`import org.apache.spark.sql.functions._`.
|
`import org.apache.spark.sql.functions._`.
|
||||||
|
|
||||||
#### Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only)
|
#### Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only)
|
||||||
|
|
|
@ -19,20 +19,60 @@ package org.apache.spark.sql
|
||||||
|
|
||||||
import java.lang.reflect.Modifier
|
import java.lang.reflect.Modifier
|
||||||
|
|
||||||
|
import scala.annotation.implicitNotFound
|
||||||
import scala.reflect.{ClassTag, classTag}
|
import scala.reflect.{ClassTag, classTag}
|
||||||
|
|
||||||
|
import org.apache.spark.annotation.Experimental
|
||||||
import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, encoderFor}
|
import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, encoderFor}
|
||||||
import org.apache.spark.sql.catalyst.expressions.{DecodeUsingSerializer, BoundReference, EncodeUsingSerializer}
|
import org.apache.spark.sql.catalyst.expressions.{DecodeUsingSerializer, BoundReference, EncodeUsingSerializer}
|
||||||
import org.apache.spark.sql.types._
|
import org.apache.spark.sql.types._
|
||||||
|
|
||||||
/**
|
/**
|
||||||
|
* :: Experimental ::
|
||||||
* Used to convert a JVM object of type `T` to and from the internal Spark SQL representation.
|
* Used to convert a JVM object of type `T` to and from the internal Spark SQL representation.
|
||||||
*
|
*
|
||||||
* Encoders are not intended to be thread-safe and thus they are allow to avoid internal locking
|
* == Scala ==
|
||||||
* and reuse internal buffers to improve performance.
|
* Encoders are generally created automatically through implicits from a `SQLContext`.
|
||||||
|
*
|
||||||
|
* {{{
|
||||||
|
* import sqlContext.implicits._
|
||||||
|
*
|
||||||
|
* val ds = Seq(1, 2, 3).toDS() // implicitly provided (sqlContext.implicits.newIntEncoder)
|
||||||
|
* }}}
|
||||||
|
*
|
||||||
|
* == Java ==
|
||||||
|
* Encoders are specified by calling static methods on [[Encoders]].
|
||||||
|
*
|
||||||
|
* {{{
|
||||||
|
* List<String> data = Arrays.asList("abc", "abc", "xyz");
|
||||||
|
* Dataset<String> ds = context.createDataset(data, Encoders.STRING());
|
||||||
|
* }}}
|
||||||
|
*
|
||||||
|
* Encoders can be composed into tuples:
|
||||||
|
*
|
||||||
|
* {{{
|
||||||
|
* Encoder<Tuple2<Integer, String>> encoder2 = Encoders.tuple(Encoders.INT(), Encoders.STRING());
|
||||||
|
* List<Tuple2<Integer, String>> data2 = Arrays.asList(new scala.Tuple2(1, "a");
|
||||||
|
* Dataset<Tuple2<Integer, String>> ds2 = context.createDataset(data2, encoder2);
|
||||||
|
* }}}
|
||||||
|
*
|
||||||
|
* Or constructed from Java Beans:
|
||||||
|
*
|
||||||
|
* {{{
|
||||||
|
* Encoders.bean(MyClass.class);
|
||||||
|
* }}}
|
||||||
|
*
|
||||||
|
* == Implementation ==
|
||||||
|
* - Encoders are not required to be thread-safe and thus they do not need to use locks to guard
|
||||||
|
* against concurrent access if they reuse internal buffers to improve performance.
|
||||||
*
|
*
|
||||||
* @since 1.6.0
|
* @since 1.6.0
|
||||||
*/
|
*/
|
||||||
|
@Experimental
|
||||||
|
@implicitNotFound("Unable to find encoder for type stored in a Dataset. Primitive types " +
|
||||||
|
"(Int, String, etc) and Product types (case classes) are supported by importing " +
|
||||||
|
"sqlContext.implicits._ Support for serializing other types will be added in future " +
|
||||||
|
"releases.")
|
||||||
trait Encoder[T] extends Serializable {
|
trait Encoder[T] extends Serializable {
|
||||||
|
|
||||||
/** Returns the schema of encoding this type of object as a Row. */
|
/** Returns the schema of encoding this type of object as a Row. */
|
||||||
|
@ -43,10 +83,12 @@ trait Encoder[T] extends Serializable {
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Methods for creating encoders.
|
* :: Experimental ::
|
||||||
|
* Methods for creating an [[Encoder]].
|
||||||
*
|
*
|
||||||
* @since 1.6.0
|
* @since 1.6.0
|
||||||
*/
|
*/
|
||||||
|
@Experimental
|
||||||
object Encoders {
|
object Encoders {
|
||||||
|
|
||||||
/**
|
/**
|
||||||
|
|
|
@ -73,7 +73,26 @@ class TypedColumn[-T, U](
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* :: Experimental ::
|
* :: Experimental ::
|
||||||
* A column in a [[DataFrame]].
|
* A column that will be computed based on the data in a [[DataFrame]].
|
||||||
|
*
|
||||||
|
* A new column is constructed based on the input columns present in a dataframe:
|
||||||
|
*
|
||||||
|
* {{{
|
||||||
|
* df("columnName") // On a specific DataFrame.
|
||||||
|
* col("columnName") // A generic column no yet associcated with a DataFrame.
|
||||||
|
* col("columnName.field") // Extracting a struct field
|
||||||
|
* col("`a.column.with.dots`") // Escape `.` in column names.
|
||||||
|
* $"columnName" // Scala short hand for a named column.
|
||||||
|
* expr("a + 1") // A column that is constructed from a parsed SQL Expression.
|
||||||
|
* lit("1") // A column that produces a literal (constant) value.
|
||||||
|
* }}}
|
||||||
|
*
|
||||||
|
* [[Column]] objects can be composed to form complex expressions:
|
||||||
|
*
|
||||||
|
* {{{
|
||||||
|
* $"a" + 1
|
||||||
|
* $"a" === $"b"
|
||||||
|
* }}}
|
||||||
*
|
*
|
||||||
* @groupname java_expr_ops Java-specific expression operators
|
* @groupname java_expr_ops Java-specific expression operators
|
||||||
* @groupname expr_ops Expression operators
|
* @groupname expr_ops Expression operators
|
||||||
|
|
Loading…
Reference in a new issue