987f386588
## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
284 lines
9.3 KiB
Markdown
284 lines
9.3 KiB
Markdown
---
|
|
layout: global
|
|
title: Generic Load/Save Functions
|
|
displayTitle: Generic Load/Save Functions
|
|
---
|
|
|
|
* Table of contents
|
|
{:toc}
|
|
|
|
|
|
In the simplest form, the default data source (`parquet` unless otherwise configured by
|
|
`spark.sql.sources.default`) will be used for all operations.
|
|
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
{% include_example generic_load_save_functions scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
{% include_example generic_load_save_functions java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
|
|
{% include_example generic_load_save_functions python/sql/datasource.py %}
|
|
</div>
|
|
|
|
<div data-lang="r" markdown="1">
|
|
|
|
{% include_example generic_load_save_functions r/RSparkSQLExample.R %}
|
|
|
|
</div>
|
|
</div>
|
|
|
|
### Manually Specifying Options
|
|
|
|
You can also manually specify the data source that will be used along with any extra options
|
|
that you would like to pass to the data source. Data sources are specified by their fully qualified
|
|
name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use their short
|
|
names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames loaded from any data
|
|
source type can be converted into other types using this syntax.
|
|
|
|
To load a JSON file you can use:
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
{% include_example manual_load_options scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
{% include_example manual_load_options java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
{% include_example manual_load_options python/sql/datasource.py %}
|
|
</div>
|
|
|
|
<div data-lang="r" markdown="1">
|
|
{% include_example manual_load_options r/RSparkSQLExample.R %}
|
|
</div>
|
|
</div>
|
|
|
|
To load a CSV file you can use:
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
{% include_example manual_load_options_csv scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
{% include_example manual_load_options_csv java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
{% include_example manual_load_options_csv python/sql/datasource.py %}
|
|
</div>
|
|
|
|
<div data-lang="r" markdown="1">
|
|
{% include_example manual_load_options_csv r/RSparkSQLExample.R %}
|
|
|
|
</div>
|
|
</div>
|
|
|
|
### Run SQL on files directly
|
|
|
|
Instead of using read API to load a file into DataFrame and query it, you can also query that
|
|
file directly with SQL.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
{% include_example direct_sql scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
{% include_example direct_sql java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
{% include_example direct_sql python/sql/datasource.py %}
|
|
</div>
|
|
|
|
<div data-lang="r" markdown="1">
|
|
{% include_example direct_sql r/RSparkSQLExample.R %}
|
|
|
|
</div>
|
|
</div>
|
|
|
|
### Save Modes
|
|
|
|
Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
|
|
present. It is important to realize that these save modes do not utilize any locking and are not
|
|
atomic. Additionally, when performing an `Overwrite`, the data will be deleted before writing out the
|
|
new data.
|
|
|
|
<table class="table">
|
|
<tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr>
|
|
<tr>
|
|
<td><code>SaveMode.ErrorIfExists</code> (default)</td>
|
|
<td><code>"error" or "errorifexists"</code> (default)</td>
|
|
<td>
|
|
When saving a DataFrame to a data source, if data already exists,
|
|
an exception is expected to be thrown.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>SaveMode.Append</code></td>
|
|
<td><code>"append"</code></td>
|
|
<td>
|
|
When saving a DataFrame to a data source, if data/table already exists,
|
|
contents of the DataFrame are expected to be appended to existing data.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>SaveMode.Overwrite</code></td>
|
|
<td><code>"overwrite"</code></td>
|
|
<td>
|
|
Overwrite mode means that when saving a DataFrame to a data source,
|
|
if data/table already exists, existing data is expected to be overwritten by the contents of
|
|
the DataFrame.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>SaveMode.Ignore</code></td>
|
|
<td><code>"ignore"</code></td>
|
|
<td>
|
|
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
|
|
the save operation is expected not to save the contents of the DataFrame and not to
|
|
change the existing data. This is similar to a <code>CREATE TABLE IF NOT EXISTS</code> in SQL.
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
### Saving to Persistent Tables
|
|
|
|
`DataFrames` can also be saved as persistent tables into Hive metastore using the `saveAsTable`
|
|
command. Notice that an existing Hive deployment is not necessary to use this feature. Spark will create a
|
|
default local Hive metastore (using Derby) for you. Unlike the `createOrReplaceTempView` command,
|
|
`saveAsTable` will materialize the contents of the DataFrame and create a pointer to the data in the
|
|
Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as
|
|
long as you maintain your connection to the same metastore. A DataFrame for a persistent table can
|
|
be created by calling the `table` method on a `SparkSession` with the name of the table.
|
|
|
|
For file-based data source, e.g. text, parquet, json, etc. you can specify a custom table path via the
|
|
`path` option, e.g. `df.write.option("path", "/some/path").saveAsTable("t")`. When the table is dropped,
|
|
the custom table path will not be removed and the table data is still there. If no custom table path is
|
|
specified, Spark will write data to a default table path under the warehouse directory. When the table is
|
|
dropped, the default table path will be removed too.
|
|
|
|
Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:
|
|
|
|
- Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.
|
|
- Hive DDLs such as `ALTER TABLE PARTITION ... SET LOCATION` are now available for tables created with the Datasource API.
|
|
|
|
Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
|
|
|
|
### Bucketing, Sorting and Partitioning
|
|
|
|
For file-based data source, it is also possible to bucket and sort or partition the output.
|
|
Bucketing and sorting are applicable only to persistent tables:
|
|
|
|
<div class="codetabs">
|
|
|
|
<div data-lang="scala" markdown="1">
|
|
{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
|
|
</div>
|
|
|
|
<div data-lang="sql" markdown="1">
|
|
|
|
{% highlight sql %}
|
|
|
|
CREATE TABLE users_bucketed_by_name(
|
|
name STRING,
|
|
favorite_color STRING,
|
|
favorite_numbers array<integer>
|
|
) USING parquet
|
|
CLUSTERED BY(name) INTO 42 BUCKETS;
|
|
|
|
{% endhighlight %}
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
while partitioning can be used with both `save` and `saveAsTable` when using the Dataset APIs.
|
|
|
|
|
|
<div class="codetabs">
|
|
|
|
<div data-lang="scala" markdown="1">
|
|
{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
{% include_example write_partitioning python/sql/datasource.py %}
|
|
</div>
|
|
|
|
<div data-lang="sql" markdown="1">
|
|
|
|
{% highlight sql %}
|
|
|
|
CREATE TABLE users_by_favorite_color(
|
|
name STRING,
|
|
favorite_color STRING,
|
|
favorite_numbers array<integer>
|
|
) USING csv PARTITIONED BY(favorite_color);
|
|
|
|
{% endhighlight %}
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
It is possible to use both partitioning and bucketing for a single table:
|
|
|
|
<div class="codetabs">
|
|
|
|
<div data-lang="scala" markdown="1">
|
|
{% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
{% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
{% include_example write_partition_and_bucket python/sql/datasource.py %}
|
|
</div>
|
|
|
|
<div data-lang="sql" markdown="1">
|
|
|
|
{% highlight sql %}
|
|
|
|
CREATE TABLE users_bucketed_and_partitioned(
|
|
name STRING,
|
|
favorite_color STRING,
|
|
favorite_numbers array<integer>
|
|
) USING parquet
|
|
PARTITIONED BY (favorite_color)
|
|
CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
|
|
|
|
{% endhighlight %}
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
`partitionBy` creates a directory structure as described in the [Partition Discovery](sql-data-sources-parquet.html#partition-discovery) section.
|
|
Thus, it has limited applicability to columns with high cardinality. In contrast
|
|
`bucketBy` distributes
|
|
data across a fixed number of buckets and can be used when a number of unique values is unbounded.
|