[SPARK-27472] add user guide for binary file data source
## What changes were proposed in this pull request? Add user guide for binary file data source. <img width="826" alt="Screen Shot 2019-04-28 at 10 21 26 PM" src="https://user-images.githubusercontent.com/829644/56877594-0488d300-6a04-11e9-9064-5047dfedd913.png"> ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24484 from mengxr/SPARK-27472. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>
This commit is contained in:
parent
76785cd6f0
commit
fbc7942683
80
docs/sql-data-sources-binaryFile.md
Normal file
80
docs/sql-data-sources-binaryFile.md
Normal file
|
@ -0,0 +1,80 @@
|
|||
---
|
||||
layout: global
|
||||
title: Binary File Data Source
|
||||
displayTitle: Binary File Data Source
|
||||
license: |
|
||||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
---
|
||||
|
||||
Since Spark 3.0, Spark supports binary file data source,
|
||||
which reads binary files and converts each file into a single record that contains the raw content
|
||||
and metadata of the file.
|
||||
It produces a DataFrame with the following columns and possibly partition columns:
|
||||
* `path`: StringType
|
||||
* `modificationTime`: TimestampType
|
||||
* `length`: LongType
|
||||
* `content`: BinaryType
|
||||
|
||||
It supports the following read option:
|
||||
<table class="table">
|
||||
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
|
||||
<tr>
|
||||
<td><code>pathGlobFilter</code></td>
|
||||
<td>none (accepts all)</td>
|
||||
<td>
|
||||
An optional glob pattern to only include files with paths matching the pattern.
|
||||
The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
|
||||
It does not change the behavior of partition discovery.
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
To read whole binary files, you need to specify the data source `format` as `binaryFile`.
|
||||
For example, the following code reads all PNG files from the input directory:
|
||||
|
||||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
{% highlight scala %}
|
||||
|
||||
spark.read.format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data")
|
||||
|
||||
{% endhighlight %}
|
||||
</div>
|
||||
|
||||
<div data-lang="java" markdown="1">
|
||||
{% highlight java %}
|
||||
|
||||
spark.read().format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data");
|
||||
|
||||
{% endhighlight %}
|
||||
</div>
|
||||
<div data-lang="python" markdown="1">
|
||||
{% highlight python %}
|
||||
|
||||
spark.read.format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data")
|
||||
|
||||
{% endhighlight %}
|
||||
</div>
|
||||
<div data-lang="r" markdown="1">
|
||||
{% highlight r %}
|
||||
|
||||
read.df("/path/to/data", source = "binaryFile", pathGlobFilter = "*.png")
|
||||
|
||||
{% endhighlight %}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Binary file data source does not support writing a DataFrame back to the original files.
|
|
@ -54,4 +54,5 @@ goes into specific options that are available for the built-in data sources.
|
|||
* [Compatibility with Databricks spark-avro](sql-data-sources-avro.html#compatibility-with-databricks-spark-avro)
|
||||
* [Supported types for Avro -> Spark SQL conversion](sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion)
|
||||
* [Supported types for Spark SQL -> Avro conversion](sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion)
|
||||
* [Whole Binary Files](sql-data-sources-binaryFile.html)
|
||||
* [Troubleshooting](sql-data-sources-troubleshooting.html)
|
||||
|
|
Loading…
Reference in a new issue