78a403fab9
## What changes were proposed in this pull request? ### Background: The data source option `pathGlobFilter` is introduced for Binary file format: https://github.com/apache/spark/pull/24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory. ### Proposal: Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver. ### Motivation: Filtering the file path names in file scan tasks on executors is kind of ugly. ### Impact: 1. The splitting of file partitions will be more balanced. 2. The metrics of file scan will be more accurate. 3. Users can use the option for reading other file sources. ## How was this patch tested? Unit tests Closes #24518 from gengliangwang/globFilter. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
69 lines
2.3 KiB
Markdown
69 lines
2.3 KiB
Markdown
---
|
|
layout: global
|
|
title: Binary File Data Source
|
|
displayTitle: Binary File Data Source
|
|
license: |
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
this work for additional information regarding copyright ownership.
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
(the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
---
|
|
|
|
Since Spark 3.0, Spark supports binary file data source,
|
|
which reads binary files and converts each file into a single record that contains the raw content
|
|
and metadata of the file.
|
|
It produces a DataFrame with the following columns and possibly partition columns:
|
|
* `path`: StringType
|
|
* `modificationTime`: TimestampType
|
|
* `length`: LongType
|
|
* `content`: BinaryType
|
|
|
|
To read whole binary files, you need to specify the data source `format` as `binaryFile`.
|
|
To load files with paths matching a given glob pattern while keeping the behavior of partition discovery,
|
|
you can use the general data source option `pathGlobFilter`.
|
|
For example, the following code reads all PNG files from the input directory:
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
{% highlight scala %}
|
|
|
|
spark.read.format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data")
|
|
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
{% highlight java %}
|
|
|
|
spark.read().format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data");
|
|
|
|
{% endhighlight %}
|
|
</div>
|
|
<div data-lang="python" markdown="1">
|
|
{% highlight python %}
|
|
|
|
spark.read.format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data")
|
|
|
|
{% endhighlight %}
|
|
</div>
|
|
<div data-lang="r" markdown="1">
|
|
{% highlight r %}
|
|
|
|
read.df("/path/to/data", source = "binaryFile", pathGlobFilter = "*.png")
|
|
|
|
{% endhighlight %}
|
|
</div>
|
|
</div>
|
|
|
|
Binary file data source does not support writing a DataFrame back to the original files.
|