c3f27b2437
## What changes were proposed in this pull request? Fix Typos. This PR is the complete version of https://github.com/apache/spark/pull/23145. ## How was this patch tested? NA Closes #23185 from kjmrknsn/docUpdate. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
109 lines
5.1 KiB
Markdown
109 lines
5.1 KiB
Markdown
---
|
|
layout: global
|
|
title: Data sources
|
|
displayTitle: Data sources
|
|
---
|
|
|
|
In this section, we introduce how to use data source in ML to load data.
|
|
Besides some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.
|
|
|
|
**Table of Contents**
|
|
|
|
* This will become a table of contents (this text will be scraped).
|
|
{:toc}
|
|
|
|
## Image data source
|
|
|
|
This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO` in Java library.
|
|
The loaded DataFrame has one `StructType` column: "image", containing image data stored as image schema.
|
|
The schema of the `image` column is:
|
|
- origin: `StringType` (represents the file path of the image)
|
|
- height: `IntegerType` (height of the image)
|
|
- width: `IntegerType` (width of the image)
|
|
- nChannels: `IntegerType` (number of image channels)
|
|
- mode: `IntegerType` (OpenCV-compatible type)
|
|
- data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
|
|
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
|
|
implements a Spark SQL data source API for loading image data as a DataFrame.
|
|
|
|
{% highlight scala %}
|
|
scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
|
|
df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>]
|
|
|
|
scala> df.select("image.origin", "image.width", "image.height").show(truncate=false)
|
|
+-----------------------------------------------------------------------+-----+------+
|
|
|origin |width|height|
|
|
+-----------------------------------------------------------------------+-----+------+
|
|
|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 |
|
|
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
|
|
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
|
|
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
|
|
+-----------------------------------------------------------------------+-----+------+
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
|
|
implements Spark SQL data source API for loading image data as DataFrame.
|
|
|
|
{% highlight java %}
|
|
Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens");
|
|
imageDF.select("image.origin", "image.width", "image.height").show(false);
|
|
/*
|
|
Will output:
|
|
+-----------------------------------------------------------------------+-----+------+
|
|
|origin |width|height|
|
|
+-----------------------------------------------------------------------+-----+------+
|
|
|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 |
|
|
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
|
|
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
|
|
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
|
|
+-----------------------------------------------------------------------+-----+------+
|
|
*/
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
In PySpark we provide Spark SQL data source API for loading image data as DataFrame.
|
|
|
|
{% highlight python %}
|
|
>>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
|
|
>>> df.select("image.origin", "image.width", "image.height").show(truncate=False)
|
|
+-----------------------------------------------------------------------+-----+------+
|
|
|origin |width|height|
|
|
+-----------------------------------------------------------------------+-----+------+
|
|
|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 |
|
|
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
|
|
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
|
|
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
|
|
+-----------------------------------------------------------------------+-----+------+
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="r" markdown="1">
|
|
In SparkR we provide Spark SQL data source API for loading image data as DataFrame.
|
|
|
|
{% highlight r %}
|
|
> df = read.df("data/mllib/images/origin/kittens", "image")
|
|
> head(select(df, df$image.origin, df$image.width, df$image.height))
|
|
|
|
1 file:///spark/data/mllib/images/origin/kittens/54893.jpg
|
|
2 file:///spark/data/mllib/images/origin/kittens/DP802813.jpg
|
|
3 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg
|
|
4 file:///spark/data/mllib/images/origin/kittens/DP153539.jpg
|
|
width height
|
|
1 300 311
|
|
2 199 313
|
|
3 300 200
|
|
4 300 296
|
|
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
|
|
</div>
|