## What changes were proposed in this pull request?
Implement an image schema datasource.
This image datasource support:
- partition discovery (loading partitioned images)
- dropImageFailures (the same behavior with `ImageSchema.readImage`)
- path wildcard matching (the same behavior with `ImageSchema.readImage`)
- loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/**`)
This datasource **NOT** support:
- specify `numPartitions` (it will be determined by datasource automatically)
- sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource)
## How was this patch tested?
Unit tests.
## Benchmark
I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource.
**cluster**: 4 nodes, each with 64GB memory, 8 cores CPU
**test dataset**: Flickr8k_Dataset (about 8091 images)
**time cost**:
- My image datasource time (automatically generate 258 partitions): 38.04s
- `ImageSchema.read` time (set 16 partitions): 68.4s
- `ImageSchema.read` time (set 258 partitions): 90.6s
**time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images)**:
- My image datasource time (automatically generate 515 partitions): 95.4s
- `ImageSchema.read` (set 32 partitions): 109s
- `ImageSchema.read` (set 515 partitions): 105s
So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API.
Closes#22328 from WeichenXu123/image_datasource.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>