spark-instrumented-optimizer/data/mllib
WeichenXu 925449283d [SPARK-22666][ML][SQL] Spark datasource for image format
## What changes were proposed in this pull request?

Implement an image schema datasource.

This image datasource support:
  - partition discovery (loading partitioned images)
  - dropImageFailures (the same behavior with `ImageSchema.readImage`)
  - path wildcard matching (the same behavior with `ImageSchema.readImage`)
  - loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/**`)

This datasource **NOT** support:
  - specify `numPartitions` (it will be determined by datasource automatically)
  - sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource)

## How was this patch tested?
Unit tests.

## Benchmark
I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource.

**cluster**: 4 nodes, each with 64GB memory, 8 cores CPU
**test dataset**: Flickr8k_Dataset (about 8091 images)

**time cost**:
- My image datasource time (automatically generate 258 partitions):  38.04s
- `ImageSchema.read` time (set 16 partitions): 68.4s
- `ImageSchema.read` time (set 258 partitions):  90.6s

**time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images)**:
- My image datasource time (automatically generate 515 partitions):  95.4s
- `ImageSchema.read` (set 32 partitions): 109s
- `ImageSchema.read` (set 515 partitions):  105s

So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API.

Closes #22328 from WeichenXu123/image_datasource.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2018-09-05 11:59:00 -07:00
..
als [SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general 2016-02-16 13:03:28 +00:00
images [SPARK-22666][ML][SQL] Spark datasource for image format 2018-09-05 11:59:00 -07:00
ridge-data
gmm_data.txt
iris_libsvm.txt [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSuite test data to data/mllib. 2017-11-07 20:07:30 -08:00
kmeans_data.txt
pagerank_data.txt
pic_data.txt [SPARK-8758] [MLLIB] Add Python user guide for PowerIterationClustering 2015-07-02 09:59:54 -07:00
sample_binary_classification_data.txt
sample_fpgrowth.txt [SPARK-5939][MLLib] make FPGrowth example app take parameters 2015-02-23 08:47:28 -08:00
sample_isotonic_regression_libsvm_data.txt [SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic regression 2016-06-16 17:35:40 -07:00
sample_kmeans_data.txt [SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans 2016-05-11 09:56:36 +02:00
sample_lda_data.txt [SPARK-5539][MLLIB] LDA guide 2015-02-08 23:40:36 -08:00
sample_lda_libsvm_data.txt [SPARK-15150][EXAMPLE][DOC] Update LDA examples 2016-05-11 12:49:41 +02:00
sample_libsvm_data.txt
sample_linear_regression_data.txt
sample_movielens_data.txt
sample_multiclass_classification_data.txt [SPARK-7574] [ML] [DOC] User guide for OneVsRest 2015-05-22 13:18:08 -07:00
sample_svm_data.txt
streaming_kmeans_data_test.txt [SPARK-13013][DOCS] Replace example code in mllib-clustering.md using include_example 2016-03-03 09:32:47 -08:00