spark-instrumented-optimizer

History

WeichenXu 925449283d [SPARK-22666][ML][SQL] Spark datasource for image format ## What changes were proposed in this pull request? Implement an image schema datasource. This image datasource support: - partition discovery (loading partitioned images) - dropImageFailures (the same behavior with `ImageSchema.readImage`) - path wildcard matching (the same behavior with `ImageSchema.readImage`) - loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/`) This datasource NOT support: - specify `numPartitions` (it will be determined by datasource automatically) - sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource) ## How was this patch tested? Unit tests. ## Benchmark I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource. cluster: 4 nodes, each with 64GB memory, 8 cores CPU test dataset: Flickr8k_Dataset (about 8091 images) time cost: - My image datasource time (automatically generate 258 partitions): 38.04s - `ImageSchema.read` time (set 16 partitions): 68.4s - `ImageSchema.read` time (set 258 partitions): 90.6s time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images)**: - My image datasource time (automatically generate 515 partitions): 95.4s - `ImageSchema.read` (set 32 partitions): 109s - `ImageSchema.read` (set 515 partitions): 105s So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API. Closes #22328 from WeichenXu123/image_datasource. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>		2018-09-05 11:59:00 -07:00
..
als	[SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general	2016-02-16 13:03:28 +00:00
images	[SPARK-22666][ML][SQL] Spark datasource for image format	2018-09-05 11:59:00 -07:00
ridge-data
gmm_data.txt
iris_libsvm.txt	[SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSuite test data to data/mllib.	2017-11-07 20:07:30 -08:00
kmeans_data.txt
pagerank_data.txt
pic_data.txt	[SPARK-8758] [MLLIB] Add Python user guide for PowerIterationClustering	2015-07-02 09:59:54 -07:00
sample_binary_classification_data.txt
sample_fpgrowth.txt	[SPARK-5939][MLLib] make FPGrowth example app take parameters	2015-02-23 08:47:28 -08:00
sample_isotonic_regression_libsvm_data.txt	[SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic regression	2016-06-16 17:35:40 -07:00
sample_kmeans_data.txt	[SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans	2016-05-11 09:56:36 +02:00
sample_lda_data.txt	[SPARK-5539][MLLIB] LDA guide	2015-02-08 23:40:36 -08:00
sample_lda_libsvm_data.txt	[SPARK-15150][EXAMPLE][DOC] Update LDA examples	2016-05-11 12:49:41 +02:00
sample_libsvm_data.txt
sample_linear_regression_data.txt
sample_movielens_data.txt
sample_multiclass_classification_data.txt	[SPARK-7574] [ML] [DOC] User guide for OneVsRest	2015-05-22 13:18:08 -07:00
sample_svm_data.txt
streaming_kmeans_data_test.txt	[SPARK-13013][DOCS] Replace example code in mllib-clustering.md using include_example	2016-03-03 09:32:47 -08:00