spark-instrumented-optimizer/python/pyspark/ml
Ilya Matiach 1edb3175d8 [SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames

The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)

Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)

Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)

# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.

## How was this patch tested?

Unit tests in scala ImageSchemaSuite, unit tests in python

Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 15:45:45 -08:00
..
linalg [SPARK-20214][ML] Make sure converted csc matrix has sorted indices 2017-04-05 17:46:44 -07:00
param [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark 2017-09-12 10:02:27 -07:00
__init__.py [SPARK-21633][ML][PYTHON] UnaryTransformer in Python 2017-08-04 01:01:32 -07:00
base.py [SPARK-21633][ML][PYTHON] UnaryTransformer in Python 2017-08-04 01:01:32 -07:00
classification.py [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest. 2017-09-14 14:09:44 +08:00
clustering.py [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator 2017-08-22 17:40:50 -07:00
common.py [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch 2016-10-03 14:12:03 -07:00
evaluation.py [SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator 2017-09-22 13:12:33 +08:00
feature.py [SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API 2017-11-21 10:53:53 -08:00
fpm.py [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth. 2017-05-25 21:40:39 +08:00
image.py [SPARK-21866][ML][PYSPARK] Adding spark image reader 2017-11-22 15:45:45 -08:00
pipeline.py [SPARK-17025][ML][PYTHON] Persistence for Pipelines with Python-only Stages 2017-08-11 23:57:08 -07:00
recommendation.py [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel 2017-10-09 10:42:33 +02:00
regression.py [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search 2017-08-09 14:44:10 +08:00
stat.py [SPARK-20076][ML][PYSPARK] Add Python interface for ml.stats.Correlation 2017-04-07 11:00:10 +02:00
tests.py [SPARK-21866][ML][PYSPARK] Adding spark image reader 2017-11-22 15:45:45 -08:00
tuning.py [SPARK-21911][ML][PYSPARK] Parallel Model Evaluation for ML Tuning in PySpark 2017-10-27 15:19:27 -07:00
util.py [SPARK-22313][PYTHON] Mark/print deprecation warnings as DeprecationWarning for deprecated APIs 2017-10-24 12:44:47 +09:00
wrapper.py [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator 2017-08-22 17:40:50 -07:00