History

Ilya Matiach 1edb3175d8 [SPARK-21866][ML][PYSPARK] Adding spark image reader ## What changes were proposed in this pull request? Adding spark image reader, an implementation of schema for representing images in spark DataFrames The code is taken from the spark package located here: (https://github.com/Microsoft/spark-images) Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866) Please see mailing list for SPIP vote and approval information: (http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html) # Background and motivation As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers. This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions. This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines. The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead. ## How was this patch tested? Unit tests in scala ImageSchemaSuite, unit tests in python Author: Ilya Matiach <ilmat@microsoft.com> Author: hyukjinkwon <gurwls223@gmail.com> Closes #19439 from imatiach-msft/ilmat/spark-images.		2017-11-22 15:45:45 -08:00
..
create-release	[SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh	2017-11-14 08:28:13 +09:00
deps	[SPARK-19112][CORE] Support for ZStandard codec	2017-11-01 14:54:08 +01:00
sparktestsupport	[SPARK-21866][ML][PYSPARK] Adding spark image reader	2017-11-22 15:45:45 -08:00
tests	[SPARK-10359] Enumerate dependencies in a file and diff against it for new pull requests	2015-12-30 12:47:42 -08:00
.gitignore	[SPARK-6219] Reuse pep8.py	2015-04-18 16:46:28 -07:00
.rat-excludes	[SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core	2017-06-15 11:46:00 -07:00
appveyor-guide.md	[SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and testing on Windows (currently SparkR only)	2016-09-08 08:26:59 -07:00
appveyor-install-dependencies.ps1	[MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12	2017-08-12 14:31:05 +09:00
change-scala-version.sh	[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10	2017-07-13 17:06:24 +08:00
check-license	[SPARK-22511][BUILD] Update maven central repo address	2017-11-14 17:58:07 -06:00
checkstyle-suppressions.xml	[HOTFIX][BUILD] Fix finalizer checkstyle error and re-disable checkstyle	2017-09-27 13:40:21 -07:00
checkstyle.xml	[HOTFIX][BUILD] Fix finalizer checkstyle error and re-disable checkstyle	2017-09-27 13:40:21 -07:00
github_jira_sync.py	[SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts	2017-01-02 15:23:19 +00:00
lint-java	[SPARK-16967] move mesos to module	2016-08-26 12:25:22 -07:00
lint-python	[MINOR][PYTHON] Ignore pep8 on test scripts generated in tests in work directory	2017-06-02 14:25:38 +01:00
lint-r	[SPARK-10328] [SPARKR] Fix generic for na.omit	2015-08-28 00:37:50 -07:00
lint-r.R	[SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r	2017-10-01 18:42:45 +09:00
lint-scala	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
make-distribution.sh	[SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK…	2017-04-02 15:31:13 +01:00
merge_spark_pr.py	[MINOR] Minor comment fixes in merge_spark_pr.py script	2017-07-31 10:07:33 +09:00
mima	[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2	2017-10-06 15:08:28 +01:00
pip-sanity-check.py	[SPARK-19064][PYSPARK] Fix pip installing of sub components	2017-01-25 14:43:39 -08:00
README.md	Merge pull request #565 from pwendell/dev-scripts. Closes #565 .	2014-02-08 23:13:34 -08:00
requirements.txt	[SPARK-19064][PYSPARK] Fix pip installing of sub components	2017-01-25 14:43:39 -08:00
run-pip-tests	Revert "[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas"	2017-06-28 14:28:40 +08:00
run-tests	[SPARK-22302][INFRA] Remove manual backports for subprocess and print explicit message for < Python 2.7	2017-10-22 02:22:35 +09:00
run-tests-jenkins	[SPARK-22302][INFRA] Remove manual backports for subprocess and print explicit message for < Python 2.7	2017-10-22 02:22:35 +09:00
run-tests-jenkins.py	[SPARK-21189][INFRA] Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs	2017-06-24 10:14:31 +01:00
run-tests.py	[SPARK-22376][TESTS] Makes dev/run-tests.py script compatible with Python 3	2017-11-07 19:45:34 +09:00
scalastyle	[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2	2017-10-06 15:08:28 +01:00
test-dependencies.sh	[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2	2017-10-06 15:08:28 +01:00
tox.ini	[SPARK-22375][TEST] Test script can fail if eggs are installed by set…	2017-10-29 15:29:23 +09:00

README.md

Spark Developer Scripts

This directory contains scripts useful to developers when packaging, testing, or committing to Spark.

Many of these scripts require Apache credentials to work correctly.