[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
#
|
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
|
|
# this work for additional information regarding copyright ownership.
|
|
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
# (the "License"); you may not use this file except in compliance with
|
|
|
|
# the License. You may obtain a copy of the License at
|
|
|
|
#
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
#
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
# limitations under the License.
|
|
|
|
#
|
|
|
|
|
|
|
|
"""
|
|
|
|
.. attribute:: ImageSchema
|
|
|
|
|
|
|
|
An attribute of this module that contains the instance of :class:`_ImageSchema`.
|
|
|
|
|
|
|
|
.. autoclass:: _ImageSchema
|
|
|
|
:members:
|
|
|
|
"""
|
|
|
|
|
2018-03-08 06:38:34 -05:00
|
|
|
import sys
|
|
|
|
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
import numpy as np
|
|
|
|
from pyspark import SparkContext
|
|
|
|
from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
|
|
|
|
from pyspark.sql import DataFrame, SparkSession
|
|
|
|
|
|
|
|
|
|
|
|
class _ImageSchema(object):
|
|
|
|
"""
|
|
|
|
Internal class for `pyspark.ml.image.ImageSchema` attribute. Meant to be private and
|
|
|
|
not to be instantized. Use `pyspark.ml.image.ImageSchema` attribute to access the
|
|
|
|
APIs of this class.
|
|
|
|
"""
|
|
|
|
|
|
|
|
def __init__(self):
|
|
|
|
self._imageSchema = None
|
|
|
|
self._ocvTypes = None
|
2018-02-04 03:53:31 -05:00
|
|
|
self._columnSchema = None
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
self._imageFields = None
|
|
|
|
self._undefinedImageType = None
|
|
|
|
|
|
|
|
@property
|
|
|
|
def imageSchema(self):
|
|
|
|
"""
|
|
|
|
Returns the image schema.
|
|
|
|
|
|
|
|
:return: a :class:`StructType` with a single column of images
|
2018-02-04 03:53:31 -05:00
|
|
|
named "image" (nullable) and having the same type returned by :meth:`columnSchema`.
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
|
|
|
|
.. versionadded:: 2.3.0
|
|
|
|
"""
|
|
|
|
|
|
|
|
if self._imageSchema is None:
|
|
|
|
ctx = SparkContext._active_spark_context
|
|
|
|
jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
|
|
|
|
self._imageSchema = _parse_datatype_json_string(jschema.json())
|
|
|
|
return self._imageSchema
|
|
|
|
|
|
|
|
@property
|
|
|
|
def ocvTypes(self):
|
|
|
|
"""
|
|
|
|
Returns the OpenCV type mapping supported.
|
|
|
|
|
|
|
|
:return: a dictionary containing the OpenCV type mapping supported.
|
|
|
|
|
|
|
|
.. versionadded:: 2.3.0
|
|
|
|
"""
|
|
|
|
|
|
|
|
if self._ocvTypes is None:
|
|
|
|
ctx = SparkContext._active_spark_context
|
|
|
|
self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
|
|
|
|
return self._ocvTypes
|
|
|
|
|
2018-02-04 03:53:31 -05:00
|
|
|
@property
|
|
|
|
def columnSchema(self):
|
|
|
|
"""
|
|
|
|
Returns the schema for the image column.
|
|
|
|
|
|
|
|
:return: a :class:`StructType` for image column,
|
|
|
|
``struct<origin:string, height:int, width:int, nChannels:int, mode:int, data:binary>``.
|
|
|
|
|
|
|
|
.. versionadded:: 2.4.0
|
|
|
|
"""
|
|
|
|
|
|
|
|
if self._columnSchema is None:
|
|
|
|
ctx = SparkContext._active_spark_context
|
|
|
|
jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.columnSchema()
|
|
|
|
self._columnSchema = _parse_datatype_json_string(jschema.json())
|
|
|
|
return self._columnSchema
|
|
|
|
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
@property
|
|
|
|
def imageFields(self):
|
|
|
|
"""
|
|
|
|
Returns field names of image columns.
|
|
|
|
|
|
|
|
:return: a list of field names.
|
|
|
|
|
|
|
|
.. versionadded:: 2.3.0
|
|
|
|
"""
|
|
|
|
|
|
|
|
if self._imageFields is None:
|
|
|
|
ctx = SparkContext._active_spark_context
|
|
|
|
self._imageFields = list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields())
|
|
|
|
return self._imageFields
|
|
|
|
|
|
|
|
@property
|
|
|
|
def undefinedImageType(self):
|
|
|
|
"""
|
|
|
|
Returns the name of undefined image type for the invalid image.
|
|
|
|
|
|
|
|
.. versionadded:: 2.3.0
|
|
|
|
"""
|
|
|
|
|
|
|
|
if self._undefinedImageType is None:
|
|
|
|
ctx = SparkContext._active_spark_context
|
|
|
|
self._undefinedImageType = \
|
|
|
|
ctx._jvm.org.apache.spark.ml.image.ImageSchema.undefinedImageType()
|
|
|
|
return self._undefinedImageType
|
|
|
|
|
|
|
|
def toNDArray(self, image):
|
|
|
|
"""
|
|
|
|
Converts an image to an array with metadata.
|
|
|
|
|
[SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and fix image test failure in Python 3.6.0 / NumPy 1.13.3
## What changes were proposed in this pull request?
Image test seems failed in Python 3.6.0 / NumPy 1.13.3. I manually tested as below:
```
======================================================================
ERROR: test_read_images (pyspark.ml.tests.ImageReaderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/ml/tests.py", line 1831, in test_read_images
self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), first_row)
File "/.../spark/python/pyspark/ml/image.py", line 149, in toImage
data = bytearray(array.astype(dtype=np.uint8).ravel())
TypeError: only integer scalar arrays can be converted to a scalar index
----------------------------------------------------------------------
Ran 1 test in 7.606s
```
To be clear, I think the error seems from NumPy - https://github.com/numpy/numpy/blob/75b2d5d427afdb1392f2a0b2092e0767e4bab53d/numpy/core/src/multiarray/number.c#L947
For a smaller scope:
```python
>>> import numpy as np
>>> bytearray(np.array([1]).astype(dtype=np.uint8))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer scalar arrays can be converted to a scalar index
```
In Python 2.7 / NumPy 1.13.1, it prints:
```
bytearray(b'\x01')
```
So, here, I simply worked around it by converting it to bytes as below:
```python
>>> bytearray(np.array([1]).astype(dtype=np.uint8).tobytes())
bytearray(b'\x01')
```
Also, while looking into it again, I realised few arguments could be quite confusing, for example, `Row` that needs some specific attributes and `numpy.ndarray`. I added few type checking and added some tests accordingly. So, it shows an error message as below:
```
TypeError: array argument should be numpy.ndarray; however, it got [<class 'str'>].
```
## How was this patch tested?
Manually tested with `./python/run-tests`.
And also:
```
PYSPARK_PYTHON=python3 SPARK_TESTING=1 bin/pyspark pyspark.ml.tests ImageReaderTest
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19835 from HyukjinKwon/SPARK-21866-followup.
2017-11-29 20:26:55 -05:00
|
|
|
:param `Row` image: A row that contains the image to be converted. It should
|
|
|
|
have the attributes specified in `ImageSchema.imageSchema`.
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
:return: a `numpy.ndarray` that is an image.
|
|
|
|
|
|
|
|
.. versionadded:: 2.3.0
|
|
|
|
"""
|
|
|
|
|
[SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and fix image test failure in Python 3.6.0 / NumPy 1.13.3
## What changes were proposed in this pull request?
Image test seems failed in Python 3.6.0 / NumPy 1.13.3. I manually tested as below:
```
======================================================================
ERROR: test_read_images (pyspark.ml.tests.ImageReaderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/ml/tests.py", line 1831, in test_read_images
self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), first_row)
File "/.../spark/python/pyspark/ml/image.py", line 149, in toImage
data = bytearray(array.astype(dtype=np.uint8).ravel())
TypeError: only integer scalar arrays can be converted to a scalar index
----------------------------------------------------------------------
Ran 1 test in 7.606s
```
To be clear, I think the error seems from NumPy - https://github.com/numpy/numpy/blob/75b2d5d427afdb1392f2a0b2092e0767e4bab53d/numpy/core/src/multiarray/number.c#L947
For a smaller scope:
```python
>>> import numpy as np
>>> bytearray(np.array([1]).astype(dtype=np.uint8))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer scalar arrays can be converted to a scalar index
```
In Python 2.7 / NumPy 1.13.1, it prints:
```
bytearray(b'\x01')
```
So, here, I simply worked around it by converting it to bytes as below:
```python
>>> bytearray(np.array([1]).astype(dtype=np.uint8).tobytes())
bytearray(b'\x01')
```
Also, while looking into it again, I realised few arguments could be quite confusing, for example, `Row` that needs some specific attributes and `numpy.ndarray`. I added few type checking and added some tests accordingly. So, it shows an error message as below:
```
TypeError: array argument should be numpy.ndarray; however, it got [<class 'str'>].
```
## How was this patch tested?
Manually tested with `./python/run-tests`.
And also:
```
PYSPARK_PYTHON=python3 SPARK_TESTING=1 bin/pyspark pyspark.ml.tests ImageReaderTest
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19835 from HyukjinKwon/SPARK-21866-followup.
2017-11-29 20:26:55 -05:00
|
|
|
if not isinstance(image, Row):
|
|
|
|
raise TypeError(
|
|
|
|
"image argument should be pyspark.sql.types.Row; however, "
|
|
|
|
"it got [%s]." % type(image))
|
|
|
|
|
|
|
|
if any(not hasattr(image, f) for f in self.imageFields):
|
|
|
|
raise ValueError(
|
|
|
|
"image argument should have attributes specified in "
|
|
|
|
"ImageSchema.imageSchema [%s]." % ", ".join(self.imageFields))
|
|
|
|
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
height = image.height
|
|
|
|
width = image.width
|
|
|
|
nChannels = image.nChannels
|
|
|
|
return np.ndarray(
|
|
|
|
shape=(height, width, nChannels),
|
|
|
|
dtype=np.uint8,
|
|
|
|
buffer=image.data,
|
|
|
|
strides=(width * nChannels, nChannels, 1))
|
|
|
|
|
|
|
|
def toImage(self, array, origin=""):
|
|
|
|
"""
|
|
|
|
Converts an array with metadata to a two-dimensional image.
|
|
|
|
|
[SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and fix image test failure in Python 3.6.0 / NumPy 1.13.3
## What changes were proposed in this pull request?
Image test seems failed in Python 3.6.0 / NumPy 1.13.3. I manually tested as below:
```
======================================================================
ERROR: test_read_images (pyspark.ml.tests.ImageReaderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/ml/tests.py", line 1831, in test_read_images
self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), first_row)
File "/.../spark/python/pyspark/ml/image.py", line 149, in toImage
data = bytearray(array.astype(dtype=np.uint8).ravel())
TypeError: only integer scalar arrays can be converted to a scalar index
----------------------------------------------------------------------
Ran 1 test in 7.606s
```
To be clear, I think the error seems from NumPy - https://github.com/numpy/numpy/blob/75b2d5d427afdb1392f2a0b2092e0767e4bab53d/numpy/core/src/multiarray/number.c#L947
For a smaller scope:
```python
>>> import numpy as np
>>> bytearray(np.array([1]).astype(dtype=np.uint8))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer scalar arrays can be converted to a scalar index
```
In Python 2.7 / NumPy 1.13.1, it prints:
```
bytearray(b'\x01')
```
So, here, I simply worked around it by converting it to bytes as below:
```python
>>> bytearray(np.array([1]).astype(dtype=np.uint8).tobytes())
bytearray(b'\x01')
```
Also, while looking into it again, I realised few arguments could be quite confusing, for example, `Row` that needs some specific attributes and `numpy.ndarray`. I added few type checking and added some tests accordingly. So, it shows an error message as below:
```
TypeError: array argument should be numpy.ndarray; however, it got [<class 'str'>].
```
## How was this patch tested?
Manually tested with `./python/run-tests`.
And also:
```
PYSPARK_PYTHON=python3 SPARK_TESTING=1 bin/pyspark pyspark.ml.tests ImageReaderTest
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19835 from HyukjinKwon/SPARK-21866-followup.
2017-11-29 20:26:55 -05:00
|
|
|
:param `numpy.ndarray` array: The array to convert to image.
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
:param str origin: Path to the image, optional.
|
|
|
|
:return: a :class:`Row` that is a two dimensional image.
|
|
|
|
|
|
|
|
.. versionadded:: 2.3.0
|
|
|
|
"""
|
|
|
|
|
[SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and fix image test failure in Python 3.6.0 / NumPy 1.13.3
## What changes were proposed in this pull request?
Image test seems failed in Python 3.6.0 / NumPy 1.13.3. I manually tested as below:
```
======================================================================
ERROR: test_read_images (pyspark.ml.tests.ImageReaderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/ml/tests.py", line 1831, in test_read_images
self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), first_row)
File "/.../spark/python/pyspark/ml/image.py", line 149, in toImage
data = bytearray(array.astype(dtype=np.uint8).ravel())
TypeError: only integer scalar arrays can be converted to a scalar index
----------------------------------------------------------------------
Ran 1 test in 7.606s
```
To be clear, I think the error seems from NumPy - https://github.com/numpy/numpy/blob/75b2d5d427afdb1392f2a0b2092e0767e4bab53d/numpy/core/src/multiarray/number.c#L947
For a smaller scope:
```python
>>> import numpy as np
>>> bytearray(np.array([1]).astype(dtype=np.uint8))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer scalar arrays can be converted to a scalar index
```
In Python 2.7 / NumPy 1.13.1, it prints:
```
bytearray(b'\x01')
```
So, here, I simply worked around it by converting it to bytes as below:
```python
>>> bytearray(np.array([1]).astype(dtype=np.uint8).tobytes())
bytearray(b'\x01')
```
Also, while looking into it again, I realised few arguments could be quite confusing, for example, `Row` that needs some specific attributes and `numpy.ndarray`. I added few type checking and added some tests accordingly. So, it shows an error message as below:
```
TypeError: array argument should be numpy.ndarray; however, it got [<class 'str'>].
```
## How was this patch tested?
Manually tested with `./python/run-tests`.
And also:
```
PYSPARK_PYTHON=python3 SPARK_TESTING=1 bin/pyspark pyspark.ml.tests ImageReaderTest
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19835 from HyukjinKwon/SPARK-21866-followup.
2017-11-29 20:26:55 -05:00
|
|
|
if not isinstance(array, np.ndarray):
|
|
|
|
raise TypeError(
|
|
|
|
"array argument should be numpy.ndarray; however, it got [%s]." % type(array))
|
|
|
|
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
if array.ndim != 3:
|
|
|
|
raise ValueError("Invalid array shape")
|
[SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and fix image test failure in Python 3.6.0 / NumPy 1.13.3
## What changes were proposed in this pull request?
Image test seems failed in Python 3.6.0 / NumPy 1.13.3. I manually tested as below:
```
======================================================================
ERROR: test_read_images (pyspark.ml.tests.ImageReaderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/ml/tests.py", line 1831, in test_read_images
self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), first_row)
File "/.../spark/python/pyspark/ml/image.py", line 149, in toImage
data = bytearray(array.astype(dtype=np.uint8).ravel())
TypeError: only integer scalar arrays can be converted to a scalar index
----------------------------------------------------------------------
Ran 1 test in 7.606s
```
To be clear, I think the error seems from NumPy - https://github.com/numpy/numpy/blob/75b2d5d427afdb1392f2a0b2092e0767e4bab53d/numpy/core/src/multiarray/number.c#L947
For a smaller scope:
```python
>>> import numpy as np
>>> bytearray(np.array([1]).astype(dtype=np.uint8))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer scalar arrays can be converted to a scalar index
```
In Python 2.7 / NumPy 1.13.1, it prints:
```
bytearray(b'\x01')
```
So, here, I simply worked around it by converting it to bytes as below:
```python
>>> bytearray(np.array([1]).astype(dtype=np.uint8).tobytes())
bytearray(b'\x01')
```
Also, while looking into it again, I realised few arguments could be quite confusing, for example, `Row` that needs some specific attributes and `numpy.ndarray`. I added few type checking and added some tests accordingly. So, it shows an error message as below:
```
TypeError: array argument should be numpy.ndarray; however, it got [<class 'str'>].
```
## How was this patch tested?
Manually tested with `./python/run-tests`.
And also:
```
PYSPARK_PYTHON=python3 SPARK_TESTING=1 bin/pyspark pyspark.ml.tests ImageReaderTest
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19835 from HyukjinKwon/SPARK-21866-followup.
2017-11-29 20:26:55 -05:00
|
|
|
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
height, width, nChannels = array.shape
|
|
|
|
ocvTypes = ImageSchema.ocvTypes
|
|
|
|
if nChannels == 1:
|
|
|
|
mode = ocvTypes["CV_8UC1"]
|
|
|
|
elif nChannels == 3:
|
|
|
|
mode = ocvTypes["CV_8UC3"]
|
|
|
|
elif nChannels == 4:
|
|
|
|
mode = ocvTypes["CV_8UC4"]
|
|
|
|
else:
|
|
|
|
raise ValueError("Invalid number of channels")
|
[SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and fix image test failure in Python 3.6.0 / NumPy 1.13.3
## What changes were proposed in this pull request?
Image test seems failed in Python 3.6.0 / NumPy 1.13.3. I manually tested as below:
```
======================================================================
ERROR: test_read_images (pyspark.ml.tests.ImageReaderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/ml/tests.py", line 1831, in test_read_images
self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), first_row)
File "/.../spark/python/pyspark/ml/image.py", line 149, in toImage
data = bytearray(array.astype(dtype=np.uint8).ravel())
TypeError: only integer scalar arrays can be converted to a scalar index
----------------------------------------------------------------------
Ran 1 test in 7.606s
```
To be clear, I think the error seems from NumPy - https://github.com/numpy/numpy/blob/75b2d5d427afdb1392f2a0b2092e0767e4bab53d/numpy/core/src/multiarray/number.c#L947
For a smaller scope:
```python
>>> import numpy as np
>>> bytearray(np.array([1]).astype(dtype=np.uint8))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer scalar arrays can be converted to a scalar index
```
In Python 2.7 / NumPy 1.13.1, it prints:
```
bytearray(b'\x01')
```
So, here, I simply worked around it by converting it to bytes as below:
```python
>>> bytearray(np.array([1]).astype(dtype=np.uint8).tobytes())
bytearray(b'\x01')
```
Also, while looking into it again, I realised few arguments could be quite confusing, for example, `Row` that needs some specific attributes and `numpy.ndarray`. I added few type checking and added some tests accordingly. So, it shows an error message as below:
```
TypeError: array argument should be numpy.ndarray; however, it got [<class 'str'>].
```
## How was this patch tested?
Manually tested with `./python/run-tests`.
And also:
```
PYSPARK_PYTHON=python3 SPARK_TESTING=1 bin/pyspark pyspark.ml.tests ImageReaderTest
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19835 from HyukjinKwon/SPARK-21866-followup.
2017-11-29 20:26:55 -05:00
|
|
|
|
|
|
|
# Running `bytearray(numpy.array([1]))` fails in specific Python versions
|
|
|
|
# with a specific Numpy version, for example in Python 3.6.0 and NumPy 1.13.3.
|
|
|
|
# Here, it avoids it by converting it to bytes.
|
|
|
|
data = bytearray(array.astype(dtype=np.uint8).ravel().tobytes())
|
|
|
|
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
# Creating new Row with _create_row(), because Row(name = value, ... )
|
|
|
|
# orders fields by name, which conflicts with expected schema order
|
|
|
|
# when the new DataFrame is created by UDF
|
|
|
|
return _create_row(self.imageFields,
|
|
|
|
[origin, height, width, nChannels, mode, data])
|
|
|
|
|
|
|
|
def readImages(self, path, recursive=False, numPartitions=-1,
|
|
|
|
dropImageFailures=False, sampleRatio=1.0, seed=0):
|
|
|
|
"""
|
|
|
|
Reads the directory of images from the local or remote source.
|
|
|
|
|
|
|
|
.. note:: If multiple jobs are run in parallel with different sampleRatio or recursive flag,
|
|
|
|
there may be a race condition where one job overwrites the hadoop configs of another.
|
|
|
|
|
|
|
|
.. note:: If sample ratio is less than 1, sampling uses a PathFilter that is efficient but
|
|
|
|
potentially non-deterministic.
|
|
|
|
|
|
|
|
:param str path: Path to the image directory.
|
|
|
|
:param bool recursive: Recursive search flag.
|
|
|
|
:param int numPartitions: Number of DataFrame partitions.
|
|
|
|
:param bool dropImageFailures: Drop the files that are not valid images.
|
|
|
|
:param float sampleRatio: Fraction of the images loaded.
|
|
|
|
:param int seed: Random number seed.
|
|
|
|
:return: a :class:`DataFrame` with a single column of "images",
|
|
|
|
see ImageSchema for details.
|
|
|
|
|
2018-01-17 17:30:54 -05:00
|
|
|
>>> df = ImageSchema.readImages('data/mllib/images/kittens', recursive=True)
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
>>> df.count()
|
2018-01-17 17:30:54 -05:00
|
|
|
5
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
|
|
|
|
.. versionadded:: 2.3.0
|
|
|
|
"""
|
|
|
|
|
[SPARK-22651][PYTHON][ML] Prevent initiating multiple Hive clients for ImageSchema.readImages
## What changes were proposed in this pull request?
Calling `ImageSchema.readImages` multiple times as below in PySpark shell:
```python
from pyspark.ml.image import ImageSchema
data_path = 'data/mllib/images/kittens'
_ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
_ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
```
throws an error as below:
```
...
org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1742f639f, see the next exception for details.
...
at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
...
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
...
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
...
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:100)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:88)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.<init>(HiveSessionStateBuilder.scala:69)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
at org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:574)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:593)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
at org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
...
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1742f639f, see the next exception for details.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
... 121 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /.../spark/metastore_db.
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
dropImageFailures, float(sampleRatio), seed)
File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
```
Seems we better stick to `SparkSession.builder.getOrCreate()` like:
https://github.com/apache/spark/blob/51620e288b5e0a7fffc3899c9deadabace28e6d7/python/pyspark/sql/streaming.py#L329
https://github.com/apache/spark/blob/dc5d34d8dcd6526d1dfdac8606661561c7576a62/python/pyspark/sql/column.py#L541
https://github.com/apache/spark/blob/33d43bf1b6f55594187066f0e38ba3985fa2542b/python/pyspark/sql/readwriter.py#L105
## How was this patch tested?
This was tested as below in PySpark shell:
```python
from pyspark.ml.image import ImageSchema
data_path = 'data/mllib/images/kittens'
_ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
_ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19845 from HyukjinKwon/SPARK-22651.
2017-12-01 21:55:43 -05:00
|
|
|
spark = SparkSession.builder.getOrCreate()
|
|
|
|
image_schema = spark._jvm.org.apache.spark.ml.image.ImageSchema
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
jsession = spark._jsparkSession
|
|
|
|
jresult = image_schema.readImages(path, jsession, recursive, numPartitions,
|
|
|
|
dropImageFailures, float(sampleRatio), seed)
|
|
|
|
return DataFrame(jresult, spark._wrapped)
|
|
|
|
|
|
|
|
|
|
|
|
ImageSchema = _ImageSchema()
|
|
|
|
|
|
|
|
|
2017-12-31 18:00:41 -05:00
|
|
|
# Monkey patch to disallow instantiation of this class.
|
[SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames
The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)
Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)
Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
## How was this patch tested?
Unit tests in scala ImageSchemaSuite, unit tests in python
Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 18:45:45 -05:00
|
|
|
def _disallow_instance(_):
|
|
|
|
raise RuntimeError("Creating instance of _ImageSchema class is disallowed.")
|
|
|
|
_ImageSchema.__init__ = _disallow_instance
|
2018-01-17 17:30:54 -05:00
|
|
|
|
|
|
|
|
|
|
|
def _test():
|
|
|
|
import doctest
|
|
|
|
import pyspark.ml.image
|
|
|
|
globs = pyspark.ml.image.__dict__.copy()
|
|
|
|
spark = SparkSession.builder\
|
|
|
|
.master("local[2]")\
|
|
|
|
.appName("ml.image tests")\
|
|
|
|
.getOrCreate()
|
|
|
|
globs['spark'] = spark
|
|
|
|
|
|
|
|
(failure_count, test_count) = doctest.testmod(
|
|
|
|
pyspark.ml.image, globs=globs,
|
|
|
|
optionflags=doctest.ELLIPSIS | doctest.NORMALIZE_WHITESPACE)
|
|
|
|
spark.stop()
|
|
|
|
if failure_count:
|
2018-03-08 06:38:34 -05:00
|
|
|
sys.exit(-1)
|
2018-01-17 17:30:54 -05:00
|
|
|
|
|
|
|
|
|
|
|
if __name__ == "__main__":
|
|
|
|
_test()
|