c3f27b2437
## What changes were proposed in this pull request? Fix Typos. This PR is the complete version of https://github.com/apache/spark/pull/23145. ## How was this patch tested? NA Closes #23185 from kjmrknsn/docUpdate. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
298 lines
14 KiB
Markdown
298 lines
14 KiB
Markdown
---
|
|
layout: global
|
|
title: ML Pipelines
|
|
displayTitle: ML Pipelines
|
|
---
|
|
|
|
`\[
|
|
\newcommand{\R}{\mathbb{R}}
|
|
\newcommand{\E}{\mathbb{E}}
|
|
\newcommand{\x}{\mathbf{x}}
|
|
\newcommand{\y}{\mathbf{y}}
|
|
\newcommand{\wv}{\mathbf{w}}
|
|
\newcommand{\av}{\mathbf{\alpha}}
|
|
\newcommand{\bv}{\mathbf{b}}
|
|
\newcommand{\N}{\mathbb{N}}
|
|
\newcommand{\id}{\mathbf{I}}
|
|
\newcommand{\ind}{\mathbf{1}}
|
|
\newcommand{\0}{\mathbf{0}}
|
|
\newcommand{\unit}{\mathbf{e}}
|
|
\newcommand{\one}{\mathbf{1}}
|
|
\newcommand{\zero}{\mathbf{0}}
|
|
\]`
|
|
|
|
In this section, we introduce the concept of ***ML Pipelines***.
|
|
ML Pipelines provide a uniform set of high-level APIs built on top of
|
|
[DataFrames](sql-programming-guide.html) that help users create and tune practical
|
|
machine learning pipelines.
|
|
|
|
**Table of Contents**
|
|
|
|
* This will become a table of contents (this text will be scraped).
|
|
{:toc}
|
|
|
|
# Main concepts in Pipelines
|
|
|
|
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple
|
|
algorithms into a single pipeline, or workflow.
|
|
This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is
|
|
mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
|
|
|
|
* **[`DataFrame`](ml-pipeline.html#dataframe)**: This ML API uses `DataFrame` from Spark SQL as an ML
|
|
dataset, which can hold a variety of data types.
|
|
E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.
|
|
|
|
* **[`Transformer`](ml-pipeline.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
|
|
E.g., an ML model is a `Transformer` which transforms a `DataFrame` with features into a `DataFrame` with predictions.
|
|
|
|
* **[`Estimator`](ml-pipeline.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
|
|
E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.
|
|
|
|
* **[`Pipeline`](ml-pipeline.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s and `Estimator`s together to specify an ML workflow.
|
|
|
|
* **[`Parameter`](ml-pipeline.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.
|
|
|
|
## DataFrame
|
|
|
|
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
|
|
This API adopts the `DataFrame` from Spark SQL in order to support a variety of data types.
|
|
|
|
`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-reference.html#data-types) for a list of supported types.
|
|
In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](mllib-data-types.html#local-vector) types.
|
|
|
|
A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`. See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.
|
|
|
|
Columns in a `DataFrame` are named. The code examples below use names such as "text", "features", and "label".
|
|
|
|
## Pipeline components
|
|
|
|
### Transformers
|
|
|
|
A `Transformer` is an abstraction that includes feature transformers and learned models.
|
|
Technically, a `Transformer` implements a method `transform()`, which converts one `DataFrame` into
|
|
another, generally by appending one or more columns.
|
|
For example:
|
|
|
|
* A feature transformer might take a `DataFrame`, read a column (e.g., text), map it into a new
|
|
column (e.g., feature vectors), and output a new `DataFrame` with the mapped column appended.
|
|
* A learning model might take a `DataFrame`, read the column containing feature vectors, predict the
|
|
label for each feature vector, and output a new `DataFrame` with predicted labels appended as a
|
|
column.
|
|
|
|
### Estimators
|
|
|
|
An `Estimator` abstracts the concept of a learning algorithm or any algorithm that fits or trains on
|
|
data.
|
|
Technically, an `Estimator` implements a method `fit()`, which accepts a `DataFrame` and produces a
|
|
`Model`, which is a `Transformer`.
|
|
For example, a learning algorithm such as `LogisticRegression` is an `Estimator`, and calling
|
|
`fit()` trains a `LogisticRegressionModel`, which is a `Model` and hence a `Transformer`.
|
|
|
|
### Properties of pipeline components
|
|
|
|
`Transformer.transform()`s and `Estimator.fit()`s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.
|
|
|
|
Each instance of a `Transformer` or `Estimator` has a unique ID, which is useful in specifying parameters (discussed below).
|
|
|
|
## Pipeline
|
|
|
|
In machine learning, it is common to run a sequence of algorithms to process and learn from data.
|
|
E.g., a simple text document processing workflow might include several stages:
|
|
|
|
* Split each document's text into words.
|
|
* Convert each document's words into a numerical feature vector.
|
|
* Learn a prediction model using the feature vectors and labels.
|
|
|
|
MLlib represents such a workflow as a `Pipeline`, which consists of a sequence of
|
|
`PipelineStage`s (`Transformer`s and `Estimator`s) to be run in a specific order.
|
|
We will use this simple workflow as a running example in this section.
|
|
|
|
### How it works
|
|
|
|
A `Pipeline` is specified as a sequence of stages, and each stage is either a `Transformer` or an `Estimator`.
|
|
These stages are run in order, and the input `DataFrame` is transformed as it passes through each stage.
|
|
For `Transformer` stages, the `transform()` method is called on the `DataFrame`.
|
|
For `Estimator` stages, the `fit()` method is called to produce a `Transformer` (which becomes part of the `PipelineModel`, or fitted `Pipeline`), and that `Transformer`'s `transform()` method is called on the `DataFrame`.
|
|
|
|
We illustrate this for the simple text document workflow. The figure below is for the *training time* usage of a `Pipeline`.
|
|
|
|
<p style="text-align: center;">
|
|
<img
|
|
src="img/ml-Pipeline.png"
|
|
title="ML Pipeline Example"
|
|
alt="ML Pipeline Example"
|
|
width="80%"
|
|
/>
|
|
</p>
|
|
|
|
Above, the top row represents a `Pipeline` with three stages.
|
|
The first two (`Tokenizer` and `HashingTF`) are `Transformer`s (blue), and the third (`LogisticRegression`) is an `Estimator` (red).
|
|
The bottom row represents data flowing through the pipeline, where cylinders indicate `DataFrame`s.
|
|
The `Pipeline.fit()` method is called on the original `DataFrame`, which has raw text documents and labels.
|
|
The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words to the `DataFrame`.
|
|
The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the `DataFrame`.
|
|
Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls `LogisticRegression.fit()` to produce a `LogisticRegressionModel`.
|
|
If the `Pipeline` had more `Estimator`s, it would call the `LogisticRegressionModel`'s `transform()`
|
|
method on the `DataFrame` before passing the `DataFrame` to the next stage.
|
|
|
|
A `Pipeline` is an `Estimator`.
|
|
Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a
|
|
`Transformer`.
|
|
This `PipelineModel` is used at *test time*; the figure below illustrates this usage.
|
|
|
|
<p style="text-align: center;">
|
|
<img
|
|
src="img/ml-PipelineModel.png"
|
|
title="ML PipelineModel Example"
|
|
alt="ML PipelineModel Example"
|
|
width="80%"
|
|
/>
|
|
</p>
|
|
|
|
In the figure above, the `PipelineModel` has the same number of stages as the original `Pipeline`, but all `Estimator`s in the original `Pipeline` have become `Transformer`s.
|
|
When the `PipelineModel`'s `transform()` method is called on a test dataset, the data are passed
|
|
through the fitted pipeline in order.
|
|
Each stage's `transform()` method updates the dataset and passes it to the next stage.
|
|
|
|
`Pipeline`s and `PipelineModel`s help to ensure that training and test data go through identical feature processing steps.
|
|
|
|
### Details
|
|
|
|
*DAG `Pipeline`s*: A `Pipeline`'s stages are specified as an ordered array. The examples given here are all for linear `Pipeline`s, i.e., `Pipeline`s in which each stage uses data produced by the previous stage. It is possible to create non-linear `Pipeline`s as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the `Pipeline` forms a DAG, then the stages must be specified in topological order.
|
|
|
|
*Runtime checking*: Since `Pipeline`s can operate on `DataFrame`s with varied types, they cannot use
|
|
compile-time type checking.
|
|
`Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`.
|
|
This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`.
|
|
|
|
*Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances. E.g., the same instance
|
|
`myHashingTF` should not be inserted into the `Pipeline` twice since `Pipeline` stages must have
|
|
unique IDs. However, different instances `myHashingTF1` and `myHashingTF2` (both of type `HashingTF`)
|
|
can be put into the same `Pipeline` since different instances will be created with different IDs.
|
|
|
|
## Parameters
|
|
|
|
MLlib `Estimator`s and `Transformer`s use a uniform API for specifying parameters.
|
|
|
|
A `Param` is a named parameter with self-contained documentation.
|
|
A `ParamMap` is a set of (parameter, value) pairs.
|
|
|
|
There are two main ways to pass parameters to an algorithm:
|
|
|
|
1. Set parameters for an instance. E.g., if `lr` is an instance of `LogisticRegression`, one could
|
|
call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations.
|
|
This API resembles the API used in `spark.mllib` package.
|
|
2. Pass a `ParamMap` to `fit()` or `transform()`. Any parameters in the `ParamMap` will override parameters previously specified via setter methods.
|
|
|
|
Parameters belong to specific instances of `Estimator`s and `Transformer`s.
|
|
For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
|
|
This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
|
|
|
|
## ML persistence: Saving and Loading Pipelines
|
|
|
|
Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API.
|
|
As of Spark 2.3, the DataFrame-based API in `spark.ml` and `pyspark.ml` has complete coverage.
|
|
|
|
ML persistence works across Scala, Java and Python. However, R currently uses a modified format,
|
|
so models saved in R can only be loaded back in R; this should be fixed in the future and is
|
|
tracked in [SPARK-15572](https://issues.apache.org/jira/browse/SPARK-15572).
|
|
|
|
### Backwards compatibility for ML persistence
|
|
|
|
In general, MLlib maintains backwards compatibility for ML persistence. I.e., if you save an ML
|
|
model or Pipeline in one version of Spark, then you should be able to load it back and use it in a
|
|
future version of Spark. However, there are rare exceptions, described below.
|
|
|
|
Model persistence: Is a model or Pipeline saved using Apache Spark ML persistence in Spark
|
|
version X loadable by Spark version Y?
|
|
|
|
* Major versions: No guarantees, but best-effort.
|
|
* Minor and patch versions: Yes; these are backwards compatible.
|
|
* Note about the format: There are no guarantees for a stable persistence format, but model loading itself is designed to be backwards compatible.
|
|
|
|
Model behavior: Does a model or Pipeline in Spark version X behave identically in Spark version Y?
|
|
|
|
* Major versions: No guarantees, but best-effort.
|
|
* Minor and patch versions: Identical behavior, except for bug fixes.
|
|
|
|
For both model persistence and model behavior, any breaking changes across a minor version or patch
|
|
version are reported in the Spark version release notes. If a breakage is not reported in release
|
|
notes, then it should be treated as a bug to be fixed.
|
|
|
|
# Code examples
|
|
|
|
This section gives code examples illustrating the functionality discussed above.
|
|
For more info, please refer to the API documentation
|
|
([Scala](api/scala/index.html#org.apache.spark.ml.package),
|
|
[Java](api/java/org/apache/spark/ml/package-summary.html),
|
|
and [Python](api/python/pyspark.ml.html)).
|
|
|
|
## Example: Estimator, Transformer, and Param
|
|
|
|
This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
|
|
|
|
<div class="codetabs">
|
|
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
Refer to the [`Estimator` Scala docs](api/scala/index.html#org.apache.spark.ml.Estimator),
|
|
the [`Transformer` Scala docs](api/scala/index.html#org.apache.spark.ml.Transformer) and
|
|
the [`Params` Scala docs](api/scala/index.html#org.apache.spark.ml.param.Params) for details on the API.
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
Refer to the [`Estimator` Java docs](api/java/org/apache/spark/ml/Estimator.html),
|
|
the [`Transformer` Java docs](api/java/org/apache/spark/ml/Transformer.html) and
|
|
the [`Params` Java docs](api/java/org/apache/spark/ml/param/Params.html) for details on the API.
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaEstimatorTransformerParamExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
|
|
Refer to the [`Estimator` Python docs](api/python/pyspark.ml.html#pyspark.ml.Estimator),
|
|
the [`Transformer` Python docs](api/python/pyspark.ml.html#pyspark.ml.Transformer) and
|
|
the [`Params` Python docs](api/python/pyspark.ml.html#pyspark.ml.param.Params) for more details on the API.
|
|
|
|
{% include_example python/ml/estimator_transformer_param_example.py %}
|
|
</div>
|
|
|
|
</div>
|
|
|
|
## Example: Pipeline
|
|
|
|
This example follows the simple text document `Pipeline` illustrated in the figures above.
|
|
|
|
<div class="codetabs">
|
|
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
Refer to the [`Pipeline` Scala docs](api/scala/index.html#org.apache.spark.ml.Pipeline) for details on the API.
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
|
|
Refer to the [`Pipeline` Java docs](api/java/org/apache/spark/ml/Pipeline.html) for details on the API.
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaPipelineExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
|
|
Refer to the [`Pipeline` Python docs](api/python/pyspark.ml.html#pyspark.ml.Pipeline) for more details on the API.
|
|
|
|
{% include_example python/ml/pipeline_example.py %}
|
|
</div>
|
|
|
|
</div>
|
|
|
|
## Model selection (hyperparameter tuning)
|
|
|
|
A big benefit of using ML Pipelines is hyperparameter optimization. See the [ML Tuning Guide](ml-tuning.html) for more information on automatic model selection.
|