spark-instrumented-optimizer/docs/ml-guide.md
Ajay Saini 720c94fe77 [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark
# What changes were proposed in this pull request?

Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism.

I take this PR #18281 over because the original author is busy but we need merge this PR soon.
After this been merged, we can close #18281 .

## How was this patch tested?

Test suite added.

Author: Ajay Saini <ajays725@gmail.com>
Author: WeichenXu <weichen.xu@databricks.com>

Closes #19110 from WeichenXu123/spark-21027.
2017-09-12 10:02:27 -07:00

154 lines
7.6 KiB
Markdown

---
layout: global
title: "MLlib: Main Guide"
displayTitle: "Machine Learning Library (MLlib) Guide"
---
MLlib is Spark's machine learning (ML) library.
Its goal is to make practical machine learning scalable and easy.
At a high level, it provides tools such as:
* ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
* Featurization: feature extraction, transformation, dimensionality reduction, and selection
* Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
* Persistence: saving and load algorithms, models, and Pipelines
* Utilities: linear algebra, statistics, data handling, etc.
# Announcement: DataFrame-based API is primary API
**The MLlib RDD-based API is now in maintenance mode.**
As of Spark 2.0, the [RDD](rdd-programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
*What are the implications?*
* MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
* MLlib will not add new features to the RDD-based API.
* In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
* After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
* The RDD-based API is expected to be removed in Spark 3.0.
*Why is MLlib switching to the DataFrame-based API?*
* DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
* The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
* DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the [Pipelines guide](ml-pipeline.html) for details.
*What is "Spark ML"?*
* "Spark ML" is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
This is majorly due to the `org.apache.spark.ml` Scala package name used by the DataFrame-based API,
and the "Spark ML Pipelines" term we used initially to emphasize the pipeline concept.
*Is MLlib deprecated?*
* No. MLlib includes both the RDD-based API and the DataFrame-based API.
The RDD-based API is now in maintenance mode.
But neither API is deprecated, nor MLlib as a whole.
# Dependencies
MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
If native libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
implementation will be used instead.
Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
proxies by default.
To configure `netlib-java` / Breeze to use system optimised binaries, include
`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
platform's additional installation instructions.
The most popular native BLAS such as [Intel MKL](https://software.intel.com/en-us/mkl), [OpenBLAS](http://www.openblas.net), can use multiple threads in a single operation, which can conflict with Spark's execution model.
Configuring these BLAS implementations to use a single thread for operations may actually improve performance (see [SPARK-21305](https://issues.apache.org/jira/browse/SPARK-21305)). It is usually optimal to match this to the number of cores each Spark task is configured to use, which is 1 by default and typically left at 1.
Please refer to resources like the following to understand how to configure the number of threads these BLAS implementations use: [Intel MKL](https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications) and [OpenBLAS](https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded).
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
# Highlights in 2.2
The list below highlights some of the new features and enhancements added to MLlib in the `2.2`
release of Spark:
* [`ALS`](ml-collaborative-filtering.html) methods for _top-k_ recommendations for all
users or items, matching the functionality in `mllib`
([SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)).
Performance was also improved for both `ml` and `mllib`
([SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and
[SPARK-20587](https://issues.apache.org/jira/browse/SPARK-20587))
* [`Correlation`](ml-statistics.html#correlation) and
[`ChiSquareTest`](ml-statistics.html#hypothesis-testing) stats functions for `DataFrames`
([SPARK-19636](https://issues.apache.org/jira/browse/SPARK-19636) and
[SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635))
* [`FPGrowth`](ml-frequent-pattern-mining.html#fp-growth) algorithm for frequent pattern mining
([SPARK-14503](https://issues.apache.org/jira/browse/SPARK-14503))
* `GLM` now supports the full `Tweedie` family
([SPARK-18929](https://issues.apache.org/jira/browse/SPARK-18929))
* [`Imputer`](ml-features.html#imputer) feature transformer to impute missing values in a dataset
([SPARK-13568](https://issues.apache.org/jira/browse/SPARK-13568))
* [`LinearSVC`](ml-classification-regression.html#linear-support-vector-machine)
for linear Support Vector Machine classification
([SPARK-14709](https://issues.apache.org/jira/browse/SPARK-14709))
* Logistic regression now supports constraints on the coefficients during training
([SPARK-20047](https://issues.apache.org/jira/browse/SPARK-20047))
# Migration guide
MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.
## From 2.2 to 2.3
### Breaking changes
There are no breaking changes.
### Deprecations and changes of behavior
**Deprecations**
There are no deprecations.
**Changes of behavior**
* [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version,
the `OneVsRest` parallelism would be parallelism of the default threadpool in scala.
## From 2.1 to 2.2
### Breaking changes
There are no breaking changes.
### Deprecations and changes of behavior
**Deprecations**
There are no deprecations.
**Changes of behavior**
* [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787):
Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` method (marked `DeveloperApi`).
**Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class.
* [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772):
Fixed inconsistency between Python and Scala APIs for `Param.copy` method.
* [SPARK-11569](https://issues.apache.org/jira/browse/SPARK-11569):
`StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception
would always be thrown regardless of the setting of the `handleInvalid` parameter.
## Previous Spark versions
Earlier migration guides are archived [on this page](ml-migration-guides.html).
---