spark-instrumented-optimizer

History

Liang-Chi Hsieh 77f74539ec [SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns ## What changes were proposed in this pull request? Current ML's Bucketizer can only bin a column of continuous features. If a dataset has thousands of of continuous columns needed to bin, we will result in thousands of ML stages. It is inefficient regarding query planning and execution. We should have a type of bucketizer that can bin a lot of columns all at once. It would need to accept an list of arrays of split points to correspond to the columns to bin, but it might make things more efficient by replacing thousands of stages with just one. This current approach in this patch is to add a new `MultipleBucketizerInterface` for this purpose. `Bucketizer` now extends this new interface. ### Performance Benchmarking using the test dataset provided in JIRA SPARK-20392 (blockbuster.csv). The ML pipeline includes 2 `StringIndexer`s and 1 `MultipleBucketizer` or 137 `Bucketizer`s to bin 137 input columns with the same splits. Then count the time to transform the dataset. MultipleBucketizer: 3352 ms Bucketizer: 51512 ms ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17819 from viirya/SPARK-20542.	2017-11-09 16:35:06 +02:00
..
src/main	[SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns	2017-11-09 16:35:06 +02:00
pom.xml	[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2	2017-10-06 15:08:28 +01:00

Liang-Chi Hsieh 77f74539ec [SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns

## What changes were proposed in this pull request?

Current ML's Bucketizer can only bin a column of continuous features. If a dataset has thousands of of continuous columns needed to bin, we will result in thousands of ML stages. It is inefficient regarding query planning and execution.

We should have a type of bucketizer that can bin a lot of columns all at once. It would need to accept an list of arrays of split points to correspond to the columns to bin, but it might make things more efficient by replacing thousands of stages with just one.

This current approach in this patch is to add a new `MultipleBucketizerInterface` for this purpose. `Bucketizer` now extends this new interface.

### Performance

Benchmarking using the test dataset provided in JIRA SPARK-20392 (blockbuster.csv).

The ML pipeline includes 2 `StringIndexer`s and 1 `MultipleBucketizer` or 137 `Bucketizer`s to bin 137 input columns with the same splits. Then count the time to transform the dataset.

MultipleBucketizer: 3352 ms
Bucketizer: 51512 ms

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17819 from viirya/SPARK-20542.

2017-11-09 16:35:06 +02:00

src/main

[SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns

2017-11-09 16:35:06 +02:00

pom.xml

[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2

2017-10-06 15:08:28 +01:00