spark-instrumented-optimizer

History

Liang-Chi Hsieh 4f17585098 [SPARK-19355][SQL] Use map output statistics to improve global limit's parallelism ## What changes were proposed in this pull request? A logical `Limit` is performed physically by two operations `LocalLimit` and `GlobalLimit`. Most of time, we gather all data into a single partition in order to run `GlobalLimit`. If we use a very big limit number, shuffling data causes performance issue also reduces parallelism. We can avoid shuffling into single partition if we don't care data ordering. This patch implements this idea by doing a map stage during global limit. It collects the info of row numbers at each partition. For each partition, we locally retrieves limited data without any shuffling to finish this global limit. For example, we have three partitions with rows (100, 100, 50) respectively. In global limit of 100 rows, we may take (34, 33, 33) rows for each partition locally. After global limit we still have three partitions. If the data partition has certain ordering, we can't distribute required rows evenly to each partitions because it could change data ordering. But we still can avoid shuffling. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16677 from viirya/improve-global-limit-parallelism.	2018-08-10 11:32:15 +02:00
..
src	[SPARK-19355][SQL] Use map output statistics to improve global limit's parallelism	2018-08-10 11:32:15 +02:00
pom.xml	[SPARK-19550][BUILD][FOLLOW-UP] Remove MaxPermSize for sql module	2018-01-15 07:49:34 -06:00

Liang-Chi Hsieh 4f17585098 [SPARK-19355][SQL] Use map output statistics to improve global limit's parallelism

## What changes were proposed in this pull request?

A logical `Limit` is performed physically by two operations `LocalLimit` and `GlobalLimit`.

Most of time, we gather all data into a single partition in order to run `GlobalLimit`. If we use a very big limit number, shuffling data causes performance issue also reduces parallelism.

We can avoid shuffling into single partition if we don't care data ordering. This patch implements this idea by doing a map stage during global limit. It collects the info of row numbers at each partition. For each partition, we locally retrieves limited data without any shuffling to finish this global limit.

For example, we have three partitions with rows (100, 100, 50) respectively. In global limit of 100 rows, we may take (34, 33, 33) rows for each partition locally. After global limit we still have three partitions.

If the data partition has certain ordering, we can't distribute required rows evenly to each partitions because it could change data ordering. But we still can avoid shuffling.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16677 from viirya/improve-global-limit-parallelism.

2018-08-10 11:32:15 +02:00

src

[SPARK-19355][SQL] Use map output statistics to improve global limit's parallelism

2018-08-10 11:32:15 +02:00

pom.xml

[SPARK-19550][BUILD][FOLLOW-UP] Remove MaxPermSize for sql module

2018-01-15 07:49:34 -06:00