spark-instrumented-optimizer

History

Josh Rosen 8ca01a6feb [SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues ## What changes were proposed in this pull request? In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase. The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying. In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful). ## How was this patch tested? This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds. Author: Josh Rosen <joshrosen@databricks.com> Closes #13421 from JoshRosen/disable-line-comments-in-codegen.	2016-05-31 17:30:03 -07:00
..
src	[SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues	2016-05-31 17:30:03 -07:00
pom.xml	[SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags	2016-05-17 09:55:53 +01:00

Josh Rosen 8ca01a6feb [SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues

## What changes were proposed in this pull request?

In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase.

The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying.

In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful).

## How was this patch tested?

This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13421 from JoshRosen/disable-line-comments-in-codegen.

2016-05-31 17:30:03 -07:00

src

[SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues

2016-05-31 17:30:03 -07:00

pom.xml

[SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags

2016-05-17 09:55:53 +01:00