spark-instrumented-optimizer/dev
Sital Kedia 444bce1c98 [SPARK-19112][CORE] Support for ZStandard codec
## What changes were proposed in this pull request?

Using zstd compression for Spark jobs spilling 100s of TBs of data, we could reduce the amount of data written to disk by as much as 50%. This translates to significant latency gain because of reduced disk io operations. There is a degradation CPU time by 2 - 5% because of zstd compression overhead, but for jobs which are bottlenecked by disk IO, this hit can be taken.

## Benchmark
Please note that this benchmark is using real world compute heavy production workload spilling TBs of data to disk

|         | zstd performance as compred to LZ4   |
| ------------- | -----:|
| spill/shuffle bytes    | -48% |
| cpu time    |    + 3% |
| cpu reservation time       |    -40%|
| latency     |     -40% |

## How was this patch tested?

Tested by running few jobs spilling large amount of data on the cluster and amount of intermediate data written to disk reduced by as much as 50%.

Author: Sital Kedia <skedia@fb.com>

Closes #18805 from sitalkedia/skedia/upstream_zstd.
2017-11-01 14:54:08 +01:00
..
create-release [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2 2017-10-06 15:08:28 +01:00
deps [SPARK-19112][CORE] Support for ZStandard codec 2017-11-01 14:54:08 +01:00
sparktestsupport [SPARK-22302][INFRA] Remove manual backports for subprocess and print explicit message for < Python 2.7 2017-10-22 02:22:35 +09:00
tests [SPARK-10359] Enumerate dependencies in a file and diff against it for new pull requests 2015-12-30 12:47:42 -08:00
.gitignore [SPARK-6219] Reuse pep8.py 2015-04-18 16:46:28 -07:00
.rat-excludes [SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core 2017-06-15 11:46:00 -07:00
appveyor-guide.md [SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and testing on Windows (currently SparkR only) 2016-09-08 08:26:59 -07:00
appveyor-install-dependencies.ps1 [MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12 2017-08-12 14:31:05 +09:00
change-scala-version.sh [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 2017-07-13 17:06:24 +08:00
check-license [MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12 2017-08-12 14:31:05 +09:00
checkstyle-suppressions.xml [HOTFIX][BUILD] Fix finalizer checkstyle error and re-disable checkstyle 2017-09-27 13:40:21 -07:00
checkstyle.xml [HOTFIX][BUILD] Fix finalizer checkstyle error and re-disable checkstyle 2017-09-27 13:40:21 -07:00
github_jira_sync.py [SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts 2017-01-02 15:23:19 +00:00
lint-java [SPARK-16967] move mesos to module 2016-08-26 12:25:22 -07:00
lint-python [MINOR][PYTHON] Ignore pep8 on test scripts generated in tests in work directory 2017-06-02 14:25:38 +01:00
lint-r [SPARK-10328] [SPARKR] Fix generic for na.omit 2015-08-28 00:37:50 -07:00
lint-r.R [SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r 2017-10-01 18:42:45 +09:00
lint-scala [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically 2014-08-06 12:58:24 -07:00
make-distribution.sh [SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK… 2017-04-02 15:31:13 +01:00
merge_spark_pr.py [MINOR] Minor comment fixes in merge_spark_pr.py script 2017-07-31 10:07:33 +09:00
mima [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2 2017-10-06 15:08:28 +01:00
pip-sanity-check.py [SPARK-19064][PYSPARK] Fix pip installing of sub components 2017-01-25 14:43:39 -08:00
README.md Merge pull request #565 from pwendell/dev-scripts. Closes #565. 2014-02-08 23:13:34 -08:00
requirements.txt [SPARK-19064][PYSPARK] Fix pip installing of sub components 2017-01-25 14:43:39 -08:00
run-pip-tests Revert "[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas" 2017-06-28 14:28:40 +08:00
run-tests [SPARK-22302][INFRA] Remove manual backports for subprocess and print explicit message for < Python 2.7 2017-10-22 02:22:35 +09:00
run-tests-jenkins [SPARK-22302][INFRA] Remove manual backports for subprocess and print explicit message for < Python 2.7 2017-10-22 02:22:35 +09:00
run-tests-jenkins.py [SPARK-21189][INFRA] Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs 2017-06-24 10:14:31 +01:00
run-tests.py [SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes 2017-06-02 21:59:52 -07:00
scalastyle [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2 2017-10-06 15:08:28 +01:00
test-dependencies.sh [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2 2017-10-06 15:08:28 +01:00
tox.ini [SPARK-22375][TEST] Test script can fail if eggs are installed by set… 2017-10-29 15:29:23 +09:00

Spark Developer Scripts

This directory contains scripts useful to developers when packaging, testing, or committing to Spark.

Many of these scripts require Apache credentials to work correctly.