spark-instrumented-optimizer/.github/workflows/benchmark.yml

103 lines
4 KiB
YAML
Raw Normal View History

[SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork ### What changes were proposed in this pull request? This PR proposes to add a workflow that allows developers to run benchmarks and download the results files. After this PR, developers can run benchmarks in GitHub Actions in their fork. ### Why are the changes needed? 1. Very easy to use. 2. We can use the (almost) same environment to run the benchmarks. Given my few experiments and observation, the CPU, cores, and memory are same. 3. Does not burden ASF's resource at GitHub Actions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in https://github.com/HyukjinKwon/spark/pull/31. Entire benchmarks are being run as below: - [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465) - [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337) ### How do developers use it in their fork? 1. **Go to Actions in your fork, and click "Run benchmarks"** ![Screen Shot 2021-03-31 at 10 15 13 PM](https://user-images.githubusercontent.com/6477701/113150018-99d71680-926e-11eb-8647-4ecf062c55f2.png) 2. **Run the benchmarks with JDK 8 or 11 with benchmark classes to run. Glob pattern is supported just like `testOnly` in SBT** ![Screen Shot 2021-04-02 at 8 35 02 PM](https://user-images.githubusercontent.com/6477701/113412599-ab95f680-93f3-11eb-9a15-c6ed54587b9d.png) 3. **After finishing the jobs, the benchmark results are available on the top in the underlying workflow:** ![Screen Shot 2021-03-31 at 10 17 21 PM](https://user-images.githubusercontent.com/6477701/113150332-ede1fb00-926e-11eb-9c0e-97d195070508.png) 4. **After downloading it, unzip and untar at Spark git root directory:** ```bash cd .../spark mv ~/Downloads/benchmark-results-8.zip . unzip benchmark-results-8.zip tar -xvf benchmark-results-8.tar ``` 5. **Check the results:** ```bash git status ``` ``` ... modified: core/benchmarks/MapStatusesSerDeserBenchmark-results.txt ``` Closes #32015 from HyukjinKwon/SPARK-34821-pr. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-03 07:55:54 -04:00
name: Run benchmarks
on:
workflow_dispatch:
inputs:
class:
description: 'Benchmark class'
required: true
default: '*'
jdk:
description: 'JDK version: 8 or 11'
required: true
default: '8'
failfast:
description: 'Failfast: true or false'
required: true
default: 'true'
num-splits:
description: 'Number of job splits'
required: true
default: '1'
jobs:
matrix-gen:
name: Generate matrix for job splits
runs-on: ubuntu-20.04
outputs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
env:
SPARK_BENCHMARK_NUM_SPLITS: ${{ github.event.inputs.num-splits }}
steps:
- name: Generate matrix
id: set-matrix
run: echo "::set-output name=matrix::["`seq -s, 1 $SPARK_BENCHMARK_NUM_SPLITS`"]"
benchmark:
name: "Run benchmarks: ${{ github.event.inputs.class }} (JDK ${{ github.event.inputs.jdk }}, ${{ matrix.split }} out of ${{ github.event.inputs.num-splits }} splits)"
needs: matrix-gen
# Ubuntu 20.04 is the latest LTS. The next LTS is 22.04.
runs-on: ubuntu-20.04
strategy:
fail-fast: false
matrix:
split: ${{fromJSON(needs.matrix-gen.outputs.matrix)}}
env:
SPARK_BENCHMARK_FAILFAST: ${{ github.event.inputs.failfast }}
SPARK_BENCHMARK_NUM_SPLITS: ${{ github.event.inputs.num-splits }}
SPARK_BENCHMARK_CUR_SPLIT: ${{ matrix.split }}
SPARK_GENERATE_BENCHMARK_FILES: 1
SPARK_LOCAL_IP: localhost
[SPARK-36007][INFRA] Failed to run benchmark in GA When I'm running the benchmark in GA, I met the below error. https://github.com/pingsutw/spark/runs/2867617238?check_suite_focus=true ``` java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.j ava:1692)java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) 21/06/20 07:40:02 ERROR SparkContext: Error initializing SparkContext.java.lang.AssertionError: assertion failed: spark.test.home is not set! at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.deploy.worker.Worker.<init> (Worker.scala:148) at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:954) at org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2(LocalSparkCluster.scala:68) at org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2$adapted(LocalSparkCluster.scala:65) at scala.collection.immutable.Range.foreach(Range.scala:158) at org.apache.spark.deploy.LocalSparkCluster.start(LocalSparkCluster.scala:65) at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2954) at org.apache.spark.SparkContext.<init>(SparkContext.scala:559) at org.apache.spark.SparkContext.<init> (SparkContext.scala:137) at org.apache.spark.serializer.KryoSerializerBenchmark$.createSparkContext(KryoSerializerBenchmark.scala:86) at org.apache.spark.serializer.KryoSerializerBenchmark$.sc$lzycompute$1(KryoSerializerBenchmark.scala:58) at org.apache.spark.serializer.KryoSerializerBenchmark$.sc$1(KryoSerializerBenchmark.scala:58) at org.apache.spark.serializer.KryoSerializerBenchmark$.$anonfun$run$3(KryoSerializerBenchmark.scala:63) ``` Set `spark.test.home` in the benchmark workflow. No Rerun the benchmark in my fork. https://github.com/pingsutw/spark/actions/runs/996067851 Closes #33203 from pingsutw/SPARK-36007. Lead-authored-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Kevin Su <pingsutw@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 11fcbc73cbcbb1bdf5ba5d90eba0aba1edebb15d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-04 20:17:06 -04:00
# To prevent spark.test.home not being set. See more detail in SPARK-36007.
SPARK_HOME: ${{ github.workspace }}
[SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork ### What changes were proposed in this pull request? This PR proposes to add a workflow that allows developers to run benchmarks and download the results files. After this PR, developers can run benchmarks in GitHub Actions in their fork. ### Why are the changes needed? 1. Very easy to use. 2. We can use the (almost) same environment to run the benchmarks. Given my few experiments and observation, the CPU, cores, and memory are same. 3. Does not burden ASF's resource at GitHub Actions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in https://github.com/HyukjinKwon/spark/pull/31. Entire benchmarks are being run as below: - [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465) - [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337) ### How do developers use it in their fork? 1. **Go to Actions in your fork, and click "Run benchmarks"** ![Screen Shot 2021-03-31 at 10 15 13 PM](https://user-images.githubusercontent.com/6477701/113150018-99d71680-926e-11eb-8647-4ecf062c55f2.png) 2. **Run the benchmarks with JDK 8 or 11 with benchmark classes to run. Glob pattern is supported just like `testOnly` in SBT** ![Screen Shot 2021-04-02 at 8 35 02 PM](https://user-images.githubusercontent.com/6477701/113412599-ab95f680-93f3-11eb-9a15-c6ed54587b9d.png) 3. **After finishing the jobs, the benchmark results are available on the top in the underlying workflow:** ![Screen Shot 2021-03-31 at 10 17 21 PM](https://user-images.githubusercontent.com/6477701/113150332-ede1fb00-926e-11eb-9c0e-97d195070508.png) 4. **After downloading it, unzip and untar at Spark git root directory:** ```bash cd .../spark mv ~/Downloads/benchmark-results-8.zip . unzip benchmark-results-8.zip tar -xvf benchmark-results-8.tar ``` 5. **Check the results:** ```bash git status ``` ``` ... modified: core/benchmarks/MapStatusesSerDeserBenchmark-results.txt ``` Closes #32015 from HyukjinKwon/SPARK-34821-pr. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-03 07:55:54 -04:00
steps:
- name: Checkout Spark repository
uses: actions/checkout@v2
# In order to get diff files
with:
fetch-depth: 0
- name: Cache Scala, SBT and Maven
uses: actions/cache@v2
with:
path: |
build/apache-maven-*
build/scala-*
build/*.jar
~/.sbt
key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
restore-keys: |
build-
- name: Cache Coursier local repository
uses: actions/cache@v2
with:
path: ~/.cache/coursier
key: benchmark-coursier-${{ github.event.inputs.jdk }}-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
restore-keys: |
benchmark-coursier-${{ github.event.inputs.jdk }}
- name: Install Java ${{ github.event.inputs.jdk }}
uses: actions/setup-java@v1
with:
java-version: ${{ github.event.inputs.jdk }}
- name: Run benchmarks
run: |
./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pspark-ganglia-lgpl test:package
# Make less noisy
cp conf/log4j.properties.template conf/log4j.properties
sed -i 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g' conf/log4j.properties
# In benchmark, we use local as master so set driver memory only. Note that GitHub Actions has 7 GB memory limit.
bin/spark-submit \
--driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \
--jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
"`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
"${{ github.event.inputs.class }}"
# To keep the directory structure and file permissions, tar them
# See also https://github.com/actions/upload-artifact#maintaining-file-permissions-and-case-sensitive-files
echo "Preparing the benchmark results:"
tar -cvf benchmark-results-${{ github.event.inputs.jdk }}.tar `git diff --name-only` `git ls-files --others --exclude-standard`
[SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork ### What changes were proposed in this pull request? This PR proposes to add a workflow that allows developers to run benchmarks and download the results files. After this PR, developers can run benchmarks in GitHub Actions in their fork. ### Why are the changes needed? 1. Very easy to use. 2. We can use the (almost) same environment to run the benchmarks. Given my few experiments and observation, the CPU, cores, and memory are same. 3. Does not burden ASF's resource at GitHub Actions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in https://github.com/HyukjinKwon/spark/pull/31. Entire benchmarks are being run as below: - [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465) - [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337) ### How do developers use it in their fork? 1. **Go to Actions in your fork, and click "Run benchmarks"** ![Screen Shot 2021-03-31 at 10 15 13 PM](https://user-images.githubusercontent.com/6477701/113150018-99d71680-926e-11eb-8647-4ecf062c55f2.png) 2. **Run the benchmarks with JDK 8 or 11 with benchmark classes to run. Glob pattern is supported just like `testOnly` in SBT** ![Screen Shot 2021-04-02 at 8 35 02 PM](https://user-images.githubusercontent.com/6477701/113412599-ab95f680-93f3-11eb-9a15-c6ed54587b9d.png) 3. **After finishing the jobs, the benchmark results are available on the top in the underlying workflow:** ![Screen Shot 2021-03-31 at 10 17 21 PM](https://user-images.githubusercontent.com/6477701/113150332-ede1fb00-926e-11eb-9c0e-97d195070508.png) 4. **After downloading it, unzip and untar at Spark git root directory:** ```bash cd .../spark mv ~/Downloads/benchmark-results-8.zip . unzip benchmark-results-8.zip tar -xvf benchmark-results-8.tar ``` 5. **Check the results:** ```bash git status ``` ``` ... modified: core/benchmarks/MapStatusesSerDeserBenchmark-results.txt ``` Closes #32015 from HyukjinKwon/SPARK-34821-pr. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-03 07:55:54 -04:00
- name: Upload benchmark results
uses: actions/upload-artifact@v2
with:
name: benchmark-results-${{ github.event.inputs.jdk }}-${{ matrix.split }}
path: benchmark-results-${{ github.event.inputs.jdk }}.tar