spark-instrumented-optimizer/.github/workflows/benchmark.yml

name: Run benchmarks

on:
  workflow_dispatch:
    inputs:
      class:
        description: 'Benchmark class'
        required: true
        default: '*'
      jdk:
        description: 'JDK version: 8 or 11'
        required: true
        default: '8'
      failfast:
        description: 'Failfast: true or false'
        required: true
        default: 'true'
      num-splits:
        description: 'Number of job splits'
        required: true
        default: '1'

jobs:
  matrix-gen:
    name: Generate matrix for job splits
    runs-on: ubuntu-20.04
    outputs:
      matrix: ${{ steps.set-matrix.outputs.matrix }}
    env:
      SPARK_BENCHMARK_NUM_SPLITS: ${{ github.event.inputs.num-splits }}
    steps:
    - name: Generate matrix
      id: set-matrix
      run: echo "::set-output name=matrix::["`seq -s, 1 $SPARK_BENCHMARK_NUM_SPLITS`"]"

  benchmark:
    name: "Run benchmarks: ${{ github.event.inputs.class }} (JDK ${{ github.event.inputs.jdk }}, ${{ matrix.split }} out of ${{ github.event.inputs.num-splits }} splits)"
    needs: matrix-gen
    # Ubuntu 20.04 is the latest LTS. The next LTS is 22.04.
    runs-on: ubuntu-20.04
    strategy:
      fail-fast: false
      matrix:
        split: ${{fromJSON(needs.matrix-gen.outputs.matrix)}}
    env:
      SPARK_BENCHMARK_FAILFAST: ${{ github.event.inputs.failfast }}
      SPARK_BENCHMARK_NUM_SPLITS: ${{ github.event.inputs.num-splits }}
      SPARK_BENCHMARK_CUR_SPLIT: ${{ matrix.split }}
      SPARK_GENERATE_BENCHMARK_FILES: 1
      SPARK_LOCAL_IP: localhost
      # To prevent spark.test.home not being set. See more detail in SPARK-36007.
      SPARK_HOME: ${{ github.workspace }}
    steps:
    - name: Checkout Spark repository
      uses: actions/checkout@v2
      # In order to get diff files
      with:
        fetch-depth: 0
    - name: Cache Scala, SBT and Maven
      uses: actions/cache@v2
      with:
        path: |
          build/apache-maven-*
          build/scala-*
          build/*.jar
          ~/.sbt
        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
        restore-keys: |
          build-
    - name: Cache Coursier local repository
      uses: actions/cache@v2
      with:
        path: ~/.cache/coursier
        key: benchmark-coursier-${{ github.event.inputs.jdk }}-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
        restore-keys: |
          benchmark-coursier-${{ github.event.inputs.jdk }}
    - name: Install Java ${{ github.event.inputs.jdk }}
      uses: actions/setup-java@v1
      with:
        java-version: ${{ github.event.inputs.jdk }}
    - name: Run benchmarks
      run: |
        ./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pspark-ganglia-lgpl test:package
        # Make less noisy
        cp conf/log4j.properties.template conf/log4j.properties
        sed -i 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g' conf/log4j.properties
        # In benchmark, we use local as master so set driver memory only. Note that GitHub Actions has 7 GB memory limit.
        bin/spark-submit \
          --driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \
          --jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
          "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
          "${{ github.event.inputs.class }}"
        # To keep the directory structure and file permissions, tar them
        # See also https://github.com/actions/upload-artifact#maintaining-file-permissions-and-case-sensitive-files
        echo "Preparing the benchmark results:"
        tar -cvf benchmark-results-${{ github.event.inputs.jdk }}.tar `git diff --name-only` `git ls-files --others --exclude-standard`
    - name: Upload benchmark results
      uses: actions/upload-artifact@v2
      with:
        name: benchmark-results-${{ github.event.inputs.jdk }}-${{ matrix.split }}
        path: benchmark-results-${{ github.event.inputs.jdk }}.tar
[SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork ### What changes were proposed in this pull request? This PR proposes to add a workflow that allows developers to run benchmarks and download the results files. After this PR, developers can run benchmarks in GitHub Actions in their fork. ### Why are the changes needed? 1. Very easy to use. 2. We can use the (almost) same environment to run the benchmarks. Given my few experiments and observation, the CPU, cores, and memory are same. 3. Does not burden ASF's resource at GitHub Actions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in https://github.com/HyukjinKwon/spark/pull/31. Entire benchmarks are being run as below: - [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465) - [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337) ### How do developers use it in their fork? 1. Go to Actions in your fork, and click "Run benchmarks" ![Screen Shot 2021-03-31 at 10 15 13 PM](https://user-images.githubusercontent.com/6477701/113150018-99d71680-926e-11eb-8647-4ecf062c55f2.png) 2. Run the benchmarks with JDK 8 or 11 with benchmark classes to run. Glob pattern is supported just like `testOnly` in SBT ![Screen Shot 2021-04-02 at 8 35 02 PM](https://user-images.githubusercontent.com/6477701/113412599-ab95f680-93f3-11eb-9a15-c6ed54587b9d.png) 3. After finishing the jobs, the benchmark results are available on the top in the underlying workflow: ![Screen Shot 2021-03-31 at 10 17 21 PM](https://user-images.githubusercontent.com/6477701/113150332-ede1fb00-926e-11eb-9c0e-97d195070508.png) 4. After downloading it, unzip and untar at Spark git root directory: ```bash cd .../spark mv ~/Downloads/benchmark-results-8.zip . unzip benchmark-results-8.zip tar -xvf benchmark-results-8.tar ``` 5. Check the results: ```bash git status ``` ``` ... modified: core/benchmarks/MapStatusesSerDeserBenchmark-results.txt ``` Closes #32015 from HyukjinKwon/SPARK-34821-pr. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2021-04-03 07:55:54 -04:00			`name: Run benchmarks`

			`on:`
			`workflow_dispatch:`
			`inputs:`
			`class:`
			`description: 'Benchmark class'`
			`required: true`
			`default: '*'`
			`jdk:`
			`description: 'JDK version: 8 or 11'`
			`required: true`
			`default: '8'`
			`failfast:`
			`description: 'Failfast: true or false'`
			`required: true`
			`default: 'true'`
			`num-splits:`
			`description: 'Number of job splits'`
			`required: true`
			`default: '1'`

			`jobs:`
			`matrix-gen:`
			`name: Generate matrix for job splits`
			`runs-on: ubuntu-20.04`
			`outputs:`
			`matrix: ${{ steps.set-matrix.outputs.matrix }}`
			`env:`
			`SPARK_BENCHMARK_NUM_SPLITS: ${{ github.event.inputs.num-splits }}`
			`steps:`
			`- name: Generate matrix`
			`id: set-matrix`
			run: echo "::set-output name=matrix::["`seq -s, 1 $SPARK_BENCHMARK_NUM_SPLITS`"]"

			`benchmark:`
			`name: "Run benchmarks: ${{ github.event.inputs.class }} (JDK ${{ github.event.inputs.jdk }}, ${{ matrix.split }} out of ${{ github.event.inputs.num-splits }} splits)"`
			`needs: matrix-gen`
			`# Ubuntu 20.04 is the latest LTS. The next LTS is 22.04.`
			`runs-on: ubuntu-20.04`
			`strategy:`
			`fail-fast: false`
			`matrix:`
			`split: ${{fromJSON(needs.matrix-gen.outputs.matrix)}}`
			`env:`
			`SPARK_BENCHMARK_FAILFAST: ${{ github.event.inputs.failfast }}`
			`SPARK_BENCHMARK_NUM_SPLITS: ${{ github.event.inputs.num-splits }}`
			`SPARK_BENCHMARK_CUR_SPLIT: ${{ matrix.split }}`
			`SPARK_GENERATE_BENCHMARK_FILES: 1`
[SPARK-35002][INFRA][FOLLOW-UP] Use localhost instead of 127.0.0.1 at SPARK_LOCAL_IP in GA builds ### What changes were proposed in this pull request? This PR replaces 127.0.0.1 to `localhost`. ### Why are the changes needed? - https://github.com/apache/spark/pull/32096#discussion_r610349269 - https://github.com/apache/spark/pull/32096#issuecomment-816442481 ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? I didn't test it because it's CI specific issue. I will test it in Github Actions build in this PR. Closes #32102 from HyukjinKwon/SPARK-35002. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> 2021-04-09 04:39:20 -04:00			`SPARK_LOCAL_IP: localhost`
[SPARK-36007][INFRA] Failed to run benchmark in GA When I'm running the benchmark in GA, I met the below error. https://github.com/pingsutw/spark/runs/2867617238?check_suite_focus=true ``` java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.j ava:1692)java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) 21/06/20 07:40:02 ERROR SparkContext: Error initializing SparkContext.java.lang.AssertionError: assertion failed: spark.test.home is not set! at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.deploy.worker.Worker.<init> (Worker.scala:148) at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:954) at org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2(LocalSparkCluster.scala:68) at org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2$adapted(LocalSparkCluster.scala:65) at scala.collection.immutable.Range.foreach(Range.scala:158) at org.apache.spark.deploy.LocalSparkCluster.start(LocalSparkCluster.scala:65) at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2954) at org.apache.spark.SparkContext.<init>(SparkContext.scala:559) at org.apache.spark.SparkContext.<init> (SparkContext.scala:137) at org.apache.spark.serializer.KryoSerializerBenchmark$.createSparkContext(KryoSerializerBenchmark.scala:86) at org.apache.spark.serializer.KryoSerializerBenchmark$.sc$lzycompute$1(KryoSerializerBenchmark.scala:58) at org.apache.spark.serializer.KryoSerializerBenchmark$.sc$1(KryoSerializerBenchmark.scala:58) at org.apache.spark.serializer.KryoSerializerBenchmark$.$anonfun$run$3(KryoSerializerBenchmark.scala:63) ``` Set `spark.test.home` in the benchmark workflow. No Rerun the benchmark in my fork. https://github.com/pingsutw/spark/actions/runs/996067851 Closes #33203 from pingsutw/SPARK-36007. Lead-authored-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Kevin Su <pingsutw@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 11fcbc73cbcbb1bdf5ba5d90eba0aba1edebb15d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2021-07-04 20:17:06 -04:00			`# To prevent spark.test.home not being set. See more detail in SPARK-36007.`
			`SPARK_HOME: ${{ github.workspace }}`
[SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork ### What changes were proposed in this pull request? This PR proposes to add a workflow that allows developers to run benchmarks and download the results files. After this PR, developers can run benchmarks in GitHub Actions in their fork. ### Why are the changes needed? 1. Very easy to use. 2. We can use the (almost) same environment to run the benchmarks. Given my few experiments and observation, the CPU, cores, and memory are same. 3. Does not burden ASF's resource at GitHub Actions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in https://github.com/HyukjinKwon/spark/pull/31. Entire benchmarks are being run as below: - [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465) - [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337) ### How do developers use it in their fork? 1. Go to Actions in your fork, and click "Run benchmarks" ![Screen Shot 2021-03-31 at 10 15 13 PM](https://user-images.githubusercontent.com/6477701/113150018-99d71680-926e-11eb-8647-4ecf062c55f2.png) 2. Run the benchmarks with JDK 8 or 11 with benchmark classes to run. Glob pattern is supported just like `testOnly` in SBT ![Screen Shot 2021-04-02 at 8 35 02 PM](https://user-images.githubusercontent.com/6477701/113412599-ab95f680-93f3-11eb-9a15-c6ed54587b9d.png) 3. After finishing the jobs, the benchmark results are available on the top in the underlying workflow: ![Screen Shot 2021-03-31 at 10 17 21 PM](https://user-images.githubusercontent.com/6477701/113150332-ede1fb00-926e-11eb-9c0e-97d195070508.png) 4. After downloading it, unzip and untar at Spark git root directory: ```bash cd .../spark mv ~/Downloads/benchmark-results-8.zip . unzip benchmark-results-8.zip tar -xvf benchmark-results-8.tar ``` 5. Check the results: ```bash git status ``` ``` ... modified: core/benchmarks/MapStatusesSerDeserBenchmark-results.txt ``` Closes #32015 from HyukjinKwon/SPARK-34821-pr. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2021-04-03 07:55:54 -04:00			`steps:`
			`- name: Checkout Spark repository`
			`uses: actions/checkout@v2`
			`# In order to get diff files`
			`with:`
			`fetch-depth: 0`
			`- name: Cache Scala, SBT and Maven`
			`uses: actions/cache@v2`
			`with:`
			`path: \|`
			`build/apache-maven-*`
			`build/scala-*`
			`build/*.jar`
			`~/.sbt`
			`key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}`
			`restore-keys: \|`
			`build-`
			`- name: Cache Coursier local repository`
			`uses: actions/cache@v2`
			`with:`
			`path: ~/.cache/coursier`
			`key: benchmark-coursier-${{ github.event.inputs.jdk }}-${{ hashFiles('/pom.xml', '/plugins.sbt') }}`
			`restore-keys: \|`
			`benchmark-coursier-${{ github.event.inputs.jdk }}`
			`- name: Install Java ${{ github.event.inputs.jdk }}`
			`uses: actions/setup-java@v1`
			`with:`
			`java-version: ${{ github.event.inputs.jdk }}`
			`- name: Run benchmarks`
			`run: \|`
			`./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pspark-ganglia-lgpl test:package`
			`# Make less noisy`
			`cp conf/log4j.properties.template conf/log4j.properties`
			`sed -i 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g' conf/log4j.properties`
			`# In benchmark, we use local as master so set driver memory only. Note that GitHub Actions has 7 GB memory limit.`
			`bin/spark-submit \`
			`--driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \`
			--jars "`find . -name '-SNAPSHOT-tests.jar' -o -name 'avro*-SNAPSHOT.jar' \| paste -sd ',' -`" \
			"`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
			`"${{ github.event.inputs.class }}"`
			`# To keep the directory structure and file permissions, tar them`
			`# See also https://github.com/actions/upload-artifact#maintaining-file-permissions-and-case-sensitive-files`
			`echo "Preparing the benchmark results:"`
[SPARK-35302][INFRA] Benchmark workflow should create new files for new benchmarks ### What changes were proposed in this pull request? Currently, it fails at `git diff --name-only` when new benchmarks are added, see https://github.com/HyukjinKwon/spark/actions/runs/808870999 We should include untracked files (new benchmark result files) to upload so developers download the results. ### Why are the changes needed? So the new benchmark results can be added and uploaded. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? Tested at: https://github.com/HyukjinKwon/spark/actions/runs/808867285 Closes #32428 from HyukjinKwon/include-new-benchmarks. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2021-05-04 06:02:52 -04:00			tar -cvf benchmark-results-${{ github.event.inputs.jdk }}.tar `git diff --name-only` `git ls-files --others --exclude-standard`
[SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork ### What changes were proposed in this pull request? This PR proposes to add a workflow that allows developers to run benchmarks and download the results files. After this PR, developers can run benchmarks in GitHub Actions in their fork. ### Why are the changes needed? 1. Very easy to use. 2. We can use the (almost) same environment to run the benchmarks. Given my few experiments and observation, the CPU, cores, and memory are same. 3. Does not burden ASF's resource at GitHub Actions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in https://github.com/HyukjinKwon/spark/pull/31. Entire benchmarks are being run as below: - [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465) - [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337) ### How do developers use it in their fork? 1. Go to Actions in your fork, and click "Run benchmarks" ![Screen Shot 2021-03-31 at 10 15 13 PM](https://user-images.githubusercontent.com/6477701/113150018-99d71680-926e-11eb-8647-4ecf062c55f2.png) 2. Run the benchmarks with JDK 8 or 11 with benchmark classes to run. Glob pattern is supported just like `testOnly` in SBT ![Screen Shot 2021-04-02 at 8 35 02 PM](https://user-images.githubusercontent.com/6477701/113412599-ab95f680-93f3-11eb-9a15-c6ed54587b9d.png) 3. After finishing the jobs, the benchmark results are available on the top in the underlying workflow: ![Screen Shot 2021-03-31 at 10 17 21 PM](https://user-images.githubusercontent.com/6477701/113150332-ede1fb00-926e-11eb-9c0e-97d195070508.png) 4. After downloading it, unzip and untar at Spark git root directory: ```bash cd .../spark mv ~/Downloads/benchmark-results-8.zip . unzip benchmark-results-8.zip tar -xvf benchmark-results-8.tar ``` 5. Check the results: ```bash git status ``` ``` ... modified: core/benchmarks/MapStatusesSerDeserBenchmark-results.txt ``` Closes #32015 from HyukjinKwon/SPARK-34821-pr. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2021-04-03 07:55:54 -04:00			`- name: Upload benchmark results`
			`uses: actions/upload-artifact@v2`
			`with:`
			`name: benchmark-results-${{ github.event.inputs.jdk }}-${{ matrix.split }}`
			`path: benchmark-results-${{ github.event.inputs.jdk }}.tar`