spark-instrumented-optimizer/core
Gengliang Wang 7ac0a2c37b [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
### What changes were proposed in this pull request?

Improve the perf and memory usage of cleaning up stage UI data. The new code make copy of the essential fields(stage id, attempt id, completion time) to an array and determine which stage data and `RDDOperationGraphWrapper` needs to be clean based on it
### Why are the changes needed?

Fix the memory usage issue described in https://issues.apache.org/jira/browse/SPARK-36827

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add new unit test for the InMemoryStore.
Also, run a simple benchmark with
```
    val testConf = conf.clone()
      .set(MAX_RETAINED_STAGES, 1000)

    val listener = new AppStatusListener(store, testConf, true)
    val stages = (1 to 5000).map { i =>
      val s = new StageInfo(i, 0, s"stage$i", 4, Nil, Nil, "details1",
        resourceProfileId = ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID)
      s.submissionTime = Some(i.toLong)
      s
    }
    listener.onJobStart(SparkListenerJobStart(4, time, Nil, null))
    val start = System.nanoTime()
    stages.foreach { s =>
      time +=1
      s.submissionTime = Some(time)
      listener.onStageSubmitted(SparkListenerStageSubmitted(s, new Properties()))
      s.completionTime = Some(time)
      listener.onStageCompleted(SparkListenerStageCompleted(s))
    }
    println(System.nanoTime() - start)
```

Before changes:
InMemoryStore: 1.2s

After changes:
InMemoryStore: 0.23s

Closes #34092 from gengliangwang/cleanStage.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-24 17:24:18 +08:00
..
benchmarks Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache" 2021-08-22 09:36:15 +09:00
src [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data 2021-09-24 17:24:18 +08:00
pom.xml [SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile) 2021-09-13 11:06:50 -05:00