f4d5aa4213
### What changes were proposed in this pull request? Instead of using GZIP for compressing the serialized `MapStatuses`, ZStd provides better compression rate and faster compression time. The original approach is serializing and writing data directly into `GZIPOutputStream` as one step; however, the compression time is faster if a bigger chuck of the data is processed by the codec at once. As a result, in this PR, the serialized data is written into an uncompressed byte array first, and then the data is compressed. For smaller `MapStatues`, we find it's 2x faster. Here is the benchmark result. #### 20k map outputs, and each has 500 blocks 1. ZStd two steps in this PR: 0.402 ops/ms, 89,066 bytes 2. ZStd one step as the original approach: 0.370 ops/ms, 89,069 bytes 3. GZip: 0.092 ops/ms, 217,345 bytes #### 20k map outputs, and each has 5 blocks 1. ZStd two steps in this PR: 0.9 ops/ms, 75,449 bytes 2. ZStd one step as the original approach: 0.38 ops/ms, 75,452 bytes 3. GZip: 0.21 ops/ms, 160,094 bytes ### Why are the changes needed? Decrease the time for serializing the `MapStatuses` in large scale job. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26085 from dbtsai/mapStatus. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> |
||
---|---|---|
.. | ||
CoalescedRDDBenchmark-jdk11-results.txt | ||
CoalescedRDDBenchmark-results.txt | ||
KryoBenchmark-jdk11-results.txt | ||
KryoBenchmark-results.txt | ||
KryoSerializerBenchmark-jdk11-results.txt | ||
KryoSerializerBenchmark-results.txt | ||
MapStatusesSerDeserBenchmark-jdk11-results.txt | ||
MapStatusesSerDeserBenchmark-results.txt | ||
PropertiesCloneBenchmark-jdk11-results.txt | ||
PropertiesCloneBenchmark-results.txt | ||
XORShiftRandomBenchmark-jdk11-results.txt | ||
XORShiftRandomBenchmark-results.txt |