[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption

## What changes were proposed in this pull request?

This PR aims to fix three things in `FilterPushdownBenchmark`.

**1. Use the same memory assumption.**
The following configurations are used in ORC and Parquet.

- Memory buffer for writing
  - parquet.block.size (default: 128MB)
  - orc.stripe.size (default: 64MB)

- Compression chunk size
  - parquet.page.size (default: 1MB)
  - orc.compress.size (default: 256KB)

SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compress.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent.

**2. Dictionary encoding should not be enforced for all cases.**
SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of `prepareStringDictTable`.

**3. Generate test result on AWS r3.xlarge**
SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used.

## How was this patch tested?

Manual. Enable the test cases and run `FilterPushdownBenchmark` on `AWS r3.xlarge`. It takes about 4 hours 15 minutes.

Closes #22427 from dongjoon-hyun/SPARK-25438.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This commit is contained in:
Dongjoon Hyun 2018-09-15 17:48:39 -07:00
parent e06da95cd9
commit fefaa3c30d
No known key found for this signature in database
GPG key ID: EDA00CE834F0FC5C
2 changed files with 428 additions and 495 deletions

File diff suppressed because it is too large Load diff

View file

@ -53,7 +53,8 @@ class FilterPushdownBenchmark extends SparkFunSuite with BenchmarkBeforeAndAfter
private val numRows = 1024 * 1024 * 15
private val width = 5
private val mid = numRows / 2
private val blockSize = 1048576
// For Parquet/ORC, we will use the same value for block size and compression size
private val blockSize = org.apache.parquet.hadoop.ParquetWriter.DEFAULT_PAGE_SIZE
private val spark = SparkSession.builder().config(conf).getOrCreate()
@ -130,16 +131,16 @@ class FilterPushdownBenchmark extends SparkFunSuite with BenchmarkBeforeAndAfter
}
val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
saveAsTable(df, dir)
saveAsTable(df, dir, true)
}
private def saveAsTable(df: DataFrame, dir: File): Unit = {
private def saveAsTable(df: DataFrame, dir: File, useDictionary: Boolean = false): Unit = {
val orcPath = dir.getCanonicalPath + "/orc"
val parquetPath = dir.getCanonicalPath + "/parquet"
// To always turn on dictionary encoding, we set 1.0 at the threshold (the default is 0.8)
df.write.mode("overwrite")
.option("orc.dictionary.key.threshold", 1.0)
.option("orc.dictionary.key.threshold", if (useDictionary) 1.0 else 0.8)
.option("orc.compress.size", blockSize)
.option("orc.stripe.size", blockSize).orc(orcPath)
spark.read.orc(orcPath).createOrReplaceTempView("orcTable")