[SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly

### What changes were proposed in this pull request?

`CliSuite.runCliWithin` was not matching for expected results correctly. It was matching for expected lines anywhere in stdout or stderr.
On the example of `Single command with --database` test:
In
```
    runCliWithin(2.minute)(
      "CREATE DATABASE hive_db_test;"
        -> "",
      "USE hive_test;"
        -> "",
      "CREATE TABLE hive_test(key INT, val STRING);"
        -> "",
      "SHOW TABLES;"
        -> "hive_test"
    )
```
It was looking for lines containing "", "", "" and then "hive_test".
However, the string "hive_test" was contained in "hive_test_db", and hence:
```
2020-04-08 17:53:12,752 INFO  CliSuite - 2020-04-08 17:53:12.752 - stderr> Spark master: local, Application Id: local-1586368384172
2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 0: ""
2020-04-08 17:53:12,765 INFO  CliSuite - 2020-04-08 17:53:12.765 - stdout> spark-sql> CREATE DATABASE hive_db_test;
2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 1: ""
2020-04-08 17:53:17,688 INFO  CliSuite - 2020-04-08 17:53:17.688 - stderr> chgrp: changing ownership of 'file:///tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': chown: changing group of '/tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': Operation not permitted
2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 2: ""
2020-04-08 17:53:18,069 INFO  CliSuite - 2020-04-08 17:53:18.069 - stderr> Time taken: 5.265 seconds
2020-04-08 17:53:18,087 INFO  CliSuite - 2020-04-08 17:53:18.087 - stdout> spark-sql> USE hive_test;
2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 3: "hive_test"
2020-04-08 17:53:21,742 INFO  CliSuite - Found all expected output.
```
Because of that, it could kill the CLI process without really even creating the table. This was not expected. The test could be flaky depending on whether process.destroy() in the finally block managed to kill it before it actually creates the table.

I make the output checking more robust to not match on unexpected output, by making it check the echo of query output on the CLI. Also, wait for the CLI process to finish gracefully (triggered by closing its stdin), instead of killing it forcibly.

### Why are the changes needed?

org.apache.spark.sql.hive.thriftserver.CliSuite was flaky, and didn't test outputs as expected.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests in CLISuite. Tested several times with no flakiness. Was getting flaky results almost on every run before.
```
[info] CliSuite:
[info] - load warehouse dir from hive-site.xml (12 seconds, 568 milliseconds)
[info] - load warehouse dir from --hiveconf (10 seconds, 648 milliseconds)
[info] - load warehouse dir from --conf spark(.hadoop).hive.* (20 seconds, 653 milliseconds)
[info] - load warehouse dir from spark.sql.warehouse.dir (9 seconds, 763 milliseconds)
[info] - Simple commands (16 seconds, 238 milliseconds)
[info] - Single command with -e (9 seconds, 967 milliseconds)
[info] - Single command with --database (21 seconds, 205 milliseconds)
[info] - Commands using SerDe provided in --jars (15 seconds, 51 milliseconds)
[info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (14 seconds, 625 milliseconds)
[info] - SPARK-11188 Analysis error reporting (7 seconds, 960 milliseconds)
[info] - SPARK-11624 Spark SQL CLI should set sessionState only once (7 seconds, 424 milliseconds)
[info] - list jars (9 seconds, 520 milliseconds)
[info] - list jar <jarfile> (9 seconds, 277 milliseconds)
[info] - list files (9 seconds, 828 milliseconds)
[info] - list file <filepath> (9 seconds, 646 milliseconds)
[info] - apply hiveconf from cli command (9 seconds, 469 milliseconds)
[info] - Support hive.aux.jars.path (10 seconds, 676 milliseconds)
[info] - SPARK-28840 test --jars command (10 seconds, 921 milliseconds)
[info] - SPARK-28840 test --jars and hive.aux.jars.path command (11 seconds, 49 milliseconds)
[info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 210 milliseconds)
[info] - SPARK-26321 Should not split semicolon within quoted string literals (12 seconds, 729 milliseconds)
[info] - Pad Decimal numbers with trailing zeros to the scale of the column (10 seconds, 381 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented lines (10 seconds, 935 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented with multi-lines (20 seconds, 731 milliseconds)
```

Closes #28156 from juliuszsompolski/SPARK-31388.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This commit is contained in:
Juliusz Sompolski 2020-04-27 04:27:55 +00:00 committed by Wenchen Fan
parent 4a21c4cc92
commit 560bd5401f

View file

@ -87,9 +87,25 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
useExternalHiveFile: Boolean = false)(
queriesAndExpectedAnswers: (String, String)*): Unit = {
val (queries, expectedAnswers) = queriesAndExpectedAnswers.unzip
// Explicitly adds ENTER for each statement to make sure they are actually entered into the CLI.
val queriesString = queries.map(_ + "\n").mkString
val queriesString = queriesAndExpectedAnswers.map(_._1 + "\n").mkString
// spark-sql echoes the queries on STDOUT, expect first an echo of the query, then the answer.
val expectedAnswers = queriesAndExpectedAnswers.flatMap {
case (query, answer) =>
if (query == "") {
// empty query means a command launched with -e
Seq(answer)
} else {
// spark-sql echoes the submitted queries
val queryEcho = query.split("\n").toList match {
case firstLine :: tail =>
s"spark-sql> $firstLine" :: tail.map(l => s" > $l")
}
// longer lines sometimes get split in the output,
// match the first 60 characters of each query line
queryEcho.map(_.take(60)) :+ answer
}
}
val extraHive = if (useExternalHiveFile) {
s"--driver-class-path ${System.getProperty("user.dir")}/src/test/noclasspath"
@ -122,10 +138,13 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
def captureOutput(source: String)(line: String): Unit = lock.synchronized {
// This test suite sometimes gets extremely slow out of unknown reason on Jenkins. Here we
// add a timestamp to provide more diagnosis information.
buffer += s"${new Timestamp(new Date().getTime)} - $source> $line"
val newLine = s"${new Timestamp(new Date().getTime)} - $source> $line"
log.info(newLine)
buffer += newLine
// If we haven't found all expected answers and another expected answer comes up...
if (next < expectedAnswers.size && line.contains(expectedAnswers(next))) {
log.info(s"$source> found expected output line $next: '${expectedAnswers(next)}'")
next += 1
// If all expected answers have been found...
if (next == expectedAnswers.size) {
@ -153,6 +172,7 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
try {
ThreadUtils.awaitResult(foundAllExpectedAnswers.future, timeout)
log.info("Found all expected output.")
} catch { case cause: Throwable =>
val message =
s"""
@ -161,8 +181,7 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
|=======================
|Spark SQL CLI command line: ${command.mkString(" ")}
|Exception: $cause
|Executed query $next "${queries(next)}",
|But failed to capture expected output "${expectedAnswers(next)}" within $timeout.
|Failed to capture next expected output "${expectedAnswers(next)}" within $timeout.
|
|${buffer.mkString("\n")}
|===========================
@ -172,7 +191,13 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
logError(message, cause)
fail(message, cause)
} finally {
process.destroy()
if (!process.waitFor(1, MINUTES)) {
try {
fail("spark-sql did not exit gracefully.")
} finally {
process.destroy()
}
}
}
}
@ -233,7 +258,8 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
-> "",
"SHOW TABLES;"
-> "hive_test",
s"LOAD DATA LOCAL INPATH '$dataFilePath' OVERWRITE INTO TABLE hive_test;"
s"""LOAD DATA LOCAL INPATH '$dataFilePath'
|OVERWRITE INTO TABLE hive_test;""".stripMargin
-> "",
"CACHE TABLE hive_test;"
-> "",
@ -250,18 +276,18 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
test("Single command with --database") {
runCliWithin(2.minute)(
"CREATE DATABASE hive_test_db;"
"CREATE DATABASE hive_db_test;"
-> "",
"USE hive_test_db;"
"USE hive_db_test;"
-> "",
"CREATE TABLE hive_test(key INT, val STRING);"
"CREATE TABLE hive_table_test(key INT, val STRING);"
-> "",
"SHOW TABLES;"
-> "hive_test"
-> "hive_table_test"
)
runCliWithin(2.minute, Seq("--database", "hive_test_db", "-e", "SHOW TABLES;"))(
"" -> "hive_test"
runCliWithin(2.minute, Seq("--database", "hive_db_test", "-e", "SHOW TABLES;"))(
"" -> "hive_table_test"
)
}
@ -273,12 +299,12 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
runCliWithin(3.minute, Seq("--jars", s"$jarFile"))(
"""CREATE TABLE t1(key string, val string)
|ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
""".stripMargin
|ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';""".stripMargin
-> "",
"CREATE TABLE sourceTable (key INT, val STRING) USING hive;"
-> "",
s"LOAD DATA LOCAL INPATH '$dataFilePath' OVERWRITE INTO TABLE sourceTable;"
s"""LOAD DATA LOCAL INPATH '$dataFilePath'
|OVERWRITE INTO TABLE sourceTable;""".stripMargin
-> "",
"INSERT INTO TABLE t1 SELECT key, val FROM sourceTable;"
-> "",
@ -299,12 +325,12 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
3.minute,
Seq("--conf", s"spark.hadoop.${ConfVars.HIVEAUXJARS}=$hiveContribJar"))(
"""CREATE TABLE addJarWithHiveAux(key string, val string)
|ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
""".stripMargin
|ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';""".stripMargin
-> "",
"CREATE TABLE sourceTableForWithHiveAux (key INT, val STRING) USING hive;"
-> "",
s"LOAD DATA LOCAL INPATH '$dataFilePath' OVERWRITE INTO TABLE sourceTableForWithHiveAux;"
s"""LOAD DATA LOCAL INPATH '$dataFilePath'
|OVERWRITE INTO TABLE sourceTableForWithHiveAux;""".stripMargin
-> "",
"INSERT INTO TABLE addJarWithHiveAux SELECT key, val FROM sourceTableForWithHiveAux;"
-> "",
@ -419,12 +445,12 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with BeforeAndAfterE
3.minute)(
s"ADD JAR ${hiveContribJar};" -> "",
"""CREATE TABLE addJarWithSQL(key string, val string)
|ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
""".stripMargin
|ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';""".stripMargin
-> "",
"CREATE TABLE sourceTableForWithSQL(key INT, val STRING) USING hive;"
-> "",
s"LOAD DATA LOCAL INPATH '$dataFilePath' OVERWRITE INTO TABLE sourceTableForWithSQL;"
s"""LOAD DATA LOCAL INPATH '$dataFilePath'
|OVERWRITE INTO TABLE sourceTableForWithSQL;""".stripMargin
-> "",
"INSERT INTO TABLE addJarWithSQL SELECT key, val FROM sourceTableForWithSQL;"
-> "",