spark-instrumented-optimizer/sql
hyukjinkwon ac84fb64dd [SPARK-16434][SQL] Avoid per-record type dispatch in JSON when reading
## What changes were proposed in this pull request?

Currently, `JacksonParser.parse` is doing type-based dispatch for each row to convert the tokens to appropriate values for Spark.
It might not have to be done like this because the schema is already kept.

So, appropriate converters can be created first according to the schema once, and then apply them to each row.

This PR corrects `JacksonParser` so that it creates all converters for the schema once and then applies them to each row rather than type dispatching for every row.

Benchmark was proceeded with the codes below:

#### Parser tests

**Before**

```scala
test("Benchmark for JSON converter") {
  val N = 500 << 8
  val row =
    """{"struct":{"field1": true, "field2": 92233720368547758070},
    "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
    "arrayOfString":["str1", "str2"],
    "arrayOfInteger":[1, 2147483647, -2147483648],
    "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
    "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
    "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
    "arrayOfBoolean":[true, false, true],
    "arrayOfNull":[null, null, null, null],
    "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
    "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
    "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
   }"""
  val data = List.fill(N)(row)
  val dummyOption = new JSONOptions(Map.empty[String, String])
  val schema =
    InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), "", dummyOption)
  val factory = new JsonFactory()

  val benchmark = new Benchmark("JSON converter", N)
  benchmark.addCase("convert JSON file", 10) { _ =>
    data.foreach { input =>
      val parser = factory.createParser(input)
      parser.nextToken()
      JacksonParser.convertRootField(factory, parser, schema)
    }
  }
  benchmark.run()
}
```

```
JSON converter:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
convert JSON file                             1697 / 1807          0.1       13256.9       1.0X
```

**After**

```scala
test("Benchmark for JSON converter") {
  val N = 500 << 8
  val row =
    """{"struct":{"field1": true, "field2": 92233720368547758070},
    "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
    "arrayOfString":["str1", "str2"],
    "arrayOfInteger":[1, 2147483647, -2147483648],
    "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
    "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
    "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
    "arrayOfBoolean":[true, false, true],
    "arrayOfNull":[null, null, null, null],
    "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
    "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
    "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
   }"""
  val data = List.fill(N)(row)
  val dummyOption = new JSONOptions(Map.empty[String, String], new SQLConf())
  val schema =
    InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), dummyOption)

  val benchmark = new Benchmark("JSON converter", N)
  benchmark.addCase("convert JSON file", 10) { _ =>
    val parser = new JacksonParser(schema, dummyOption)
    data.foreach { input =>
      parser.parse(input)
    }
  }
  benchmark.run()
}
```

```
JSON converter:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
convert JSON file                             1401 / 1461          0.1       10947.4       1.0X
```

It seems parsing time is improved by roughly ~20%

#### End-to-End test

```scala
test("Benchmark for JSON reader") {
  val N = 500 << 8
  val row =
    """{"struct":{"field1": true, "field2": 92233720368547758070},
    "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
    "arrayOfString":["str1", "str2"],
    "arrayOfInteger":[1, 2147483647, -2147483648],
    "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
    "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
    "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
    "arrayOfBoolean":[true, false, true],
    "arrayOfNull":[null, null, null, null],
    "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
    "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
    "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
   }"""
  val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row)))
  withTempPath { path =>
    df.write.format("json").save(path.getCanonicalPath)

    val benchmark = new Benchmark("JSON reader", N)
    benchmark.addCase("reading JSON file", 10) { _ =>
      spark.read.format("json").load(path.getCanonicalPath).collect()
    }
    benchmark.run()
  }
}
```

**Before**

```
JSON reader:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading JSON file                             6485 / 6924          0.0       50665.0       1.0X
```

**After**

```
JSON reader:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading JSON file                             6350 / 6529          0.0       49609.3       1.0X
```

## How was this patch tested?

Existing test cases should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14102 from HyukjinKwon/SPARK-16434.
2016-08-12 11:09:42 +08:00
..
catalyst [SPARK-16958] [SQL] Reuse subqueries within the same query 2016-08-11 09:47:19 -07:00
core [SPARK-16434][SQL] Avoid per-record type dispatch in JSON when reading 2016-08-12 11:09:42 +08:00
hive [SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end tests 2016-08-10 17:17:21 +08:00
hive-thriftserver [SPARK-16941] Use concurrentHashMap instead of scala Map in SparkSQLOperationManager. 2016-08-11 11:28:28 +01:00
README.md [SPARK-16557][SQL] Remove stale doc in sql/README.md 2016-07-14 19:24:42 -07:00

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

  • Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
  • Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
  • Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  • HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.