2337ccc15d
This patch comprises of a few related pieces of work:
* Schema inference is performed directly on the JSON token stream
* `String => Row` conversion populate Spark SQL structures without intermediate types
* Projection pushdown is implemented via CatalystScan for DataFrame queries
* Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false`
Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset:
```
Command | Baseline | Patched
---------------------------------------------------|----------|--------
import sqlContext.implicits._ | |
val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s
df.count() | 28.8s | 6.2s
df.rdd.count() | 35.3s | 21.5s
df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s
```
To prepare this dataset for benchmarking, follow these steps:
```
# Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \
http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip
# Decompress and combine, pipe through `jq -c` to ensure there is one record per line
unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json
```
Author: Nathan Howell <nhowell@godaddy.com>
Closes #5801 from NathanHowell/json-performance and squashes the following commits:
26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation
a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation
e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag
6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects
fa8234f [Nathan Howell] Wrap long lines
b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI`
15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy
f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes
fa0be47 [Nathan Howell] Remove unused default case in the field parser
80dba17 [Nathan Howell] Add comments regarding null handling and empty strings
842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2
ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2
f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD
0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance
7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
(cherry picked from commit
|
||
---|---|---|
.. | ||
src | ||
pom.xml |