spark-instrumented-optimizer/common
Nathan Howell 21fde57f15 [SPARK-18352][SQL] Support parsing multiline json files
## What changes were proposed in this pull request?

If a new option `wholeFile` is set to `true` the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory.

Because the file is not buffered in memory the corrupt record handling is also slightly different when `wholeFile` is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired.

These changes have allowed types other than `String` to be parsed. Support for `UTF8String` and `Text` have been added (alongside `String` and `InputFormat`) and no longer require a conversion to `String` just for parsing.

I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one.

## How was this patch tested?

New and existing unit tests. No performance or load tests have been run.

Author: Nathan Howell <nhowell@godaddy.com>

Closes #16386 from NathanHowell/SPARK-18352.
2017-02-16 20:51:19 -08:00
..
network-common [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support 2017-02-16 12:32:45 +00:00
network-shuffle [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support 2017-02-16 12:32:45 +00:00
network-yarn [SPARK-19139][CORE] New auth mechanism for transport library. 2017-01-24 10:44:04 -08:00
sketch [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support 2017-02-16 12:32:45 +00:00
tags [SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing Scala deps in spark-tags 2016-12-28 12:17:33 +00:00
unsafe [SPARK-18352][SQL] Support parsing multiline json files 2017-02-16 20:51:19 -08:00