spark-instrumented-optimizer/python/test_support/sql
Maxim Gekk bd14da6fd5 [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files
## What changes were proposed in this pull request?

I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option:

```
spark.read.schema(schema)
  .option("multiline", "true")
  .option("encoding", "UTF-16LE")
  .json(fileName)
```

If the option is not specified, charset auto-detection mechanism is used by default.

The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in `UTF-8` charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like `.option("charset", "UTF-16BE")`. By default the output charset is still `UTF-8` to keep backward compatibility.

The solution has the following restrictions for per-line mode (`multiline = false`):

- If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725

- Encoding with [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) are not supported. For example, the `UTF-16` and `UTF-32` encodings are blacklisted. The problem can be solved by https://github.com/MaxGekk/spark-1/pull/2

## How was this patch tested?

I added the following tests:
- reads an json file in `UTF-16LE` encoding with BOM in `multiline` mode
- read json file by using charset auto detection (`UTF-32BE` with BOM)
- read json file using of user's charset (`UTF-16LE`)
- saving in `UTF-32BE` and read the result by standard library (not by Spark)
- checking that default charset is `UTF-8`
- handling wrong (unsupported) charset

Author: Maxim Gekk <maxim.gekk@databricks.com>
Author: Maxim Gekk <max.gekk@gmail.com>

Closes #20937 from MaxGekk/json-encoding-line-sep.
2018-04-29 11:25:31 +08:00
..
orc_partitioned [SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden file 2015-09-21 23:29:59 -07:00
parquet_partitioned [SPARK-8060] Improve DataFrame Python test coverage and documentation. 2015-06-03 00:23:34 -07:00
streaming [SPARK-14555] First cut of Python API for Structured Streaming 2016-04-20 10:32:01 -07:00
ages.csv [SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call 2016-02-29 09:44:29 -08:00
ages_newlines.csv [SPARK-19610][SQL] Support parsing multiline CSV files 2017-02-28 13:34:33 -08:00
people.json [SPARK-8060] Improve DataFrame Python test coverage and documentation. 2015-06-03 00:23:34 -07:00
people1.json [SPARK-10185] [SQL] Feat sql comma separated paths 2015-10-17 14:56:24 -07:00
people_array.json [SPARK-18352][SQL] Support parsing multiline json files 2017-02-16 20:51:19 -08:00
people_array_utf16le.json [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files 2018-04-29 11:25:31 +08:00
text-test.txt [SPARK-11292] [SQL] Python API for text data source 2015-10-28 14:28:38 -07:00