spark-instrumented-optimizer

History

Maxim Gekk bd14da6fd5 [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files ## What changes were proposed in this pull request? I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option: ``` spark.read.schema(schema) .option("multiline", "true") .option("encoding", "UTF-16LE") .json(fileName) ``` If the option is not specified, charset auto-detection mechanism is used by default. The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in `UTF-8` charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like `.option("charset", "UTF-16BE")`. By default the output charset is still `UTF-8` to keep backward compatibility. The solution has the following restrictions for per-line mode (`multiline = false`): - If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725 - Encoding with [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) are not supported. For example, the `UTF-16` and `UTF-32` encodings are blacklisted. The problem can be solved by https://github.com/MaxGekk/spark-1/pull/2 ## How was this patch tested? I added the following tests: - reads an json file in `UTF-16LE` encoding with BOM in `multiline` mode - read json file by using charset auto detection (`UTF-32BE` with BOM) - read json file using of user's charset (`UTF-16LE`) - saving in `UTF-32BE` and read the result by standard library (not by Spark) - checking that default charset is `UTF-8` - handling wrong (unsupported) charset Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20937 from MaxGekk/json-encoding-line-sep.		2018-04-29 11:25:31 +08:00
..
orc_partitioned	[SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden file	2015-09-21 23:29:59 -07:00
parquet_partitioned	[SPARK-8060] Improve DataFrame Python test coverage and documentation.	2015-06-03 00:23:34 -07:00
streaming	[SPARK-14555] First cut of Python API for Structured Streaming	2016-04-20 10:32:01 -07:00
ages.csv	[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call	2016-02-29 09:44:29 -08:00
ages_newlines.csv	[SPARK-19610][SQL] Support parsing multiline CSV files	2017-02-28 13:34:33 -08:00
people.json	[SPARK-8060] Improve DataFrame Python test coverage and documentation.	2015-06-03 00:23:34 -07:00
people1.json	[SPARK-10185] [SQL] Feat sql comma separated paths	2015-10-17 14:56:24 -07:00
people_array.json	[SPARK-18352][SQL] Support parsing multiline json files	2017-02-16 20:51:19 -08:00
people_array_utf16le.json	[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files	2018-04-29 11:25:31 +08:00
text-test.txt	[SPARK-11292] [SQL] Python API for text data source	2015-10-28 14:28:38 -07:00