spark-instrumented-optimizer

History

Maxim Gekk bd14da6fd5 [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files ## What changes were proposed in this pull request? I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option: ``` spark.read.schema(schema) .option("multiline", "true") .option("encoding", "UTF-16LE") .json(fileName) ``` If the option is not specified, charset auto-detection mechanism is used by default. The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in `UTF-8` charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like `.option("charset", "UTF-16BE")`. By default the output charset is still `UTF-8` to keep backward compatibility. The solution has the following restrictions for per-line mode (`multiline = false`): - If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725 - Encoding with [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) are not supported. For example, the `UTF-16` and `UTF-32` encodings are blacklisted. The problem can be solved by https://github.com/MaxGekk/spark-1/pull/2 ## How was this patch tested? I added the following tests: - reads an json file in `UTF-16LE` encoding with BOM in `multiline` mode - read json file by using charset auto detection (`UTF-32BE` with BOM) - read json file using of user's charset (`UTF-16LE`) - saving in `UTF-32BE` and read the result by standard library (not by Spark) - checking that default charset is `UTF-8` - handling wrong (unsupported) charset Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20937 from MaxGekk/json-encoding-line-sep.		2018-04-29 11:25:31 +08:00
..
__init__.py	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark	2017-11-02 15:22:52 +01:00
catalog.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
column.py	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark	2018-04-08 12:09:06 +08:00
conf.py	[SPARK-23700][PYTHON] Cleanup imports in pyspark.sql	2018-03-26 12:42:32 +09:00
context.py	[SPARK-23706][PYTHON] spark.conf.get(value, default=None) should produce None in PySpark	2018-03-18 20:24:14 +09:00
dataframe.py	[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled	2018-03-27 20:06:12 -07:00
functions.py	[SPARK-23916][SQL] Add array_join function	2018-04-26 13:37:13 +09:00
group.py	[SPARK-23700][PYTHON] Cleanup imports in pyspark.sql	2018-03-26 12:42:32 +09:00
readwriter.py	[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files	2018-04-29 11:25:31 +08:00
session.py	[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled	2018-03-27 20:06:12 -07:00
streaming.py	[SPARK-23765][SQL] Supports custom line separator for json datasource	2018-03-28 19:49:27 +08:00
tests.py	[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files	2018-04-29 11:25:31 +08:00
types.py	[SPARK-24057][PYTHON] put the real data type in the AssertionError message	2018-04-26 14:21:22 -07:00
udf.py	[SPARK-23700][PYTHON] Cleanup imports in pyspark.sql	2018-03-26 12:42:32 +09:00
utils.py	[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled	2018-03-27 20:06:12 -07:00
window.py	[SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause	2018-04-07 00:15:54 +08:00