spark-instrumented-optimizer/common
Maxim Gekk 5e7bc2acef [SPARK-23649][SQL] Skipping chars disallowed in UTF-8
## What changes were proposed in this pull request?

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet).

## How was this patch tested?

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes #20796 from MaxGekk/skip-wrong-utf8-chars.
2018-03-20 10:34:56 -07:00
..
kvstore [SPARK-23103][CORE] Ensure correct sort order for negative values in LevelDB. 2018-01-19 13:32:20 -06:00
network-common [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT 2018-01-13 00:37:59 +08:00
network-shuffle [SPARK-23289][CORE] OneForOneBlockFetcher.DownloadCallback.onData should write the buffer fully 2018-02-01 21:00:47 +08:00
network-yarn [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT 2018-01-13 00:37:59 +08:00
sketch [SPARK-23381][CORE] Murmur3 hash generates a different value from other implementations 2018-02-16 17:17:55 -08:00
tags [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT 2018-01-13 00:37:59 +08:00
unsafe [SPARK-23649][SQL] Skipping chars disallowed in UTF-8 2018-03-20 10:34:56 -07:00