8c6748f691
### What changes were proposed in this pull request? Unsigned types may be used to produce smaller in-memory representations of the data. These types used by frameworks(e.g. hive, pig) using parquet. And parquet will map them to its base types. see more https://github.com/apache/parquet-format/blob/master/LogicalTypes.md https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift ```thrift /** * An unsigned integer value. * * The number describes the maximum number of meaningful data bits in * the stored value. 8, 16 and 32 bit values are stored using the * INT32 physical type. 64 bit values are stored using the INT64 * physical type. * */ UINT_8 = 11; UINT_16 = 12; UINT_32 = 13; UINT_64 = 14; ``` ``` UInt8-[0:255] UInt16-[0:65535] UInt32-[0:4294967295] UInt64-[0:18446744073709551615] ``` In this PR, we support read UINT_8 as ShortType, UINT_16 as IntegerType, UINT_32 as LongType to fit their range. Support for UINT_64 will be in another PR. ### Why are the changes needed? better parquet support ### Does this PR introduce _any_ user-facing change? yes, we can read unit[8/16/32] from parquet files ### How was this patch tested? new tests Closes #31921 from yaooqinn/SPARK-34817. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
pom.xml |