--- layout: global title: ORC Files displayTitle: ORC Files license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --- * Table of contents {:toc} [Apache ORC](https://orc.apache.org) is a columnar format which has more advanced features like native zstd compression, bloom filter and columnar encryption. ### ORC Implementation Spark supports two ORC implementations (`native` and `hive`) which is controlled by `spark.sql.orc.impl`. Two implementations share most functionalities with different design goals. - `native` implementation is designed to follow Spark's data source behavior like `Parquet`. - `hive` implementation is designed to follow Hive's behavior and uses Hive SerDe. For example, historically, `native` implementation handles `CHAR/VARCHAR` with Spark's native `String` while `hive` implementation handles it via Hive `CHAR/VARCHAR`. The query results are different. Since Spark 3.1.0, [SPARK-33480](https://issues.apache.org/jira/browse/SPARK-33480) removes this difference by supporting `CHAR/VARCHAR` from Spark-side. ### Vectorized Reader `native` implementation supports a vectorized ORC reader and has been the default ORC implementaion since Spark 2.3. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For nested data types (array, map and struct), vectorized reader is disabled by default. Set `spark.sql.orc.enableNestedColumnVectorizedReader` to `true` to enable vectorized reader for these types. For the Hive ORC serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`, and is turned on by default. ### Schema Merging Like Protocol Buffer, Avro, and Thrift, ORC also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple ORC files with different but mutually compatible schemas. The ORC data source is now able to automatically detect this case and merge schemas of all these files. Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default . You may enable it by 1. setting data source option `mergeSchema` to `true` when reading ORC files, or 2. setting the global SQL option `spark.sql.orc.mergeSchema` to `true`. ### Zstandard Spark supports both Hadoop 2 and 3. Since Spark 3.2, you can take advantage of Zstandard compression in ORC files on both Hadoop versions. Please see [Zstandard](https://facebook.github.io/zstd/) for the benefits.
Property Name | Default | Meaning | Since Version |
---|---|---|---|
spark.sql.orc.impl |
native |
The name of ORC implementation. It can be one of native and hive .
native means the native ORC support. hive means the ORC library
in Hive.
|
2.3.0 |
spark.sql.orc.enableVectorizedReader |
true |
Enables vectorized orc decoding in native implementation. If false ,
a new non-vectorized ORC reader is used in native implementation.
For hive implementation, this is ignored.
|
2.3.0 |
spark.sql.orc.enableNestedColumnVectorizedReader |
false |
Enables vectorized orc decoding in native implementation for nested data types
(array, map and struct). If spark.sql.orc.enableVectorizedReader is set to
false , this is ignored.
|
3.2.0 |
spark.sql.orc.mergeSchema |
false |
When true, the ORC data source merges schemas collected from all data files, otherwise the schema is picked from a random data file. |
3.0.0 |
spark.sql.hive.convertMetastoreOrc |
true | When set to false, Spark SQL will use the Hive SerDe for ORC tables instead of the built in support. | 2.0.0 |
Property Name | Default | Meaning | Scope |
---|---|---|---|
mergeSchema |
false |
sets whether we should merge schemas collected from all ORC part-files. This will override spark.sql.orc.mergeSchema . The default value is specified in spark.sql.orc.mergeSchema . |
read |
compression |
snappy |
compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, lzo, zstd and lz4). This will override orc.compress and spark.sql.orc.compression.codec . |
write |