[SPARK-35965][DOCS] Add doc for ORC nested column vectorized reader
### What changes were proposed in this pull request? In https://issues.apache.org/jira/browse/SPARK-34862, we added support for ORC nested column vectorized reader, and it is disabled by default for now. So we would like to add the user-facing documentation for it, and user can opt-in to use it if they want. ### Why are the changes needed? To make user be aware of the feature, and let them know the instruction to use the feature. ### Does this PR introduce _any_ user-facing change? Yes, the documentation itself. ### How was this patch tested? Manually check generated documentation as below. <img width="1153" alt="Screen Shot 2021-07-01 at 12 19 40 AM" src="https://user-images.githubusercontent.com/4629931/124083422-b0724280-da02-11eb-93aa-a25d118ba56e.png"> <img width="1147" alt="Screen Shot 2021-07-01 at 12 19 52 AM" src="https://user-images.githubusercontent.com/4629931/124083442-b5cf8d00-da02-11eb-899f-827d55b8558d.png"> Closes #33168 from c21/orc-doc. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This commit is contained in:
parent
0c34b96541
commit
3c3193c0fc
|
@ -37,6 +37,8 @@ For example, historically, `native` implementation handles `CHAR/VARCHAR` with S
|
||||||
|
|
||||||
`native` implementation supports a vectorized ORC reader and has been the default ORC implementaion since Spark 2.3.
|
`native` implementation supports a vectorized ORC reader and has been the default ORC implementaion since Spark 2.3.
|
||||||
The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`.
|
The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`.
|
||||||
|
For nested data types (array, map and struct), vectorized reader is disabled by default. Set `spark.sql.orc.enableNestedColumnVectorizedReader` to `true` to enable vectorized reader for these types.
|
||||||
|
|
||||||
For the Hive ORC serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
|
For the Hive ORC serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
|
||||||
the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`, and is turned on by default.
|
the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`, and is turned on by default.
|
||||||
|
|
||||||
|
@ -151,6 +153,16 @@ When reading from Hive metastore ORC tables and inserting to Hive metastore ORC
|
||||||
</td>
|
</td>
|
||||||
<td>2.3.0</td>
|
<td>2.3.0</td>
|
||||||
</tr>
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><code>spark.sql.orc.enableNestedColumnVectorizedReader</code></td>
|
||||||
|
<td><code>false</code></td>
|
||||||
|
<td>
|
||||||
|
Enables vectorized orc decoding in <code>native</code> implementation for nested data types
|
||||||
|
(array, map and struct). If <code>spark.sql.orc.enableVectorizedReader</code> is set to
|
||||||
|
<code>false</code>, this is ignored.
|
||||||
|
</td>
|
||||||
|
<td>3.2.0</td>
|
||||||
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><code>spark.sql.orc.mergeSchema</code></td>
|
<td><code>spark.sql.orc.mergeSchema</code></td>
|
||||||
<td>false</td>
|
<td>false</td>
|
||||||
|
|
Loading…
Reference in a new issue