[SPARK-35965][DOCS] Add doc for ORC nested column vectorized reader

### What changes were proposed in this pull request?

In https://issues.apache.org/jira/browse/SPARK-34862, we added support for ORC nested column vectorized reader, and it is disabled by default for now. So we would like to add the user-facing documentation for it, and user can opt-in to use it if they want.

### Why are the changes needed?

To make user be aware of the feature, and let them know the instruction to use the feature.

### Does this PR introduce _any_ user-facing change?

Yes, the documentation itself.

### How was this patch tested?

Manually check generated documentation as below.

<img width="1153" alt="Screen Shot 2021-07-01 at 12 19 40 AM" src="https://user-images.githubusercontent.com/4629931/124083422-b0724280-da02-11eb-93aa-a25d118ba56e.png">

<img width="1147" alt="Screen Shot 2021-07-01 at 12 19 52 AM" src="https://user-images.githubusercontent.com/4629931/124083442-b5cf8d00-da02-11eb-899f-827d55b8558d.png">

Closes #33168 from c21/orc-doc.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This commit is contained in:
Cheng Su 2021-07-01 19:01:35 +09:00 committed by Hyukjin Kwon
parent 0c34b96541
commit 3c3193c0fc

View file

@ -37,6 +37,8 @@ For example, historically, `native` implementation handles `CHAR/VARCHAR` with S
`native` implementation supports a vectorized ORC reader and has been the default ORC implementaion since Spark 2.3. `native` implementation supports a vectorized ORC reader and has been the default ORC implementaion since Spark 2.3.
The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`.
For nested data types (array, map and struct), vectorized reader is disabled by default. Set `spark.sql.orc.enableNestedColumnVectorizedReader` to `true` to enable vectorized reader for these types.
For the Hive ORC serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), For the Hive ORC serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`, and is turned on by default. the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`, and is turned on by default.
@ -151,6 +153,16 @@ When reading from Hive metastore ORC tables and inserting to Hive metastore ORC
</td> </td>
<td>2.3.0</td> <td>2.3.0</td>
</tr> </tr>
<tr>
<td><code>spark.sql.orc.enableNestedColumnVectorizedReader</code></td>
<td><code>false</code></td>
<td>
Enables vectorized orc decoding in <code>native</code> implementation for nested data types
(array, map and struct). If <code>spark.sql.orc.enableVectorizedReader</code> is set to
<code>false</code>, this is ignored.
</td>
<td>3.2.0</td>
</tr>
<tr> <tr>
<td><code>spark.sql.orc.mergeSchema</code></td> <td><code>spark.sql.orc.mergeSchema</code></td>
<td>false</td> <td>false</td>