02a0cdea13
### What changes were proposed in this pull request? This PR upgrade the built-in Hive to 2.3.6 for `hadoop-3.2`. Hive 2.3.6 release notes: - [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096): Backport [HIVE-21584](https://issues.apache.org/jira/browse/HIVE-21584) (Java 11 preparation: system class loader is not URLClassLoader) - [HIVE-21859](https://issues.apache.org/jira/browse/HIVE-21859): Backport [HIVE-17466](https://issues.apache.org/jira/browse/HIVE-17466) (Metastore API to list unique partition-key-value combinations) - [HIVE-21786](https://issues.apache.org/jira/browse/HIVE-21786): Update repo URLs in poms branch 2.3 version ### Why are the changes needed? Make Spark support JDK 11. ### Does this PR introduce any user-facing change? Yes. Please see [SPARK-28684](https://issues.apache.org/jira/browse/SPARK-28684) and [SPARK-24417](https://issues.apache.org/jira/browse/SPARK-24417) for more details. ### How was this patch tested? Existing unit test and manual test. Closes #25443 from wangyum/test-on-jenkins. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
182 lines
8.5 KiB
Markdown
182 lines
8.5 KiB
Markdown
---
|
|
layout: global
|
|
title: Hive Tables
|
|
displayTitle: Hive Tables
|
|
license: |
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
this work for additional information regarding copyright ownership.
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
(the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
---
|
|
|
|
* Table of contents
|
|
{:toc}
|
|
|
|
Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
|
|
However, since Hive has a large number of dependencies, these dependencies are not included in the
|
|
default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them
|
|
automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as
|
|
they will need access to the Hive serialization and deserialization libraries (SerDes) in order to
|
|
access data stored in Hive.
|
|
|
|
Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration),
|
|
and `hdfs-site.xml` (for HDFS configuration) file in `conf/`.
|
|
|
|
When working with Hive, one must instantiate `SparkSession` with Hive support, including
|
|
connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
|
|
Users who do not have an existing Hive deployment can still enable Hive support. When not configured
|
|
by the `hive-site.xml`, the context automatically creates `metastore_db` in the current directory and
|
|
creates a directory configured by `spark.sql.warehouse.dir`, which defaults to the directory
|
|
`spark-warehouse` in the current directory that the Spark application is started. Note that
|
|
the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated since Spark 2.0.0.
|
|
Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse.
|
|
You may need to grant write privilege to the user who starts the Spark application.
|
|
|
|
<div class="codetabs">
|
|
|
|
<div data-lang="scala" markdown="1">
|
|
{% include_example spark_hive scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
{% include_example spark_hive java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
{% include_example spark_hive python/sql/hive.py %}
|
|
</div>
|
|
|
|
<div data-lang="r" markdown="1">
|
|
|
|
When working with Hive one must instantiate `SparkSession` with Hive support. This
|
|
adds support for finding tables in the MetaStore and writing queries using HiveQL.
|
|
|
|
{% include_example spark_hive r/RSparkSQLExample.R %}
|
|
|
|
</div>
|
|
</div>
|
|
|
|
### Specifying storage format for Hive tables
|
|
|
|
When you create a Hive table, you need to define how this table should read/write data from/to file system,
|
|
i.e. the "input format" and "output format". You also need to define how this table should deserialize the data
|
|
to rows, or serialize rows to data, i.e. the "serde". The following options can be used to specify the storage
|
|
format("serde", "input format", "output format"), e.g. `CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')`.
|
|
By default, we will read the table files as plain text. Note that, Hive storage handler is not supported yet when
|
|
creating table, you can create a table using storage handler at Hive side, and use Spark SQL to read it.
|
|
|
|
<table class="table">
|
|
<tr><th>Property Name</th><th>Meaning</th></tr>
|
|
<tr>
|
|
<td><code>fileFormat</code></td>
|
|
<td>
|
|
A fileFormat is kind of a package of storage format specifications, including "serde", "input format" and
|
|
"output format". Currently we support 6 fileFormats: 'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile' and 'avro'.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><code>inputFormat, outputFormat</code></td>
|
|
<td>
|
|
These 2 options specify the name of a corresponding `InputFormat` and `OutputFormat` class as a string literal,
|
|
e.g. `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat`. These 2 options must be appeared in a pair, and you can not
|
|
specify them if you already specified the `fileFormat` option.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><code>serde</code></td>
|
|
<td>
|
|
This option specifies the name of a serde class. When the `fileFormat` option is specified, do not specify this option
|
|
if the given `fileFormat` already include the information of serde. Currently "sequencefile", "textfile" and "rcfile"
|
|
don't include the serde information and you can use this option with these 3 fileFormats.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><code>fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim</code></td>
|
|
<td>
|
|
These options can only be used with "textfile" fileFormat. They define how to read delimited files into rows.
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
All other properties defined with `OPTIONS` will be regarded as Hive serde properties.
|
|
|
|
### Interacting with Different Versions of Hive Metastore
|
|
|
|
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
|
|
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
|
|
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
|
|
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
|
|
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
|
|
|
|
The following options can be used to configure the version of Hive that is used to retrieve metadata:
|
|
|
|
<table class="table">
|
|
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
|
|
<tr>
|
|
<td><code>spark.sql.hive.metastore.version</code></td>
|
|
<td><code>1.2.1</code></td>
|
|
<td>
|
|
Version of the Hive metastore. Available
|
|
options are <code>0.12.0</code> through <code>2.3.6</code> and <code>3.0.0</code> through <code>3.1.1</code>.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>spark.sql.hive.metastore.jars</code></td>
|
|
<td><code>builtin</code></td>
|
|
<td>
|
|
Location of the jars that should be used to instantiate the HiveMetastoreClient. This
|
|
property can be one of three options:
|
|
<ol>
|
|
<li><code>builtin</code></li>
|
|
Use Hive 1.2.1, which is bundled with the Spark assembly when <code>-Phive</code> is
|
|
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
|
|
either <code>1.2.1</code> or not defined.
|
|
<li><code>maven</code></li>
|
|
Use Hive jars of specified version downloaded from Maven repositories. This configuration
|
|
is not generally recommended for production deployments.
|
|
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive
|
|
and its dependencies, including the correct version of Hadoop. These jars only need to be
|
|
present on the driver, but if you are running in yarn cluster mode then you must ensure
|
|
they are packaged with your application.</li>
|
|
</ol>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>spark.sql.hive.metastore.sharedPrefixes</code></td>
|
|
<td><code>com.mysql.jdbc,<br/>org.postgresql,<br/>com.microsoft.sqlserver,<br/>oracle.jdbc</code></td>
|
|
<td>
|
|
<p>
|
|
A comma-separated list of class prefixes that should be loaded using the classloader that is
|
|
shared between Spark SQL and a specific version of Hive. An example of classes that should
|
|
be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need
|
|
to be shared are those that interact with classes that are already shared. For example,
|
|
custom appenders that are used by log4j.
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>spark.sql.hive.metastore.barrierPrefixes</code></td>
|
|
<td><code>(empty)</code></td>
|
|
<td>
|
|
<p>
|
|
A comma separated list of class prefixes that should explicitly be reloaded for each version
|
|
of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a
|
|
prefix that typically would be shared (i.e. <code>org.apache.spark.*</code>).
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
</table>
|