--- layout: global title: Hive Tables displayTitle: Hive Tables license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --- * Table of contents {:toc} Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/). However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive. Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration), and `hdfs-site.xml` (for HDFS configuration) file in `conf/`. When working with Hive, one must instantiate `SparkSession` with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the `hive-site.xml`, the context automatically creates `metastore_db` in the current directory and creates a directory configured by `spark.sql.warehouse.dir`, which defaults to the directory `spark-warehouse` in the current directory that the Spark application is started. Note that the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated since Spark 2.0.0. Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
Property Name | Meaning |
---|---|
fileFormat |
A fileFormat is kind of a package of storage format specifications, including "serde", "input format" and "output format". Currently we support 6 fileFormats: 'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile' and 'avro'. |
inputFormat, outputFormat |
These 2 options specify the name of a corresponding InputFormat and OutputFormat class as a string literal,
e.g. org.apache.hadoop.hive.ql.io.orc.OrcInputFormat . These 2 options must be appeared in a pair, and you can not
specify them if you already specified the fileFormat option.
|
serde |
This option specifies the name of a serde class. When the fileFormat option is specified, do not specify this option
if the given fileFormat already include the information of serde. Currently "sequencefile", "textfile" and "rcfile"
don't include the serde information and you can use this option with these 3 fileFormats.
|
fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim |
These options can only be used with "textfile" fileFormat. They define how to read delimited files into rows. |
Property Name | Default | Meaning | Since Version |
---|---|---|---|
spark.sql.hive.metastore.version |
2.3.8 |
Version of the Hive metastore. Available
options are 0.12.0 through 2.3.8 and 3.0.0 through 3.1.2 .
|
1.4.0 |
spark.sql.hive.metastore.jars |
builtin |
Location of the jars that should be used to instantiate the HiveMetastoreClient. This
property can be one of four options:
-Phive is
enabled. When this option is chosen, spark.sql.hive.metastore.version must be
either 2.3.8 or not defined.
spark.sql.hive.metastore.jars.path
in comma separated format. Support both local or remote paths. The provided jars should be
the same version as spark.sql.hive.metastore.version .
|
1.4.0 |
spark.sql.hive.metastore.jars.path |
(empty) |
Comma-separated paths of the jars that used to instantiate the HiveMetastoreClient.
This configuration is useful only when spark.sql.hive.metastore.jars is set as path .
The paths can be any of the following format:
|
3.1.0 |
spark.sql.hive.metastore.sharedPrefixes |
com.mysql.jdbc, |
A comma-separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j. |
1.4.0 |
spark.sql.hive.metastore.barrierPrefixes |
(empty) |
A comma separated list of class prefixes that should explicitly be reloaded for each version
of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a
prefix that typically would be shared (i.e. |
1.4.0 |