spark-instrumented-optimizer/docs/sql-distributed-sql-engine.md

---
layout: global
title: Distributed SQL Engine
displayTitle: Distributed SQL Engine
license: |
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at
 
     http://www.apache.org/licenses/LICENSE-2.0
 
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
---

* Table of contents
{:toc}

Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface.
In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries,
without the need to write any code.

## Running the Thrift JDBC/ODBC server

The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
in built-in Hive. You can test the JDBC server with the beeline script that comes with either Spark or compatible Hive.

To start the JDBC/ODBC server, run the following in the Spark directory:

    ./sbin/start-thriftserver.sh

This script accepts all `bin/spark-submit` command line options, plus a `--hiveconf` option to
specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` for a complete list of
all available options. By default, the server listens on localhost:10000. You may override this
behaviour via either environment variables, i.e.:

{% highlight bash %}
export HIVE_SERVER2_THRIFT_PORT=<listening-port>
export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>
./sbin/start-thriftserver.sh \
  --master <master-uri> \
  ...
{% endhighlight %}

or system properties:

{% highlight bash %}
./sbin/start-thriftserver.sh \
  --hiveconf hive.server2.thrift.port=<listening-port> \
  --hiveconf hive.server2.thrift.bind.host=<listening-host> \
  --master <master-uri>
  ...
{% endhighlight %}

Now you can use beeline to test the Thrift JDBC/ODBC server:

    ./bin/beeline

Connect to the JDBC/ODBC server in beeline with:

    beeline> !connect jdbc:hive2://localhost:10000

Beeline will ask you for a username and password. In non-secure mode, simply enter the username on
your machine and a blank password. For secure mode, please follow the instructions given in the
[beeline documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients).

Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.

You may also use the beeline script that comes with Hive.

Thrift JDBC server also supports sending thrift RPC messages over HTTP transport.
Use the following setting to enable HTTP mode as system property or in `hive-site.xml` file in `conf/`:

    hive.server2.transport.mode - Set this to value: http
    hive.server2.thrift.http.port - HTTP port number to listen on; default is 10001
    hive.server2.http.endpoint - HTTP endpoint; default is cliservice

To test, use beeline to connect to the JDBC/ODBC server in http mode with:

    beeline> !connect jdbc:hive2://<host>:<port>/<database>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=<http_endpoint>

If you closed a session and do CTAS, you must set `fs.%s.impl.disable.cache` to true in `hive-site.xml`.
See more details in [[SPARK-21067]](https://issues.apache.org/jira/browse/SPARK-21067).

## Running the Spark SQL CLI

The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute
queries input from the command line. Note that the Spark SQL CLI cannot talk to the Thrift JDBC server.

To start the Spark SQL CLI, run the following in the Spark directory:

    ./bin/spark-sql

Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.
You may run `./bin/spark-sql --help` for a complete list of all available options.
[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> 2018-10-18 14:59:06 -04:00			`---`
			`layout: global`
			`title: Distributed SQL Engine`
			`displayTitle: Distributed SQL Engine`
[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 2019-03-30 20:49:45 -04:00			`license: \|`
			`Licensed to the Apache Software Foundation (ASF) under one or more`
			`contributor license agreements. See the NOTICE file distributed with`
			`this work for additional information regarding copyright ownership.`
			`The ASF licenses this file to You under the Apache License, Version 2.0`
			`(the "License"); you may not use this file except in compliance with`
			`the License. You may obtain a copy of the License at`

			`http://www.apache.org/licenses/LICENSE-2.0`

			`Unless required by applicable law or agreed to in writing, software`
			`distributed under the License is distributed on an "AS IS" BASIS,`
			`WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`See the License for the specific language governing permissions and`
			`limitations under the License.`
[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> 2018-10-18 14:59:06 -04:00			`---`

			`* Table of contents`
			`{:toc}`

			`Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface.`
			`In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries,`
			`without the need to write any code.`

			`## Running the Thrift JDBC/ODBC server`

			The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
[SPARK-30280][DOC] Update docs for make Hive 2.3 dependency by default ### What changes were proposed in this pull request? This PR update document for make Hive 2.3 dependency by default. ### Why are the changes needed? The documentation is incorrect. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #26919 from wangyum/SPARK-30280. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 2019-12-21 13:51:28 -05:00			`in built-in Hive. You can test the JDBC server with the beeline script that comes with either Spark or compatible Hive.`
[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> 2018-10-18 14:59:06 -04:00
			`To start the JDBC/ODBC server, run the following in the Spark directory:`

			`./sbin/start-thriftserver.sh`

			This script accepts all `bin/spark-submit` command line options, plus a `--hiveconf` option to
			specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` for a complete list of
			`all available options. By default, the server listens on localhost:10000. You may override this`
			`behaviour via either environment variables, i.e.:`

			`{% highlight bash %}`
			`export HIVE_SERVER2_THRIFT_PORT=<listening-port>`
			`export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>`
			`./sbin/start-thriftserver.sh \`
			`--master <master-uri> \`
			`...`
			`{% endhighlight %}`

			`or system properties:`

			`{% highlight bash %}`
			`./sbin/start-thriftserver.sh \`
			`--hiveconf hive.server2.thrift.port=<listening-port> \`
			`--hiveconf hive.server2.thrift.bind.host=<listening-host> \`
			`--master <master-uri>`
			`...`
			`{% endhighlight %}`

			`Now you can use beeline to test the Thrift JDBC/ODBC server:`

			`./bin/beeline`

			`Connect to the JDBC/ODBC server in beeline with:`

			`beeline> !connect jdbc:hive2://localhost:10000`

			`Beeline will ask you for a username and password. In non-secure mode, simply enter the username on`
			`your machine and a blank password. For secure mode, please follow the instructions given in the`
			`[beeline documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients).`

			Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.

			`You may also use the beeline script that comes with Hive.`

			`Thrift JDBC server also supports sending thrift RPC messages over HTTP transport.`
			Use the following setting to enable HTTP mode as system property or in `hive-site.xml` file in `conf/`:

			`hive.server2.transport.mode - Set this to value: http`
			`hive.server2.thrift.http.port - HTTP port number to listen on; default is 10001`
			`hive.server2.http.endpoint - HTTP endpoint; default is cliservice`

			`To test, use beeline to connect to the JDBC/ODBC server in http mode with:`

			`beeline> !connect jdbc:hive2://<host>:<port>/<database>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=<http_endpoint>`

[SPARK-21067][DOC] Fix Thrift Server - CTAS fail with Unable to move source ## What changes were proposed in this pull request? This PR aims to fix CTAS fails after we closed a session of ThriftServer. - sql-distributed-sql-engine.md ![image](https://user-images.githubusercontent.com/25916266/62509628-6f854980-b83e-11e9-9bea-daaf76c8f724.png) It seems the simplest way to fix [[SPARK-21067]](https://issues.apache.org/jira/browse/SPARK-21067). For example : If we use HDFS, we can set the following property in hive-site.xml. `<property>` ` <name>fs.hdfs.impl.disable.cache</name>` ` <value>true</value>` `</property>` ## How was this patch tested Manual. Closes #25364 from Deegue/fix_add_doc_file_system. Authored-by: Yizhong Zhang <zyzzxycj@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 2019-08-18 16:55:43 -04:00			If you closed a session and do CTAS, you must set `fs.%s.impl.disable.cache` to true in `hive-site.xml`.
			`See more details in [[SPARK-21067]](https://issues.apache.org/jira/browse/SPARK-21067).`
[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> 2018-10-18 14:59:06 -04:00
			`## Running the Spark SQL CLI`

			`The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute`
			`queries input from the command line. Note that the Spark SQL CLI cannot talk to the Thrift JDBC server.`

			`To start the Spark SQL CLI, run the following in the Spark directory:`

			`./bin/spark-sql`

			Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.
			You may run `./bin/spark-sql --help` for a complete list of all available options.