6ade5cbb49
## What changes were proposed in this pull request? Easy fix in the documentation. ## How was this patch tested? N/A Closes #20948 Author: Daniel Sakuma <dsakuma@gmail.com> Closes #20928 from dsakuma/fix_typo_configuration_docs.
139 lines
4.5 KiB
Markdown
139 lines
4.5 KiB
Markdown
---
|
|
layout: global
|
|
title: Accessing OpenStack Swift from Spark
|
|
---
|
|
|
|
Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the
|
|
same URI formats as in Hadoop. You can specify a path in Swift as input through a
|
|
URI of the form <code>swift://container.PROVIDER/path</code>. You will also need to set your
|
|
Swift security credentials, through <code>core-site.xml</code> or via
|
|
<code>SparkContext.hadoopConfiguration</code>.
|
|
The current Swift driver requires Swift to use the Keystone authentication method, or
|
|
its Rackspace-specific predecessor.
|
|
|
|
# Configuring Swift for Better Data Locality
|
|
|
|
Although not mandatory, it is recommended to configure the proxy server of Swift with
|
|
<code>list_endpoints</code> to have better data locality. More information is
|
|
[available here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py).
|
|
|
|
|
|
# Dependencies
|
|
|
|
The Spark application should include <code>hadoop-openstack</code> dependency, which can
|
|
be done by including the `hadoop-cloud` module for the specific version of spark used.
|
|
For example, for Maven support, add the following to the <code>pom.xml</code> file:
|
|
|
|
{% highlight xml %}
|
|
<dependencyManagement>
|
|
...
|
|
<dependency>
|
|
<groupId>org.apache.spark</groupId>
|
|
<artifactId>hadoop-cloud_2.11</artifactId>
|
|
<version>${spark.version}</version>
|
|
</dependency>
|
|
...
|
|
</dependencyManagement>
|
|
{% endhighlight %}
|
|
|
|
# Configuration Parameters
|
|
|
|
Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code> directory.
|
|
The main category of parameters that should be configured is the authentication parameters
|
|
required by Keystone.
|
|
|
|
The following table contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be
|
|
any (alphanumeric) name.
|
|
|
|
<table class="table">
|
|
<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr>
|
|
<tr>
|
|
<td><code>fs.swift.service.PROVIDER.auth.url</code></td>
|
|
<td>Keystone Authentication URL</td>
|
|
<td>Mandatory</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>fs.swift.service.PROVIDER.auth.endpoint.prefix</code></td>
|
|
<td>Keystone endpoints prefix</td>
|
|
<td>Optional</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>fs.swift.service.PROVIDER.tenant</code></td>
|
|
<td>Tenant</td>
|
|
<td>Mandatory</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>fs.swift.service.PROVIDER.username</code></td>
|
|
<td>Username</td>
|
|
<td>Mandatory</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>fs.swift.service.PROVIDER.password</code></td>
|
|
<td>Password</td>
|
|
<td>Mandatory</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>fs.swift.service.PROVIDER.http.port</code></td>
|
|
<td>HTTP port</td>
|
|
<td>Mandatory</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>fs.swift.service.PROVIDER.region</code></td>
|
|
<td>Keystone region</td>
|
|
<td>Mandatory</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>fs.swift.service.PROVIDER.public</code></td>
|
|
<td>Indicates whether to use the public (off cloud) or private (in cloud; no transfer fees) endpoints</td>
|
|
<td>Mandatory</td>
|
|
</tr>
|
|
</table>
|
|
|
|
For example, assume <code>PROVIDER=SparkTest</code> and Keystone contains user <code>tester</code> with password <code>testing</code>
|
|
defined for tenant <code>test</code>. Then <code>core-site.xml</code> should include:
|
|
|
|
{% highlight xml %}
|
|
<configuration>
|
|
<property>
|
|
<name>fs.swift.service.SparkTest.auth.url</name>
|
|
<value>http://127.0.0.1:5000/v2.0/tokens</value>
|
|
</property>
|
|
<property>
|
|
<name>fs.swift.service.SparkTest.auth.endpoint.prefix</name>
|
|
<value>endpoints</value>
|
|
</property>
|
|
<name>fs.swift.service.SparkTest.http.port</name>
|
|
<value>8080</value>
|
|
</property>
|
|
<property>
|
|
<name>fs.swift.service.SparkTest.region</name>
|
|
<value>RegionOne</value>
|
|
</property>
|
|
<property>
|
|
<name>fs.swift.service.SparkTest.public</name>
|
|
<value>true</value>
|
|
</property>
|
|
<property>
|
|
<name>fs.swift.service.SparkTest.tenant</name>
|
|
<value>test</value>
|
|
</property>
|
|
<property>
|
|
<name>fs.swift.service.SparkTest.username</name>
|
|
<value>tester</value>
|
|
</property>
|
|
<property>
|
|
<name>fs.swift.service.SparkTest.password</name>
|
|
<value>testing</value>
|
|
</property>
|
|
</configuration>
|
|
{% endhighlight %}
|
|
|
|
Notice that
|
|
<code>fs.swift.service.PROVIDER.tenant</code>,
|
|
<code>fs.swift.service.PROVIDER.username</code>,
|
|
<code>fs.swift.service.PROVIDER.password</code> contains sensitive information and keeping them in
|
|
<code>core-site.xml</code> is not always a good approach.
|
|
We suggest to keep those parameters in <code>core-site.xml</code> for testing purposes when running Spark
|
|
via <code>spark-shell</code>.
|
|
For job submissions they should be provided via <code>sparkContext.hadoopConfiguration</code>.
|