spark-instrumented-optimizer/docs/cloud-integration.md

292 lines
13 KiB
Markdown
Raw Normal View History

[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
---
layout: global
displayTitle: Integration with Cloud Infrastructures
title: Integration with Cloud Infrastructures
description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
* This will become a table of contents (this text will be scraped).
{:toc}
## Introduction
All major cloud providers offer persistent data storage in *object stores*.
These are not classic "POSIX" file systems.
In order to store hundreds of petabytes of data without any single points of failure,
object stores replace the classic file system directory tree
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
with a simpler model of `object-name => data`. To enable remote access, operations
on objects are usually offered as (slow) HTTP REST operations.
Spark can read and write data in object stores through filesystem connectors implemented
in Hadoop or provided by the infrastructure suppliers themselves.
These connectors make the object stores look *almost* like file systems, with directories and files
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
and the classic operations on them such as list, delete and rename.
### Important: Cloud Object Stores are Not Real Filesystems
While the stores appear to be filesystems, underneath
they are still object stores, [and the difference is significant](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
They cannot be used as a direct replacement for a cluster filesystem such as HDFS
*except where this is explicitly stated*.
Key differences are:
* The means by which directories are emulated may make working with them slow.
* Rename operations may be very slow and, on failure, leave the store in an unknown state.
* Seeking within a file may require new HTTP calls, hurting performance.
How does this affect Spark?
1. Reading and writing data can be significantly slower than working with a normal filesystem.
1. Some directory structures may be very inefficient to scan during query split calculation.
1. The rename-based algorithm by which Spark normally commits work when saving an RDD, DataFrame or Dataset
is potentially both slow and unreliable.
For these reasons, it is not always safe to use an object store as a direct destination of queries, or as
an intermediate store in a chain of queries. Consult the documentation of the object store and its
connector to determine which uses are considered safe.
[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 09:43:18 -05:00
### Consistency
As of 2021, the object stores of Amazon (S3), Google Cloud (GCS) and Microsoft (Azure Storage, ADLS Gen1, ADLS Gen2) are all *consistent*.
This means that as soon as a file is written/updated it can be listed, viewed and opened by other processes
-and the latest version will be retrieved. This was a known issue with AWS S3, especially with 404 caching
of HEAD requests made before an object was created.
Even so: none of the store connectors provide any guarantees as to how their clients cope with objects
which are overwritten while a stream is reading them. Do not assume that the old file can be safely
read, nor that there is any bounded time period for changes to become visible -or indeed, that
the clients will not simply fail if a file being read is overwritten.
For this reason: avoid overwriting files where it is known/likely that other clients
will be actively reading them.
Other object stores are *inconsistent*
This includes [OpenStack Swift](https://docs.openstack.org/swift/latest/).
Such stores are not always safe to use as a destination of work -consult
each store's specific documentation.
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
### Installation
With the relevant libraries on the classpath and Spark configured with valid credentials,
objects can be read or written by using their URLs as the path to data.
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
For example `sparkContext.textFile("s3a://landsat-pds/scene_list.gz")` will create
an RDD of the file `scene_list.gz` stored in S3, using the s3a connector.
To add the relevant libraries to an application's classpath, include the `hadoop-cloud`
module and its dependencies.
In Maven, add the following to the `pom.xml` file, assuming `spark.version`
is set to the chosen version of Spark:
{% highlight xml %}
<dependencyManagement>
...
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>hadoop-cloud_{{site.SCALA_BINARY_VERSION}}</artifactId>
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
<version>${spark.version}</version>
<scope>provided</scope>
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
</dependency>
...
</dependencyManagement>
{% endhighlight %}
Commercial products based on Apache Spark generally directly set up the classpath
for talking to cloud infrastructures, in which case this module may not be needed.
### Authenticating
Spark jobs must authenticate with the object stores to access data within them.
1. When Spark is running in a cloud infrastructure, the credentials are usually automatically set up.
[SPARK-32035][DOCS][EXAMPLES] Fixed typos involving AWS Access, Secret, & Sessions tokens ### What changes were proposed in this pull request? I resolved some of the inconsistencies of AWS env variables. They're fixed in the documentation as well as in the examples. I grep-ed through the repo to try & find any more instances but nothing popped up. ### Why are the changes needed? As previously mentioned, there is a JIRA request, SPARK-32035, which encapsulates all the issues. But, in summary, the naming of items was inconsistent. ### Does this PR introduce _any_ user-facing change? Correct names: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN These are the same that AWS uses in their libraries. However, looking through the Spark documentation and comments, I see that these are not denoted correctly across the board: docs/cloud-integration.md 106:1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY` <-- both different 107:and `AWS_SESSION_TOKEN` environment variables and sets the associated authentication options docs/streaming-kinesis-integration.md 232:- Set up the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY` with your AWS credentials. <-- secret key different external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py 34: $ export AWS_ACCESS_KEY_ID=<your-access-key> 35: $ export AWS_SECRET_KEY=<your-secret-key> <-- different 48: Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 438: val keyId = System.getenv("AWS_ACCESS_KEY_ID") 439: val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY") 448: val sessionToken = System.getenv("AWS_SESSION_TOKEN") external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala 53: * $ export AWS_ACCESS_KEY_ID=<your-access-key> 54: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different 65: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different external/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java 59: * $ export AWS_ACCESS_KEY_ID=[your-access-key] 60: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different 71: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different These were all fixed to match names listed under the "correct names" heading. ### How was this patch tested? I built the documentation using jekyll and verified that the changes were present & accurate. Closes #29058 from Moovlin/SPARK-32035. Authored-by: moovlin <richjoerger@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-09 13:35:21 -04:00
1. `spark-submit` reads the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
and `AWS_SESSION_TOKEN` environment variables and sets the associated authentication options
for the `s3n` and `s3a` connectors to Amazon S3.
1. In a Hadoop cluster, settings may be set in the `core-site.xml` file.
1. Authentication details may be manually added to the Spark configuration in `spark-defaults.conf`
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
1. Alternatively, they can be programmatically set in the `SparkConf` instance used to configure
the application's `SparkContext`.
*Important: never check authentication secrets into source code repositories,
especially public ones*
Consult [the Hadoop documentation](https://hadoop.apache.org/docs/current/) for the relevant
configuration and security options.
## Configuring
Each cloud connector has its own set of configuration parameters, again,
consult the relevant documentation.
### Recommended settings for writing to object stores
For object stores whose consistency model means that rename-based commits are safe
use the `FileOutputCommitter` v2 algorithm for performance; v1 for safety.
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
```
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
```
This does less renaming at the end of a job than the "version 1" algorithm.
As it still uses `rename()` to commit files, it is unsafe to use
when the object store does not have consistent metadata/listings.
The committer can also be set to ignore failures when cleaning up temporary
files; this reduces the risk that a transient network problem is escalated into a
job failure:
```
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
```
The original v1 commit algorithm renames the output of successful tasks
to a job attempt directory, and then renames all the files in that directory
into the final destination during the job commit phase:
```
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 1
```
The slow performance of mimicked renames on Amazon S3 makes this algorithm
very, very slow. The recommended solution to this is switch to an S3 "Zero Rename"
committer (see below).
For reference, here are the performance and safety characteristics of
different stores and connectors when renaming directories:
| Store | Connector | Directory Rename Safety | Rename Performance |
|---------------|-----------|-------------------------|--------------------|
| Amazon S3 | s3a | Unsafe | O(data) |
| Azure Storage | wasb | Safe | O(files) |
| Azure Datalake Gen 2 | abfs | Safe | O(1) |
[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 09:43:18 -05:00
| Google Cloud Storage | gs | Mixed | O(files) |
[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 09:43:18 -05:00
1. As storing temporary files can run up charges; delete
directories called `"_temporary"` on a regular basis.
[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 09:43:18 -05:00
1. For AWS S3, set a limit on how long multipart uploads can remain outstanding.
This avoids incurring bills from incompleted uploads.
1. For Google cloud, directory rename is file-by-file. Consider using the v2 committer
and only write code which generates idemportent output -including filenames,
as it is *no more unsafe* than the v1 committer, and faster.
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
### Parquet I/O Settings
For optimal performance when working with Parquet data use the following settings:
```
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.parquet.mergeSchema false
spark.sql.parquet.filterPushdown true
spark.sql.hive.metastorePartitionPruning true
```
These minimise the amount of data read during queries.
### ORC I/O Settings
For best performance when working with ORC data, use these settings:
```
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
```
Again, these minimise the amount of data read during queries.
## Spark Streaming and Object Storage
Spark Streaming can monitor files added to object stores, by
creating a `FileInputDStream` to monitor a path in the store through a call to
`StreamingContext.textFileStream()`.
1. The time to scan for new files is proportional to the number of files
under the path, not the number of *new* files, so it can become a slow operation.
The size of the window needs to be set to handle this.
1. Files only appear in an object store once they are completely written; there
is no need for a workflow of write-then-rename to ensure that files aren't picked up
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
while they are still being written. Applications can write straight to the monitored directory.
1. Streams should only be checkpointed to a store implementing a fast and
atomic `rename()` operation.
Otherwise the checkpointing may be slow and potentially unreliable.
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
## Committing work into cloud storage safely and fast.
As covered earlier, commit-by-rename is dangerous on any object store which
exhibits eventual consistency (example: S3), and often slower than classic
filesystem renames.
Some object store connectors provide custom committers to commit tasks and
jobs without using rename. In versions of Spark built with Hadoop 3.1 or later,
the S3A connector for AWS S3 is such a committer.
Instead of writing data to a temporary directory on the store for renaming,
these committers write the files to the final destination, but do not issue
the final POST command to make a large "multi-part" upload visible. Those
operations are postponed until the job commit itself. As a result, task and
job commit are much faster, and task failures do not affect the result.
To switch to the S3A committers, use a version of Spark was built with Hadoop
3.1 or later, and switch the committers through the following options.
```
spark.hadoop.fs.s3a.committer.name directory
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
```
It has been tested with the most common formats supported by Spark.
```python
mydataframe.write.format("parquet").save("s3a://bucket/destination")
```
More details on these committers can be found in the latest Hadoop documentation.
[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 09:43:18 -05:00
Note: depending upon the committer used, in-progress statistics may be
under-reported with Hadoop versions before 3.3.1.
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00
## Further Reading
Here is the documentation on the standard connectors both from Apache and the cloud providers.
* [OpenStack Swift](https://hadoop.apache.org/docs/current/hadoop-openstack/index.html).
* [Azure Blob Storage](https://hadoop.apache.org/docs/current/hadoop-azure/index.html).
* [Azure Blob Filesystem (ABFS) and Azure Datalake Gen 2](https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html).
* [Azure Data Lake Gen 1](https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html).
[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 09:43:18 -05:00
* [Amazon S3 Strong Consistency](https://aws.amazon.com/s3/consistency/)
* [Hadoop-AWS module (Hadoop 3.x)](https://hadoop.apache.org/docs/current3/hadoop-aws/tools/hadoop-aws/index.html).
* [Amazon S3 via S3A and S3N (Hadoop 2.x)](https://hadoop.apache.org/docs/current2/hadoop-aws/tools/hadoop-aws/index.html).
* [Amazon EMR File System (EMRFS)](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html). From Amazon.
[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 09:43:18 -05:00
* [Using the EMRFS S3-optimized Committer](https://docs.amazonaws.cn/en_us/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html)
* [Google Cloud Storage Connector for Spark and Hadoop](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage). From Google.
* [The Azure Blob Filesystem driver (ABFS)](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver)
[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 09:43:18 -05:00
* IBM Cloud Object Storage connector for Apache Spark: [Stocator](https://github.com/CODAIT/stocator),
[IBM Object Storage](https://www.ibm.com/cloud/object-storage). From IBM.
[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 05:15:31 -04:00