Commit graph

16 commits

Author SHA1 Message Date
Daoyuan Wang 26a8d2f908 [SPARK-35238][DOC] Add JindoFS SDK in cloud integration documents
### What changes were proposed in this pull request?
Add JindoFS SDK documents link in the cloud integration section of Spark's official document.

### Why are the changes needed?
If Spark users need to interact with Alibaba Cloud OSS, JindoFS SDK is the official solution provided by Alibaba Cloud.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
tested the url manually.

Closes #32360 from adrian-wang/jindodoc.

Authored-by: Daoyuan Wang <daoyuan.wdy@alibaba-inc.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-04-27 09:32:47 -05:00
Lena d32bb4e5ee [MINOR][DOCS] Updating the link for Azure Data Lake Gen 2 in docs
### What changes were proposed in this pull request?

Current link for `Azure Blob Storage and Azure Datalake Gen 2` leads to AWS information. Replacing the link to point to the right page.

### Why are the changes needed?

For users to access to the correct link.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the link correctly.

### How was this patch tested?

N/A

Closes #31938 from lenadroid/patch-1.

Authored-by: Lena <alehall@microsoft.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-03-23 10:13:32 +03:00
Steve Loughran ff5115c3ac [SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes
BasicWriteStatsTracker to probe for a custom Xattr if the size of
the generated file is 0 bytes; if found and parseable use that as
the declared length of the output.

The matching Hadoop patch in HADOOP-17414:

* Returns all S3 object headers as XAttr attributes prefixed "header."
* Sets the custom header x-hadoop-s3a-magic-data-length to the length of
  the data in the marker file.

As a result, spark job tracking will correctly report the amount of data uploaded
and yet to materialize.

### Why are the changes needed?

Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer
which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro`
to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at
the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending`
containing all the information for the job committer to complete the upload.

But: the write tracker statictics don't show progress as they measure the length of the
created file, find the marker file and report 0 bytes.
By probing for a specific HTTP header in the marker file and parsing that if
retrieved, the real progress can be reported.

There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530)
which adds getXAttr API support to the S3A connector and returns the headers; the magic
committer adds the relevant attributes.

If the FS being probed doesn't support the XAttr API, the header is missing
or the value not a positive long then the size of 0 is returned.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to
implement getXAttr on top of LocalFS; this is used to explore the set of
options:
* no XAttr API implementation (existing tests; what callers would see with
  most filesystems)
* no attribute found (HDFS, ABFS without the attribute)
* invalid data of different forms

All of these return Some(0) as file length.

The Hadoop PR verifies XAttr implementation in S3A and that
the commit protocol attaches the header to the files.

External downstream testing has done the full hadoop+spark end
to end operation, with manual review of logs to verify that the
data was successfully collected from the attribute.

Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 08:43:18 -06:00
Igor Dvorzhak 32a0451376 [MINOR][DOCS] Fix links to Cloud Storage connectors docs
Closes #29155 from medb/patch-1.

Authored-by: Igor Dvorzhak <idv@google.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-19 12:19:36 -07:00
moovlin 9331a5c44b [SPARK-32035][DOCS][EXAMPLES] Fixed typos involving AWS Access, Secret, & Sessions tokens
### What changes were proposed in this pull request?
I resolved some of the inconsistencies of AWS env variables. They're fixed in the documentation as well as in the examples. I grep-ed through the repo to try & find any more instances but nothing popped up.

### Why are the changes needed?

As previously mentioned, there is a JIRA request, SPARK-32035, which encapsulates all the issues. But, in summary, the naming of items was inconsistent.

### Does this PR introduce _any_ user-facing change?

Correct names:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN
These are the same that AWS uses in their libraries.

However, looking through the Spark documentation and comments, I see that these are not denoted correctly across the board:

docs/cloud-integration.md
106:1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY` <-- both different
107:and `AWS_SESSION_TOKEN` environment variables and sets the associated authentication options

docs/streaming-kinesis-integration.md
232:- Set up the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY` with your AWS credentials. <-- secret key different

external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py
34: $ export AWS_ACCESS_KEY_ID=<your-access-key>
35: $ export AWS_SECRET_KEY=<your-secret-key> <-- different
48: Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
438: val keyId = System.getenv("AWS_ACCESS_KEY_ID")
439: val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY")
448: val sessionToken = System.getenv("AWS_SESSION_TOKEN")

external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
53: * $ export AWS_ACCESS_KEY_ID=<your-access-key>
54: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different
65: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different

external/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
59: * $ export AWS_ACCESS_KEY_ID=[your-access-key]
60: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different
71: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different

These were all fixed to match names listed under the "correct names" heading.

### How was this patch tested?

I built the documentation using jekyll and verified that the changes were present & accurate.

Closes #29058 from Moovlin/SPARK-32035.

Authored-by: moovlin <richjoerger@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-09 10:35:21 -07:00
Guy Khazma 44aecaa912 [SPARK-32099][DOCS] Remove broken link in cloud integration documentation
### What changes were proposed in this pull request?

The 3rd link in `IBM Cloud Object Storage connector for Apache Spark` is broken. The PR removes this link.

### Why are the changes needed?

broken link

### Does this PR introduce _any_ user-facing change?

yes, the broken link is removed from the doc.

### How was this patch tested?

doc generation passes successfully as before

Closes #28927 from guykhazma/spark32099.

Authored-by: Guy Khazma <guykhag@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-26 19:12:42 -07:00
Dilip Biswal 7309e021ec [SPARK-29028][DOCS] Add links to IBM Cloud Object Storage connector in cloud-integration.md
### What changes were proposed in this pull request?
Add links to IBM Cloud Storage connector in cloud-integration.md

### Why are the changes needed?
This page mentions the connectors to cloud providers.  Currently connector to
IBM cloud storage is not specified. This PR adds the necessary links for
completeness.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
<img width="1234" alt="Screen Shot 2019-09-09 at 3 52 44 PM" src="https://user-images.githubusercontent.com/14225158/64571863-11a2c080-d31a-11e9-82e3-78c02675adb9.png">

**After.**

<img width="1234" alt="Screen Shot 2019-09-10 at 8 16 49 AM" src="https://user-images.githubusercontent.com/14225158/64626857-663e4e00-d3a3-11e9-8fa3-15ebf52ea832.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25737 from dilipbiswal/ibm-cloud-storage.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-10 11:19:55 -05:00
Steve Loughran 2ac6163a5d [SPARK-23977][SQL] Support High Performance S3A committers [test-hadoop3.2]
This patch adds the binding classes to enable spark to switch dataframe output to using the S3A zero-rename committers shipping in Hadoop 3.1+. It adds a source tree into the hadoop-cloud-storage module which only compiles with the hadoop-3.2 profile, and contains a binding for normal output and a specific bridge class for Parquet (as the parquet output format requires a subclass of `ParquetOutputCommitter`.

Commit algorithms are a critical topic. There's no formal proof of correctness, but the algorithms are documented an analysed in [A Zero Rename Committer](https://github.com/steveloughran/zero-rename-committer/releases). This also reviews the classic v1 and v2 algorithms, IBM's swift committer and the one from EMRFS which they admit was based on the concepts implemented here.

Test-wise

* There's a public set of scala test suites [on github](https://github.com/hortonworks-spark/cloud-integration)
* We have run integration tests against Spark on Yarn clusters.
* This code has been shipping for ~12 months in HDP-3.x.

Closes #24970 from steveloughran/cloud/SPARK-23977-s3a-committer.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-08-15 09:39:26 -07:00
Sean Owen 754f820035 [SPARK-26918][DOCS] All .md should have ASF license header
## What changes were proposed in this pull request?

Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.

## How was this patch tested?

Doc build

Closes #24243 from srowen/SPARK-26918.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 19:49:45 -05:00
Sean Owen 3909223681 [MINOR][DOCS] Clarify that Spark apps should mark Spark as a 'provided' dependency, not package it
## What changes were proposed in this pull request?

Spark apps do not need to package Spark. In fact it can cause problems in some cases. Our examples should show depending on Spark as a 'provided' dependency.

Packaging Spark makes the app much bigger by tens of megabytes. It can also bring in conflicting dependencies that wouldn't otherwise be a problem. https://issues.apache.org/jira/browse/SPARK-26146 was what reminded me of this.

## How was this patch tested?

Doc build

Closes #23938 from srowen/Provided.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-05 08:26:30 -06:00
DB Tsai ad853c5678
[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0
## What changes were proposed in this pull request?

This PR makes Spark's default Scala version as 2.12, and Scala 2.11 will be the alternative version. This implies that Scala 2.12 will be used by our CI builds including pull request builds.

We'll update the Jenkins to include a new compile-only jobs for Scala 2.11 to ensure the code can be still compiled with Scala 2.11.

## How was this patch tested?

existing tests

Closes #22967 from dbtsai/scala2.12.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2018-11-14 16:22:23 -08:00
Joey Krabacher 30be71e912 [DOCS] Fix cloud-integration.md Typo
Corrected typo; changed spark-default.conf to spark-defaults.conf

Closes #22125 from KraFusion/patch-2.

Authored-by: Joey Krabacher <jkrabacher@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-08-16 16:48:51 -07:00
Jim Kleckner 8ab8ef7733 Fix minor typo in docs/cloud-integration.md
## What changes were proposed in this pull request?

Minor typo in docs/cloud-integration.md

## How was this patch tested?

This is trivial enough that it should not affect tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Jim Kleckner <jim@cloudphysics.com>

Closes #21629 from jkleckner/fix-doc-typo.
2018-06-25 16:23:23 +08:00
Daniel Sakuma 6ade5cbb49 [MINOR][DOC] Fix some typos and grammar issues
## What changes were proposed in this pull request?

Easy fix in the documentation.

## How was this patch tested?

N/A

Closes #20948

Author: Daniel Sakuma <dsakuma@gmail.com>

Closes #20928 from dsakuma/fix_typo_configuration_docs.
2018-04-06 13:37:08 +08:00
Shashwat Anand 84a076e0e9 [SPARK-23165][DOC] Spelling mistake fix in quick-start doc.
## What changes were proposed in this pull request?

Fix spelling in quick-start doc.

## How was this patch tested?

Doc only.

Author: Shashwat Anand <me@shashwat.me>

Closes #20336 from ashashwat/SPARK-23165.
2018-01-20 14:34:37 -08:00
Steve Loughran 2cf83c4783 [SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access.
## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to #12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <stevel@apache.org>
Author: Steve Loughran <stevel@hortonworks.com>

Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 10:15:31 +01:00