[SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming-guide redirector

## What changes were proposed in this pull request?

Update internal references from programming-guide to rdd-programming-guide

See 5ddf243fd8 and https://github.com/apache/spark/pull/18485#issuecomment-314789751

Let's keep the redirector even if it's problematic to build, but not rely on it internally.

## How was this patch tested?

(Doc build)

Author: Sean Owen <sowen@cloudera.com>

Closes #18625 from srowen/SPARK-21267.2.
This commit is contained in:
Sean Owen 2017-07-15 09:21:29 +01:00
parent ac5d5d7959
commit 74ac1fb081
9 changed files with 20 additions and 14 deletions

View file

@ -593,7 +593,7 @@ setMethod("cache",
#' #'
#' Persist this SparkDataFrame with the specified storage level. For details of the #' Persist this SparkDataFrame with the specified storage level. For details of the
#' supported storage levels, refer to #' supported storage levels, refer to
#' \url{http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence}. #' \url{http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence}.
#' #'
#' @param x the SparkDataFrame to persist. #' @param x the SparkDataFrame to persist.
#' @param newLevel storage level chosen for the persistance. See available options in #' @param newLevel storage level chosen for the persistance. See available options in

View file

@ -227,7 +227,7 @@ setMethod("cacheRDD",
#' #'
#' Persist this RDD with the specified storage level. For details of the #' Persist this RDD with the specified storage level. For details of the
#' supported storage levels, refer to #' supported storage levels, refer to
#'\url{http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence}. #'\url{http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence}.
#' #'
#' @param x The RDD to persist #' @param x The RDD to persist
#' @param newLevel The new storage level to be assigned #' @param newLevel The new storage level to be assigned

View file

@ -27,7 +27,7 @@ description: GraphX graph processing library guide for Spark SPARK_VERSION_SHORT
[EdgeContext]: api/scala/index.html#org.apache.spark.graphx.EdgeContext [EdgeContext]: api/scala/index.html#org.apache.spark.graphx.EdgeContext
[GraphOps.collectNeighborIds]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]] [GraphOps.collectNeighborIds]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]]
[GraphOps.collectNeighbors]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]] [GraphOps.collectNeighbors]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]]
[RDD Persistence]: programming-guide.html#rdd-persistence [RDD Persistence]: rdd-programming-guide.html#rdd-persistence
[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED] [Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]
[GraphOps.pregel]: api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED] [GraphOps.pregel]: api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$ [PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$

View file

@ -87,7 +87,7 @@ options for deployment:
**Programming Guides:** **Programming Guides:**
* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here! * [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
* [RDD Programming Guide](programming-guide.html): overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables * [RDD Programming Guide](rdd-programming-guide.html): overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): processing structured data with relational queries (newer API than RDDs) * [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): processing structured data with relational queries (newer API than RDDs)
* [Structured Streaming](structured-streaming-programming-guide.html): processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams) * [Structured Streaming](structured-streaming-programming-guide.html): processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)
* [Spark Streaming](streaming-programming-guide.html): processing data streams using DStreams (old API) * [Spark Streaming](streaming-programming-guide.html): processing data streams using DStreams (old API)

View file

@ -18,7 +18,7 @@ At a high level, it provides tools such as:
**The MLlib RDD-based API is now in maintenance mode.** **The MLlib RDD-based API is now in maintenance mode.**
As of Spark 2.0, the [RDD](programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode. As of Spark 2.0, the [RDD](rdd-programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package. The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
*What are the implications?* *What are the implications?*

View file

@ -116,7 +116,7 @@ is a stochastic gradient. Here `$S$` is the sampled subset of size `$|S|=$ miniB
$\cdot n$`. $\cdot n$`.
In each iteration, the sampling over the distributed dataset In each iteration, the sampling over the distributed dataset
([RDD](programming-guide.html#resilient-distributed-datasets-rdds)), as well as the ([RDD](rdd-programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
computation of the sum of the partial results from each worker machine is performed by the computation of the sum of the partial results from each worker machine is performed by the
standard spark routines. standard spark routines.

View file

@ -264,7 +264,7 @@ SPARK_WORKER_OPTS supports the following system properties:
# Connecting an Application to the Cluster # Connecting an Application to the Cluster
To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext` To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`
constructor](programming-guide.html#initializing-spark). constructor](rdd-programming-guide.html#initializing-spark).
To run an interactive Spark shell against the cluster, run the following command: To run an interactive Spark shell against the cluster, run the following command:

View file

@ -535,7 +535,7 @@ After a context is defined, you have to do the following.
It represents a continuous stream of data, either the input data stream received from source, It represents a continuous stream of data, either the input data stream received from source,
or the processed data stream generated by transforming the input stream. Internally, or the processed data stream generated by transforming the input stream. Internally,
a DStream is represented by a continuous series of RDDs, which is Spark's abstraction of an immutable, a DStream is represented by a continuous series of RDDs, which is Spark's abstraction of an immutable,
distributed dataset (see [Spark Programming Guide](programming-guide.html#resilient-distributed-datasets-rdds) for more details). Each RDD in a DStream contains data from a certain interval, distributed dataset (see [Spark Programming Guide](rdd-programming-guide.html#resilient-distributed-datasets-rdds) for more details). Each RDD in a DStream contains data from a certain interval,
as shown in the following figure. as shown in the following figure.
<p style="text-align: center;"> <p style="text-align: center;">
@ -1531,7 +1531,7 @@ default persistence level is set to replicate the data to two nodes for fault-to
Note that, unlike RDDs, the default persistence level of DStreams keeps the data serialized in Note that, unlike RDDs, the default persistence level of DStreams keeps the data serialized in
memory. This is further discussed in the [Performance Tuning](#memory-tuning) section. More memory. This is further discussed in the [Performance Tuning](#memory-tuning) section. More
information on different persistence levels can be found in the [Spark Programming Guide](programming-guide.html#rdd-persistence). information on different persistence levels can be found in the [Spark Programming Guide](rdd-programming-guide.html#rdd-persistence).
*** ***
@ -1720,7 +1720,13 @@ batch interval that is at least 10 seconds. It can be set by using
## Accumulators, Broadcast Variables, and Checkpoints ## Accumulators, Broadcast Variables, and Checkpoints
[Accumulators](programming-guide.html#accumulators) and [Broadcast variables](programming-guide.html#broadcast-variables) cannot be recovered from checkpoint in Spark Streaming. If you enable checkpointing and use [Accumulators](programming-guide.html#accumulators) or [Broadcast variables](programming-guide.html#broadcast-variables) as well, you'll have to create lazily instantiated singleton instances for [Accumulators](programming-guide.html#accumulators) and [Broadcast variables](programming-guide.html#broadcast-variables) so that they can be re-instantiated after the driver restarts on failure. This is shown in the following example. [Accumulators](rdd-programming-guide.html#accumulators) and [Broadcast variables](rdd-programming-guide.html#broadcast-variables)
cannot be recovered from checkpoint in Spark Streaming. If you enable checkpointing and use
[Accumulators](rdd-programming-guide.html#accumulators) or [Broadcast variables](rdd-programming-guide.html#broadcast-variables)
as well, you'll have to create lazily instantiated singleton instances for
[Accumulators](rdd-programming-guide.html#accumulators) and [Broadcast variables](rdd-programming-guide.html#broadcast-variables)
so that they can be re-instantiated after the driver restarts on failure.
This is shown in the following example.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
@ -2182,7 +2188,7 @@ overall processing throughput of the system, its use is still recommended to ach
consistent batch processing times. Make sure you set the CMS GC on both the driver (using `--driver-java-options` in `spark-submit`) and the executors (using [Spark configuration](configuration.html#runtime-environment) `spark.executor.extraJavaOptions`). consistent batch processing times. Make sure you set the CMS GC on both the driver (using `--driver-java-options` in `spark-submit`) and the executors (using [Spark configuration](configuration.html#runtime-environment) `spark.executor.extraJavaOptions`).
* **Other tips**: To further reduce GC overheads, here are some more tips to try. * **Other tips**: To further reduce GC overheads, here are some more tips to try.
- Persist RDDs using the `OFF_HEAP` storage level. See more detail in the [Spark Programming Guide](programming-guide.html#rdd-persistence). - Persist RDDs using the `OFF_HEAP` storage level. See more detail in the [Spark Programming Guide](rdd-programming-guide.html#rdd-persistence).
- Use more executors with smaller heap sizes. This will reduce the GC pressure within each JVM heap. - Use more executors with smaller heap sizes. This will reduce the GC pressure within each JVM heap.
*** ***

View file

@ -12,7 +12,7 @@ Because of the in-memory nature of most Spark computations, Spark programs can b
by any resource in the cluster: CPU, network bandwidth, or memory. by any resource in the cluster: CPU, network bandwidth, or memory.
Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you
also need to do some tuning, such as also need to do some tuning, such as
[storing RDDs in serialized form](programming-guide.html#rdd-persistence), to [storing RDDs in serialized form](rdd-programming-guide.html#rdd-persistence), to
decrease memory usage. decrease memory usage.
This guide will cover two main topics: data serialization, which is crucial for good network This guide will cover two main topics: data serialization, which is crucial for good network
performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics. performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.
@ -155,7 +155,7 @@ pointer-based data structures and wrapper objects. There are several ways to do
When your objects are still too large to efficiently store despite this tuning, a much simpler way When your objects are still too large to efficiently store despite this tuning, a much simpler way
to reduce memory usage is to store them in *serialized* form, using the serialized StorageLevels in to reduce memory usage is to store them in *serialized* form, using the serialized StorageLevels in
the [RDD persistence API](programming-guide.html#rdd-persistence), such as `MEMORY_ONLY_SER`. the [RDD persistence API](rdd-programming-guide.html#rdd-persistence), such as `MEMORY_ONLY_SER`.
Spark will then store each RDD partition as one large byte array. Spark will then store each RDD partition as one large byte array.
The only downside of storing data in serialized form is slower access times, due to having to The only downside of storing data in serialized form is slower access times, due to having to
deserialize each object on the fly. deserialize each object on the fly.
@ -262,7 +262,7 @@ number of cores in your clusters.
## Broadcasting Large Variables ## Broadcasting Large Variables
Using the [broadcast functionality](programming-guide.html#broadcast-variables) Using the [broadcast functionality](rdd-programming-guide.html#broadcast-variables)
available in `SparkContext` can greatly reduce the size of each serialized task, and the cost available in `SparkContext` can greatly reduce the size of each serialized task, and the cost
of launching a job over a cluster. If your tasks use any large object from the driver program of launching a job over a cluster. If your tasks use any large object from the driver program
inside of them (e.g. a static lookup table), consider turning it into a broadcast variable. inside of them (e.g. a static lookup table), consider turning it into a broadcast variable.