[SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming-guide redirector
## What changes were proposed in this pull request?
Update internal references from programming-guide to rdd-programming-guide
See 5ddf243fd8
and https://github.com/apache/spark/pull/18485#issuecomment-314789751
Let's keep the redirector even if it's problematic to build, but not rely on it internally.
## How was this patch tested?
(Doc build)
Author: Sean Owen <sowen@cloudera.com>
Closes #18625 from srowen/SPARK-21267.2.
This commit is contained in:
parent
ac5d5d7959
commit
74ac1fb081
|
@ -593,7 +593,7 @@ setMethod("cache",
|
||||||
#'
|
#'
|
||||||
#' Persist this SparkDataFrame with the specified storage level. For details of the
|
#' Persist this SparkDataFrame with the specified storage level. For details of the
|
||||||
#' supported storage levels, refer to
|
#' supported storage levels, refer to
|
||||||
#' \url{http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence}.
|
#' \url{http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence}.
|
||||||
#'
|
#'
|
||||||
#' @param x the SparkDataFrame to persist.
|
#' @param x the SparkDataFrame to persist.
|
||||||
#' @param newLevel storage level chosen for the persistance. See available options in
|
#' @param newLevel storage level chosen for the persistance. See available options in
|
||||||
|
|
|
@ -227,7 +227,7 @@ setMethod("cacheRDD",
|
||||||
#'
|
#'
|
||||||
#' Persist this RDD with the specified storage level. For details of the
|
#' Persist this RDD with the specified storage level. For details of the
|
||||||
#' supported storage levels, refer to
|
#' supported storage levels, refer to
|
||||||
#'\url{http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence}.
|
#'\url{http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence}.
|
||||||
#'
|
#'
|
||||||
#' @param x The RDD to persist
|
#' @param x The RDD to persist
|
||||||
#' @param newLevel The new storage level to be assigned
|
#' @param newLevel The new storage level to be assigned
|
||||||
|
|
|
@ -27,7 +27,7 @@ description: GraphX graph processing library guide for Spark SPARK_VERSION_SHORT
|
||||||
[EdgeContext]: api/scala/index.html#org.apache.spark.graphx.EdgeContext
|
[EdgeContext]: api/scala/index.html#org.apache.spark.graphx.EdgeContext
|
||||||
[GraphOps.collectNeighborIds]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]]
|
[GraphOps.collectNeighborIds]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]]
|
||||||
[GraphOps.collectNeighbors]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]]
|
[GraphOps.collectNeighbors]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]]
|
||||||
[RDD Persistence]: programming-guide.html#rdd-persistence
|
[RDD Persistence]: rdd-programming-guide.html#rdd-persistence
|
||||||
[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]
|
[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]
|
||||||
[GraphOps.pregel]: api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
|
[GraphOps.pregel]: api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
|
||||||
[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$
|
[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$
|
||||||
|
|
|
@ -87,7 +87,7 @@ options for deployment:
|
||||||
**Programming Guides:**
|
**Programming Guides:**
|
||||||
|
|
||||||
* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
|
* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
|
||||||
* [RDD Programming Guide](programming-guide.html): overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables
|
* [RDD Programming Guide](rdd-programming-guide.html): overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables
|
||||||
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): processing structured data with relational queries (newer API than RDDs)
|
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): processing structured data with relational queries (newer API than RDDs)
|
||||||
* [Structured Streaming](structured-streaming-programming-guide.html): processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)
|
* [Structured Streaming](structured-streaming-programming-guide.html): processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)
|
||||||
* [Spark Streaming](streaming-programming-guide.html): processing data streams using DStreams (old API)
|
* [Spark Streaming](streaming-programming-guide.html): processing data streams using DStreams (old API)
|
||||||
|
|
|
@ -18,7 +18,7 @@ At a high level, it provides tools such as:
|
||||||
|
|
||||||
**The MLlib RDD-based API is now in maintenance mode.**
|
**The MLlib RDD-based API is now in maintenance mode.**
|
||||||
|
|
||||||
As of Spark 2.0, the [RDD](programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
|
As of Spark 2.0, the [RDD](rdd-programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
|
||||||
The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
|
The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
|
||||||
|
|
||||||
*What are the implications?*
|
*What are the implications?*
|
||||||
|
|
|
@ -116,7 +116,7 @@ is a stochastic gradient. Here `$S$` is the sampled subset of size `$|S|=$ miniB
|
||||||
$\cdot n$`.
|
$\cdot n$`.
|
||||||
|
|
||||||
In each iteration, the sampling over the distributed dataset
|
In each iteration, the sampling over the distributed dataset
|
||||||
([RDD](programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
|
([RDD](rdd-programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
|
||||||
computation of the sum of the partial results from each worker machine is performed by the
|
computation of the sum of the partial results from each worker machine is performed by the
|
||||||
standard spark routines.
|
standard spark routines.
|
||||||
|
|
||||||
|
|
|
@ -264,7 +264,7 @@ SPARK_WORKER_OPTS supports the following system properties:
|
||||||
# Connecting an Application to the Cluster
|
# Connecting an Application to the Cluster
|
||||||
|
|
||||||
To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`
|
To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`
|
||||||
constructor](programming-guide.html#initializing-spark).
|
constructor](rdd-programming-guide.html#initializing-spark).
|
||||||
|
|
||||||
To run an interactive Spark shell against the cluster, run the following command:
|
To run an interactive Spark shell against the cluster, run the following command:
|
||||||
|
|
||||||
|
|
|
@ -535,7 +535,7 @@ After a context is defined, you have to do the following.
|
||||||
It represents a continuous stream of data, either the input data stream received from source,
|
It represents a continuous stream of data, either the input data stream received from source,
|
||||||
or the processed data stream generated by transforming the input stream. Internally,
|
or the processed data stream generated by transforming the input stream. Internally,
|
||||||
a DStream is represented by a continuous series of RDDs, which is Spark's abstraction of an immutable,
|
a DStream is represented by a continuous series of RDDs, which is Spark's abstraction of an immutable,
|
||||||
distributed dataset (see [Spark Programming Guide](programming-guide.html#resilient-distributed-datasets-rdds) for more details). Each RDD in a DStream contains data from a certain interval,
|
distributed dataset (see [Spark Programming Guide](rdd-programming-guide.html#resilient-distributed-datasets-rdds) for more details). Each RDD in a DStream contains data from a certain interval,
|
||||||
as shown in the following figure.
|
as shown in the following figure.
|
||||||
|
|
||||||
<p style="text-align: center;">
|
<p style="text-align: center;">
|
||||||
|
@ -1531,7 +1531,7 @@ default persistence level is set to replicate the data to two nodes for fault-to
|
||||||
|
|
||||||
Note that, unlike RDDs, the default persistence level of DStreams keeps the data serialized in
|
Note that, unlike RDDs, the default persistence level of DStreams keeps the data serialized in
|
||||||
memory. This is further discussed in the [Performance Tuning](#memory-tuning) section. More
|
memory. This is further discussed in the [Performance Tuning](#memory-tuning) section. More
|
||||||
information on different persistence levels can be found in the [Spark Programming Guide](programming-guide.html#rdd-persistence).
|
information on different persistence levels can be found in the [Spark Programming Guide](rdd-programming-guide.html#rdd-persistence).
|
||||||
|
|
||||||
***
|
***
|
||||||
|
|
||||||
|
@ -1720,7 +1720,13 @@ batch interval that is at least 10 seconds. It can be set by using
|
||||||
|
|
||||||
## Accumulators, Broadcast Variables, and Checkpoints
|
## Accumulators, Broadcast Variables, and Checkpoints
|
||||||
|
|
||||||
[Accumulators](programming-guide.html#accumulators) and [Broadcast variables](programming-guide.html#broadcast-variables) cannot be recovered from checkpoint in Spark Streaming. If you enable checkpointing and use [Accumulators](programming-guide.html#accumulators) or [Broadcast variables](programming-guide.html#broadcast-variables) as well, you'll have to create lazily instantiated singleton instances for [Accumulators](programming-guide.html#accumulators) and [Broadcast variables](programming-guide.html#broadcast-variables) so that they can be re-instantiated after the driver restarts on failure. This is shown in the following example.
|
[Accumulators](rdd-programming-guide.html#accumulators) and [Broadcast variables](rdd-programming-guide.html#broadcast-variables)
|
||||||
|
cannot be recovered from checkpoint in Spark Streaming. If you enable checkpointing and use
|
||||||
|
[Accumulators](rdd-programming-guide.html#accumulators) or [Broadcast variables](rdd-programming-guide.html#broadcast-variables)
|
||||||
|
as well, you'll have to create lazily instantiated singleton instances for
|
||||||
|
[Accumulators](rdd-programming-guide.html#accumulators) and [Broadcast variables](rdd-programming-guide.html#broadcast-variables)
|
||||||
|
so that they can be re-instantiated after the driver restarts on failure.
|
||||||
|
This is shown in the following example.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
@ -2182,7 +2188,7 @@ overall processing throughput of the system, its use is still recommended to ach
|
||||||
consistent batch processing times. Make sure you set the CMS GC on both the driver (using `--driver-java-options` in `spark-submit`) and the executors (using [Spark configuration](configuration.html#runtime-environment) `spark.executor.extraJavaOptions`).
|
consistent batch processing times. Make sure you set the CMS GC on both the driver (using `--driver-java-options` in `spark-submit`) and the executors (using [Spark configuration](configuration.html#runtime-environment) `spark.executor.extraJavaOptions`).
|
||||||
|
|
||||||
* **Other tips**: To further reduce GC overheads, here are some more tips to try.
|
* **Other tips**: To further reduce GC overheads, here are some more tips to try.
|
||||||
- Persist RDDs using the `OFF_HEAP` storage level. See more detail in the [Spark Programming Guide](programming-guide.html#rdd-persistence).
|
- Persist RDDs using the `OFF_HEAP` storage level. See more detail in the [Spark Programming Guide](rdd-programming-guide.html#rdd-persistence).
|
||||||
- Use more executors with smaller heap sizes. This will reduce the GC pressure within each JVM heap.
|
- Use more executors with smaller heap sizes. This will reduce the GC pressure within each JVM heap.
|
||||||
|
|
||||||
***
|
***
|
||||||
|
|
|
@ -12,7 +12,7 @@ Because of the in-memory nature of most Spark computations, Spark programs can b
|
||||||
by any resource in the cluster: CPU, network bandwidth, or memory.
|
by any resource in the cluster: CPU, network bandwidth, or memory.
|
||||||
Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you
|
Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you
|
||||||
also need to do some tuning, such as
|
also need to do some tuning, such as
|
||||||
[storing RDDs in serialized form](programming-guide.html#rdd-persistence), to
|
[storing RDDs in serialized form](rdd-programming-guide.html#rdd-persistence), to
|
||||||
decrease memory usage.
|
decrease memory usage.
|
||||||
This guide will cover two main topics: data serialization, which is crucial for good network
|
This guide will cover two main topics: data serialization, which is crucial for good network
|
||||||
performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.
|
performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.
|
||||||
|
@ -155,7 +155,7 @@ pointer-based data structures and wrapper objects. There are several ways to do
|
||||||
|
|
||||||
When your objects are still too large to efficiently store despite this tuning, a much simpler way
|
When your objects are still too large to efficiently store despite this tuning, a much simpler way
|
||||||
to reduce memory usage is to store them in *serialized* form, using the serialized StorageLevels in
|
to reduce memory usage is to store them in *serialized* form, using the serialized StorageLevels in
|
||||||
the [RDD persistence API](programming-guide.html#rdd-persistence), such as `MEMORY_ONLY_SER`.
|
the [RDD persistence API](rdd-programming-guide.html#rdd-persistence), such as `MEMORY_ONLY_SER`.
|
||||||
Spark will then store each RDD partition as one large byte array.
|
Spark will then store each RDD partition as one large byte array.
|
||||||
The only downside of storing data in serialized form is slower access times, due to having to
|
The only downside of storing data in serialized form is slower access times, due to having to
|
||||||
deserialize each object on the fly.
|
deserialize each object on the fly.
|
||||||
|
@ -262,7 +262,7 @@ number of cores in your clusters.
|
||||||
|
|
||||||
## Broadcasting Large Variables
|
## Broadcasting Large Variables
|
||||||
|
|
||||||
Using the [broadcast functionality](programming-guide.html#broadcast-variables)
|
Using the [broadcast functionality](rdd-programming-guide.html#broadcast-variables)
|
||||||
available in `SparkContext` can greatly reduce the size of each serialized task, and the cost
|
available in `SparkContext` can greatly reduce the size of each serialized task, and the cost
|
||||||
of launching a job over a cluster. If your tasks use any large object from the driver program
|
of launching a job over a cluster. If your tasks use any large object from the driver program
|
||||||
inside of them (e.g. a static lookup table), consider turning it into a broadcast variable.
|
inside of them (e.g. a static lookup table), consider turning it into a broadcast variable.
|
||||||
|
|
Loading…
Reference in a new issue