[SPARK-23165][DOC] Spelling mistake fix in quick-start doc.

## What changes were proposed in this pull request?

Fix spelling in quick-start doc.

## How was this patch tested?

Doc only.

Author: Shashwat Anand <me@shashwat.me>

Closes #20336 from ashashwat/SPARK-23165.
This commit is contained in:
Shashwat Anand 2018-01-20 14:34:37 -08:00 committed by gatorsmile
parent 396cdfbea4
commit 84a076e0e9
14 changed files with 37 additions and 37 deletions

View file

@ -180,10 +180,10 @@ under the path, not the number of *new* files, so it can become a slow operation
The size of the window needs to be set to handle this.
1. Files only appear in an object store once they are completely written; there
is no need for a worklow of write-then-rename to ensure that files aren't picked up
is no need for a workflow of write-then-rename to ensure that files aren't picked up
while they are still being written. Applications can write straight to the monitored directory.
1. Streams should only be checkpointed to an store implementing a fast and
1. Streams should only be checkpointed to a store implementing a fast and
atomic `rename()` operation Otherwise the checkpointing may be slow and potentially unreliable.
## Further Reading

View file

@ -79,7 +79,7 @@ Then, you can supply configuration values at runtime:
{% endhighlight %}
The Spark shell and [`spark-submit`](submitting-applications.html)
tool support two ways to load configurations dynamically. The first are command line options,
tool support two ways to load configurations dynamically. The first is command line options,
such as `--master`, as shown above. `spark-submit` can accept any Spark property using the `--conf`
flag, but uses special flags for properties that play a part in launching the Spark application.
Running `./bin/spark-submit --help` will show the entire list of these options.
@ -413,7 +413,7 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
Enable profiling in Python worker, the profile result will show up by <code>sc.show_profiles()</code>,
or it will be displayed before the driver exiting. It also can be dumped into disk by
or it will be displayed before the driver exits. It also can be dumped into disk by
<code>sc.dump_profiles(path)</code>. If some of the profile results had been displayed manually,
they will not be displayed automatically before driver exiting.
@ -446,7 +446,7 @@ Apart from these, the following properties are also available, and may be useful
<td>true</td>
<td>
Reuse Python worker or not. If yes, it will use a fixed number of Python workers,
does not need to fork() a Python process for every tasks. It will be very useful
does not need to fork() a Python process for every task. It will be very useful
if there is large broadcast, then the broadcast will not be needed to transferred
from JVM to Python worker for every task.
</td>
@ -1294,7 +1294,7 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.files.openCostInBytes</code></td>
<td>4194304 (4 MB)</td>
<td>
The estimated cost to open a file, measured by the number of bytes could be scanned in the same
The estimated cost to open a file, measured by the number of bytes could be scanned at the same
time. This is used when putting multiple files into a partition. It is better to over estimate,
then the partitions with small files will be faster than partitions with bigger files.
</td>
@ -1855,8 +1855,8 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.user.groups.mapping</code></td>
<td><code>org.apache.spark.security.ShellBasedGroupsMappingProvider</code></td>
<td>
The list of groups for a user are determined by a group mapping service defined by the trait
org.apache.spark.security.GroupMappingServiceProvider which can configured by this property.
The list of groups for a user is determined by a group mapping service defined by the trait
org.apache.spark.security.GroupMappingServiceProvider which can be configured by this property.
A default unix shell based implementation is provided <code>org.apache.spark.security.ShellBasedGroupsMappingProvider</code>
which can be specified to resolve a list of groups for a user.
<em>Note:</em> This implementation supports only a Unix/Linux based environment. Windows environment is
@ -2465,7 +2465,7 @@ should be included on Spark's classpath:
The location of these configuration files varies across Hadoop versions, but
a common location is inside of `/etc/hadoop/conf`. Some tools create
configurations on-the-fly, but offer a mechanisms to download copies of them.
configurations on-the-fly, but offer a mechanism to download copies of them.
To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/conf/spark-env.sh`
to a location containing the configuration files.

View file

@ -708,7 +708,7 @@ messages remaining.
> messaging function. These constraints allow additional optimization within GraphX.
The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch*
of its implementation (note: to avoid stackOverflowError due to long lineage chains, pregel support periodcally
of its implementation (note: to avoid stackOverflowError due to long lineage chains, pregel support periodically
checkpoint graph and messages by setting "spark.graphx.pregel.checkpointInterval" to a positive number,
say 10. And set checkpoint directory as well using SparkContext.setCheckpointDir(directory: String)):
@ -928,7 +928,7 @@ switch to 2D-partitioning or other heuristics included in GraphX.
<!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
Once the edges have be partitioned the key challenge to efficient graph-parallel computation is
Once the edges have been partitioned the key challenge to efficient graph-parallel computation is
efficiently joining vertex attributes with the edges. Because real-world graphs typically have more
edges than vertices, we move vertex attributes to the edges. Because not all partitions will
contain edges adjacent to all vertices we internally maintain a routing table which identifies where

View file

@ -118,7 +118,7 @@ The history server can be configured as follows:
<td>
The number of applications to retain UI data for in the cache. If this cap is exceeded, then
the oldest applications will be removed from the cache. If an application is not in the cache,
it will have to be loaded from disk if its accessed from the UI.
it will have to be loaded from disk if it is accessed from the UI.
</td>
</tr>
<tr>
@ -407,7 +407,7 @@ can be identified by their `[attempt-id]`. In the API listed below, when running
</tr>
</table>
The number of jobs and stages which can retrieved is constrained by the same retention
The number of jobs and stages which can be retrieved is constrained by the same retention
mechanism of the standalone Spark UI; `"spark.ui.retainedJobs"` defines the threshold
value triggering garbage collection on jobs, and `spark.ui.retainedStages` that for stages.
Note that the garbage collection takes place on playback: it is possible to retrieve
@ -422,10 +422,10 @@ These endpoints have been strongly versioned to make it easier to develop applic
* Individual fields will never be removed for any given endpoint
* New endpoints may be added
* New fields may be added to existing endpoints
* New versions of the api may be added in the future at a separate endpoint (eg., `api/v2`). New versions are *not* required to be backwards compatible.
* New versions of the api may be added in the future as a separate endpoint (eg., `api/v2`). New versions are *not* required to be backwards compatible.
* Api versions may be dropped, but only after at least one minor release of co-existing with a new api version.
Note that even when examining the UI of a running applications, the `applications/[app-id]` portion is
Note that even when examining the UI of running applications, the `applications/[app-id]` portion is
still required, though there is only one application available. Eg. to see the list of jobs for the
running app, you would go to `http://localhost:4040/api/v1/applications/[app-id]/jobs`. This is to
keep the paths consistent in both modes.

View file

@ -67,7 +67,7 @@ res3: Long = 15
./bin/pyspark
Or if PySpark is installed with pip in your current enviroment:
Or if PySpark is installed with pip in your current environment:
pyspark
@ -156,7 +156,7 @@ One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can i
>>> wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
{% endhighlight %}
Here, we use the `explode` function in `select`, to transfrom a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: "word" and "count". To collect the word counts in our shell, we can call `collect`:
Here, we use the `explode` function in `select`, to transform a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: "word" and "count". To collect the word counts in our shell, we can call `collect`:
{% highlight python %}
>>> wordCounts.collect()
@ -422,7 +422,7 @@ $ YOUR_SPARK_HOME/bin/spark-submit \
Lines with a: 46, Lines with b: 23
{% endhighlight %}
If you have PySpark pip installed into your enviroment (e.g., `pip install pyspark`), you can run your application with the regular Python interpreter or use the provided 'spark-submit' as you prefer.
If you have PySpark pip installed into your environment (e.g., `pip install pyspark`), you can run your application with the regular Python interpreter or use the provided 'spark-submit' as you prefer.
{% highlight bash %}
# Use the Python interpreter to run your application

View file

@ -154,7 +154,7 @@ can find the results of the driver from the Mesos Web UI.
To use cluster mode, you must start the `MesosClusterDispatcher` in your cluster via the `sbin/start-mesos-dispatcher.sh` script,
passing in the Mesos master URL (e.g: mesos://host:5050). This starts the `MesosClusterDispatcher` as a daemon running on the host.
By setting the Mesos proxy config property (requires mesos version >= 1.4), `--conf spark.mesos.proxy.baseURL=http://localhost:5050` when launching the dispacther, the mesos sandbox URI for each driver is added to the mesos dispatcher UI.
By setting the Mesos proxy config property (requires mesos version >= 1.4), `--conf spark.mesos.proxy.baseURL=http://localhost:5050` when launching the dispatcher, the mesos sandbox URI for each driver is added to the mesos dispatcher UI.
If you like to run the `MesosClusterDispatcher` with Marathon, you need to run the `MesosClusterDispatcher` in the foreground (i.e: `bin/spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher`). Note that the `MesosClusterDispatcher` not yet supports multiple instances for HA.

View file

@ -445,7 +445,7 @@ To use a custom metrics.properties for the application master and executors, upd
<code>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</code> should be
configured in yarn-site.xml.
This feature can only be used with Hadoop 2.6.4+. The Spark log4j appender needs be changed to use
FileAppender or another appender that can handle the files being removed while its running. Based
FileAppender or another appender that can handle the files being removed while it is running. Based
on the file name configured in the log4j configuration (like spark.log), the user should set the
regex (spark*) to include all the log files that need to be aggregated.
</td>

View file

@ -62,7 +62,7 @@ component-specific configuration namespaces used to override the default setting
</tr>
</table>
The full breakdown of available SSL options can be found on the [configuration page](configuration.html).
The full breakdown of available SSL options can be found on the [configuration page](configuration.html).
SSL must be configured on each node and configured for each component involved in communication using the particular protocol.
### YARN mode

View file

@ -1253,7 +1253,7 @@ provide a ClassTag.
(Note that this is different than the Spark SQL JDBC server, which allows other applications to
run queries using Spark SQL).
To get started you will need to include the JDBC driver for you particular database on the
To get started you will need to include the JDBC driver for your particular database on the
spark classpath. For example, to connect to postgres from the Spark Shell you would run the
following command:
@ -1793,7 +1793,7 @@ options.
- Since Spark 2.3, when all inputs are binary, `functions.concat()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.concatBinaryAsString` to `true`.
- Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.eltOutputAsString` to `true`.
- Since Spark 2.3, by default arithmetic operations between decimals return a rounded value if an exact representation is not possible (instead of returning NULL). This is compliant to SQL ANSI 2011 specification and Hive's new behavior introduced in Hive 2.2 (HIVE-15331). This involves the following changes
- Since Spark 2.3, by default arithmetic operations between decimals return a rounded value if an exact representation is not possible (instead of returning NULL). This is compliant with SQL ANSI 2011 specification and Hive's new behavior introduced in Hive 2.2 (HIVE-15331). This involves the following changes
- The rules to determine the result type of an arithmetic operation have been updated. In particular, if the precision / scale needed are out of the range of available values, the scale is reduced up to 6, in order to prevent the truncation of the integer part of the decimals. All the arithmetic operations are affected by the change, ie. addition (`+`), subtraction (`-`), multiplication (`*`), division (`/`), remainder (`%`) and positive module (`pmod`).
- Literal values used in SQL operations are converted to DECIMAL with the exact precision and scale needed by them.
- The configuration `spark.sql.decimalOperations.allowPrecisionLoss` has been introduced. It defaults to `true`, which means the new behavior described here; if set to `false`, Spark uses previous rules, ie. it doesn't adjust the needed scale to represent the values and it returns NULL if an exact representation of the value is not possible.
@ -1821,7 +1821,7 @@ options.
transformations (e.g., `map`, `filter`, and `groupByKey`) and untyped transformations (e.g.,
`select` and `groupBy`) are available on the Dataset class. Since compile-time type-safety in
Python and R is not a language feature, the concept of Dataset does not apply to these languages
APIs. Instead, `DataFrame` remains the primary programing abstraction, which is analogous to the
APIs. Instead, `DataFrame` remains the primary programming abstraction, which is analogous to the
single-node data frame notion in these languages.
- Dataset and DataFrame API `unionAll` has been deprecated and replaced by `union`
@ -1997,7 +1997,7 @@ Java and Python users will need to update their code.
Prior to Spark 1.3 there were separate Java compatible classes (`JavaSQLContext` and `JavaSchemaRDD`)
that mirrored the Scala API. In Spark 1.3 the Java API and Scala API have been unified. Users
of either language should use `SQLContext` and `DataFrame`. In general theses classes try to
of either language should use `SQLContext` and `DataFrame`. In general these classes try to
use types that are usable from both languages (i.e. `Array` instead of language specific collections).
In some cases where no common type exists (e.g., for passing in closures or Maps) function overloading
is used instead.

View file

@ -42,7 +42,7 @@ Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code>
The main category of parameters that should be configured are the authentication parameters
required by Keystone.
The following table contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be
The following table contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be
any (alphanumeric) name.
<table class="table">

View file

@ -74,7 +74,7 @@ import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
// The master requires 2 cores to prevent a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
@ -172,7 +172,7 @@ each line will be split into multiple words and the stream of words is represent
`words` DStream. Note that we defined the transformation using a
[FlatMapFunction](api/scala/index.html#org.apache.spark.api.java.function.FlatMapFunction) object.
As we will discover along the way, there are a number of such convenience classes in the Java API
that help define DStream transformations.
that help defines DStream transformations.
Next, we want to count these words.

View file

@ -125,7 +125,7 @@ df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
### Creating a Kafka Source for Batch Queries
If you have a use case that is better suited to batch processing,
you can create an Dataset/DataFrame for a defined range of offsets.
you can create a Dataset/DataFrame for a defined range of offsets.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@ -597,7 +597,7 @@ Note that the following Kafka params cannot be set and the Kafka source or sink
- **key.serializer**: Keys are always serialized with ByteArraySerializer or StringSerializer. Use
DataFrame operations to explicitly serialize the keys into either strings or byte arrays.
- **value.serializer**: values are always serialized with ByteArraySerializer or StringSerializer. Use
DataFrame oeprations to explicitly serialize the values into either strings or byte arrays.
DataFrame operations to explicitly serialize the values into either strings or byte arrays.
- **enable.auto.commit**: Kafka source doesn't commit any offset.
- **interceptor.classes**: Kafka source always read keys and values as byte arrays. It's not safe to
use ConsumerInterceptor as it may break the query.

View file

@ -10,7 +10,7 @@ title: Structured Streaming Programming Guide
# Overview
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the [Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs. In short, *Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.*
Internally, by default, Structured Streaming queries are processed using a *micro-batch processing* engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called **Continuous Processing**, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able choose the mode based on your application requirements.
Internally, by default, Structured Streaming queries are processed using a *micro-batch processing* engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called **Continuous Processing**, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.
In this guide, we are going to walk you through the programming model and the APIs. We are going to explain the concepts mostly using the default micro-batch processing model, and then [later](#continuous-processing-experimental) discuss Continuous Processing model. First, let's start with a simple example of a Structured Streaming query - a streaming word count.
@ -1121,7 +1121,7 @@ Lets discuss the different types of supported stream-stream joins and how to
##### Inner Joins with optional Watermarking
Inner joins on any kind of columns along with any kind of join conditions are supported.
However, as the stream runs, the size of streaming state will keep growing indefinitely as
*all* past input must be saved as the any new input can match with any input from the past.
*all* past input must be saved as any new input can match with any input from the past.
To avoid unbounded state, you have to define additional join conditions such that indefinitely
old inputs cannot match with future inputs and therefore can be cleared from the state.
In other words, you will have to do the following additional steps in the join.
@ -1839,7 +1839,7 @@ aggDF \
.format("console") \
.start()
# Have all the aggregates in an in memory table. The query name will be the table name
# Have all the aggregates in an in-memory table. The query name will be the table name
aggDF \
.writeStream \
.queryName("aggregates") \

View file

@ -5,7 +5,7 @@ title: Submitting Applications
The `spark-submit` script in Spark's `bin` directory is used to launch applications on a cluster.
It can use all of Spark's supported [cluster managers](cluster-overview.html#cluster-manager-types)
through a uniform interface so you don't have to configure your application specially for each one.
through a uniform interface so you don't have to configure your application especially for each one.
# Bundling Your Application's Dependencies
If your code depends on other projects, you will need to package them alongside
@ -58,7 +58,7 @@ for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g.
locally on your laptop), it is common to use `cluster` mode to minimize network latency between
the drivers and the executors. Currently, standalone mode does not support cluster mode for Python
the drivers and the executors. Currently, the standalone mode does not support cluster mode for Python
applications.
For Python applications, simply pass a `.py` file in the place of `<application-jar>` instead of a JAR,
@ -68,7 +68,7 @@ There are a few options available that are specific to the
[cluster manager](cluster-overview.html#cluster-manager-types) that is being used.
For example, with a [Spark standalone cluster](spark-standalone.html) with `cluster` deploy mode,
you can also specify `--supervise` to make sure that the driver is automatically restarted if it
fails with non-zero exit code. To enumerate all such options available to `spark-submit`,
fails with a non-zero exit code. To enumerate all such options available to `spark-submit`,
run it with `--help`. Here are a few examples of common options:
{% highlight bash %}
@ -192,7 +192,7 @@ debugging information by running `spark-submit` with the `--verbose` option.
# Advanced Dependency Management
When using `spark-submit`, the application jar along with any jars included with the `--jars` option
will be automatically transferred to the cluster. URLs supplied after `--jars` must be separated by commas. That list is included on the driver and executor classpaths. Directory expansion does not work with `--jars`.
will be automatically transferred to the cluster. URLs supplied after `--jars` must be separated by commas. That list is included in the driver and executor classpaths. Directory expansion does not work with `--jars`.
Spark uses the following URL scheme to allow different strategies for disseminating jars: