spark-instrumented-optimizer/docs/web-ui.md at d573e4c482f8095cd81f26f39379e3bac1fbd687

Unknown d573e4c482 [SPARK-28542][DOCS][WEBUI] Stages Tab

### What changes were proposed in this pull request?
New documentation to explain in detail Web UI Stages page. New images are included to better explanation.
![image](https://user-images.githubusercontent.com/12819544/63807320-c05bff80-c91d-11e9-986f-e09d0b8d4bbb.png)
![image](https://user-images.githubusercontent.com/12819544/63807343-cd78ee80-c91d-11e9-9e4a-2cef3ff70577.png)
![image](https://user-images.githubusercontent.com/12819544/63807363-d9fd4700-c91d-11e9-9691-1d39b0e2c69e.png)
![image](https://user-images.githubusercontent.com/12819544/63807384-e41f4580-c91d-11e9-92bd-cb01aced3752.png)

### Does this PR introduce any user-facing change?
Only documentation

### How was this patch tested?
I have generated it using "jekyll build" to ensure that it's ok

Closes #25598 from planga82/feature/SPARK-28542_ImproveWebUIStagesPage.

Lead-authored-by: Unknown <soypab@gmail.com>
Co-authored-by: Pablo <soypab@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

2019-08-31 13:33:44 -05:00

16 KiB

Raw Blame History

layout	title	description	license
global	Web UI	Web UI guide for Spark SPARK_VERSION_SHORT	Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Apache Spark provides a suite of web user interfaces (UIs) that you can use to monitor the status and resource consumption of your Spark cluster.

Table of Contents

This will become a table of contents (this text will be scraped). {:toc}

Jobs Tab

The Jobs tab displays a summary page of all jobs in the Spark application and a details page for each job. The summary page shows high-level information, such as the status, duration, and progress of all jobs and the overall event timeline. When you click on a job on the summary page, you see the details page for that job. The details page further shows the event timeline, DAG visualization, and all stages of the job.

The information that is displayed in this section is

User: Current Spark user
Total uptime: Time since Spark application started
Scheduling mode: See job scheduling
Number of jobs per status: Active, Completed, Failed

Basic info

Event timeline: Displays in chronological order the events related to the executors (added, removed) and the jobs

Event timeline

Details of jobs grouped by status: Displays detailed information of the jobs including Job ID, description (with a link to detailed job page), submitted time, duration, stages summary and tasks progress bar

Details of jobs grouped by status

When you click on a specific job, you can see the detailed information of this job.

Jobs detail

This page displays the details of a specific job identified by its job ID.

Job Status: (running, succeeded, failed)
Number of stages per status (active, pending, completed, skipped, failed)
Associated SQL Query: Link to the sql tab for this job
Event timeline: Displays in chronological order the events related to the executors (added, removed) and the stages of the job

Event timeline

DAG visualization: Visual representation of the directed acyclic graph of this job where vertices represent the RDDs or DataFrames and the edges represent an operation to be applied on RDD.

DAG

List of stages (grouped by state active, pending, completed, skipped, and failed)
- Stage ID
- Description of the stage
- Submitted timestamp
- Duration of the stage
- Tasks progress bar
- Input: Bytes read from storage in this stage
- Output: Bytes written in storage in this stage
- Shuffle read: Total shuffle bytes and records read, includes both data read locally and data read from remote executors
- Shuffle write: Bytes and records written to disk in order to be read by a shuffle in a future stage

DAG

Stages Tab

The Stages tab displays a summary page that shows the current state of all stages of all jobs in the Spark application.

At the beginning of the page is the summary with the count of all stages by status (active, pending, completed, sikipped, and failed)

Stages header

In Fair scheduling mode there is a table that displays pools properties

Pool properties

After that are the details of stages per status (active, pending, completed, skipped, failed). In active stages, it's possible to kill the stage with the kill link. Only in failed stages, failure reason is shown. Task detail can be accessed by clicking on the description.

Stages detail

Stage detail

The stage detail page begins with information like total time across all tasks, Locality level summary, Shuffle Read Size / Records and Associated Job IDs.

Stage header

There is also a visual representation of the directed acyclic graph (DAG) of this stage, where vertices represent the RDDs or DataFrames and the edges represent an operation to be applied.

Stage DAG

Summary metrics for all task are represented in a table and in a timeline.

Tasks deserialization time
Duration of tasks.
GC time is the total JVM garbage collection time.
Result serialization time is the time spent serializing the task result on a executor before sending it back to the driver.
Getting result time is the time that the driver spends fetching task results from workers.
Scheduler delay is the time the task waits to be scheduled for execution.
Peak execution memory is the maximum memory used by the internal data structures created during shuffles, aggregations and joins.
Shuffle Read Size / Records. Total shuffle bytes read, includes both data read locally and data read from remote executors.
Shuffle Read Blocked Time is the time that tasks spent blocked waiting for shuffle data to be read from remote machines.
Shuffle Remote Reads is the total shuffle bytes read from remote executors.
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.

Stages metrics

Aggregated metrics by executor show the same information aggregated by executor.

Stages metrics per executors

Accumulators are a type of shared variables. It provides a mutable variable that can be updated inside of a variety of transformations. It is possible to create accumulators with and without name, but only named accumulators are displayed.

Stage accumulator

Tasks details basically includes the same information as in the summary section but detailed by task. It also includes links to review the logs and the task attempt number if it fails for any reason. If there are named accumulators, here it is possible to see the accumulator value at the end of each task.

Tasks

Storage Tab

The Storage tab displays the persisted RDDs and DataFrames, if any, in the application. The summary page shows the storage levels, sizes and partitions of all RDDs, and the details page shows the sizes and using executors for all partitions in an RDD or DataFrame.

{% highlight scala %} scala> import org.apache.spark.storage.StorageLevel._ import org.apache.spark.storage.StorageLevel._

scala> val rdd = sc.range(0, 100, 1, 5).setName("rdd") rdd: org.apache.spark.rdd.RDD[Long] = rdd MapPartitionsRDD[1] at range at :27

scala> rdd.persist(MEMORY_ONLY_SER) res0: rdd.type = rdd MapPartitionsRDD[1] at range at :27

scala> rdd.count res1: Long = 100

scala> val df = Seq((1, "andy"), (2, "bob"), (2, "andy")).toDF("count", "name") df: org.apache.spark.sql.DataFrame = [count: int, name: string]

scala> df.persist(DISK_ONLY) res2: df.type = [count: int, name: string]

scala> df.count res3: Long = 3 {% endhighlight %}

Storage tab

After running the above example, we can find two RDDs listed in the Storage tab. Basic information like storage level, number of partitions and memory overhead are provided. Note that the newly persisted RDDs or DataFrames are not shown in the tab before they are materialized. To monitor a specific RDD or DataFrame, make sure an action operation has been triggered.

Storage detail

You can click the RDD name 'rdd' for obtaining the details of data persistence, such as the data distribution on the cluster.

Environment Tab

The Environment tab displays the values for the different environment and configuration variables, including JVM, Spark, and system properties.

Env tab

This environment page has five parts. It is a useful place to check whether your properties have been set correctly. The first part 'Runtime Information' simply contains the runtime properties like versions of Java and Scala. The second part 'Spark Properties' lists the application properties like 'spark.app.name' and 'spark.driver.memory'.

Hadoop Properties

Clicking the 'Hadoop Properties' link displays properties relative to Hadoop and YARN. Note that properties like ['spark.hadoop.*'](configuration.html#execution-behavior) are shown not in this part but in 'Spark Properties'.

System Properties

'System Properties' shows more details about the JVM.

Classpath Entries

The last part 'Classpath Entries' lists the classes loaded from different sources, which is very useful to resolve class conflicts.

Executors Tab

The Executors tab displays summary information about the executors that were created for the application, including memory and disk usage and task and shuffle information. The Storage Memory column shows the amount of memory used and reserved for caching data.

Executors Tab

The Executors tab provides not only resource information (amount of memory, disk, and cores used by each executor) but also performance information (GC time and shuffle information).

Stderr Log

Clicking the 'stderr' link of executor 0 displays detailed standard error log in its console.

Thread Dump

Clicking the 'Thread Dump' link of executor 0 displays the thread dump of JVM on executor 0, which is pretty useful for performance analysis.

SQL Tab

If the application executes Spark SQL queries, the SQL tab displays information, such as the duration, jobs, and physical and logical plans for the queries. Here we include a basic example to illustrate this tab: {% highlight scala %} scala> val df = Seq((1, "andy"), (2, "bob"), (2, "andy")).toDF("count", "name") df: org.apache.spark.sql.DataFrame = [count: int, name: string]

scala> df.count res0: Long = 3

scala> df.createGlobalTempView("df")

scala> spark.sql("select name,sum(count) from global_temp.df group by name").show +----+----------+ |name|sum(count)| +----+----------+ |andy| 3| | bob| 2| +----+----------+ {% endhighlight %}

SQL tab

Now the above three dataframe/SQL operators are shown in the list. If we click the 'show at <console>: 24' link of the last query, we will see the DAG of the job.

SQL DAG

We can see that details information of each stage. The first block 'WholeStageCodegen'
compile multiple operator ('LocalTableScan' and 'HashAggregate') together into a single Java function to improve performance, and metrics like number of rows and spill size are listed in the block. The second block 'Exchange' shows the metrics on the shuffle exchange, including number of written shuffle records, total data size, etc.

logical plans and the physical plan

Clicking the 'Details' link on the bottom displays the logical plans and the physical plan, which illustrate how Spark parses, analyzes, optimizes and performs the query.

Streaming Tab

The web UI includes a Streaming tab if the application uses Spark streaming. This tab displays scheduling delay and processing time for each micro-batch in the data stream, which can be useful for troubleshooting the streaming application.

16 KiB Raw Blame History

Jobs Tab

Jobs detail

Stages Tab

Stage detail

Storage Tab

Environment Tab

Executors Tab

SQL Tab

Streaming Tab

16 KiB

Raw Blame History