## What changes were proposed in this pull request?
This proposes to improve Spark instrumentation by adding a hook for user-defined metrics, extending Spark’s Dropwizard/Codahale metrics system.
The original motivation of this work was to add instrumentation for S3 filesystem access metrics by Spark job. Currently, [[ExecutorSource]] instruments HDFS and local filesystem metrics. Rather than extending the code there, we proposes with this JIRA to add a metrics plugin system which is of more flexible and general use.
Context: The Spark metrics system provides a large variety of metrics, see also , useful to monitor and troubleshoot Spark workloads. A typical workflow is to sink the metrics to a storage system and build dashboards on top of that.
Highlights:
- The metric plugin system makes it easy to implement instrumentation for S3 access by Spark jobs.
- The metrics plugin system allows for easy extensions of how Spark collects HDFS-related workload metrics. This is currently done using the Hadoop Filesystem GetAllStatistics method, which is deprecated in recent versions of Hadoop. Recent versions of Hadoop Filesystem recommend using method GetGlobalStorageStatistics, which also provides several additional metrics. GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an easy way to “opt in” using such new API calls for those deploying suitable Hadoop versions.
- We also have the use case of adding Hadoop filesystem monitoring for a custom Hadoop compliant filesystem in use in our organization (EOS using the XRootD protocol). The metrics plugin infrastructure makes this easy to do. Others may have similar use cases.
- More generally, this method makes it straightforward to plug in Filesystem and other metrics to the Spark monitoring system. Future work on plugin implementation can address extending monitoring to measure usage of external resources (OS, filesystem, network, accelerator cards, etc), that maybe would not normally be considered general enough for inclusion in Apache Spark code, but that can be nevertheless useful for specialized use cases, tests or troubleshooting.
Implementation:
The proposed implementation extends and modifies the work on Executor Plugin of SPARK-24918. Additionally, this is related to recent work on extending Spark executor metrics, such as SPARK-25228.
As discussed during the review, the implementaiton of this feature modifies the Developer API for Executor Plugins, such that the new version is incompatible with the original version in Spark 2.4.
## How was this patch tested?
This modifies existing tests for ExecutorPluginSuite to adapt them to the API changes. In addition, the new funtionality for registering pluginMetrics has been manually tested running Spark on YARN and K8S clusters, in particular for monitoring S3 and for extending HDFS instrumentation with the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric plugin example and code used for testing are available, for example at: https://github.com/cerndb/SparkExecutorPluginsCloses#24901 from LucaCanali/executorMetricsPlugin.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
Today all registered metric sources are reported to GraphiteSink with no filtering mechanism, although the codahale project does support it.
GraphiteReporter (ScheduledReporter) from the codahale project requires you implement and supply the MetricFilter interface (there is only a single implementation by default in the codahale project, MetricFilter.ALL).
Propose to add an additional regex config to match and filter metrics to the GraphiteSink
## How was this patch tested?
Included a GraphiteSinkSuite that tests:
1. Absence of regex filter (existing default behavior maintained)
2. Presence of `regex=<regexexpr>` correctly filters metric keys
Closes#25232 from nkarpov/graphite_regex.
Authored-by: Nick Karpov <nick@nickkarpov.com>
Signed-off-by: jerryshao <jerryshao@tencent.com>
## What changes were proposed in this pull request?
The motivation for these additional metrics is to help in troubleshooting and monitoring task execution workload when running on a cluster. Currently available metrics include executor threadpool metrics for task completed and for active tasks. The addition of threadpool taskStarted metric will allow for example to collect info on the (approximate) number of failed tasks by computing the difference thread started – (active threads + completed tasks and/or successfully finished tasks).
The proposed metric finishedTasks is also intended for this type of troubleshooting. The difference between finshedTasks and threadpool.completeTasks, is that the latter is a (dropwizard library) gauge taken from the threadpool, while the former is a (dropwizard) counter computed in the [[Executor]] class, when a task successfully finishes, together with several other task metrics counters.
Note, there are similarities with some of the metrics introduced in SPARK-24398, however there are key differences, coming from the fact that this PR concerns the executor source, therefore providing metric values per executor + metric values do not require to pass through the listerner bus in this case.
## How was this patch tested?
Manually tested on a YARN cluster
Closes#22290 from LucaCanali/AddMetricExecutorStartedTasks.
Lead-authored-by: Luca Canali <luca.canali@cern.ch>
Co-authored-by: LucaCanali <luca.canali@cern.ch>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
Up to now, Apache Spark maintains the given event log directory by **time** policy, `spark.history.fs.cleaner.maxAge`. However, there are two issues.
1. Some file system has a limitation on the maximum number of files in a single directory. For example, HDFS `dfs.namenode.fs-limits.max-directory-items` is 1024 * 1024 by default.
https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
2. Spark is sometimes unable to to clean up some old log files due to permission issues (mainly, security policy).
To handle both (1) and (2), this PR aims to support an additional policy configuration for the maximum number of files in the event log directory, `spark.history.fs.cleaner.maxNum`. Spark will try to keep the number of files in the event log directory according to this policy.
## How was this patch tested?
Pass the Jenkins with a newly added test case.
Closes#25072 from dongjoon-hyun/SPARK-28294.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
SPARK-27773 has introduced a new metric (counter) numCaughtExceptions to the Spark Dropwizard monitoring system. This PR adds an entry in the monitoring documentation to document this.
Closes#24790 from LucaCanali/addDocFollowingSPARK27773.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.
## How was this patch tested?
Doc build
Closes#24243 from srowen/SPARK-26918.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This applies some minor updates/cleaning following up SPARK-26928, notably renaming JVMCPU.scala to JVMCPUSource.scala.
## How was this patch tested?
Manually tested
Closes#24201 from LucaCanali/fixupSPARK-26928.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Add completedStages metircs for namespace=appStatus for monitoring.md.
Closes#24109 from hehuiyuan/hehuiyuan-patch-5.
Authored-by: hehuiyuan <hehuiyuan@ZBMAC-C02WD3K5H.local>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
A sub-task of [SPARK-23206](https://issues.apache.org/jira/browse/SPARK-23206)
Add Executor level metrics to monitoring docs
## How was this patch tested?
jekyll
Closes#24090 from LantaoJin/SPARK-27157.
Authored-by: Lantao Jin <jinlantao@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This proposes to add instrumentation for the driver's JVM CPU time via the Spark Dropwizard/Codahale metrics system. It follows directly from previous work SPARK-25228 and shares similar motivations: it is intended as an improvement to be used for Spark performance dashboards and monitoring tools/instrumentation.
Implementation details: this PR takes the code introduced in SPARK-25228 and moves it to a new separate Source JVMCPUSource, which is then used to register the jvmCpuTime gauge metric for both executor and driver.
The registration of the jvmCpuTime metric for the driver is conditional, a new configuration parameter `spark.metrics.cpu.time.driver.enabled` (proposed default: false) is introduced for this purpose.
## How was this patch tested?
Manually tested, using local mode and using YARN.
Closes#23838 from LucaCanali/addCPUTimeMetricDriver.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
This PR proposes to extend the documentation of the Spark metrics system in the monitoring guide. In particular by:
- adding a list of the available metrics grouped per component instance
- adding information on configuration parameters that can be used to configure the metrics system in alternative to the metrics.properties file
- adding information on the configuration parameters needed to enable certain metrics
- it also propose to add an example of Graphite sink configuration in metrics.properties.template
Closes#23798 from LucaCanali/metricsDocUpdate.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Metrics instance "mesos_cluster" exists in spark, but not mentioned in monitoring.md. This PR add it.
## How was this patch tested?
Manually test.
Closes#23691 from SongYadong/doc_mesos_metrics_inst.
Authored-by: SongYadong <song.yadong1@zte.com.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This patch proposes adding a new configuration on SHS: custom executor log URL pattern. This will enable end users to replace executor logs to other than RM provide, like external log service, which enables to serve executor logs when NodeManager becomes unavailable in case of YARN.
End users can build their own of custom executor log URLs with pre-defined patterns which would be vary on each resource manager. This patch adds some patterns to YARN resource manager. (For others, there's even no executor log url available so cannot define patterns as well.)
Please refer the doc change as well as added UTs in this patch to see how to set up the feature.
## How was this patch tested?
Added UT, as well as manual test with YARN cluster
Closes#23260 from HeartSaVioR/SPARK-26311.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Currently, we do not have a mechanism to collect driver logs if a user chooses
to run their application in client mode. This is a big issue as admin teams need
to create their own mechanisms to capture driver logs.
This commit adds a logger which, if enabled, adds a local log appender to the
root logger and asynchronously syncs it an application specific log file on hdfs
(Spark Driver Log Dir).
Additionally, this collects spark-shell driver logs at INFO level by default.
The change is that instead of setting root logger level to WARN, we will set the
consoleAppender threshold to WARN, in case of spark-shell. This ensures that
only WARN logs are printed on CONSOLE but other log appenders still capture INFO
(or the default log level logs).
1. Verified that logs are written to local and remote dir
2. Added a unit test case
3. Verified this for spark-shell, client mode and pyspark.
4. Verified in both non-kerberos and kerberos environment
5. Verified with following unexpected termination conditions: Ctrl + C, Driver
OOM, Large Log Files
6. Ran an application in spark-shell and ensured that driver logs were
captured at INFO level
7. Started the application at WARN level, programmatically changed the level to
INFO and ensured that logs on console were printed at INFO level
Closes#22504 from ankuriitg/ankurgupta/SPARK-25118.
Authored-by: ankurgupta <ankur.gupta@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
Add description of Executor Task Metrics to the documentation.
Closes#22397 from LucaCanali/docMonitoringTaskMetrics.
Authored-by: LucaCanali <luca.canali@cern.ch>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
In this PR metrics are introduced for YARN. As up to now there was no metrics in the YARN module a new metric system is created with the name "applicationMaster".
To support both client and cluster mode the metric system lifecycle is bound to the AM.
## How was this patch tested?
Both client and cluster mode was tested manually.
Before the test on one of the YARN node spark-core was removed to cause the allocation failure.
Spark was started as (in case of client mode):
```
spark2-submit \
--class org.apache.spark.examples.SparkPi \
--conf "spark.yarn.blacklist.executor.launch.blacklisting.enabled=true" --conf "spark.blacklist.application.maxFailedExecutorsPerNode=2" --conf "spark.dynamicAllocation.enabled=true" --conf "spark.metrics.conf.*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink" \
--master yarn \
--deploy-mode client \
original-spark-examples_2.11-2.4.0-SNAPSHOT.jar \
1000
```
In both cases the YARN logs contained the new metrics as:
```
$ yarn logs --applicationId application_1529926424933_0015
...
-- Gauges ----------------------------------------------------------------------
application_1531751594108_0046.applicationMaster.numContainersPendingAllocate
value = 0
application_1531751594108_0046.applicationMaster.numExecutorsFailed
value = 3
application_1531751594108_0046.applicationMaster.numExecutorsRunning
value = 9
application_1531751594108_0046.applicationMaster.numLocalityAwareTasks
value = 0
application_1531751594108_0046.applicationMaster.numReleasedContainers
value = 0
...
```
Author: “attilapiros” <piros.attila.zsolt@gmail.com>
Author: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com>
Closes#21635 from attilapiros/SPARK-24594.
## What changes were proposed in this pull request?
Easy fix in the documentation.
## How was this patch tested?
N/A
Closes#20948
Author: Daniel Sakuma <dsakuma@gmail.com>
Closes#20928 from dsakuma/fix_typo_configuration_docs.
This change basically rewrites the security documentation so that it's
up to date with new features, more correct, and more complete.
Because security is such an important feature, I chose to move all the
relevant configuration documentation to the security page, instead of
having them peppered all over the place in the configuration page. This
allows an almost one-stop shop for security configuration in Spark. The
only exceptions are some YARN-specific minor features which I left in
the YARN page.
I also re-organized the page's topics, since they didn't make a lot of
sense. You had kerberos features described inside paragraphs talking
about UI access control, and other oddities. It should be easier now
to find information about specific Spark security features. I also
enabled TOCs for both the Security and YARN pages, since that makes it
easier to see what is covered.
I removed most of the comments from the SecurityManager javadoc since
they just replicated information in the security doc, with different
levels of out-of-dateness.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#20742 from vanzin/SPARK-23572.
## What changes were proposed in this pull request?
Fix spelling in quick-start doc.
## How was this patch tested?
Doc only.
Author: Shashwat Anand <me@shashwat.me>
Closes#20336 from ashashwat/SPARK-23165.
The application listing is still generated from event logs, but is now stored
in a KVStore instance. By default an in-memory store is used, but a new config
allows setting a local disk path to store the data, in which case a LevelDB
store will be created.
The provider stores things internally using the public REST API types; I believe
this is better going forward since it will make it easier to get rid of the
internal history server API which is mostly redundant at this point.
I also added a finalizer to LevelDBIterator, to make sure that resources are
eventually released. This helps when code iterates but does not exhaust the
iterator, thus not triggering the auto-close code.
HistoryServerSuite was modified to not re-start the history server unnecessarily;
this makes the json validation tests run more quickly.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#18887 from vanzin/SPARK-20642.
This patch adds statsd sink to the current metrics system in spark core.
Author: Xiaofeng Lin <xlin@twilio.com>
Closes#9518 from xflin/statsd.
Change-Id: Ib8720e86223d4a650df53f51ceb963cd95b49a44
History Server Launch uses SparkClassCommandBuilder for launching the server. It is observed that SPARK_CLASSPATH has been removed and deprecated. For spark-submit this takes a different route and spark.driver.extraClasspath takes care of specifying additional jars in the classpath that were previously specified in the SPARK_CLASSPATH. Right now the only way specify the additional jars for launching daemons such as history server is using SPARK_DIST_CLASSPATH (https://spark.apache.org/docs/latest/hadoop-provided.html) but this I presume is a distribution classpath. It would be nice to have a similar config like spark.driver.extraClasspath for launching daemons similar to history server.
Added new environment variable SPARK_DAEMON_CLASSPATH to set classpath for launching daemons. Tested and verified for History Server and Standalone Mode.
## How was this patch tested?
Initially, history server start script would fail for the reason being that it could not find the required jars for launching the server in the java classpath. Same was true for running Master and Worker in standalone mode. By adding the environment variable SPARK_DAEMON_CLASSPATH to the java classpath, both the daemons(History Server, Standalone daemons) are starting up and running.
Author: pgandhi <pgandhi@yahoo-inc.com>
Author: pgandhi999 <parthkgandhi9@gmail.com>
Closes#19047 from pgandhi999/master.
Hello
PR #10991 removed the built-in history view from Spark Standalone, so the history server is no longer useful to Yarn or Mesos only.
Author: Hervé <dud225@users.noreply.github.com>
Closes#17709 from dud225/patch-1.
## What changes were proposed in this pull request?
1. '/applications/[app-id]/stages' in rest api.status should add description '?status=[active|complete|pending|failed] list only stages in the state.'
Now the lack of this description, resulting in the use of this api do not know the use of the status through the brush stage list.
2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description ‘?status=[active|complete|pending|failed] list only stages in the state.’.
Because only one stage is determined based on stage-id.
code:
GET
def stageList(QueryParam("status") statuses: JList[StageStatus]): Seq[StageData] = {
val listener = ui.jobProgressListener
val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
val adjStatuses = {
if (statuses.isEmpty()) {
Arrays.asList(StageStatus.values(): _*)
} else {
statuses
}
};
## How was this patch tested?
manual tests
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Closes#17534 from guoxiaolongzte/SPARK-20218.
…ucceeded|failed|unknown]
## What changes were proposed in this pull request?
'/applications/[app-id]/jobs' in rest api.status should be'[running|succeeded|failed|unknown]'.
now status is '[complete|succeeded|failed]'.
but '/applications/[app-id]/jobs?status=complete' the server return 'HTTP ERROR 404'.
Added '?status=running' and '?status=unknown'.
code :
public enum JobExecutionStatus {
RUNNING,
SUCCEEDED,
FAILED,
UNKNOWN;
## How was this patch tested?
manual tests
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
Closes#17507 from guoxiaolongzte/SPARK-20190.
# What changes were proposed in this pull request?
It seems there are several non-breaking spaces were inserted into several `.md`s and they look breaking rendering markdown files.
These are different. For example, this can be checked via `python` as below:
```python
>>> " "
'\xc2\xa0'
>>> " "
' '
```
_Note that it seems this PR description automatically replaces non-breaking spaces into normal spaces. Please open a `vi` and copy and paste it into `python` to verify this (do not copy the characters here)._
I checked the output below in Sapari and Chrome on Mac OS and, Internal Explorer on Windows 10.
**Before**
![2017-04-03 12 37 17](https://cloud.githubusercontent.com/assets/6477701/24594655/50aaba02-186a-11e7-80bb-d34b17a3398a.png)
![2017-04-03 12 36 57](https://cloud.githubusercontent.com/assets/6477701/24594654/50a855e6-186a-11e7-94e2-661e56544b0f.png)
**After**
![2017-04-03 12 36 46](https://cloud.githubusercontent.com/assets/6477701/24594657/53c2545c-186a-11e7-9a73-00529afbfd75.png)
![2017-04-03 12 36 31](https://cloud.githubusercontent.com/assets/6477701/24594658/53c286c0-186a-11e7-99c9-e66b1f510fe7.png)
## How was this patch tested?
Manually checking.
These instances were found via
```
grep --include=*.scala --include=*.python --include=*.java --include=*.r --include=*.R --include=*.md --include=*.r -r -I " " .
```
in Mac OS.
It seems there are several instances more as below:
```
./docs/sql-programming-guide.md: │ ├── ...
./docs/sql-programming-guide.md: │ │
./docs/sql-programming-guide.md: │ ├── country=US
./docs/sql-programming-guide.md: │ │ └── data.parquet
./docs/sql-programming-guide.md: │ ├── country=CN
./docs/sql-programming-guide.md: │ │ └── data.parquet
./docs/sql-programming-guide.md: │ └── ...
./docs/sql-programming-guide.md: ├── ...
./docs/sql-programming-guide.md: │
./docs/sql-programming-guide.md: ├── country=US
./docs/sql-programming-guide.md: │ └── data.parquet
./docs/sql-programming-guide.md: ├── country=CN
./docs/sql-programming-guide.md: │ └── data.parquet
./docs/sql-programming-guide.md: └── ...
./sql/core/src/test/README.md:│ ├── *.avdl # Testing Avro IDL(s)
./sql/core/src/test/README.md:│ └── *.avpr # !! NO TOUCH !! Protocol files generated from Avro IDL(s)
./sql/core/src/test/README.md:│ ├── gen-avro.sh # Script used to generate Java code for Avro
./sql/core/src/test/README.md:│ └── gen-thrift.sh # Script used to generate Java code for Thrift
```
These seems generated via `tree` command which inserts non-breaking spaces. They do not look causing any problem for rendering within code blocks and I did not fix it to reduce the overhead to manually replace it when it is overwritten via `tree` command in the future.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17517 from HyukjinKwon/non-breaking-space.
## What changes were proposed in this pull request?
follow up pr of #16949.
## How was this patch tested?
jenkins
Author: uncleGen <hustyugm@gmail.com>
Closes#17033 from uncleGen/doc-restapi-environment.
## What changes were proposed in this pull request?
Currently, spark history server REST API provides functionality to query applications by application start time range based on minDate and maxDate query parameters, but it lacks support to query applications by their end time. In this pull request we are proposing optional minEndDate and maxEndDate query parameters and filtering capability based on these parameters to spark history server REST API. This functionality can be used for following queries,
1. Applications finished in last 'x' minutes
2. Applications finished before 'y' time
3. Applications finished between 'x' time to 'y' time
4. Applications started from 'x' time and finished before 'y' time.
For backward compatibility, we can keep existing minDate and maxDate query parameters as they are and they can continue support filtering based on start time range.
## How was this patch tested?
Existing unit tests and 4 new unit tests.
Author: Parag Chaudhari <paragpc@amazon.com>
Closes#11867 from paragpc/master-SHS-query-by-endtime_2.
## What changes were proposed in this pull request?
Current HistoryServer's ACLs is derived from application event-log, which means the newly changed ACLs cannot be applied to the old data, this will become a problem where newly added admin cannot access the old application history UI, only the new application can be affected.
So here propose to add admin ACLs for history server, any configured user/group could have the view access to all the applications, while the view ACLs derived from application run-time still take effect.
## How was this patch tested?
Unit test added.
Author: jerryshao <sshao@hortonworks.com>
Closes#16470 from jerryshao/SPARK-19033.
## What changes were proposed in this pull request?
add streaming rest api doc
related to pr #16253
cc saturday-shi srowen
## How was this patch tested?
Author: uncleGen <hustyugm@gmail.com>
Closes#16414 from uncleGen/SPARK-19009.
## What changes were proposed in this pull request?
Typo fixes
## How was this patch tested?
Local build. Awaiting the official build.
Author: Jacek Laskowski <jacek@japila.pl>
Closes#16144 from jaceklaskowski/typo-fixes.
## What changes were proposed in this pull request?
Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation.
## How was this patch tested?
Manually.
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#15886 from weiqingy/fixTypo.
After change [SPARK-16405](https://github.com/apache/spark/pull/14080), we need to update docs by adding shuffle service metrics entry in currently supporting metrics list.
Author: Yangyang Liu <yangyangliu@fb.com>
Closes#14254 from lovexi/yangyang-monitoring-doc.
## What changes were proposed in this pull request?
With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)
I've also added a new test for the `limit` param in `HistoryServerSuite.scala`
## How was this patch tested?
Manual testing and dev/run-tests
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#14835 from ajbozarth/spark17243.
## What changes were proposed in this pull request?
Adding a new property to SparkConf called spark.metrics.namespace that allows users to
set a custom namespace for executor and driver metrics in the metrics systems.
By default, the root namespace used for driver or executor metrics is
the value of `spark.app.id`. However, often times, users want to be able to track the metrics
across apps for driver and executor metrics, which is hard to do with application ID
(i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases,
users can set the `spark.metrics.namespace` property to another spark configuration key like
`spark.app.name` which is then used to populate the root namespace of the metrics system
(with the app name in our example). `spark.metrics.namespace` property can be set to any
arbitrary spark property key, whose value would be used to set the root namespace of the
metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor
does the `spark.metrics.namespace` property have any such affect on such metrics.
## How was this patch tested?
Added new unit tests, modified existing unit tests.
Author: Mark Grover <mark@apache.org>
Closes#14270 from markgrover/spark-5847.
1. Create the executorspage-template.html for displaying application information in datables.
2. Added REST API endpoint "allexecutors" to be able to see all executors created for particular job.
3. The executorspage.js uses jQuery to access the data from /api/v1/applications/appid/allexecutors REST API, and use DataTable to display executors for the application. It also, generates summary of dead/live and total executors created during life of the application.
4. Similar changes applicable to Executors Page on history server for a given application.
Snapshots for how it looks like now:
<img width="938" alt="screen shot 2016-06-14 at 2 45 44 pm" src="https://cloud.githubusercontent.com/assets/6090397/16060092/ad1de03a-324b-11e6-8469-9eaa3f2548b5.png">
New Executors Page screenshot looks like this:
<img width="1436" alt="screen shot 2016-06-15 at 10 12 01 am" src="https://cloud.githubusercontent.com/assets/6090397/16085514/ee7004f0-32e1-11e6-9340-33d91e407f2b.png">
Author: Kishor Patil <kpatil@yahoo-inc.com>
Closes#13670 from kishorvpatil/execTemplates.
## What changes were proposed in this pull request?
Update monitoring.md.
…<appId>'
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#14163 from Sherry302/master.
Remove history server functionality from standalone Master. Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270). Keeping this functionality out of the Master will help to simplify the process and increase stability.
Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly. Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#10991 from BryanCutler/remove-history-master-SPARK-12299.
## What changes were proposed in this pull request?
Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs.
**Changes Proposed in the fix**
Three new corresponding config entries have been added where the user can specify the groups to be given access.
```
spark.admin.acls.groups
spark.modify.acls.groups
spark.ui.view.acls.groups
```
New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter.
A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided.
Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping```
**How the patch was Tested**
We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly.
Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior.
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#12760 from dhruve/impr/SPARK-4224.
Add to the REST API details on the ? args and examples from the test suite.
I've used the existing table, adding all the fields to the second table.
see [in the pr](https://github.com/steveloughran/spark/blob/history/SPARK-13267-doc-params/docs/monitoring.md).
There's a slightly more sophisticated option: make the table 3 columns wide, and for all existing entries, have the initial `td` span 2 columns. The new entries would then have an empty 1st column, param in 2nd and text in 3rd, with any examples after a `br` entry.
Author: Steve Loughran <stevel@hortonworks.com>
Closes#11152 from steveloughran/history/SPARK-13267-doc-params.
## What changes were proposed in this pull request?
The patch makes event log processing multi threaded.
## How was this patch tested?
Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI.
Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
Closes#11800 from Parth-Brahmbhatt/SPARK-13988.
It looks like the docs were recently updated to reflect the History Server's support for incomplete applications, but they still had wording that suggested only completed applications were viewable. This fixes that.
My editor also introduced several whitespace removal changes, that I hope are OK, as text files shouldn't have trailing whitespace. To verify they're purely whitespace changes, add `&w=1` to your browser address. If this isn't acceptable, let me know and I'll update the PR.
I also didn't think this required a JIRA. Let me know if I should create one.
Not tested
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#12045 from mgummelt/update-history-docs.
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11300 from dongjoon-hyun/minor_fix_typos.
When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger *filesize*. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI.
https://issues.apache.org/jira/browse/SPARK-7889
Author: Steve Loughran <stevel@hortonworks.com>
Author: Imran Rashid <irashid@cloudera.com>
Closes#11118 from squito/SPARK-7889-alternate.