## What changes were proposed in this pull request?
The configuration setting `spark.executor.logs.rolling.size.maxBytes` was changed to `spark.executor.logs.rolling.maxSize` in 1.4 or so.
This commit fixes a remaining reference to the old name in the documentation.
Also the description for `spark.executor.logs.rolling.maxSize` was edited to clearly state that the unit for the size is bytes.
## How was this patch tested?
no tests
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#13001 from philipphoffmann/patch-3.
## What changes were proposed in this pull request?
Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs.
**Changes Proposed in the fix**
Three new corresponding config entries have been added where the user can specify the groups to be given access.
```
spark.admin.acls.groups
spark.modify.acls.groups
spark.ui.view.acls.groups
```
New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter.
A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided.
Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping```
**How the patch was Tested**
We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly.
Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior.
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#12760 from dhruve/impr/SPARK-4224.
## What changes were proposed in this pull request?
The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.
## How was this patch tested?
Removed some tests related to the old manager.
Author: Reynold Xin <rxin@databricks.com>
Closes#12423 from rxin/SPARK-14667.
## What changes were proposed in this pull request?
The configuration docs are updated to reflect the changes introduced with [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This allows the user to specify initial heap memory settings through the extraJavaOptions for executor, driver and am.
## How was this patch tested?
The changes are tested in [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This is just documenting the changes made.
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#12333 from dhruve/doc/SPARK-14572.
The description of "spark.memory.offHeap.size" in the current document does not clearly state that memory is counted with bytes....
This PR contains a small fix for this tiny issue
document fix
Author: CodingCat <zhunansjtu@gmail.com>
Closes#11561 from CodingCat/master.
## What changes were proposed in this pull request?
We provide a very limited set of cluster management script in Spark for Tachyon, although Tachyon itself provides a much better version of it. Given now Spark users can simply use Tachyon as a normal file system and does not require extensive configurations, we can remove this management capabilities to simplify Spark bash scripts.
Note that this also reduces coupling between a 3rd party external system and Spark's release scripts, and would eliminate possibility for failures such as Tachyon being renamed or the tar balls being relocated.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#11400 from rxin/release-script.
## What changes were proposed in this pull request?
This PR fixes some typos in the following documentation files.
* `NOTICE`, `configuration.md`, and `hardware-provisioning.md`.
## How was the this patch tested?
manual tests
Author: Dongjoon Hyun <dongjoonapache.org>
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11289 from dongjoon-hyun/minor_fix_typos_notice_and_confdoc.
https://issues.apache.org/jira/browse/SPARK-11627
Spark Streaming backpressure mechanism has no initial input rate limit, it might cause OOM exception.
In the firest batch task ,receivers receive data at the maximum speed they can reach,it might exhaust executors memory resources. Add a initial input rate limit value can make sure the Streaming job execute success in the first batch,then the backpressure mechanism can adjust receiving rate adaptively.
Author: junhao <junhao@mogujie.com>
Closes#9593 from junhaoMg/junhao-dev.
This JIRA is related to
https://github.com/apache/spark/pull/5852
Had to do some minor rework and test to make sure it
works with current version of spark.
Author: Sanket <schintap@untilservice-lm>
Closes#10838 from redsanket/limit-outbound-connections.
Remove spark.closure.serializer option and use JavaSerializer always
CC andrewor14 rxin I see there's a discussion in the JIRA but just thought I'd offer this for a look at what the change would be.
Author: Sean Owen <sowen@cloudera.com>
Closes#11150 from srowen/SPARK-12414.
This is the next iteration of tnachen's previous PR: https://github.com/apache/spark/pull/4027
In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone. This PR implements that resolution.
This PR implements two high-level features. These two features are co-dependent, so they're implemented both here:
- Mesos support for spark.executor.cores
- Multiple executors per slave
We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR.
The contribution is my original work and I license the work to the project under the project's open source license.
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#10993 from mgummelt/executor_sizing.
Fix zookeeper dir configuration used in cluster mode, and also add documentation around these settings.
Author: Timothy Chen <tnachen@gmail.com>
Closes#10057 from tnachen/fix_mesos_dir.
JIRA 1680 added a property called spark.yarn.appMasterEnv. This PR draws users' attention to this special case by adding an explanation in configuration.html#environment-variables
Author: Andrew <weiner.andrew.j@gmail.com>
Closes#10869 from weineran/branch-yarn-docs.
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it.
- Update comments and docs
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10854 from zsxwing/remove-akka.
Several Spark properties equivalent to Spark submit command line options are missing.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10491 from felixcheung/sparksubmitdoc.
modify 'spark.memory.offHeap.enabled' default value to false
Author: zzcclp <xm_zzc@sina.com>
Closes#10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.
This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.
Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.
For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10534 from JoshRosen/remove-ttl-based-cleaning.
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0.
Author: Reynold Xin <rxin@databricks.com>
Closes#10531 from rxin/SPARK-12588.
According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.
After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).
[1] https://github.com/ning/jvm-compressor-benchmark/wiki
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#10342 from davies/lz4.
The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.
davies Is this inconsistency intentional? Thanks!
Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.
Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#10092 from gatorsmile/persistStorageLevel.
This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs.
- Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6).
- Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix.
- Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion.
- Document these configurations on the configuration page.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10237 from JoshRosen/SPARK-12251.
This avoids bringing up yet another HTTP server on the driver, and
instead reuses the file server already managed by the driver's
RpcEnv. As a bonus, the repl now inherits the security features of
the network library.
There's also a small change to create the directory for storing classes
under the root temp dir for the application (instead of directly
under java.io.tmpdir).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9923 from vanzin/SPARK-11563.
The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g. default 1GB leaves only 250MB system memory. This is especially a problem in local mode, where the driver and executor are crammed in the same JVM. Members of the community have reported driver OOM's in such cases.
**New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs, this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is proposal (1) listed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-12081).
Author: Andrew Or <andrew@databricks.com>
Closes#10081 from andrewor14/unified-memory-small-heaps.
This change abstracts the code that serves jars / files to executors so that
each RpcEnv can have its own implementation; the akka version uses the existing
HTTP-based file serving mechanism, while the netty versions uses the new
stream support added to the network lib, which makes file transfers benefit
from the easier security configuration of the network library, and should also
reduce overhead overall.
The change includes a small fix to TransportChannelHandler so that it propagates
user events to downstream handlers.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9530 from vanzin/SPARK-11140.
`<\code>` end tag missing backslash in
docs/configuration.md{L308-L339}
ref #8795
Author: Kai Jiang <jiangkai@gmail.com>
Closes#9715 from vectorijk/minor-typo-docs.
Remove Hadoop third party distro page, and move Hadoop cluster config info to configuration page
CC pwendell
Author: Sean Owen <sowen@cloudera.com>
Closes#9298 from srowen/SPARK-11305.
There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8829 from JoshRosen/consolidate-sort-shuffle-implementations.
Add documentation for configuration:
- spark.sql.ui.retainedExecutions
- spark.streaming.ui.retainedBatches
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes#9052 from pnpritchard/SPARK-11039.
This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:
- **spark.memory.fraction (default 0.75)**: fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.
- **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `spark.memory.fraction`. Cached data may only be evicted if total storage exceeds this region.
- **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.
For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000.
Author: Andrew Or <andrew@databricks.com>
Closes#9084 from andrewor14/unified-memory-manager.
1.4 docs noted that the units were MB - i have assumed this is still the case
Author: admackin <admackin@users.noreply.github.com>
Closes#9025 from admackin/master.
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8831 from JoshRosen/remove-ability-to-disable-spilling.
spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode.
If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true.
There are no existing test for YARN mode too. Hence not added test for the same.
Author: Akash Mishra <akash.mishra20@gmail.com>
Closes#8672 from SleepyThread/master.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#8656 from tdas/SPARK-10492 and squashes the following commits:
986cdd6 [Tathagata Das] Added information on backpressure
We introduced the Netty network module for shuffle in Spark 1.2, and has turned it on by default for 3 releases. The old ConnectionManager is difficult to maintain. If we merge the patch now, by the time it is released, it would be 1 yr for which ConnectionManager is off by default. It's time to remove it.
Author: Reynold Xin <rxin@databricks.com>
Closes#8161 from rxin/SPARK-9767.