When a cached block is spilled to disk and read back in serialized form (i.e. as bytes), the current BlockManager implementation will attempt to re-insert the serialized block into the MemoryStore even if the block's storage level requests deserialized caching.
This behavior adds some complexity to the MemoryStore but I don't think it offers many performance benefits and I'd like to remove it in order to simplify a larger refactoring patch. Therefore, this patch changes the behavior so that disk store reads will only cache bytes in the memory store for blocks with serialized storage levels.
There are two places where we request serialized bytes from the BlockStore:
1. getLocalBytes(), which is only called when reading local copies of TorrentBroadcast pieces. Broadcast pieces are always cached using a serialized storage level, so this won't lead to a mismatch in serialization forms if spilled bytes read from disk are cached as bytes in the memory store.
2. the non-shuffle-block branch in getBlockData(), which is only called by the NettyBlockRpcServer when responding to requests to read remote blocks. Caching the serialized bytes in memory will only benefit us if those cached bytes are read before they're evicted and the likelihood of that happening seems low since the frequency of remote reads of non-broadcast cached blocks seems very low. Caching these bytes when they have a low probability of being read is bad if it risks the eviction of blocks which are cached in their expected serialized/deserialized forms, since those blocks seem more likely to be read in local computation.
Given the argument above, I think this change is unlikely to cause performance regressions.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11533 from JoshRosen/remove-memorystore-level-mismatch.
## What changes were proposed in this pull request?
In WebUI, now Jetty Server starts with SPARK_LOCAL_IP config value if it
is configured otherwise it starts with default value as '0.0.0.0'.
It is continuation as per the closed PR https://github.com/apache/spark/pull/11133 for the JIRA SPARK-13117 and discussion in SPARK-13117.
## How was this patch tested?
This has been verified using the command 'netstat -tnlp | grep <PID>' to check on which IP/hostname is binding with the below steps.
In the below results, mentioned PID in the command is the corresponding process id.
#### Without the patch changes,
Web UI(Jetty Server) is not taking the value configured for SPARK_LOCAL_IP and it is listening to all the interfaces.
###### Master
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3930
tcp6 0 0 :::8080 :::* LISTEN 3930/java
```
###### Worker
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4090
tcp6 0 0 :::8081 :::* LISTEN 4090/java
```
###### History Server Process,
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2471
tcp6 0 0 :::18080 :::* LISTEN 2471/java
```
###### Driver
```
[devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6556
tcp6 0 0 :::4040 :::* LISTEN 6556/java
```
#### With the patch changes
##### i. With SPARK_LOCAL_IP configured
If the SPARK_LOCAL_IP is configured then all the processes Web UI(Jetty Server) is getting bind to the configured value.
###### Master
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 1561
tcp6 0 0 x.x.x.x:8080 :::* LISTEN 1561/java
```
###### Worker
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2229
tcp6 0 0 x.x.x.x:8081 :::* LISTEN 2229/java
```
###### History Server
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3747
tcp6 0 0 x.x.x.x:18080 :::* LISTEN 3747/java
```
###### Driver
```
[devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6013
tcp6 0 0 x.x.x.x:4040 :::* LISTEN 6013/java
```
##### ii. Without SPARK_LOCAL_IP configured
If the SPARK_LOCAL_IP is not configured then all the processes Web UI(Jetty Server) will start with the '0.0.0.0' as default value.
###### Master
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4573
tcp6 0 0 :::8080 :::* LISTEN 4573/java
```
###### Worker
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4703
tcp6 0 0 :::8081 :::* LISTEN 4703/java
```
###### History Server
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4846
tcp6 0 0 :::18080 :::* LISTEN 4846/java
```
###### Driver
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 5437
tcp6 0 0 :::4040 :::* LISTEN 5437/java
```
Author: Devaraj K <devaraj@apache.org>
Closes#11490 from devaraj-kavali/SPARK-13117-v1.
In preparation for larger refactoring, this patch removes the confusing `returnValues` option from the BlockStore put() APIs: returning the value is only useful in one place (caching) and in other situations, such as block replication, it's simpler to put() and then get().
As part of this change, I needed to refactor `BlockManager.doPut()`'s block replication code. I also changed `doPut()` to access the memory and disk stores directly rather than calling them through the BlockStore interface; this is in anticipation of a followup patch to remove the BlockStore interface so that the disk store can expose a binary-data-oriented API which is not concerned with Java objects or serialization.
These changes should be covered by the existing storage unit tests. The best way to review this patch is probably to look at the individual commits, all of which are small and have useful descriptions to guide the review.
/cc davies for review.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11502 from JoshRosen/remove-returnvalues.
## What changes were proposed in this pull request?
AppClient runs in the driver side. It should not call `Utils.tryOrExit` as it will send exception to SparkUncaughtExceptionHandler and call `System.exit`. This PR just removed `Utils.tryOrExit`.
## How was this patch tested?
manual tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11566 from zsxwing/SPARK-13711.
`HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.
### HadoopFsRelation
A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers.
```scala
case class HadoopFsRelation(
sqlContext: SQLContext,
location: FileCatalog,
partitionSchema: StructType,
dataSchema: StructType,
bucketSpec: Option[BucketSpec],
fileFormat: FileFormat,
options: Map[String, String]) extends BaseRelation
```
### FileFormat
The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files.
```scala
trait FileFormat {
def inferSchema(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
def prepareWrite(
sqlContext: SQLContext,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
def buildInternalScan(
sqlContext: SQLContext,
dataSchema: StructType,
requiredColumns: Array[String],
filters: Array[Filter],
bucketSet: Option[BitSet],
inputFiles: Array[FileStatus],
broadcastedConf: Broadcast[SerializableConfiguration],
options: Map[String, String]): RDD[InternalRow]
}
```
The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.
### FileCatalog
This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
```scala
trait FileCatalog {
def paths: Seq[Path]
def partitionSpec(schema: Option[StructType]): PartitionSpec
def allFiles(): Seq[FileStatus]
def getStatus(path: Path): Array[FileStatus]
def refresh(): Unit
}
```
Currently there are two implementations:
- `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance
- `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.
### ResolvedDataSource
Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
- `paths: Seq[String] = Nil`
- `userSpecifiedSchema: Option[StructType] = None`
- `partitionColumns: Array[String] = Array.empty`
- `bucketSpec: Option[BucketSpec] = None`
- `provider: String`
- `options: Map[String, String]`
This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here.
### DataSourceAnalysis / DataSourceStrategy
Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
- pruning the files from partitions that will be read based on filters.
- appending partition columns*
- applying additional filters when a data source can not evaluate them internally.
- constructing an RDD that is bucketed correctly when required*
- sanity checking schema match-up and other analysis when writing.
*In the future we should do that following:
- Break out file handling into its own Strategy as its sufficiently complex / isolated.
- Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
- Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`
Author: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11509 from marmbrus/fileDataSource.
This is, in a way, the basics to enable SPARK-529 (which was closed as
won't fix but I think is still valuable). In fact, Spark SQL created
something for that, and this change basically factors out that code
and inserts it into SparkConf, with some extra bells and whistles.
To showcase the usage of this pattern, I modified the YARN backend
to use the new config keys (defined in the new `config` package object
under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although
logic had to be slightly modified in a handful of places.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#10205 from vanzin/conf-opts.
## What changes were proposed in this pull request?
Now that dead executors are shown in the executors table (#10058) the totals table is updated to include the separate totals for alive and dead executors as well as the current total, as originally discussed in #10668
## How was this patch tested?
Manually verified by running the Standalone Web UI in the latest Safari and Firefox ESR
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#11381 from ajbozarth/spark13459.
## What changes were proposed in this pull request?
Remove old deprecated ThreadPoolExecutor and replace with ExecutionContext using a ForkJoinPool. The downside of this is that scala's ForkJoinPool doesn't give us a way to specify the thread pool name (and is a wrapper of Java's in 2.12) except by providing a custom factory. Note that we can't use Java's ForkJoinPool directly in Scala 2.11 since it uses a ExecutionContext which reports system parallelism. One other implicit change that happens is the old ExecutionContext would have reported a different default parallelism since it used system parallelism rather than threadpool parallelism (this was likely not intended but also likely not a huge difference).
The previous version of this PR attempted to use an execution context constructed on the ThreadPool (but not the deprecated ThreadPoolExecutor class) so as to keep the ability to have human readable named threads but this reported system parallelism.
## How was this patch tested?
unit tests: streaming/testOnly org.apache.spark.streaming.util.*
Author: Holden Karau <holden@us.ibm.com>
Closes#11423 from holdenk/SPARK-13398-move-away-from-ThreadPoolTaskSupport-java-forkjoin.
## What changes were proposed in this pull request?
This PR fixes typos in comments and testcase name of code.
## How was this patch tested?
manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11481 from dongjoon-hyun/minor_fix_typos_in_code.
## What changes were proposed in this pull request?
Fixes (another) compile problem due to inadvertent use of Option.contains, only in Scala 2.11
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#11496 from srowen/SPARK-13423.3.
## What changes were proposed in this pull request?
Fixes compile problem due to inadvertent use of `Option.contains`, only in Scala 2.11. The change should have been to replace `Option.exists(_ == x)` with `== Some(x)`. Replacing exists with contains only makes sense for collections. Replacing use of `Option.exists` still makes sense though as it's misleading.
## How was this patch tested?
Jenkins tests / compilation
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Sean Owen <sowen@cloudera.com>
Closes#11493 from srowen/SPARK-13423.2.
## What changes were proposed in this pull request?
After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11438 from dongjoon-hyun/SPARK-13583.
## What changes were proposed in this pull request?
Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map
The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
## How was the this patch tested?
Existing Jenkins unit tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11292 from srowen/SPARK-13423.
Moved TestExecutor.scala from src to test package and removed the unused file TestClient.scala.
Author: Devaraj K <devaraj@apache.org>
Closes#11474 from devaraj-kavali/SPARK-13621.
## What changes were proposed in this pull request?
In order to tell OutputStream that the task has failed or not, we should call the failure callbacks BEFORE calling writer.close().
## How was this patch tested?
Added new unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11450 from davies/callback.
CacheManager directly calls MemoryStore.unrollSafely() and has its own logic for handling graceful fallback to disk when cached data does not fit in memory. However, this logic also exists inside of the MemoryStore itself, so this appears to be unnecessary duplication.
Thanks to the addition of block-level read/write locks in #10705, we can refactor the code to remove the CacheManager and replace it with an atomic `BlockManager.getOrElseUpdate()` method.
This pull request replaces / subsumes #10748.
/cc andrewor14 and nongli for review. Note that this changes the locking semantics of a couple of internal BlockManager methods (`doPut()` and `lockNewBlockForWriting`), so please pay attention to the Scaladoc changes and new test cases for those methods.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11436 from JoshRosen/remove-cachemanager.
The Hive client library is not smart enough to notice that the current
user is a proxy user; so when using a proxy user, it fails to fetch
delegation tokens from the metastore because of a missing kerberos
TGT for the current user.
To fix it, just run the code that fetches the delegation token as the
real logged in user.
Tested on a kerberos cluster both submitting normally and with a proxy
user; Hive and HBase tokens are retrieved correctly in both cases.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11358 from vanzin/SPARK-13478.
## What changes were proposed in this pull request?
Just fixed the log place introduced by #11401
## How was this patch tested?
unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11432 from zsxwing/SPARK-13522-follow-up.
## What changes were proposed in this pull request?
Sometimes, network disconnection event won't be triggered for other potential race conditions that we may not have thought of, then the executor will keep sending heartbeats to driver and won't exit.
This PR adds a new configuration `spark.executor.heartbeat.maxFailures` to kill Executor when it's unable to heartbeat to the driver more than `spark.executor.heartbeat.maxFailures` times.
## How was this patch tested?
unit tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11401 from zsxwing/SPARK-13522.
## What changes were proposed in this pull request?
Now by default, it shows as ascending order of appId. We might prefer to display as descending order by default, which will show the latest application at the top.
## How was this patch tested?
Manual tested. See screenshot below:
![desc-sort](https://cloud.githubusercontent.com/assets/11683054/13307473/102f4cf8-db31-11e5-8dd5-391edbf32f0d.png)
Author: zhuol <zhuol@yahoo-inc.com>
Closes#11357 from zhuoliu/13481.
## What changes were proposed in this pull request?
When the driver removes an executor's state, the connection between the driver and the executor may be still alive so that the executor cannot exit automatically (E.g., Master will send RemoveExecutor when a work is lost but the executor is still alive), so the driver should try to tell the executor to stop itself. Otherwise, we will leak an executor.
This PR modified the driver to send `StopExecutor` to the executor when it's removed.
## How was this patch tested?
manual test: increase the worker heartbeat interval to force it's always timeout and the leak executors are gone.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11399 from zsxwing/SPARK-13519.
## What changes were proposed in this pull request?
TaskContext supports task completion callback, which gets called regardless of task failures. However, there is no way for the listener to know if there is an error. This patch adds a new listener that gets called when a task fails.
## How was the this patch tested?
New unit test case and integration test case covering the code path
Author: Reynold Xin <rxin@databricks.com>
Closes#11340 from rxin/SPARK-13465.
## Motivation
As a pre-requisite to off-heap caching of blocks, we need a mechanism to prevent pages / blocks from being evicted while they are being read. With on-heap objects, evicting a block while it is being read merely leads to memory-accounting problems (because we assume that an evicted block is a candidate for garbage-collection, which will not be true during a read), but with off-heap memory this will lead to either data corruption or segmentation faults.
## Changes
### BlockInfoManager and reader/writer locks
This patch adds block-level read/write locks to the BlockManager. It introduces a new `BlockInfoManager` component, which is contained within the `BlockManager`, holds the `BlockInfo` objects that the `BlockManager` uses for tracking block metadata, and exposes APIs for locking blocks in either shared read or exclusive write modes.
`BlockManager`'s `get*()` and `put*()` methods now implicitly acquire the necessary locks. After a `get()` call successfully retrieves a block, that block is locked in a shared read mode. A `put()` call will block until it acquires an exclusive write lock. If the write succeeds, the write lock will be downgraded to a shared read lock before returning to the caller. This `put()` locking behavior allows us store a block and then immediately turn around and read it without having to worry about it having been evicted between the write and the read, which will allow us to significantly simplify `CacheManager` in the future (see #10748).
See `BlockInfoManagerSuite`'s test cases for a more detailed specification of the locking semantics.
### Auto-release of locks at the end of tasks
Our locking APIs support explicit release of locks (by calling `unlock()`), but it's not always possible to guarantee that locks will be released prior to the end of the task. One reason for this is our iterator interface: since our iterators don't support an explicit `close()` operator to signal that no more records will be consumed, operations like `take()` or `limit()` don't have a good means to release locks on their input iterators' blocks. Another example is broadcast variables, whose block locks can only be released at the end of the task.
To address this, `BlockInfoManager` uses a pair of maps to track the set of locks acquired by each task. Lock acquisitions automatically record the current task attempt id by obtaining it from `TaskContext`. When a task finishes, code in `Executor` calls `BlockInfoManager.unlockAllLocksForTask(taskAttemptId)` to free locks.
### Locking and the MemoryStore
In order to prevent in-memory blocks from being evicted while they are being read, the `MemoryStore`'s `evictBlocksToFreeSpace()` method acquires write locks on blocks which it is considering as candidates for eviction. These lock acquisitions are non-blocking, so a block which is being read will not be evicted. By holding write locks until the eviction is performed or skipped (in case evicting the blocks would not free enough memory), we avoid a race where a new reader starts to read a block after the block has been marked as an eviction candidate but before it has been removed.
### Locking and remote block transfer
This patch makes small changes to to block transfer and network layer code so that locks acquired by the BlockTransferService are released as soon as block transfer messages are consumed and released by Netty. This builds on top of #11193, a bug fix related to freeing of network layer ManagedBuffers.
## FAQ
- **Why not use Java's built-in [`ReadWriteLock`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReadWriteLock.html)?**
Our locks operate on a per-task rather than per-thread level. Under certain circumstances a task may consist of multiple threads, so using `ReadWriteLock` would mean that we might call `unlock()` from a thread which didn't hold the lock in question, an operation which has undefined semantics. If we could rely on Java 8 classes, we might be able to use [`StampedLock`](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/StampedLock.html) to work around this issue.
- **Why not detect "leaked" locks in tests?**:
See above notes about `take()` and `limit`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10705 from JoshRosen/pin-pages.
Our nightly doc snapshot builds are failing due to some issue involving the Guava Stopwatch constructor:
```
[error] /home/jenkins/workspace/spark-master-docs/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:496: constructor Stopwatch in class Stopwatch cannot be accessed in class CoarseMesosSchedulerBackend
[error] val stopwatch = new Stopwatch()
[error] ^
```
This Stopwatch constructor was deprecated in newer versions of Guava (fd0cbc2c5c) and it's possible that some classpath issues affecting Unidoc could be causing this to trigger compilation failures.
In order to work around this issue, this patch removes this use of Stopwatch since we don't use it anywhere else in the Spark codebase.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11376 from JoshRosen/remove-stopwatch.
When uses clicks more than one time on any stage in the DAG graph on the *Job* web UI page, many new *Stage* web UI pages are opened, but only half of their DAG graphs are expanded.
After this PR's fix, every newly opened *Stage* page's DAG graph is expanded.
Before:
![](https://cloud.githubusercontent.com/assets/15843379/13279144/74808e86-db10-11e5-8514-cecf31af8908.png)
After:
![](https://cloud.githubusercontent.com/assets/15843379/13279145/77ca5dec-db10-11e5-9457-8e1985461328.png)
## What changes were proposed in this pull request?
- Removed the `expandDagViz` parameter for _Stage_ page and related codes
- Added a `onclick` function setting `expandDagVizArrowKey(false)` as `true`
## How was this patch tested?
Manual tests (with this fix) to verified this fix work:
- clicked many times on _Job_ Page's DAG Graph → each newly opened Stage page's DAG graph is expanded
Manual tests (with this fix) to verified this fix do not break features we already had:
- refreshed many times for a same _Stage_ page (whose DAG already expanded) → DAG remained expanded upon every refresh
- refreshed many times for a same _Stage_ page (whose DAG unexpanded) → DAG remained unexpanded upon every refresh
- refreshed many times for a same _Job_ page (whose DAG already expanded) → DAG remained expanded upon every refresh
- refreshed many times for a same _Job_ page (whose DAG unexpanded) → DAG remained unexpanded upon every refresh
Author: Liwei Lin <proflin.me@gmail.com>
Closes#11368 from proflin/SPARK-13468.
Fixed the HTTP Server Host Name/IP issue i.e. HTTP Server to take the
configured host name/IP and not '0.0.0.0' always.
Author: Devaraj K <devaraj@apache.org>
Closes#11133 from devaraj-kavali/SPARK-13117.
## What changes were proposed in this pull request?
When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.
## How was the this patch tested?
by existing unit tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11342 from cloud-fan/python-clean.
Added an exception to be thrown in UnifiedMemoryManager.scala if the configuration given for executor memory is too low. Also modified the exception message thrown when driver memory is too low.
This patch was tested manually by passing in config options to Spark shell. I also added a test in UnifiedMemoryManagerSuite.scala
Author: Daniel Jalova <djalova@us.ibm.com>
Closes#11255 from djalova/SPARK-12759.
## What changes were proposed in this pull request?
Generates code for SortMergeJoin.
## How was the this patch tested?
Unit tests and manually tested with TPCDS Q72, which showed 70% performance improvements (from 42s to 25s), but micro benchmark only show minor improvements, it may depends the distribution of data and number of columns.
Author: Davies Liu <davies@databricks.com>
Closes#11248 from davies/gen_smj.
## What changes were proposed in this pull request?
History page now sorts the appID as a string, which can lead to unexpected order for the case "application_11111_9" and "application_11111_20".
Add a new sort type called appId-numeric can fix it.
## How was the this patch tested?
This patch was manually tested with UI. See the screenshot below:
![sortappidbetter](https://cloud.githubusercontent.com/assets/11683054/13185564/7f941a16-d707-11e5-8fb7-0316368d3030.png)
Author: zhuol <zhuol@yahoo-inc.com>
Closes#11259 from zhuoliu/13364.
JIRA: https://issues.apache.org/jira/browse/SPARK-13358
When trying to run a benchmark, I found that on my Ubuntu linux grep is not in /usr/bin/ but /bin/. So wondering if it is better to use which to retrieve grep path.
cc davies
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11231 from viirya/benchmark-grep-path.
## What changes were proposed in this pull request?
When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters
## How was the this patch tested?
Jenkins tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11309 from zsxwing/SPARK-13298.
## What changes were proposed in this pull request?
This patch removes SparkContext.metricsSystem. SparkContext.metricsSystem returns MetricsSystem, which is a private class. I think it was added by accident.
In addition, I also removed an unused private[spark] method schedulerBackend setter.
## How was the this patch tested?
N/A.
Author: Reynold Xin <rxin@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>
Closes#11282 from rxin/SPARK-13413.
Currently the Mesos cluster dispatcher is not using offers from multiple roles correctly, as it simply aggregates all the offers resource values into one, but doesn't apply them correctly before calling the driver as Mesos needs the resources from the offers to be specified which role it originally belongs to. Multiple roles is already supported with fine/coarse grain scheduler, so porting that logic here to the cluster scheduler.
https://issues.apache.org/jira/browse/SPARK-10749
Author: Timothy Chen <tnachen@gmail.com>
Closes#8872 from tnachen/cluster_multi_roles.
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11300 from dongjoon-hyun/minor_fix_typos.
## What changes were proposed in this pull request?
This PR removes the support of SIMR, since SIMR is not actively used and maintained for a long time, also is not supported from `SparkSubmit`, so here propose to remove it.
## How was the this patch tested?
This patch is tested locally by running unit tests.
Author: jerryshao <sshao@hortonworks.com>
Closes#11296 from jerryshao/SPARK-13426.
## What changes were proposed in this pull request?
`JobWaiter.taskSucceeded` will be called for each task. When `resultHandler` throws an exception, `taskSucceeded` will also throw it for each task. DAGScheduler just catches it and reports it like this:
```Scala
try {
job.listener.taskSucceeded(rt.outputId, event.result)
} catch {
case e: Exception =>
// TODO: Perhaps we want to mark the resultStage as failed?
job.listener.jobFailed(new SparkDriverExecutionException(e))
}
```
Therefore `JobWaiter.jobFailed` may be called multiple times.
So `JobWaiter.jobFailed` should use `Promise.tryFailure` instead of `Promise.failure` because the latter one doesn't support calling multiple times.
## How was the this patch tested?
Jenkins tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11280 from zsxwing/SPARK-13408.
`TaskMetrics.fromAccumulatorUpdates()` can fail if accumulators have been garbage-collected on the driver. To guard against this, this patch introduces `ListenerTaskMetrics`, a subclass of `TaskMetrics` which is used only in `TaskMetrics.fromAccumulatorUpdates()` and which eliminates the need to access the original accumulators on the driver.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11276 from JoshRosen/accum-updates-fix.
Clarify that reduce functions need to be commutative, and fold functions do not
See https://github.com/apache/spark/pull/11091
Author: Sean Owen <sowen@cloudera.com>
Closes#11217 from srowen/SPARK-13339.
## What changes were proposed in this pull request?
Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager)
## How was the this patch tested?
Running Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#11253 from srowen/SPARK-13371.
This commit removes an unnecessary duplicate check in addPendingTask that meant
that scheduling a task set took time proportional to (# tasks)^2.
Author: Sital Kedia <skedia@fb.com>
Closes#11175 from sitalkedia/fix_stuck_driver.
See http://openjdk.java.net/jeps/223 for more information about the JDK 9 version string scheme.
Author: Claes Redestad <claes.redestad@gmail.com>
Closes#11160 from cl4es/master.
Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK.
Is it worth considering also including this fix in any future 1.5.x releases (if any)?
I confirm this is my own original work and license it to the Spark project under its open source license.
Author: markpavey <mark.pavey@thefilter.com>
Closes#11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.