## What changes were proposed in this pull request?
Make spill tests wait until job has completed before returning the number of stages that spilled
## How was this patch tested?
Existing Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#13896 from srowen/SPARK-16193.
## What changes were proposed in this pull request?
Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable.
## How was this patch tested?
Tested by running a job on the cluster.
Author: Sital Kedia <skedia@fb.com>
Closes#13699 from sitalkedia/config_sort_buffer_upstream.
## The problem
Before this change, if either of the following cases happened to a task , the task would be marked as `FAILED` instead of `KILLED`:
- the task was killed before it was deserialized
- `executor.kill()` marked `taskRunner.killed`, but before calling `task.killed()` the worker thread threw the `TaskKilledException`
The reason is, in the `catch` block of the current [Executor.TaskRunner](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L362)'s implementation, we are mistakenly catching:
```scala
case _: TaskKilledException | _: InterruptedException if task.killed => ...
```
the semantics of which is:
- **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed`
Then when `TaskKilledException` is thrown but `task.killed` is not marked, we would mark the task as `FAILED` (which should really be `KILLED`).
## What changes were proposed in this pull request?
This patch alters the catch condition's semantics from:
- **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed`
to
- `TaskKilledException` **OR** **(**`InterruptedException` **AND** `task.killed`**)**
so that we can catch `TaskKilledException` correctly and mark the task as `KILLED` correctly.
## How was this patch tested?
Added unit test which failed before the change, ran new test 1000 times manually
Author: Liwei Lin <lwlin7@gmail.com>
Closes#13685 from lw-lin/fix-task-killed.
## What changes were proposed in this pull request?
Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException`
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#13843 from srowen/SPARK-16129.
## What changes were proposed in this pull request?
Since SPARK-13220(Deprecate "yarn-client" and "yarn-cluster"), YarnClusterSuite doesn't test "yarn cluster" mode correctly.
This pull request fixes it.
## How was this patch tested?
Unit test
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: peng.zhang <peng.zhang@xiaomi.com>
Closes#13836 from renozhang/SPARK-16125-test-yarn-cluster-mode.
## What changes were proposed in this pull request?
This changes the behavior of --num-executors and spark.executor.instances when using dynamic allocation. Instead of turning dynamic allocation off, it uses the value for the initial number of executors.
This changes was discussed on [SPARK-13723](https://issues.apache.org/jira/browse/SPARK-13723). I highly recommend using it while we can change the behavior for 2.0.0. In practice, the 1.x behavior causes unexpected behavior for users (it is not clear that it disables dynamic allocation) and wastes cluster resources because users rarely notice the log message.
## How was this patch tested?
This patch updates tests and adds a test for Utils.getDynamicAllocationInitialExecutors.
Author: Ryan Blue <blue@apache.org>
Closes#13338 from rdblue/SPARK-13723-num-executors-with-dynamic-allocation.
## What changes were proposed in this pull request?
Initialize logger instance lazily in Scala preferred way
## How was this patch tested?
By running `./build/mvn clean test` locally
Author: Prajwal Tuladhar <praj@infynyxx.com>
Closes#13842 from infynyxx/spark_internal_logger.
## What changes were proposed in this pull request?
This fixes SerializationDebugger to not recurse forever when `writeReplace` returns an object of the same class, which is the case for at least the `SQLMetrics` class.
See also the OpenJDK unit tests on the behavior of recursive `writeReplace()`:
f4d80957e8/test/java/io/Serializable/nestedReplace/NestedReplace.java
cc davies cloud-fan
## How was this patch tested?
Unit tests for SerializationDebugger.
Author: Eric Liang <ekl@databricks.com>
Closes#13814 from ericl/spark-16003.
## What changes were proposed in this pull request?
Three changes here -- first two were causing failures w/ BlacklistIntegrationSuite
1. The testing framework didn't include the reviveOffers thread, so the test which involved delay scheduling might never submit offers late enough for the delay scheduling to kick in. So added in the periodic revive offers, just like the real scheduler.
2. `assertEmptyDataStructures` would occasionally fail, because it appeared there was still an active job. This is because in DAGScheduler, the jobWaiter is notified of the job completion before the data structures are cleaned up. Most of the time the test code that is waiting on the jobWaiter won't become active until after the data structures are cleared, but occasionally the race goes the other way, and the assertions fail.
3. `DAGSchedulerSuite` was not stopping all the inner parts it was setting up, so each test was leaking a number of threads. So we stop those parts too.
4. Turns out that `assertMapOutputAvailable` is not terribly useful in this framework -- most of the places I was trying to use it suffer from some race.
5. When there is an exception in the backend, try to improve the error msg a little bit. Before the exception was printed to the console, but the test would fail w/ a timeout, and the logs wouldn't show anything.
## How was this patch tested?
I ran all the tests in `BlacklistIntegrationSuite` 5k times and everything in `DAGSchedulerSuite` 1k times on my laptop. Also I ran a full jenkins build with `BlacklistIntegrationSuite` 500 times and `DAGSchedulerSuite` 50 times, see https://github.com/apache/spark/pull/13548. (I tried more times but jenkins timed out.)
To check for more leaked threads, I added some code to dump the list of all threads at the end of each test in DAGSchedulerSuite, which is how I discovered the mapOutputTracker and eventLoop were leaking threads. (I removed that code from the final pr, just part of the testing.)
And I'll run Jenkins on this a couple of times to do one more check.
Author: Imran Rashid <irashid@cloudera.com>
Closes#13565 from squito/blacklist_extra_tests.
## What changes were proposed in this pull request?
Add a configuration to allow people to set a minimum polling delay when no new data arrives (default is 10ms). This PR also cleans up some INFO logs.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13718 from zsxwing/SPARK-16002.
## What changes were proposed in this pull request?
This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](cba5eee1ab/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala (L149)) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47).
The codes with the external data sources below:
```scala
df.select(input_file_name).show()
```
will produce
- **Before**
```
+-----------------+
|input_file_name()|
+-----------------+
| |
+-----------------+
```
- **After**
```
+--------------------+
| input_file_name()|
+--------------------+
|file:/private/var...|
+--------------------+
```
## How was this patch tested?
Unit tests in `ColumnExpressionSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13759 from HyukjinKwon/SPARK-16044.
## What changes were proposed in this pull request?
[SPARK-15395](https://issues.apache.org/jira/browse/SPARK-15395) changes the behavior that how the driver gets the executor host and the driver will get the executor IP address instead of the host name. This PR just sends the hostname from executors to driver so that driver can pass it to TaskScheduler.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13741 from zsxwing/SPARK-16017.
## What changes were proposed in this pull request?
This pull request refactors parts of the DAGScheduler to improve readability, focusing on the code around stage creation. One goal of this change it to make it clearer which functions may create new stages (as opposed to looking up stages that already exist). There are no functionality changes in this pull request. In more detail:
* shuffleToMapStage was renamed to shuffleIdToMapStage (when reading the existing code I have sometimes struggled to remember what the key is -- is it a stage? A stage id? This change is intended to avoid that confusion)
* Cleaned up the code to create shuffle map stages. Previously, creating a shuffle map stage involved 3 different functions (newOrUsedShuffleStage, newShuffleMapStage, and getShuffleMapStage), and it wasn't clear what the purpose of each function was. With the new code, a single function (getOrCreateShuffleMapStage) is responsible for getting a stage (if it already exists) or creating new shuffle map stages and any missing ancestor stages, and it delegates to createShuffleMapStage when new stages need to be created. There's some remaining confusion here because the getOrCreateParentStages call in createShuffleMapStage may recursively create ancestor stages; this is an issue I plan to fix in a future pull request, because it's trickier to fix and involves a slight functionality change.
* newResultStage was renamed to createResultStage, for consistency with naming around shuffle map stages.
* getParentStages has been renamed to getOrCreateParentStages, to make it clear that this function will sometimes create missing ancestor stages.
* The only *slight* functionality change is that on line 478, updateJobIdStageIdMaps now uses a stage's parents instance variable rather than re-calculating them (I couldn't see any reason why they'd need to be re-calculated, and suspect this is just leftover from older code).
* getAncestorShuffleDependencies was renamed to getMissingAncestorShuffleDependencies, to make it clear that this only returns dependencies that have not yet been run.
cc squito markhamstra JoshRosen (who requested more DAG scheduler commenting long ago -- an issue this pull request tries, in part, to address)
FYI rxin
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#13677 from kayousterhout/SPARK-15926.
When `--packages` is specified with spark-shell the classes from those packages cannot be found, which I think is due to some of the changes in SPARK-12343.
Tested manually with both scala 2.10 and 2.11 repls.
vanzin davies can you guys please review?
Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: Nezih Yigitbasi <nyigitbasi@netflix.com>
Closes#13709 from nezihyigitbasi/SPARK-15782.
## What changes were proposed in this pull request?
Reduce `spark.memory.fraction` default to 0.6 in order to make it fit within default JVM old generation size (2/3 heap). See JIRA discussion. This means a full cache doesn't spill into the new gen. CC andrewor14
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#13618 from srowen/SPARK-15796.
## What changes were proposed in this pull request?
gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.
Please, let me know what do you think and if you have any ideas to improve it.
Thank you!
## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>
Closes#12836 from NarineK/gapply2.
## What changes were proposed in this pull request?
SPARK-15927 exacerbated a race in BasicSchedulerIntegrationTest, so it went from very unlikely to fairly frequent. The issue is that stage numbering is not completely deterministic, but these tests treated it like it was. So turn off the tests.
## How was this patch tested?
on my laptop the test failed abotu 10% of the time before this change, and didn't fail in 500 runs after the change.
Author: Imran Rashid <irashid@cloudera.com>
Closes#13688 from squito/hotfix_basic_scheduler.
## What changes were proposed in this pull request?
When `--packages` is specified with `spark-shell` the classes from those packages cannot be found, which I think is due to some of the changes in `SPARK-12343`. In particular `SPARK-12343` removes a line that sets the `spark.jars` system property in client mode, which is used by the repl main class to set the classpath.
## How was this patch tested?
Tested manually.
This system property is used by the repl to populate its classpath. If
this is not set properly the classes for external packages cannot be
found.
tgravescs vanzin as you may be familiar with this part of the code.
Author: Nezih Yigitbasi <nyigitbasi@netflix.com>
Closes#13527 from nezihyigitbasi/repl-fix.
## What changes were proposed in this pull request?
Link to jira which describes the problem: https://issues.apache.org/jira/browse/SPARK-15826
The fix in this PR is to allow users specify encoding in the pipe() operation. For backward compatibility,
keeping the default value to be system default.
## How was this patch tested?
Ran existing unit tests
Author: Tejas Patil <tejasp@fb.com>
Closes#13563 from tejasapatil/pipedrdd_utf8.
## What changes were proposed in this pull request?
This patch is a follow-up to https://github.com/apache/spark/pull/13288 completing the renaming:
- LocalScheduler -> LocalSchedulerBackend~~Endpoint~~
## How was this patch tested?
Updated test cases to reflect the name change.
Author: Liwei Lin <lwlin7@gmail.com>
Closes#13683 from lw-lin/rename-backend.
Use the config variable definition both to set and parse the value,
avoiding issues with code expecting the value in a different format.
Tested by running spark-submit with --principal / --keytab.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13669 from vanzin/SPARK-15046.
To try to eliminate redundant code to traverse the RDD dependency graph,
this PR creates a new function getShuffleDependencies that returns
shuffle dependencies that are immediate parents of a given RDD. This
new function is used by getParentStages and
getAncestorShuffleDependencies.
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#13646 from kayousterhout/SPARK-15927.
## What changes were proposed in this pull request?
Another PR to clean up recent build warnings. This particularly cleans up several instances of the old accumulator API usage in tests that are straightforward to update. I think this qualifies as "minor".
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#13642 from srowen/BuildWarnings.
## What changes were proposed in this pull request?
`Dispatcher.stopped` is guarded by `this`, but it is used without synchronization in `postMessage` function. This PR fixes this and also the exception message became more accurate.
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13634 from dongjoon-hyun/SPARK-15913.
## What changes were proposed in this pull request?
Remove deprecated support for `zk://` master (`mesos://zk//` remains supported)
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#13625 from srowen/SPARK-15876.
## What changes were proposed in this pull request?
- Deprecate old Java accumulator API; should use Scala now
- Update Java tests and examples
- Don't bother testing old accumulator API in Java 8 (too)
- (fix a misspelling too)
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#13606 from srowen/SPARK-15086.
## What changes were proposed in this pull request?
SPARK_MASTER_IP is a deprecated environment variable. It is replaced by SPARK_MASTER_HOST according to MasterArguments.scala.
## How was this patch tested?
Manually verified.
Author: bomeng <bmeng@us.ibm.com>
Closes#13543 from bomeng/SPARK-15806.
## What changes were proposed in this pull request?
These tests weren't properly using `LocalSparkContext` so weren't cleaning up correctly when tests failed.
## How was this patch tested?
Jenkins.
Author: Imran Rashid <irashid@cloudera.com>
Closes#13602 from squito/SPARK-15878_cleanup_replaylistener.
## What changes were proposed in this pull request?
Adds codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get.
To simplify, I added the metrics under a statically-initialized source that is always registered with SparkEnv.
## How was this patch tested?
Unit tests
Author: Eric Liang <ekl@databricks.com>
Closes#13586 from ericl/spark-15860.
## What changes were proposed in this pull request?
This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort.
This strategy for nulls does mean the sort is no longer stable. cc davies
## How was this patch tested?
Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts.
Some test queries (best of 5 runs each).
Before change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190437233227987
res3: Double = 4716.471091
After change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190367870952791
res4: Double = 2981.143045
Author: Eric Liang <ekl@databricks.com>
Closes#13161 from ericl/sc-2998.
## What changes were proposed in this pull request?
Use new Spark logo including "Apache" (now, with crushed PNGs). Remove old unreferenced logo files.
## How was this patch tested?
Manual check of generated HTML site and Spark UI. I searched for references to the deleted files to make sure they were not used.
Author: Sean Owen <sowen@cloudera.com>
Closes#13609 from srowen/SPARK-15879.
## What changes were proposed in this pull request?
In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead.
## How was this patch tested?
existing tests
Author: wangyang <wangyang@haizhi.com>
Closes#13601 from yangw1234/isEmpty.
This reverts commit 695dbc816a.
This change is being reverted because it hurts performance of some jobs, and
only helps in a narrow set of cases. For more discussion, refer to the JIRA.
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#13580 from kayousterhout/revert-SPARK-14485.
## What changes were proposed in this pull request?
SparkContext.listAccumulator, by Spark's convention, makes it sound like "list" is a verb and the method should return a list of accumulators. This patch renames the method and the class collection accumulator.
## How was this patch tested?
Updated test case to reflect the names.
Author: Reynold Xin <rxin@databricks.com>
Closes#13594 from rxin/SPARK-15866.
## What changes were proposed in this pull request?
With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact.
It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability.
## How was this patch tested?
Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold.
```
numFields = 5
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 2336 / 2558 0.0 23364.4 0.1X
numFields = 25
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 4237 / 4465 0.0 42367.9 0.1X
numFields = 100
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0 104582.0 0.0X
numFields = Infinity
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
[info] java.lang.OutOfMemoryError: Java heap space
```
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes#13537 from ericl/truncated-string.
## What changes were proposed in this pull request?
This makes microbenchmarks run for at least 2 seconds by default, to allow some time for jit compilation to kick in.
## How was this patch tested?
Tested manually with existing microbenchmarks. This change is backwards compatible in that existing microbenchmarks which specified numIters per-case will still run exactly that number of iterations. Microbenchmarks which previously overrode defaultNumIters now override minNumIters.
cc hvanhovell
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes#13472 from ericl/spark-15735.
Now, when executor is removed by driver with heartbeats timeout, driver will re-queue the task on this executor and send a kill command to cluster to kill this executor.
But, in a situation, the running task of this executor is finished and return result to driver before this executor killed by kill command sent by driver. At this situation, driver will accept the task finished event and ignore speculative task and re-queued task.
But, as we know, this executor has removed by driver, the result of this finished task can not save in driver because the BlockManagerId has also removed from BlockManagerMaster by driver. So, the result data of this stage is not complete, and then, it will cause fetch failure. For more details, [link to jira issues SPARK-14485](https://issues.apache.org/jira/browse/SPARK-14485)
This PR introduce a mechanism to ignore this kind of task finished.
N/A
Author: zhonghaihua <793507405@qq.com>
Closes#12258 from zhonghaihua/ignoreTaskFinishForExecutorLostAndRemovedByDriver.
## What changes were proposed in this pull request?
There is still some flakiness in BlacklistIntegrationSuite, so turning it off for the moment to avoid breaking more builds -- will turn it back with more fixes.
## How was this patch tested?
jenkins.
Author: Imran Rashid <irashid@cloudera.com>
Closes#13528 from squito/ignore_blacklist.
## What changes were proposed in this pull request?
Change the way spark picks up version information. Also embed the build information to better identify the spark version running.
More context can be found here : https://github.com/apache/spark/pull/12152
## How was this patch tested?
Ran the mvn and sbt builds to verify the version information was being displayed correctly on executing <code>spark-submit --version </code>
![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png)
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#13061 from dhruve/impr/SPARK-14279.
## What changes were proposed in this pull request?
`an -> a`
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13515 from zhengruifeng/an_a.
## What changes were proposed in this pull request?
Stop using the abbreviated and ambiguous timezone "EST" in a test, since it is machine-local default timezone dependent, and fails in different timezones. Fixed [SPARK-15723](https://issues.apache.org/jira/browse/SPARK-15723).
## How was this patch tested?
Note that to reproduce this problem in any locale/timezone, you can modify the scalatest-maven-plugin argLine to add a timezone:
<argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="Australia/Sydney"</argLine>
and run
$ mvn test -DwildcardSuites=org.apache.spark.status.api.v1.SimpleDateParamSuite -Dtest=none. Equally this will fix it in an effected timezone:
<argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="America/New_York"</argLine>
To test the fix, apply the above change to `pom.xml` to set test TZ to `Australia/Sydney`, and confirm the test now passes.
Author: Brett Randall <javabrett@gmail.com>
Closes#13462 from javabrett/SPARK-15723-SimpleDateParamSuite.
## What changes were proposed in this pull request?
Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode.
This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode.
This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#13318 from davies/fix_timsort.
## What changes were proposed in this pull request?
Currently `SparkContext API setLogLevel(level: String) `can not handle lower case or mixed case input string. But `org.apache.log4j.Level.toLevel` can take lowercase or mixed case.
This PR is to allow case-insensitive user input for the log level.
## How was this patch tested?
A unit testcase is added.
Author: Xin Wu <xinwu@us.ibm.com>
Closes#13422 from xwu0226/reset_loglevel.
## What changes were proposed in this pull request?
After upgrading Jetty to 9.2, we always see "WARN org.eclipse.jetty.server.handler.AbstractHandler: No Server set for org.eclipse.jetty.server.handler.ErrorHandler" while running any test cases.
This PR will fix it.
## How was this patch tested?
The existing test cases will cover it.
Author: bomeng <bmeng@us.ibm.com>
Closes#13475 from bomeng/SPARK-15737.
## What changes were proposed in this pull request?
BlacklistIntegrationSuite (introduced by SPARK-10372) is a bit flaky because of some race conditions:
1. Failed jobs might have non-empty results, because the resultHandler will be invoked for successful tasks (if there are task successes before failures)
2. taskScheduler.taskIdToTaskSetManager must be protected by a lock on taskScheduler
(1) has failed a handful of jenkins builds recently. I don't think I've seen (2) in jenkins, but I've run into with some uncommitted tests I'm working on where there are lots more tasks.
While I was in there, I also made an unrelated fix to `runningTasks`in the test framework -- there was a pointless `O(n)` operation to remove completed tasks, could be `O(1)`.
## How was this patch tested?
I modified the o.a.s.scheduler.BlacklistIntegrationSuite to have it run the tests 1k times on my laptop. It failed 11 times before this change, and none with it. (Pretty sure all the failures were problem (1), though I didn't check all of them).
Also the full suite of tests via jenkins.
Author: Imran Rashid <irashid@cloudera.com>
Closes#13454 from squito/SPARK-15714.
If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure.
In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block.
This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#13473 from JoshRosen/handle-missing-cache-files.