## What changes were proposed in this pull request?
Add `ThreadUtils.awaitReady` similar to `ThreadUtils.awaitResult` and disallow `Await.ready`.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17763 from zsxwing/awaitready.
## What changes were proposed in this pull request?
Executor task reaper may fail to detect if a task is finished or not when a task is finishing but being killed at the same time.
The fix is pretty easy, just flip the "finished" flag when a task is successful.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#18021 from zsxwing/SPARK-20788.
## What changes were proposed in this pull request?
In
```
./bin/spark-shell --master=local[64]
```
I ran
```
sc.parallelize(1 to 100000, 100000).count()
```
and profiled the time spend in the LiveListenerBus event processing thread. I discovered that the majority of the time was being spent in `TaskMetrics.empty` calls in `JobProgressListener.onTaskStart`. It turns out that we can slightly refactor to remove the need to construct one empty instance per call, greatly improving the performance of this code.
The performance gains here help to avoid an issue where listener events would be dropped because the JobProgressListener couldn't keep up with the throughput.
**Before:**
![image](https://cloud.githubusercontent.com/assets/50748/26133095/95bcd42a-3a59-11e7-8051-a50550e447b8.png)
**After:**
![image](https://cloud.githubusercontent.com/assets/50748/26133070/7935e148-3a59-11e7-8c2d-73d5aa5a2397.png)
## How was this patch tested?
Benchmarks described above.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#18008 from JoshRosen/nametoaccums-improvements.
## What changes were proposed in this pull request?
In the current codes, when worker connects to master, master will send its address to the worker. Then worker will save this address and use it to reconnect in case of failure. However, sometimes, this address is not correct. If there is a proxy between master and worker, the address master sent is not the address of proxy.
In this PR, the master address used by the worker will be sent to the master, then master just replies this address back, worker will use this address to reconnect in case of failure. In other words, the worker will use the config master address set in the worker side if possible rather than the master address set in the master side.
There is still one potential issue though. When a master is restarted or takes over leadership, the work will use the address sent from the master to connect. If there is still a proxy between master and worker, the address may be wrong. However, there is no way to figure it out just in the worker.
## How was this patch tested?
The new added unit test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17821 from zsxwing/SPARK-20529.
## What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/17596 , we do not send internal accumulator name to executor side anymore, and always look up the accumulator name in `AccumulatorContext`.
This cause a regression if the accumulator is already garbage collected, this PR fixes this by still sending accumulator name for `SQLMetrics`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17931 from cloud-fan/bug.
## What changes were proposed in this pull request?
When the number of spark worker executors is large, if the specific number is displayed, will better help us to analyze and observe by spark ui.
Although this is a small improvement, but it is indeed very valuable.
After fix:
![executor1](https://cloud.githubusercontent.com/assets/26266482/25986597/2d8e4386-3723-11e7-9c24-e5bff17c26e2.png)
## How was this patch tested?
manual tests
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
Closes#17961 from guoxiaolongzte/SPARK-20720.
## What changes were proposed in this pull request?
When you open the master page, when you use Firefox or Google Chrom, the console of Firefox or Google Chrome is wrong. But The IE is no problem.
e.g.
![error](https://cloud.githubusercontent.com/assets/26266482/25946143/74467a5c-367c-11e7-8f9f-d3585b1aea88.png)
My Firefox version is 48.0.2.
My Google Chrome version is 49.0.2623.75 m.
## How was this patch tested?
manual tests
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
Closes#17952 from guoxiaolongzte/SPARK-20705.
## What changes were proposed in this pull request?
This PR adds an `error` parameter to `TaskContextImpl.markTaskCompleted` to propagate the original error.
It also fixes an issue that `TaskCompletionListenerException.getMessage` doesn't include `previousError`.
## How was this patch tested?
New unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17942 from zsxwing/SPARK-20702.
## What changes were proposed in this pull request?
Remove uses of scala.language.reflectiveCalls that are either unnecessary or probably resulting in more complex code. This turned out to be less significant than I thought, but, still worth a touch-up.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#17949 from srowen/SPARK-20554.
## What changes were proposed in this pull request?
This PR proposes to fix the lint-breaks as below:
```
[ERROR] src/main/java/org/apache/spark/unsafe/Platform.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[45,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[62,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[78,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[92,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[102,25] (naming) MethodName: Method name 'Once' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.api.java.JavaDStream.
```
after:
```
dev/lint-java
Checkstyle checks passed.
```
[Test Result](https://travis-ci.org/ConeyLiu/spark/jobs/229666169)
## How was this patch tested?
Travis CI
Author: Xianyang Liu <xianyang.liu@intel.com>
Closes#17890 from ConeyLiu/codestyle.
## What changes were proposed in this pull request?
Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest.
## How was this patch tested?
Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages.
Author: NICHOLAS T. MARION <nmarion@us.ibm.com>
Closes#17686 from n-marion/xss-fix.
## What changes were proposed in this pull request?
A few comments around the code mention RDD classes that do not exist anymore. I'm not sure of the best way to replace these, so I've just removed them here.
## How was this patch tested?
Only changes code comments, no testing required
Author: Michael Mior <mmior@uwaterloo.ca>
Closes#17900 from michaelmior/remove-old-rdds.
## What changes were proposed in this pull request?
#14617 added new columns to the executor table causing the visibility checks for the logs and threadDump columns to toggle the wrong columns since they used hard-coded column numbers.
I've updated the checks to use column names instead of numbers so future updates don't accidentally break this again.
Note: This will also need to be back ported into 2.2 since #14617 was merged there
## How was this patch tested?
Manually tested
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#17904 from ajbozarth/spark20630.
## What changes were proposed in this pull request?
Spark Version for a specific application is not displayed on the history page now. It should be nice to switch the spark version on the UI when we click on the specific application.
Currently there seems to be way as SparkListenerLogStart records the application version. So, it should be trivial to listen to this event and provision this change on the UI.
For Example
<img width="1439" alt="screen shot 2017-04-06 at 3 23 41 pm" src="https://cloud.githubusercontent.com/assets/8295799/25092650/41f3970a-2354-11e7-9b0d-4646d0adeb61.png">
<img width="1399" alt="screen shot 2017-04-17 at 9 59 33 am" src="https://cloud.githubusercontent.com/assets/8295799/25092743/9f9e2f28-2354-11e7-9605-f2f1c63f21fe.png">
{"Event":"SparkListenerLogStart","Spark Version":"2.0.0"}
(Please fill in changes proposed in this fix)
Modified the SparkUI for History server to listen to SparkLogListenerStart event and extract the version and print it.
## How was this patch tested?
Manual testing of UI page. Attaching the UI screenshot changes here
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Sanket <schintap@untilservice-lm>
Closes#17658 from redsanket/SPARK-20355.
## What changes were proposed in this pull request?
After SPARK-10997, client mode Netty RpcEnv doesn't require to start server, so port configurations are not used any more, here propose to remove these two configurations: "spark.executor.port" and "spark.am.port".
## How was this patch tested?
Existing UTs.
Author: jerryshao <sshao@hortonworks.com>
Closes#17866 from jerryshao/SPARK-20605.
## What changes were proposed in this pull request?
When call the method getLocations of BlockManager, we only compare the data block host. Random selection for non-local data blocks, this may cause the selected data block to be in a different rack. So in this patch to increase the sort of the rack.
## How was this patch tested?
New test case.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Xianyang Liu <xianyang.liu@intel.com>
Closes#17300 from ConeyLiu/blockmanager.
Signed-off-by: liuxian <liu.xian3zte.com.cn>
## What changes were proposed in this pull request?
When the input parameter is null, may be a runtime exception occurs
## How was this patch tested?
Existing unit tests
Author: liuxian <liu.xian3@zte.com.cn>
Closes#17796 from 10110346/wip_lx_0428.
## What changes were proposed in this pull request?
This PR adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , TempShuffleBlockId , TempLocalBlockId
## How was this patch tested?
The new unit test.
Author: caoxuewen <cao.xuewen@zte.com.cn>
Closes#17794 from heary-cao/blockidsuite.
## What changes were proposed in this pull request?
Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17803 from srowen/SPARK-20523.
## What changes were proposed in this pull request?
To better understand this problem, let's take a look at an example first:
```
object Main {
def main(args: Array[String]): Unit = {
var t = new Test
new Thread(new Runnable {
override def run() = {}
}).start()
println("first thread finished")
t.a = null
t = new Test
new Thread(new Runnable {
override def run() = {}
}).start()
}
}
class Test {
var a = new InheritableThreadLocal[String] {
override protected def childValue(parent: String): String = {
println("parent value is: " + parent)
parent
}
}
a.set("hello")
}
```
The result is:
```
parent value is: hello
first thread finished
parent value is: hello
parent value is: hello
```
Once an `InheritableThreadLocal` has been set value, child threads will inherit its value as long as it has not been GCed, so setting the variable which holds the `InheritableThreadLocal` to `null` doesn't work as we expected.
In `SparkContext`, we have an `InheritableThreadLocal` for local properties, we should clear it when stopping `SparkContext`, or all the future child threads will still inherit it and copy the properties and waste memory.
This is the root cause of https://issues.apache.org/jira/browse/SPARK-20548 , which creates/stops `SparkContext` many times and finally have a lot of `InheritableThreadLocal` alive, and cause OOM when starting new threads in the internal thread pools.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17833 from cloud-fan/core.
In the previous patch I deprecated StorageStatus, but not the
method in SparkContext that exposes that class publicly. So deprecate
the method too.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#17824 from vanzin/SPARK-20421.
There are two problems fixed in this commit. First, the
ExecutorAllocationManager sets a timeout to avoid requesting executors
too often. However, the timeout is always updated based on its value and
a timeout, not the current time. If the call is delayed by locking for
more than the ongoing scheduler timeout, the manager will request more
executors on every run. This seems to be the main cause of SPARK-20540.
The second problem is that the total number of requested executors is
not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
the value based on the current status of 3 variables: the number of
known executors, the number of executors that have been killed, and the
number of pending executors. But, the number of pending executors is
never less than 0, even though there may be more known than requested.
When executors are killed and not replaced, this can cause the request
sent to YARN to be incorrect because there were too many executors due
to the scheduler's state being slightly out of date. This is fixed by tracking
the currently requested size explicitly.
## How was this patch tested?
Existing tests.
Author: Ryan Blue <blue@apache.org>
Closes#17813 from rdblue/SPARK-20540-fix-dynamic-allocation.
The download link in history server UI is concatenated with:
```
<td><a href="{{uiroot}}/api/v1/applications/{{id}}/{{num}}/logs" class="btn btn-info btn-mini">Download</a></td>
```
Here `num` field represents number of attempts, this is not equal to REST APIs. In the REST API, if attempt id is not existed the URL should be `api/v1/applications/<id>/logs`, otherwise the URL should be `api/v1/applications/<id>/<attemptId>/logs`. Using `<num>` to represent `<attemptId>` will lead to the issue of "no such app".
Manual verification.
CC ajbozarth can you please review this change, since you add this feature before? Thanks!
Author: jerryshao <sshao@hortonworks.com>
Closes#17795 from jerryshao/SPARK-20517.
## What changes were proposed in this pull request?
This PR adds RDD checkpoint compression support and add a new config `spark.checkpoint.compress` to enable/disable it. Credit goes to aramesh117
Closes#17024
## How was this patch tested?
The new unit test.
Author: Shixiong Zhu <shixiong@databricks.com>
Author: Aaditya Ramesh <aramesh@conviva.com>
Closes#17789 from zsxwing/pr17024.
Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular).
Without this change, because of incompatibily between Jetty versions,
Spark fails to compile when built against Hadoop 3
## How was this patch tested?
Unit tests being run.
Author: Mark Grover <mark@apache.org>
Closes#17790 from markgrover/spark-20514.
## What changes were proposed in this pull request?
This PR proposes to throw an exception with better message rather than `ArrayIndexOutOfBoundsException` when temp directories could not be created.
Running the commands below:
```bash
./bin/spark-shell --conf spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
```
produces ...
**Before**
```
Exception in thread "main" java.lang.ExceptionInInitializerError
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
...
```
**After**
```
Exception in thread "main" java.lang.ExceptionInInitializerError
...
Caused by: java.io.IOException: Failed to get a temp directory under [/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO].
...
```
## How was this patch tested?
Unit tests in `LocalDirsSuite.scala`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17768 from HyukjinKwon/throws-temp-dir-exception.
## What changes were proposed in this pull request?
When sending accumulator updates back to driver, the network overhead is pretty big as there are a lot of accumulators, e.g. `TaskMetrics` will send about 20 accumulators everytime, there may be a lot of `SQLMetric` if the query plan is complicated.
Therefore, it's critical to reduce the size of serialized accumulator. A simple way is to not send the name of internal accumulators to executor side, as it's unnecessary. When executor sends accumulator updates back to driver, we can look up the accumulator name in `AccumulatorContext` easily. Note that, we still need to send names of normal accumulators, as the user code run at executor side may rely on accumulator names.
In the future, we should reimplement `TaskMetrics` to not rely on accumulators and use custom serialization.
Tried on the example in https://issues.apache.org/jira/browse/SPARK-12837, the size of serialized accumulator has been cut down by about 40%.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17596 from cloud-fan/oom.
## What changes were proposed in this pull request?
This PR changes Executor's threads to `UninterruptibleThread` so that we can use `runUninterruptibly` in `CachedKafkaConsumer`. However, this is just best effort to avoid hanging forever. If the user uses`CachedKafkaConsumer` in another thread (e.g., create a new thread or Future), the potential hang may still happen.
## How was this patch tested?
The new added test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17761 from zsxwing/int.
## What changes were proposed in this pull request?
When application contains large amount of shuffle blocks. NodeManager requires lots of memory to keep metadata(`FileSegmentManagedBuffer`) in `StreamManager`. When the number of shuffle blocks is big enough. NodeManager can run OOM. This pr proposes to do lazy initialization of `FileSegmentManagedBuffer` in shuffle service.
## How was this patch tested?
Manually test.
Author: jinxing <jinxing6042@126.com>
Closes#17744 from jinxing64/SPARK-20426.
These listeners weren't really meant for external consumption, but they're
public and marked with DeveloperApi. Adding the deprecated tag warns people
that they may soon go away (as they will as part of the work for SPARK-18085).
Note that not all types made public by https://github.com/apache/spark/pull/648
are being deprecated. Some remaining types are still exposed through the
SparkListener API.
Also note the text for StorageStatus is a tiny bit different, since I'm not
so sure I'll be able to remove it. But the effect for the users should be the
same (they should stop trying to use it).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#17766 from vanzin/SPARK-20421.
This change does a more thorough redaction of sensitive information from logs and UI
Add unit tests that ensure that no regressions happen that leak sensitive information to the logs.
The motivation for this change was appearance of password like so in `SparkListenerEnvironmentUpdate` in event logs under some JVM configurations:
`"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..."
`
Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well.
## How was this patch tested?
New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it.
Author: Mark Grover <mark@apache.org>
Closes#17725 from markgrover/spark-20435.
## What changes were proposed in this pull request?
This is a follow-up of #14617 to make the name of memory related fields more meaningful.
Here for the backward compatibility, I didn't change `maxMemory` and `memoryUsed` fields.
## How was this patch tested?
Existing UT and local verification.
CC squito and tgravescs .
Author: jerryshao <sshao@hortonworks.com>
Closes#17700 from jerryshao/SPARK-20391.
## What changes were proposed in this pull request?
Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains.
This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set.
This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core
## How was this patch tested?
unit tests, manual tests
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: ding <ding@localhost.localdomain>
Author: dding3 <ding.ding@intel.com>
Author: Michael Allman <michael@videoamp.com>
Closes#15125 from dding3/cp2_pregel.
Using Option(name) instead of Some(name) to prevent runtime failures when using accumulators created like the following
```
sparkContext.accumulator(0, null)
```
Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com>
Closes#17740 from szhem/SPARK-20404-null-acc-names.
## What changes were proposed in this pull request?
Current SHS (Spark History Server) two different ACLs:
* ACL of base URL, it is controlled by "spark.acls.enabled" or "spark.ui.acls.enabled", and with this enabled, only user configured with "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user who started SHS could list all the applications, otherwise none of them can be listed. This will also affect REST APIs which listing the summary of all apps and one app.
* Per application ACL. This is controlled by "spark.history.ui.acls.enabled". With this enabled only history admin user and user/group who ran this app can access the details of this app.
With this two ACLs, we may encounter several unexpected behaviors:
1. if base URL's ACL (`spark.acls.enable`) is enabled but user A has no view permission. User "A" cannot see the app list but could still access details of it's own app.
2. if ACLs of base URL (`spark.acls.enable`) is disabled, then user "A" could download any application's event log, even it is not run by user "A".
3. The changes of Live UI's ACL will affect History UI's ACL which share the same conf file.
The unexpected behaviors is mainly because we have two different ACLs, ideally we should have only one to manage all.
So to improve SHS's ACL mechanism, here in this PR proposed to:
1. Disable "spark.acls.enable" and only use "spark.history.ui.acls.enable" for history server.
2. Check permission for event-log download REST API.
With this PR:
1. Admin user could see/download the list of all applications, as well as application details.
2. Normal user could see the list of all applications, but can only download and check the details of applications accessible to him.
## How was this patch tested?
New UTs are added, also verified in real cluster.
CC tgravescs vanzin please help to review, this PR changes the semantics you did previously. Thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#17582 from jerryshao/SPARK-20239.
## What changes were proposed in this pull request?
Submitted Time' field, the date format **needs to be formatted**, in running Drivers table or Completed Drivers table in master web ui.
Before fix this problem e.g.
Completed Drivers
Submission ID **Submitted Time** Worker State Cores Memory Main Class
driver-20170419145755-0005 **Wed Apr 19 14:57:55 CST 2017** worker-20170419145250-zdh120-40412 FAILED 1 1024.0 MB cn.zte.HdfsTest
please see the attachment:https://issues.apache.org/jira/secure/attachment/12863977/before_fix.png
After fix this problem e.g.
Completed Drivers
Submission ID **Submitted Time** Worker State Cores Memory Main Class
driver-20170419145755-0006 **2017/04/19 16:01:25** worker-20170419145250-zdh120-40412 FAILED 1 1024.0 MB cn.zte.HdfsTest
please see the attachment:https://issues.apache.org/jira/secure/attachment/12863976/after_fix.png
'Submitted Time' field, the date format **has been formatted**, in running Applications table or Completed Applicationstable in master web ui, **it is correct.**
e.g.
Running Applications
Application ID Name Cores Memory per Executor **Submitted Time** User State Duration
app-20170419160910-0000 (kill) SparkSQL::10.43.183.120 1 5.0 GB **2017/04/19 16:09:10** root RUNNING 53 s
**Format after the time easier to observe, and consistent with the applications table,so I think it's worth fixing.**
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
Closes#17682 from guoxiaolongzte/SPARK-20385.
## What changes were proposed in this pull request?
Modify the added memory size to memSize-originalMemSize if the block exists on the slave already
since if the block exists, the added memory size should be memSize-originalMemSize; if originalMemSize is bigger than memSize ,then the log info should be Removed memory, removed size should be originalMemSize-memSize
## How was this patch tested?
Multiple runs on existing unit tests
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: eatoncys <chen.yanshan@zte.com.cn>
Closes#17683 from eatoncys/SPARK-20386.
## What changes were proposed in this pull request?
In the current Spark's HistoryServer we expected to get `AccessControlException` during listing all the files, but unfortunately it was not worked because we actually doesn't check the access permission and no other calls will throw such exception. What was worse is that this check will be deferred until reading files, which is not necessary and quite verbose, since it will be printed out the exception in every 10 seconds when checking the files.
So here with this fix, we actually check the read permission during listing the files, which could avoid unnecessary file read later on and suppress the verbose log.
## How was this patch tested?
Add unit test to verify.
Author: jerryshao <sshao@hortonworks.com>
Closes#17495 from jerryshao/SPARK-20172.
## What changes were proposed in this pull request?
This was a regression introduced by my earlier PR here: https://github.com/apache/spark/pull/17531
It turns out NonFatal() does not in fact catch InterruptedException.
## How was this patch tested?
Extended cancellation unit test coverage. The first test fails before this patch.
cc JoshRosen mridulm
Author: Eric Liang <ekl@databricks.com>
Closes#17659 from ericl/spark-20358.
## What changes were proposed in this pull request?
When I request access to the 'http: //ip:port/api/v1/applications' link, get the json. I need the 'sparkUser' field specific value, because my Spark big data management platform needs to filter through this field which user submits the application to facilitate my administration and query, but the current return of the json string is empty, causing me this Function can not be achieved, that is, I do not know who the specific application is submitted by this REST Api.
**current return json:**
[ {
"id" : "app-20170417152053-0000",
"name" : "KafkaWordCount",
"attempts" : [ {
"startTime" : "2017-04-17T07:20:51.395GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2017-04-17T07:20:51.395GMT",
"duration" : 0,
**"sparkUser" : "",**
"completed" : false,
"endTimeEpoch" : -1,
"startTimeEpoch" : 1492413651395,
"lastUpdatedEpoch" : 1492413651395
} ]
} ]
**When I fix this question, return json:**
[ {
"id" : "app-20170417154201-0000",
"name" : "KafkaWordCount",
"attempts" : [ {
"startTime" : "2017-04-17T07:41:57.335GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2017-04-17T07:41:57.335GMT",
"duration" : 0,
**"sparkUser" : "mr",**
"completed" : false,
"startTimeEpoch" : 1492414917335,
"endTimeEpoch" : -1,
"lastUpdatedEpoch" : 1492414917335
} ]
} ]
## How was this patch tested?
manual tests
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
Closes#17656 from guoxiaolongzte/SPARK-20354.
## What changes were proposed in this pull request?
Eliminate the duplicate call to `Pool.getSchedulableByName()` in `FairSchedulableBuilder.addTaskSetManager`
## How was this patch tested?
./dev/run-tests
Author: Robert Stupp <snazy@snazy.de>
Closes#17647 from snazy/20344-dup-call-master.
## What changes were proposed in this pull request?
This PR allows to use `SerializationStream` and `DeserializationStream` in try-with-resources.
## How was this patch tested?
`core` unit tests.
Author: Sergei Lebedev <s.lebedev@criteo.com>
Closes#17598 from superbobry/compression-stream-closeable.
## What changes were proposed in this pull request?
`o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` is flaky is because there is a potential dead-lock in StandaloneSchedulerBackend which causes `await` timeout. Here is the related stack trace:
```
"Thread-31" #211 daemon prio=5 os_prio=31 tid=0x00007fedd4808000 nid=0x16403 waiting on condition [0x00007000239b7000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000079b49ca10> (a scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:402)
at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:213)
- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:517)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1657)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1921)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1302)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1920)
at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:708)
at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43$$anonfun$apply$mcV$sp$66$$anon$3.run(StreamingContextSuite.scala:827)
"dispatcher-event-loop-3" #18 daemon prio=5 os_prio=31 tid=0x00007fedd603a000 nid=0x6203 waiting for monitor entry [0x0000700003be4000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:253)
- waiting to lock <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:124)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
This PR removes `synchronized` and changes `stopping` to AtomicBoolean to ensure idempotent to fix the dead-lock.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17610 from zsxwing/SPARK-20131.
## What changes were proposed in this pull request?
This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
There are several problems with it:
- It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
- > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
(see joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
## How was this patch tested?
Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
This was tested via manually adding `time.time()` as below:
```diff
profiles_and_goals = build_profiles + sbt_goals
print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
" ".join(profiles_and_goals))
+ import time
+ st = time.time()
exec_sbt(profiles_and_goals)
+ print("Elapsed :[%s]" % str(time.time() - st))
```
produces
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
...
[info] Main Java API documentation successful.
...
Elapsed :[94.8746569157]
...
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17477 from HyukjinKwon/SPARK-18692.
## What changes were proposed in this pull request?
Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#17527 from srowen/SPARK-20156.
## What changes were proposed in this pull request?
Synchronize access to openStreams map.
## How was this patch tested?
Existing tests.
Author: Bogdan Raducanu <bogdan@databricks.com>
Closes#17592 from bogdanrdc/SPARK-20243.
## What changes were proposed in this pull request?
This error message doesn't get properly formatted because of a missing `s`. Currently the error looks like:
```
Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
```
(note the literal `$current` instead of the interpolated value)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Vijay Ramesh <vramesh@demandbase.com>
Closes#17572 from vijaykramesh/master.
## What changes were proposed in this pull request?
Avoid `NoSuchElementException` every time `ConfigProvider.get(val, default)` falls back to default. This apparently causes non-trivial overhead in at least one path, and can easily be avoided.
See https://github.com/apache/spark/pull/17329
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17567 from srowen/SPARK-19991.
## What changes were proposed in this pull request?
With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992), Spark supports persisting data into off-heap memory, but the usage of on-heap and off-heap memory is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:
1. Spark UI's executor page will display both on-heap and off-heap memory usage.
2. REST request returns both on-heap and off-heap memory.
3. Also this can be gotten from MetricsSystem.
4. Last this usage can be obtained programmatically from SparkListener.
Attach the UI changes:
![screen shot 2016-08-12 at 11 20 44 am](https://cloud.githubusercontent.com/assets/850797/17612032/6c2f4480-607f-11e6-82e8-a27fb8cbb4ae.png)
Backward compatibility is also considered for event-log and REST API. Old event log can still be replayed with off-heap usage displayed as 0. For REST API, only adds the new fields, so JSON backward compatibility can still be kept.
## How was this patch tested?
Unit test added and manual verification.
Author: jerryshao <sshao@hortonworks.com>
Closes#14617 from jerryshao/SPARK-17019.
## What changes were proposed in this pull request?
If tasks throw non-interrupted exceptions on kill (e.g. java.nio.channels.ClosedByInterruptException), their death is reported back as TaskFailed instead of TaskKilled. This causes stage failure in some cases.
This is reproducible as follows. Run the following, and then use SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will fail since we threw a RuntimeException instead of InterruptedException.
```
spark.range(100).repartition(100).foreach { i =>
try {
Thread.sleep(10000000)
} catch {
case t: InterruptedException =>
throw new RuntimeException(t)
}
}
```
Based on the code in TaskSetManager, I think this also affects kills of speculative tasks. However, since the number of speculated tasks is few, and usually you need to fail a task a few times before the stage is cancelled, it unlikely this would be noticed in production unless both speculation was enabled and the num allowed task failures was = 1.
We should probably unconditionally return TaskKilled instead of TaskFailed if the task was killed by the driver, regardless of the actual exception thrown.
## How was this patch tested?
Unit test. The test fails before the change in Executor.scala
cc JoshRosen
Author: Eric Liang <ekl@databricks.com>
Closes#17531 from ericl/fix-task-interrupt.
## What changes were proposed in this pull request?
Make sure SESSION_LOCAL_TIMEZONE reflects the change in JVM's default timezone setting. Currently several timezone related tests fail as the change to default timezone is not picked up by SQLConf.
## How was this patch tested?
Added an unit test in ConfigEntrySuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#17537 from dilipbiswal/timezone_debug.
## What changes were proposed in this pull request?
When a user kills a stage using web UI (in Stages page), StagesTab.handleKillRequest requests SparkContext to cancel the stage without giving a reason. SparkContext has cancelStage(stageId: Int, reason: String) that Spark could use to pass the information for monitoring/debugging purposes.
## How was this patch tested?
manual tests
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: shaolinliu <liu.shaolin1@zte.com.cn>
Author: lvdongr <lv.dongdong@zte.com.cn>
Closes#17258 from shaolinliu/SPARK-19807.
with spark.ui.reverseProxy=true, full path URLs like /log will point to
the master web endpoint which is serving the worker UI as reverse proxy.
To access a REST endpoint in the worker in reverse proxy mode , the
leading /proxy/"target"/ part of the base URI must be retained.
Added logic to log-view.js to handle this, similar to executorspage.js
Patch was tested manually
Author: Oliver Köth <okoeth@de.ibm.com>
Closes#17370 from okoethibm/master.
[SPARK-9002][CORE] KryoSerializer initialization does not include 'Array[Int]'
## What changes were proposed in this pull request?
Array[Int] has been registered in KryoSerializer.
The following file has been changed
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
## How was this patch tested?
First, the issue was reproduced by new unit test.
Then, the issue was fixed to pass the failed test.
Author: Denis Bolshakov <denis.bolshakov@onefactor.com>
Closes#17482 from dbolshak/SPARK-9002.
## What changes were proposed in this pull request?
Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.
## How was this patch tested?
Current History UI tests cover use of the history file.
Author: Ryan Blue <blue@apache.org>
Closes#17412 from rdblue/SPARK-20084-remove-block-accumulator-info.
## What changes were proposed in this pull request?
Few changes related to Intellij IDEA inspection.
## How was this patch tested?
Changes were tested by existing unit tests
Author: Denis Bolshakov <denis.bolshakov@onefactor.com>
Closes#17458 from dbolshak/SPARK-20127.
## What changes were proposed in this pull request?
while submit apps with -v or --verbose, we can print the right queue name, but if we set a queue name with `spark.yarn.queue` by --conf or in the spark-default.conf, we just got `null` for the queue in Parsed arguments.
```
bin/spark-shell -v --conf spark.yarn.queue=thequeue
Using properties file: /home/hadoop/spark-2.1.0-bin-apache-hdp2.7.3/conf/spark-defaults.conf
....
Adding default property: spark.yarn.queue=default
Parsed arguments:
master yarn
deployMode client
...
queue null
....
verbose true
Spark properties used, including those specified through
--conf and those from the properties file /home/hadoop/spark-2.1.0-bin-apache-hdp2.7.3/conf/spark-defaults.conf:
spark.yarn.queue -> thequeue
....
```
## How was this patch tested?
ut and local verify
Author: Kent Yao <yaooqinn@hotmail.com>
Closes#17430 from yaooqinn/SPARK-20096.
…adoc
## What changes were proposed in this pull request?
Use recommended values for row boundaries in Window's scaladoc, i.e. `Window.unboundedPreceding`, `Window.unboundedFollowing`, and `Window.currentRow` (that were introduced in 2.1.0).
## How was this patch tested?
Local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#17417 from jaceklaskowski/window-expression-scaladoc.
## What changes were proposed in this pull request?
Implementations of strategies for resilient block replication for different resource managers that replicate the 3-replica strategy used by HDFS, where the first replica is on an executor, the second replica within the same rack as the executor and a third replica on a different rack.
The implementation involves providing two pluggable classes, one running in the driver that provides topology information for every host at cluster start and the second prioritizing a list of peer BlockManagerIds.
The prioritization itself can be thought of an optimization problem to find a minimal set of peers that satisfy certain objectives and replicating to these peers first. The objectives can be used to express richer constraints over and above HDFS like 3-replica strategy.
## How was this patch tested?
This patch was tested with unit tests for storage, along with new unit tests to verify prioritization behaviour.
Author: Shubham Chopra <schopra31@bloomberg.net>
Closes#13932 from shubhamchopra/PrioritizerStrategy.
## What changes were proposed in this pull request?
The internal FileCommitProtocol interface returns all task commit messages in bulk to the implementation when a job finishes. However, it is sometimes useful to access those messages before the job completes, so that the driver gets incremental progress updates before the job finishes.
This adds an `onTaskCommit` listener to the internal api.
## How was this patch tested?
Unit tests.
cc rxin
Author: Eric Liang <ekl@databricks.com>
Closes#17475 from ericl/file-commit-api-ext.
## What changes were proposed in this pull request?
Currently we use system classloader to find HBase jars, if it is specified by `--jars`, then it will be failed with ClassNotFound issue. So here changing to use child classloader.
Also putting added jars and main jar into classpath of submitted application in yarn cluster mode, otherwise HBase jars specified with `--jars` will never be honored in cluster mode, and fetching tokens in client side will always be failed.
## How was this patch tested?
Unit test and local verification.
Author: jerryshao <sshao@hortonworks.com>
Closes#17388 from jerryshao/SPARK-20059.
This change modifies the way block data is encrypted to make the more
common cases faster, while penalizing an edge case. As a side effect
of the change, all data that goes through the block manager is now
encrypted only when needed, including the previous path (broadcast
variables) where that did not happen.
The way the change works is by not encrypting data that is stored in
memory; so if a serialized block is in memory, it will only be encrypted
once it is evicted to disk.
The penalty comes when transferring that encrypted data from disk. If the
data ends up in memory again, it is as efficient as before; but if the
evicted block needs to be transferred directly to a remote executor, then
there's now a performance penalty, since the code now uses a custom
FileRegion implementation to decrypt the data before transferring.
This also means that block data transferred between executors now is
not encrypted (and thus relies on the network library encryption support
for secrecy). Shuffle blocks are still transferred in encrypted form,
since they're handled in a slightly different way by the code. This also
keeps compatibility with existing external shuffle services, which transfer
encrypted shuffle blocks, and avoids having to make the external service
aware of encryption at all.
The serialization and deserialization APIs in the SerializerManager now
do not do encryption automatically; callers need to explicitly wrap their
streams with an appropriate crypto stream before using those.
As a result of these changes, some of the workarounds added in SPARK-19520
are removed here.
Testing: a new trait ("EncryptionFunSuite") was added that provides an easy
way to run a test twice, with encryption on and off; broadcast, block manager
and caching tests were modified to use this new trait so that the existing
tests exercise both encrypted and non-encrypted paths. I also ran some
applications with encryption turned on to verify that they still work,
including streaming tests that failed without the fix for SPARK-19520.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#17295 from vanzin/SPARK-19556.
## What changes were proposed in this pull request?
We must set the taskset to zombie before the DAGScheduler handles the taskEnded event. It's possible the taskEnded event will cause the DAGScheduler to launch a new stage attempt (this happens when map output data was lost), and if this happens before the taskSet has been set to zombie, it will appear that we have conflicting task sets.
Author: liujianhui <liujianhui@didichuxing>
Closes#17208 from liujianhuiouc/spark-19868.
## What changes were proposed in this pull request?
Executors cache a list of their peers that is refreshed by default every minute. The cached stale references were randomly being used for replication. Since those executors were removed from the master, they did not occur in the block locations as reported by the master. This was fixed by
1. Refreshing peer cache in the block manager before trying to pro-actively replicate. This way the probability of replicating to a failed executor is eliminated.
2. Explicitly stopping the block manager in the tests. This shuts down the RPC endpoint use by the block manager. This way, even if a block manager tries to replicate using a stale reference, the replication logic should take care of refreshing the list of peers after failure.
## How was this patch tested?
Tested manually
Author: Shubham Chopra <schopra31@bloomberg.net>
Author: Kay Ousterhout <kayousterhout@gmail.com>
Author: Shubham Chopra <shubhamchopra@users.noreply.github.com>
Closes#17325 from shubhamchopra/SPARK-19803.
## What changes were proposed in this pull request?
Instead of creating new `JavaSparkContext` we use `SparkContext.getOrCreate`.
## How was this patch tested?
Existing tests
Author: Hossein <hossein@databricks.com>
Closes#17423 from falaki/SPARK-20088.
## What changes were proposed in this pull request?
Adding additional information to existing logging messages:
- YarnAllocator: log the executor ID together with the container id when a container for an executor is launched.
- NettyRpcEnv: log the receiver address when there is a timeout waiting for an answer to a remote call.
- ExecutorAllocationManager: fix a typo in the logging message for the list of executors to be removed.
## How was this patch tested?
Build spark and submit the word count example to a YARN cluster using cluster mode
Author: Juan Rodriguez Hortala <hortala@amazon.com>
Closes#17411 from juanrh/logging-improvements.
## What changes were proposed in this pull request?
Commit 91fa80fe8a broke the build for scala 2.10. The commit uses `Regex.regex` field which is not available in Scala 2.10. This PR fixes this.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17420 from hvanhovell/SPARK-20070-2.0.
## What changes were proposed in this pull request?
The explain output of `DataSourceScanExec` can contain sensitive information (like Amazon keys). Such information should not end up in logs, or be exposed to non privileged users.
This PR addresses this by adding a redaction facility for the `DataSourceScanExec.treeString`. A user can enable this by setting a regex in the `spark.redaction.string.regex` configuration.
## How was this patch tested?
Added a unit test to check the output of DataSourceScanExec.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17397 from hvanhovell/SPARK-20070.
This commit adds a killTaskAttempt method to SparkContext, to allow users to
kill tasks so that they can be re-scheduled elsewhere.
This also refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as `(N killed: $reason)` and `TaskKilled: $reason`. Without this change, there is no way to provide the user feedback through the UI.
Currently used reasons are "stage cancelled", "another attempt succeeded", and "killed via SparkContext.killTask". The user can also specify a custom reason through `SparkContext.killTask`.
cc rxin
In the stage overview UI the reasons are summarized:
![1](https://cloud.githubusercontent.com/assets/14922/23929209/a83b2862-08e1-11e7-8b3e-ae1967bbe2e5.png)
Within the stage UI you can see individual task kill reasons:
![2](https://cloud.githubusercontent.com/assets/14922/23929200/9a798692-08e1-11e7-8697-72b27ad8a287.png)
Existing tests, tried killing some stages in the UI and verified the messages are as expected.
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekl@google.com>
Closes#17166 from ericl/kill-reason.
## What changes were proposed in this pull request?
1. Use a MedianHeap to record durations of successful tasks. When check speculatable tasks, we can get the median duration with O(1) time complexity.
2. `checkSpeculatableTasks` will synchronize `TaskSchedulerImpl`. If `checkSpeculatableTasks` doesn't finish with 100ms, then the possibility exists for that thread to release and then immediately re-acquire the lock. Change `scheduleAtFixedRate` to be `scheduleWithFixedDelay` when call method of `checkSpeculatableTasks`.
## How was this patch tested?
Added MedianHeapSuite.
Author: jinxing <jinxing6042@126.com>
Closes#16867 from jinxing64/SPARK-16929.
## What changes were proposed in this pull request?
Some `Schedulable` Entities(`Pool` and `TaskSetManager`) variables need refactoring for _immutability_ and _access modifiers_ levels as follows:
- From `var` to `val` (if there is no requirement): This is important to support immutability as much as possible.
- Sample => `Pool`: `weight`, `minShare`, `priority`, `name` and `taskSetSchedulingAlgorithm`.
- Access modifiers: Specially, `var`s access needs to be restricted from other parts of codebase to prevent potential side effects.
- `TaskSetManager`: `tasksSuccessful`, `totalResultSize`, `calculatedTasks` etc...
This PR is related with #15604 and has been created seperatedly to keep patch content as isolated and to help the reviewers.
## How was this patch tested?
Added new UTs and existing UT coverage.
Author: erenavsarogullari <erenavsarogullari@gmail.com>
Closes#16905 from erenavsarogullari/SPARK-19567.
## What changes were proposed in this pull request?
Several javadoc8 breaks have been introduced. This PR proposes fix those instances so that we can build Scala/Java API docs.
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:6: error: reference not found
[error] * <code>flatMapGroupsWithState</code> operations on {link KeyValueGroupedDataset}.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:10: error: reference not found
[error] * Both, <code>mapGroupsWithState</code> and <code>flatMapGroupsWithState</code> in {link KeyValueGroupedDataset}
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:51: error: reference not found
[error] * {link GroupStateTimeout.ProcessingTimeTimeout}) or event time (i.e.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:52: error: reference not found
[error] * {link GroupStateTimeout.EventTimeTimeout}).
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:158: error: reference not found
[error] * Spark SQL types (see {link Encoder} for more details).
[error] ^
[error] .../spark/mllib/target/java/org/apache/spark/ml/fpm/FPGrowthParams.java:26: error: bad use of '>'
[error] * Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and
[error] ^
[error] .../spark/sql/core/src/main/java/org/apache/spark/api/java/function/FlatMapGroupsWithStateFunction.java:30: error: reference not found
[error] * {link org.apache.spark.sql.KeyValueGroupedDataset#flatMapGroupsWithState(
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:211: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:232: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:254: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:277: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
[error] * {link TaskMetrics} & {link MetricsSystem} objects are not thread safe.
[error] ^
[error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
[error] * {link TaskMetrics} & {link MetricsSystem} objects are not thread safe.
[error] ^
[info] 13 errors
```
```
jekyll 3.3.1 | Error: Unidoc generation failed
```
## How was this patch tested?
Manually via `jekyll build`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17389 from HyukjinKwon/minor-javadoc8-fix.
## What changes were proposed in this pull request?
SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).
## How was this patch tested?
Add unit tests, and verify this fix at standalone and yarn cluster.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17274 from yanboliang/spark-19925.
## What changes were proposed in this pull request?
"java.lang.Exception: Could not compute split, block $blockId not found" doesn't have the rdd id info, the "BlockManager: Removing RDD $id" has only the RDD id, so it couldn't find that the Exception's reason is the Removing; so it's better block not found Exception add RDD id info
## How was this patch tested?
Existing tests
Author: jianran.tfh <jianran.tfh@taobao.com>
Author: jianran <tanfanhua1984@163.com>
Closes#17334 from jianran/SPARK-19998.
(Jira: https://issues.apache.org/jira/browse/SPARK-17204)
## What changes were proposed in this pull request?
There are a couple of bugs in the `BlockManager` with respect to support for replicated off-heap storage. First, the locally-stored off-heap byte buffer is disposed of when it is replicated. It should not be. Second, the replica byte buffers are stored as heap byte buffers instead of direct byte buffers even when the storage level memory mode is off-heap. This PR addresses both of these problems.
## How was this patch tested?
`BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. It now fails if either of the bugs in this PR exist.
Author: Michael Allman <michael@videoamp.com>
Closes#16499 from mallman/spark-17204-replicated_off_heap_storage.
## What changes were proposed in this pull request?
Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
## How was this patch tested?
Manually, unit tests
With this, these are relocated to under /tmp
```
# ls /tmp/RtmpG2M0cB/
derby.log
```
And they are removed automatically when the R session is ended.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16330 from felixcheung/rderby.
## What changes were proposed in this pull request?
Avoid None.get exception in (rare?) case that no readLocks exist
Note that while this would resolve the immediate cause of the exception, it's not clear it is the root problem.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17290 from srowen/SPARK-16599.
The previously hardcoded max 4 retries per stage is not suitable for all cluster configurations. Since spark retries a stage at the sign of the first fetch failure, you can easily end up with many stage retries to discover all the failures. In particular, two scenarios this value should change are (1) if there are more than 4 executors per node; in that case, it may take 4 retries to discover the problem with each executor on the node and (2) during cluster maintenance on large clusters, where multiple machines are serviced at once, but you also cannot afford total cluster downtime. By making this value configurable, cluster managers can tune this value to something more appropriate to their cluster configuration.
Unit tests
Author: Sital Kedia <skedia@fb.com>
Closes#17307 from sitalkedia/SPARK-13369.
## What changes were proposed in this pull request?
DebugFilesystem.assertNoOpenStreams throws an exception with a cause exception that actually shows the code line which leaked the stream.
## How was this patch tested?
New test in SparkContextSuite to check there is a cause exception.
Author: Bogdan Raducanu <bogdan@databricks.com>
Closes#17292 from bogdanrdc/SPARK-19946.
## What changes were proposed in this pull request?
The following FIFO & FAIR Schedulers Pool usage cases need to have unit test coverage :
- FIFO Scheduler just uses **root pool** so even if `spark.scheduler.pool` property is set, related pool is not created and `TaskSetManagers` are added to **root pool**.
- FAIR Scheduler uses `default pool` when `spark.scheduler.pool` property is not set. This can be happened when
- `Properties` object is **null**,
- `Properties` object is **empty**(`new Properties()`),
- **default pool** is set(`spark.scheduler.pool=default`).
- FAIR Scheduler creates a **new pool** with **default values** when `spark.scheduler.pool` property points a **non-existent** pool. This can be happened when **scheduler allocation file** is not set or it does not contain related pool.
## How was this patch tested?
New Unit tests are added.
Author: erenavsarogullari <erenavsarogullari@gmail.com>
Closes#15604 from erenavsarogullari/SPARK-18066.
## What changes were proposed in this pull request?
This PR introduces the following changes:
1. Move `SparkHadoopWriter` to `core/internal/io/`, so that it's in the same directory with `SparkHadoopMapReduceWriter`;
2. Move `SparkHadoopWriterUtils` to a separated file.
After this PR is merged, we may consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`, and make the new commit protocol support the old `mapred` package's committer;
## How was this patch tested?
Tested by existing test cases.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#17304 from jiangxb1987/writer.
## What changes were proposed in this pull request?
It is sometimes useful to use multiple threads in a task to parallelize tasks. These threads might register some completion/failure listeners to clean up when the task completes or fails. We currently cannot register such a callback and be sure that it will get called, because the context might be in the process of invoking its callbacks, when the the callback gets registered.
This PR improves this by making sure that you cannot add a completion/failure listener from a different thread when the context is being marked as completed/failed in another thread. This is done by synchronizing these methods on the task context itself.
Failure listeners were called only once. Completion listeners now follow the same pattern; this lifts the idempotency requirement for completion listeners and makes it easier to implement them. In some cases we can (accidentally) add a completion/failure listener after the fact, these listeners will be called immediately in order make sure we can safely clean-up after a task.
As a result of this change we could make the `failure` and `completed` flags non-volatile. The `isCompleted()` method now uses synchronization to ensure that updates are visible across threads.
## How was this patch tested?
Adding tests to `TaskContestSuite` to test adding listeners to a completed/failed context.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17244 from hvanhovell/SPARK-19889.
Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend ThreadSafeRpcEndpoint. If the heartbeat from a worker is blocked by the message ApplicationFinished, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again.
It had better reuse the existing cleanupThreadExecutor to clean up the directories of finished applications to avoid the block.
Author: xiaojian.fxj <xiaojian.fxj@alibaba-inc.com>
Closes#17189 from hustfxj/worker-hearbeat.
## What changes were proposed in this pull request?
TaskSetManager is now using `System.getCurrentTimeMillis` when mark task as finished in `handleSuccessfulTask` and `handleFailedTask`. Thus developer cannot set the tasks finishing time in unit test. When `handleSuccessfulTask`, task's duration = `System.getCurrentTimeMillis` - launchTime(which can be set by `clock`), the result is not correct.
## How was this patch tested?
Existing tests.
Author: jinxing <jinxing6042@126.com>
Closes#17133 from jinxing64/SPARK-19793.
## What changes were proposed in this pull request?
While some executors are being killed due to idleness, if some new tasks come in, driver could assign them to some executors are being killed. These tasks will fail later when the executors are lost. This patch is to make sure CoarseGrainedSchedulerBackend#killExecutors and DriverEndpoint#makeOffers are properly synchronized.
## How was this patch tested?
manual tests
Author: Jimmy Xiang <jxiang@apache.org>
Closes#17091 from jxiang/spark-19757.
## What changes were proposed in this pull request?
200ms may be too short. Give more time for replication to happen and new block be reported to master
## How was this patch tested?
test manully
Author: uncleGen <hustyugm@gmail.com>
Author: dylon <hustyugm@gmail.com>
Closes#17144 from uncleGen/SPARK-19803.
## What changes were proposed in this pull request?
The properties that are serialized with a TaskDescription can have very long values (eg. "spark.job.description" which is set to the full sql statement with the thrift-server). DataOutputStream.writeUTF() does not work well for long strings, so this changes the way those values are serialized to handle longer strings.
## How was this patch tested?
Updated existing unit test to reproduce the issue. All unit tests via jenkins.
Author: Imran Rashid <irashid@cloudera.com>
Closes#17140 from squito/SPARK-19796.
Signed-off-by: liuxian <liu.xian3zte.com.cn>
## What changes were proposed in this pull request?
Open the spark web page,in the Master Page ,have two tables:Running Applications table and Completed Applications table, to the column named “Memory per Node” ,I think it is not all right ,because a node may be not have only one executor.So I think that should be named as “Memory per Executor”.Otherwise easy to let the user misunderstanding
## How was this patch tested?
N/A
Author: liuxian <liu.xian3@zte.com.cn>
Closes#17132 from 10110346/wid-lx-0302.
## What changes were proposed in this pull request?
propagate S3 session token to cluser
## How was this patch tested?
existing ut
Author: uncleGen <hustyugm@gmail.com>
Closes#17080 from uncleGen/SPARK-19739.
## What changes were proposed in this pull request?
Fault-tolerance in spark requires special handling of shuffle fetch
failures. The Executor would catch FetchFailedException and send a
special msg back to the driver.
However, intervening user code could intercept that exception, and wrap
it with something else. This even happens in SparkSQL. So rather than
checking the thrown exception only, we'll store the fetch failure directly
in the TaskContext, where users can't touch it.
## How was this patch tested?
Added a test case which failed before the fix. Full test suite via jenkins.
Author: Imran Rashid <irashid@cloudera.com>
Closes#16639 from squito/SPARK-19276.
## What changes were proposed in this pull request?
Previously it was possible for there to be a race between a task failure and committing the output of a task. For example, the driver may mark a task attempt as failed due to an executor heartbeat timeout (possibly due to GC), but the task attempt actually ends up coordinating with the OutputCommitCoordinator once the executor recovers and committing its result. This will lead to any retry attempt failing because the task result has already been committed despite the original attempt failing.
This ensures that any previously failed task attempts cannot enter the commit protocol.
## How was this patch tested?
Added a unit test
Author: Patrick Woody <pwoody@palantir.com>
Closes#16959 from pwoody/pw/recordFailuresForCommitter.
## What changes were proposed in this pull request?
This change redacts senstive information (based on `spark.redaction.regex` property)
from the Spark Submit console logs. Such sensitive information is already being
redacted from event logs and yarn logs, etc.
## How was this patch tested?
Testing was done manually to make sure that the console logs were not printing any
sensitive information.
Here's some output from the console:
```
Spark properties used, including those specified through
--conf and those from the properties file /etc/spark2/conf/spark-defaults.conf:
(spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted))
(spark.authenticate,false)
(spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted))
```
```
System properties:
(spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted))
(spark.authenticate,false)
(spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted))
```
There is a risk if new print statements were added to the console down the road, sensitive information may still get leaked, since there is no test that asserts on the console log output. I considered it out of the scope of this JIRA to write an integration test to make sure new leaks don't happen in the future.
Running unit tests to make sure nothing else is broken by this change.
Author: Mark Grover <mark@apache.org>
Closes#17047 from markgrover/master_redaction.
## What changes were proposed in this pull request?
When function 'executorLost' is invoked in class 'TaskSetManager', it's significant to judge whether variable 'isZombie' is set to true.
This pull request fixes the following hang:
1.Open speculation switch in the application.
2.Run this app and suppose last task of shuffleMapStage 1 finishes. Let's get the record straight, from the eyes of DAG, this stage really finishes, and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but variable runningTasksSet isn't empty because of speculation.
3.Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes all executorLost functions of rootPool's taskSetManagers. DAG receiving this signal, removes all this executor's outputLocs.
4.TaskSetManager adds all this executor's tasks to pendingTasks and tells DAG they will be resubmitted (Attention: possibly not on time).
5.DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and going to find that shuffleMapStage 1 is its missing parent because some outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 1 again.
6.DAG still receives Task 'Resubmitted' signal from old taskSetManager, and increases the number of pendingTasks of shuffleMapStage 1 each time. However, old taskSetManager won't resolve new task to submit because its variable 'isZombie' is set to true.
7.Finally shuffleMapStage 1 never finishes in DAG together with all stages depending on it.
## How was this patch tested?
It's quite difficult to construct test cases.
Author: GavinGavinNo1 <gavingavinno1@gmail.com>
Author: 16092929 <16092929@cnsuning.com>
Closes#16855 from GavinGavinNo1/resolve-stage-blocked2.
## What changes were proposed in this pull request?
When check speculatable tasks in `TaskSetManager`, only scan `runningTasksSet` instead of scanning all `taskInfos`.
## How was this patch tested?
Existing tests.
Author: jinxing <jinxing6042@126.com>
Closes#17111 from jinxing64/SPARK-19777.
## What changes were proposed in this pull request?
This PR proposes to fix the lint-breaks as below:
```
[ERROR] src/test/java/org/apache/spark/network/TransportResponseHandlerSuite.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.network.buffer.ManagedBuffer.
[ERROR] src/main/java/org/apache/spark/unsafe/types/UTF8String.java:[156,10] (modifier) ModifierOrder: 'Nonnull' annotation modifier does not precede non-annotation modifiers.
[ERROR] src/main/java/org/apache/spark/SparkFirehoseListener.java:[122] (sizes) LineLength: Line is longer than 100 characters (found 105).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[164,78] (coding) OneStatementPerLine: Only one statement per line allowed.
[ERROR] src/test/java/test/org/apache/spark/JavaAPISuite.java:[1157] (sizes) LineLength: Line is longer than 100 characters (found 121).
[ERROR] src/test/java/org/apache/spark/streaming/JavaMapWithStateSuite.java:[149] (sizes) LineLength: Line is longer than 100 characters (found 113).
[ERROR] src/test/java/test/org/apache/spark/streaming/Java8APISuite.java:[146] (sizes) LineLength: Line is longer than 100 characters (found 122).
[ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[32,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.Time.
[ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[611] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[1317] (sizes) LineLength: Line is longer than 100 characters (found 102).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java:[91] (sizes) LineLength: Line is longer than 100 characters (found 102).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[113] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[164] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[212] (sizes) LineLength: Line is longer than 100 characters (found 114).
[ERROR] src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java:[36] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java:[26,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
[ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[20,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
[ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[94] (sizes) LineLength: Line is longer than 100 characters (found 103).
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[30,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.api.java.UDF1.
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[72] (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] src/main/java/org/apache/spark/examples/mllib/JavaRankingMetricsExample.java:[121] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaRDD.
[ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaSparkContext.
```
## How was this patch tested?
Manually via
```bash
./dev/lint-java
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17072 from HyukjinKwon/java-lint.