When I type spark-shell --help, I find that the default value description for the master parameter is missing. The user does not know what the default value is when the master parameter is not included, so we need to add the master parameter default description to the help information.
[https://issues.apache.org/jira/browse/SPARK-21553](https://issues.apache.org/jira/browse/SPARK-21553)
Author: davidxdh <xu.donghui@zte.com.cn>
Author: Donghui Xu <xu.donghui@zte.com.cn>
Closes#18755 from davidxdh/dev_0728.
## What changes were proposed in this pull request?
Fixes current failures in dev/lint-java
## How was this patch tested?
Existing linter, tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#18757 from srowen/LintJava.
## What changes were proposed in this pull request?
`UnsafeExternalSorter.recordComparator` can be either `KVComparator` or `RowComparator`, and both of them will keep the reference to the input rows they compared last time.
After sorting, we return the sorted iterator to upstream operators. However, the upstream operators may take a while to consume up the sorted iterator, and `UnsafeExternalSorter` is registered to `TaskContext` at [here](https://github.com/apache/spark/blob/v2.2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L159-L161), which means we will keep the `UnsafeExternalSorter` instance and keep the last compared input rows in memory until the sorted iterator is consumed up.
Things get worse if we sort within partitions of a dataset and coalesce all partitions into one, as we will keep a lot of input rows in memory and the time to consume up all the sorted iterators is long.
This PR takes over https://github.com/apache/spark/pull/18543 , the idea is that, we do not keep the record comparator instance in `UnsafeExternalSorter`, but a generator of record comparator.
close#18543
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18679 from cloud-fan/memory-leak.
## What changes were proposed in this pull request?
In our production cluster,oom happens when NettyBlockRpcServer receive OpenBlocks message.The reason we observed is below:
When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default maxNumComponents=16 in low-level CompositeByteBuf.When our component's number is bigger than 16, it will execute consolidateIfNeeded
int numComponents = this.components.size();
if(numComponents > this.maxNumComponents) {
int capacity = ((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
ByteBuf consolidated = this.allocBuffer(capacity);
for(int c = 0; c < numComponents; ++c) {
CompositeByteBuf.Component c1 = (CompositeByteBuf.Component)this.components.get(c);
ByteBuf b = c1.buf;
consolidated.writeBytes(b);
c1.freeIfNecessary();
}
CompositeByteBuf.Component var7 = new CompositeByteBuf.Component(consolidated);
var7.endOffset = var7.length;
this.components.clear();
this.components.add(var7);
}
in CompositeByteBuf which will consume some memory during buffer copy.
We can use another api Unpooled. wrappedBuffer(int maxNumComponents, ByteBuffer... buffers) to avoid this comsuming.
## How was this patch tested?
Test in production cluster.
Author: zhoukang <zhoukang@xiaomi.com>
Closes#18723 from caneGuy/zhoukang/fix-chunkbuffer.
inprogress history file in some cases.
Add failure handling for EOFException that can be thrown during
decompression of an inprogress spark history file, treat same as case
where can't parse the last line.
## What changes were proposed in this pull request?
Failure handling for case of EOFException thrown within the ReplayListenerBus.replay method to handle the case analogous to json parse fail case. This path can arise in compressed inprogress history files since an incomplete compression block could be read (not flushed by writer on a block boundary). See the stack trace of this occurrence in the jira ticket (https://issues.apache.org/jira/browse/SPARK-21447)
## How was this patch tested?
Added a unit test that specifically targets validating the failure handling path appropriately when maybeTruncated is true and false.
Author: Eric Vandenberg <ericvandenberg@fb.com>
Closes#18673 from ericvandenbergfb/fix_inprogress_compr_history_file.
Executors run a thread pool with daemon threads to run tasks. This means
that those threads remain active when the JVM is shutting down, meaning
those tasks are affected by code that runs in shutdown hooks.
So if a shutdown hook messes with something that the task is using (e.g.
an HDFS connection), the task will fail and will report that failure to
the driver. That will make the driver mark the task as failed regardless
of what caused the executor to shut down. So, for example, if YARN pre-empted
that executor, the driver would consider that task failed when it should
instead ignore the failure.
This change avoids reporting failures to the driver when shutdown hooks
are executing; this fixes the YARN preemption accounting, and doesn't really
change things much for other scenarios, other than reporting a more generic
error ("Executor lost") when the executor shuts down unexpectedly - which
is arguably more correct.
Tested with a hacky app running on spark-shell that tried to cause failures
only when shutdown hooks were running, verified that preemption didn't cause
the app to fail because of task failures exceeding the threshold.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#18594 from vanzin/SPARK-20904.
## What changes were proposed in this pull request?
DirectParquetOutputCommitter was removed from Spark as it was deemed unsafe to use. We however still have some code to generate warning. This patch removes those code as well.
This is kind of a follow-up of https://github.com/apache/spark/pull/16796
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18689 from cloud-fan/minor.
## What changes were proposed in this pull request?
During writing to the .inprogress file (stored on the HDFS) Hadoop doesn't update file length until close and therefor Spark's history server can't detect any changes. We have to send UPDATE_LENGTH manually.
Author: Oleg Danilov <oleg.danilov@wandisco.com>
Closes#16924 from dosoft/SPARK-19531.
## What changes were proposed in this pull request?
This is the reopen of https://github.com/apache/spark/pull/14198, with merge conflicts resolved.
ueshin Could you please take a look at my code?
Fix bugs about types that result an array of null when creating DataFrame using python.
Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows.
A simple code to reproduce this bug is:
```
from pyspark import SparkContext
from pyspark.sql import SQLContext,Row,DataFrame
from array import array
sc = SparkContext()
sqlContext = SQLContext(sc)
row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
rows = sc.parallelize([ row1 ])
df = sqlContext.createDataFrame(rows)
df.show()
```
which have output
```
+---------------+------------------+
| doublearray| floatarray|
+---------------+------------------+
|[1.0, 2.0, 3.0]|[null, null, null]|
+---------------+------------------+
```
## How was this patch tested?
New test case added
Author: Xiang Gao <qasdfgtyuiop@gmail.com>
Author: Gao, Xiang <qasdfgtyuiop@gmail.com>
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#18444 from zasdfgbnm/fix_array_infer.
## What changes were proposed in this pull request?
For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled.
## How was this patch tested?
Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.)
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#18487 from dhruve/impr/SPARK-21243.
## What changes were proposed in this pull request?
The most of BoundedPriorityQueue usages in ML/MLLIB are:
Get the value of BoundedPriorityQueue, then sort it.
For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
in ALS, pq.toArray.sorted()
The test results show using pq.poll is much faster than sort the value.
It is good to add the poll function for BoundedPriorityQueue.
## How was this patch tested?
The existing UT
Author: Peng <peng.meng@intel.com>
Author: Peng Meng <peng.meng@intel.com>
Closes#18620 from mpjlu/add-poll.
Instead of using the host's cpu count, use the number of cores allocated
for the Spark process when sizing the RPC dispatch thread pool. This avoids
creating large thread pools on large machines when the number of allocated
cores is small.
Tested by verifying number of threads with spark.executor.cores set
to 1 and 4; same thing for YARN AM.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#18639 from vanzin/SPARK-21408.
## What changes were proposed in this pull request?
In the current `YARNHadoopDelegationTokenManager`, `FileSystem` to which to get tokens are created out of KDC logged UGI, using these `FileSystem` to get new tokens will lead to exception. The main thing is that Spark code trying to get new tokens from the FS created with token auth-ed UGI, but Hadoop can only grant new tokens in kerberized UGI. To fix this issue, we should lazily create these FileSystem within KDC logged UGI.
## How was this patch tested?
Manual verification in secure cluster.
CC vanzin mgummelt please help to review, thanks!
Author: jerryshao <sshao@hortonworks.com>
Closes#18633 from jerryshao/SPARK-21411.
## What changes were proposed in this pull request?
Address scapegoat warnings for:
- BigDecimal double constructor
- Catching NPE
- Finalizer without super
- List.size is O(n)
- Prefer Seq.empty
- Prefer Set.empty
- reverse.map instead of reverseMap
- Type shadowing
- Unnecessary if condition.
- Use .log1p
- Var could be val
In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#18635 from srowen/Scapegoat1.
## What changes were proposed in this pull request?
In SPARK-21444, sitalkedia reported an issue where the `Broadcast.destroy()` call in `MapOutputTracker`'s `ShuffleStatus.invalidateSerializedMapOutputStatusCache()` was failing with an `IOException`, causing the DAGScheduler to crash and bring down the entire driver.
This is a bug introduced by #17955. In the old code, we removed a broadcast variable by calling `BroadcastManager.unbroadcast` with `blocking=false`, but the new code simply calls `Broadcast.destroy()` which is capable of failing with an IOException in case certain blocking RPCs time out.
The fix implemented here is to replace this with a call to `destroy(blocking = false)` and to wrap the entire operation in `Utils.tryLogNonFatalError`.
## How was this patch tested?
I haven't written regression tests for this because it's really hard to inject mocks to simulate RPC failures here. Instead, this class of issue is probably best uncovered with more generalized error injection / network unreliability / fuzz testing tools.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#18662 from JoshRosen/SPARK-21444.
## What changes were proposed in this pull request?
Fix a bug in RangePartitioner:
In RangePartitioner(partitions: Int, rdd: RDD[]), RangePartitioner.numPartitions is wrong if the number of elements in RDD (rdd.count()) is less than number of partitions (partitions in constructor).
## How was this patch tested?
test as described in [SPARK-SPARK-21410](https://issues.apache.org/jira/browse/SPARK-21410
)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Zhang A Peng <zhangap@cn.ibm.com>
Closes#18631 from apapi/fixRangePartitioner.numPartitions.
## What changes were proposed in this pull request?
The current code is very verbose on shutdown.
The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException).
## How was this patch tested?
Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data.
Author: John Lee <jlee2@yahoo-inc.com>
Closes#18547 from yoonlee95/SPARK-21321.
## What changes were proposed in this pull request?
This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.
## How was this patch tested?
Added a test suite in `OrderingSuite`.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#18571 from kiszk/SPARK-21344.
## What changes were proposed in this pull request?
Fixes --packages flag for mesos in cluster mode. Probably I will handle standalone and Yarn in another commit, I need to investigate those cases as they are different.
## How was this patch tested?
Tested with a community 1.9 dc/os cluster. packages were successfully resolved in cluster mode within a container.
andrewor14 susanxhuynh ArtRand srowen pls review.
Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Closes#18587 from skonto/fix_packages_mesos_cluster.
## What changes were proposed in this pull request?
- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17150 from srowen/SPARK-19810.
## What changes were proposed in this pull request?
Adding the default UncaughtExceptionHandler to the Worker.
## How was this patch tested?
I verified it manually, when any of the worker thread gets uncaught exceptions then the default UncaughtExceptionHandler will handle those exceptions.
Author: Devaraj K <devaraj@apache.org>
Closes#18357 from devaraj-kavali/SPARK-21146.
## What changes were proposed in this pull request?
In current code, it is expensive to use `UnboundedFollowingWindowFunctionFrame`, because it is iterating from the start to lower bound every time calling `write` method. When traverse the iterator, it's possible to skip some spilled files thus to save some time.
## How was this patch tested?
Added unit test
Did a small test for benchmark:
Put 2000200 rows into `UnsafeExternalSorter`-- 2 spill files(each contains 1000000 rows) and inMemSorter contains 200 rows.
Move the iterator forward to index=2000001.
*With this change*:
`getIterator(2000001)`, it will cost almost 0ms~1ms;
*Without this change*:
`for(int i=0; i<2000001; i++)geIterator().loadNext()`, it will cost 300ms.
Author: jinxing <jinxing6042@126.com>
Closes#18541 from jinxing64/SPARK-21315.
## What changes were proposed in this pull request?
When `RetryingBlockFetcher` retries fetching blocks. There could be two `DownloadCallback`s download the same content to the same target file. It could cause `ShuffleBlockFetcherIterator` reading a partial result.
This pr proposes to create and delete the tmp files in `OneForOneBlockFetcher`
Author: jinxing <jinxing6042@126.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Closes#18565 from jinxing64/SPARK-21342.
## What changes were proposed in this pull request?
There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219
The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
## How was this patch tested?
Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Eric Vandenberg <ericvandenberg@fb.com>
Closes#18427 from ericvandenbergfb/blacklistFix.
## What changes were proposed in this pull request?
In current code, reducer can break the old shuffle service when `spark.reducer.maxReqSizeShuffleToMem` is enabled. Let's refine document.
Author: jinxing <jinxing6042@126.com>
Closes#18566 from jinxing64/SPARK-21343.
This makes sures that listeners get updated task information; otherwise it's
possible to write incomplete task information into event logs, for example,
making the information in a replayed UI inconsistent with the original
application.
Added a new unit test to try to detect the problem, but it's not guaranteed
to fail since it's a race; but it fails pretty reliably for me without the
scheduler changes.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#18393 from vanzin/SPARK-20342.try2.
This change exposes the internal code path in SparkConf that allows
configs to be read with variable substitution applied, and uses that
new method in SSLOptions so that SSL configs can reference other
variables, and more importantly, environment variables, providing
a secure way to provide passwords to Spark when using SSL.
The approach is a little bit hacky, but is the smallest change possible.
Otherwise, the concept of "namespaced configs" would have to be added
to the config system, which would create a lot of noise for not much
gain at this point.
Tested with added unit tests, and on a real cluster with SSL enabled.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#18394 from vanzin/SPARK-20379.try2.
## What changes were proposed in this pull request?
Some dropped event will make the whole application behaves unexpectedly, e.g. some UI problem...we shall log the dropped event name to facilitate the debugging
## How was this patch tested?
Existing tests
Author: CodingCat <zhunansjtu@gmail.com>
Closes#16697 from CodingCat/SPARK-19358.
## What changes were proposed in this pull request?
Currently `ArrayConstructor` handles an array of typecode `'l'` as `int` when converting Python object in Python 2 into Java object, so if the value is larger than `Integer.MAX_VALUE` or smaller than `Integer.MIN_VALUE` then the overflow occurs.
```python
import array
data = [Row(longarray=array.array('l', [-9223372036854775808, 0, 9223372036854775807]))]
df = spark.createDataFrame(data)
df.show(truncate=False)
```
```
+----------+
|longarray |
+----------+
|[0, 0, -1]|
+----------+
```
This should be:
```
+----------------------------------------------+
|longarray |
+----------------------------------------------+
|[-9223372036854775808, 0, 9223372036854775807]|
+----------------------------------------------+
```
## How was this patch tested?
Added a test and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#18553 from ueshin/issues/SPARK-21327.
## What changes were proposed in this pull request?
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test `forceSorterToSpill`
The average performance of running 10 times is as follows:(their unit is MS).
```
diskWriteBufferSize: 1M 512K 256K 128K 64K 32K 16K 8K 4K
---------------------------------------------------------------------------------------
RecordSize = 2.5M 742 722 694 686 667 668 671 669 683
RecordSize = 1M 294 293 292 287 283 285 281 279 285
```
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function.
## How was this patch tested?
The unit test.
Author: caoxuewen <cao.xuewen@zte.com.cn>
Closes#18174 from heary-cao/buffersize.
## What changes were proposed in this pull request?
Right now in the UI, after SPARK-20213, we can show the operations to write data out. However, there is no way to associate metrics with data writes. We should show relative metrics on the operations.
#### Supported commands
This change supports updating metrics for file-based data writing operations, including `InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`.
Supported metrics:
* number of written files
* number of dynamic partitions
* total bytes of written data
* total number of output rows
* average writing data out time (ms)
* (TODO) min/med/max number of output rows per file/partition
* (TODO) min/med/max bytes of written data per file/partition
#### Commands not supported
`InsertIntoDataSourceCommand`, `SaveIntoDataSourceCommand`:
The two commands uses DataSource APIs to write data out, i.e., the logic of writing data out is delegated to the DataSource implementations, such as `InsertableRelation.insert` and `CreatableRelationProvider.createRelation`. So we can't obtain metrics from delegated methods for now.
`CreateHiveTableAsSelectCommand`, `CreateDataSourceTableAsSelectCommand` :
The two commands invokes other commands to write data out. The invoked commands can even write to non file-based data source. We leave them as future TODO.
#### How to update metrics of writing files out
A `RunnableCommand` which wants to update metrics, needs to override its `metrics` and provide the metrics data structure to `ExecutedCommandExec`.
The metrics are prepared during the execution of `FileFormatWriter`. The callback function passed to `FileFormatWriter` will accept the metrics and update accordingly.
There is a metrics updating function in `RunnableCommand`. In runtime, the function will be bound to the spark context and `metrics` of `ExecutedCommandExec` and pass to `FileFormatWriter`.
## How was this patch tested?
Updated unit tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18159 from viirya/SPARK-20703-2.
Current "--jars (spark.jars)", "--files (spark.files)", "--py-files (spark.submit.pyFiles)" and "--archives (spark.yarn.dist.archives)" only support non-glob path. This is OK for most of the cases, but when user requires to add more jars, files into Spark, it is too verbose to list one by one. So here propose to add glob path support for resources.
Also improving the code of downloading resources.
## How was this patch tested?
UT added, also verified manually in local cluster.
Author: jerryshao <sshao@hortonworks.com>
Closes#18235 from jerryshao/SPARK-21012.
## What changes were proposed in this pull request?
This PR uses `runUninterruptibly` to avoid that the clean up codes in StreamExecution is interrupted. It also removes an optimization in `runUninterruptibly` to make sure this method never throw `InterruptedException`.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#18461 from zsxwing/SPARK-21248.
## What changes were proposed in this pull request?
This PR aims to bump Py4J in order to fix the following float/double bug.
Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6.
**BEFORE**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+--------------------+
|(id + 17.1335742042)|
+--------------------+
| 17.1335742042|
+--------------------+
```
**AFTER**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+-------------------------+
|(id + 17.133574204226083)|
+-------------------------+
| 17.133574204226083|
+-------------------------+
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#18546 from dongjoon-hyun/SPARK-21278.
## What changes were proposed in this pull request?
The old unit test not effect
## How was this patch tested?
unit test
Author: he.qiao <he.qiao17@zte.com.cn>
Closes#18511 from Geek-He/dev_0703.
## What changes were proposed in this pull request?
Currently, running the shell below:
```bash
$ ./bin/spark-submit tmp.R a b c
```
with R file, `tmp.R` as below:
```r
#!/usr/bin/env Rscript
library(SparkR)
sparkRSQL.init(sparkR.init(master = "local"))
collect(createDataFrame(list(list(1))))
print(commandArgs(trailingOnly = TRUE))
```
working fine as below:
```bash
_1
1 1
[1] "a" "b" "c"
```
However, it looks not printed in usage documentation as below:
```bash
$ ./bin/spark-submit
```
```
Usage: spark-submit [options] <app jar | python file> [app arguments]
...
```
For `./bin/sparkR`, it looks fine as below:
```bash
$ ./bin/sparkR tmp.R
```
```
Running R applications through 'sparkR' is not supported as of Spark 2.0.
Use ./bin/spark-submit <R file>
```
Running the script below:
```bash
$ ./bin/spark-submit
```
**Before**
```
Usage: spark-submit [options] <app jar | python file> [app arguments]
...
```
**After**
```
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
...
```
## How was this patch tested?
Manually tested.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18505 from HyukjinKwon/minor-doc-summit.
## What changes were proposed in this pull request?
`FileAppender` is used to write `stderr` and `stdout` files in `ExecutorRunner`, But before writing `ErrorStream` into the the `stderr` file, the header information has been written into ,if FileOutputStream is not created as append mode, the header information will be lost
## How was this patch tested?
unit test case
Author: liuxian <liu.xian3@zte.com.cn>
Closes#18507 from 10110346/wip-lx-0703.
## What changes were proposed in this pull request?
Parallelize FileInputFormat.listStatus in Hadoop API via LIST_STATUS_NUM_THREADS to speed up examination of file sizes for wholeTextFiles et al
## How was this patch tested?
Existing tests, which will exercise the key path here: using a local file system.
Author: Sean Owen <sowen@cloudera.com>
Closes#18441 from srowen/SPARK-21137.
## What changes were proposed in this pull request?
Not adding the exception to the suppressed if it is the same instance as originalThrowable.
## How was this patch tested?
Added new tests to verify this, these tests fail without source code changes and passes with the change.
Author: Devaraj K <devaraj@apache.org>
Closes#18384 from devaraj-kavali/SPARK-21170.
## What changes were proposed in this pull request?
Remove `numHashCollisions` in `BytesToBytesMap`. And change `getAverageProbesPerLookup()` to `getAverageProbesPerLookup` as suggested.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18480 from viirya/SPARK-21052-followup.
# What issue does this PR address ?
Jira:https://issues.apache.org/jira/browse/SPARK-21223
fix the Thread-safety issue in FsHistoryProvider
Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class FsHistoryProvider to store the map of eventlog path and attemptInfo.
When use ThreadPool to Replay the log files in the list and merge the list of old applications with new ones, multi thread may update fileToAppInfo at the same time, which may cause Thread-safety issues, such as falling into an infinite loop because of calling resize func of the hashtable.
Author: 曾林西 <zenglinxi@meituan.com>
Closes#18430 from zenglinxi0615/master.
## What changes were proposed in this pull request?
This PR makes the following changes:
- Implement a new commit protocol `HadoopMapRedCommitProtocol` which support the old `mapred` package's committer;
- Refactor SparkHadoopWriter and SparkHadoopMapReduceWriter, now they are combined together, thus we can support write through both mapred and mapreduce API by the new SparkHadoopWriter, a lot of duplicated codes are removed.
After this change, it should be pretty easy for us to support the committer from both the new and the old hadoop API at high level.
## How was this patch tested?
No major behavior change, passed the existing test cases.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#18438 from jiangxb1987/SparkHadoopWriter.
## What changes were proposed in this pull request?
Please see also https://issues.apache.org/jira/browse/SPARK-21176
This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2).
The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers).
Once https://github.com/eclipse/jetty.project/issues/1643 is available, the code could be cleaned up to avoid the method override.
I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR?
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy.
gurvindersingh zsxwing can you please review the change?
Author: IngoSchuster <ingo.schuster@de.ibm.com>
Author: Ingo Schuster <ingo.schuster@de.ibm.com>
Closes#18437 from IngoSchuster/master.
## What changes were proposed in this pull request?
Disable spark.reducer.maxReqSizeShuffleToMem because it breaks the old shuffle service.
Credits to wangyum
Closes#18466
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18467 from zsxwing/SPARK-21253.
## What changes were proposed in this pull request?
Since the objects `readLocksByTask`, `writeLocksByTask` and `info`s are coupled and supposed to be modified by other threads concurrently, all the read and writes of them in the method `releaseAllLocksForTask` should be protected by a single synchronized block like other similar methods.
## How was this patch tested?
existing tests
Author: Feng Liu <fengliu@databricks.com>
Closes#18400 from liufengdb/synchronize.
JIRA Issue:https://issues.apache.org/jira/browse/SPARK-21225
In the function "resourceOffers", It declare a variable "tasks" for storage the tasks which have allocated a executor. It declared like this:
`val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))`
But, I think this code only conside a situation for that one task per core. If the user set "spark.task.cpus" as 2 or 3, It really don't need so much Mem. I think It can motify as follow:
val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK))
to instead.
Motify like this the other earning is that it's more easy to understand the way how the tasks allocate offers.
Author: 杨治国10192065 <yang.zhiguo@zte.com.cn>
Closes#18435 from JackYangzg/motifyTaskCoreDisp.
## What changes were proposed in this pull request?
Hide duration of incompleted applications.
## How was this patch tested?
manual tests
Author: fjh100456 <fu.jinhua6@zte.com.cn>
Closes#18351 from fjh100456/master.
## What changes were proposed in this pull request?
Same with SPARK-20985.
Fix code style for constructing and stopping a `SparkContext`. Assure the context is stopped to avoid other tests complain that there's only one `SparkContext` can exist.
Author: jinxing <jinxing6042@126.com>
Closes#18454 from jinxing64/SPARK-21240.