https://issues.apache.org/jira/browse/SPARK-13227
It might confuse the future developers when they use OpenHashMap.apply() with a numeric value type.
null.asInstance[Int], null.asInstance[Long], null.asInstace[Float] and null.asInstance[Double] will return 0/0.0/0L, which might confuse the developer if the value set contains 0/0.0/0L with an existing key
The current patch only adds the comments describing the issue, with the respect to apply the minimum changes to the code base
The more direct, yet more aggressive, approach is use Option as the return type
andrewor14 JoshRosen any thoughts about how to avoid the potential issue?
Author: CodingCat <zhunansjtu@gmail.com>
Closes#11107 from CodingCat/SPARK-13227.
## What changes were proposed in this pull request?
This PR is a follow up for https://github.com/apache/spark/pull/12417, now we always track input/output/shuffle metrics in spark JSON protocol and status API.
Most of the line changes are because of re-generating the gold answer for `HistoryServerSuite`, and we add a lot of 0 values for read/write metrics.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12462 from cloud-fan/follow.
## What changes were proposed in this pull request?
When there are multiple tests running, "NettyBlockTransferServiceSuite.can bind to a specific port twice and the second increments" may fail.
E.g., assume there are 2 tests running. Here are the execution order to reproduce the test failure.
| Execution Order | Test 1 | Test 2 |
| ------------- | ------------- | ------------- |
| 1 | service0 binds to 17634 | |
| 2 | | service0 binds to 17635 (17634 is occupied) |
| 3 | service1 binds to 17636 | |
| 4 | pass test | |
| 5 | service0.close (release 17634) | |
| 6 | | service1 binds to 17634 |
| 7 | | `service1.port should be (service0.port + 1)` fails (17634 != 17635 + 1) |
Here is an example in Jenkins: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/786/testReport/junit/org.apache.spark.network.netty/NettyBlockTransferServiceSuite/can_bind_to_a_specific_port_twice_and_the_second_increments/
This PR makes two changes:
- Use a random port between 17634 and 27634 to reduce the possibility of port conflicts.
- Make `service1` use `service0.port` to bind to avoid the above race condition.
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12477 from zsxwing/SPARK-14713.
## What changes were proposed in this pull request?
This commit adds support for pluggable cluster manager. And also allows a cluster manager to clean up tasks without taking the parent process down.
To plug a new external cluster manager, ExternalClusterManager trait should be implemented. It returns task scheduler and backend scheduler that will be used by SparkContext to schedule tasks. An external cluster manager is registered using the java.util.ServiceLoader mechanism (This mechanism is also being used to register data sources like parquet, json, jdbc etc.). This allows auto-loading implementations of ExternalClusterManager interface.
Currently, when a driver fails, executors exit using system.exit. This does not bode well for cluster managers that would like to reuse the parent process of an executor. Hence,
1. Moving system.exit to a function that can be overriden in subclasses of CoarseGrainedExecutorBackend.
2. Added functionality of killing all the running tasks in an executor.
## How was this patch tested?
ExternalClusterManagerSuite.scala was added to test this patch.
Author: Hemant Bhanawat <hemant@snappydata.io>
Closes#11723 from hbhanawat/pluggableScheduler.
## What changes were proposed in this pull request?
This PR removes
- Inappropriate type notations
For example, from
```scala
words.foreachRDD { (rdd: RDD[String], time: Time) =>
...
```
to
```scala
words.foreachRDD { (rdd, time) =>
...
```
- Extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
and corrects some obvious style nits.
## How was this patch tested?
This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.
The rules applied were below:
- For the first correction,
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```
- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```
**Those rules were not added**
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12413 from HyukjinKwon/SPARK-style.
## What changes were proposed in this pull request?
Part of the reason why TaskMetrics and its callers are complicated are due to the optional metrics we collect, including input, output, shuffle read, and shuffle write. I think we can always track them and just assign 0 as the initial values. It is usually very obvious whether a task is supposed to read any data or not. By always tracking them, we can remove a lot of map, foreach, flatMap, getOrElse(0L) calls throughout Spark.
This patch also changes a few behaviors.
1. Removed the distinction of data read/write methods (e.g. Hadoop, Memory, Network, etc).
2. Accumulate all data reads and writes, rather than only the first method. (Fixes SPARK-5225)
## How was this patch tested?
existing tests.
This is bases on https://github.com/apache/spark/pull/12388, with more test fixes.
Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12417 from cloud-fan/metrics-refactor.
## What changes were proposed in this pull request?
Round memory bytes and convert it to Long to it’s original type. This change fixes the formatting issue in the Exception message.
## How was this patch tested?
Manual tests were done in CDH cluster.
Author: Peter Ableda <peter.ableda@cloudera.com>
Closes#12392 from peterableda/SPARK-14633.
## What changes were proposed in this pull request?
Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.
## How was this patch tested?
Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.
Author: Mark Grover <mark@apache.org>
Closes#12365 from markgrover/spark-14601.
## What changes were proposed in this pull request?
When we clean a closure, if its outermost parent is not a closure, we won't clone and clean it as cloning user's objects is dangerous. However, if it's a REPL line object, which may carry a lot of unnecessary references(like hadoop conf, spark conf, etc.), we should clean it as it's not a user object.
This PR improves the check for user's objects to exclude REPL line object.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12327 from cloud-fan/closure.
## What changes were proposed in this pull request?
This patch removes some of the deprecated APIs in TaskMetrics. This is part of my bigger effort to simplify accumulators and task metrics.
## How was this patch tested?
N/A - only removals
Author: Reynold Xin <rxin@databricks.com>
Closes#12375 from rxin/SPARK-14617.
## What changes were proposed in this pull request?
When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. This is part of my bigger effort to simplify accumulators and task metrics.
## How was this patch tested?
Covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12378 from rxin/SPARK-14619.
## What changes were proposed in this pull request?
Move json4s, breeze dependency declaration into parent
## How was this patch tested?
Should be no functional change, but Jenkins tests will test that.
Author: Sean Owen <sowen@cloudera.com>
Closes#12390 from srowen/SPARK-14612.
## What changes were proposed in this pull request?
Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110):
```scala
def start() // should be: def start(): Unit
def stop() // should be: def stop(): Unit
```
These methods exist in core, sql, streaming; this PR fixes them.
## How was this patch tested?
N/A
## Which piece of scala style rule led to the changes?
the rule was added separately in https://github.com/apache/spark/pull/12396
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12389 from lw-lin/public-abstract-methods.
## What changes were proposed in this pull request?
I was trying to understand the accumulator and metrics update source code and these two classes don't really need to be case classes. It would also be more consistent with other UI classes if they are not case classes. This is part of my bigger effort to simplify accumulators and task metrics.
## How was this patch tested?
This is a straightforward refactoring without behavior change.
Author: Reynold Xin <rxin@databricks.com>
Closes#12386 from rxin/SPARK-14625.
## What changes were proposed in this pull request?
This PR removes extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
## How was this patch tested?
Related unit tests and `sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12382 from HyukjinKwon/minor-extra-closers.
## What changes were proposed in this pull request?
Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`.
Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore.
So, this PR removes `SqlNewHadoopRDD` and several unused imports.
This was discussed in https://github.com/apache/spark/pull/12326.
## How was this patch tested?
Several related existing unit tests and `sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12354 from HyukjinKwon/SPARK-14596.
This patch makes the postStartHook throw an IllegalStateException if the SparkContext is shutdown while it is waiting for the backend to be ready
Author: Charles Allen <charles@allen-net.com>
Closes#12301 from drcrallen/SPARK-14537.
## What changes were proposed in this pull request?
- updated `OFF_HEAP` semantics for `StorageLevels.java`
- updated `OFF_HEAP` semantics for `storagelevel.py`
## How was this patch tested?
no need to test
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12126 from lw-lin/storagelevel.py.
## What changes were proposed in this pull request?
Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization.
This is a regression partially introduced in PR https://github.com/apache/spark/pull/9241
## How was this patch tested?
Tested by running a job and observed around 30% speedup after this change.
Author: Sital Kedia <skedia@fb.com>
Closes#12285 from sitalkedia/executor_oom.
## What changes were proposed in this pull request?
This PR improve the performance of SQL UI by:
1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page.
2) break-all is super slow in Chrome recently, so switch to break-word.
3) Using "display: none" to hide a block.
4) using one js closure for for all the executions, not one for each.
5) remove the height limitation of details, don't need to scroll it in the tiny window.
## How was this patch tested?
Exists tests.
![ui](https://cloud.githubusercontent.com/assets/40902/14445712/68d7b258-0004-11e6-9b48-5d329b05d165.png)
Author: Davies Liu <davies@databricks.com>
Closes#12311 from davies/ui_perf.
## What changes were proposed in this pull request?
Shutting down `QueuedThreadPool` used by Jetty `Server` to avoid threads leakage after SparkContext is stopped.
Note: If this fix is going to apply to the `branch-1.6`, one more patch on the `NettyRpcEnv` class is needed so that the `NettyRpcEnv._fileServer.shutdown` is called in the `NettyRpcEnv.cleanup` method. This is due to the removal of `_fileServer` field in the `NettyRpcEnv` class in the master branch. Please advice if a second PR is necessary for bring this fix back to `branch-1.6`
## How was this patch tested?
Ran the ./dev/run-tests locally
Author: Terence Yim <terence@cask.co>
Closes#12318 from chtyim/fixes/SPARK-14513-thread-leak.
## What changes were proposed in this pull request?
According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
```
case: Always omit braces in case clauses.
```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.
## How was this patch tested?
Pass the Jenkins tests (including Scala style checking)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12280 from dongjoon-hyun/SPARK-14508.
## What changes were proposed in this pull request?
This adds a new API call `TaskContext.getLocalProperty` for getting properties set in the driver from executors. These local properties are automatically propagated from the driver to executors. For streaming, the context for streaming tasks will be the initial driver context when ssc.start() is called.
## How was this patch tested?
Unit tests.
cc JoshRosen
Author: Eric Liang <ekl@databricks.com>
Closes#12248 from ericl/sc-2813.
## What changes were proposed in this pull request?
When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception.
## How was this patch tested?
Added a test suite for the component that extracts the root cause of the error.
Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException.
Author: Jason Moore <jasonmoore2k@outlook.com>
Closes#12228 from jasonmoore2k/SPARK-14357.
## What changes were proposed in this pull request?
Currently, `checkstyle` is configured to check the files under `src/main/java`. However, Spark has Java files in `src/main/scala`, too. This PR fixes the following configuration in `pom.xml` and the unchecked-so-far violations on those files.
```xml
-<sourceDirectory>${basedir}/src/main/java</sourceDirectory>
+<sourceDirectories>${basedir}/src/main/java,${basedir}/src/main/scala</sourceDirectories>
```
## How was this patch tested?
After passing the Jenkins build and manually `dev/lint-java`. (Note that Jenkins does not run `lint-java`)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12242 from dongjoon-hyun/SPARK-14465.
## What changes were proposed in this pull request?
Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.
This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.
This PR reopen#12190 to fix bugs.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12278 from davies/long_map3.
## What changes were proposed in this pull request?
This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`:
```scala
logError("Aborting task.", cause)
// call failure callbacks first, so we could have a chance to cleanup the writer.
TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause)
if (currentWriter != null) {
currentWriter.close()
}
abortTask()
throw new SparkException("Task failed while writing rows.", cause)
```
If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception.
## How was this patch tested?
No new functionality added
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12234 from sameeragarwal/fix-exception.
## What changes were proposed in this pull request?
Here is why SPARK-14437 happens:
BlockManagerId is created using NettyBlockTransferService.hostName which comes from `customHostname`. And `Executor` will set `customHostname` to the hostname which is detected by the driver. However, the driver may not be able to detect the correct address in some complicated network (Netty's Channel.remoteAddress doesn't always return a connectable address). In such case, `BlockManagerId` will be created using a wrong hostname.
To fix this issue, this PR uses `hostname` provided by `SparkEnv.create` to create `NettyBlockTransferService` and set `NettyBlockTransferService.hostname` to this one directly. A bonus of this approach is NettyBlockTransferService won't bound to `0.0.0.0` which is much safer.
## How was this patch tested?
Manually checked the bound address using local-cluster.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12240 from zsxwing/SPARK-14437.
Currently all `SparkFirehoseListener` implementations are broken since we expect listeners to extend `SparkListener`, while the fire hose only extends `SparkListenerInterface`. This changes the addListener function and the config based injection to use the interface instead.
The existing tests in SparkListenerSuite are improved such that they would have caught this.
Follow-up to #12142
Author: Michael Armbrust <michael@databricks.com>
Closes#12227 from marmbrus/fixListener.
## What changes were proposed in this pull request?
`OutputCommitCoordinator` was introduced to deal with concurrent task attempts racing to write output, leading to data loss or corruption. For more detail, read the [JIRA description](https://issues.apache.org/jira/browse/SPARK-14468).
Before: `OutputCommitCoordinator` is enabled only if speculation is enabled.
After: `OutputCommitCoordinator` is always enabled.
Users may still disable this through `spark.hadoop.outputCommitCoordination.enabled`, but they really shouldn't...
## How was this patch tested?
`OutputCommitCoordinator*Suite`
Author: Andrew Or <andrew@databricks.com>
Closes#12244 from andrewor14/always-occ.
## What changes were proposed in this pull request?
Currently Spark clients are started with the same memory setting for Xms and Xms leading to reserving unnecessary higher amounts of memory.
This behavior is changed and the clients can now specify an initial heap size using the extraJavaOptions in the config for driver,executor and am individually.
Note, that only -Xms can be provided through this config option, if the client wants to set the max size(-Xmx), this has to be done via the *.memory configuration knobs which are currently supported.
## How was this patch tested?
Monitored executor and yarn logs in debug mode to verify the commands through which they are being launched in client and cluster mode. The driver memory was verified locally using jps -v. Setting up -Xmx parameter in the javaExtraOptions raises exception with the info provided.
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#12115 from dhruve/impr/SPARK-12384.
## What changes were proposed in this pull request?
The Spark UI (both active and history) should show the user who ran the application somewhere when you are in the application view. This was added under the Jobs view by total uptime and scheduler mode.
## How was this patch tested?
Manual testing
<img width="191" alt="username" src="https://cloud.githubusercontent.com/assets/13952758/14222830/6d1fe542-f82a-11e5-885f-c05ee2cdf857.png">
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#12123 from ajbozarth/spark14245.
The current package name uses a dash, which is a little weird but seemed
to work. That is, until a new test tried to mock a class that references
one of those shaded types, and then things started failing.
Most changes are just noise to fix the logging configs.
For reference, SPARK-8815 also raised this issue, although at the time it
did not cause any issues in Spark, so it was not addressed.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11941 from vanzin/SPARK-14134.
## What changes were proposed in this pull request?
Send `RegisterExecutorResponse` using `executorRef` in order to make sure RegisterExecutorResponse and LaunchTask are both sent using the same channel. Then RegisterExecutorResponse will always arrive before LaunchTask
## How was this patch tested?
Existing unit tests
Closes#12078
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12211 from zsxwing/SPARK-13112.
## What changes were proposed in this pull request?
According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
```
/** In Spark, we don't use the ScalaDoc style so this
* is not correct.
*/
```
## How was this patch tested?
Pass the Jenkins tests (including `lint-scala`).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12221 from dongjoon-hyun/SPARK-14444.
## What changes were proposed in this pull request?
Added a new Executor Allocation Manager for the Streaming scheduler for doing Streaming Dynamic Allocation.
## How was this patch tested
Unit tests, and cluster tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12154 from tdas/streaming-dynamic-allocation.
## What changes were proposed in this pull request?
As mentioned in the ticket this was because one get path in the refactored `BlockManager` did not check for remote storage.
## How was this patch tested?
Unit test, also verified manually with reproduction in the ticket.
cc JoshRosen
Author: Eric Liang <ekl@databricks.com>
Closes#12193 from ericl/spark-14252.
## What changes were proposed in this pull request?
While I was reviewing #12078, I found most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any comments about the thread-safe assumptions and it's hard for people to figure out which part of codes should be protected by the lock. This PR just added comments/annotations for them and also added strict access modifiers for some fields.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12188 from zsxwing/comments.
Because SQL keeps track of all known configs, some customization was
needed in SQLConf to allow that, since the core API does not have that
feature.
Tested via existing (and slightly updated) unit tests.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11570 from vanzin/SPARK-529-sql.
## What changes were proposed in this pull request?
In `LogPage`, the content to be rendered is defined as follows.
```
val content =
<html>
<body>
{linkToMaster}
<div>
<div style="float:left; margin-right:10px">{backButton}</div>
<div style="float:left;">{range}</div>
<div style="float:right; margin-left:10px">{nextButton}</div>
</div>
<br />
<div style="height:500px; overflow:auto; padding:5px;">
<pre>{logText}</pre>
</div>
</body>
</html>
UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
```
As you can see, <html> and <body> tags will be rendered.
On the other hand, `UIUtils.basicSparkPage` will render those tags so those tags will be nested.
```
def basicSparkPage(
content: => Seq[Node],
title: String,
useDataTables: Boolean = false): Seq[Node] = {
<html>
<head>
{commonHeaderNodes}
{if (useDataTables) dataTablesHeaderNodes else Seq.empty}
<title>{title}</title>
</head>
<body>
<div class="container-fluid">
<div class="row-fluid">
<div class="span12">
<h3 style="vertical-align: middle; display: inline-block;">
<a style="text-decoration: none" href={prependBaseUri("/")}>
<img src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} />
<span class="version"
style="margin-right: 15px;">{org.apache.spark.SPARK_VERSION}</span>
</a>
{title}
</h3>
</div>
</div>
{content}
</div>
</body>
</html>
}
```
These are the screen shots before this patch is applied.
![before1](https://cloud.githubusercontent.com/assets/4736016/14273236/03cbed8a-fb44-11e5-8786-bc1bfa4d3f8c.png)
![before2](https://cloud.githubusercontent.com/assets/4736016/14273237/03d1741c-fb44-11e5-9dee-ea93022033a6.png)
And these are the ones after this patch is applied.
![after1](https://cloud.githubusercontent.com/assets/4736016/14273248/1b6a7d8a-fb44-11e5-8a3b-69964f3434f6.png)
![after2](https://cloud.githubusercontent.com/assets/4736016/14273249/1b6b9c38-fb44-11e5-9d6f-281d64c842e4.png)
The appearance is not changed but the html source code is changed.
## How was this patch tested?
Manually run some jobs on my standalone-cluster and check the WebUI.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#12170 from sarutak/SPARK-14397.
Use PartitionerAwareUnionRDD when possbile for optimizing shuffling and
preserving the partitioner.
Author: Guillaume Poulin <poulin.guillaume@gmail.com>
Closes#10382 from gpoulin/dstream_union_optimisation.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).
I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).
Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11796 from vanzin/SPARK-13579.
## What changes were proposed in this pull request?
RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).
This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.
The JDBC server has been updated to use DataFrame.toIterator.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12114 from davies/local_iterator.
## What changes were proposed in this pull request?
Scala traits are difficult to maintain binary compatibility on, and as a result we had to introduce JavaSparkListener. In Spark 2.0 we can change SparkListener from a trait to an abstract class and then remove JavaSparkListener.
## How was this patch tested?
Updated related unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12142 from rxin/SPARK-14358.
## What changes were proposed in this pull request?
It's a mistake that HeartbeatReceiver object was made public in Spark 1.x.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#12148 from rxin/SPARK-14364.
## What changes were proposed in this pull request?
This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.
## How was this patch tested?
Manual and pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12139 from dongjoon-hyun/SPARK-14355.
## What changes were proposed in this pull request?
This special cases 0 and 1 counts to avoid passing 0 degrees of freedom.
## How was this patch tested?
Tests run successfully. New test added.
## Note:
This recreates #11982 which was closed to due to non-updated diff. rxin srowen Commented there.
This also adds tests, reworks the code to perform the special casing (based on srowen's comments), and adds equality machinery for BoundedDouble, as well as changing how it is transformed to string.
Author: Marcin Tustin <mtustin@handybook.com>
Author: Marcin Tustin <mtustin@handy.com>
Closes#12016 from mtustin-handy/SPARK-14163.
## What changes were proposed in this pull request?
Appends s3 specific configurations and spark.hadoop configurations to hive configuration.
## How was this patch tested?
Tested by running a job on cluster.
…figurations to hive configuration.
Author: Sital Kedia <skedia@fb.com>
Closes#11876 from sitalkedia/hiveConf.
## What changes were proposed in this pull request?
Straggler references to Tachyon were removed:
- for docs, `tachyon` has been generalized as `off-heap memory`;
- for Mesos test suits, the key-value `tachyon:true`/`tachyon:false` has been changed to `os:centos`/`os:ubuntu`, since `os` is an example constrain used by the [Mesos official docs](http://mesos.apache.org/documentation/attributes-resources/).
## How was this patch tested?
Existing test suites.
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12129 from lw-lin/tachyon-cleanup.