Commit graph

5404 commits

Author SHA1 Message Date
Shixiong Zhu 6ff0435858 [SPARK-14713][TESTS] Fix the flaky test NettyBlockTransferServiceSuite
## What changes were proposed in this pull request?

When there are multiple tests running, "NettyBlockTransferServiceSuite.can bind to a specific port twice and the second increments" may fail.

E.g., assume there are 2 tests running. Here are the execution order to reproduce the test failure.

| Execution Order | Test 1 | Test 2 |
| ------------- | ------------- | ------------- |
| 1 | service0 binds to 17634 |  |
| 2 |  | service0 binds to 17635 (17634 is occupied) |
| 3 | service1 binds to 17636 |  |
| 4 | pass test |  |
| 5 | service0.close (release 17634) |  |
| 6 |  | service1 binds to 17634 |
| 7 |  | `service1.port should be (service0.port + 1)` fails (17634 != 17635 + 1) |

Here is an example in Jenkins: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/786/testReport/junit/org.apache.spark.network.netty/NettyBlockTransferServiceSuite/can_bind_to_a_specific_port_twice_and_the_second_increments/

This PR makes two changes:

- Use a random port between 17634 and 27634 to reduce the possibility of port conflicts.
- Make `service1` use `service0.port` to bind to avoid the above race condition.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12477 from zsxwing/SPARK-14713.
2016-04-18 14:41:45 -07:00
Reynold Xin 8a87f7d5c8 Mark ExternalClusterManager as private[spark]. 2016-04-16 23:49:26 -07:00
Hemant Bhanawat af1f4da762 [SPARK-13904][SCHEDULER] Add support for pluggable cluster manager
## What changes were proposed in this pull request?

This commit adds support for pluggable cluster manager. And also allows a cluster manager to clean up tasks without taking the parent process down.

To plug a new external cluster manager, ExternalClusterManager trait should be implemented. It returns task scheduler and backend scheduler that will be used by SparkContext to schedule tasks. An external cluster manager is registered using the java.util.ServiceLoader mechanism (This mechanism is also being used to register data sources like parquet, json, jdbc etc.). This allows auto-loading implementations of ExternalClusterManager interface.

Currently, when a driver fails, executors exit using system.exit. This does not bode well for cluster managers that would like to reuse the parent process of an executor. Hence,

  1. Moving system.exit to a function that can be overriden in subclasses of CoarseGrainedExecutorBackend.
  2. Added functionality of killing all the running tasks in an executor.

## How was this patch tested?
ExternalClusterManagerSuite.scala was added to test this patch.

Author: Hemant Bhanawat <hemant@snappydata.io>

Closes #11723 from hbhanawat/pluggableScheduler.
2016-04-16 23:43:32 -07:00
hyukjinkwon 9f678e9754 [MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes

- Inappropriate type notations
    For example, from
    ```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
    ```
    to
    ```scala
    words.foreachRDD { (rdd, time) =>
    ...
    ```

- Extra anonymous closure within functional transformations.
    For example,
    ```scala
    .map(item => {
      ...
    })
    ```

    which can be just simply as below:

    ```scala
    .map { item =>
      ...
    }
    ```

and corrects some obvious style nits.

## How was this patch tested?

This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.

The rules applied were below:

- For the first correction,

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```

- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```

**Those rules were not added**

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12413 from HyukjinKwon/SPARK-style.
2016-04-16 14:56:23 +01:00
Reynold Xin 8028a28885 [SPARK-14628][CORE] Simplify task metrics by always tracking read/write metrics
## What changes were proposed in this pull request?

Part of the reason why TaskMetrics and its callers are complicated are due to the optional metrics we collect, including input, output, shuffle read, and shuffle write. I think we can always track them and just assign 0 as the initial values. It is usually very obvious whether a task is supposed to read any data or not. By always tracking them, we can remove a lot of map, foreach, flatMap, getOrElse(0L) calls throughout Spark.

This patch also changes a few behaviors.

1. Removed the distinction of data read/write methods (e.g. Hadoop, Memory, Network, etc).
2. Accumulate all data reads and writes, rather than only the first method. (Fixes SPARK-5225)

## How was this patch tested?

existing tests.

This is bases on https://github.com/apache/spark/pull/12388, with more test fixes.

Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #12417 from cloud-fan/metrics-refactor.
2016-04-15 15:39:39 -07:00
Peter Ableda 06b9d623e8 [SPARK-14633] Use more readable format to show memory bytes in Error Message
## What changes were proposed in this pull request?

Round memory bytes and convert it to Long to it’s original type. This change fixes the formatting issue in the Exception message.

## How was this patch tested?

Manual tests were done in CDH cluster.

Author: Peter Ableda <peter.ableda@cloudera.com>

Closes #12392 from peterableda/SPARK-14633.
2016-04-15 13:18:48 +01:00
Mark Grover ff9ae61a3b [SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly
## What changes were proposed in this pull request?

Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.

## How was this patch tested?

Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.

Author: Mark Grover <mark@apache.org>

Closes #12365 from markgrover/spark-14601.
2016-04-14 18:51:43 -07:00
Wenchen Fan 1d04c86fc5 [SPARK-14558][CORE] In ClosureCleaner, clean the outer pointer if it's a REPL line object
## What changes were proposed in this pull request?

When we clean a closure, if its outermost parent is not a closure, we won't clone and clean it as cloning user's objects is dangerous. However, if it's a REPL line object, which may carry a lot of unnecessary references(like hadoop conf, spark conf, etc.), we should clean it as it's not a user object.

This PR improves the check for user's objects to exclude REPL line object.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12327 from cloud-fan/closure.
2016-04-14 10:58:06 -07:00
Reynold Xin a46f98d3f4 [SPARK-14617] Remove deprecated APIs in TaskMetrics
## What changes were proposed in this pull request?
This patch removes some of the deprecated APIs in TaskMetrics. This is part of my bigger effort to simplify accumulators and task metrics.

## How was this patch tested?
N/A - only removals

Author: Reynold Xin <rxin@databricks.com>

Closes #12375 from rxin/SPARK-14617.
2016-04-14 10:56:13 -07:00
Reynold Xin dac40b68dc [SPARK-14619] Track internal accumulators (metrics) by stage attempt
## What changes were proposed in this pull request?
When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. This is part of my bigger effort to simplify accumulators and task metrics.

## How was this patch tested?
Covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12378 from rxin/SPARK-14619.
2016-04-14 10:54:57 -07:00
Liwei Lin 3e27940a19 [SPARK-14630][BUILD][CORE][SQL][STREAMING] Code style: public abstract methods should have explicit return types
## What changes were proposed in this pull request?

Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110):
```scala
def start() // should be: def start(): Unit
def stop()  // should be: def stop(): Unit
```

These methods exist in core, sql, streaming; this PR fixes them.

## How was this patch tested?

N/A

## Which piece of scala style rule led to the changes?

the rule was added separately in https://github.com/apache/spark/pull/12396

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12389 from lw-lin/public-abstract-methods.
2016-04-14 10:14:38 -07:00
Reynold Xin de2ad52855 [SPARK-14625] TaskUIData and ExecutorUIData shouldn't be case classes
## What changes were proposed in this pull request?
I was trying to understand the accumulator and metrics update source code and these two classes don't really need to be case classes. It would also be more consistent with other UI classes if they are not case classes. This is part of my bigger effort to simplify accumulators and task metrics.

## How was this patch tested?
This is a straightforward refactoring without behavior change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12386 from rxin/SPARK-14625.
2016-04-14 10:12:29 -07:00
hyukjinkwon 6fc3dc8839 [MINOR][SQL] Remove extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes extra anonymous closure within functional transformations.

For example,

```scala
.map(item => {
  ...
})
```

which can be just simply as below:

```scala
.map { item =>
  ...
}
```

## How was this patch tested?

Related unit tests and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12382 from HyukjinKwon/minor-extra-closers.
2016-04-14 09:43:41 +01:00
hyukjinkwon b4819404a6 [SPARK-14596][SQL] Remove not used SqlNewHadoopRDD and some more unused imports
## What changes were proposed in this pull request?

Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`.
Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore.

So, this PR removes `SqlNewHadoopRDD` and several unused imports.

This was discussed in https://github.com/apache/spark/pull/12326.

## How was this patch tested?

Several related existing unit tests and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12354 from HyukjinKwon/SPARK-14596.
2016-04-14 15:43:44 +08:00
Charles Allen dd11e401e4 [SPARK-14537][CORE] Make TaskSchedulerImpl waiting fail if context is shut down
This patch makes the postStartHook throw an IllegalStateException if the SparkContext is shutdown while it is waiting for the backend to be ready

Author: Charles Allen <charles@allen-net.com>

Closes #12301 from drcrallen/SPARK-14537.
2016-04-13 16:02:49 +01:00
Liwei Lin 23f93f559c [SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api
## What changes were proposed in this pull request?

- updated `OFF_HEAP` semantics for `StorageLevels.java`
- updated `OFF_HEAP` semantics for `storagelevel.py`

## How was this patch tested?

no need to test

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12126 from lw-lin/storagelevel.py.
2016-04-12 23:06:55 -07:00
Sital Kedia d187e7dea9 [SPARK-14363] Fix executor OOM due to memory leak in the Sorter
## What changes were proposed in this pull request?

Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization.
This is a regression partially introduced in PR https://github.com/apache/spark/pull/9241

## How was this patch tested?

Tested by running a job and observed around 30% speedup after this change.

Author: Sital Kedia <skedia@fb.com>

Closes #12285 from sitalkedia/executor_oom.
2016-04-12 16:10:07 -07:00
Davies Liu 1ef5f8cfa6 [SPARK-14544] [SQL] improve performance of SQL UI tab
## What changes were proposed in this pull request?

This PR improve the performance of SQL UI by:

1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page.
2) break-all is super slow in Chrome recently, so switch to break-word.
3) Using "display: none" to hide a block.
4) using one js closure for  for all the executions, not one for each.
5) remove the height limitation of details, don't need to scroll it in the tiny window.

## How was this patch tested?

Exists tests.

![ui](https://cloud.githubusercontent.com/assets/40902/14445712/68d7b258-0004-11e6-9b48-5d329b05d165.png)

Author: Davies Liu <davies@databricks.com>

Closes #12311 from davies/ui_perf.
2016-04-12 15:03:00 -07:00
Terence Yim 3e53de4bdd [SPARK-14513][CORE] Fix threads left behind after stopping SparkContext
## What changes were proposed in this pull request?

Shutting down `QueuedThreadPool` used by Jetty `Server` to avoid threads leakage after SparkContext is stopped.

Note: If this fix is going to apply to the `branch-1.6`, one more patch on the `NettyRpcEnv` class is needed so that the `NettyRpcEnv._fileServer.shutdown` is called in the `NettyRpcEnv.cleanup` method. This is due to the removal of `_fileServer` field in the `NettyRpcEnv` class in the master branch. Please advice if a second PR is necessary for bring this fix back to `branch-1.6`

## How was this patch tested?

Ran the ./dev/run-tests locally

Author: Terence Yim <terence@cask.co>

Closes #12318 from chtyim/fixes/SPARK-14513-thread-leak.
2016-04-12 13:46:39 -07:00
Dongjoon Hyun b0f5497e95 [SPARK-14508][BUILD] Add a new ScalaStyle Rule OmitBracesInCase
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
  ```
  case: Always omit braces in case clauses.
  ```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.

## How was this patch tested?

Pass the Jenkins tests (including Scala style checking)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12280 from dongjoon-hyun/SPARK-14508.
2016-04-12 00:43:28 -07:00
Eric Liang 6f27027d96 [SPARK-14475] Propagate user-defined context from driver to executors
## What changes were proposed in this pull request?

This adds a new API call `TaskContext.getLocalProperty` for getting properties set in the driver from executors. These local properties are automatically propagated from the driver to executors. For streaming, the context for streaming tasks will be the initial driver context when ssc.start() is called.

## How was this patch tested?

Unit tests.

cc JoshRosen

Author: Eric Liang <ekl@databricks.com>

Closes #12248 from ericl/sc-2813.
2016-04-11 18:33:54 -07:00
Jason Moore 22014e6fb9 [SPARK-14357][CORE] Properly handle the root cause being a commit denied exception
## What changes were proposed in this pull request?

When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception.

## How was this patch tested?

Added a test suite for the component that extracts the root cause of the error.
Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException.

Author: Jason Moore <jasonmoore2k@outlook.com>

Closes #12228 from jasonmoore2k/SPARK-14357.
2016-04-09 23:34:57 -07:00
Dongjoon Hyun aea30a1a9b [SPARK-14465][BUILD] Checkstyle should check all Java files
## What changes were proposed in this pull request?

Currently, `checkstyle` is configured to check the files under `src/main/java`. However, Spark has Java files in `src/main/scala`, too. This PR fixes the following configuration in `pom.xml` and the unchecked-so-far violations on those files.
```xml
-<sourceDirectory>${basedir}/src/main/java</sourceDirectory>
+<sourceDirectories>${basedir}/src/main/java,${basedir}/src/main/scala</sourceDirectories>
```

## How was this patch tested?

After passing the Jenkins build and manually `dev/lint-java`. (Note that Jenkins does not run `lint-java`)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12242 from dongjoon-hyun/SPARK-14465.
2016-04-09 21:31:20 -07:00
Davies Liu 5cb5edaf9c [SPARK-14419] [SQL] Improve HashedRelation for key fit within Long
## What changes were proposed in this pull request?

Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.

This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.

This PR reopen #12190 to fix bugs.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #12278 from davies/long_map3.
2016-04-09 17:44:38 -07:00
Sameer Agarwal 813e96e6fa [SPARK-14454] Better exception handling while marking tasks as failed
## What changes were proposed in this pull request?

This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`:

```scala
logError("Aborting task.", cause)
// call failure callbacks first, so we could have a chance to cleanup the writer.
TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause)
if (currentWriter != null) {
  currentWriter.close()
}
abortTask()
throw new SparkException("Task failed while writing rows.", cause)
```

If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception.

## How was this patch tested?

No new functionality added

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12234 from sameeragarwal/fix-exception.
2016-04-08 17:23:32 -07:00
Shixiong Zhu 4d7c359263 [SPARK-14437][CORE] Use the address that NettyBlockTransferService listens to create BlockManagerId
## What changes were proposed in this pull request?

Here is why SPARK-14437 happens:
BlockManagerId is created using NettyBlockTransferService.hostName which comes from `customHostname`. And `Executor` will set `customHostname` to the hostname which is detected by the driver. However, the driver may not be able to detect the correct address in some complicated network (Netty's Channel.remoteAddress doesn't always return a connectable address). In such case, `BlockManagerId` will be created using a wrong hostname.

To fix this issue, this PR uses `hostname` provided by `SparkEnv.create` to create `NettyBlockTransferService` and set `NettyBlockTransferService.hostname` to this one directly. A bonus of this approach is NettyBlockTransferService won't bound to `0.0.0.0` which is much safer.

## How was this patch tested?

Manually checked the bound address using local-cluster.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12240 from zsxwing/SPARK-14437.
2016-04-08 17:18:19 -07:00
Michael Armbrust 692c74840b [SPARK-14449][SQL] SparkContext should use SparkListenerInterface
Currently all `SparkFirehoseListener` implementations are broken since we expect listeners to extend `SparkListener`, while the fire hose only extends `SparkListenerInterface`.  This changes the addListener function and the config based injection to use the interface instead.

The existing tests in SparkListenerSuite are improved such that they would have caught this.

Follow-up to #12142

Author: Michael Armbrust <michael@databricks.com>

Closes #12227 from marmbrus/fixListener.
2016-04-07 18:05:54 -07:00
Andrew Or 3e29e372ff [SPARK-14468] Always enable OutputCommitCoordinator
## What changes were proposed in this pull request?

`OutputCommitCoordinator` was introduced to deal with concurrent task attempts racing to write output, leading to data loss or corruption. For more detail, read the [JIRA description](https://issues.apache.org/jira/browse/SPARK-14468).

Before: `OutputCommitCoordinator` is enabled only if speculation is enabled.
After: `OutputCommitCoordinator` is always enabled.

Users may still disable this through `spark.hadoop.outputCommitCoordination.enabled`, but they really shouldn't...

## How was this patch tested?

`OutputCommitCoordinator*Suite`

Author: Andrew Or <andrew@databricks.com>

Closes #12244 from andrewor14/always-occ.
2016-04-07 17:49:39 -07:00
Dhruve Ashar 033d808152 [SPARK-12384] Enables spark-clients to set the min(-Xms) and max(*.memory config) j…
## What changes were proposed in this pull request?

Currently Spark clients are started with the same memory setting for Xms and Xms leading to reserving unnecessary higher amounts of memory.
This behavior is changed and the clients can now specify an initial heap size using the extraJavaOptions in the config for driver,executor and am individually.
 Note, that only -Xms can be provided through this config option, if the client wants to set the max size(-Xmx), this has to be done via the *.memory configuration knobs which are currently supported.

## How was this patch tested?

Monitored executor and yarn logs in debug mode to verify the commands through which they are being launched in client and cluster mode. The driver memory was verified locally using jps -v. Setting up -Xmx parameter in the javaExtraOptions raises exception with the info provided.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #12115 from dhruve/impr/SPARK-12384.
2016-04-07 10:39:21 -05:00
Alex Bozarth 35e0db2d45 [SPARK-14245][WEB UI] Display the user in the application view
## What changes were proposed in this pull request?

The Spark UI (both active and history) should show the user who ran the application somewhere when you are in the application view. This was added under the Jobs view by total uptime and scheduler mode.

## How was this patch tested?

Manual testing

<img width="191" alt="username" src="https://cloud.githubusercontent.com/assets/13952758/14222830/6d1fe542-f82a-11e5-885f-c05ee2cdf857.png">

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #12123 from ajbozarth/spark14245.
2016-04-07 09:15:00 -05:00
Marcelo Vanzin 21d5ca128b [SPARK-14134][CORE] Change the package name used for shading classes.
The current package name uses a dash, which is a little weird but seemed
to work. That is, until a new test tried to mock a class that references
one of those shaded types, and then things started failing.

Most changes are just noise to fix the logging configs.

For reference, SPARK-8815 also raised this issue, although at the time it
did not cause any issues in Spark, so it was not addressed.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11941 from vanzin/SPARK-14134.
2016-04-06 19:33:51 -07:00
Shixiong Zhu f1def573f4 [SPARK-13112][CORE] Make sure RegisterExecutorResponse arrive before LaunchTask
## What changes were proposed in this pull request?

Send `RegisterExecutorResponse` using `executorRef` in order to make sure RegisterExecutorResponse and LaunchTask are both sent using the same channel. Then RegisterExecutorResponse will always arrive before LaunchTask

## How was this patch tested?

Existing unit tests

Closes #12078

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12211 from zsxwing/SPARK-13112.
2016-04-06 16:18:04 -07:00
Dongjoon Hyun d717ae1fd7 [SPARK-14444][BUILD] Add a new scalastyle NoScalaDoc to prevent ScalaDoc-style multiline comments
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
```
/** In Spark, we don't use the ScalaDoc style so this
  * is not correct.
  */
```

## How was this patch tested?

Pass the Jenkins tests (including `lint-scala`).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12221 from dongjoon-hyun/SPARK-14444.
2016-04-06 16:02:55 -07:00
Tathagata Das 9af5423ec2 [SPARK-12133][STREAMING] Streaming dynamic allocation
## What changes were proposed in this pull request?

Added a new Executor Allocation Manager for the Streaming scheduler for doing Streaming Dynamic Allocation.

## How was this patch tested
Unit tests, and cluster tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12154 from tdas/streaming-dynamic-allocation.
2016-04-06 15:46:20 -07:00
Eric Liang 78c1076d04 [SPARK-14252] Executors do not try to download remote cached blocks
## What changes were proposed in this pull request?

As mentioned in the ticket this was because one get path in the refactored `BlockManager` did not check for remote storage.

## How was this patch tested?

Unit test, also verified manually with reproduction in the ticket.

cc JoshRosen

Author: Eric Liang <ekl@databricks.com>

Closes #12193 from ericl/spark-14252.
2016-04-05 22:37:51 -07:00
Shixiong Zhu 48467f4eb0 [SPARK-14416][CORE] Add thread-safe comments for CoarseGrainedSchedulerBackend's fields
## What changes were proposed in this pull request?

While I was reviewing #12078, I found most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any comments about the thread-safe assumptions and it's hard for people to figure out which part of codes should be protected by the lock. This PR just added comments/annotations for them and also added strict access modifiers for some fields.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12188 from zsxwing/comments.
2016-04-05 22:32:37 -07:00
Marcelo Vanzin d5ee9d5c24 [SPARK-529][SQL] Modify SQLConf to use new config API from core.
Because SQL keeps track of all known configs, some customization was
needed in SQLConf to allow that, since the core API does not have that
feature.

Tested via existing (and slightly updated) unit tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11570 from vanzin/SPARK-529-sql.
2016-04-05 15:19:51 -07:00
Kousuke Saruta e4bd504120 [SPARK-14397][WEBUI] <html> and <body> tags are nested in LogPage
## What changes were proposed in this pull request?

In `LogPage`, the content to be rendered is defined as follows.

```
    val content =
      <html>
        <body>
          {linkToMaster}
          <div>
            <div style="float:left; margin-right:10px">{backButton}</div>
            <div style="float:left;">{range}</div>
            <div style="float:right; margin-left:10px">{nextButton}</div>
          </div>
          <br />
          <div style="height:500px; overflow:auto; padding:5px;">
            <pre>{logText}</pre>
          </div>
        </body>
      </html>
    UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
```

As you can see, <html> and <body> tags will be rendered.

On the other hand, `UIUtils.basicSparkPage` will render those tags so those tags will be nested.

```
  def basicSparkPage(
      content: => Seq[Node],
      title: String,
      useDataTables: Boolean = false): Seq[Node] = {
    <html>
      <head>
        {commonHeaderNodes}
        {if (useDataTables) dataTablesHeaderNodes else Seq.empty}
        <title>{title}</title>
      </head>
      <body>
        <div class="container-fluid">
          <div class="row-fluid">
            <div class="span12">
              <h3 style="vertical-align: middle; display: inline-block;">
                <a style="text-decoration: none" href={prependBaseUri("/")}>
                  <img src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} />
                  <span class="version"
                        style="margin-right: 15px;">{org.apache.spark.SPARK_VERSION}</span>
                </a>
                {title}
              </h3>
            </div>
          </div>
          {content}
        </div>
      </body>
    </html>
  }
```

These are the screen shots before this patch is applied.

![before1](https://cloud.githubusercontent.com/assets/4736016/14273236/03cbed8a-fb44-11e5-8786-bc1bfa4d3f8c.png)
![before2](https://cloud.githubusercontent.com/assets/4736016/14273237/03d1741c-fb44-11e5-9dee-ea93022033a6.png)

And these are the ones after this patch is applied.

![after1](https://cloud.githubusercontent.com/assets/4736016/14273248/1b6a7d8a-fb44-11e5-8a3b-69964f3434f6.png)
![after2](https://cloud.githubusercontent.com/assets/4736016/14273249/1b6b9c38-fb44-11e5-9d6f-281d64c842e4.png)

The appearance is not changed but the html source code is changed.

## How was this patch tested?

Manually run some jobs on my standalone-cluster and check the WebUI.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #12170 from sarutak/SPARK-14397.
2016-04-05 10:51:23 -07:00
Guillaume Poulin 7201f033ce [SPARK-12425][STREAMING] DStream union optimisation
Use PartitionerAwareUnionRDD when possbile for optimizing shuffling and
preserving the partitioner.

Author: Guillaume Poulin <poulin.guillaume@gmail.com>

Closes #10382 from gpoulin/dstream_union_optimisation.
2016-04-05 02:54:38 +01:00
Marcelo Vanzin 24d7d2e453 [SPARK-13579][BUILD] Stop building the main Spark assembly.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).

I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).

Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11796 from vanzin/SPARK-13579.
2016-04-04 16:52:22 -07:00
Davies Liu cc70f17416 [SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame
## What changes were proposed in this pull request?

RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).

This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.

The JDBC server has been updated to use DataFrame.toIterator.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #12114 from davies/local_iterator.
2016-04-04 13:31:44 -07:00
Reynold Xin 7143904700 [SPARK-14358] Change SparkListener from a trait to an abstract class
## What changes were proposed in this pull request?
Scala traits are difficult to maintain binary compatibility on, and as a result we had to introduce JavaSparkListener. In Spark 2.0 we can change SparkListener from a trait to an abstract class and then remove JavaSparkListener.

## How was this patch tested?
Updated related unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12142 from rxin/SPARK-14358.
2016-04-04 13:26:18 -07:00
Reynold Xin 27dad6f658 [SPARK-14364][SPARK] HeartbeatReceiver object should be private
## What changes were proposed in this pull request?
It's a mistake that HeartbeatReceiver object was made public in Spark 1.x.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #12148 from rxin/SPARK-14364.
2016-04-04 13:19:34 -07:00
Dongjoon Hyun 3f749f7ed4 [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static analysis results
## What changes were proposed in this pull request?

This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.

## How was this patch tested?

Manual and pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12139 from dongjoon-hyun/SPARK-14355.
2016-04-03 18:14:16 -07:00
Marcin Tustin 9023015f05 [SPARK-14163][CORE] SumEvaluator and countApprox cannot reliably handle RDDs of size 1
## What changes were proposed in this pull request?

This special cases 0 and 1 counts to avoid passing 0 degrees of freedom.

## How was this patch tested?

Tests run successfully. New test added.

## Note:
This recreates #11982 which was closed to due to non-updated diff. rxin srowen Commented there.
This also adds tests, reworks the code to perform the special casing (based on srowen's comments), and adds equality machinery for BoundedDouble, as well as changing how it is transformed to string.

Author: Marcin Tustin <mtustin@handybook.com>
Author: Marcin Tustin <mtustin@handy.com>

Closes #12016 from mtustin-handy/SPARK-14163.
2016-04-03 17:42:33 -07:00
Sital Kedia 1cf7018342 [SPARK-14056] Appends s3 specific configurations and spark.hadoop con…
## What changes were proposed in this pull request?

Appends s3 specific configurations and spark.hadoop configurations to hive configuration.

## How was this patch tested?

Tested by running a job on cluster.

…figurations to hive configuration.

Author: Sital Kedia <skedia@fb.com>

Closes #11876 from sitalkedia/hiveConf.
2016-04-02 19:17:25 -07:00
Liwei Lin 03d130f973 [SPARK-14342][CORE][DOCS][TESTS] Remove straggler references to Tachyon
## What changes were proposed in this pull request?

Straggler references to Tachyon were removed:
- for docs, `tachyon` has been generalized as `off-heap memory`;
- for Mesos test suits, the key-value `tachyon:true`/`tachyon:false` has been changed to `os:centos`/`os:ubuntu`, since `os` is an example constrain used by the [Mesos official docs](http://mesos.apache.org/documentation/attributes-resources/).

## How was this patch tested?

Existing test suites.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12129 from lw-lin/tachyon-cleanup.
2016-04-02 17:55:46 -07:00
Dongjoon Hyun 4a6e78abd9 [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.
## What changes were proposed in this pull request?

This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
(All comment-only changes over 77 files: +786 lines, −747 lines)

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
2016-04-02 17:50:40 -07:00
Alex Bozarth abc6c42c2d [SPARK-13241][WEB UI] Added long values for dates in ApplicationAttemptInfo API
## What changes were proposed in this pull request?

Adding long values for each Date in the ApplicationAttemptInfo API for easier use in code

## How was the this patch tested?

Tested with dev/run-tests

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #11326 from ajbozarth/spark13241.
2016-04-01 16:18:09 -07:00
Josh Rosen e41acb7573 [SPARK-13992] Add support for off-heap caching
This patch adds support for caching blocks in the executor processes using direct / off-heap memory.

## User-facing changes

**Updated semantics of `OFF_HEAP` storage level**: In Spark 1.x, the `OFF_HEAP` storage level indicated that an RDD should be cached in Tachyon. Spark 2.x removed the external block store API that Tachyon caching was based on (see #10752 / SPARK-12667), so `OFF_HEAP` became an alias for `MEMORY_ONLY_SER`. As of this patch, `OFF_HEAP` means "serialized and cached in off-heap memory or on disk". Via the `StorageLevel` constructor, `useOffHeap` can be set if `serialized == true` and can be used to construct custom storage levels which support replication.

**Storage UI reporting**: the storage UI will now report whether in-memory blocks are stored on- or off-heap.

**Only supported by UnifiedMemoryManager**: for simplicity, this feature is only supported when the default UnifiedMemoryManager is used; applications which use the legacy memory manager (`spark.memory.useLegacyMode=true`) are not currently able to allocate off-heap storage memory, so using off-heap caching will fail with an error when legacy memory management is enabled. Given that we plan to eventually remove the legacy memory manager, this is not a significant restriction.

**Memory management policies:** the policies for dividing available memory between execution and storage are the same for both on- and off-heap memory. For off-heap memory, the total amount of memory available for use by Spark is controlled by `spark.memory.offHeap.size`, which is an absolute size. Off-heap storage memory obeys `spark.memory.storageFraction` in order to control the amount of unevictable storage memory. For example, if `spark.memory.offHeap.size` is 1 gigabyte and Spark uses the default `storageFraction` of 0.5, then up to 500 megabytes of off-heap cached blocks will be protected from eviction due to execution memory pressure. If necessary, we can split `spark.memory.storageFraction` into separate on- and off-heap configurations, but this doesn't seem necessary now and can be done later without any breaking changes.

**Use of off-heap memory does not imply use of off-heap execution (or vice-versa)**: for now, the settings controlling the use of off-heap execution memory (`spark.memory.offHeap.enabled`) and off-heap caching are completely independent, so Spark SQL can be configured to use off-heap memory for execution while continuing to cache blocks on-heap. If desired, we can change this in a followup patch so that `spark.memory.offHeap.enabled` affect the default storage level for cached SQL tables.

## Internal changes

- Rename `ByteArrayChunkOutputStream` to `ChunkedByteBufferOutputStream`
  - It now returns a `ChunkedByteBuffer` instead of an array of byte arrays.
  - Its constructor now accept an `allocator` function which is called to allocate `ByteBuffer`s. This allows us to control whether it allocates regular ByteBuffers or off-heap DirectByteBuffers.
  - Because block serialization is now performed during the unroll process, a `ChunkedByteBufferOutputStream` which is configured with a `DirectByteBuffer` allocator will use off-heap memory for both unroll and storage memory.
- The `MemoryStore`'s MemoryEntries now tracks whether blocks are stored on- or off-heap.
  - `evictBlocksToFreeSpace()` now accepts a `MemoryMode` parameter so that we don't try to evict off-heap blocks in response to on-heap memory pressure (or vice-versa).
- Make sure that off-heap buffers are properly de-allocated during MemoryStore eviction.
- The JVM limits the total size of allocated direct byte buffers using the `-XX:MaxDirectMemorySize` flag and the default tends to be fairly low (< 512 megabytes in some JVMs). To work around this limitation, this patch adds a custom DirectByteBuffer allocator which ignores this memory limit.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11805 from JoshRosen/off-heap-caching.
2016-04-01 14:34:59 -07:00
zhonghaihua bd7b91cefb [SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…
Currently, when max number of executor failures reached the `maxNumExecutorFailures`, `ApplicationMaster` will be killed and re-register another one.This time, `YarnAllocator` will be created a new instance.
But, the value of property `executorIdCounter` in `YarnAllocator` will reset to `0`. Then the Id of new executor will starting from `1`. This will confuse with the executor has already created before, which will cause FetchFailedException.
This situation is just in yarn client mode, so this is an issue in yarn client mode. For more details, [link to jira issues SPARK-12864](https://issues.apache.org/jira/browse/SPARK-12864)
This PR introduce a mechanism to initialize `executorIdCounter` after `ApplicationMaster` killed.

Author: zhonghaihua <793507405@qq.com>

Closes #10794 from zhonghaihua/initExecutorIdCounterAfterAMKilled.
2016-04-01 16:23:14 -05:00
Liang-Chi Hsieh 3e991dbc31 [SPARK-13674] [SQL] Add wholestage codegen support to Sample
JIRA: https://issues.apache.org/jira/browse/SPARK-13674

## What changes were proposed in this pull request?

Sample operator doesn't support wholestage codegen now. This pr is to add support to it.

## How was this patch tested?

A test is added into `BenchmarkWholeStageCodegen`. Besides, all tests should be passed.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11517 from viirya/add-wholestage-sample.
2016-04-01 14:02:32 -07:00
jerryshao 8ba2b7f28f [SPARK-12343][YARN] Simplify Yarn client and client argument
## What changes were proposed in this pull request?

Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments.

## How was this patch tested?

This patch is tested manually with unit test.

CC vanzin tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove `ClientArguments`, through refactoring some arguments like `--class`, `--arg` are not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set.

Author: jerryshao <sshao@hortonworks.com>

Closes #11603 from jerryshao/SPARK-12343.
2016-04-01 10:52:13 -07:00
Davies Liu f0afafdc5d [SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch
## What changes were proposed in this pull request?

This PR support multiple Python UDFs within single batch, also improve the performance.

```python
>>> from pyspark.sql.types import IntegerType
>>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
>>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
>>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)]
+- OneRowRelation$

== Analyzed Logical Plan ==
double(add(1, 2)): int, add(double(2), 1): int
Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [double(add(1, 2))#14,add(double(2), 1)#15]
   +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
      +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
         +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
            +- OneRowRelation$

== Optimized Logical Plan ==
Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
   +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
      +- OneRowRelation$

== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
:     +- INPUT
+- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
   +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
      +- Scan OneRowRelation[]
```

## How was this patch tested?

Added new tests.

Using the following script to benchmark 1, 2 and 3 udfs,
```
df = sqlContext.range(1, 1 << 23, 1, 4)
double = F.udf(lambda x: x * 2, LongType())
print df.select(double(df.id)).count()
print df.select(double(df.id), double(df.id + 1)).count()
print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
```
Here is the results:

N | Before | After  | speed up
---- |------------ | -------------|------
1 | 22 s | 7 s |  3.1X
2 | 38 s | 13 s | 2.9X
3 | 58 s | 16 s | 3.6X

This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering).

Author: Davies Liu <davies@databricks.com>

Closes #12057 from davies/multi_udfs.
2016-03-31 16:40:20 -07:00
Jo Voordeckers 10508f36ad [SPARK-11327][MESOS] Dispatcher does not respect all args from the Submit request
Supersedes https://github.com/apache/spark/pull/9752

Author: Jo Voordeckers <jo.voordeckers@gmail.com>
Author: Iulian Dragos <jaguarul@gmail.com>

Closes #10370 from jayv/mesos_cluster_params.
2016-03-31 12:08:10 -07:00
Wenchen Fan 0abee534f0 [SPARK-14069][SQL] Improve SparkStatusTracker to also track executor information
## What changes were proposed in this pull request?

Track executor information like host and port, cache size, running tasks.

TODO: tests

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11888 from cloud-fan/status-tracker.
2016-03-31 12:07:19 -07:00
jeanlyn 8a333d2da8 [SPARK-14243][CORE] update task metrics when removing blocks
## What changes were proposed in this pull request?

This PR try to use `incUpdatedBlockStatuses ` to update the `updatedBlockStatuses ` when removing blocks, making sure `BlockManager` correctly updates `updatedBlockStatuses`

## How was this patch tested?

test("updated block statuses") in BlockManagerSuite.scala

Author: jeanlyn <jeanlyn92@gmail.com>

Closes #12091 from jeanlyn/updateBlock.
2016-03-31 12:04:42 -07:00
Nishkam Ravi ac1b8b302a [SPARK-13796] Redirect error message to logWarning
## What changes were proposed in this pull request?

Redirect error message to logWarning

## How was this patch tested?

Unit tests, manual tests

JoshRosen

Author: Nishkam Ravi <nishkamravi@gmail.com>

Closes #12052 from nishkamravi2/master_warning.
2016-03-31 12:03:05 -07:00
tedyu e1f6845391 [SPARK-12181] Check Cached unaligned-access capability before using Unsafe
## What changes were proposed in this pull request?

For MemoryMode.OFF_HEAP, Unsafe.getInt etc. are used with no restriction.

However, the Oracle implementation uses these methods only if the class variable unaligned (commented as "Cached unaligned-access capability") is true, which seems to be calculated whether the architecture is i386, x86, amd64, or x86_64.

I think we should perform similar check for the use of Unsafe.

Reference: https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L112

## How was this patch tested?

Unit test suite

Author: tedyu <yuzhihong@gmail.com>

Closes #11943 from tedyu/master.
2016-03-29 17:16:53 -07:00
Davies Liu a7a93a116d [SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs
## What changes were proposed in this pull request?

This PR brings the support for chained Python UDFs, for example

```sql
select udf1(udf2(a))
select udf1(udf2(a) + 3)
select udf1(udf2(a) + udf3(b))
```

Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.

For example,
```python
>>> sqlContext.sql("select double(double(1))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#10 AS double(double(1))#9]
:     +- INPUT
+- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
   +- Scan OneRowRelation[]
>>> sqlContext.sql("select double(double(1) + double(2))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
:     +- INPUT
+- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
   +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
      +- !BatchPythonEvaluation double(1), [pythonUDF#17]
         +- Scan OneRowRelation[]
```

TODO: will support multiple unrelated Python UDFs in one batch (another PR).

## How was this patch tested?

Added new unit tests for chained UDFs.

Author: Davies Liu <davies@databricks.com>

Closes #12014 from davies/py_udfs.
2016-03-29 15:06:29 -07:00
Jakob Odersky d26c42982c [SPARK-10570][CORE] Add version info to json api
Add a new api endpoint `/api/v1/version` to retrieve various version info. This PR only adds support for finding the current spark version, however other version info such as jvm or scala versions can easily be added.

Author: Jakob Odersky <jodersky@gmail.com>

Closes #10760 from jodersky/SPARK-10570.
2016-03-29 11:10:15 -07:00
Carson Wang 15c0b0006b [SPARK-14232][WEBUI] Fix event timeline display issue when an executor is removed with a multiple line reason.
## What changes were proposed in this pull request?
The event timeline doesn't show on job page if an executor is removed with a multiple line reason. This PR replaces all new line characters in the reason string with spaces.

![timelineerror](https://cloud.githubusercontent.com/assets/9278199/14100211/5fd4cd30-f5be-11e5-9cea-f32651a4cd62.jpg)

## How was this patch tested?
Verified on the Web UI.

Author: Carson Wang <carson.wang@intel.com>

Closes #12029 from carsonwang/eventTimeline.
2016-03-29 11:07:58 -07:00
Sun Rui d3638d7bff [SPARK-12792] [SPARKR] Refactor RRDD to support R UDF.
## What changes were proposed in this pull request?

Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.

Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.

## How was this patch tested?
dev/lint-r
SparkR unit tests

Author: Sun Rui <rui.sun@intel.com>

Closes #12024 from sun-rui/SPARK-12792_new.
2016-03-28 21:51:02 -07:00
jerryshao 2bc7c96d61 [SPARK-13447][YARN][CORE] Clean the stale states for AM failure and restart situation
## What changes were proposed in this pull request?

This is a follow-up fix of #9963, in #9963 we handle this stale states clean-up work only for dynamic allocation enabled scenario. Here we should also clean the states in `CoarseGrainedSchedulerBackend` for dynamic allocation disabled scenario.

Please review, CC andrewor14 lianhuiwang , thanks a lot.

## How was this patch tested?

Run the unit test locally, also with integration test manually.

Author: jerryshao <sshao@hortonworks.com>

Closes #11366 from jerryshao/SPARK-13447.
2016-03-28 17:03:21 -07:00
jeanlyn ad9e3d50f7 [SPARK-13845][CORE] Using onBlockUpdated to replace onTaskEnd avioding driver OOM
## What changes were proposed in this pull request?

We have a streaming job using `FlumePollInputStream` always driver OOM after few days, here is some driver heap dump before OOM
```
 num     #instances         #bytes  class name
----------------------------------------------
   1:      13845916      553836640  org.apache.spark.storage.BlockStatus
   2:      14020324      336487776  org.apache.spark.storage.StreamBlockId
   3:      13883881      333213144  scala.collection.mutable.DefaultEntry
   4:          8907       89043952  [Lscala.collection.mutable.HashEntry;
   5:         62360       65107352  [B
   6:        163368       24453904  [Ljava.lang.Object;
   7:        293651       20342664  [C
...
```
`BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in the end.
After investigated, i found the `executorIdToStorageStatus` in `StorageStatusListener` seems never remove the blocks from `StorageStatus`.
In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd ` , so we can update the block informations(add blocks, drop the block from memory to disk and delete the blocks) in time.

## How was this patch tested?

Existing unit tests and manual tests

Author: jeanlyn <jeanlyn92@gmail.com>

Closes #11779 from jeanlyn/fix_driver_oom.
2016-03-28 16:56:25 -07:00
Shixiong Zhu 2f98ee67df [SPARK-14169][CORE] Add UninterruptibleThread
## What changes were proposed in this pull request?

Extract the workaround for HADOOP-10622 introduced by #11940 into UninterruptibleThread so that we can test and reuse it.

## How was this patch tested?

Unit tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11971 from zsxwing/uninterrupt.
2016-03-28 16:29:11 -07:00
Shixiong Zhu 34c0638ee6 [SPARK-14180][CORE] Fix a deadlock in CoarseGrainedExecutorBackend Shutdown
## What changes were proposed in this pull request?

Call `executor.stop` in a new thread to eliminate deadlock.

## How was this patch tested?

Existing unit tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12012 from zsxwing/SPARK-14180.
2016-03-28 16:23:29 -07:00
Davies Liu d7b58f1461 [SPARK-14052] [SQL] build a BytesToBytesMap directly in HashedRelation
## What changes were proposed in this pull request?

Currently, for the key that can not fit within a long,  we build a hash map for UnsafeHashedRelation, it's converted to BytesToBytesMap after serialization and deserialization. We should build a BytesToBytesMap directly to have better memory efficiency.

In order to do that, BytesToBytesMap should support multiple (K,V) pair with the same K,  Location.putNewKey() is renamed to Location.append(), which could append multiple values for the same key (same Location). `Location.newValue()` is added to find the next value for the same key.

## How was this patch tested?

Existing tests. Added benchmark for broadcast hash join with duplicated keys.

Author: Davies Liu <davies@databricks.com>

Closes #11870 from davies/map2.
2016-03-28 13:07:32 -07:00
Davies Liu e5a1b301fb Revert "[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF."
This reverts commit 40984f6706.
2016-03-28 10:21:02 -07:00
Sun Rui 40984f6706 [SPARK-12792] [SPARKR] Refactor RRDD to support R UDF.
Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.

Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.

Author: Sun Rui <rui.sun@intel.com>

Closes #10947 from sun-rui/SPARK-12792.
2016-03-28 10:14:28 -07:00
Liang-Chi Hsieh 68c0c460bf [SPARK-13742] [CORE] Add non-iterator interface to RandomSampler
JIRA: https://issues.apache.org/jira/browse/SPARK-13742

## What changes were proposed in this pull request?

`RandomSampler.sample` currently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen of `Sampler` operator #11517. This change is to add non-iterator interface to `RandomSampler`.

This change adds a new method `def sample(): Int` to the trait `RandomSampler`. As we don't need to know the actual values of the sampling items, so this new method takes no arguments.

This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled.

For `BernoulliSampler` and `BernoulliCellSampler`, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not.

For `PoissonSampler`, the returned value can be more than 1, meaning the next item will be sampled multiple times.

## How was this patch tested?

Tests are added into `RandomSamplerSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11578 from viirya/random-sampler-no-iterator.
2016-03-28 09:58:47 -07:00
Josh Rosen 20c0bcd972 [SPARK-14135] Add off-heap storage memory bookkeeping support to MemoryManager
This patch extends Spark's `UnifiedMemoryManager` to add bookkeeping support for off-heap storage memory, an requirement for enabling off-heap caching (which will be done by #11805). The `MemoryManager`'s `storageMemoryPool` has been split into separate on- and off-heap pools and the storage and unroll memory allocation methods have been updated to accept a `memoryMode` parameter to specify whether allocations should be performed on- or off-heap.

In order to reduce the testing surface, the `StaticMemoryManager` does not support off-heap caching (we plan to eventually remove the `StaticMemoryManager`, so this isn't a significant limitation).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11942 from JoshRosen/off-heap-storage-memory-bookkeeping.
2016-03-26 11:03:25 -07:00
Liwei Lin 62a85eb09f [SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5
## What changes were proposed in this pull request?

Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5.

## How was this patch tested?

- manully checked that no codes in Spark call these methods any more
- existing test suits

Author: Liwei Lin <lwlin7@gmail.com>
Author: proflin <proflin.me@gmail.com>

Closes #11910 from lw-lin/remove-deprecates.
2016-03-26 12:41:34 +00:00
Dongjoon Hyun 1808465855 [MINOR] Fix newly added java-lint errors
## What changes were proposed in this pull request?

This PR fixes some newly added java-lint errors(unused-imports, line-lengsth).

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11968 from dongjoon-hyun/SPARK-14167.
2016-03-26 11:55:49 +00:00
Rajesh Balamohan ff7cc45f52 [SPARK-14091][CORE] Improve performance of SparkContext.getCallSite()
Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().

```
 private[spark] def getCallSite(): CallSite = {
    val callSite = Utils.getCallSite()
    CallSite(
      Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
      Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
    )
  }
```
However, in some places utils.withDummyCallSite(sc) is invoked to avoid expensive threaddumps within getCallSite(). But Utils.getCallSite() is evaluated earlier causing threaddumps to be computed.

This can have severe impact on smaller queries (that finish in 10-20 seconds) having large number of RDDs.

Creating this patch for lazy evaluation of  getCallSite.

No new test cases are added. Following standalone test was tried out manually. Also, built entire spark binary and tried with few SQL queries in TPC-DS  and TPC-H in multi node cluster
```
def run(): Unit = {
    val conf = new SparkConf()
    val sc = new SparkContext("local[1]", "test-context", conf)
    val start: Long = System.currentTimeMillis();
    val confBroadcast = sc.broadcast(new SerializableConfiguration(new Configuration()))
    Utils.withDummyCallSite(sc) {
      //Large tables end up creating 5500 RDDs
      for(i <- 1 to 5000) {
       //ignore nulls in RDD as its mainly for testing callSite
        val testRDD = new HadoopRDD(sc, confBroadcast, None, null,
          classOf[NullWritable], classOf[Writable], 10)
      }
    }
    val end: Long = System.currentTimeMillis();
    println("Time taken : " + (end - start))
  }

def main(args: Array[String]): Unit = {
    run
  }
```

Author: Rajesh Balamohan <rbalamohan@apache.org>

Closes #11911 from rajeshbalamohan/SPARK-14091.
2016-03-25 15:09:52 -07:00
Reynold Xin 70a6f0bb57 [SPARK-14149] Log exceptions in tryOrIOException
## What changes were proposed in this pull request?
We ran into a problem today debugging some class loading problem during deserialization, and JVM was masking the underlying exception which made it very difficult to debug. We can however log the exceptions using try/catch ourselves in serialization/deserialization. The good thing is that all these methods are already using Utils.tryOrIOException, so we can just put the try catch and logging in a single place.

## How was this patch tested?
A logging change with a manual test.

Author: Reynold Xin <rxin@databricks.com>

Closes #11951 from rxin/SPARK-14149.
2016-03-25 01:17:23 -07:00
Josh Rosen fdd460f5f4 [SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore
When a block is persisted in the MemoryStore at a serialized storage level, the current MemoryStore.putIterator() code will unroll the entire iterator as Java objects in memory, then will turn around and serialize an iterator obtained from the unrolled array. This is inefficient and doubles our peak memory requirements.

Instead, I think that we should incrementally serialize blocks while unrolling them.

A downside to incremental serialization is the fact that we will need to deserialize the partially-unrolled data in case there is not enough space to unroll the block and the block cannot be dropped to disk. However, I'm hoping that the memory efficiency improvements will outweigh any performance losses as a result of extra serialization in that hopefully-rare case.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11791 from JoshRosen/serialize-incrementally.
2016-03-24 17:33:21 -07:00
Sean Owen 342079dc45 Revert "[SPARK-2208] Fix for local metrics tests can fail on fast machines". The test appears to still be flaky after this change, or more flaky.
This reverts commit 5519760e0f.
2016-03-24 17:27:20 +00:00
Joan 5519760e0f [SPARK-2208] Fix for local metrics tests can fail on fast machines
## What changes were proposed in this pull request?

A fix for local metrics tests that can fail on fast machines.
This is probably what is suggested here #3380 by aarondav?

## How was this patch tested?

CI Tests

Cheers

Author: Joan <joan@goyeau.com>

Closes #11747 from joan38/SPARK-2208-Local-metrics-tests.
2016-03-24 09:47:44 +00:00
Tejas Patil 01849da080 [SPARK-14110][CORE] PipedRDD to print the command ran on non zero exit
## What changes were proposed in this pull request?

In case of failure in subprocess launched in PipedRDD, the failure exception reads “Subprocess exited with status XXX”. Debugging this is not easy for users especially if there are multiple pipe() operations in the Spark application.

Changes done:
- Changed the exception message when non-zero exit code is seen
- If the reader and writer threads see exception, simply logging the command ran. The current model is to propagate the exception "as is" so that upstream Spark logic will take the right action based on what the exception was (eg. for fetch failure, it needs to retry; but for some fatal exception, it will decide to fail the stage / job). So wrapping the exception with a generic exception will not work. Altering the exception message will keep that guarantee but that is ugly (plus not all exceptions might have a constructor for a string message)

## How was this patch tested?

- Added a new test case
- Ran all existing tests for PipedRDD

Author: Tejas Patil <tejasp@fb.com>

Closes #11927 from tejasapatil/SPARK-14110-piperdd-failure.
2016-03-24 00:31:13 -07:00
Liwei Lin de4e48b62b [SPARK-14025][STREAMING][WEBUI] Fix streaming job descriptions on the event timeline
## What changes were proposed in this pull request?

Removed the extra `<a href=...>...</a>` for each streaming job's description on the event timeline.

### [Before]
![before](https://cloud.githubusercontent.com/assets/15843379/13898653/0a6c1838-ee13-11e5-9761-14bb7b114c13.png)

### [After]
![after](https://cloud.githubusercontent.com/assets/15843379/13898650/012b8808-ee13-11e5-92a6-64aff0799c83.png)

## How was this patch tested?

test suits, manual checks (see screenshots above)

Author: Liwei Lin <proflin.me@gmail.com>
Author: proflin <proflin.me@gmail.com>

Closes #11845 from lw-lin/description-event-line.
2016-03-23 15:15:55 -07:00
Ernest 48ee16d801 [SPARK-14055] writeLocksByTask need to be update when removeBlock
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14055

## How was this patch tested?

manual tests by running LiveJournalPageRank on a large dataset ( the dataset must larger enough to incure RDD partition eviction).

Author: Ernest <earneyzxl@gmail.com>

Closes #11875 from Earne/issue-14055.
2016-03-23 10:29:36 -07:00
Josh Rosen 3de24ae2ed [SPARK-14075] Refactor MemoryStore to be testable independent of BlockManager
This patch refactors the `MemoryStore` so that it can be tested without needing to construct / mock an entire `BlockManager`.

- The block manager's serialization- and compression-related methods have been moved from `BlockManager` to `SerializerManager`.
- `BlockInfoManager `is now passed directly to classes that need it, rather than being passed via the `BlockManager`.
- The `MemoryStore` now calls `dropFromMemory` via a new `BlockEvictionHandler` interface rather than directly calling the `BlockManager`. This change helps to enforce a narrow interface between the `MemoryStore` and `BlockManager` functionality and makes this interface easier to mock in tests.
- Several of the block unrolling tests have been moved from `BlockManagerSuite` into a new `MemoryStoreSuite`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11899 from JoshRosen/reduce-memorystore-blockmanager-coupling.
2016-03-23 10:15:23 -07:00
Kazuaki Ishizaki 0d51b60443 [SPARK-14072][CORE] Show JVM/OS version information when we run a benchmark program
## What changes were proposed in this pull request?

This PR allows us to identify what JVM is used when someone ran a benchmark program. In some cases, a JVM version may affect performance result. Thus, it would be good to show processor information and JVM version information.

```
model name	: Intel(R) Xeon(R) CPU E5-2697 v2  2.70GHz
JVM information : OpenJDK 64-Bit Server VM, 1.7.0_65-mockbuild_2014_07_14_06_19-b00
Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized                    981 /  994         10.7          93.5       1.0X
SQL Parquet MR                           2518 / 2542          4.2         240.1       0.4X
```

```
model name	: Intel(R) Xeon(R) CPU E5-2697 v2  2.70GHz
JVM information : IBM J9 VM, pxa6480sr2-20151023_01 (SR2)
String Dictionary:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized                    693 /  740         15.1          66.1       1.0X
SQL Parquet MR                           2501 / 2562          4.2         238.5       0.3X
```

## How was this patch tested?

Tested by using existing benchmark programs

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #11893 from kiszk/SPARK-14072.
2016-03-22 21:01:52 -07:00
Josh Rosen b5f1ab701a [SPARK-13990] Automatically pick serializer when caching RDDs
Building on the `SerializerManager` introduced in SPARK-13926/ #11755, this patch Spark modifies Spark's BlockManager to use RDD's ClassTags in order to select the best serializer to use when caching RDD blocks.

When storing a local block, the BlockManager `put()` methods use implicits to record ClassTags and stores those tags in the blocks' BlockInfo records. When reading a local block, the stored ClassTag is used to pick the appropriate serializer. When a block is stored with replication, the class tag is written into the block transfer metadata and will also be stored in the remote BlockManager.

There are two or three places where we don't properly pass ClassTags, including TorrentBroadcast and BlockRDD. I think this happens to work because the missing ClassTag always happens to be `ClassTag.Any`, but it might be worth looking more carefully at those places to see whether we should be more explicit.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11801 from JoshRosen/pick-best-serializer-for-caching.
2016-03-21 17:19:39 -07:00
Davies Liu 9b4e15ba13 [SPARK-14007] [SQL] Manage the memory used by hash map in shuffled hash join
## What changes were proposed in this pull request?

This PR try acquire the memory for hash map in shuffled hash join, fail the task if there is no enough memory (otherwise it could OOM the executor).

It also removed unused HashedRelation.

## How was this patch tested?

Existing unit tests. Manual tests with TPCDS Q78.

Author: Davies Liu <davies@databricks.com>

Closes #11826 from davies/cleanup_hash2.
2016-03-21 11:21:39 -07:00
Dongjoon Hyun df61fbd978 [SPARK-13986][CORE][MLLIB] Remove DeveloperApi-annotations for non-publics
## What changes were proposed in this pull request?

Spark uses `DeveloperApi` annotation, but sometimes it seems to conflict with visibility. This PR tries to fix those conflict by removing annotations for non-publics. The following is the example.

**JobResult.scala**
```scala
DeveloperApi
sealed trait JobResult

DeveloperApi
case object JobSucceeded extends JobResult

-DeveloperApi
private[spark] case class JobFailed(exception: Exception) extends JobResult
```

## How was this patch tested?

Pass the existing Jenkins test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11797 from dongjoon-hyun/SPARK-13986.
2016-03-21 14:57:52 +00:00
Dongjoon Hyun 761c2d1b6e [MINOR][DOCS] Add proper periods and spaces for CLI help messages and config doc.
## What changes were proposed in this pull request?

This PR adds some proper periods and spaces to Spark CLI help messages and SQL/YARN conf docs for consistency.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11848 from dongjoon-hyun/add_proper_period_and_space.
2016-03-21 08:00:09 +00:00
Dongjoon Hyun 20fd254101 [SPARK-14011][CORE][SQL] Enable LineLength Java checkstyle rule
## What changes were proposed in this pull request?

[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.

```xml
-        <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
-        <!--
         <module name="LineLength">
             <property name="max" value="100"/>
             <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
         </module>
-        -->
         <module name="NoLineWrap"/>
         <module name="EmptyBlock">
             <property name="option" value="TEXT"/>
 -167,5 +164,7
         </module>
         <module name="CommentsIndentation"/>
         <module name="UnusedImports"/>
+        <module name="RedundantImport"/>
+        <module name="RedundantModifier"/>
```

## How was this patch tested?

Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11831 from dongjoon-hyun/SPARK-14011.
2016-03-21 07:58:57 +00:00
Sital Kedia 2e0c5284fd [SPARK-13958] Executor OOM due to unbounded growth of pointer array in…
## What changes were proposed in this pull request?

This change fixes the executor OOM which was recently introduced in PR apache/spark#11095
(Please fill in changes proposed in this fix)

## How was this patch tested?
Tested by running a spark job on the cluster.
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

… Sorter

Author: Sital Kedia <skedia@fb.com>

Closes #11794 from sitalkedia/SPARK-13958.
2016-03-18 12:56:06 -07:00
jerryshao 3537782168 [SPARK-13885][YARN] Fix attempt id regression for Spark running on Yarn
## What changes were proposed in this pull request?

This regression is introduced in #9182, previously attempt id is simply as counter "1" or "2". With the change of #9182, it is changed to full name as "appattemtp-xxx-00001", this will affect all the parts which uses this attempt id, like event log file name, history server app url link. So here change it back to the counter to keep consistent with previous code.

Also revert back this patch #11518, this patch fix the url link of history log according to the new way of attempt id, since here we change back to the previous way, so this patch is not necessary, here to revert it.

Also clean "spark.yarn.app.id" and "spark.yarn.app.attemptId", since it is useless now.

## How was this patch tested?

Test it with unit test and manually test different scenario:

1. application running in yarn-client mode.
2. application running in yarn-cluster mode.
3. application running in yarn-cluster mode with multiple attempts.

Checked both the event log file name and url link.

CC vanzin tgravescs , please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #11721 from jerryshao/SPARK-13885.
2016-03-18 12:39:49 -07:00
Josh Rosen 6c2d894a2f [SPARK-13921] Store serialized blocks as multiple chunks in MemoryStore
This patch modifies the BlockManager, MemoryStore, and several other storage components so that serialized cached blocks are stored as multiple small chunks rather than as a single contiguous ByteBuffer.

This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a ByteBufferOutputStream, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted.

This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11748 from JoshRosen/chunked-block-serialization.
2016-03-17 20:00:56 -07:00
Shixiong Zhu 65b75e66e8 [SPARK-13776][WEBUI] Limit the max number of acceptors and selectors for Jetty
## What changes were proposed in this pull request?

As each acceptor/selector in Jetty will use one thread, the number of threads should at least be the number of acceptors and selectors plus 1. Otherwise, the thread pool of Jetty server may be exhausted by acceptors/selectors and not be able to response any request.

To avoid wasting threads, the PR limits the max number of acceptors and selectors and also updates the max thread number if necessary.

## How was this patch tested?

Just make sure we don't break any existing tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11615 from zsxwing/SPARK-13776.
2016-03-17 13:05:29 +00:00
Wenchen Fan 8ef3399aff [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging
## What changes were proposed in this pull request?

Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11764 from cloud-fan/logger.
2016-03-17 19:23:38 +08:00
trueyao ea9ca6f04c [SPARK-13901][CORE] correct the logDebug information when jump to the next locality level
JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901
In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug information when jump to the next locality level.So we should fix it.

Author: trueyao <501663994@qq.com>

Closes #11719 from trueyao/logDebug-localityWait.
2016-03-17 09:45:06 +00:00
Josh Rosen de1a84e56e [SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types
Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo.

This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers.

In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11755 from JoshRosen/automatically-pick-best-serializer.
2016-03-16 22:52:55 -07:00
Wesley Tang 5f6bdf97c5 [SPARK-13281][CORE] Switch broadcast of RDD to exception from warning
## What changes were proposed in this pull request?

In SparkContext, throw Illegalargumentexception when trying to broadcast rdd directly, instead of logging the warning.

## How was this patch tested?

mvn clean install
Add UT in BroadcastSuite

Author: Wesley Tang <tangmingjun@mininglamp.com>

Closes #11735 from breakdawn/master.
2016-03-16 16:12:17 +00:00
Tejas Patil 1d95fb6785 [SPARK-13793][CORE] PipedRDD doesn't propagate exceptions while reading parent RDD
## What changes were proposed in this pull request?

PipedRDD creates a child thread to read output of the parent stage and feed it to the pipe process. Used a variable to save the exception thrown in the child thread and then propagating the exception in the main thread if the variable was set.

## How was this patch tested?

- Added a unit test
- Ran all the existing tests in PipedRDDSuite and they all pass with the change
- Tested the patch with a real pipe() job, bounced the executor node which ran the parent stage to simulate a fetch failure and observed that the parent stage was re-ran.

Author: Tejas Patil <tejasp@fb.com>

Closes #11628 from tejasapatil/pipe_rdd.
2016-03-16 09:58:53 +00:00
GayathriMurali 56d88247f1 [SPARK-13396] Stop using our internal deprecated .metrics on Exceptio…
JIRA: https://issues.apache.org/jira/browse/SPARK-13396

Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates

Author: GayathriMurali <gayathri.m.softie@gmail.com>

Closes #11544 from GayathriMurali/SPARK-13396.
2016-03-16 09:39:41 +00:00
Sean Owen 3b461d9ecd [SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset follow up
## What changes were proposed in this pull request?

Follow up to https://github.com/apache/spark/pull/11657

- Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
- And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
- And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11725 from srowen/SPARK-13823.2.
2016-03-16 09:36:34 +00:00