Commit graph

6294 commits

Author SHA1 Message Date
Shixiong Zhu 303f00a4bf [SPARK-19481] [REPL] [MAVEN] Avoid to leak SparkContext in Signaling.cancelOnInterrupt
## What changes were proposed in this pull request?

`Signaling.cancelOnInterrupt` leaks a SparkContext per call and it makes ReplSuite unstable.

This PR adds `SparkContext.getActive` to allow `Signaling.cancelOnInterrupt` to get the active `SparkContext` to avoid the leak.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16825 from zsxwing/SPARK-19481.
2017-02-09 11:16:51 -08:00
José Hiram Soltren 6287c94f08 [SPARK-16554][CORE] Automatically Kill Executors and Nodes when they are Blacklisted
## What changes were proposed in this pull request?

In SPARK-8425, we introduced a mechanism for blacklisting executors and nodes (hosts). After a certain number of failures, these resources would be "blacklisted" and no further work would be assigned to them for some period of time.

In some scenarios, it is better to fail fast, and to simply kill these unreliable resources. This changes proposes to do so by having the BlacklistTracker kill unreliable resources when they would otherwise be "blacklisted".

In order to be thread safe, this code depends on the CoarseGrainedSchedulerBackend sending a message to the driver backend in order to do the actual killing. This also helps to prevent a race which would permit work to begin on a resource (executor or node), between the time the resource is marked for killing and the time at which it is finally killed.

## How was this patch tested?

./dev/run-tests
Ran https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh, and checked logs to see executors and nodes being killed.

Testing can likely be improved here; suggestions welcome.

Author: José Hiram Soltren <jose@cloudera.com>

Closes #16650 from jsoltren/SPARK-16554-submit.
2017-02-09 12:49:31 -06:00
Marcelo Vanzin 3fc8e8caf8 [SPARK-17874][CORE] Add SSL port configuration.
Make the SSL port configuration explicit, instead of deriving it
from the non-SSL port, but retain the existing functionality in
case anyone depends on it.

The change starts the HTTPS and HTTP connectors separately, so
that it's possible to use independent ports for each. For that to
work, the initialization of the server needs to be shuffled around
a bit. The change also makes it so the initialization of both
connectors is similar, and end up using the same Scheduler - previously
only the HTTP connector would use the correct one.

Also fixed some outdated documentation about a couple of services
that were removed long ago.

Tested with unit tests and by running spark-shell with SSL configs.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16625 from vanzin/SPARK-17874.
2017-02-09 22:06:46 +09:00
Lee Dongjin 1a09cd6346
[MINOR][CORE] Fix incorrect documentation of WritableConverter
## What changes were proposed in this pull request?

`WritableConverter` and `WritableFactory` work in opposite directions. But both of them are documented with same description:

> A class encapsulating how to convert some type T to Writable. It stores both the Writable class corresponding to T (e.g. IntWritable for Int) and a function for doing the conversion.

This error is a result of commit 2604939. As a note, `WritableFactory` was added from commit d37978d, which resolves [SPARK-4795](https://issues.apache.org/jira/browse/SPARK-4795) with the correct description.

This PR fix the documentation of `WritableConverter`, along with some improvements on type description.

## How was this patch tested?

`build/mvn clean checkstyle:checkstyle`

Author: Lee Dongjin <dongjin@apache.org>

Closes #16830 from dongjinleekr/feature/fix-writableconverter-doc.
2017-02-09 11:18:02 +00:00
Liwei Lin 9d9d67c795 [SPARK-19265][SQL][FOLLOW-UP] Configurable tableRelationCache maximum size
## What changes were proposed in this pull request?

SPARK-19265 had made table relation cache general; this follow-up aims to make `tableRelationCache`'s maximum size configurable.

In order to do sanity-check, this patch also adds a `checkValue()` method to `TypedConfigBuilder`.

## How was this patch tested?

new test case: `test("conf entry: checkValue()")`

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16736 from lw-lin/conf.
2017-02-09 00:48:47 -05:00
Sean Owen e8d3fca450
[SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier
## What changes were proposed in this pull request?

- Remove support for Hadoop 2.5 and earlier
- Remove reflection and code constructs only needed to support multiple versions at once
- Update docs to reflect newer versions
- Remove older versions' builds and profiles.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16810 from srowen/SPARK-19464.
2017-02-08 12:20:07 +00:00
zuotingbing 8fd178d215
[SPARK-19260] Spaces or "%20" in path parameter are not correctly handled with…
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19260

## What changes were proposed in this pull request?

1. “spark.history.fs.logDirectory” supports with space character and “%20” characters.
2. As usually, if the run classpath includes hdfs-site.xml and core-site.xml files, the supplied path eg."/test" which does not contain a scheme should be taken as a HDFS path rather than a local path since the path parameter is a Hadoop dir.

## How was this patch tested?
Update Unit Test and take some manual tests

local:
.sbin/start-history-server.sh "file:/a b"
.sbin/start-history-server.sh "/abc%20c" (without hdfs-site.xml,core-site.xml)
.sbin/start-history-server.sh "/a b" (without hdfs-site.xml,core-site.xml)
.sbin/start-history-server.sh "/a b/a bc%20c" (without hdfs-site.xml,core-site.xml)

hdfs:
.sbin/start-history-server.sh "hdfs:/namenode:9000/a b"
.sbin/start-history-server.sh "/a b" (with hdfs-site.xml,core-site.xml)
.sbin/start-history-server.sh "/a b/a bc%20c" (with hdfs-site.xml,core-site.xml)

Author: zuotingbing <zuo.tingbing9@zte.com.cn>

Closes #16614 from zuotingbing/SPARK-19260.
2017-02-07 12:21:36 +00:00
Imran Rashid d9043092ca [SPARK-18967][SCHEDULER] compute locality levels even if delay = 0
## What changes were proposed in this pull request?

Before this change, with delay scheduling off, spark would effectively
ignore locality preferences for bulk scheduling.  With this change,
locality preferences are used when multiple offers are made
simultaneously.

## How was this patch tested?

Test case added which fails without this change.  All unit tests run via jenkins.

Author: Imran Rashid <irashid@cloudera.com>

Closes #16376 from squito/locality_without_delay.
2017-02-06 22:37:37 -08:00
Jin Xing d33021b312 [SPARK-19398] Change one misleading log in TaskSetManager.
## What changes were proposed in this pull request?

Log below is misleading:

```
if (successful(index)) {
  logInfo(
    s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " +
    "but another instance of the task has already succeeded, " +
    "so not re-queuing the task to be re-executed.")
}
```

If fetch failed, the task is marked as successful in `TaskSetManager:: handleFailedTask`. Then log above will be printed. The `successful` just means task will not be scheduled any longer, not a real success.

## How was this patch tested?
Existing unit tests can cover this.

Author: jinxing <jinxing@meituan.com>

Closes #16738 from jinxing64/SPARK-19398.
2017-02-06 10:51:26 -08:00
erenavsarogullari 7beb227cc8 [SPARK-17663][CORE] SchedulableBuilder should handle invalid data access via scheduler.allocation.file
## What changes were proposed in this pull request?

If `spark.scheduler.allocation.file` has invalid `minShare` or/and `weight` values, these cause :
- `NumberFormatException` due to `toInt` function
- `SparkContext` can not be initialized.
- It does not show meaningful error message to user.

In a nutshell, this functionality can be more robust by selecting one of the following flows :

**1-** Currently, if `schedulingMode` has an invalid value, a warning message is logged and default value is set as `FIFO`. Same pattern can be used for `minShare`(default: 0) and `weight`(default: 1) as well
**2-** Meaningful error message can be shown to the user for all invalid cases.

PR offers :
- `schedulingMode` handles just empty values. It also needs to be supported for **whitespace**, **non-uppercase**(fair, FaIr etc...) or `SchedulingMode.NONE` cases by setting default value(`FIFO`)
- `minShare` and `weight` handle just empty values. They also need to be supported for **non-integer** cases by setting default values.
- Some refactoring of `PoolSuite`.

**Code to Reproduce :**

```
val conf = new SparkConf().setAppName("spark-fairscheduler").setMaster("local")
conf.set("spark.scheduler.mode", "FAIR")
conf.set("spark.scheduler.allocation.file", "src/main/resources/fairscheduler-invalid-data.xml")
val sc = new SparkContext(conf)
```

**fairscheduler-invalid-data.xml :**

```
<allocations>
    <pool name="production">
        <schedulingMode>FIFO</schedulingMode>
        <weight>invalid_weight</weight>
        <minShare>2</minShare>
    </pool>
</allocations>
```

**Stacktrace :**

```
Exception in thread "main" java.lang.NumberFormatException: For input string: "invalid_weight"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
    at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
    at org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:127)
    at org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:102)
```
## How was this patch tested?

Added Unit Test Case.

Author: erenavsarogullari <erenavsarogullari@gmail.com>

Closes #15237 from erenavsarogullari/SPARK-17663.
2017-02-06 08:24:17 -06:00
Reynold Xin 22d4aae8be [SPARK-10063] Follow-up: remove dead code related to an old output committer.
## What changes were proposed in this pull request?
DirectParquetOutputCommitter was removed from Spark as it was deemed unsafe to use. We however still have some code to generate warning. This patch removes those code as well.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #16796 from rxin/remove-direct.
2017-02-03 21:12:20 -08:00
Liang-Chi Hsieh 2f523fa0c9 [SPARK-19244][CORE] Sort MemoryConsumers according to their memory usage when spilling
## What changes were proposed in this pull request?

In `TaskMemoryManager `, when we acquire memory by calling `acquireExecutionMemory` and we can't acquire required memory, we will try to spill other memory consumers.

Currently, we simply iterates the memory consumers in a hash set. Normally each time the consumer will be iterated in the same order.

The first issue is that we might spill additional consumers. For example, if consumer 1 uses 10MB, consumer 2 uses 50MB, then consumer 3 acquires 100MB but we can only get 60MB and spilling is needed. We might spill both consumer 1 and consumer 2. But we actually just need to spill consumer 2 and get the required 100MB.

The second issue is that if we spill consumer 1 in first time spilling. After a while, consumer 1 now uses 5MB. Then consumer 4 may acquire some memory and spilling is needed again. Because we iterate the memory consumers in the same order, we will spill consumer 1 again. So for consumer 1, we will produce many small spilling files.

This patch modifies the way iterating the memory consumers. It sorts the memory consumers by their memory usage. So the consumer using more memory will spill first. Once it is spilled, even it acquires few memory again, in next time spilling happens it will not be the consumers to spill again if there are other consumers using more memory than it.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16603 from viirya/sort-memoryconsumer-when-spill.
2017-02-03 06:14:10 -08:00
jinxing c86a57f4d1 [SPARK-19437] Rectify spark executor id in HeartbeatReceiverSuite.
## What changes were proposed in this pull request?

The current code in `HeartbeatReceiverSuite`, executorId is set as below:
```
  private val executorId1 = "executor-1"
  private val executorId2 = "executor-2"
```

The executorId is sent to driver when register as below:

```
test("expire dead hosts should kill executors with replacement (SPARK-8119)")  {
  ...
  fakeSchedulerBackend.driverEndpoint.askSync[Boolean](
      RegisterExecutor(executorId1, dummyExecutorEndpointRef1, "1.2.3.4", 0, Map.empty))
  ...
}
```

Receiving `RegisterExecutor` in `CoarseGrainedSchedulerBackend`, the executorId will be compared with `currentExecutorIdCounter` as below:
```
case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls)  =>
  if (executorDataMap.contains(executorId)) {
    executorRef.send(RegisterExecutorFailed("Duplicate executor ID: " + executorId))
    context.reply(true)
  } else {
  ...
  executorDataMap.put(executorId, data)
  if (currentExecutorIdCounter < executorId.toInt) {
    currentExecutorIdCounter = executorId.toInt
  }
  ...
```

`executorId.toInt` will cause NumberformatException.

This unit test can pass currently because of `askWithRetry`, when catching exception, RPC will call again, thus it will go `if` branch and return true.

**To fix**
Rectify executorId and replace `askWithRetry` with `askSync`, refer to https://github.com/apache/spark/pull/16690
## How was this patch tested?
This fix is for unit test and no need to add another one.(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: jinxing <jinxing@meituan.com>

Closes #16779 from jinxing64/SPARK-19437.
2017-02-02 23:18:16 -08:00
Shixiong Zhu 8303e20c45 [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout
## What changes were proposed in this pull request?

When connecting timeout, `ask` may fail with a confusing message:

```
17/02/01 23:15:19 INFO Worker: Connecting to master ...
java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet been set.
        at scala.Predef$.require(Predef.scala:224)
        at org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
        at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
        at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
```

It's better to provide a meaningful message.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16773 from zsxwing/connect-timeout.
2017-02-01 21:39:21 -08:00
jinxing c5fcb7f68b [SPARK-19347] ReceiverSupervisorImpl can add block to ReceiverTracker multiple times because of askWithRetry.
## What changes were proposed in this pull request?

`ReceiverSupervisorImpl` on executor side reports block's meta back to `ReceiverTracker` on driver side. In current code, `askWithRetry` is used. However, for `AddBlock`, `ReceiverTracker` is not idempotent, which may result in messages are processed multiple times.

**To reproduce**:

1. Check if it is the first time receiving `AddBlock` in `ReceiverTracker`, if so sleep long enough(say 200 seconds), thus the first RPC call will be timeout in `askWithRetry`, then `AddBlock` will be resent.
2. Rebuild Spark and run following job:
```
  def streamProcessing(): Unit = {
    val conf = new SparkConf()
      .setAppName("StreamingTest")
      .setMaster(masterUrl)
    val ssc = new StreamingContext(conf, Seconds(200))
    val stream = ssc.socketTextStream("localhost", 1234)
    stream.print()
    ssc.start()
    ssc.awaitTermination()
  }
```
**To fix**:

It makes sense to provide a blocking version `ask` in RpcEndpointRef, as mentioned in SPARK-18113 (https://github.com/apache/spark/pull/16503#event-927953218). Because Netty RPC layer will not drop messages. `askWithRetry` is a leftover from akka days. It imposes restrictions on the caller(e.g. idempotency) and other things that people generally don't pay that much attention to when using it.

## How was this patch tested?
Test manually. The scenario described above doesn't happen with this patch.

Author: jinxing <jinxing@meituan.com>

Closes #16690 from jinxing64/SPARK-19347.
2017-02-01 13:54:37 -08:00
Devaraj K df4a27cc5c [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED
## What changes were proposed in this pull request?

Copying of the killed status was missing while getting the newTaskInfo object by dropping the unnecessary details to reduce the memory usage. This patch adds the copying of the killed status to newTaskInfo object, this will correct the display of the status from wrong status to KILLED status in Web UI.

## How was this patch tested?

Current behaviour of displaying tasks in stage UI page,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143	|10	|0	|SUCCESS	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
|156	|11	|0	|SUCCESS	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|

Web UI display after applying the patch,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143	|10	|0	|KILLED	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  | 0.0 B / 0	| TaskKilled (killed intentionally)|
|156	|11	|0	|KILLED	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  |0.0 B / 0	| TaskKilled (killed intentionally)|

Author: Devaraj K <devaraj@apache.org>

Closes #16725 from devaraj-kavali/SPARK-19377.
2017-02-01 12:55:11 -08:00
hyukjinkwon f1a1f2607d
[SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in Scala/Java APIs generation
## What changes were proposed in this pull request?

This PR proposes three things as below:

- Support LaTex inline-formula, `\( ... \)` in Scala API documentation
  It seems currently,

  ```
  \( ... \)
  ```

  are rendered as they are, for example,

  <img width="345" alt="2017-01-30 10 01 13" src="https://cloud.githubusercontent.com/assets/6477701/22423960/ab37d54a-e737-11e6-9196-4f6229c0189c.png">

  It seems mistakenly more backslashes were added.

- Fix warnings Scaladoc/Javadoc generation
  This PR fixes t two types of warnings as below:

  ```
  [warn] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala:335: Could not find any member to link for "UnsupportedOperationException".
  [warn]   /**
  [warn]   ^
  ```

  ```
  [warn] .../spark/sql/core/src/main/scala/org/apache/spark/sql/internal/VariableSubstitution.scala:24: Variable var undefined in comment for class VariableSubstitution in class VariableSubstitution
  [warn]  * `${var}`, `${system:var}` and `${env:var}`.
  [warn]      ^
  ```

- Fix Javadoc8 break
  ```
  [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictionModel.java:7: error: reference not found
  [error]  *                       E.g., {link VectorUDT} for vector features.
  [error]                                       ^
  [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictorParams.java:12: error: reference not found
  [error]    *                          E.g., {link VectorUDT} for vector features.
  [error]                                            ^
  [error] .../spark/mllib/target/java/org/apache/spark/ml/Predictor.java:10: error: reference not found
  [error]  *                       E.g., {link VectorUDT} for vector features.
  [error]                                       ^
  [error] .../spark/sql/hive/target/java/org/apache/spark/sql/hive/HiveAnalysis.java:5: error: reference not found
  [error]  * Note that, this rule must be run after {link PreprocessTableInsertion}.
  [error]                                                  ^
  ```

## How was this patch tested?

Manually via `sbt unidoc` and `jeykil build`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16741 from HyukjinKwon/warn-and-break.
2017-02-01 13:26:16 +00:00
Shixiong Zhu 21aa8c32ba [SPARK-19365][CORE] Optimize RequestMessage serialization
## What changes were proposed in this pull request?

Right now Netty PRC serializes `RequestMessage` using Java serialization, and the size of a single message (e.g., RequestMessage(..., "hello")`) is almost 1KB.

This PR optimizes it by serializing `RequestMessage` manually (eliminate unnecessary information from most messages, e.g., class names of `RequestMessage`, `NettyRpcEndpointRef`, ...), and reduces the above message size to 100+ bytes.

## How was this patch tested?

Jenkins

I did a simple test to measure the improvement:

Before
```
$ bin/spark-shell --master local-cluster[1,4,1024]
...
scala> for (i <- 1 to 10) {
     |   val start = System.nanoTime
     |   val s = sc.parallelize(1 to 1000000, 10 * 1000).count()
     |   val end = System.nanoTime
     |   println(s"$i\t" + ((end - start)/1000/1000))
     | }
1       6830
2       4353
3       3322
4       3107
5       3235
6       3139
7       3156
8       3166
9       3091
10      3029
```
After:
```
$ bin/spark-shell --master local-cluster[1,4,1024]
...
scala> for (i <- 1 to 10) {
     |   val start = System.nanoTime
     |   val s = sc.parallelize(1 to 1000000, 10 * 1000).count()
     |   val end = System.nanoTime
     |   println(s"$i\t" + ((end - start)/1000/1000))
     | }
1       6431
2       3643
3       2913
4       2679
5       2760
6       2710
7       2747
8       2793
9       2679
10      2651
```

I also captured the TCP packets for this test. Before this patch, the total size of TCP packets is ~1.5GB. After it, it reduces to ~1.2GB.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16706 from zsxwing/rpc-opt.
2017-01-27 15:07:57 -08:00
Takeshi YAMAMURO 7045b8b355
[CORE][DOCS] Update a help message for --files in spark-submit
## What changes were proposed in this pull request?
This pr is to update a help message for `--files` in spark-submit because it seems users get confused about how to get full paths of the files that one adds via the option.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #16698 from maropu/SparkFilesDoc.
2017-01-26 10:15:58 +00:00
Marcelo Vanzin d3dcb63b97 [SPARK-19220][UI] Make redirection to HTTPS apply to all URIs.
The redirect handler was installed only for the root of the server;
any other context ended up being served directly through the HTTP
port. Since every sub page (e.g. application UIs in the history
server) is a separate servlet context, this meant that everything
but the root was accessible via HTTP still.

The change adds separate names to each connector, and binds contexts
to specific connectors so that content is only served through the
HTTPS connector when it's enabled. In that case, the only thing that
binds to the HTTP connector is the redirect handler.

Tested with new unit tests and by checking a live history server.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16582 from vanzin/SPARK-19220.
2017-01-26 17:24:44 +09:00
uncleGen 2338451266 [SPARK-18495][UI] Document meaning of green dot in DAG visualization
## What changes were proposed in this pull request?

A green dot in the DAG visualization apparently means that the referenced RDD is cached. This is not documented anywhere except in this [blog post](https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html).
It would be good if the Web UI itself documented this somehow (perhaps in the tooltip?) so that the user can naturally learn what it means while using the Web UI.

before pr:

![dingtalk20170125180158](https://cloud.githubusercontent.com/assets/7402327/22286167/37910ea2-e329-11e6-9aae-03dce6fceee2.png)
![dingtalk20170125180218](https://cloud.githubusercontent.com/assets/7402327/22286178/421f3132-e329-11e6-8283-ba6bbd15cfb0.png)

after pr:
![dingtalk20170125175704](https://cloud.githubusercontent.com/assets/7402327/22286216/56a0050a-e329-11e6-813d-2be96b5ab7f1.png)
![dingtalk20170125175749](https://cloud.githubusercontent.com/assets/7402327/22286233/618cd646-e329-11e6-9ce5-10322b169dcb.png)

## How was this patch tested?

Author: uncleGen <hustyugm@gmail.com>

Closes #16702 from uncleGen/SPARK-18495.
2017-01-25 17:18:13 -08:00
Tathagata Das 47d5d0ddb0 [SPARK-14804][SPARK][GRAPHX] Fix checkpointing of VertexRDD/EdgeRDD
## What changes were proposed in this pull request?

EdgeRDD/VertexRDD overrides checkpoint() and isCheckpointed() to forward these to the internal partitionRDD. So when checkpoint() is called on them, its the partitionRDD that actually gets checkpointed. However since isCheckpointed() also overridden to call partitionRDD.isCheckpointed, EdgeRDD/VertexRDD.isCheckpointed returns true even though this RDD is actually not checkpointed.

This would have been fine except the RDD's internal logic for computing the RDD depends on isCheckpointed(). So for VertexRDD/EdgeRDD, since isCheckpointed is true, when computing Spark tries to read checkpoint data of VertexRDD/EdgeRDD even though they are not actually checkpointed. Through a crazy sequence of call forwarding, it reads checkpoint data of partitionsRDD and tries to cast it to types in Vertex/EdgeRDD. This leads to ClassCastException.

The minimal fix that does not change any public behavior is to modify RDD internal to not use public override-able API for internal logic.
## How was this patch tested?

New unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15396 from tdas/SPARK-14804.
2017-01-25 17:17:34 -08:00
Marcelo Vanzin 8f3f73abc1 [SPARK-19139][CORE] New auth mechanism for transport library.
This change introduces a new auth mechanism to the transport library,
to be used when users enable strong encryption. This auth mechanism
has better security than the currently used DIGEST-MD5.

The new protocol uses symmetric key encryption to mutually authenticate
the endpoints, and is very loosely based on ISO/IEC 9798.

The new protocol falls back to SASL when it thinks the remote end is old.
Because SASL does not support asking the server for multiple auth protocols,
which would mean we could re-use the existing SASL code by just adding a
new SASL provider, the protocol is implemented outside of the SASL API
to avoid the boilerplate of adding a new provider.

Details of the auth protocol are discussed in the included README.md
file.

This change partly undos the changes added in SPARK-13331; AES encryption
is now decoupled from SASL authentication. The encryption code itself,
though, has been re-used as part of this change.

## How was this patch tested?

- Unit tests
- Tested Spark 2.2 against Spark 1.6 shuffle service with SASL enabled
- Tested Spark 2.2 against Spark 2.2 shuffle service with SASL fallback disabled

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16521 from vanzin/SPARK-19139.
2017-01-24 10:44:04 -08:00
Parag Chaudhari 0ff67a1cf9 [SPARK-14049][CORE] Add functionality in spark history sever API to query applications by end time
## What changes were proposed in this pull request?

Currently, spark history server REST API provides functionality to query applications by application start time range based on minDate and maxDate query parameters, but it  lacks support to query applications by their end time. In this pull request we are proposing optional minEndDate and maxEndDate query parameters and filtering capability based on these parameters to spark history server REST API. This functionality can be used for following queries,
1. Applications finished in last 'x' minutes
2. Applications finished before 'y' time
3. Applications finished between 'x' time to 'y' time
4. Applications started from 'x' time and finished before 'y' time.

For backward compatibility, we can keep existing minDate and maxDate query parameters as they are and they can continue support filtering based on start time range.
## How was this patch tested?

Existing unit tests and 4 new unit tests.

Author: Parag Chaudhari <paragpc@amazon.com>

Closes #11867 from paragpc/master-SHS-query-by-endtime_2.
2017-01-24 08:41:46 -06:00
jerryshao e4974721f3 [SPARK-19306][CORE] Fix inconsistent state in DiskBlockObject when expection occurred
## What changes were proposed in this pull request?

In `DiskBlockObjectWriter`, when some errors happened during writing, it will call `revertPartialWritesAndClose`, if this method again failed due to some issues like out of disk, it will throw exception without resetting the state of this writer, also skipping the revert. So here propose to fix this issue to offer user a chance to recover from such issue.

## How was this patch tested?

Existing test.

Author: jerryshao <sshao@hortonworks.com>

Closes #16657 from jerryshao/SPARK-19306.
2017-01-23 13:36:41 -08:00
Yuming Wang c99492141b
[SPARK-19146][CORE] Drop more elements when stageData.taskData.size > retainedTasks
## What changes were proposed in this pull request?

Drop more elements when `stageData.taskData.size > retainedTasks` to reduce the number of times on call drop function.

## How was this patch tested?

Jenkins

Author: Yuming Wang <wgyumg@gmail.com>

Closes #16527 from wangyum/SPARK-19146.
2017-01-23 11:02:22 +00:00
hyukjinkwon 6113fe78a5
[SPARK-19117][SPARK-18922][TESTS] Fix the rest of flaky, newly introduced and missed test failures on Windows
## What changes were proposed in this pull request?

**Failed tests**

```
org.apache.spark.sql.hive.execution.HiveQuerySuite:
 - transform with SerDe3 *** FAILED ***
 - transform with SerDe4 *** FAILED ***
```

```
org.apache.spark.sql.hive.execution.HiveDDLSuite:
 - create hive serde table with new syntax *** FAILED ***
 - add/drop partition with location - managed table *** FAILED ***
```

```
org.apache.spark.sql.hive.ParquetMetastoreSuite:
 - Explicitly added partitions should be readable after load *** FAILED ***
 - Non-partitioned table readable after load *** FAILED ***
```

**Aborted tests**

```
Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.execution.HiveSerDeSuite *** ABORTED *** (157 milliseconds)
   org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive   argetscala-2.11   est-classesdatafilessales.txt;
```

**Flaky tests(failed 9ish out of 10)**

```
org.apache.spark.scheduler.SparkListenerSuite:
 - local metrics *** FAILED ***
```

## How was this patch tested?

Manually tested via AppVeyor.

**Failed tests**

```
org.apache.spark.sql.hive.execution.HiveQuerySuite:
 - transform with SerDe3 !!! CANCELED !!! (0 milliseconds)
 - transform with SerDe4 !!! CANCELED !!! (0 milliseconds)
```

```
org.apache.spark.sql.hive.execution.HiveDDLSuite:
 - create hive serde table with new syntax (1 second, 672 milliseconds)
 - add/drop partition with location - managed table (2 seconds, 391 milliseconds)
```

```
org.apache.spark.sql.hive.ParquetMetastoreSuite:
 - Explicitly added partitions should be readable after load (609 milliseconds)
 - Non-partitioned table readable after load (344 milliseconds)
```

**Aborted tests**

```
spark.sql.hive.execution.HiveSerDeSuite:
 - Read with RegexSerDe (2 seconds, 142 milliseconds)
 - Read and write with LazySimpleSerDe (tab separated) (2 seconds)
 - Read with AvroSerDe (1 second, 47 milliseconds)
 - Read Partitioned with AvroSerDe (1 second, 422 milliseconds)
```

**Flaky tests (failed 9ish out of 10)**

```
org.apache.spark.scheduler.SparkListenerSuite:
 - local metrics (4 seconds, 562 milliseconds)
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16586 from HyukjinKwon/set-path-appveyor.
2017-01-21 14:08:01 +00:00
Xin Ren bcdabaac93
[SPARK-17724][STREAMING][WEBUI] Unevaluated new lines in tooltip in DAG Visualization of a job
https://issues.apache.org/jira/browse/SPARK-17724

## What changes were proposed in this pull request?
For unevaluated `\n`, evaluate it and enable line break, for Streaming WebUI `stages` page and `job` page.
(I didn't change Scala source file, since Jetty server has to somehow indicate line break and js to code display it.)
(This PR is a continue from previous PR https://github.com/apache/spark/pull/15353 for the same issue, sorry being so long time)

Two changes:

1. RDD Node tooltipText is actually showing the `<circle>`  `title` property, so I set extra attribute in `spark-dag-viz.js`: `.attr("data-html", "true")`

`<circle x="-5" y="-5" r="5" data-toggle="tooltip" data-placement="bottom" title="" data-original-title="ParallelCollectionRDD [9]\nmakeRDD at QueueStream.scala:49"></circle>`

2. Static `<tspan>` text of each stage, split by `/n`, and append an extra `<tspan>` element to its parentNode

`<text><tspan xml:space="preserve" dy="1em" x="1">reduceByKey</tspan><tspan xml:space="preserve" dy="1em" x="1">reduceByKey/n 23:34:49</tspan></text>
`

## UI changes
Screenshot **before fix**, `\n` is not evaluated in both circle tooltipText and static text:
![screen shot 2017-01-19 at 12 21 54 am](https://cloud.githubusercontent.com/assets/3925641/22098829/53c7f49c-dddd-11e6-9daa-b3ddb6044114.png)

Screenshot **after fix**:
![screen shot 2017-01-19 at 12 20 30 am](https://cloud.githubusercontent.com/assets/3925641/22098806/294910d4-dddd-11e6-9948-d942e09f545e.png)

## How was this patch tested?
Tested locally. For Streaming WebUI `stages` page and `job` page, on multiple browsers:
- Chrome
- Firefox
- Safari

Author: Xin Ren <renxin.ubc@gmail.com>

Closes #16643 from keypointt/SPARK-17724-2nd.
2017-01-21 13:55:35 +00:00
Parag Chaudhari e20d9b1565 [SPARK-19069][CORE] Expose task 'status' and 'duration' in spark history server REST API.
## What changes were proposed in this pull request?

Although Spark history server UI shows task ‘status’ and ‘duration’ fields, it does not expose these fields in the REST API response. For the Spark history server API users, it is not possible to determine task status and duration. Spark history server has access to task status and duration from event log, but it is not exposing these in API. This patch is proposed to expose task ‘status’ and ‘duration’ fields in Spark history server REST API.

## How was this patch tested?

Modified existing test cases in org.apache.spark.deploy.history.HistoryServerSuite.

Author: Parag Chaudhari <paragpc@amazon.com>

Closes #16473 from paragpc/expose_task_status.
2017-01-20 10:49:05 -06:00
José Hiram Soltren 640f942337 [SPARK-16654][CORE] Add UI coverage for Application Level Blacklisting
Builds on top of work in SPARK-8425 to update Application Level Blacklisting in the scheduler.

## What changes were proposed in this pull request?

Adds a UI to these patches by:
- defining new listener events for blacklisting and unblacklisting, nodes and executors;
- sending said events at the relevant points in BlacklistTracker;
- adding JSON (de)serialization code for these events;
- augmenting the Executors UI page to show which, and how many, executors are blacklisted;
- adding a unit test to make sure events are being fired;
- adding HistoryServerSuite coverage to verify that the SHS reads these events correctly.
- updates the Executor UI to show Blacklisted/Active/Dead as a tri-state in Executors Status

Updates .rat-excludes to pass tests.

username squito

## How was this patch tested?

./dev/run-tests
testOnly org.apache.spark.util.JsonProtocolSuite
testOnly org.apache.spark.scheduler.BlacklistTrackerSuite
testOnly org.apache.spark.deploy.history.HistoryServerSuite
https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh
![blacklist-20161219](https://cloud.githubusercontent.com/assets/1208477/21335321/9eda320a-c623-11e6-8b8c-9c912a73c276.jpg)

Author: José Hiram Soltren <jose@cloudera.com>

Closes #16346 from jsoltren/SPARK-16654-submit.
2017-01-19 09:08:18 -06:00
jinxing 33791a8ced [SPARK-18113] Use ask to replace askWithRetry in canCommit and make receiver idempotent.
## What changes were proposed in this pull request?

Method canCommit sends AskPermissionToCommitOutput using askWithRetry. If timeout, it will send again. Thus AskPermissionToCommitOutput can be received multi times. Method canCommit should return the same value when called by the same attempt multi times.

In implementation before this fix, method handleAskPermissionToCommit just check if there is committer already registered, which is not enough. When worker retries AskPermissionToCommitOutput it will get CommitDeniedException, then the task will fail with reason TaskCommitDenied, which is not regarded as a task failure(SPARK-11178), so TaskScheduler will schedule this task infinitely.

In this fix, use `ask` to replace `askWithRetry` in `canCommit` and make receiver idempotent.

## How was this patch tested?

Added a new unit test to OutputCommitCoordinatorSuite.

Author: jinxing <jinxing@meituan.com>

Closes #16503 from jinxing64/SPARK-18113.
2017-01-18 10:47:22 -08:00
Liang-Chi Hsieh d06172b88e [SPARK-19223][SQL][PYSPARK] Fix InputFileBlockHolder for datasources which are based on HadoopRDD or NewHadoopRDD
## What changes were proposed in this pull request?

For some datasources which are based on HadoopRDD or NewHadoopRDD, such as spark-xml, InputFileBlockHolder doesn't work with Python UDF.

The method to reproduce it is, running the following codes with `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`:

    from pyspark.sql.functions import udf,input_file_name
    from pyspark.sql.types import StringType
    from pyspark.sql import SparkSession

    def filename(path):
        return path

    session = SparkSession.builder.appName('APP').getOrCreate()

    session.udf.register('sameText', filename)
    sameText = udf(filename, StringType())

    df = session.read.format('xml').load('a.xml', rowTag='root').select('*', input_file_name().alias('file'))
    df.select('file').show() # works
    df.select(sameText(df['file'])).show()   # returns empty content

The issue is because in `HadoopRDD` and `NewHadoopRDD` we set the file block's info in `InputFileBlockHolder` before the returned iterator begins consuming. `InputFileBlockHolder` will record this info into thread local variable. When running Python UDF in batch, we set up another thread to consume the iterator from child plan's output rdd, so we can't read the info back in another thread.

To fix this, we have to set the info in `InputFileBlockHolder` after the iterator begins consuming. So the info can be read in correct thread.

## How was this patch tested?

Manual test with above example codes for spark-xml package on pyspark: `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`.

Added pyspark test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16585 from viirya/fix-inputfileblock-hadooprdd.
2017-01-18 23:06:44 +08:00
uncleGen eefdf9f9dd
[SPARK-19227][SPARK-19251] remove unused imports and outdated comments
## What changes were proposed in this pull request?
remove ununsed imports and outdated comments, and fix some minor code style issue.

## How was this patch tested?
existing ut

Author: uncleGen <hustyugm@gmail.com>

Closes #16591 from uncleGen/SPARK-19227.
2017-01-18 09:44:32 +00:00
Wenchen Fan 4494cd9716 [SPARK-18243][SQL] Port Hive writing to use FileFormat interface
## What changes were proposed in this pull request?

Inserting data into Hive tables has its own implementation that is distinct from data sources: `InsertIntoHiveTable`, `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`.

Note that one other major difference is that data source tables write directly to the final destination without using some staging directory, and then Spark itself adds the partitions/tables to the catalog. Hive tables actually write to some staging directory, and then call Hive metastore's loadPartition/loadTable function to load those data in. So we still need to keep `InsertIntoHiveTable` to put this special logic. In the future, we should think of writing to the hive table location directly, so that we don't need to call `loadTable`/`loadPartition` at the end and remove `InsertIntoHiveTable`.

This PR removes `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`, and create a `HiveFileFormat` to implement the write logic. In the future, we should also implement the read logic in `HiveFileFormat`.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16517 from cloud-fan/insert-hive.
2017-01-17 23:37:59 -08:00
jerryshao b79cc7ceb4 [SPARK-19179][YARN] Change spark.yarn.access.namenodes config and update docs
## What changes were proposed in this pull request?

`spark.yarn.access.namenodes` configuration cannot actually reflects the usage of it, inside the code it is the Hadoop filesystems we get tokens, not NNs. So here propose to update the name of this configuration, also change the related code and doc.

## How was this patch tested?

Local verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #16560 from jerryshao/SPARK-19179.
2017-01-17 09:30:56 -06:00
hyukjinkwon 6c00c069e3
[SPARK-3249][DOC] Fix links in ScalaDoc that cause warning messages in sbt/sbt unidoc
## What changes were proposed in this pull request?

This PR proposes to fix ambiguous link warnings by simply making them as code blocks for both javadoc and scaladoc.

```
[warn] .../spark/core/src/main/scala/org/apache/spark/Accumulator.scala:20: The link target "SparkContext#accumulator" is ambiguous. Several members fit the target:
[warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala:281: The link target "runMiniBatchSGD" is ambiguous. Several members fit the target:
[warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala:83: The link target "run" is ambiguous. Several members fit the target:
...
```

This PR also fixes javadoc8 break as below:

```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
[error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
[error]                                                   ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
[error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
[error]                                                                                ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
[error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
[error]                                                                                                ^
[info] 3 errors
```

## How was this patch tested?

Manually via `sbt unidoc > output.txt` and the checked it via `cat output.txt | grep ambiguous`

and `sbt unidoc | grep error`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16604 from HyukjinKwon/SPARK-3249.
2017-01-17 12:28:15 +00:00
Nick Lavers 0019005a2d
[SPARK-19219][SQL] Fix Parquet log output defaults
## What changes were proposed in this pull request?

Changing the default parquet logging levels to reflect the changes made in PR [#15538](https://github.com/apache/spark/pull/15538), in order to prevent the flood of log messages by default.

## How was this patch tested?

Default log output when reading from parquet 1.6 files was compared with and without this change. The change eliminates the extraneous logging and makes the output readable.

Author: Nick Lavers <nick.lavers@videoamp.com>

Closes #16580 from nicklavers/spark-19219-set_default_parquet_log_level.
2017-01-17 12:14:38 +00:00
Liang-Chi Hsieh 61e48f52d1 [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet
## What changes were proposed in this pull request?

We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet:

1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too.

This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc.

Two main changes in this patch:

1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner

    We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`.

2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator.

One thing to notice is:

We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out df9d8e4154/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java (L470). So this patch catches `RuntimeException`.  One concern is that it might also shadow other runtime exceptions other than reading corrupt files.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16474 from viirya/fix-ignorecorrupted-parquet-files.
2017-01-16 15:26:41 +08:00
xiaojian.fxj c9d612f82c
[SPARK-19042] spark executor can't download the jars when uber jar's http url contains any query strings
If the uber jars' https contains any query strings, the Executor.updateDependencies method can't can't download the jars correctly. This is because  the "localName = name.split("/").last" won't get the expected jar's url. The bug fix is the same as [SPARK-17855]

Author: xiaojian.fxj <xiaojian.fxj@alibaba-inc.com>

Closes #16509 from hustfxj/bug.
2017-01-15 11:12:59 +00:00
Felix Cheung 2bc4d4e286 [SPARK-12757][CORE] lower "block locks were not released" log to info level
## What changes were proposed in this pull request?

lower "block locks were not released" log to info level, as it is generating a lot of warnings in running ML, graph calls, as pointed out in the JIRA.

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16513 from felixcheung/blocklockswarn.
2017-01-12 09:45:16 -08:00
Eric Liang c71b25481a [SPARK-19183][SQL] Add deleteWithJob hook to internal commit protocol API
## What changes were proposed in this pull request?

Currently in SQL we implement overwrites by calling fs.delete() directly on the original data. This is not ideal since we the original files end up deleted even if the job aborts. We should extend the commit protocol to allow file overwrites to be managed as well.

## How was this patch tested?

Existing tests. I also fixed a bunch of tests that were depending on the commit protocol implementation being set to the legacy mapreduce one.

cc rxin cloud-fan

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #16554 from ericl/add-delete-protocol.
2017-01-12 17:45:55 +08:00
Bryan Cutler 3bc2eff888 [SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings used to resolve packages/artifacts
## What changes were proposed in this pull request?

Adding option in spark-submit to allow overriding the default IvySettings used to resolve artifacts as part of the Spark Packages functionality.  This will allow all artifact resolution to go through a central managed repository, such as Nexus or Artifactory, where site admins can better approve and control what is used with Spark apps.

This change restructures the creation of the IvySettings object in two distinct ways.  First, if the `spark.ivy.settings` option is not defined then `buildIvySettings` will create a default settings instance, as before, with defined repositories (Maven Central) included.  Second, if the option is defined, the ivy settings file will be loaded from the given path and only repositories defined within will be used for artifact resolution.
## How was this patch tested?

Existing tests for default behaviour, Manual tests that load a ivysettings.xml file with local and Nexus repositories defined.  Added new test to load a simple Ivy settings file with a local filesystem resolver.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Ian Hummel <ian@themodernlife.net>

Closes #15119 from BryanCutler/spark-custom-IvySettings.
2017-01-11 11:57:38 -08:00
hyukjinkwon 2cfd41ac02
[SPARK-19117][TESTS] Skip the tests using script transformation on Windows
## What changes were proposed in this pull request?

This PR proposes to skip the tests for script transformation failed on Windows due to fixed bash location.

```
SQLQuerySuite:
 - script *** FAILED *** (553 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - Star Expansion - script transform *** FAILED *** (2 seconds, 375 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - test script transform for stdout *** FAILED *** (2 seconds, 813 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - test script transform for stderr *** FAILED *** (2 seconds, 407 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - test script transform data type *** FAILED *** (171 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
```

```
HiveQuerySuite:
 - transform *** FAILED *** (359 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - schema-less transform *** FAILED *** (344 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - transform with custom field delimiter *** FAILED *** (296 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - transform with custom field delimiter2 *** FAILED *** (297 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - transform with custom field delimiter3 *** FAILED *** (312 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - transform with SerDe2 *** FAILED *** (437 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1355.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1355.0 (TID 2403, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
```

```
LogicalPlanToSQLSuite:
 - script transformation - schemaless *** FAILED *** (78 milliseconds)
   ...
   Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1968.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1968.0 (TID 3932, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
  - script transformation - alias list *** FAILED *** (94 milliseconds)
   ...
   Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1969.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1969.0 (TID 3933, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - script transformation - alias list with type *** FAILED *** (93 milliseconds)
   ...
   Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1970.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1970.0 (TID 3934, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - script transformation - row format delimited clause with only one format property *** FAILED *** (78 milliseconds)
   ...
   Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1971.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1971.0 (TID 3935, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - script transformation - row format delimited clause with multiple format properties *** FAILED *** (94 milliseconds)
   ...
   Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1972.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1972.0 (TID 3936, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - script transformation - row format serde clauses with SERDEPROPERTIES *** FAILED *** (78 milliseconds)
   ...
   Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1973.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1973.0 (TID 3937, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - script transformation - row format serde clauses without SERDEPROPERTIES *** FAILED *** (78 milliseconds)
   ...
   Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1974.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1974.0 (TID 3938, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
```

```
ScriptTransformationSuite:
 - cat without SerDe *** FAILED *** (156 milliseconds)
   ...
   Caused by: java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - cat with LazySimpleSerDe *** FAILED *** (63 milliseconds)
    ...
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2383.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2383.0 (TID 4819, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - script transformation should not swallow errors from upstream operators (no serde) *** FAILED *** (78 milliseconds)
    ...
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2384.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2384.0 (TID 4820, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - script transformation should not swallow errors from upstream operators (with serde) *** FAILED *** (47 milliseconds)
    ...
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2385.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2385.0 (TID 4821, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - SPARK-14400 script transformation should fail for bad script command *** FAILED *** (47 milliseconds)
   "Job aborted due to stage failure: Task 0 in stage 2386.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2386.0 (TID 4822, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
```

## How was this patch tested?

AppVeyor as below:

```
SQLQuerySuite:
  - script !!! CANCELED !!! (63 milliseconds)
  - Star Expansion - script transform !!! CANCELED !!! (0 milliseconds)
  - test script transform for stdout !!! CANCELED !!! (0 milliseconds)
  - test script transform for stderr !!! CANCELED !!! (0 milliseconds)
  - test script transform data type !!! CANCELED !!! (0 milliseconds)
```

```
HiveQuerySuite:
  - transform !!! CANCELED !!! (31 milliseconds)
  - schema-less transform !!! CANCELED !!! (0 milliseconds)
  - transform with custom field delimiter !!! CANCELED !!! (0 milliseconds)
  - transform with custom field delimiter2 !!! CANCELED !!! (0 milliseconds)
  - transform with custom field delimiter3 !!! CANCELED !!! (0 milliseconds)
  - transform with SerDe2 !!! CANCELED !!! (0 milliseconds)
```

```
LogicalPlanToSQLSuite:
  - script transformation - schemaless !!! CANCELED !!! (78 milliseconds)
  - script transformation - alias list !!! CANCELED !!! (0 milliseconds)
  - script transformation - alias list with type !!! CANCELED !!! (0 milliseconds)
  - script transformation - row format delimited clause with only one format property !!! CANCELED !!! (15 milliseconds)
  - script transformation - row format delimited clause with multiple format properties !!! CANCELED !!! (0 milliseconds)
  - script transformation - row format serde clauses with SERDEPROPERTIES !!! CANCELED !!! (0 milliseconds)
  - script transformation - row format serde clauses without SERDEPROPERTIES !!! CANCELED !!! (0 milliseconds)
```

```
ScriptTransformationSuite:
  - cat without SerDe !!! CANCELED !!! (62 milliseconds)
  - cat with LazySimpleSerDe !!! CANCELED !!! (0 milliseconds)
  - script transformation should not swallow errors from upstream operators (no serde) !!! CANCELED !!! (0 milliseconds)
  - script transformation should not swallow errors from upstream operators (with serde) !!! CANCELED !!! (0 milliseconds)
  - SPARK-14400 script transformation should fail for bad script command !!! CANCELED !!! (0 milliseconds)
```

Jenkins tests

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16501 from HyukjinKwon/windows-bash.
2017-01-10 13:22:35 +00:00
hyukjinkwon 4e27578faa
[SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all identified tests failed due to path and resource-not-closed problems on Windows
## What changes were proposed in this pull request?

This PR proposes to fix all the test failures identified by testing with AppVeyor.

**Scala - aborted tests**

```
WindowQuerySuite:
  Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.execution.WindowQuerySuite *** ABORTED *** (156 milliseconds)
   org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive   argetscala-2.11   est-classesdatafilespart_tiny.txt;

OrcSourceSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.orc.OrcSourceSuite *** ABORTED *** (62 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

ParquetMetastoreSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetMetastoreSuite *** ABORTED *** (4 seconds, 703 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

ParquetSourceSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetSourceSuite *** ABORTED *** (3 seconds, 907 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-581a6575-454f-4f21-a516-a07f95266143;

KafkaRDDSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaRDDSuite *** ABORTED *** (5 seconds, 212 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-4722304d-213e-4296-b556-951df1a46807

DirectKafkaStreamSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite *** ABORTED *** (7 seconds, 127 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d0d3eba7-4215-4e10-b40e-bb797e89338e
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)

ReliableKafkaStreamSuite
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.ReliableKafkaStreamSuite *** ABORTED *** (5 seconds, 498 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d33e45a0-287e-4bed-acae-ca809a89d888

KafkaStreamSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaStreamSuite *** ABORTED *** (2 seconds, 892 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-59c9d169-5a56-4519-9ef0-cefdbd3f2e6c

KafkaClusterSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaClusterSuite *** ABORTED *** (1 second, 690 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-3ef402b0-8689-4a60-85ae-e41e274f179d

DirectKafkaStreamSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite *** ABORTED *** (59 seconds, 626 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-426107da-68cf-4d94-b0d6-1f428f1c53f6

KafkaRDDSuite:
Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.KafkaRDDSuite *** ABORTED *** (2 minutes, 6 seconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-b9ce7929-5dae-46ab-a0c4-9ef6f58fbc2
```

**Java - failed tests**

```
Test org.apache.spark.streaming.kafka.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-1cee32f4-4390-4321-82c9-e8616b3f0fb0, took 9.61 sec

Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-f42695dd-242e-4b07-847c-f299b8e4676e, took 11.797 sec

Test org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-85c0d062-78cf-459c-a2dd-7973572101ce, took 1.581 sec

Test org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-49eb6b5c-8366-47a6-83f2-80c443c48280, took 17.895 sec

org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-898cf826-d636-4b1c-a61a-c12a364c02e7, took 8.858 sec
```

**Scala - failed tests**

```
PartitionProviderCompatibilitySuite:
 - insert overwrite partition of new datasource table overwrites just partition *** FAILED *** (828 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-bb6337b9-4f99-45ab-ad2c-a787ab965c09

 - SPARK-18635 special chars in partition values - partition management true *** FAILED *** (5 seconds, 360 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - SPARK-18635 special chars in partition values - partition management false *** FAILED *** (141 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```

```
UtilsSuite:
 - reading offset bytes of a file (compressed) *** FAILED *** (0 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ecb2b7d5-db8b-43a7-b268-1bf242b5a491

 - reading offset bytes across multiple files (compressed) *** FAILED *** (0 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-25cc47a8-1faa-4da5-8862-cf174df63ce0
```

```
StatisticsSuite:
 - MetastoreRelations fallback to HDFS for size estimation *** FAILED *** (110 milliseconds)
   org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'csv_table' not found in database 'default';
```

```
SQLQuerySuite:
 - permanent UDTF *** FAILED *** (125 milliseconds)
   org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count_temp'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 24

 - describe functions - user defined functions *** FAILED *** (125 milliseconds)
   org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7

 - CTAS without serde with location *** FAILED *** (16 milliseconds)
   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-ed673d73-edfc-404e-829e-2e2b9725d94e/c1

 - derived from Hive query file: drop_database_removes_partition_dirs.q *** FAILED *** (47 milliseconds)
   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_database_removes_partition_dirs_table

 - derived from Hive query file: drop_table_removes_partition_dirs.q *** FAILED *** (0 milliseconds)
   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_table_removes_partition_dirs_table2

 - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH *** FAILED *** (109 milliseconds)
   java.nio.file.InvalidPathException: Illegal char <:> at index 2: /C:/projects/spark/sql/hive/projectsspark	arget	mpspark-1a122f8c-dfb3-46c4-bab1-f30764baee0e/*part-r*
```

```
HiveDDLSuite:
 - drop external tables in default database *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - add/drop partitions - external table *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - create/drop database - location without pre-created directory *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - create/drop database - location with pre-created directory *** FAILED *** (32 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - drop database containing tables - CASCADE *** FAILED *** (94 milliseconds)
   CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)

 - drop an empty database - CASCADE *** FAILED *** (63 milliseconds)
   CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)

 - drop database containing tables - RESTRICT *** FAILED *** (47 milliseconds)
   CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)

 - drop an empty database - RESTRICT *** FAILED *** (47 milliseconds)
   CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)

 - CREATE TABLE LIKE an external data source table *** FAILED *** (140 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-c5eba16d-07ae-4186-95bb-21c5811cf888;

 - CREATE TABLE LIKE an external Hive serde table *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - desc table for data source table - no user-defined schema *** FAILED *** (125 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-e8bf5bf5-721a-4cbe-9d6	at scala.collection.immutable.List.foreach(List.scala:381)d-5543a8301c1d;
```

```
MetastoreDataSourcesSuite
 - CTAS: persisted bucketed data source table *** FAILED *** (16 milliseconds)
   java.lang.IllegalArgumentException: Can not create a Path from an empty string
```

```
ShowCreateTableSuite:
 - simple external hive table *** FAILED *** (0 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```

```
PartitionedTablePerfStatsSuite:
 - hive table: partitioned pruned table reports only selected files *** FAILED *** (313 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: partitioned pruned table reports only selected files *** FAILED *** (219 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-311f45f8-d064-4023-a4bb-e28235bff64d;

 - hive table: lazy partition pruning reads only necessary partition data *** FAILED *** (203 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: lazy partition pruning reads only necessary partition data *** FAILED *** (187 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-fde874ca-66bd-4d0b-a40f-a043b65bf957;

 - hive table: lazy partition pruning with file status caching enabled *** FAILED *** (188 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: lazy partition pruning with file status caching enabled *** FAILED *** (187 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-e6d20183-dd68-4145-acbe-4a509849accd;

 - hive table: file status caching respects refresh table and refreshByPath *** FAILED *** (172 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: file status caching respects refresh table and refreshByPath *** FAILED *** (203 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-8b2c9651-2adf-4d58-874f-659007e21463;

 - hive table: file status cache respects size limit *** FAILED *** (219 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: file status cache respects size limit *** FAILED *** (171 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-7835ab57-cb48-4d2c-bb1d-b46d5a4c47e4;

 - datasource table: table setup does not scan filesystem *** FAILED *** (266 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-20598d76-c004-42a7-8061-6c56f0eda5e2;

 - hive table: table setup does not scan filesystem *** FAILED *** (266 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - hive table: num hive client calls does not scale with partition count *** FAILED *** (2 seconds, 281 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: num hive client calls does not scale with partition count *** FAILED *** (2 seconds, 422 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4cfed321-4d1d-4b48-8d34-5c169afff383;

 - hive table: files read and cached when filesource partition management is off *** FAILED *** (234 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: all partition data cached in memory when partition management is off *** FAILED *** (203 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4bcc0398-15c9-4f6a-811e-12d40f3eec12;

 - SPARK-18700: table loaded only once even when resolved concurrently *** FAILED *** (1 second, 266 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```

```
HiveSparkSubmitSuite:
 - temporary Hive UDF: define a UDF and use it *** FAILED *** (2 seconds, 94 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - permanent Hive UDF: define a UDF and use it *** FAILED *** (281 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - permanent Hive UDF: use a already defined permanent function *** FAILED *** (718 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-8368: includes jars passed in through --jars *** FAILED *** (3 seconds, 521 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-8020: set sql conf in spark conf *** FAILED *** (0 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-8489: MissingRequirementError during reflection *** FAILED *** (94 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-9757 Persist Parquet relation with decimal column *** FAILED *** (16 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-11009 fix wrong result of Window function in cluster mode *** FAILED *** (16 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-14244 fix window partition size attribute binding failure *** FAILED *** (78 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - set spark.sql.warehouse.dir *** FAILED *** (16 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - set hive.metastore.warehouse.dir *** FAILED *** (15 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-16901: set javax.jdo.option.ConnectionURL *** FAILED *** (16 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-18360: default table path of tables in default database should depend on the location of default database *** FAILED *** (15 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
```

```
UtilsSuite:
 - resolveURIs with multiple paths *** FAILED *** (0 milliseconds)
   ".../jar3,file:/C:/pi.py[%23]py.pi,file:/C:/path%..." did not equal ".../jar3,file:/C:/pi.py[#]py.pi,file:/C:/path%..." (UtilsSuite.scala:468)
```

```
CheckpointSuite:
 - recovery with file input stream *** FAILED *** (10 seconds, 205 milliseconds)
   The code passed to eventually never returned normally. Attempted 660 times over 10.014272499999999 seconds. Last failure message: Unexpected internal error near index 1
   \
    ^. (CheckpointSuite.scala:680)
```

## How was this patch tested?

Manually via AppVeyor as below:

**Scala - aborted tests**

```
WindowQuerySuite - all passed
OrcSourceSuite:
- SPARK-18220: read Hive orc table with varchar column *** FAILED *** (4 seconds, 417 milliseconds)
  org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
ParquetMetastoreSuite - all passed
ParquetSourceSuite - all passed
KafkaRDDSuite - all passed
DirectKafkaStreamSuite - all passed
ReliableKafkaStreamSuite - all passed
KafkaStreamSuite - all passed
KafkaClusterSuite - all passed
DirectKafkaStreamSuite - all passed
KafkaRDDSuite - all passed
```

**Java - failed tests**

```
org.apache.spark.streaming.kafka.JavaKafkaRDDSuite - all passed
org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite - all passed
org.apache.spark.streaming.kafka.JavaKafkaStreamSuite - all passed
org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite - all passed
org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite - all passed
```

**Scala - failed tests**

```
PartitionProviderCompatibilitySuite:
- insert overwrite partition of new datasource table overwrites just partition (1 second, 953 milliseconds)
- SPARK-18635 special chars in partition values - partition management true (6 seconds, 31 milliseconds)
- SPARK-18635 special chars in partition values - partition management false (4 seconds, 578 milliseconds)
```

```
UtilsSuite:
- reading offset bytes of a file (compressed) (203 milliseconds)
- reading offset bytes across multiple files (compressed) (0 milliseconds)
```

```
StatisticsSuite:
- MetastoreRelations fallback to HDFS for size estimation (94 milliseconds)
```

```
SQLQuerySuite:
 - permanent UDTF (407 milliseconds)
 - describe functions - user defined functions (441 milliseconds)
 - CTAS without serde with location (2 seconds, 831 milliseconds)
 - derived from Hive query file: drop_database_removes_partition_dirs.q (734 milliseconds)
 - derived from Hive query file: drop_table_removes_partition_dirs.q (563 milliseconds)
 - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH (453 milliseconds)
```

```
HiveDDLSuite:
 - drop external tables in default database (3 seconds, 5 milliseconds)
 - add/drop partitions - external table (2 seconds, 750 milliseconds)
 - create/drop database - location without pre-created directory (500 milliseconds)
 - create/drop database - location with pre-created directory (407 milliseconds)
 - drop database containing tables - CASCADE (453 milliseconds)
 - drop an empty database - CASCADE (375 milliseconds)
 - drop database containing tables - RESTRICT (328 milliseconds)
 - drop an empty database - RESTRICT (391 milliseconds)
 - CREATE TABLE LIKE an external data source table (953 milliseconds)
 - CREATE TABLE LIKE an external Hive serde table (3 seconds, 782 milliseconds)
 - desc table for data source table - no user-defined schema (1 second, 150 milliseconds)
```

```
MetastoreDataSourcesSuite
 - CTAS: persisted bucketed data source table (875 milliseconds)
```

```
ShowCreateTableSuite:
 - simple external hive table (78 milliseconds)
```

```
PartitionedTablePerfStatsSuite:
 - hive table: partitioned pruned table reports only selected files (1 second, 109 milliseconds)
- datasource table: partitioned pruned table reports only selected files (860 milliseconds)
 - hive table: lazy partition pruning reads only necessary partition data (859 milliseconds)
 - datasource table: lazy partition pruning reads only necessary partition data (1 second, 219 milliseconds)
 - hive table: lazy partition pruning with file status caching enabled (875 milliseconds)
 - datasource table: lazy partition pruning with file status caching enabled (890 milliseconds)
 - hive table: file status caching respects refresh table and refreshByPath (922 milliseconds)
 - datasource table: file status caching respects refresh table and refreshByPath (640 milliseconds)
 - hive table: file status cache respects size limit (469 milliseconds)
 - datasource table: file status cache respects size limit (453 milliseconds)
 - datasource table: table setup does not scan filesystem (328 milliseconds)
 - hive table: table setup does not scan filesystem (313 milliseconds)
 - hive table: num hive client calls does not scale with partition count (5 seconds, 431 milliseconds)
 - datasource table: num hive client calls does not scale with partition count (4 seconds, 79 milliseconds)
 - hive table: files read and cached when filesource partition management is off (656 milliseconds)
 - datasource table: all partition data cached in memory when partition management is off (484 milliseconds)
 - SPARK-18700: table loaded only once even when resolved concurrently (2 seconds, 578 milliseconds)
```

```
HiveSparkSubmitSuite:
 - temporary Hive UDF: define a UDF and use it (1 second, 745 milliseconds)
 - permanent Hive UDF: define a UDF and use it (406 milliseconds)
 - permanent Hive UDF: use a already defined permanent function (375 milliseconds)
 - SPARK-8368: includes jars passed in through --jars (391 milliseconds)
 - SPARK-8020: set sql conf in spark conf (156 milliseconds)
 - SPARK-8489: MissingRequirementError during reflection (187 milliseconds)
 - SPARK-9757 Persist Parquet relation with decimal column (157 milliseconds)
 - SPARK-11009 fix wrong result of Window function in cluster mode (156 milliseconds)
 - SPARK-14244 fix window partition size attribute binding failure (156 milliseconds)
 - set spark.sql.warehouse.dir (172 milliseconds)
 - set hive.metastore.warehouse.dir (156 milliseconds)
 - SPARK-16901: set javax.jdo.option.ConnectionURL (157 milliseconds)
 - SPARK-18360: default table path of tables in default database should depend on the location of default database (172 milliseconds)
```

```
UtilsSuite:
 - resolveURIs with multiple paths (0 milliseconds)
```

```
CheckpointSuite:
 - recovery with file input stream (4 seconds, 452 milliseconds)
```

Note: after resolving the aborted tests, there is a test failure identified as below:

```
OrcSourceSuite:
- SPARK-18220: read Hive orc table with varchar column *** FAILED *** (4 seconds, 417 milliseconds)
  org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
```

This does not look due to this problem so this PR does not fix it here.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16451 from HyukjinKwon/all-path-resource-fixes.
2017-01-10 13:19:21 +00:00
zuotingbing cd1d00adaf
[SPARK-19026] SPARK_LOCAL_DIRS(multiple directories on different disks) cannot be deleted
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19026

SPARK_LOCAL_DIRS (Standalone) can  be a comma-separated list of multiple directories on different disks, e.g. SPARK_LOCAL_DIRS=/dir1,/dir2,/dir3, if there is a IOExecption when create sub directory on dir3 , the sub directory which have been created successfully on dir1 and dir2 cannot be deleted anymore when the application finishes.
So we should catch the IOExecption at Utils.createDirectory  , otherwise the variable "appDirectories(appId)" which the function maybeCleanupApplication calls will not be set then dir1 and dir2 will not be cleaned up .

Author: zuotingbing <zuo.tingbing9@zte.com.cn>

Closes #16439 from zuotingbing/master.
2017-01-08 09:29:01 +00:00
Kay Ousterhout 2e139eed31 [SPARK-17931] Eliminate unnecessary task (de) serialization
In the existing code, there are three layers of serialization
    involved in sending a task from the scheduler to an executor:
        - A Task object is serialized
        - The Task object is copied to a byte buffer that also
          contains serialized information about any additional JARs,
          files, and Properties needed for the task to execute. This
          byte buffer is stored as the member variable serializedTask
          in the TaskDescription class.
        - The TaskDescription is serialized (in addition to the serialized
          task + JARs, the TaskDescription class contains the task ID and
          other metadata) and sent in a LaunchTask message.

While it *is* necessary to have two layers of serialization, so that
the JAR, file, and Property info can be deserialized prior to
deserializing the Task object, the third layer of deserialization is
unnecessary.  This commit eliminates a layer of serialization by moving
the JARs, files, and Properties into the TaskDescription class.

This commit also serializes the Properties manually (by traversing the map),
as is done with the JARs and files, which reduces the final serialized size.

Unit tests

This is a simpler alternative to the approach proposed in #15505.

shivaram and I did some benchmarking of this and #15505 on a 20-machine m2.4xlarge EC2 machines (160 cores). We ran ~30 trials of code [1] (a very simple job with 10K tasks per stage) and measured the average time per stage:

Before this change: 2490ms
With this change: 2345 ms (so ~6% improvement over the baseline)
With witgo's approach in #15505: 2046 ms (~18% improvement over baseline)

The reason that #15505 has a more significant improvement is that it also moves the serialization from the TaskSchedulerImpl thread to the CoarseGrainedSchedulerBackend thread. I added that functionality on top of this change, and got almost the same improvement [1] as #15505 (average of 2103ms). I think we should decouple these two changes, both so we have some record of the improvement form each individual improvement, and because this change is more about simplifying the code base (the improvement is negligible) while the other is about performance improvement.  The plan, currently, is to merge this PR and then merge the remaining part of #15505 that moves serialization.

[1] The reason the improvement wasn't quite as good as with #15505 when we ran the benchmarks is almost certainly because, at the point when we ran the benchmarks, I hadn't updated the code to manually serialize the Properties (instead the code was using Java's default serialization for the Properties object, whereas #15505 manually serialized the Properties).  This PR has since been updated to manually serialize the Properties, just like the other maps.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #16053 from kayousterhout/SPARK-17931.
2017-01-06 10:48:08 -06:00
jerryshao 4a4c3dc9ca [SPARK-19033][CORE] Add admin acls for history server
## What changes were proposed in this pull request?

Current HistoryServer's ACLs is derived from application event-log, which means the newly changed ACLs cannot be applied to the old data, this will become a problem where newly added admin cannot access the old application history UI, only the new application can be affected.

So here propose to add admin ACLs for history server, any configured user/group could have the view access to all the applications, while the view ACLs derived from application run-time still take effect.

## How was this patch tested?

Unit test added.

Author: jerryshao <sshao@hortonworks.com>

Closes #16470 from jerryshao/SPARK-19033.
2017-01-06 10:07:54 -06:00
Rui Li f5d18af6a8 [SPARK-14958][CORE] Failed task not handled when there's error deserializing failure reason
## What changes were proposed in this pull request?

TaskResultGetter tries to deserialize the TaskEndReason before handling the failed task. If an error is thrown during deserialization, the failed task won't be handled, which leaves the job hanging.
The PR proposes to handle the failed task in a finally block.
## How was this patch tested?

In my case I hit a NoClassDefFoundError and the job hangs. Manually verified the patch can fix it.

Author: Rui Li <rui.li@intel.com>
Author: Rui Li <lirui@apache.org>
Author: Rui Li <shlr@cn.ibm.com>

Closes #12775 from lirui-intel/SPARK-14958.
2017-01-05 14:51:13 -08:00
Kay Ousterhout 00074b5778 [SPARK-19062] Utils.writeByteBuffer bug fix
This commit changes Utils.writeByteBuffer so that it does not change
the position of the ByteBuffer that it writes out, and adds a unit test for
this functionality.

cc mridulm

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #16462 from kayousterhout/SPARK-19062.
2017-01-04 11:21:09 -08:00
Niranjan Padmanabhan a1e40b1f5d
[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.

## How was this patch tested?
N/A since only docs or comments were updated.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>

Closes #16455 from neurons/np.structure_streaming_doc.
2017-01-04 15:07:29 +00:00
Weiqing Yang e5c307c50a
[MINOR] Add missing sc.stop() to end of examples
## What changes were proposed in this pull request?

Add `finally` clause for `sc.stop()` in the `test("register and deregister Spark listener from SparkContext")`.

## How was this patch tested?
Pass the build and unit tests.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #16426 from weiqingy/testIssue.
2017-01-03 09:56:42 +00:00
Sergei Lebedev 67fb33e7e0
[SPARK-19010][CORE] Include Kryo exception in case of overflow
## What changes were proposed in this pull request?

This is to workaround an implicit result of #4947 which suppressed the
original Kryo exception if the overflow happened during serialization.

## How was this patch tested?

`KryoSerializerSuite` was augmented to reflect this change.

Author: Sergei Lebedev <superbobry@gmail.com>

Closes #16416 from superbobry/patch-1.
2016-12-28 10:30:38 +00:00
Shixiong Zhu 7026ee23e0 [SPARK-17755][CORE] Use workerRef to send RegisterWorkerResponse to avoid the race condition
## What changes were proposed in this pull request?

The root cause of this issue is that RegisterWorkerResponse and LaunchExecutor are sent via two different channels (TCP connections) and their order is not guaranteed.

This PR changes the master and worker codes to use `workerRef` to send RegisterWorkerResponse, so that RegisterWorkerResponse and LaunchExecutor are sent via the same connection. Hence `LaunchExecutor` will always be after `RegisterWorkerResponse` and never be ignored.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16345 from zsxwing/SPARK-17755.
2016-12-25 23:48:14 -08:00
Kousuke Saruta f2ceb2abe9
[SPARK-18837][WEBUI] Very long stage descriptions do not wrap in the UI
## What changes were proposed in this pull request?

This issue was reported by wangyum.

In the AllJobsPage, JobPage and StagePage, the description length was limited before like as follows.

![ui-2 0 0](https://cloud.githubusercontent.com/assets/4736016/21319673/8b225246-c651-11e6-9041-4fcdd04f4dec.gif)

But recently, the limitation seems to have been accidentally removed.

![ui-2 1 0](https://cloud.githubusercontent.com/assets/4736016/21319825/104779f6-c652-11e6-8bfa-dfd800396352.gif)

The cause is that some tables are no longer `sortable` class although they were, and `sortable` class does not only mark tables as sortable but also limited the width of their child `td` elements.
The reason why now some tables are not `sortable` class is because another sortable mechanism was introduced by #13620 and #13708 with pagination feature.

To fix this issue, I've introduced new class `table-cell-width-limited` which limits the description cell width and the description is like what it was.

<img width="1260" alt="2016-12-20 1 00 34" src="https://cloud.githubusercontent.com/assets/4736016/21320478/89141c7a-c654-11e6-8494-f8f91325980b.png">

## How was this patch tested?

Tested manually with my browser.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #16338 from sarutak/SPARK-18837.
2016-12-24 13:02:58 +00:00
Shixiong Zhu a848f0ba84 [SPARK-18991][CORE] Change ContextCleaner.referenceBuffer to use ConcurrentHashMap to make it faster
## What changes were proposed in this pull request?

The time complexity of ConcurrentHashMap's `remove` is O(1). Changing ContextCleaner.referenceBuffer's type from `ConcurrentLinkedQueue` to `ConcurrentHashMap's` will make the removal much faster.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16390 from zsxwing/SPARK-18991.
2016-12-23 15:38:41 -08:00
Shixiong Zhu f252cb5d16 [SPARK-18972][CORE] Fix the netty thread names for RPC
## What changes were proposed in this pull request?

Right now the name of threads created by Netty for Spark RPC are `shuffle-client-**` and `shuffle-server-**`. It's pretty confusing.

This PR just uses the module name in TransportConf to set the thread name. In addition, it also includes the following minor fixes:

- TransportChannelHandler.channelActive and channelInactive should call the corresponding super methods.
- Make ShuffleBlockFetcherIterator throw NoSuchElementException if it has no more elements. Otherwise,  if the caller calls `next` without `hasNext`, it will just hang.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16380 from zsxwing/SPARK-18972.
2016-12-22 16:22:55 -08:00
saturday_s ce99f51d2e [SPARK-18537][WEB UI] Add a REST api to serve spark streaming information
## What changes were proposed in this pull request?

This PR is an inheritance from #16000, and is a completion of #15904.

**Description**

- Augment the `org.apache.spark.status.api.v1` package for serving streaming information.
- Retrieve the streaming information through StreamingJobProgressListener.

> this api should cover exceptly the same amount of information as you can get from the web interface
> the implementation is base on the current REST implementation of spark-core
> and will be available for running applications only
>
> https://issues.apache.org/jira/browse/SPARK-18537

## How was this patch tested?

Local test.

Author: saturday_s <shi.indetail@gmail.com>
Author: Chan Chor Pang <ChorPang.Chan@access-company.com>
Author: peterCPChan <universknight@gmail.com>

Closes #16253 from saturday-shi/SPARK-18537.
2016-12-22 12:51:37 -08:00
jerryshao 31da755c80 [SPARK-18975][CORE] Add an API to remove SparkListener
## What changes were proposed in this pull request?

In current Spark we could add customized SparkListener through `SparkContext#addListener` API, but there's no equivalent API to remove the registered one. In our scenario SparkListener will be added repeatedly accordingly to the changed environment. If lacks the ability to remove listeners, there might be many registered listeners finally, this is unnecessary and potentially affects the performance. So here propose to add an API to remove registered listener.

## How was this patch tested?

Add an unit test to verify it.

Author: jerryshao <sshao@hortonworks.com>

Closes #16382 from jerryshao/SPARK-18975.
2016-12-22 11:18:22 -08:00
hyukjinkwon 4186aba632
[SPARK-18922][TESTS] Fix more resource-closing-related and path-related test failures in identified ones on Windows
## What changes were proposed in this pull request?

There are several tests failing due to resource-closing-related and path-related  problems on Windows as below.

- `SQLQuerySuite`:

```
- specifying database name for a temporary table is not allowed *** FAILED *** (125 milliseconds)
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
```

- `JsonSuite`:

```
- Loading a JSON dataset from a text file with SQL *** FAILED *** (94 milliseconds)
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
```

- `StateStoreSuite`:

```
- SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
  java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?????-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
  at org.apache.hadoop.fs.Path.initialize(Path.java:206)
  at org.apache.hadoop.fs.Path.<init>(Path.java:116)
  at org.apache.hadoop.fs.Path.<init>(Path.java:89)
  ...
  Cause: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?????-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.<init>(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:203)
```

- `HDFSMetadataLogSuite`:

```
- FileManager: FileContextManager *** FAILED *** (94 milliseconds)
  java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
  at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
  at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
  at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)

- FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
  java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
  at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
  at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
  at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
```

And, there are some tests being failed due to the length limitation on cmd in Windows as below:

- `LauncherBackendSuite`:

```
- local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
  The code passed to eventually never returned normally. Attempted 283 times over 30.0960053 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56)
  org.scalatest.exceptions.TestFailedDueToTimeoutException:
  at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
  at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)

- standalone/client: launcher handle *** FAILED *** (30 seconds, 47 milliseconds)
  The code passed to eventually never returned normally. Attempted 282 times over 30.037987100000002 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56)
  org.scalatest.exceptions.TestFailedDueToTimeoutException:
  at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
  at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
```

The executed command is, https://gist.github.com/HyukjinKwon/d3fdd2e694e5c022992838a618a516bd, which is 16K length; however, the length limitation is 8K on Windows. So, it is being failed to launch.

This PR proposes to fix the test failures on Windows and skip the tests failed due to the length limitation

## How was this patch tested?

Manually tested via AppVeyor

**Before**

`SQLQuerySuite `: https://ci.appveyor.com/project/spark-test/spark/build/306-pr-references
`JsonSuite`: https://ci.appveyor.com/project/spark-test/spark/build/307-pr-references
`StateStoreSuite` : https://ci.appveyor.com/project/spark-test/spark/build/305-pr-references
`HDFSMetadataLogSuite`: https://ci.appveyor.com/project/spark-test/spark/build/304-pr-references
`LauncherBackendSuite`: https://ci.appveyor.com/project/spark-test/spark/build/303-pr-references

**After**

`SQLQuerySuite`: https://ci.appveyor.com/project/spark-test/spark/build/293-SQLQuerySuite
`JsonSuite`: https://ci.appveyor.com/project/spark-test/spark/build/294-JsonSuite
`StateStoreSuite`: https://ci.appveyor.com/project/spark-test/spark/build/297-StateStoreSuite
`HDFSMetadataLogSuite`: https://ci.appveyor.com/project/spark-test/spark/build/319-pr-references
`LauncherBackendSuite`: failed test skipped.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16335 from HyukjinKwon/more-fixes-on-windows.
2016-12-22 16:15:54 +00:00
Dongjoon Hyun f489339c75
[SPARK-18953][CORE][WEB UI] Do now show the link to a dead worker on the master page
## What changes were proposed in this pull request?

For a dead worker, we will not be able to see its worker page anyway. This PR removes the links to dead workers from the master page.

## How was this patch tested?

Since this is UI change, please do the following steps manually.

**1. Start a master and a slave**

```
sbin/start-master.sh
sbin/start-slave.sh spark://10.22.16.140:7077
```

![1_live_worker_a](https://cloud.githubusercontent.com/assets/9700541/21373572/d7e187d6-c6d4-11e6-9110-f4371d215dec.png)

**2. Stop the slave**
```
sbin/stop-slave.sh
```

![2_dead_worker_a](https://cloud.githubusercontent.com/assets/9700541/21373579/dd9e9704-c6d4-11e6-9047-a22cb0aa83ed.png)

**3. Start a slave**

```
sbin/start-slave.sh spark://10.22.16.140:7077
```

![3_dead_worder_a_and_live_worker_b](https://cloud.githubusercontent.com/assets/9700541/21373582/e1b207f4-c6d4-11e6-89cb-6d8970175a5e.png)

**4. Stop the slave**

```
sbin/stop-slave.sh
```

![4_dead_worker_a_and_b](https://cloud.githubusercontent.com/assets/9700541/21373584/e5fecb4e-c6d4-11e6-95d3-49defe366946.png)

**5. Driver list testing**

Do the followings and stop the slave in a minute by `sbin/stop-slave.sh`.

```
sbin/start-master.sh
sbin/start-slave.sh spark://10.22.16.140:7077
bin/spark-submit --master=spark://10.22.16.140:7077 --deploy-mode=cluster --class org.apache.spark.examples.SparkPi examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-SNAPSHOT.jar 10000
```

![5_dead_worker_in_driver_list](https://cloud.githubusercontent.com/assets/9700541/21401320/be6cc9fc-c768-11e6-8de7-6512961296a5.png)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16366 from dongjoon-hyun/SPARK-18953.
2016-12-22 09:43:46 +00:00
Ryan Williams afd9bc1d8a [SPARK-17807][CORE] split test-tags into test-JAR
Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.

Alternative to #16303.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #16311 from ryan-williams/tt.
2016-12-21 16:37:20 -08:00
Holden Karau 047a9d92ca [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark
## What changes were proposed in this pull request?

Adds basic TaskContext information to PySpark.

## How was this patch tested?

New unit tests to `tests.py` & existing unit tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #16211 from holdenk/SPARK-18576-pyspark-taskcontext.
2016-12-20 15:51:21 -08:00
Josh Rosen fa829ce21f [SPARK-18761][CORE] Introduce "task reaper" to oversee task killing in executors
## What changes were proposed in this pull request?

Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptible or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks.

This patch aims to address this problem by adding a "task reaper" mechanism to executors. At a high-level, task killing now launches a new thread which attempts to kill the task and then watches the task and periodically checks whether it has been killed. The TaskReaper will periodically re-attempt to call `TaskRunner.kill()` and will log warnings if the task keeps running. I modified TaskRunner to rename its thread at the start of the task, allowing TaskReaper to take a thread dump and filter it in order to log stacktraces from the exact task thread that we are waiting to finish. If the task has not stopped after a configurable timeout then the TaskReaper will throw an exception to trigger executor JVM death, thereby forcibly freeing any resources consumed by the zombie tasks.

This feature is flagged off by default and is controlled by four new configurations under the `spark.task.reaper.*` namespace. See the updated `configuration.md` doc for details.

## How was this patch tested?

Tested via a new test case in `JobCancellationSuite`, plus manual testing.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #16189 from JoshRosen/cancellation.
2016-12-19 18:43:59 -08:00
Josh Rosen 5857b9ac2d [SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter
## What changes were proposed in this pull request?

In order to respond to task cancellation, Spark tasks must periodically check `TaskContext.isInterrupted()`, but this check is missing on a few critical read paths used in Spark SQL, including `FileScanRDD`, `JDBCRDD`, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies (as also described in #16189).

This patch aims to fix this problem by adding `TaskContext.isInterrupted()` checks to these paths. Note that I could have used `InterruptibleIterator` to simply wrap a bunch of iterators but in some cases this would have an adverse performance penalty or might not be effective due to certain special uses of Iterators in Spark SQL. Instead, I inlined `InterruptibleIterator`-style logic into existing iterator subclasses.

## How was this patch tested?

Tested manually in `spark-shell` with two different reproductions of non-cancellable tasks, one involving scans of huge files and another involving sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by the changes added here.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #16340 from JoshRosen/sql-task-interruption.
2016-12-20 01:19:38 +01:00
Shivaram Venkataraman 4cb49412d1 [SPARK-18836][CORE] Serialize one copy of task metrics in DAGScheduler
## What changes were proposed in this pull request?

Right now we serialize the empty task metrics once per task – Since this is shared across all tasks we could use the same serialized task metrics across all tasks of a stage.

## How was this patch tested?

- [x] Run tests on EC2 to measure performance improvement

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #16261 from shivaram/task-metrics-one-copy.
2016-12-19 14:53:01 -08:00
Yuming Wang 1e5c51f336
[SPARK-18827][CORE] Fix cannot read broadcast on disk
## What changes were proposed in this pull request?
`NoSuchElementException` will throw since https://github.com/apache/spark/pull/15056 if a broadcast cannot cache in memory. The reason is that that change cannot cover `!unrolled.hasNext` in `next()` function.

This change is to cover the `!unrolled.hasNext` and check `hasNext` before calling `next` in `blockManager.getLocalValues` to make it  more robust.

We can cache and read broadcast even it cannot fit in memory from this pull request.

Exception log:
```
16/12/10 10:10:04 INFO UnifiedMemoryManager: Will not store broadcast_131 as the required space (1048576 bytes) exceeds our memory limit (122764 bytes)
16/12/10 10:10:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block broadcast_131 in memory.
16/12/10 10:10:04 WARN MemoryStore: Not enough space to cache broadcast_131 in memory! (computed 384.0 B so far)
16/12/10 10:10:04 INFO MemoryStore: Memory use = 95.6 KB (blocks) + 0.0 B (scratch space shared across 0 tasks(s)) = 95.6 KB. Storage limit = 119.9 KB.
16/12/10 10:10:04 ERROR Utils: Exception encountered
java.util.NoSuchElementException
	at org.apache.spark.util.collection.PrimitiveVector$$anon$1.next(PrimitiveVector.scala:58)
	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:700)
	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
16/12/10 10:10:04 ERROR Executor: Exception in task 1.0 in stage 86.0 (TID 134423)
java.io.IOException: java.util.NoSuchElementException
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException
	at org.apache.spark.util.collection.PrimitiveVector$$anon$1.next(PrimitiveVector.scala:58)
	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:700)
	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
	... 12 more
```

## How was this patch tested?

Add unit test

Author: Yuming Wang <wgyumg@gmail.com>

Closes #16252 from wangyum/SPARK-18827.
2016-12-18 09:08:02 +00:00
uncleGen 6d2379b3b7
[SPARK-18485][CORE] Underlying integer overflow when create ChunkedByteBufferOutputStream in MemoryStore
## What changes were proposed in this pull request?

There is an underlying integer overflow when create ChunkedByteBufferOutputStream in MemoryStore. This PR provide a check before cast.

## How was this patch tested?

add new unit test

Author: uncleGen <hustyugm@gmail.com>

Closes #15915 from uncleGen/SPARK-18485.
2016-12-17 13:19:30 +00:00
hyukjinkwon 2bc1c95154 [SPARK-18895][TESTS] Fix resource-closing-related and path-related test failures in identified ones on Windows
## What changes were proposed in this pull request?

There are several tests failing due to resource-closing-related and path-related  problems on Windows as below.

- `RPackageUtilsSuite`:

```
- build an R package from a jar end to end *** FAILED *** (1 second, 625 milliseconds)
  java.io.IOException: Unable to delete file: C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar
  at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
  at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
  at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)

- faulty R package shows documentation *** FAILED *** (359 milliseconds)
  java.io.IOException: Unable to delete file: C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar
  at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
  at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
  at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)

- SparkR zipping works properly *** FAILED *** (47 milliseconds)
  java.util.regex.PatternSyntaxException: Unknown character property name {r} near index 4

C:\projects\spark\target\tmp\1481729429282-0

    ^
  at java.util.regex.Pattern.error(Pattern.java:1955)
  at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781)
```

- `InputOutputMetricsSuite`:

```
- input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics with cache and coalesce *** FAILED *** (109 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics for new Hadoop API with coalesce *** FAILED *** (0 milliseconds)
  java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
  at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
  at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)

- input metrics when reading text file *** FAILED *** (110 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics on records read - simple *** FAILED *** (125 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics on records read - more stages *** FAILED *** (110 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds)
  java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
  at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
  at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)

- input metrics on records read with cache *** FAILED *** (93 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input read/write and shuffle read/write metrics all line up *** FAILED *** (93 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics with interleaved reads *** FAILED *** (0 milliseconds)
  java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-2638d893-e89b-47ce-acd0-bbaeee78dd9b\InputOutputMetricsSuite_cart.txt, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
  at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
  at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)

- input metrics with old CombineFileInputFormat *** FAILED *** (157 milliseconds)
  17947 was not greater than or equal to 300000 (InputOutputMetricsSuite.scala:324)
  org.scalatest.exceptions.TestFailedException:
  at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
  at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)

- input metrics with new CombineFileInputFormat *** FAILED *** (16 milliseconds)
  java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-11920c08-19d8-4c7c-9fba-28ed72b79f80\test\InputOutputMetricsSuite.txt, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
  at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
  at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
```

- `ReplayListenerSuite`:

```
- End-to-end replay *** FAILED *** (121 milliseconds)
  java.io.IOException: No FileSystem for scheme: C
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)

- End-to-end replay with compression *** FAILED *** (516 milliseconds)
  java.io.IOException: No FileSystem for scheme: C
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
```

- `EventLoggingListenerSuite`:

```
- End-to-end event logging *** FAILED *** (7 seconds, 435 milliseconds)
  java.io.IOException: No FileSystem for scheme: C
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)

- End-to-end event logging with compression *** FAILED *** (1 second)
  java.io.IOException: No FileSystem for scheme: C
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)

- Event log name *** FAILED *** (16 milliseconds)
  "file:/[]base-dir/app1" did not equal "file:/[C:/]base-dir/app1" (EventLoggingListenerSuite.scala:123)
  org.scalatest.exceptions.TestFailedException:
  at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
  at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
```

This PR proposes to fix the test failures on Windows

## How was this patch tested?

Manually tested via AppVeyor

**Before**

`RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/273-RPackageUtilsSuite-before
`InputOutputMetricsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/272-InputOutputMetricsSuite-before
`ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/274-ReplayListenerSuite-before
`EventLoggingListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/275-EventLoggingListenerSuite-before

**After**

`RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/270-RPackageUtilsSuite
`InputOutputMetricsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/271-InputOutputMetricsSuite
`ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/277-ReplayListenerSuite-after
`EventLoggingListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/278-EventLoggingListenerSuite-after

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16305 from HyukjinKwon/RPackageUtilsSuite-InputOutputMetricsSuite.
2016-12-16 21:32:24 -08:00
Mark Hamstra 295db8259b [SPARK-17769][CORE][SCHEDULER] Some FetchFailure refactoring
## What changes were proposed in this pull request?

Readability rewrites.
Changed order of `failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId)` and `disallowStageRetryForTest` evaluation.
Stage resubmission guard condition changed from `failedStages.isEmpty` to `!failedStages.contains(failedStage)`
Log all resubmission of stages
## How was this patch tested?

existing tests

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #15335 from markhamstra/SPARK-17769.
2016-12-16 12:46:32 -08:00
Aliaksandr.Bedrytski f7a574a6cb
[SPARK-18708][CORE] Improvement/improve docs in spark context file
## What changes were proposed in this pull request?

SparkContext.scala was created a long time ago and contains several types of Scaladocs/Javadocs mixed together. Public methods/fields should have a Scaladoc that is formatted in the same way everywhere. This pull request also adds scaladoc to methods/fields that did not have it before.

## How was this patch tested?

No actual code was modified, only comments.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Aliaksandr.Bedrytski <aliaksandr.bedrytski@valtech.co.uk>

Closes #16137 from Mironor/improvement/improve-docs-in-spark-context-file.
2016-12-16 17:47:08 +00:00
Michal Senkyr 836c95b108
[SPARK-18723][DOC] Expanded programming guide information on wholeTex…
## What changes were proposed in this pull request?

Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance.

Also added reference to the underlying CombineFileInputFormat

## How was this patch tested?

Manual build of documentation and inspection in browser

```
cd docs
jekyll serve --watch
```

Author: Michal Senkyr <mike.senkyr@gmail.com>

Closes #16157 from michalsenkyr/wholeTextFilesExpandedDocs.
2016-12-16 17:43:39 +00:00
root 53ab8fb346
[SPARK-18742][CORE] Clarify that user-defined BroadcastFactory is not supported
## What changes were proposed in this pull request?
After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory and the spark.broadcast.factory conf has removed.

however the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

so we should modify the comment that SparkContext will not use a  user-specified BroadcastFactory implementation

[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30

## How was this patch tested?
unit test added

Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
Author: windpiger <songjun@outlook.com>

Closes #16173 from windpiger/addBroadFactoryConf.
2016-12-16 10:16:57 +00:00
Imran Rashid 93cdb8a7d0 [SPARK-8425][CORE] Application Level Blacklisting
## What changes were proposed in this pull request?

This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application.  Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout.  Full details are available in a design doc attached to the jira.
## How was this patch tested?

Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.

The added tests include:
- verifying BlacklistTracker works correctly
- verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
- an integration test for the entire scheduler with blacklisting in a few different scenarios

Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>

Closes #14079 from squito/blacklist-SPARK-8425.
2016-12-15 08:29:56 -06:00
hyukjinkwon 169b9d73ee
[SPARK-18830][TESTS] Fix tests in PipedRDDSuite to pass on Windows
## What changes were proposed in this pull request?

This PR proposes to fix the tests failed on Windows as below:

```
[info] - pipe with empty partition *** FAILED *** (672 milliseconds)
[info]   Set(0, 4, 5) did not equal Set(0, 5, 6) (PipedRDDSuite.scala:145)
[info]   org.scalatest.exceptions.TestFailedException:
...
```

In this case, `wc -c` counts the characters on both Windows and Linux but the newlines characters on Windows are `\r\n` which are two. So, the counts ends up one more for each.

```
[info] - test pipe exports map_input_file *** FAILED *** (62 milliseconds)
[info]   java.lang.IllegalStateException: Subprocess exited with status 1. Command ran: printenv map_input_file
[info]   at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:178)
...
```

```
[info] - test pipe exports mapreduce_map_input_file *** FAILED *** (172 milliseconds)
[info]   java.lang.IllegalStateException: Subprocess exited with status 1. Command ran: printenv mapreduce_map_input_file
[info]   at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:178)
...
```

`printenv` command prints the environment variables; however, when environment variables are set to `ProcessBuilder` as lower-cased keys, `printenv` in Windows ignores and does not print this although it is actually set and accessible. (this was tested in [here](https://ci.appveyor.com/project/spark-test/spark/build/208-PipedRDDSuite) for upper-cases with this [diff](https://github.com/apache/spark/compare/master...spark-test:74d39da) and [here](https://ci.appveyor.com/project/spark-test/spark/build/203-PipedRDDSuite) for lower-cases with this [diff](https://github.com/apache/spark/compare/master...spark-test:fde5e37f28032c15a8d8693ba033a8a779a26317). It seems a bug in `printenv`.
(BTW, note that environment variables on Windows are case-insensitive).

This is (I believe) a thirdparty tool on Windows that resembles `printenv` on Linux (installed in AppVeyor environment or Windows Server 2012 R2). This command does not exist, at least, for Windows 7 and 10 (manually tested).

On Windows, we can use `cmd.exe /C set [varname]` officially for this purpose. We could fix the tests with this in order to test if the environment variable is set.

## How was this patch tested?

Manually tested via AppVeyor.

**Before**
https://ci.appveyor.com/project/spark-test/spark/build/194-PipedRDDSuite

**After**
https://ci.appveyor.com/project/spark-test/spark/build/226-PipedRDDSuite

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16254 from HyukjinKwon/pipe-errors.
2016-12-14 19:27:29 +00:00
hyukjinkwon c6b8eb71a9
[SPARK-18842][TESTS][LAUNCHER] De-duplicate paths in classpaths in commands for local-cluster mode to work around the path length limitation on Windows
## What changes were proposed in this pull request?

Currently, some tests are being failed and hanging on Windows due to this problem. For the reason in SPARK-18718, some tests using `local-cluster` mode were disabled on Windows due to the length limitation by paths given to classpaths.

The limitation seems roughly 32K (see the [blog in MS](https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/) and [another reference](https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows)) but in `local-cluster` mode, executors were being launched as processes with the command such as [here](https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea) in (only) tests.

This length is roughly 40K due to the classpaths given to `java` command. However, it seems duplicates are almost half of them. So, if we deduplicate the paths, it seems reduced to roughly 20K with the command, [here](https://gist.github.com/HyukjinKwon/dad0c8db897e5e094684a2dc6a417790).

Maybe, we should consider as some more paths are added in the future but it seems better than disabling all the tests for now with minimised changes.

Therefore, this PR proposes to deduplicate the paths in classpaths in case of launching executors as processes in `local-cluster` mode.

## How was this patch tested?

Existing tests in `ShuffleSuite` and `BroadcastJoinSuite` manually via AppVeyor

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16266 from HyukjinKwon/disable-local-cluster-tests.
2016-12-14 19:24:24 +00:00
Imran Rashid ac013ea589 [SPARK-18846][SCHEDULER] Fix flakiness in SchedulerIntegrationSuite
There is a small race in SchedulerIntegrationSuite.
The test assumes that the taskscheduler thread
processing that last task will finish before the DAGScheduler processes
the task event and notifies the job waiter, but that is not 100%
guaranteed.

ran the test locally a bunch of times, never failed, though admittedly
it never failed locally for me before either.  However I am nearly 100%
certain this is what caused the failure of one jenkins build
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68694/consoleFull
(which is long gone now, sorry -- I fixed it as part of
https://github.com/apache/spark/pull/14079 initially)

Author: Imran Rashid <irashid@cloudera.com>

Closes #16270 from squito/sched_integ_flakiness.
2016-12-14 12:27:01 -06:00
Alex Bozarth aebf44e50b
[SPARK-18816][WEB UI] Executors Logs column only ran visibility check on initial table load
## What changes were proposed in this pull request?

When I added a visibility check for the logs column on the executors page in #14382 the method I used only ran the check on the initial DataTable creation and not subsequent page loads. I moved the check out of the table definition and instead it runs on each page load. The jQuery DataTable functionality used is the same.

## How was this patch tested?

Tested Manually

No visible UI changes to screenshot.

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #16256 from ajbozarth/spark18816.
2016-12-13 21:37:46 +00:00
Shixiong Zhu fb3081d3b3 [SPARK-13747][CORE] Fix potential ThreadLocal leaks in RPC when using ForkJoinPool
## What changes were proposed in this pull request?

Some places in SQL may call `RpcEndpointRef.askWithRetry` (e.g., ParquetFileFormat.buildReader -> SparkContext.broadcast -> ... -> BlockManagerMaster.updateBlockInfo -> RpcEndpointRef.askWithRetry), which will finally call `Await.result`. It may cause `java.lang.IllegalArgumentException: spark.sql.execution.id is already set` when running in Scala ForkJoinPool.

This PR includes the following changes to fix this issue:

- Remove `ThreadUtils.awaitResult`
- Rename `ThreadUtils. awaitResultInForkJoinSafely` to `ThreadUtils.awaitResult`
- Replace `Await.result` in RpcTimeout with `ThreadUtils.awaitResult`.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16230 from zsxwing/fix-SPARK-13747.
2016-12-13 09:53:22 -08:00
Jacek Laskowski 096f868b74
[MINOR][CORE][SQL] Remove explicit RDD and Partition overrides
## What changes were proposed in this pull request?

I **believe** that I _only_ removed duplicated code (that adds nothing but noise). I'm gonna remove the comment after Jenkins has built the changes with no issues and Spark devs has agreed to include the changes.

Remove explicit `RDD` and `Partition` overrides (that turn out code duplication)

## How was this patch tested?

Local build. Awaiting Jenkins.

…cation)

Author: Jacek Laskowski <jacek@japila.pl>

Closes #16145 from jaceklaskowski/rdd-overrides-removed.
2016-12-13 09:40:16 +00:00
Marcelo Vanzin bc59951bab [SPARK-18773][CORE] Make commons-crypto config translation consistent.
This change moves the logic that translates Spark configuration to
commons-crypto configuration to the network-common module. It also
extends TransportConf and ConfigProvider to provide the necessary
interfaces for the translation to work.

As part of the change, I removed SystemPropertyConfigProvider, which
was mostly used as an "empty config" in unit tests, and adjusted the
very few tests that required a specific config.

I also changed the config keys for AES encryption to live under the
"spark.network." namespace, which is more correct than their previous
names under "spark.authenticate.".

Tested via existing unit test.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16200 from vanzin/SPARK-18773.
2016-12-12 16:27:04 -08:00
Steve Loughran 586d198228 [SPARK-15844][CORE] HistoryServer doesn't come up if spark.authenticate = true
## What changes were proposed in this pull request?

During history server startup, the spark configuration is examined. If security.authentication is
set, log at debug and set the value to false, so that {{SecurityManager}} can be created.

## How was this patch tested?

A new test in `HistoryServerSuite` sets the `spark.authenticate` property to true, tries to create a security manager via a new package-private method `HistoryServer.createSecurityManager(SparkConf)`. This is the method used in `HistoryServer.main`. All other instantiations of a security manager in `HistoryServerSuite` have been switched to the new method, for consistency with the production code.

Author: Steve Loughran <stevel@apache.org>

Closes #13579 from steveloughran/history/SPARK-15844-security.
2016-12-12 12:24:53 -08:00
hyukjinkwon e094d01156
[SPARK-18803][TESTS] Fix JarEntry-related & path-related test failures and skip some tests by path length limitation on Windows
## What changes were proposed in this pull request?

This PR proposes to fix some tests being failed on Windows as below for several problems.

### Incorrect path handling

- FileSuite
  ```
  [info] - binary file input as byte array *** FAILED *** (500 milliseconds)
  [info]   "file:/C:/projects/spark/target/tmp/spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624/record-bytestream-00000.bin" did not contain "C:\projects\spark\target\tmp\spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624\record-bytestream-00000.bin" (FileSuite.scala:258)
  [info]   org.scalatest.exceptions.TestFailedException:
  [info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
  ...
  ```
  ```
  [info] - Get input files via old Hadoop API *** FAILED *** (1 second, 94 milliseconds)
  [info]   Set("/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-00000", "/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-00001") did not equal Set("C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-00000", "C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-00001") (FileSuite.scala:535)
  [info]   org.scalatest.exceptions.TestFailedException:
  [info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
  ...
  ```

  ```
  [info] - Get input files via new Hadoop API *** FAILED *** (313 milliseconds)
  [info]   Set("/C:/projects/spark/target/tmp/spark-12bc1540-1111-4df6-9c4d-79e0e614407c/output/part-00000", "/C:/projects/spark/target/tmp/spark-12bc1540-1111-4df6-9c4d-79e0e614407c/output/part-00001") did not equal Set("C:\projects\spark\target\tmp\spark-12bc1540-1111-4df6-9c4d-79e0e614407c\output/part-00000", "C:\projects\spark\target\tmp\spark-12bc1540-1111-4df6-9c4d-79e0e614407c\output/part-00001") (FileSuite.scala:549)
  [info]   org.scalatest.exceptions.TestFailedException:
  ...
  ```

- TaskResultGetterSuite

  ```
  [info] - handling results larger than max RPC message size *** FAILED *** (1 second, 579 milliseconds)
  [info]   1 did not equal 0 Expect result to be removed from the block manager. (TaskResultGetterSuite.scala:129)
  [info]   org.scalatest.exceptions.TestFailedException:
  [info]   ...
  [info]   Cause: java.net.URISyntaxException: Illegal character in path at index 12: string:///C:\projects\spark\target\tmp\spark-93c485af-68da-440f-a907-aac7acd5fc25\repro\MyException.java
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.checkChars(URI.java:3021)
  ...
  ```
  ```
  [info] - failed task deserialized with the correct classloader (SPARK-11195) *** FAILED *** (0 milliseconds)
  [info]   java.lang.IllegalArgumentException: Illegal character in path at index 12: string:///C:\projects\spark\target\tmp\spark-93c485af-68da-440f-a907-aac7acd5fc25\repro\MyException.java
  [info]   at java.net.URI.create(URI.java:852)
  ...
  ```

- SparkSubmitSuite

  ```
  [info]   java.lang.IllegalArgumentException: Illegal character in path at index 12: string:///C:\projects\spark\target\tmp\1481210831381-0\870903339\MyLib.java
  [info]   at java.net.URI.create(URI.java:852)
  [info]   at org.apache.spark.TestUtils$.org$apache$spark$TestUtils$$createURI(TestUtils.scala:112)
  ...
  ```

### Incorrect separate for JarEntry

After the path fix from above, then `TaskResultGetterSuite` throws another exception as below:

```
[info] - failed task deserialized with the correct classloader (SPARK-11195) *** FAILED *** (907 milliseconds)
[info]   java.lang.ClassNotFoundException: repro.MyException
[info]   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
...
```

This is because `Paths.get` concatenates the given paths to an OS-specific path (Windows `\` and Linux `/`). However, for `JarEntry` we should comply ZIP specification meaning it should be always `/` according to ZIP specification.

See `4.4.17 file name: (Variable)` in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

### Long path problem on Windows

Some tests in `ShuffleSuite` via `ShuffleNettySuite` were skipped due to the same reason with SPARK-18718

## How was this patch tested?

Manually via AppVeyor.

**Before**

- `FileSuite`, `TaskResultGetterSuite`,`SparkSubmitSuite`
  https://ci.appveyor.com/project/spark-test/spark/build/164-tmp-windows-base (please grep each to check each)
- `ShuffleSuite`
  https://ci.appveyor.com/project/spark-test/spark/build/157-tmp-windows-base

**After**

- `FileSuite`
  https://ci.appveyor.com/project/spark-test/spark/build/166-FileSuite
- `TaskResultGetterSuite`
  https://ci.appveyor.com/project/spark-test/spark/build/173-TaskResultGetterSuite
- `SparkSubmitSuite`
  https://ci.appveyor.com/project/spark-test/spark/build/167-SparkSubmitSuite
- `ShuffleSuite`
  https://ci.appveyor.com/project/spark-test/spark/build/176-ShuffleSuite

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16234 from HyukjinKwon/test-errors-windows.
2016-12-10 19:55:22 +00:00
WangTaoTheTonic 3a3e65adaf
[SPARK-18606][HISTORYSERVER] remove useless elements while searching
## What changes were proposed in this pull request?

When we search applications in HistoryServer, it will include all contents between <td> tag, which including useless elemtns like "<span title...", "a href" and making results confused.
We should remove those to make it clear.

## How was this patch tested?

manual tests.

Before:
![before](https://cloud.githubusercontent.com/assets/5276001/20662840/28bcc874-b590-11e6-9115-12fb64e49898.jpg)

After:
![after](https://cloud.githubusercontent.com/assets/5276001/20662844/2f717af2-b590-11e6-97dc-a48b08a54247.jpg)

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #16031 from WangTaoTheTonic/span.
2016-12-10 16:43:08 +00:00
Davies Liu cf33a86285 [SPARK-4105] retry the fetch or stage if shuffle block is corrupt
## What changes were proposed in this pull request?

There is an outstanding issue that existed for a long time: Sometimes the shuffle blocks are corrupt and can't be decompressed. We recently hit this in three different workloads, sometimes we can reproduce it by every try, sometimes can't. I also found that when the corruption happened, the beginning and end of the blocks are correct, the corruption happen in the middle. There was one case that the string of block id is corrupt by one character. It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption.

Unfortunately, Spark does not have checksum for shuffle blocks or broadcast, the job will fail if any corruption happen in the shuffle block from disk, or broadcast blocks during network. This PR try to detect the corruption after fetching shuffle blocks by decompressing them, because most of the compression already have checksum in them. It will retry the block, or failed with FetchFailure, so the previous stage could be retried on different (still random) machines.

Checksum for broadcast will be added by another PR.

## How was this patch tested?

Added unit tests

Author: Davies Liu <davies@databricks.com>

Closes #15923 from davies/detect_corrupt.
2016-12-09 15:44:22 -08:00
Xiangrui Meng fd48d80a61 [SPARK-17822][R] Make JVMObjectTracker a member variable of RBackend
## What changes were proposed in this pull request?

* This PR changes `JVMObjectTracker` from `object` to `class` and let its instance associated with each RBackend. So we can manage the lifecycle of JVM objects when there are multiple `RBackend` sessions. `RBackend.close` will clear the object tracker explicitly.
* I assume that `SQLUtils` and `RRunner` do not need to track JVM instances, which could be wrong.
* Small refactor of `SerDe.sqlSerDe` to increase readability.

## How was this patch tested?

* Added unit tests for `JVMObjectTracker`.
* Wait for Jenkins to run full tests.

Author: Xiangrui Meng <meng@databricks.com>

Closes #16154 from mengxr/SPARK-17822.
2016-12-09 07:51:46 -08:00
Jacek Laskowski b162cc0c28
[MINOR][CORE][SQL][DOCS] Typo fixes
## What changes were proposed in this pull request?

Typo fixes

## How was this patch tested?

Local build. Awaiting the official build.

Author: Jacek Laskowski <jacek@japila.pl>

Closes #16144 from jaceklaskowski/typo-fixes.
2016-12-09 18:45:57 +08:00
Shixiong Zhu 26432df9cc [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
## What changes were proposed in this pull request?

When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit.

- ContextCleaner.keepCleaning
- LiveListenerBus.listenerThread.run
- TaskSchedulerImpl.start

This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16178 from zsxwing/fix-stop-deadlock.
2016-12-08 11:54:04 -08:00
Ergin Seyfe ed8869ebbf [SPARK-8617][WEBUI] HistoryServer: Include in-progress files during cleanup
## What changes were proposed in this pull request?
- Removed the`attempt.completed ` filter so cleaner would include the orphan inprogress files.
- Use loading time for inprogress files as lastUpdated. Keep using the modTime for completed files. First one will prevent deletion of inprogress job files. Second one will ensure that lastUpdated time won't change for completed jobs in an event of HistoryServer reboot.

## How was this patch tested?
Added new unittests and via existing tests.

Author: Ergin Seyfe <eseyfe@fb.com>

Closes #16165 from seyfe/clear_old_inprogress_files.
2016-12-08 10:21:09 -08:00
hyukjinkwon 7f3c778fd0
[SPARK-18718][TESTS] Skip some test failures due to path length limitation and fix tests to pass on Windows
## What changes were proposed in this pull request?

There are some tests failed on Windows due to the wrong format of path and the limitation of path length as below:

This PR proposes both to fix the failed tests by fixing the path for the tests below:

- `InsertSuite`
  ```
  Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.sources.InsertSuite *** ABORTED *** (12 seconds, 547 milliseconds)
      org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-177945ef-9128-42b4-8c07-de31f78bbbd6;
      at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
      at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
      at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  ```

- `PathOptionSuite`
  ```
  - path option also exist for write path *** FAILED *** (1 second, 93 milliseconds)
    "C:[projectsspark	arget	mp]spark-5ab34a58-df8d-..." did not equal "C:[\projects\spark\target\tmp\]spark-5ab34a58-df8d-..." (PathOptionSuite.scala:93)
    org.scalatest.exceptions.TestFailedException:
        at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
        at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
    ...
  ```

- `UDFSuite`
  ```
  - SPARK-8005 input_file_name *** FAILED *** (2 seconds, 234 milliseconds)
    "file:///C:/projects/spark/target/tmp/spark-e4e5720a-2006-48f9-8b11-797bf59794bf/part-00001-26fb05e4-603d-471d-ae9d-b9549e0c7765.snappy.parquet" did not contain "C:\projects\spark\target\tmp\spark-e4e5720a-2006-48f9-8b11-797bf59794bf" (UDFSuite.scala:67)
    org.scalatest.exceptions.TestFailedException:
      at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
      at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
    ...
  ```

and to skip the tests belows which are being failed on Windows due to path length limitation.

- `SparkLauncherSuite`
  ```
  Test org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher failed: java.lang.AssertionError: expected:<0> but was:<1>, took 0.062 sec
    at org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177)
      ...
  ```

  The stderr from the process is `The filename or extension is too long` which is equivalent to the one below.

- `BroadcastJoinSuite`
  ```
  04:09:40.882 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
  java.io.IOException: Cannot run program "C:\Progra~1\Java\jdk1.8.0\bin\java" (in directory "C:\projects\spark\work\app-20161205040542-0000\51658"): CreateProcess error=206, The filename or extension is too long
      at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
      at org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
      at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
  Caused by: java.io.IOException: CreateProcess error=206, The filename or extension is too long
      at java.lang.ProcessImpl.create(Native Method)
      at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
      at java.lang.ProcessImpl.start(ProcessImpl.java:137)
      at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
      ... 2 more
  04:09:40.929 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor

    (appearently infinite same error messages)

  ...
  ```

## How was this patch tested?

Manually tested via AppVeyor.

**Before**

`InsertSuite`: https://ci.appveyor.com/project/spark-test/spark/build/148-InsertSuite-pr
`PathOptionSuite`: https://ci.appveyor.com/project/spark-test/spark/build/139-PathOptionSuite-pr
`UDFSuite`: https://ci.appveyor.com/project/spark-test/spark/build/143-UDFSuite-pr
`SparkLauncherSuite`: https://ci.appveyor.com/project/spark-test/spark/build/141-SparkLauncherSuite-pr
`BroadcastJoinSuite`: https://ci.appveyor.com/project/spark-test/spark/build/145-BroadcastJoinSuite-pr

**After**

`PathOptionSuite`: https://ci.appveyor.com/project/spark-test/spark/build/140-PathOptionSuite-pr
`SparkLauncherSuite`: https://ci.appveyor.com/project/spark-test/spark/build/142-SparkLauncherSuite-pr
`UDFSuite`: https://ci.appveyor.com/project/spark-test/spark/build/144-UDFSuite-pr
`InsertSuite`: https://ci.appveyor.com/project/spark-test/spark/build/147-InsertSuite-pr
`BroadcastJoinSuite`: https://ci.appveyor.com/project/spark-test/spark/build/149-BroadcastJoinSuite-pr

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16147 from HyukjinKwon/fix-tests.
2016-12-08 23:02:05 +08:00
Shixiong Zhu b47b892e45 [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles is enabled
## What changes were proposed in this pull request?

When `ignoreCorruptFiles` is enabled, it's better to also ignore non-existing files.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16203 from zsxwing/ignore-file-not-found.
2016-12-07 22:37:04 -08:00
sarutak bb94f61a7a [SPARK-18762][WEBUI] Web UI should be http:4040 instead of https:4040
## What changes were proposed in this pull request?

When SSL is enabled, the Spark shell shows:
```
Spark context Web UI available at https://192.168.99.1:4040
```
This is wrong because 4040 is http, not https. It redirects to the https port.
More importantly, this introduces several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481.

CC: mengxr liancheng

I manually tested accessing by accessing MasterPage, WorkerPage and HistoryServer with SSL enabled.

Author: sarutak <sarutak@oss.nttdata.co.jp>

Closes #16190 from sarutak/SPARK-18761.
2016-12-07 11:41:23 -08:00
Shixiong Zhu dbf3e298a1 [SPARK-18764][CORE] Add a warning log when skipping a corrupted file
## What changes were proposed in this pull request?

It's better to add a warning log when skipping a corrupted file. It will be helpful when we want to finish the job first, then find them in the log and fix these files.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16192 from zsxwing/SPARK-18764.
2016-12-07 10:30:05 -08:00
Jie Xiong c496d03b52 [SPARK-18208][SHUFFLE] Executor OOM due to a growing LongArray in BytesToBytesMap
## What changes were proposed in this pull request?

BytesToBytesMap currently does not release the in-memory storage (the longArray variable) after it spills to disk. This is typically not a problem during aggregation because the longArray should be much smaller than the pages, and because we grow the longArray at a conservative rate.

However this can lead to an OOM when an already running task is allocated more than its fair share, this can happen because of a scheduling delay. In this case the longArray can grow beyond the fair share of memory for the task. This becomes problematic when the task spills and the long array is not freed, that causes subsequent memory allocation requests to be denied by the memory manager resulting in an OOM.

This PR fixes this issuing by freeing the longArray when the BytesToBytesMap spills.

## How was this patch tested?

Existing tests and tested on realworld workloads.

Author: Jie Xiong <jiexiong@fb.com>
Author: jiexiong <jiexiong@gmail.com>

Closes #15722 from jiexiong/jie_oom_fix.
2016-12-07 04:33:30 -08:00
Sean Owen 79f5f281bb
[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils
## What changes were proposed in this pull request?

Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.

## How was this patch tested?

Existing test plus new test case.

Author: Sean Owen <sowen@cloudera.com>

Closes #16129 from srowen/SPARK-18678.
2016-12-07 17:34:45 +08:00
Peter Ableda 05d416ffc6 [SPARK-18740] Log spark.app.name in driver logs
## What changes were proposed in this pull request?

Added simple logInfo line to print out the `spark.app.name` in the driver logs

## How was this patch tested?

Spark was built and tested with SparkPi app. Example log:
```
16/12/06 05:49:50 INFO spark.SparkContext: Running Spark version 2.0.0
16/12/06 05:49:52 INFO spark.SparkContext: Submitted application: Spark Pi
16/12/06 05:49:52 INFO spark.SecurityManager: Changing view acls to: root
16/12/06 05:49:52 INFO spark.SecurityManager: Changing modify acls to: root
```

Author: Peter Ableda <peter.ableda@cloudera.com>

Closes #16172 from peterableda/feature/print_appname.
2016-12-06 10:12:27 -08:00
hyukjinkwon b8c7b8d31d
[SPARK-18672][CORE] Close recordwriter in SparkHadoopMapReduceWriter before committing
## What changes were proposed in this pull request?

It seems some APIs such as `PairRDDFunctions.saveAsHadoopDataset()` do not close the record writer before issuing the commit for the task.

On Windows, the output in the temp directory is being open and output committer tries to rename it from temp directory to the output directory after finishing writing.

So, it fails to move the file. It seems we should close the writer actually before committing the task like the other writers such as `FileFormatWriter`.

Identified failure was as below:

```
FAILURE! - in org.apache.spark.JavaAPISuite
writeWithNewAPIHadoopFile(org.apache.spark.JavaAPISuite)  Time elapsed: 0.25 sec  <<< ERROR!
org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:182)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:100)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:99)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not rename file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155_0000_r_000000_0 to file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155_0000_r_000000
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:436)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:415)
	at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
	at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:153)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:167)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:156)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:168)
	... 8 more
Driver stacktrace:
	at org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
Caused by: org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.io.IOException: Could not rename file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155_0000_r_000000_0 to file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155_0000_r_000000
```

This PR proposes to close this before committing the task.

## How was this patch tested?

Manually tested via AppVeyor.

**Before**

https://ci.appveyor.com/project/spark-test/spark/build/94-scala-tests

**After**

https://ci.appveyor.com/project/spark-test/spark/build/93-scala-tests

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16098 from HyukjinKwon/close-wirter-first.
2016-12-06 12:20:16 +08:00
Reynold Xin e9730b707d [SPARK-18702][SQL] input_file_block_start and input_file_block_length
## What changes were proposed in this pull request?
We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions:

1. input_file_block_start: returns the file block start offset, or -1 if not available.

2. input_file_block_length: returns the file block length, or -1 if not available.

## How was this patch tested?
Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions.

Author: Reynold Xin <rxin@databricks.com>

Closes #16133 from rxin/SPARK-18702.
2016-12-04 21:51:10 -08:00
Reynold Xin c7c7265950 [SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT
## What changes were proposed in this pull request?
This patch bumps master branch version to 2.2.0-SNAPSHOT.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #16126 from rxin/SPARK-18695.
2016-12-02 21:09:37 -08:00
Eric Liang 294163ee93 [SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables
## What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.

This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).

cc mallman  cloud-fan

## How was this patch tested?

Checked metrics in unit tests.

Author: Eric Liang <ekl@databricks.com>

Closes #16112 from ericl/spark-18679.
2016-12-02 20:59:39 +08:00
Liang-Chi Hsieh dbf842b7a8 [SPARK-18666][WEB UI] Remove the codes checking deprecated config spark.sql.unsafe.enabled
## What changes were proposed in this pull request?

`spark.sql.unsafe.enabled` is deprecated since 1.6. There still are codes in UI to check it. We should remove it and clean the codes.

## How was this patch tested?

Changes to related existing unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16095 from viirya/remove-deprecated-config-code.
2016-12-01 01:57:58 -08:00
wm624@hotmail.com 2eb6764fbb [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support output original label.
## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and added. Document is also modified.

## How was this patch tested?

Unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15910 from wangmiao1981/audit.
2016-11-30 20:32:17 -08:00
Shixiong Zhu c4979f6ea8 [SPARK-18655][SS] Ignore Structured Streaming 2.0.2 logs in history server
## What changes were proposed in this pull request?

As `queryStatus` in StreamingQueryListener events was removed in #15954, parsing 2.0.2 structured streaming logs will throw the following errror:

```
[info]   com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "queryStatus" (class org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), not marked as ignorable (2 known properties: "id", "exception"])
[info]  at [Source: {"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null}; line: 1, column: 521] (through reference chain: org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"])
[info]   at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
[info]   at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839)
[info]   at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045)
[info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352)
[info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306)
[info]   at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453)
[info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099)
...
```

This PR just ignores such errors and adds a test to make sure we can read 2.0.2 logs.

## How was this patch tested?

`query-event-logs-version-2.0.2.txt` has all types of events generated by Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should ignore broken event jsons generated in 2.0.2")` verified we can load them without any error.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16085 from zsxwing/SPARK-18655.
2016-11-30 16:18:53 -08:00
Marcelo Vanzin 93e9d880bf [SPARK-18546][CORE] Fix merging shuffle spills when using encryption.
The problem exists because it's not possible to just concatenate encrypted
partition data from different spill files; currently each partition would
have its own initial vector to set up encryption, and the final merged file
should contain a single initial vector for each merged partiton, otherwise
iterating over each record becomes really hard.

To fix that, UnsafeShuffleWriter now decrypts the partitions when merging,
so that the merged file contains a single initial vector at the start of
the partition data.

Because it's not possible to do that using the fast transferTo path, when
encryption is enabled UnsafeShuffleWriter will revert back to using file
streams when merging. It may be possible to use a hybrid approach when
using encryption, using an intermediate direct buffer when reading from
files and encrypting the data, but that's better left for a separate patch.

As part of the change I made DiskBlockObjectWriter take a SerializerManager
instead of a "wrap stream" closure, since that makes it easier to test the
code without having to mock SerializerManager functionality.

Tested with newly added unit tests (UnsafeShuffleWriterSuite for the write
side and ExternalAppendOnlyMapSuite for integration), and by running some
apps that failed without the fix.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15982 from vanzin/SPARK-18546.
2016-11-30 14:10:32 -08:00
Josh Rosen c51c772594 [SPARK-18640] Add synchronization to TaskScheduler.runningTasksByExecutors
## What changes were proposed in this pull request?

The method `TaskSchedulerImpl.runningTasksByExecutors()` accesses the mutable `executorIdToRunningTaskIds` map without proper synchronization. In addition, as markhamstra pointed out in #15986, the signature's use of parentheses is a little odd given that this is a pure getter method.

This patch fixes both issues.

## How was this patch tested?

Covered by existing tests.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #16073 from JoshRosen/runningTasksByExecutors-thread-safety.
2016-11-30 14:47:41 -05:00
uncleGen 56c82edabd [SPARK-18617][CORE][STREAMING] Close "kryo auto pick" feature for Spark Streaming
## What changes were proposed in this pull request?

#15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue #15992 to optimize the solution.

## How was this patch tested?

existing ut

Author: uncleGen <hustyugm@gmail.com>

Closes #16052 from uncleGen/SPARK-18617.
2016-11-29 23:45:06 -08:00
Josh Rosen 9a02f68212 [SPARK-18553][CORE] Fix leak of TaskSetManager following executor loss
## What changes were proposed in this pull request?

_This is the master branch version of #15986; the original description follows:_

This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails.

This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several `ShuffleDependency` instances that were retained by `TaskSetManager`s inside the scheduler but were not otherwise referenced. Each of these `TaskSetManager`s was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map.

Entries are added to the `taskIdToTaskSetManager` map when tasks are launched and are removed inside of `TaskScheduler.statusUpdate()`, which is invoked by the scheduler backend while processing `StatusUpdate` messages from executors. The problem with this design is that a completely dead executor will never send a `StatusUpdate`. There is [some code](072f4c518c/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala (L338)) in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The `executorLost` and [`removeExecutor`](072f4c518c/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala (L527)) methods don't appear to perform any cleanup of the `taskId -> *` mappings, causing the leaks observed here.

This patch's fix is to maintain a `executorId -> running task id` mapping so that these `taskId -> *` maps can be properly cleaned up following an executor loss.

There are some potential corner-case interactions that I'm concerned about here, especially some details in [the comment](072f4c518c/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala (L523)) in `removeExecutor`, so I'd appreciate a very careful review of these changes.

## How was this patch tested?

I added a new unit test to `TaskSchedulerImplSuite`.

/cc kayousterhout and markhamstra, who reviewed #15986.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #16045 from JoshRosen/fix-leak-following-total-executor-loss-master.
2016-11-29 16:27:25 -08:00
hyukjinkwon 1a870090e4
[SPARK-18615][DOCS] Switch to multi-line doc to avoid a genjavadoc bug for backticks
## What changes were proposed in this pull request?

Currently, single line comment does not mark down backticks to `<code>..</code>` but prints as they are (`` `..` ``). For example, the line below:

```scala
/** Return an RDD with the pairs from `this` whose keys are not in `other`. */
```

So, we could work around this as below:

```scala
/**
 * Return an RDD with the pairs from `this` whose keys are not in `other`.
 */
```

- javadoc

  - **Before**
    ![2016-11-29 10 39 14](https://cloud.githubusercontent.com/assets/6477701/20693606/e64c8f90-b622-11e6-8dfc-4a029216e23d.png)

  - **After**
    ![2016-11-29 10 39 08](https://cloud.githubusercontent.com/assets/6477701/20693607/e7280d36-b622-11e6-8502-d2e21cd5556b.png)

- scaladoc (this one looks fine either way)

  - **Before**
    ![2016-11-29 10 38 22](https://cloud.githubusercontent.com/assets/6477701/20693640/12c18aa8-b623-11e6-901a-693e2f6f8066.png)

  - **After**
    ![2016-11-29 10 40 05](https://cloud.githubusercontent.com/assets/6477701/20693642/14eb043a-b623-11e6-82ac-7cd0000106d1.png)

I suspect this is related with SPARK-16153 and genjavadoc issue in ` typesafehub/genjavadoc#85`.

## How was this patch tested?

I found them via

```
grep -r "\/\*\*.*\`" . | grep .scala
````

and then checked if each is in the public API documentation with manually built docs (`jekyll build`) with Java 7.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16050 from HyukjinKwon/javadoc-markdown.
2016-11-29 13:50:24 +00:00
hyukjinkwon f830bb9170
[SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility in Java API documentation
## What changes were proposed in this pull request?

This PR make `sbt unidoc` complete with Java 8.

This PR roughly includes several fixes as below:

- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``

  ```diff
  - * A column that will be computed based on the data in a [[DataFrame]].
  + * A column that will be computed based on the data in a `DataFrame`.
  ```

- Fix throws annotations so that they are recognisable in javadoc

- Fix URL links to `<a href="http..."></a>`.

  ```diff
  - * [[http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree]] model for regression.
  + * <a href="http://en.wikipedia.org/wiki/Decision_tree_learning">
  + * Decision tree (Wikipedia)</a> model for regression.
  ```

  ```diff
  -   * see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
  +   * see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">
  +   * Receiver operating characteristic (Wikipedia)</a>
  ```

- Fix < to > to

  - `greater than`/`greater than or equal to` or `less than`/`less than or equal to` where applicable.

  - Wrap it with `{{{...}}}` to print them in javadoc or use `{code ...}` or `{literal ..}`. Please refer https://github.com/apache/spark/pull/16013#discussion_r89665558

- Fix `</p>` complaint

## How was this patch tested?

Manually tested by `jekyll build` with Java 7 and 8

```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```

```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16013 from HyukjinKwon/SPARK-3359-errors-more.
2016-11-29 09:41:32 +00:00
Davies Liu 7d5cb3af76 [SPARK-18188] add checksum for blocks of broadcast
## What changes were proposed in this pull request?

A TorrentBroadcast is serialized and compressed first, then splitted as fixed size blocks, if any block is corrupt when fetching from remote, the decompression/deserialization will fail without knowing which block is corrupt. Also, the corrupt block is kept in block manager and reported to driver, so other tasks (in same executor or from different executor) will also fail because of it.

This PR add checksum for each block, and check it after fetching a block from remote executor, because it's very likely that the corruption happen in network. When the corruption happen, it will throw the block away and throw an exception to fail the task, which will be retried.

Added a config for it: `spark.broadcast.checksum`, which is true by default.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #15935 from davies/broadcast_checksum.
2016-11-29 00:00:33 -08:00
Marcelo Vanzin 8b325b17ec [SPARK-18547][CORE] Propagate I/O encryption key when executors register.
This change modifies the method used to propagate encryption keys used during
shuffle. Instead of relying on YARN's UserGroupInformation credential propagation,
this change explicitly distributes the key using the messages exchanged between
driver and executor during registration. When RPC encryption is enabled, this means
key propagation is also secure.

This allows shuffle encryption to work in non-YARN mode, which means that it's
easier to write unit tests for areas of the code that are affected by the feature.

The key is stored in the SecurityManager; because there are many instances of
that class used in the code, the key is only guaranteed to exist in the instance
managed by the SparkEnv. This path was chosen to avoid storing the key in the
SparkConf, which would risk having the key being written to disk as part of the
configuration (as, for example, is done when starting YARN applications).

Tested by new and existing unit tests (which were moved from the YARN module to
core), and by running apps with shuffle encryption enabled.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15981 from vanzin/SPARK-18547.
2016-11-28 21:10:57 -08:00
Imran Rashid 8b1609bebe [SPARK-18117][CORE] Add test for TaskSetBlacklist
## What changes were proposed in this pull request?

This adds tests to verify the interaction between TaskSetBlacklist and
TaskSchedulerImpl.  TaskSetBlacklist was introduced by SPARK-17675 but
it neglected to add these tests.

This change does not fix any bugs -- it is just for increasing test
coverage.
## How was this patch tested?

Jenkins

Author: Imran Rashid <irashid@cloudera.com>

Closes #15644 from squito/taskset_blacklist_test_update.
2016-11-28 13:47:09 -06:00
Mark Grover 237c3b9642 [SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI
## What changes were proposed in this pull request?

This patch adds a new property called `spark.secret.redactionPattern` that
allows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.

This change uses this property to redact information from event logs and YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.

Here's an image of the UI post-redaction:
![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png)

Here's the text in the YARN logs, post-redaction:
``HADOOP_CREDSTORE_PASSWORD -> *********(redacted)``

Here's the text in the event logs, post-redaction:
``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...``

## How was this patch tested?
1. Unit tests are added to ensure that redaction works.
2. A YARN job reading data off of S3 with confidential information
(hadoop credential provider password) being provided in the environment
variables of driver and executor. And, afterwards, logs were grepped to make
sure that no mention of secret password was present. It was also ensure that
the job was able to read the data off of S3 correctly, thereby ensuring that
the sensitive information was being trickled down to the right places to read
the data.
3. The event logs were checked to make sure no mention of secret password was
present.
4. UI environment tab was checked to make sure there was no secret information
being displayed.

Author: Mark Grover <mark@apache.org>

Closes #15971 from markgrover/master_redaction.
2016-11-28 08:59:47 -08:00
Takuya UESHIN a88329d455 [SPARK-18583][SQL] Fix nullability of InputFileName.
## What changes were proposed in this pull request?

The nullability of `InputFileName` should be `false`.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #16007 from ueshin/issues/SPARK-18583.
2016-11-25 20:25:29 -08:00
hyukjinkwon 51b1c1551d
[SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
## What changes were proposed in this pull request?

This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.

This PR roughly fixes several things as below:

- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``

  ```
  [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
  [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
  ```

- Fix an exception annotation and remove code backticks in `throws` annotation

  Currently, sbt unidoc with Java 8 complains as below:

  ```
  [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
  [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
  ```

  `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).

- Fix `[[http..]]` to `<a href="http..."></a>`.

  ```diff
  -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
  -   * blog page]].
  +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
  +   * Oracle blog page</a>.
  ```

   `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.

- It seems class can't have `return` annotation. So, two cases of this were removed.

  ```
  [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
  [error]    * return New instance of IsotonicRegression.
  ```

- Fix < to `&lt;` and > to `&gt;` according to HTML rules.

- Fix `</p>` complaint

- Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.

## How was this patch tested?

Manually tested by `jekyll build` with Java 7 and 8

```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```

```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```

Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15999 from HyukjinKwon/SPARK-3359-errors.
2016-11-25 11:27:07 +00:00
n.fraison f42db0c0c1
[SPARK-18119][SPARK-CORE] Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server
## What changes were proposed in this pull request?

Instead of using the setSafeMode method that check the first namenode used the one which permitts to check only for active NNs
## How was this patch tested?

manual tests

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

This commit is contributed by Criteo SA under the Apache v2 licence.

Author: n.fraison <n.fraison@criteo.com>

Closes #15648 from ashangit/SPARK-18119.
2016-11-25 09:45:51 +00:00
Reynold Xin 9785ed40d7 [SPARK-18557] Downgrade confusing memory leak warning message
## What changes were proposed in this pull request?
TaskMemoryManager has a memory leak detector that gets called at task completion callback and checks whether any memory has not been released. If they are not released by the time the callback is invoked, TaskMemoryManager releases them.

The current error message says something like the following:
```
WARN  [Executor task launch worker-0]
org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory from
org.apache.spark.unsafe.map.BytesToBytesMap33fb6a15
In practice, there are multiple reasons why these can be triggered in the normal code path (e.g. limit, or task failures), and the fact that these messages are log means the "leak" is fixed by TaskMemoryManager.
```

To not confuse users, this patch downgrade the message from warning to debug level, and avoids using the word "leak" since it is not actually a leak.

## How was this patch tested?
N/A - this is a simple logging improvement.

Author: Reynold Xin <rxin@databricks.com>

Closes #15989 from rxin/SPARK-18557.
2016-11-23 04:22:26 -08:00
Eric Liang 85235ed6c6 [SPARK-18545][SQL] Verify number of hive client RPCs in PartitionedTablePerfStatsSuite
## What changes were proposed in this pull request?

This would help catch accidental O(n) calls to the hive client as in https://issues.apache.org/jira/browse/SPARK-18507

## How was this patch tested?

Checked that the test fails before https://issues.apache.org/jira/browse/SPARK-18507 was patched. cc cloud-fan

Author: Eric Liang <ekl@databricks.com>

Closes #15985 from ericl/spark-18545.
2016-11-23 20:14:08 +08:00
Kazuaki Ishizaki d93b655247 [SPARK-18458][CORE] Fix signed integer overflow problem at an expression in RadixSort.java
## What changes were proposed in this pull request?

This PR avoids that a result of an expression is negative due to signed integer overflow (e.g. 0x10?????? * 8 < 0). This PR casts each operand to `long` before executing a calculation. Since the result is interpreted as long, the result of the expression is positive.

## How was this patch tested?

Manually executed query82 of TPC-DS with 100TB

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #15907 from kiszk/SPARK-18458.
2016-11-19 21:50:20 -08:00
Stavros Kontopoulos ea77c81ec0 [SPARK-17062][MESOS] add conf option to mesos dispatcher
Adds --conf option to set spark configuration properties in mesos dispacther.
Properties provided with --conf take precedence over properties within the properties file.
The reason for this PR is that for simple configuration or testing purposes we need to provide a property file (ideally a shared one for a cluster) even if we just provide a single property.

Manually tested.

Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>

Closes #14650 from skonto/dipatcher_conf.
2016-11-19 16:04:49 -08:00
Sean Owen 8b1e1088eb
[SPARK-18353][CORE] spark.rpc.askTimeout defalut value is not 120s
## What changes were proposed in this pull request?

Avoid hard-coding spark.rpc.askTimeout to non-default in Client; fix doc about spark.rpc.askTimeout default

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #15833 from srowen/SPARK-18353.
2016-11-19 11:28:25 +00:00
hyukjinkwon d5b1d5fc80
[SPARK-18445][BUILD][DOCS] Fix the markdown for Note:/NOTE:/Note that/'''Note:''' across Scala/Java API documentation
## What changes were proposed in this pull request?

It seems in Scala/Java,

- `Note:`
- `NOTE:`
- `Note that`
- `'''Note:'''`
- `note`

This PR proposes to fix those to `note` to be consistent.

**Before**

- Scala
  ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)

- Java
  ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)

**After**

- Scala
  ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)

- Java
  ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)

## How was this patch tested?

The notes were found via

```bash
grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note that " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

And then fixed one by one comparing with API documentation/access modifiers.

After that, manually tested via `jekyll build`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15889 from HyukjinKwon/SPARK-18437.
2016-11-19 11:24:15 +00:00
hyukjinkwon 40d59ff5ea
[SPARK-18422][CORE] Fix wholeTextFiles test to pass on Windows in JavaAPISuite
## What changes were proposed in this pull request?

This PR fixes the test `wholeTextFiles` in `JavaAPISuite.java`. This is failed due to the different path format on Windows.

For example, the path in `container` was

```
C:\projects\spark\target\tmp\1478967560189-0/part-00000
```

whereas `new URI(res._1()).getPath()` was as below:

```
/C:/projects/spark/target/tmp/1478967560189-0/part-00000
```

## How was this patch tested?

Tests in `JavaAPISuite.java`.

Tested via AppVeyor.

**Before**
Build: https://ci.appveyor.com/project/spark-test/spark/build/63-JavaAPISuite-1
Diff: https://github.com/apache/spark/compare/master...spark-test:JavaAPISuite-1

```
[info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started
[error] Test org.apache.spark.JavaAPISuite.wholeTextFiles failed: java.lang.AssertionError: expected:<spark is easy to use.
[error] > but was:<null>, took 0.578 sec
[error]     at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
...
```

**After**
Build started: [CORE] `org.apache.spark.JavaAPISuite` [![PR-15866](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=198DDA52-F201-4D2B-BE2F-244E0C1725B2&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/198DDA52-F201-4D2B-BE2F-244E0C1725B2)
Diff: https://github.com/apache/spark/compare/master...spark-test:198DDA52-F201-4D2B-BE2F-244E0C1725B2

```
[info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started
...
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15866 from HyukjinKwon/SPARK-18422.
2016-11-18 21:45:18 +00:00
anabranch 49b6f456ac
[SPARK-18365][DOCS] Improve Sample Method Documentation
## What changes were proposed in this pull request?

I found the documentation for the sample method to be confusing, this adds more clarification across all languages.

- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python

## How was this patch tested?

NA

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>

Closes #15815 from anabranch/SPARK-18365.
2016-11-17 11:34:55 +00:00
Xianyang Liu 7569cf6cb8
[SPARK-18420][BUILD] Fix the errors caused by lint check in Java
## What changes were proposed in this pull request?

Small fix, fix the errors caused by lint check in Java

- Clear unused objects and `UnusedImports`.
- Add comments around the method `finalize` of `NioBufferedFileInputStream`to turn off checkstyle.
- Cut the line which is longer than 100 characters into two lines.

## How was this patch tested?
Travis CI.
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
$ dev/lint-java
```
Before:
```
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] (imports) UnusedImports: Unused import - org.apache.commons.crypto.cipher.CryptoCipherFactory.
[ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] (modifier) RedundantModifier: Redundant 'public' modifier.
[ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] (coding) NoFinalizer: Avoid using finalizer method.
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] (sizes) LineLength: Line is longer than 100 characters (found 113).
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[31,17] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
[ERROR]src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103).
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
```

After:
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
$ dev/lint-java
Using `mvn` from path: /home/travis/build/ConeyLiu/spark/build/apache-maven-3.3.9/bin/mvn
Checkstyle checks passed.
```

Author: Xianyang Liu <xyliu0530@icloud.com>

Closes #15865 from ConeyLiu/master.
2016-11-16 11:59:00 +00:00
WangTaoTheTonic 637a0bb88f
[SPARK-18396][HISTORYSERVER] Duration" column makes search result confused, maybe we should make it unsearchable
## What changes were proposed in this pull request?

When we search data in History Server, it will check if any columns contains the search string. Duration is represented as long value in table, so if we search simple string like "003", "111", the duration containing "003", ‘111“ will be showed, which make not much sense to users.
We cannot simply transfer the long value to meaning format like "1 h", "3.2 min" because they are also used for sorting. Better way to handle it is ban "Duration" columns from searching.

## How was this patch tested

manually tests.

Before("local-1478225166651" pass the filter because its duration in long value, which is "257244245" contains search string "244"):
![before](https://cloud.githubusercontent.com/assets/5276001/20203166/f851ffc6-a7ff-11e6-8fe6-91a90ca92b23.jpg)

After:
![after](https://cloud.githubusercontent.com/assets/5276001/20178646/2129fbb0-a78d-11e6-9edb-39f885ce3ed0.jpg)

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #15838 from WangTaoTheTonic/duration.
2016-11-14 12:22:36 +01:00
Sean Owen f95b124c68 [SPARK-18382][WEBUI] "run at null:-1" in UI when no file/line info in call site info
## What changes were proposed in this pull request?

Avoid reporting null/-1 file / line number in call sites if encountering StackTraceElement without this info

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #15862 from srowen/SPARK-18382.
2016-11-14 16:52:07 +09:00
Guoqiang Li bc41d997ea
[SPARK-18375][SPARK-18383][BUILD][CORE] Upgrade netty to 4.0.42.Final
## What changes were proposed in this pull request?

One of the important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport netty/netty#5825".
In 4.0.42.Final, `MessageWithHeader` can work properly when `spark.[shuffle|rpc].io.mode` is set to epoll

## How was this patch tested?

Existing tests

Author: Guoqiang Li <witgo@qq.com>

Closes #15830 from witgo/SPARK-18375_netty-4.0.42.
2016-11-12 09:49:14 +00:00
Weiqing Yang 3af894511b [SPARK-16759][CORE] Add a configuration property to pass caller contexts of upstream applications into Spark
## What changes were proposed in this pull request?

Many applications take Spark as a computing engine and run on it. This PR adds a configuration property `spark.log.callerContext` that can be used by Spark's upstream applications (e.g. Oozie) to set up their caller contexts into Spark. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into Yarn RM log and HDFS audit log.

The audit log has a config to truncate the caller contexts passed in (default 128). The caller contexts will be sent over rpc, so it should be concise. The call context written into HDFS log and Yarn log consists of two parts: the information `A` specified by Spark itself and the value `B` of `spark.log.callerContext` property.  Currently `A` typically takes 64 to 74 characters,  so `B` can have up to 50 characters (mentioned in the doc `running-on-yarn.md`)
## How was this patch tested?

Manual tests. I have run some Spark applications with `spark.log.callerContext` configuration in Yarn client/cluster mode, and verified that the caller contexts were written into Yarn RM log and HDFS audit log correctly.

The ways to configure `spark.log.callerContext` property:
- In spark-defaults.conf:

```
spark.log.callerContext  infoSpecifiedByUpstreamApp
```
- In app's source code:

```
val spark = SparkSession
      .builder
      .appName("SparkKMeans")
      .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")
      .getOrCreate()
```

When running on Spark Yarn cluster mode, the driver is unable to pass 'spark.log.callerContext' to Yarn client and AM since Yarn client and AM have already started before the driver performs `.config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")`.

The following  example shows the command line used to submit a SparkKMeans application and the corresponding records in Yarn RM log and HDFS audit log.

Command:

```
./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5
```

Yarn RM log:

<img width="1440" alt="screen shot 2016-10-19 at 9 12 03 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547050/7d2f278c-9649-11e6-9df8-8d5ff12609f0.png">

HDFS audit log:

<img width="1400" alt="screen shot 2016-10-19 at 10 18 14 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547102/096060ae-964a-11e6-981a-cb28efd5a058.png">

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15563 from weiqingy/SPARK-16759.
2016-11-11 18:36:23 -08:00
Vinayak a531fe1a82 [SPARK-17843][WEB UI] Indicate event logs pending for processing on history server UI
## What changes were proposed in this pull request?

History Server UI's application listing to display information on currently under process event logs so a user knows that pending this processing an application may not list on the UI.

When there are no event logs under process, the application list page has a "Last Updated" date-time at the top indicating the date-time of the last _completed_ scan of the event logs. The value is displayed to the user in his/her local time zone.
## How was this patch tested?

All unit tests pass. Particularly all the suites under org.apache.spark.deploy.history.\* were run to test changes.
- Very first startup - Pending logs - no logs processed yet:

<img width="1280" alt="screen shot 2016-10-24 at 3 07 04 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640981/b8d2a96a-99fc-11e6-9b1f-2d736fe90e48.png">
- Very first startup - Pending logs - some logs processed:

<img width="1280" alt="screen shot 2016-10-24 at 3 18 42 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641087/3f8e3bae-99fd-11e6-9ef1-e0e70d71d8ef.png">
- Last updated - No currently pending logs:

<img width="1280" alt="screen shot 2016-10-17 at 8 34 37 pm" src="https://cloud.githubusercontent.com/assets/12079825/19443100/4d13946c-94a9-11e6-8ee2-c442729bb206.png">
- Last updated - With some currently pending logs:

<img width="1280" alt="screen shot 2016-10-24 at 3 09 31 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640903/7323ba3a-99fc-11e6-8359-6a45753dbb28.png">
- No applications found and No currently pending logs:

<img width="1280" alt="screen shot 2016-10-24 at 3 24 26 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641364/03a2cb04-99fe-11e6-87d6-d09587fc6201.png">

Author: Vinayak <vijoshi5@in.ibm.com>

Closes #15410 from vijoshi/SAAS-608_master.
2016-11-11 12:54:16 -06:00
Eric Liang a3356343cb [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE for Datasource tables
## What changes were proposed in this pull request?

As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the partitions matching the static keys, as in Hive. It also doesn't respect custom partition locations.

This PR adds support for all these operations to Datasource tables managed by the Hive metastore. It is implemented as follows
- During planning time, the full set of partitions affected by an INSERT or OVERWRITE command is read from the Hive metastore.
- The planner identifies any partitions with custom locations and includes this in the write task metadata.
- FileFormatWriter tasks refer to this custom locations map when determining where to write for dynamic partition output.
- When the write job finishes, the set of written partitions is compared against the initial set of matched partitions, and the Hive metastore is updated to reflect the newly added / removed partitions.

It was necessary to introduce a method for staging files with absolute output paths to `FileCommitProtocol`. These files are not handled by the Hadoop output committer but are moved to their final locations when the job commits.

The overwrite behavior of legacy Datasource tables is also changed: no longer will the entire table be overwritten if a partial partition spec is present.

cc cloud-fan yhuai

## How was this patch tested?

Unit tests, existing tests.

Author: Eric Liang <ekl@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #15814 from ericl/sc-5027.
2016-11-10 17:00:43 -08:00
wm624@hotmail.com 22a9d064e9
[SPARK-14914][CORE] Fix Resource not closed after using, for unit tests and example
## What changes were proposed in this pull request?

This is a follow-up work of #15618.

Close file source;
For any newly created streaming context outside the withContext, explicitly close the context.

## How was this patch tested?

Existing unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15818 from wangmiao1981/rtest.
2016-11-10 10:54:36 +00:00
jiangxingbo 64fbdf1aa9 [SPARK-18191][CORE][FOLLOWUP] Call setConf if OutputFormat is Configurable.
## What changes were proposed in this pull request?

We should call `setConf` if `OutputFormat` is `Configurable`, this should be done before we create `OutputCommitter` and `RecordWriter`.
This is follow up of #15769, see discussion [here](https://github.com/apache/spark/pull/15769/files#r87064229)

## How was this patch tested?

Add test of this case in `PairRDDFunctionsSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #15823 from jiangxb1987/config-format.
2016-11-09 13:14:26 -08:00
Vinayak 06a13ecca7 [SPARK-16808][CORE] History Server main page does not honor APPLICATION_WEB_PROXY_BASE
## What changes were proposed in this pull request?

Application links generated on the history server UI no longer (regression from 1.6) contain the configured spark.ui.proxyBase in the links. To address this, made the uiRoot available globally to all javascripts for Web UI. Updated the mustache template (historypage-template.html) to include the uiroot for rendering links to the applications.

The existing test was not sufficient to verify the scenario where ajax call is used to populate the application listing template, so added a new selenium test case to cover this scenario.

## How was this patch tested?

Existing tests and a new unit test.
No visual changes to the UI.

Author: Vinayak <vijoshi5@in.ibm.com>

Closes #15742 from vijoshi/SPARK-16808_master.
2016-11-09 10:40:14 -08:00
Shixiong Zhu b6de0c98c7 [SPARK-18280][CORE] Fix potential deadlock in StandaloneSchedulerBackend.dead
## What changes were proposed in this pull request?

"StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock.

This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15775 from zsxwing/SPARK-18280.
2016-11-08 13:14:56 -08:00
jiangxingbo 9c419698fe [SPARK-18191][CORE] Port RDD API to use commit protocol
## What changes were proposed in this pull request?

This PR port RDD API to use commit protocol, the changes made here:
1. Add new internal helper class that saves an RDD using a Hadoop OutputFormat named `SparkNewHadoopWriter`, it's similar with `SparkHadoopWriter` but uses commit protocol. This class supports the newer `mapreduce` API, instead of the old `mapred` API which is supported by `SparkHadoopWriter`;
2. Rewrite `PairRDDFunctions.saveAsNewAPIHadoopDataset` function, so it uses commit protocol now.

## How was this patch tested?
Exsiting test cases.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #15769 from jiangxb1987/rdd-commit.
2016-11-08 09:41:01 -08:00
fidato 6f3697136a [SPARK-16575][CORE] partition calculation mismatch with sc.binaryFiles
## What changes were proposed in this pull request?

This Pull request comprises of the critical bug SPARK-16575 changes. This change rectifies the issue with BinaryFileRDD partition calculations as  upon creating an RDD with sc.binaryFiles, the resulting RDD always just consisted of two partitions only.
## How was this patch tested?

The original issue ie. getNumPartitions on binary Files RDD (always having two partitions) was first replicated and then tested upon the changes. Also the unit tests have been checked and passed.

This contribution is my original work and I licence the work to the project under the project's open source license

srowen hvanhovell rxin vanzin skyluc kmader zsxwing datafarmer Please have a look .

Author: fidato <fidato.july13@gmail.com>

Closes #15327 from fidato13/SPARK-16575.
2016-11-07 18:41:17 -08:00
Josh Rosen 3a710b94b0 [SPARK-18236] Reduce duplicate objects in Spark UI and HistoryServer
## What changes were proposed in this pull request?

When profiling heap dumps from the HistoryServer and live Spark web UIs, I found a large amount of memory being wasted on duplicated objects and strings. This patch's changes remove most of this duplication, resulting in over 40% memory savings for some benchmarks.

- **Task metrics** (6441f0624dfcda9c7193a64bfb416a145b5aabdf): previously, every `TaskUIData` object would have its own instances of `InputMetricsUIData`, `OutputMetricsUIData`, `ShuffleReadMetrics`, and `ShuffleWriteMetrics`, but for many tasks these metrics are irrelevant because they're all zero. This patch changes how we construct these metrics in order to re-use a single immutable "empty" value for the cases where these metrics are empty.
- **TaskInfo.accumulables** (ade86db901127bf13c0e0bdc3f09c933a093bb76): Previously, every `TaskInfo` object had its own empty `ListBuffer` for holding updates from named accumulators. Tasks which didn't use named accumulators still paid for the cost of allocating and storing this empty buffer. To avoid this overhead, I changed the `val` with a mutable buffer into a `var` which holds an immutable Scala list, allowing tasks which do not have named accumulator updates to share the same singleton `Nil` object.
- **String.intern() in JSONProtocol** (7e05630e9a78c455db8c8c499f0590c864624e05): in the HistoryServer, executor hostnames and ids are deserialized from JSON, leading to massive duplication of these string objects. By calling `String.intern()` on the deserialized values we can remove all of this duplication. Since Spark now requires Java 7+ we don't have to worry about string interning exhausting the permgen (see http://java-performance.info/string-intern-in-java-6-7-8/).

## How was this patch tested?

I ran

```
sc.parallelize(1 to 100000, 100000).count()
```

in `spark-shell` with event logging enabled, then loaded that event log in the HistoryServer, performed a full GC, and took a heap dump. According to YourKit, the changes in this patch reduced memory consumption by roughly 28 megabytes (or 770k Java objects):

![image](https://cloud.githubusercontent.com/assets/50748/19953276/4f3a28aa-a129-11e6-93df-d7fa91396f66.png)

Here's a table illustrating the drop in objects due to deduplication (the drop is <100k for some objects because some events were dropped from the listener bus; this is a separate, existing bug that I'll address separately after CPU-profiling):

![image](https://cloud.githubusercontent.com/assets/50748/19953290/6a271290-a129-11e6-93ad-b825f1448886.png)

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15743 from JoshRosen/spark-ui-memory-usage.
2016-11-07 16:14:19 -08:00
Hyukjin Kwon 8f0ea011a7 [SPARK-14914][CORE] Fix Resource not closed after using, mostly for unit tests
## What changes were proposed in this pull request?

Close `FileStreams`, `ZipFiles` etc to release the resources after using. Not closing the resources will cause IO Exception to be raised while deleting temp files.
## How was this patch tested?

Existing tests

Author: U-FAREAST\tl <tl@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Tao LI <tl@microsoft.com>

Closes #15618 from HyukjinKwon/SPARK-14914-1.
2016-11-07 12:47:39 -08:00
Susan X. Huynh 9a87c31385
[SPARK-17964][SPARKR] Enable SparkR with Mesos client mode and cluster mode
## What changes were proposed in this pull request?

Enabled SparkR with Mesos client mode and cluster mode. Just a few changes were required to get this working on Mesos: (1) removed the SparkR on Mesos error checks and (2) do not require "--class" to be specified for R apps. The logic to check spark.mesos.executor.home was already in there.

sun-rui

## How was this patch tested?

1. SparkSubmitSuite
2. On local mesos cluster (on laptop): ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application.
3. On multi-node mesos cluster: ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application. I tested with the following --conf values set: spark.mesos.executor.docker.image and spark.mesos.executor.home

This contribution is my original work and I license the work to the project under the project's open source license.

Author: Susan X. Huynh <xhuynh@mesosphere.com>

Closes #15700 from susanxhuynh/susan-r-branch.
2016-11-05 17:45:15 +00:00
Weiqing Yang 8a9ca19247 [SPARK-17710][FOLLOW UP] Add comments to state why 'Utils.classForName' is not used
## What changes were proposed in this pull request?
Add comments.

## How was this patch tested?
Build passed.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15776 from weiqingy/SPARK-17710.
2016-11-04 23:44:46 -07:00
Josh Rosen 0e3312ee72 [SPARK-18256] Improve the performance of event log replay in HistoryServer
## What changes were proposed in this pull request?

This patch significantly improves the performance of event log replay in the HistoryServer via two simple changes:

- **Don't use `extractOpt`**: it turns out that `json4s`'s `extractOpt` method uses exceptions for control flow, causing huge performance bottlenecks due to the overhead of initializing exceptions. To avoid this overhead, we can simply use our own` Utils.jsonOption` method. This patch replaces all uses of `extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the use of the slow `extractOpt` method.
- **Don't call `Utils.getFormattedClassName` for every event**: the old code called` Utils.getFormattedClassName` dozens of times per replayed event in order to match up class names in events with SparkListener event names. By simply storing the results of these calls in constants rather than recomputing them, we're able to eliminate a huge performance hotspot by removing thousands of expensive `Class.getSimpleName` calls.

## How was this patch tested?

Tested by profiling the replay of a long event log using YourKit. For an event log containing 1000+ jobs, each of which had thousands of tasks, the changes in this patch cut the replay time in half:

![image](https://cloud.githubusercontent.com/assets/50748/19980953/31154622-a1bd-11e6-9be4-21fbb9b3f9a7.png)

Prior to this patch's changes, the two slowest methods in log replay were internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`:

![image](https://cloud.githubusercontent.com/assets/50748/19981052/87416cce-a1bd-11e6-9f25-06a7cd391822.png)

After this patch, these hotspots are completely eliminated.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15756 from JoshRosen/speed-up-jsonprotocol.
2016-11-04 19:32:26 -07:00
Adam Roberts a42d738c5d [SPARK-18197][CORE] Optimise AppendOnlyMap implementation
## What changes were proposed in this pull request?
This improvement works by using the fastest comparison test first and we observed a 1% throughput performance improvement on PageRank (HiBench large profile) with this change.

We used tprof and before the change in AppendOnlyMap.changeValue (where the optimisation occurs) this method was being used for 8053 profiling ticks representing 0.72% of the overall application time.

After this change we observed this method only occurring for 2786 ticks and for 0.25% of the overall time.

## How was this patch tested?
Existing unit tests and for performance we used HiBench large, profiling with tprof and IBM Healthcenter.

Author: Adam Roberts <aroberts@uk.ibm.com>

Closes #15714 from a-roberts/patch-9.
2016-11-04 12:06:06 -07:00
Dongjoon Hyun 27602c3375 [SPARK-18200][GRAPHX][FOLLOW-UP] Support zero as an initial capacity in OpenHashSet
## What changes were proposed in this pull request?

This is a follow-up PR of #15741 in order to keep `nextPowerOf2` consistent.

**Before**
```
nextPowerOf2(0) => 2
nextPowerOf2(1) => 1
nextPowerOf2(2) => 2
nextPowerOf2(3) => 4
nextPowerOf2(4) => 4
nextPowerOf2(5) => 8
```

**After**
```
nextPowerOf2(0) => 1
nextPowerOf2(1) => 1
nextPowerOf2(2) => 2
nextPowerOf2(3) => 4
nextPowerOf2(4) => 4
nextPowerOf2(5) => 8
```

## How was this patch tested?

N/A

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15754 from dongjoon-hyun/SPARK-18200-2.
2016-11-03 23:15:33 -07:00
Sean Owen dc4c600986 [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0
## What changes were proposed in this pull request?

Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it.

## How was this patch tested?

Doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #15733 from srowen/SPARK-18138.
2016-11-03 17:27:23 -07:00
Reynold Xin 937af592e6 [SPARK-18219] Move commit protocol API (internal) from sql/core to core module
## What changes were proposed in this pull request?
This patch moves the new commit protocol API from sql/core to core module, so we can use it in the future in the RDD API.

As part of this patch, I also moved the speficiation of the random uuid for the write path out of the commit protocol, and instead pass in a job id.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #15731 from rxin/SPARK-18219.
2016-11-03 02:42:48 -07:00
Dongjoon Hyun d24e736471 [SPARK-18200][GRAPHX] Support zero as an initial capacity in OpenHashSet
## What changes were proposed in this pull request?

[SPARK-18200](https://issues.apache.org/jira/browse/SPARK-18200) reports Apache Spark 2.x raises `java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity` while running `triangleCount`. The root cause is that `VertexSet`, a type alias of `OpenHashSet`, does not allow zero as a initial size. This PR loosens the restriction to allow zero.

## How was this patch tested?

Pass the Jenkins test with a new test case in `OpenHashSetSuite`.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15741 from dongjoon-hyun/SPARK-18200.
2016-11-02 23:50:50 -07:00
Jeff Zhang 3c24299b71 [SPARK-18160][CORE][YARN] spark.files & spark.jars should not be passed to driver in yarn mode
## What changes were proposed in this pull request?

spark.files is still passed to driver in yarn mode, so SparkContext will still handle it which cause the error in the jira desc.

## How was this patch tested?

Tested manually in a 5 node cluster. As this issue only happens in multiple node cluster, so I didn't write test for it.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #15669 from zjffdu/SPARK-18160.
2016-11-02 11:47:45 -07:00
Xiangrui Meng 02f203107b [SPARK-14393][SQL] values generated by non-deterministic functions shouldn't change after coalesce or union
## What changes were proposed in this pull request?

When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following:
- The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations.

However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column.

See the unit tests below or JIRA for examples.

This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization.
## How was this patch tested?

Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...)

cc: rxin davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #15567 from mengxr/SPARK-14393.
2016-11-02 11:41:49 -07:00
Sean Owen 9c8deef64e
[SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, NumberFormat to Locale.US
## What changes were proposed in this pull request?

Fix `Locale.US` for all usages of `DateFormat`, `NumberFormat`
## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15610 from srowen/SPARK-18076.
2016-11-02 09:39:15 +00:00
Jacek Laskowski 70a5db7bbd
[SPARK-18204][WEBUI] Remove SparkUI.appUIAddress
## What changes were proposed in this pull request?

Removing `appUIAddress` attribute since it is no longer in use.
## How was this patch tested?

Local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #15603 from jaceklaskowski/sparkui-fixes.
2016-11-02 09:21:26 +00:00
Ryan Blue 2dc0480816 [SPARK-17532] Add lock debugging info to thread dumps.
## What changes were proposed in this pull request?

This adds information to the web UI thread dump page about the JVM locks
held by threads and the locks that threads are blocked waiting to
acquire. This should help find cases where lock contention is causing
Spark applications to run slowly.
## How was this patch tested?

Tested by applying this patch and viewing the change in the web UI.

![thread-lock-info](https://cloud.githubusercontent.com/assets/87915/18493057/6e5da870-79c3-11e6-8c20-f54c18a37544.png)

Additions:
- A "Thread Locking" column with the locks held by the thread or that are blocking the thread
- Links from the a blocked thread to the thread holding the lock
- Stack frames show where threads are inside `synchronized` blocks, "holding Monitor(...)"

Author: Ryan Blue <blue@apache.org>

Closes #15088 from rdblue/SPARK-17532-add-thread-lock-info.
2016-11-02 00:08:30 -07:00
Josh Rosen b929537b6e [SPARK-18182] Expose ReplayListenerBus.read() overload which takes string iterator
The `ReplayListenerBus.read()` method is used when implementing a custom `ApplicationHistoryProvider`. The current interface only exposes a `read()` method which takes an `InputStream` and performs stream-to-lines conversion itself, but it would also be useful to expose an overloaded method which accepts an iterator of strings, thereby enabling events to be provided from non-`InputStream` sources.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15698 from JoshRosen/replay-listener-bus-interface.
2016-11-01 16:49:41 -07:00
Shixiong Zhu d2923f1732 [SPARK-18143][SQL] Ignore Structured Streaming event logs to avoid breaking history server
## What changes were proposed in this pull request?

Because of the refactoring work in Structured Streaming, the event logs generated by Strucutred Streaming in Spark 2.0.0 and 2.0.1 cannot be parsed.

This PR just ignores these logs in ReplayListenerBus because no places use them.
## How was this patch tested?
- Generated events logs using Spark 2.0.0 and 2.0.1, and saved them as `structured-streaming-query-event-logs-2.0.0.txt` and `structured-streaming-query-event-logs-2.0.1.txt`
- The new added test makes sure ReplayListenerBus will skip these bad jsons.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15663 from zsxwing/fix-event-log.
2016-10-31 00:11:33 -07:00
Hossein 2881a2d1d1 [SPARK-17919] Make timeout to RBackend configurable in SparkR
## What changes were proposed in this pull request?

This patch makes RBackend connection timeout configurable by user.

## How was this patch tested?
N/A

Author: Hossein <hossein@databricks.com>

Closes #15471 from falaki/SPARK-17919.
2016-10-30 16:17:23 -07:00
Eric Liang 90d3b91f4c [SPARK-18103][SQL] Rename *FileCatalog to *FileIndex
## What changes were proposed in this pull request?

To reduce the number of components in SQL named *Catalog, rename *FileCatalog to *FileIndex. A FileIndex is responsible for returning the list of partitions / files to scan given a filtering expression.

```
TableFileCatalog => CatalogFileIndex
FileCatalog => FileIndex
ListingFileCatalog => InMemoryFileIndex
MetadataLogFileCatalog => MetadataLogFileIndex
PrunedTableFileCatalog => PrunedInMemoryFileIndex
```

cc yhuai marmbrus

## How was this patch tested?

N/A

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #15634 from ericl/rename-file-provider.
2016-10-30 13:14:45 -07:00
wm624@hotmail.com 701a9d361b
[SPARK-CORE][TEST][MINOR] Fix the wrong comment in test
## What changes were proposed in this pull request?

While learning core scheduler code, I found two lines of wrong comments. This PR simply corrects the comments.

## How was this patch tested?

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15631 from wangmiao1981/Rbug.
2016-10-27 10:00:37 +02:00
Yin Huai d3b4831d00 [SPARK-18132] Fix checkstyle
This PR fixes checkstyle.

Author: Yin Huai <yhuai@databricks.com>

Closes #15656 from yhuai/fix-format.
2016-10-26 22:22:23 -07:00
Miao Wang a76846cfb1 [SPARK-18126][SPARK-CORE] getIteratorZipWithIndex accepts negative value as index
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

`Utils.getIteratorZipWithIndex` was added to deal with number of records > 2147483647 in one partition.

method `getIteratorZipWithIndex` accepts `startIndex` < 0, which leads to negative index.

This PR just adds a defensive check on `startIndex` to make sure it is >= 0.

## How was this patch tested?

Add a new unit test.

Author: Miao Wang <miaowang@Miaos-MacBook-Pro.local>

Closes #15639 from wangmiao1981/zip.
2016-10-27 01:17:32 +02:00
Shixiong Zhu 7ac70e7ba8 [SPARK-13747][SQL] Fix concurrent executions in ForkJoinPool for SQL
## What changes were proposed in this pull request?

Calling `Await.result` will allow other tasks to be run on the same thread when using ForkJoinPool. However, SQL uses a `ThreadLocal` execution id to trace Spark jobs launched by a query, which doesn't work perfectly in ForkJoinPool.

This PR just uses `Awaitable.result` instead to  prevent ForkJoinPool from running other tasks in the current waiting thread.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15520 from zsxwing/SPARK-13747.
2016-10-26 10:36:36 -07:00
Shuai Lin 402205ddf7
[SPARK-17802] Improved caller context logging.
## What changes were proposed in this pull request?

[SPARK-16757](https://issues.apache.org/jira/browse/SPARK-16757) sets the hadoop `CallerContext` when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the `org.apache.hadoop.ipc.CallerContext` class is only added since [hadoop 2.8](https://issues.apache.org/jira/browse/HDFS-9184), which is not officially releaed yet. So each time `utils.CallerContext.setCurrentContext()` is called (e.g [when a task is created](https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96)), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
error is logged, which pollutes the spark logs when there are lots of tasks.

This patch improves this behaviour by only logging the `ClassNotFoundException` once.

## How was this patch tested?

Existing tests.

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #15377 from lins05/spark-17802-improve-callercontext-logging.
2016-10-26 14:31:47 +02:00
Alex Bozarth 5d0f81da49
[SPARK-4411][WEB UI] Add "kill" link for jobs in the UI
## What changes were proposed in this pull request?

Currently users can kill stages via the web ui but not jobs directly (jobs are killed if one of their stages is). I've added the ability to kill jobs via the web ui. This code change is based on #4823 by lianhuiwang and updated to work with the latest code matching how stages are currently killed. In general I've copied the kill stage code warning and note comments and all. I also updated applicable tests and documentation.

## How was this patch tested?

Manually tested and dev/run-tests

![screen shot 2016-10-11 at 4 49 43 pm](https://cloud.githubusercontent.com/assets/13952758/19292857/12f1b7c0-8fd4-11e6-8982-210249f7b697.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>
Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #15441 from ajbozarth/spark4411.
2016-10-26 14:26:54 +02:00
hayashidac c329a568b5 [SPARK-16988][SPARK SHELL] spark history server log needs to be fixed to show https url when ssl is enabled
spark history server log needs to be fixed to show https url when ssl is enabled

Author: chie8842 <chie@chie-no-Mac-mini.local>

Closes #15611 from hayashidac/SPARK-16988.
2016-10-26 07:13:48 +09:00
Vinayak c5fe3dd4f5 [SPARK-18010][CORE] Reduce work performed for building up the application list for the History Server app list UI page
## What changes were proposed in this pull request?
allow ReplayListenerBus to skip deserialising and replaying certain events using an inexpensive check of the event log entry. Use this to ensure that when event log replay is triggered for building the application list, we get the ReplayListenerBus to skip over all but the few events needed for our immediate purpose. Refer [SPARK-18010] for the motivation behind this change.

## How was this patch tested?

Tested with existing HistoryServer and ReplayListener unit test suites. All tests pass.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: Vinayak <vijoshi5@in.ibm.com>

Closes #15556 from vijoshi/SAAS-467_master.
2016-10-25 10:36:03 -07:00
Kay Ousterhout 483c37c581 [SPARK-17894][HOTFIX] Fix broken build from
The named parameter in an overridden class isn't supported in Scala 2.10 so was breaking the build.

cc zsxwing

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #15617 from kayousterhout/hotfix.
2016-10-24 20:16:00 -07:00
Eren Avsarogullari 81d6933e75 [SPARK-17894][CORE] Ensure uniqueness of TaskSetManager name.
`TaskSetManager` should have unique name to avoid adding duplicate ones to parent `Pool` via `SchedulableBuilder`. This problem has been surfaced with following discussion: [[PR: Avoid adding duplicate schedulables]](https://github.com/apache/spark/pull/15326)

**Proposal** :
There is 1x1 relationship between `stageAttemptId` and `TaskSetManager` so `taskSet.Id` covering both `stageId` and `stageAttemptId` looks to be used for uniqueness of `TaskSetManager` name instead of just `stageId`.

**Current TaskSetManager Name** :
`var name = "TaskSet_" + taskSet.stageId.toString`
**Sample**: TaskSet_0

**Proposed TaskSetManager Name** :
`val name = "TaskSet_" + taskSet.Id ` `// taskSet.Id = (stageId + "." + stageAttemptId)`
**Sample** : TaskSet_0.0

Added new Unit Test.

Author: erenavsarogullari <erenavsarogullari@gmail.com>

Closes #15463 from erenavsarogullari/SPARK-17894.
2016-10-24 15:33:54 -07:00
Zheng RuiFeng c64a8ff397
[SPARK-18049][MLLIB][TEST] Add missing tests for truePositiveRate and weightedTruePositiveRate
## What changes were proposed in this pull request?
Add missing tests for `truePositiveRate` and `weightedTruePositiveRate` in `MulticlassMetricsSuite`

## How was this patch tested?
added testing

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15585 from zhengruifeng/mc_missing_test.
2016-10-24 10:25:24 +01:00
Sandeep Singh bc167a2a53 [SPARK-928][CORE] Add support for Unsafe-based serializer in Kryo
## What changes were proposed in this pull request?
Now since we have migrated to Kryo-3.0.0 in https://issues.apache.org/jira/browse/SPARK-11416, we can gives users option to use unsafe SerDer. It can turned by setting `spark.kryo.useUnsafe` to `true`

## How was this patch tested?
Ran existing tests

```
     Benchmark Kryo Unsafe vs safe Serialization: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      basicTypes: Int unsafe:true                    160 /  178         98.5          10.1       1.0X
      basicTypes: Long unsafe:true                   210 /  218         74.9          13.4       0.8X
      basicTypes: Float unsafe:true                  203 /  213         77.5          12.9       0.8X
      basicTypes: Double unsafe:true                 226 /  235         69.5          14.4       0.7X
      Array: Int unsafe:true                        1087 / 1101         14.5          69.1       0.1X
      Array: Long unsafe:true                       2758 / 2844          5.7         175.4       0.1X
      Array: Float unsafe:true                      1511 / 1552         10.4          96.1       0.1X
      Array: Double unsafe:true                     2942 / 2972          5.3         187.0       0.1X
      Map of string->Double unsafe:true             2645 / 2739          5.9         168.2       0.1X
      basicTypes: Int unsafe:false                   211 /  218         74.7          13.4       0.8X
      basicTypes: Long unsafe:false                  247 /  253         63.6          15.7       0.6X
      basicTypes: Float unsafe:false                 211 /  216         74.5          13.4       0.8X
      basicTypes: Double unsafe:false                227 /  233         69.2          14.4       0.7X
      Array: Int unsafe:false                       3012 / 3032          5.2         191.5       0.1X
      Array: Long unsafe:false                      4463 / 4515          3.5         283.8       0.0X
      Array: Float unsafe:false                     2788 / 2868          5.6         177.2       0.1X
      Array: Double unsafe:false                    3558 / 3752          4.4         226.2       0.0X
      Map of string->Double unsafe:false            2806 / 2933          5.6         178.4       0.1X
```

Author: Sandeep Singh <sandeep@techaddict.me>
Author: Sandeep Singh <sandeep@origamilogic.com>

Closes #12913 from techaddict/SPARK-928.
2016-10-22 12:03:37 -07:00
WeichenXu 4f1dcd3dce [SPARK-18051][SPARK CORE] fix bug of custom PartitionCoalescer causing serialization exception
## What changes were proposed in this pull request?

add a require check in `CoalescedRDD` to make sure the passed in `partitionCoalescer` to be `serializable`.
and update the document for api `RDD.coalesce`

## How was this patch tested?

Manual.(test code in jira [SPARK-18051])

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15587 from WeichenXu123/fix_coalescer_bug.
2016-10-22 11:59:28 -07:00
Eric Liang 3eca283aca [SPARK-17994][SQL] Add back a file status cache for catalog tables
## What changes were proposed in this pull request?

In SPARK-16980, we removed the full in-memory cache of table partitions in favor of loading only needed partitions from the metastore. This greatly improves the initial latency of queries that only read a small fraction of table partitions.

However, since the metastore does not store file statistics, we need to discover those from remote storage. With the loss of the in-memory file status cache this has to happen on each query, increasing the latency of repeated queries over the same partitions.

The proposal is to add back a per-table cache of partition contents, i.e. Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can be invalidated through refreshTable() and refreshByPath(). Unlike the prior cache, it can be incrementally updated as new partitions are read.

## How was this patch tested?

Existing tests and new tests in `HiveTablePerfStatsSuite`.

cc mallman

Author: Eric Liang <ekl@databricks.com>
Author: Michael Allman <michael@videoamp.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #15539 from ericl/meta-cache.
2016-10-22 22:08:28 +08:00
w00228970 c1f344f1a0 [SPARK-17929][CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-17929

Now `CoarseGrainedSchedulerBackend` reset will get the lock,
```
  protected def reset(): Unit = synchronized {
    numPendingExecutors = 0
    executorsPendingToRemove.clear()

    // Remove all the lingering executors that should be removed but not yet. The reason might be
    // because (1) disconnected event is not yet received; (2) executors die silently.
    executorDataMap.toMap.foreach { case (eid, _) =>
      driverEndpoint.askWithRetry[Boolean](
        RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
    }
  }
```
 but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock.

```
   private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
      logDebug(s"Asked to remove executor $executorId with reason $reason")
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>
          // This must be synchronized because variables mutated
          // in this block are read when requesting executors
          val killed = CoarseGrainedSchedulerBackend.this.synchronized {
            addressToExecutorId -= executorInfo.executorAddress
            executorDataMap -= executorId
            executorsPendingLossReason -= executorId
            executorsPendingToRemove.remove(executorId).getOrElse(false)
          }
     ...

## How was this patch tested?

manual test.

Author: w00228970 <wangfei1@huawei.com>

Closes #15481 from scwf/spark-17929.
2016-10-21 14:43:55 -07:00
Hossein e371040a01 [SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date columns
## What changes were proposed in this pull request?
NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time.

## How was this patch tested?
* [x] TODO

Author: Hossein <hossein@databricks.com>

Closes #15421 from falaki/SPARK-17811.
2016-10-21 12:38:52 -07:00
Alex Bozarth 3a237512b1
[SPARK-13275][WEB UI] Visually clarified executors start time in timeline
## What changes were proposed in this pull request?

Updated the Executors added/removed bubble in the time line so it's clearer where it starts. Now the bubble is left justified on the start time (still also denoted by the line) rather than center justified.

## How was this patch tested?

Manually tested UI

<img width="596" alt="screen shot 2016-10-17 at 6 04 36 pm" src="https://cloud.githubusercontent.com/assets/13952758/19496563/e6c9186e-953c-11e6-85e4-63309a553f65.png">
<img width="492" alt="screen shot 2016-10-17 at 5 54 09 pm" src="https://cloud.githubusercontent.com/assets/13952758/19496568/e9f06132-953c-11e6-8901-54405ebc7f5b.png">

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #15536 from ajbozarth/spark13275.
2016-10-21 11:39:32 +01:00
Zheng RuiFeng a8ea4da8d0
[SPARK-17331][FOLLOWUP][ML][CORE] Avoid allocating 0-length arrays
## What changes were proposed in this pull request?

`Array[T]()` -> `Array.empty[T]` to avoid allocating 0-length arrays.
Use regex `find . -name '*.scala' | xargs -i bash -c 'egrep "Array\[[A-Za-z]+\]\(\)" -n {} && echo {}'` to find modification candidates.

cc srowen

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15564 from zhengruifeng/avoid_0_length_array.
2016-10-21 09:49:37 +01:00
Jagadeesan 595893d33a
[SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4]
## What changes were proposed in this pull request?

1) Upgrade the Py4J version on the Java side
2) Update the py4j src zip file we bundle with Spark

## How was this patch tested?

Existing doctests & unit tests pass

Author: Jagadeesan <as2@us.ibm.com>

Closes #15514 from jagadeesanas2/SPARK-17960.
2016-10-21 09:48:24 +01:00
WeichenXu 39755169fb [SPARK-18003][SPARK CORE] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing
## What changes were proposed in this pull request?

- Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records.

- Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records.

## How was this patch tested?

test added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.
2016-10-19 23:41:38 -07:00
Alex Bozarth 444c2d22e3 [SPARK-10541][WEB UI] Allow ApplicationHistoryProviders to provide their own text when there aren't any complete apps
## What changes were proposed in this pull request?

I've added a method to `ApplicationHistoryProvider` that returns the html paragraph to display when there are no applications. This allows providers other than `FsHistoryProvider` to determine what is printed. The current hard coded text is now moved into `FsHistoryProvider` since it assumed that's what was being used before.

I chose to make the function return html rather than text because the current text block had inline html in it and it allows a new implementation of `ApplicationHistoryProvider` more versatility. I did not see any security issues with this since injecting html here requires implementing `ApplicationHistoryProvider` and can't be done outside of code.

## How was this patch tested?

Manual testing and dev/run-tests

No visible changes to the UI

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #15490 from ajbozarth/spark10541.
2016-10-19 13:01:33 -07:00
Yu Peng 2629cd7460 [SPARK-17711][TEST-HADOOP2.2] Fix hadoop2.2 compilation error
## What changes were proposed in this pull request?

Fix hadoop2.2 compilation error.

## How was this patch tested?

Existing tests.

cc tdas zsxwing

Author: Yu Peng <loneknightpy@gmail.com>

Closes #15537 from loneknightpy/fix-17711.
2016-10-18 19:43:08 -07:00
Guoqiang Li 4518642abd [SPARK-17930][CORE] The SerializerInstance instance used when deserializing a TaskResult is not reused
## What changes were proposed in this pull request?
The following code is called when the DirectTaskResult instance is deserialized

```scala

  def value(): T = {
    if (valueObjectDeserialized) {
      valueObject
    } else {
      // Each deserialization creates a new instance of SerializerInstance, which is very time-consuming
      val resultSer = SparkEnv.get.serializer.newInstance()
      valueObject = resultSer.deserialize(valueBytes)
      valueObjectDeserialized = true
      valueObject
    }
  }

```

In the case of stage has a lot of tasks, reuse SerializerInstance instance can improve the scheduling performance of three times

The test data is TPC-DS 2T (Parquet) and  SQL statement as follows (query 2):

```sql

select  i_item_id,
        avg(ss_quantity) agg1,
        avg(ss_list_price) agg2,
        avg(ss_coupon_amt) agg3,
        avg(ss_sales_price) agg4
 from store_sales, customer_demographics, date_dim, item, promotion
 where ss_sold_date_sk = d_date_sk and
       ss_item_sk = i_item_sk and
       ss_cdemo_sk = cd_demo_sk and
       ss_promo_sk = p_promo_sk and
       cd_gender = 'M' and
       cd_marital_status = 'M' and
       cd_education_status = '4 yr Degree' and
       (p_channel_email = 'N' or p_channel_event = 'N') and
       d_year = 2001
 group by i_item_id
 order by i_item_id
 limit 100;

```

`spark-defaults.conf` file:

```
spark.master                           yarn-client
spark.executor.instances               20
spark.driver.memory                    16g
spark.executor.memory                  30g
spark.executor.cores                   5
spark.default.parallelism              100
spark.sql.shuffle.partitions           100000
spark.serializer                       org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize              0
spark.rpc.netty.dispatcher.numThreads   8
spark.executor.extraJavaOptions          -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M
spark.cleaner.referenceTracking.blocking true
spark.cleaner.referenceTracking.blocking.shuffle true

```

Performance test results are as follows

[SPARK-17930](https://github.com/witgo/spark/tree/SPARK-17930)| [ed14633](ed14633414])
------------ | -------------
54.5 s|231.7 s

## How was this patch tested?

Existing tests.

Author: Guoqiang Li <witgo@qq.com>

Closes #15512 from witgo/SPARK-17930.
2016-10-18 13:46:57 -07:00
Yu Peng 231f39e3f6 [SPARK-17711] Compress rolled executor log
## What changes were proposed in this pull request?

This PR adds support for executor log compression.

## How was this patch tested?

Unit tests

cc: yhuai tdas mengxr

Author: Yu Peng <loneknightpy@gmail.com>

Closes #15285 from loneknightpy/compress-executor-log.
2016-10-18 13:23:31 -07:00
Liwei Lin 7d878cf2da [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite
This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it.

## What changes were proposed in this pull request?
There were two sources of flakiness in StreamingQueryListener test.

- When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
```
+-----------------------------------+--------------------------------+
|      StreamExecution thread       |         testing thread         |
+-----------------------------------+--------------------------------+
|  ManualClock.waitTillTime(100) {  |                                |
|        _isWaiting = true          |                                |
|            wait(10)               |                                |
|        still in wait(10)          |  if (_isWaiting) advance(100)  |
|        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
|        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
|      wake up from wait(10)        |                                |
|       current time is 600         |                                |
|       _isWaiting = false          |                                |
|  }                                |                                |
+-----------------------------------+--------------------------------+
```

- Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.

My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).

In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.

## How was this patch tested?
Ran existing unit test MANY TIME in Jenkins

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Liwei Lin <lwlin7@gmail.com>

Closes #15519 from tdas/metrics-flaky-test-fix.
2016-10-18 00:49:57 -07:00
Sital Kedia c7ac027d5f [SPARK-17839][CORE] Use Nio's directbuffer instead of BufferedInputStream in order to avoid additional copy from os buffer cache to user buffer
## What changes were proposed in this pull request?

Currently we use BufferedInputStream to read the shuffle file which copies the file content from os buffer cache to the user buffer. This adds additional latency in reading the spill files. We made a change to use java nio's direct buffer to read the spill files and for certain pipelines spilling significant amount of data, we see up to 7% speedup for the entire pipeline.

## How was this patch tested?
Tested by running the job in the cluster and observed up to 7% speedup.

Author: Sital Kedia <skedia@fb.com>

Closes #15408 from sitalkedia/skedia/nio_spill_read.
2016-10-17 11:03:04 -07:00
Reynold Xin 72a6e7a57a Revert "[SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors"
This reverts commit ed14633414.

The patch merged had obvious quality and documentation issue. The idea is useful, and we should work towards improving its quality and merging it in again.
2016-10-15 22:31:37 -07:00
Zhan Zhang ed14633414 [SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors
## What changes were proposed in this pull request?

Restructure the code and implement two new task assigner.
PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled.

BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors.

By default, the original round robin assigner is used.

We test a pipeline, and new PackedAssigner  save around 45% regarding the reserved cpu and memory with dynamic allocation enabled.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline.

Author: Zhan Zhang <zhanzhang@fb.com>

Closes #15218 from zhzhan/packed-scheduler.
2016-10-15 18:45:04 -07:00
Michael Allman 6ce1b675ee [SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query
(This PR addresses https://issues.apache.org/jira/browse/SPARK-16980.)

## What changes were proposed in this pull request?

In a new Spark session, when a partitioned Hive table is converted to use Spark's `HadoopFsRelation` in `HiveMetastoreCatalog`, metadata for every partition of that table are retrieved from the metastore and loaded into driver memory. In addition, every partition's metadata files are read from the filesystem to perform schema inference.

If a user queries such a table with predicates which prune that table's partitions, we would like to be able to answer that query without consulting partition metadata which are not involved in the query. When querying a table with a large number of partitions for some data from a small number of partitions (maybe even a single partition), the current conversion strategy is highly inefficient. I suspect this scenario is not uncommon in the wild.

In addition to being inefficient in running time, the current strategy is inefficient in its use of driver memory. When the sum of the number of partitions of all tables loaded in a driver reaches a certain level (somewhere in the tens of thousands), their cached data exhaust all driver heap memory in the default configuration. I suspect this scenario is less common (in that not too many deployments work with tables with tens of thousands of partitions), however this does illustrate how large the memory footprint of this metadata can be. With tables with hundreds or thousands of partitions, I would expect the `HiveMetastoreCatalog` table cache to represent a significant portion of the driver's heap space.

This PR proposes an alternative approach. Basically, it makes four changes:

1. It adds a new method, `listPartitionsByFilter` to the Catalyst `ExternalCatalog` trait which returns the partition metadata for a given sequence of partition pruning predicates.
1. It refactors the `FileCatalog` type hierarchy to include a new `TableFileCatalog` to efficiently return files only for partitions matching a sequence of partition pruning predicates.
1. It removes partition loading and caching from `HiveMetastoreCatalog`.
1. It adds a new Catalyst optimizer rule, `PruneFileSourcePartitions`, which applies a plan's partition-pruning predicates to prune out unnecessary partition files from a `HadoopFsRelation`'s underlying file catalog.

The net effect is that when a query over a partitioned Hive table is planned, the analyzer retrieves the table metadata from `HiveMetastoreCatalog`. As part of this operation, the `HiveMetastoreCatalog` builds a `HadoopFsRelation` with a `TableFileCatalog`. It does not load any partition metadata or scan any files. The optimizer prunes-away unnecessary table partitions by sending the partition-pruning predicates to the relation's `TableFileCatalog `. The `TableFileCatalog` in turn calls the `listPartitionsByFilter` method on its external catalog. This queries the Hive metastore, passing along those filters.

As a bonus, performing partition pruning during optimization leads to a more accurate relation size estimate. This, along with c481bdf, can lead to automatic, safe application of the broadcast optimization in a join where it might previously have been omitted.

## Open Issues

1. This PR omits partition metadata caching. I can add this once the overall strategy for the cold path is established, perhaps in a future PR.
1. This PR removes and omits partitioned Hive table schema reconciliation. As a result, it fails to find Parquet schema columns with upper case letters because of the Hive metastore's case-insensitivity. This issue may be fixed by #14750, but that PR appears to have stalled. ericl has contributed to this PR a workaround for Parquet wherein schema reconciliation occurs at query execution time instead of planning. Whether ORC requires a similar patch is an open issue.
1. This PR omits an implementation of `listPartitionsByFilter` for the `InMemoryCatalog`.
1. This PR breaks parquet log output redirection during query execution. I can work around this by running `Class.forName("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$")` first thing in a Spark shell session, but I haven't figured out how to fix this properly.

## How was this patch tested?

The current Spark unit tests were run, and some ad-hoc tests were performed to validate that only the necessary partition metadata is loaded.

Author: Michael Allman <michael@videoamp.com>
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #14690 from mallman/spark-16980-lazy_partition_fetching.
2016-10-14 18:26:18 -07:00
invkrh 28b645b1e6
[SPARK-17855][CORE] Remove query string from jar url
## What changes were proposed in this pull request?

Spark-submit support jar url with http protocol. However, if the url contains any query strings, `worker.DriverRunner.downloadUserJar()` method will throw "Did not see expected jar" exception. This is because this method checks the existance of a downloaded jar whose name contains query strings. This is a problem when your jar is located on some web service which requires some additional information to retrieve the file.

This pr just removes query strings before checking jar existance on worker.

## How was this patch tested?

For now, you can only test this patch by manual test.
* Deploy a spark cluster locally
* Make sure apache httpd service is on
* Save an uber jar, e.g spark-job.jar under `/var/www/html/`
* Use http://localhost/spark-job.jar?param=1 as jar url when running `spark-submit`
* Job should be launched

Author: invkrh <invkrh@gmail.com>

Closes #15420 from invkrh/spark-17855.
2016-10-14 12:52:08 +01:00
jerryshao 7bf8a40498 [SPARK-17686][CORE] Support printing out scala and java version with spark-submit --version command
## What changes were proposed in this pull request?

In our universal gateway service we need to specify different jars to Spark according to scala version. For now only after launching Spark application can we know which version of Scala it depends on. It makes hard for us to support different Scala + Spark versions to pick the right jars.

So here propose to print out Scala version according to Spark version in "spark-submit --version", so that user could leverage this output to make the choice without needing to launching application.

## How was this patch tested?

Manually verified in local environment.

Author: jerryshao <sshao@hortonworks.com>

Closes #15456 from jerryshao/SPARK-17686.
2016-10-13 03:29:14 -04:00
Alex Bozarth 6f2fa6c54a [SPARK-11272][WEB UI] Add support for downloading event logs from HistoryServer UI
## What changes were proposed in this pull request?

This is a reworked PR based on feedback in #9238 after it was closed and not reopened. As suggested in that PR I've only added the download feature. This functionality already exists in the api and this allows easier access to download event logs to share with others.

I've attached a screenshot of the committed version, but I will also include alternate options with screen shots in the comments below. I'm personally not sure which option is best.

## How was this patch tested?

Manual testing

![screen shot 2016-10-07 at 6 11 12 pm](https://cloud.githubusercontent.com/assets/13952758/19209213/832fe48e-8cba-11e6-9840-749b1be4d399.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #15400 from ajbozarth/spark11272.
2016-10-13 03:24:37 -04:00
Imran Rashid 9ce7d3e542 [SPARK-17675][CORE] Expand Blacklist for TaskSets
## What changes were proposed in this pull request?

This is a step along the way to SPARK-8425.

To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for
* (task, executor) pairs (this already exists via an undocumented config)
* (task, node)
* (taskset, executor)
* (taskset, node)

Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster.

Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be).

## How was this patch tested?

Added unit tests, run tests via jenkins.

Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>

Closes #15249 from squito/taskset_blacklist_only.
2016-10-12 16:43:03 -05:00
Shixiong Zhu 47776e7c0c [SPARK-17850][CORE] Add a flag to ignore corrupt files
## What changes were proposed in this pull request?

Add a flag to ignore corrupt files. For Spark core, the configuration is `spark.files.ignoreCorruptFiles`. For Spark SQL, it's `spark.sql.files.ignoreCorruptFiles`.

## How was this patch tested?

The added unit tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15422 from zsxwing/SPARK-17850.
2016-10-12 13:51:53 -07:00
Hossein 5cc503f4fe [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB
## What changes were proposed in this pull request?
If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.

I tested this on my MacBook. Following code works with this patch:
```R
intMax <- .Machine$integer.max
largeVec <- 1:intMax
rdd <- SparkR:::parallelize(sc, largeVec, 2)
```

## How was this patch tested?
* [x] Unit tests

Author: Hossein <hossein@databricks.com>

Closes #15375 from falaki/SPARK-17790.
2016-10-12 10:32:38 -07:00
Wenchen Fan b9a147181d [SPARK-17720][SQL] introduce static SQL conf
## What changes were proposed in this pull request?

SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897.

Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf.

## How was this patch tested?

new tests in SQLConfSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15295 from cloud-fan/global-conf.
2016-10-11 20:27:08 -07:00
Bryan Cutler 658c7147f5
[SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4.13
## What changes were proposed in this pull request?
Upgraded to a newer version of Pyrolite which supports serialization of a BinaryType StructField for PySpark.SQL

## How was this patch tested?
Added a unit test which fails with a raised ValueError when using the previous version of Pyrolite 4.9 and Python3

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #15386 from BryanCutler/pyrolite-upgrade-SPARK-17808.
2016-10-11 08:29:52 +02:00
Ergin Seyfe 19a5bae47f [SPARK-17816][CORE] Fix ConcurrentModificationException issue in BlockStatusesAccumulator
## What changes were proposed in this pull request?
Change the BlockStatusesAccumulator to return immutable object when value method is called.

## How was this patch tested?
Existing tests plus I verified this change by running a pipeline which consistently repro this issue.

This is the stack trace for this exception:
`
java.util.ConcurrentModificationException
        at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
        at java.util.ArrayList$Itr.next(ArrayList.java:851)
        at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
        at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
        at scala.collection.AbstractTraversable.to(Traversable.scala:104)
        at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
        at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
        at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
        at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
        at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
        at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
        at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
        at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
        at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
        at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
`

Author: Ergin Seyfe <eseyfe@fb.com>

Closes #15371 from seyfe/race_cond_jsonprotocal.
2016-10-10 20:41:31 -07:00
Dhruve Ashar 4bafacaa5f [SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing
## What changes were proposed in this pull request?
Currently the no. of partition files are limited to 10000 files (%05d format). If there are more than 10000 part files, the logic goes for a toss while recreating the RDD as it sorts them by string. More details can be found in the JIRA desc [here](https://issues.apache.org/jira/browse/SPARK-17417).

## How was this patch tested?
I tested this patch by checkpointing a RDD and then manually renaming part files to the old format and tried to access the RDD. It was successfully created from the old format. Also verified loading a sample parquet file and saving it as multiple formats - CSV, JSON, Text, Parquet, ORC and read them successfully back from the saved files. I couldn't launch the unit test from my local box, so will wait for the Jenkins output.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #15370 from dhruve/bug/SPARK-17417.
2016-10-10 10:55:57 -05:00
Wenchen Fan 23ddff4b2b [SPARK-17338][SQL] add global temp view
## What changes were proposed in this pull request?

Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.

changes for `SessionCatalog`:

1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.

changes for SQL commands:

1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.

changes for other public API

1. add a new method `dropGlobalTempView` in `Catalog`
2. `Catalog.findTable` can find global temp view
3. add a new method `createGlobalTempView` in `Dataset`

## How was this patch tested?

new tests in `SQLViewSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14897 from cloud-fan/global-temp-view.
2016-10-10 15:48:57 +08:00
Weiqing Yang 8a6bbe095b
[MINOR][SQL] Use resource path for test_script.sh
## What changes were proposed in this pull request?
This PR modified the test case `test("script")` to use resource path for `test_script.sh`. Make the test case portable (even in IntelliJ).

## How was this patch tested?
Passed the test case.
Before:
Run `test("script")` in IntelliJ:
```
Caused by: org.apache.spark.SparkException: Subprocess exited with status 127. Error: bash: src/test/resources/test_script.sh: No such file or directory
```
After:
Test passed.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15246 from weiqingy/hivetest.
2016-10-08 12:12:35 +01:00
Sean Owen 4201ddcc07
[SPARK-17768][CORE] Small (Sum,Count,Mean)Evaluator problems and suboptimalities
## What changes were proposed in this pull request?

Fix:

- GroupedMeanEvaluator and GroupedSumEvaluator are unused, as is the StudentTCacher support class
- CountEvaluator can return a lower bound < 0, when counts can't be negative
- MeanEvaluator will actually fail on exactly 1 datum (yields t-test with 0 DOF)
- CountEvaluator uses a normal distribution, which may be an inappropriate approximation (leading to above)
- Test for SumEvaluator asserts incorrect expected sums – e.g. after observing 10% of data has sum of 2, expectation should be 20, not 38
- CountEvaluator, MeanEvaluator have no unit tests to catch these
- Duplication of distribution code across CountEvaluator, GroupedCountEvaluator
- The stats in each could use a bit of documentation as I had to guess at them
- (Code could use a few cleanups and optimizations too)

## How was this patch tested?

Existing and new tests

Author: Sean Owen <sowen@cloudera.com>

Closes #15341 from srowen/SPARK-17768.
2016-10-08 11:31:12 +01:00
Alex Bozarth 362ba4b6f8
[SPARK-17793][WEB UI] Sorting on the description on the Job or Stage page doesn’t always work
## What changes were proposed in this pull request?

Added secondary sorting on stage name for the description column. This provide a clearer behavior in the common case where the Description column only comprises of Stage names instead of the option description value.

## How was this patch tested?

manual testing and dev/run-tests

Screenshots of sorting on both description and stage name as well as an example of both:
![screen shot 2016-10-04 at 1 09 39 pm](https://cloud.githubusercontent.com/assets/13952758/19135523/067b042e-8b1a-11e6-912e-e6371d006d21.png)
![screen shot 2016-10-04 at 1 09 51 pm](https://cloud.githubusercontent.com/assets/13952758/19135526/06960936-8b1a-11e6-85e9-8aaf694c5f7b.png)
![screen shot 2016-10-05 at 1 14 45 pm](https://cloud.githubusercontent.com/assets/13952758/19135525/069547da-8b1a-11e6-8692-6524c75c4c07.png)
![screen shot 2016-10-05 at 1 14 51 pm](https://cloud.githubusercontent.com/assets/13952758/19135524/0694b4d2-8b1a-11e6-92dc-c8aa514e4f62.png)
![screen shot 2016-10-05 at 4 42 52 pm](https://cloud.githubusercontent.com/assets/13952758/19135618/e232eafe-8b1a-11e6-88b3-ff0bbb26b7f8.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #15366 from ajbozarth/spark17793.
2016-10-08 11:24:00 +01:00
Sean Owen cff5607552 [SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished
## What changes were proposed in this pull request?

This expands calls to Jetty's simple `ServerConnector` constructor to explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It should otherwise result in exactly the same configuration, because the other args are copied from the constructor that is currently called.

(I'm not sure we should change the Hive Thriftserver impl, but I did anyway.)

This also adds `sc.stop()` to the quick start guide example.

## How was this patch tested?

Existing tests; _pending_ at least manual verification of the fix.

Author: Sean Owen <sowen@cloudera.com>

Closes #15381 from srowen/SPARK-17707.
2016-10-07 10:31:41 -07:00
Brian Cho e56614cba9 [SPARK-16827] Stop reporting spill metrics as shuffle metrics
## What changes were proposed in this pull request?

Fix a bug where spill metrics were being reported as shuffle metrics. Eventually these spill metrics should be reported (SPARK-3577), but separate from shuffle metrics. The fix itself basically reverts the line to what it was in 1.6.

## How was this patch tested?

Tested on a job that was reporting shuffle writes even for the final stage, when no shuffle writes should take place. After the change the job no longer shows these writes.

Before:
![screen shot 2016-10-03 at 6 39 59 pm](https://cloud.githubusercontent.com/assets/1514239/19085897/dbf59a92-8a20-11e6-9f68-a978860c0d74.png)

After:
<img width="1052" alt="screen shot 2016-10-03 at 11 44 44 pm" src="https://cloud.githubusercontent.com/assets/1514239/19085903/e173a860-8a20-11e6-85e3-d47f9835f494.png">

Author: Brian Cho <bcho@fb.com>

Closes #15347 from dafrista/shuffle-metrics.
2016-10-07 11:37:18 -04:00
Alex Bozarth 24097d8474
[SPARK-17795][WEB UI] Sorting on stage or job tables doesn’t reload page on that table
## What changes were proposed in this pull request?

Added anchor on table header id to sorting links on job and stage tables. This make the page reload after a sort load the page at the sorted table.

This only changes page load behavior so no UI changes

## How was this patch tested?

manually tested and dev/run-tests

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #15369 from ajbozarth/spark17795.
2016-10-07 11:47:37 +01:00
Shixiong Zhu 9293734d35 [SPARK-17346][SQL] Add Kafka source for Structured Streaming
## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")
  .load()
```

* Subscribe to multiple topics
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")
  .load()
```

* Subscribe to a pattern
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")
  .load()
```

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>

Closes #15102 from zsxwing/kafka-source.
2016-10-05 16:45:45 -07:00
Shixiong Zhu 221b418b1c [SPARK-17778][TESTS] Mock SparkContext to reduce memory usage of BlockManagerSuite
## What changes were proposed in this pull request?

Mock SparkContext to reduce memory usage of BlockManagerSuite

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15350 from zsxwing/SPARK-17778.
2016-10-05 14:54:55 -07:00
sumansomasundar 7d51608835
[SPARK-16962][CORE][SQL] Fix misaligned record accesses for SPARC architectures
## What changes were proposed in this pull request?

Made changes to record length offsets to make them uniform throughout various areas of Spark core and unsafe

## How was this patch tested?

This change affects only SPARC architectures and was tested on X86 architectures as well for regression.

Author: sumansomasundar <suman.somasundar@oracle.com>

Closes #14762 from sumansomasundar/master.
2016-10-04 10:31:56 +01:00
Sean Owen 8e8de0073d
[SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications
## What changes were proposed in this pull request?

Return Iterator of applications internally in history server, for consistency and performance. See https://github.com/apache/spark/pull/15248 for some back-story.

The code called by and calling HistoryServer.getApplicationList wants an Iterator, but this method materializes an Iterable, which potentially causes a performance problem. It's simpler too to make this internal method also pass through an Iterator.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15321 from srowen/SPARK-17671.
2016-10-04 10:29:22 +01:00
Tao LI 76dc2d9073 [SPARK-14914][CORE][SQL] Skip/fix some test cases on Windows due to limitation of Windows
## What changes were proposed in this pull request?

This PR proposes to fix/skip some tests failed on Windows. This PR takes over https://github.com/apache/spark/pull/12696.

**Before**

- **SparkSubmitSuite**

  ```
[info] - launch simple application with spark-submit *** FAILED *** (202 milliseconds)
[info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specifie

[info] - includes jars passed in through --jars *** FAILED *** (1 second, 625 milliseconds)
[info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
```

- **DiskStoreSuite**

  ```
[info] - reads of memory-mapped and non memory-mapped files are equivalent *** FAILED *** (1 second, 78 milliseconds)
[info]   diskStoreMapped.remove(blockId) was false (DiskStoreSuite.scala:41)
```

**After**

- **SparkSubmitSuite**

  ```
[info] - launch simple application with spark-submit (578 milliseconds)
[info] - includes jars passed in through --jars (1 second, 875 milliseconds)
```

- **DiskStoreSuite**

  ```
[info] DiskStoreSuite:
[info] - reads of memory-mapped and non memory-mapped files are equivalent !!! CANCELED !!! (766 milliseconds
```

For `CreateTableAsSelectSuite` and `FsHistoryProviderSuite`, I could not reproduce as the Java version seems higher than the one that has the bugs about `setReadable(..)` and `setWritable(...)` but as they are bugs reported clearly, it'd be sensible to skip those. We should revert the changes for both back as soon as we drop the support of Java 7.

## How was this patch tested?

Manually tested via AppVeyor.

Closes #12696

Author: Tao LI <tl@microsoft.com>
Author: U-FAREAST\tl <tl@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15320 from HyukjinKwon/SPARK-14914.
2016-10-02 16:01:02 -07:00
Eric Liang 4bcd9b728b [SPARK-17740] Spark tests should mock / interpose HDFS to ensure that streams are closed
## What changes were proposed in this pull request?

As a followup to SPARK-17666, ensure filesystem connections are not leaked at least in unit tests. This is done here by intercepting filesystem calls as suggested by JoshRosen . At the end of each test, we assert no filesystem streams are left open.

This applies to all tests using SharedSQLContext or SharedSparkContext.

## How was this patch tested?

I verified that tests in sql and core are indeed using the filesystem backend, and fixed the detected leaks. I also checked that reverting https://github.com/apache/spark/pull/15245 causes many actual test failures due to connection leaks.

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #15306 from ericl/sc-4672.
2016-09-30 23:51:36 -07:00
Shubham Chopra a26afd5219 [SPARK-15353][CORE] Making peer selection for block replication pluggable
## What changes were proposed in this pull request?

This PR makes block replication strategies pluggable. It provides two trait that can be implemented, one that maps a host to its topology and is used in the master, and the second that helps prioritize a list of peers for block replication and would run in the executors.

This patch contains default implementations of these traits that make sure current Spark behavior is unchanged.

## How was this patch tested?

This patch should not change Spark behavior in any way, and was tested with unit tests for storage.

Author: Shubham Chopra <schopra31@bloomberg.net>

Closes #13152 from shubhamchopra/RackAwareBlockReplication.
2016-09-30 18:24:39 -07:00
Imran Rashid 3993ebca23 [SPARK-17676][CORE] FsHistoryProvider should ignore hidden files
## What changes were proposed in this pull request?

FsHistoryProvider was writing a hidden file (to check the fs's clock).
Even though it deleted the file immediately, sometimes another thread
would try to scan the files on the fs in-between, and then there would
be an error msg logged which was very misleading for the end-user.
(The logged error was harmless, though.)

## How was this patch tested?

I added one unit test, but to be clear, that test was passing before.  The actual change in behavior in that test is just logging (after the change, there is no more logged error), which I just manually verified.

Author: Imran Rashid <irashid@cloudera.com>

Closes #15250 from squito/SPARK-17676.
2016-09-29 15:40:35 -07:00
Brian Cho 027dea8f29 [SPARK-17715][SCHEDULER] Make task launch logs DEBUG
## What changes were proposed in this pull request?

Ramp down the task launch logs from INFO to DEBUG. Task launches can happen orders of magnitude more than executor registration so it makes the logs easier to handle if they are different log levels. For larger jobs, there can be 100,000s of task launches which makes the driver log huge.

## How was this patch tested?

No tests, as this is a trivial change.

Author: Brian Cho <bcho@fb.com>

Closes #15290 from dafrista/ramp-down-task-logging.
2016-09-29 15:59:17 -04:00
Gang Wu cb87b3ced9 [SPARK-17672] Spark 2.0 history server web Ui takes too long for a single application
Added a new API getApplicationInfo(appId: String) in class ApplicationHistoryProvider and class SparkUI to get app info. In this change, FsHistoryProvider can directly fetch one app info in O(1) time complexity compared to O(n) before the change which used an Iterator.find() interface.

Both ApplicationCache and OneApplicationResource classes adopt this new api.

 manual tests

Author: Gang Wu <wgtmac@uber.com>

Closes #15247 from wgtmac/SPARK-17671.
2016-09-29 15:51:38 -04:00
Imran Rashid 7f779e7439 [SPARK-17648][CORE] TaskScheduler really needs offers to be an IndexedSeq
## What changes were proposed in this pull request?

The Seq[WorkerOffer] is accessed by index, so it really should be an
IndexedSeq, otherwise an O(n) operation becomes O(n^2).  In practice
this hasn't been an issue b/c where these offers are generated, the call
to `.toSeq` just happens to create an IndexedSeq anyway.I got bitten by
this in performance tests I was doing, and its better for the types to be
more precise so eg. a change in Scala doesn't destroy performance.

## How was this patch tested?

Unit tests via jenkins.

Author: Imran Rashid <irashid@cloudera.com>

Closes #15221 from squito/SPARK-17648.
2016-09-29 15:36:40 -04:00
Weiqing Yang 7dfad4b132 [SPARK-17710][HOTFIX] Fix ClassCircularityError in ReplSuite tests in Maven build: use 'Class.forName' instead of 'Utils.classForName'
## What changes were proposed in this pull request?
Fix ClassCircularityError in ReplSuite tests when Spark is built by Maven build.

## How was this patch tested?
(1)
```
build/mvn -DskipTests -Phadoop-2.3 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Pmesos clean package
```
Then test:
```
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.repl.ReplSuite test
```
ReplSuite tests passed

(2)
Manual Tests against some Spark applications in Yarn client mode and Yarn cluster mode. Need to check if spark caller contexts are written into HDFS hdfs-audit.log and Yarn RM audit log successfully.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15286 from Sherry302/SPARK-16757.
2016-09-28 20:20:03 -05:00
w00228970 46d1203bf2 [SPARK-17644][CORE] Do not add failedStages when abortStage for fetch failure
## What changes were proposed in this pull request?
| Time        |Thread 1 ,  Job1          | Thread 2 ,  Job2  |
|:-------------:|:-------------:|:-----:|
| 1 | abort stage due to FetchFailed |  |
| 2 | failedStages += failedStage |    |
| 3 |      |  task failed due to  FetchFailed |
| 4 |      |  can not post ResubmitFailedStages because failedStages is not empty |

Then job2 of thread2 never resubmit the failed stage and hang.

We should not add the failedStages when abortStage for fetch failure

## How was this patch tested?

added unit test

Author: w00228970 <wangfei1@huawei.com>
Author: wangfei <wangfei_hello@126.com>

Closes #15213 from scwf/dag-resubmit.
2016-09-28 12:02:59 -07:00
Liang-Chi Hsieh e7bce9e187 [SPARK-17056][CORE] Fix a wrong assert regarding unroll memory in MemoryStore
## What changes were proposed in this pull request?

There is an assert in MemoryStore's putIteratorAsValues method which is used to check if unroll memory is not released too much. This assert looks wrong.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14642 from viirya/fix-unroll-memory.
2016-09-27 16:00:39 -07:00
Weiqing Yang 6a68c5d7b4 [SPARK-16757] Set up Spark caller context to HDFS and YARN
## What changes were proposed in this pull request?

1. Pass `jobId` to Task.
2. Invoke Hadoop APIs.
    * A new function `setCallerContext` is added in `Utils`. `setCallerContext` function invokes APIs of   `org.apache.hadoop.ipc.CallerContext` to set up spark caller contexts, which will be written into `hdfs-audit.log` and Yarn RM audit log.
    * For HDFS: Spark sets up its caller context by invoking`org.apache.hadoop.ipc.CallerContext` in `Task` and Yarn `Client` and `ApplicationMaster`.
    * For Yarn: Spark sets up its caller context by invoking `org.apache.hadoop.ipc.CallerContext` in Yarn `Client`.

## How was this patch tested?
Manual Tests against some Spark applications in Yarn client mode and Yarn cluster mode. Need to check if spark caller contexts are written into HDFS hdfs-audit.log and Yarn RM audit log successfully.

For example, run SparkKmeans in Yarn client mode:
```
./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5
```

**Before**:
There will be no Spark caller context in records of `hdfs-audit.log` and Yarn RM audit log.

**After**:
Spark caller contexts will be written in records of `hdfs-audit.log` and Yarn RM audit log.

These are records in `hdfs-audit.log`:
```
2016-09-20 11:54:24,116 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_CLIENT_AppId_application_1474394339641_0005
2016-09-20 11:54:28,164 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0005_JobId_0_StageId_0_AttemptId_0_TaskId_2_AttemptNum_0
2016-09-20 11:54:28,164 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0005_JobId_0_StageId_0_AttemptId_0_TaskId_1_AttemptNum_0
2016-09-20 11:54:28,164 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0005_JobId_0_StageId_0_AttemptId_0_TaskId_0_AttemptNum_0
```
```
2016-09-20 11:59:33,868 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=mkdirs	src=/private/tmp/hadoop-wyang/nm-local-dir/usercache/wyang/appcache/application_1474394339641_0006/container_1474394339641_0006_01_000001/spark-warehouse	dst=null	perm=wyang:supergroup:rwxr-xr-x	proto=rpc	callerContext=SPARK_APPLICATION_MASTER_AppId_application_1474394339641_0006_AttemptId_1
2016-09-20 11:59:37,214 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_1_AttemptNum_0
2016-09-20 11:59:37,215 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_2_AttemptNum_0
2016-09-20 11:59:37,215 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_0_AttemptNum_0
2016-09-20 11:59:42,391 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_3_AttemptNum_0
```
This is a record in Yarn RM log:
```
2016-09-20 11:59:24,050 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wyang	IP=127.0.0.1	OPERATION=Submit Application Request	TARGET=ClientRMService	RESULT=SUCCESS	APPID=application_1474394339641_0006	CALLERCONTEXT=SPARK_CLIENT_AppId_application_1474394339641_0006
```

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #14659 from Sherry302/callercontextSubmit.
2016-09-27 08:10:38 -05:00
Ding Fei 6ee28423ad Fix two comments since Actor is not used anymore.
## What changes were proposed in this pull request?

Fix two comments since Actor is not used anymore.

Author: Ding Fei <danis@danix>

Closes #15251 from danix800/comment-fixing.
2016-09-26 23:09:51 -07:00
Shixiong Zhu bde85f8b70 [SPARK-17649][CORE] Log how many Spark events got dropped in LiveListenerBus
## What changes were proposed in this pull request?

Log how many Spark events got dropped in LiveListenerBus so that the user can get insights on how to set a correct event queue size.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15220 from zsxwing/SPARK-17649.
2016-09-26 10:44:35 -07:00
Burak Yavuz 59d87d2407 [SPARK-17650] malformed url's throw exceptions before bricking Executors
## What changes were proposed in this pull request?

When a malformed URL was sent to Executors through `sc.addJar` and `sc.addFile`, the executors become unusable, because they constantly throw `MalformedURLException`s and can never acknowledge that the file or jar is just bad input.

This PR tries to fix that problem by making sure MalformedURLs can never be submitted through `sc.addJar` and `sc.addFile`. Another solution would be to blacklist bad files and jars on Executors. Maybe fail the first time, and then ignore the second time (but print a warning message).

## How was this patch tested?

Unit tests in SparkContextSuite

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #15224 from brkyvz/SPARK-17650.
2016-09-25 22:57:31 -07:00
jisookim 90a30f4634 [SPARK-12221] add cpu time to metrics
Currently task metrics don't support executor CPU time, so there's no way to calculate how much CPU time a stage/task took from History Server metrics. This PR enables reporting CPU time.

Author: jisookim <jisookim0513@gmail.com>

Closes #10212 from jisookim0513/add-cpu-time-metric.
2016-09-23 13:43:47 -07:00
Holden Karau 90d5754212
[SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2
## What changes were proposed in this pull request?

Move the internals of the PySpark accumulator API from the old deprecated API on top of the new accumulator API.

## How was this patch tested?

The existing PySpark accumulator tests (both unit tests and doc tests at the start of accumulator.py).

Author: Holden Karau <holden@us.ibm.com>

Closes #14467 from holdenk/SPARK-16861-refactor-pyspark-accumulator-api.
2016-09-23 09:44:30 +01:00
Marcelo Vanzin a4aeb7677b [SPARK-17639][BUILD] Add jce.jar to buildclasspath when building.
This was missing, preventing code that uses javax.crypto to properly
compile in Spark.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15204 from vanzin/SPARK-17639.
2016-09-22 21:35:25 -07:00
Dhruve Ashar 17b72d31e0 [SPARK-17365][CORE] Remove/Kill multiple executors together to reduce RPC call time.
## What changes were proposed in this pull request?
We are killing multiple executors together instead of iterating over expensive RPC calls to kill single executor.

## How was this patch tested?
Executed sample spark job to observe executors being killed/removed with dynamic allocation enabled.

Author: Dhruve Ashar <dashar@yahoo-inc.com>
Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #15152 from dhruve/impr/SPARK-17365.
2016-09-22 10:10:37 -07:00
Yanbo Liang c133907c5d [SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors
## What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```.
We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15131 from yanboliang/spark-17577.
2016-09-21 20:08:28 -07:00
jerryshao 8c3ee2bc42 [SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster mode
## What changes were proposed in this pull request?

Yarn and mesos cluster mode support remote python path (HDFS/S3 scheme) by their own mechanism, it is not necessary to check and format the python when running on these modes. This is a potential regression compared to 1.6, so here propose to fix it.

## How was this patch tested?

Unit test to verify SparkSubmit arguments, also with local cluster verification. Because of lack of `MiniDFSCluster` support in Spark unit test, there's no integration test added.

Author: jerryshao <sshao@hortonworks.com>

Closes #15137 from jerryshao/SPARK-17512.
2016-09-21 17:57:21 -04:00
Imran Rashid 9fcf1c51d5 [SPARK-17623][CORE] Clarify type of TaskEndReason with a failed task.
## What changes were proposed in this pull request?

In TaskResultGetter, enqueueFailedTask currently deserializes the result
as a TaskEndReason. But the type is actually more specific, its a
TaskFailedReason. This just leads to more blind casting later on – it
would be more clear if the msg was cast to the right type immediately,
so method parameter types could be tightened.

## How was this patch tested?

Existing unit tests via jenkins.  Note that the code was already performing a blind-cast to a TaskFailedReason before in any case, just in a different spot, so there shouldn't be any behavior change.

Author: Imran Rashid <irashid@cloudera.com>

Closes #15181 from squito/SPARK-17623.
2016-09-21 17:49:36 -04:00
Marcelo Vanzin 2cd1bfa4f0 [SPARK-4563][CORE] Allow driver to advertise a different network address.
The goal of this feature is to allow the Spark driver to run in an
isolated environment, such as a docker container, and be able to use
the host's port forwarding mechanism to be able to accept connections
from the outside world.

The change is restricted to the driver: there is no support for achieving
the same thing on executors (or the YARN AM for that matter). Those still
need full access to the outside world so that, for example, connections
can be made to an executor's block manager.

The core of the change is simple: add a new configuration that tells what's
the address the driver should bind to, which can be different than the address
it advertises to executors (spark.driver.host). Everything else is plumbing
the new configuration where it's needed.

To use the feature, the host starting the container needs to set up the
driver's port range to fall into a range that is being forwarded; this
required the block manager port to need a special configuration just for
the driver, which falls back to the existing spark.blockManager.port when
not set. This way, users can modify the driver settings without affecting
the executors; it would theoretically be nice to also have different
retry counts for driver and executors, but given that docker (at least)
allows forwarding port ranges, we can probably live without that for now.

Because of the nature of the feature it's kinda hard to add unit tests;
I just added a simple one to make sure the configuration works.

This was tested with a docker image running spark-shell with the following
command:

 docker blah blah blah \
   -p 38000-38100:38000-38100 \
   [image] \
   spark-shell \
     --num-executors 3 \
     --conf spark.shuffle.service.enabled=false \
     --conf spark.dynamicAllocation.enabled=false \
     --conf spark.driver.host=[host's address] \
     --conf spark.driver.port=38000 \
     --conf spark.driver.blockManager.port=38020 \
     --conf spark.ui.port=38040

Running on YARN; verified the driver works, executors start up and listen
on ephemeral ports (instead of using the driver's config), and that caching
and shuffling (without the shuffle service) works. Clicked through the UI
to make sure all pages (including executor thread dumps) worked. Also tested
apps without docker, and ran unit tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15120 from vanzin/SPARK-4563.
2016-09-21 14:42:41 -07:00
erenavsarogullari dd7561d337
[CORE][MINOR] Add minor code change to TaskState and Task
## What changes were proposed in this pull request?
- TaskState and ExecutorState expose isFailed and isFinished functions. It can be useful to add test coverage for different states. Currently, Other enums do not expose any functions so this PR aims just these two enums.
- `private` access modifier is added for Finished Task States Set
- A minor doc change is added.

## How was this patch tested?
New Unit tests are added and run locally.

Author: erenavsarogullari <erenavsarogullari@gmail.com>

Closes #15143 from erenavsarogullari/SPARK-17584.
2016-09-21 14:47:18 +01:00
Yanbo Liang d3b8869763 [SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files recursively
## What changes were proposed in this pull request?
Users would like to add a directory as dependency in some cases, they can use ```SparkContext.addFile``` with argument ```recursive=true``` to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15140 from yanboliang/spark-17585.
2016-09-21 01:37:03 -07:00
wm624@hotmail.com 61876a4279
[CORE][DOC] Fix errors in comments
## What changes were proposed in this pull request?
While reading source code of CORE and SQL core, I found some minor errors in comments such as extra space, missing blank line and grammar error.

I fixed these minor errors and might find more during my source code study.

## How was this patch tested?
Manually build

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15151 from wangmiao1981/mem.
2016-09-21 09:33:29 +01:00
Weiqing Yang 1ea49916ac [MINOR][BUILD] Fix CheckStyle Error
## What changes were proposed in this pull request?
This PR is to fix the code style errors before 2.0.1 release.

## How was this patch tested?
Manual.

Before:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[153] (sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[196] (sizes) LineLength: Line is longer than 100 characters (found 108).
[ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[239] (sizes) LineLength: Line is longer than 100 characters (found 115).
[ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[119] (sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[129] (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] src/main/java/org/apache/spark/network/util/LevelDBProvider.java:[124,11] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
[ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[26] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[33] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[38] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[43] (sizes) LineLength: Line is longer than 100 characters (found 106).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[48] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[0] (misc) NewlineAtEndOfFile: File does not end with a newline.
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java:[67] (sizes) LineLength: Line is longer than 100 characters (found 106).
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[200] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[309] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[332] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[348] (regexp) RegexpSingleline: No trailing whitespace allowed.
 ```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15170 from Sherry302/fixjavastyle.
2016-09-20 21:48:25 -07:00
Shixiong Zhu 80d6655921 [SPARK-17438][WEBUI] Show Application.executorLimit in the application page
## What changes were proposed in this pull request?

This PR adds `Application.executorLimit` to the applicatino page

## How was this patch tested?

Checked the UI manually.

Screenshots:

1. Dynamic allocation is disabled

<img width="484" alt="screen shot 2016-09-07 at 4 21 49 pm" src="https://cloud.githubusercontent.com/assets/1000778/18332029/210056ea-7518-11e6-9f52-76d96046c1c0.png">

2. Dynamic allocation is enabled.

<img width="466" alt="screen shot 2016-09-07 at 4 25 30 pm" src="https://cloud.githubusercontent.com/assets/1000778/18332034/2c07700a-7518-11e6-8fce-aebe25014902.png">

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15001 from zsxwing/fix-core-info.
2016-09-19 14:00:42 -04:00
hyukjinkwon 7151011b38
[SPARK-17586][BUILD] Do not call static member via instance reference
## What changes were proposed in this pull request?

This PR fixes a warning message as below:

```
[WARNING] .../UnsafeInMemorySorter.java:284: warning: [static] static method should be qualified by type name, TaskMemoryManager, instead of by an expression
[WARNING]       currentPageNumber = memoryManager.decodePageNumber(recordPointer)
```

by referencing the static member via class not instance reference.

## How was this patch tested?

Existing tests should cover this - Jenkins tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15141 from HyukjinKwon/SPARK-17586.
2016-09-18 19:18:49 +01:00
Josh Rosen 8faa5217b4 [SPARK-17491] Close serialization stream to fix wrong answer bug in putIteratorAsBytes()
## What changes were proposed in this pull request?

`MemoryStore.putIteratorAsBytes()` may silently lose values when used with `KryoSerializer` because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read.

This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer).

The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. In addition, this patch adds several additional assertions / precondition checks to prevent misuse of `PartiallySerializedBlock` and `ChunkedByteBufferOutputStream`.

## How was this patch tested?

The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with `zip` but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. The updated test case reproduced this bug.

In addition, I added a new `PartiallySerializedBlockSuite` to unit test that component.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15043 from JoshRosen/partially-serialized-block-values-iterator-bugfix.
2016-09-17 11:46:15 -07:00
David Navas 9dbd4b864e
[SPARK-17529][CORE] Implement BitSet.clearUntil and use it during merge joins
## What changes were proposed in this pull request?

Add a clearUntil() method on BitSet (adapted from the pre-existing setUntil() method).
Use this method to clear the subset of the BitSet which needs to be used during merge joins.

## How was this patch tested?

dev/run-tests, as well as performance tests on skewed data as described in jira.

I expect there to be a small local performance hit using BitSet.clearUntil rather than BitSet.clear for normally shaped (unskewed) joins (additional read on the last long).  This is expected to be de-minimis and was not specifically tested.

Author: David Navas <davidn@clearstorydata.com>

Closes #15084 from davidnavas/bitSet.
2016-09-17 16:22:23 +01:00
Xin Ren f15d41be3c
[SPARK-17567][DOCS] Use valid url to Spark RDD paper
https://issues.apache.org/jira/browse/SPARK-17567

## What changes were proposed in this pull request?

Documentation (http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) contains broken link to Spark paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf).

I found it elsewhere (https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) and I hope it is the same one. It should be uploaded to and linked from some Apache controlled storage, so it won't break again.

## How was this patch tested?

Tested manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #15121 from keypointt/SPARK-17567.
2016-09-17 12:30:25 +01:00
Josh Rosen 1202075c95 [SPARK-17484] Prevent invalid block locations from being reported after put() exceptions
## What changes were proposed in this pull request?

If a BlockManager `put()` call failed after the BlockManagerMaster was notified of a block's availability then incomplete cleanup logic in a `finally` block would never send a second block status method to inform the master of the block's unavailability. This, in turn, leads to fetch failures and used to be capable of causing complete job failures before #15037 was fixed.

This patch addresses this issue via multiple small changes:

- The `finally` block now calls `removeBlockInternal` when cleaning up from a failed `put()`; in addition to removing the `BlockInfo` entry (which was _all_ that the old cleanup logic did), this code (redundantly) tries to remove the block from the memory and disk stores (as an added layer of defense against bugs lower down in the stack) and optionally notifies the master of block removal (which now happens during exception-triggered cleanup).
- When a BlockManager receives a request for a block that it does not have it will now notify the master to update its block locations. This ensures that bad metadata pointing to non-existent blocks will eventually be fixed. Note that I could have implemented this logic in the block manager client (rather than in the remote server), but that would introduce the problem of distinguishing between transient and permanent failures; on the server, however, we know definitively that the block isn't present.
- Catch `NonFatal` instead of `Exception` to avoid swallowing `InterruptedException`s thrown from synchronous block replication calls.

This patch depends upon the refactorings in #15036, so that other patch will also have to be backported when backporting this fix.

For more background on this issue, including example logs from a real production failure, see [SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484).

## How was this patch tested?

Two new regression tests in BlockManagerSuite.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15085 from JoshRosen/SPARK-17484.
2016-09-15 11:54:17 -07:00
Josh Rosen 5b8f7377d5 [SPARK-17547] Ensure temp shuffle data file is cleaned up after error
SPARK-8029 (#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file.

This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer.
2016-09-15 11:22:58 -07:00
Tejas Patil b479278142 [SPARK-17451][CORE] CoarseGrainedExecutorBackend should inform driver before self-kill
## What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17451

`CoarseGrainedExecutorBackend` in some failure cases exits the JVM. While this does not have any issue, from the driver UI there is no specific reason captured for this. In this PR, I am adding functionality to `exitExecutor` to notify driver that the executor is exiting.

## How was this patch tested?

Ran the change over a test env and took down shuffle service before the executor could register to it. In the driver logs, where the job failure reason is mentioned (ie. `Job aborted due to stage ...` it gives the correct reason:

Before:
`ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.`

After:
`ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Unable to create executor due to java.util.concurrent.TimeoutException: Timeout waiting for task.`

Author: Tejas Patil <tejasp@fb.com>

Closes #15013 from tejasapatil/SPARK-17451_inform_driver.
2016-09-15 10:23:41 -07:00
cenyuhai ad79fc0a84 [SPARK-17406][WEB UI] limit timeline executor events
## What changes were proposed in this pull request?
The job page will be too slow to open when there are thousands of executor events(added or removed). I found that in ExecutorsTab file, executorIdToData will not remove elements, it will increase all the time.Before this pr, it looks like [timeline1.png](https://issues.apache.org/jira/secure/attachment/12827112/timeline1.png). After this pr, it looks like [timeline2.png](https://issues.apache.org/jira/secure/attachment/12827113/timeline2.png)(we can set how many executor events will be displayed)

Author: cenyuhai <cenyuhai@didichuxing.com>

Closes #14969 from cenyuhai/SPARK-17406.
2016-09-15 09:58:53 +01:00
codlife 647ee05e58 [SPARK-17521] Error when I use sparkContext.makeRDD(Seq())
## What changes were proposed in this pull request?

 when i use sc.makeRDD below
```
val data3 = sc.makeRDD(Seq())
println(data3.partitions.length)
```
I got an error:
Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required

We can fix this bug just modify the last line ,do a check of seq.size
```
  def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
    assertNotStopped()
    val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
    new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, defaultParallelism), indexToPrefs)
  }
```

## How was this patch tested?

 manual tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: codlife <1004910847@qq.com>
Author: codlife <wangjianfei15@otcaix.iscas.ac.cn>

Closes #15077 from codlife/master.
2016-09-15 09:38:13 +01:00
Xing SHI bb32294362 [SPARK-17465][SPARK CORE] Inappropriate memory management in org.apache.spark.storage.MemoryStore may lead to memory leak
The expression like `if (memoryMap(taskAttemptId) == 0) memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and `releasePendingUnrollMemoryForThisTask` should be called after release memory operation, whatever `memoryToRelease` is > 0 or not.

If the memory of a task has been set to 0 when calling a `releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask` method, the key in the memory map corresponding to that task will never be removed from the hash map.

See the details in [SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465).

Author: Xing SHI <shi-kou@indetail.co.jp>

Closes #15022 from saturday-shi/SPARK-17465.
2016-09-14 13:59:57 -07:00
Shixiong Zhu e33bfaed3b [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value can be read thread-safely
## What changes were proposed in this pull request?

Make CollectionAccumulator and SetAccumulator's value can be read thread-safely to fix the ConcurrentModificationException reported in [JIRA](https://issues.apache.org/jira/browse/SPARK-17463).

## How was this patch tested?

Existing tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15063 from zsxwing/SPARK-17463.
2016-09-14 13:33:51 -07:00
Xin Wu 040e46979d [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in ORDER BY
## What changes were proposed in this pull request?
Currently, ORDER BY clause returns nulls value according to sorting order (ASC|DESC), considering null value is always smaller than non-null values.
However, SQL2003 standard support NULLS FIRST or NULLS LAST to allow users to specify whether null values should be returned first or last, regardless of sorting order (ASC|DESC).

This PR is to support this new feature.

## How was this patch tested?
New test cases are added to test NULLS FIRST|LAST for regular select queries and windowing queries.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Xin Wu <xinwu@us.ibm.com>

Closes #14842 from xwu0226/SPARK-10747.
2016-09-14 21:14:29 +02:00
wm624@hotmail.com 18b4f035f4 [CORE][DOC] remove redundant comment
## What changes were proposed in this pull request?
In the comment, there is redundant `the estimated`.

This PR simply remove the redundant comment and adjusts format.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15091 from wangmiao1981/comment.
2016-09-14 09:49:15 +01:00
Jagadeesan def7c265f5 [SPARK-17449][DOCUMENTATION] Relation between heartbeatInterval and…
## What changes were proposed in this pull request?

The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document.

… network timeout]

Author: Jagadeesan <as2@us.ibm.com>

Closes #15042 from jagadeesanas2/SPARK-17449.
2016-09-14 09:03:16 +01:00
Josh Rosen f9c580f110 [SPARK-17485] Prevent failed remote reads of cached blocks from failing entire job
## What changes were proposed in this pull request?

In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring).

In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable.

Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`.

## How was this patch tested?

Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method.

I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15037 from JoshRosen/SPARK-17485.
2016-09-12 15:43:57 -07:00
Josh Rosen 3d40896f41 [SPARK-17483] Refactoring in BlockManager status reporting and block removal
This patch makes three minor refactorings to the BlockManager:

- Move the `if (info.tellMaster)` check out of `reportBlockStatus`; this fixes an issue where a debug logging message would incorrectly claim to have reported a block status to the master even though no message had been sent (in case `info.tellMaster == false`). This also makes it easier to write code which unconditionally sends block statuses to the master (which is necessary in another patch of mine).
- Split  `removeBlock()` into two methods, the existing method and an internal `removeBlockInternal()` method which is designed to be called by internal code that already holds a write lock on the block. This is also needed by a followup patch.
- Instead of calling `getCurrentBlockStatus()` in `removeBlock()`, just pass `BlockStatus.empty`; the block status should always be empty following complete removal of a block.

These changes were originally authored as part of a bug fix patch which is targeted at branch-2.0 and master; I've split them out here into their own separate PR in order to make them easier to review and so that the behavior-changing parts of my other patch can be isolated to their own PR.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15036 from JoshRosen/cache-failure-race-conditions-refactorings-only.
2016-09-12 13:09:33 -07:00
Sean Zhong 1742c3ab86 [SPARK-17503][CORE] Fix memory leak in Memory store when unable to cache the whole RDD in memory
## What changes were proposed in this pull request?

   MemoryStore may throw OutOfMemoryError when trying to cache a super big RDD that cannot fit in memory.
   ```
   scala> sc.parallelize(1 to 1000000000, 100).map(x => new Array[Long](1000)).cache().count()

   java.lang.OutOfMemoryError: Java heap space
	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24)
	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
   ```

Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer to store all input values that it has read so far before transferring the values to storage memory cache. The problem is that when the input RDD is too big for caching in memory, the temporary unrolling memory SizeTrackingVector is not garbage collected in time. As SizeTrackingVector can occupy all available storage memory, it may cause the executor JVM to run out of memory quickly.

More info can be found at https://issues.apache.org/jira/browse/SPARK-17503

## How was this patch tested?

Unit test and manual test.

### Before change

Heap memory consumption
<img width="702" alt="screen shot 2016-09-12 at 4 16 15 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429524/60d73a26-7906-11e6-9768-6f286f5c58c8.png">

Heap dump
<img width="1402" alt="screen shot 2016-09-12 at 4 34 19 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429577/cbc1ef20-7906-11e6-847b-b5903f450b3b.png">

### After change

Heap memory consumption
<img width="706" alt="screen shot 2016-09-12 at 4 29 10 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429503/4abe9342-7906-11e6-844a-b2f815072624.png">

Author: Sean Zhong <seanzhong@databricks.com>

Closes #15056 from clockfly/memory_store_leak.
2016-09-12 11:30:06 -07:00
WeichenXu 8087ecf8da [SPARK CORE][MINOR] fix "default partitioner cannot partition array keys" error message in PairRDDfunctions
## What changes were proposed in this pull request?

In order to avoid confusing user,
error message in `PairRDDfunctions`
`Default partitioner cannot partition array keys.`
is updated,
the one in `partitionBy` is replaced with
`Specified partitioner cannot partition array keys.`
other is replaced with
`Specified or default partitioner cannot partition array keys.`

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15045 from WeichenXu123/fix_partitionBy_error_message.
2016-09-12 12:23:16 +01:00
codlife 4efcdb7fea [SPARK-17447] Performance improvement in Partitioner.defaultPartitioner without sortBy
## What changes were proposed in this pull request?

if there are many rdds in some situations,the sort will loss he performance servely,actually we needn't sort the rdds , we can just scan the rdds one time to gain the same goal.

## How was this patch tested?

manual tests

Author: codlife <1004910847@qq.com>

Closes #15039 from codlife/master.
2016-09-12 12:10:46 +01:00
cenyuhai cc87280fcd [SPARK-17171][WEB UI] DAG will list all partitions in the graph
## What changes were proposed in this pull request?
DAG will list all partitions in the graph, it is too slow and hard to see all graph.
Always we don't want to see all partitions,we just want to see the relations of DAG graph.
So I just show 2 root nodes for Rdds.

Before this PR, the DAG graph looks like [dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), [dag3.png](https://issues.apache.org/jira/secure/attachment/12825456/dag3.png), after this PR, the DAG graph looks like [dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png),[dag4.png](https://issues.apache.org/jira/secure/attachment/12825457/dag4.png)

Author: cenyuhai <cenyuhai@didichuxing.com>
Author: 岑玉海 <261810726@qq.com>

Closes #14737 from cenyuhai/SPARK-17171.
2016-09-12 11:52:56 +01:00
Josh Rosen 72eec70bdb [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field
The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never read, increasing the memory consumption of the web UI. We should remove this field.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15038 from JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.
2016-09-11 21:51:22 -07:00
Ryan Blue 6ea5055fa7 [SPARK-17396][CORE] Share the task support between UnionRDD instances.
## What changes were proposed in this pull request?

Share the ForkJoinTaskSupport between UnionRDD instances to avoid creating a huge number of threads if lots of RDDs are created at the same time.

## How was this patch tested?

This uses existing UnionRDD tests.

Author: Ryan Blue <blue@apache.org>

Closes #14985 from rdblue/SPARK-17396-use-shared-pool.
2016-09-10 10:18:53 +01:00
Joseph K. Bradley 65b814bf50 [SPARK-17456][CORE] Utility for parsing Spark versions
## What changes were proposed in this pull request?

This patch adds methods for extracting major and minor versions as Int types in Scala from a Spark version string.

Motivation: There are many hacks within Spark's codebase to identify and compare Spark versions. We should add a simple utility to standardize these code paths, especially since there have been mistakes made in the past. This will let us add unit tests as well.  Currently, I want this functionality to check Spark versions to provide backwards compatibility for ML model persistence.

## How was this patch tested?

Unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15017 from jkbradley/version-parsing.
2016-09-09 05:35:10 -07:00
Gurvinder Singh 92ce8d4849 [SPARK-15487][WEB UI] Spark Master UI to reverse proxy Application and Workers UI
## What changes were proposed in this pull request?

This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as

WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/
ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/

This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy

## How was this patch tested?

The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address.

pwendell bomeng BryanCutler can you please review it, thanks.

Author: Gurvinder Singh <gurvinder.singh@uninett.no>

Closes #13950 from gurvindersingh/rproxy.
2016-09-08 17:20:20 -07:00
Eric Liang 649fa4bf1d [SPARK-17370] Shuffle service files not invalidated when a slave is lost
## What changes were proposed in this pull request?

DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime.

However, it also doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files.

The proposed fix is to also invalidate shuffle files when an executor is lost due to a `SlaveLost` event.

## How was this patch tested?

Unit tests, also verified on an actual cluster that slave loss invalidates shuffle files immediately as expected.

cc mateiz

Author: Eric Liang <ekl@databricks.com>

Closes #14931 from ericl/sc-4439.
2016-09-07 12:33:50 -07:00
hyukjinkwon 6b41195bca [SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR
## What changes were proposed in this pull request?

This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details.

## How was this patch tested?

Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r

Also, manually,

![2016-09-06 3 14 38](https://cloud.githubusercontent.com/assets/6477701/18263406/b93a98be-7444-11e6-9521-b28ee65a4771.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14960 from HyukjinKwon/SPARK-17339.
2016-09-07 19:24:03 +09:00
Liwei Lin 3ce3a282c8 [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths
## What changes were proposed in this pull request?

We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14914 from lw-lin/append_to_plus_eq_v2.
2016-09-07 10:04:00 +01:00
Eric Liang c07cbb3534 [SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks
## What changes were proposed in this pull request?

It seems that old shuffle map tasks hanging around after a stage resubmit will delete intended shuffle output files on stop(), causing downstream stages to fail even after successful resubmit completion. This can happen easily if the prior map task is waiting for a network timeout when its stage is resubmitted.

This can cause unnecessary stage resubmits, sometimes multiple times as fetch fails cause a cascade of shuffle file invalidations, and confusing FetchFailure messages that report shuffle index files missing from the local disk.

Given that IndexShuffleBlockResolver commits data atomically, it seems unnecessary to ever delete committed task output: even in the rare case that a task is failed after it finishes committing shuffle output, it should be safe to retain that output.

## How was this patch tested?

Prior to the fix proposed in https://github.com/apache/spark/pull/14931, I was able to reproduce this behavior by killing slaves in the middle of a large shuffle. After this patch, stages were no longer resubmitted multiple times due to shuffle index loss.

cc JoshRosen vanzin

Author: Eric Liang <ekl@databricks.com>

Closes #14932 from ericl/dont-remove-committed-files.
2016-09-06 16:55:22 -07:00
Shixiong Zhu 175b434411 [SPARK-17316][CORE] Fix the 'ask' type parameter in 'removeExecutor'
## What changes were proposed in this pull request?

Fix the 'ask' type parameter in 'removeExecutor' to eliminate a lot of error logs `Cannot cast java.lang.Boolean to scala.runtime.Nothing$`

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14983 from zsxwing/SPARK-17316-3.
2016-09-06 16:49:06 -07:00
Josh Rosen 29cfab3f15 [SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues()
## What changes were proposed in this pull request?

This patch fixes a `java.io.StreamCorruptedException` error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as an `RDD[Array[Byte]]` and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, the `getRemoteValues()` / `getRemoteBytes()` code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException.

We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long.

This patch addresses the bug by modifying `BlockManager`'s `get()` and  `getRemoteValues()` methods to accept ClassTags, allowing the proper class tag to be threaded in the `getOrElseUpdate` code path (which is used by `rdd.iterator`)

## How was this patch tested?

Extended the caching tests in `DistributedSuite` to exercise the `getRemoteValues` path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14952 from JoshRosen/SPARK-17110.
2016-09-06 15:07:28 -07:00
Shivansh e75c162e9e [SPARK-17308] Improved the spark core code by replacing all pattern match on boolean value by if/else block.
## What changes were proposed in this pull request?
Improved the code quality of spark by replacing all pattern match on boolean value by if/else block.

## How was this patch tested?

By running the tests

Author: Shivansh <shiv4nsh@gmail.com>

Closes #14873 from shiv4nsh/SPARK-17308.
2016-09-04 12:39:26 +01:00
wm624@hotmail.com e9b58e9ef8 [SPARK-16829][SPARKR] sparkR sc.setLogLevel doesn't work
(Please fill in changes proposed in this fix)

./bin/sparkR
Launching java with spark-submit command /Users/mwang/spark_ws_0904/bin/spark-submit "sparkr-shell" /var/folders/s_/83b0sgvj2kl2kwq4stvft_pm0000gn/T//RtmpQxJGiZ/backend_porte9474603ed1e
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).

> sc.setLogLevel("INFO")
Error: could not find function "sc.setLogLevel"

sc.setLogLevel doesn't exist.

R has a function setLogLevel.

I rename the setLogLevel function to sc.setLogLevel.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Change unit test. Run unit tests.
Manually tested it in sparkR shell.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14433 from wangmiao1981/sc.
2016-09-03 13:56:20 -07:00
Robert Kruszewski 806d8a8e98 [SPARK-16984][SQL] don't try whole dataset immediately when first partition doesn't have…
## What changes were proposed in this pull request?

Try increase number of partitions to try so we don't revert to all.

## How was this patch tested?

Empirically. This is common case optimization.

Author: Robert Kruszewski <robertk@palantir.com>

Closes #14573 from robert3005/robertk/execute-take-backoff.
2016-09-02 17:14:43 +02:00
Kousuke Saruta 7ee24dac8e [SPARK-17352][WEBUI] Executor computing time can be negative-number because of calculation error
## What changes were proposed in this pull request?

In StagePage, executor-computing-time is calculated but calculation error can occur potentially because it's calculated by subtraction of floating numbers.

Following capture is an example.

<img width="949" alt="capture-timeline" src="https://cloud.githubusercontent.com/assets/4736016/18152359/43f07a28-7030-11e6-8cbd-8e73bf4c4c67.png">

## How was this patch tested?

Manual tests.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #14908 from sarutak/SPARK-17352.
2016-09-02 10:26:43 +01:00
Kousuke Saruta 2ab8dbddaa [SPARK-17342][WEBUI] Style of event timeline is broken
## What changes were proposed in this pull request?

SPARK-15373 (#13158) updated the version of vis.js to 4.16.1. As of 4.0.0, some class was renamed like 'timeline to vis-timeline' but that ticket didn't care and now style is broken.

In this PR, I've restored the style by modifying `timeline-view.css` and `timeline-view.js`.

## How was this patch tested?

manual tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

* Before
<img width="1258" alt="2016-09-01 1 38 31" src="https://cloud.githubusercontent.com/assets/4736016/18141311/fddf1bac-6ff3-11e6-935f-28b389073b39.png">

* After
<img width="1256" alt="2016-09-01 3 30 19" src="https://cloud.githubusercontent.com/assets/4736016/18141394/49af65dc-6ff4-11e6-8640-70e20300f3c3.png">

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #14900 from sarutak/SPARK-17342.
2016-09-02 08:46:15 +01:00
Sean Owen 3893e8c576 [SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays
## What changes were proposed in this pull request?

Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14895 from srowen/SPARK-17331.
2016-09-01 12:13:07 -07:00
Angus Gerry a0aac4b775 [SPARK-16533][CORE] resolve deadlocking in driver when executors die
## What changes were proposed in this pull request?
This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.

## How was this patch tested?
Existing tests, and manual tests under yarn-client mode.

Author: Angus Gerry <angolon@gmail.com>

Closes #14710 from angolon/SPARK-16533.
2016-09-01 10:35:31 -07:00
Sean Owen 5d84c7fd83 [SPARK-17332][CORE] Make Java Loggers static members
## What changes were proposed in this pull request?

Make all Java Loggers static members

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14896 from srowen/SPARK-17332.
2016-08-31 11:09:14 -07:00
Shixiong Zhu 9bcb33c541 [SPARK-17316][CORE] Make CoarseGrainedSchedulerBackend.removeExecutor non-blocking
## What changes were proposed in this pull request?

StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint.

This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value).

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14882 from zsxwing/SPARK-17316.
2016-08-31 10:56:02 -07:00
Alex Bozarth f7beae6da0 [SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large application history
## What changes were proposed in this pull request?

With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)

I've also added a new test for the `limit` param in `HistoryServerSuite.scala`

## How was this patch tested?

Manual testing and dev/run-tests

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #14835 from ajbozarth/spark17243.
2016-08-30 16:33:54 -05:00
Josh Rosen fb20084313 [SPARK-17304] Fix perf. issue caused by TaskSetManager.abortIfCompletelyBlacklisted
This patch addresses a minor scheduler performance issue that was introduced in #13603. If you run

```
sc.parallelize(1 to 100000, 100000).map(identity).count()
```

then most of the time ends up being spent in `TaskSetManager.abortIfCompletelyBlacklisted()`:

![image](https://cloud.githubusercontent.com/assets/50748/18071032/428732b0-6e07-11e6-88b2-c9423cd61f53.png)

When processing resource offers, the scheduler uses a nested loop which considers every task set at multiple locality levels:

```scala
   for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
      do {
        launchedTask = resourceOfferSingleTaskSet(
            taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
      } while (launchedTask)
    }
```

In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a `taskSet.abortIfCompletelyBlacklisted` call inside of  `resourceOfferSingleTaskSet`; if a call to `resourceOfferSingleTaskSet` fails to schedule any tasks, then `abortIfCompletelyBlacklisted` checks whether the tasks are completely blacklisted in order to figure out whether they will ever be schedulable. The problem with this placement of the call is that the last call to `resourceOfferSingleTaskSet` in the `while` loop will return `false`, implying that  `resourceOfferSingleTaskSet` will call `abortIfCompletelyBlacklisted`, so almost every call to `resourceOffers` will trigger the `abortIfCompletelyBlacklisted` check for every task set.

Instead, I think that this call should be moved out of the innermost loop and should be called _at most_ once per task set in case none of the task set's tasks can be scheduled at any locality level.

Before this patch's changes, the microbenchmark example that I posted above took 35 seconds to run, but it now only takes 15 seconds after this change.

/cc squito and kayousterhout for review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14871 from JoshRosen/bail-early-if-no-cpus.
2016-08-30 13:15:21 -07:00
Ferdinand Xu 4b4e329e49 [SPARK-5682][CORE] Add encrypted shuffle in spark
This patch is using Apache Commons Crypto library to enable shuffle encryption support.

Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>

Closes #8880 from winningsix/SPARK-10771.
2016-08-30 09:15:31 -07:00
Xin Ren 27209252f0 [MINOR][MLLIB][SQL] Clean up unused variables and unused import
## What changes were proposed in this pull request?

Clean up unused variables and unused import statements, unnecessary `return` and `toArray`, and some more style improvement,  when I walk through the code examples.

## How was this patch tested?

Testet manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14836 from keypointt/codeWalkThroughML.
2016-08-30 11:24:55 +01:00
Xin Ren 2d76cb11f5 [SPARK-17276][CORE][TEST] Stop env params output on Jenkins job page
https://issues.apache.org/jira/browse/SPARK-17276

## What changes were proposed in this pull request?

When trying to find error msg in a failed Jenkins build job, I'm annoyed by the huge env output.
The env parameter output should be muted.

![screen shot 2016-08-26 at 10 52 07 pm](https://cloud.githubusercontent.com/assets/3925641/18025581/b8d567ba-6be2-11e6-9eeb-6aec223f1730.png)

## How was this patch tested?

Tested manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14848 from keypointt/SPARK-17276.
2016-08-30 11:18:29 +01:00
Robert Kruszewski 9fbced5b25 [SPARK-17216][UI] fix event timeline bars length
## What changes were proposed in this pull request?

Make event timeline bar expand to full length of the bar (which is total time)

This issue occurs only on chrome, firefox looks fine. Haven't tested other browsers.

## How was this patch tested?
Inspection in browsers

Before
![screen shot 2016-08-24 at 3 38 24 pm](https://cloud.githubusercontent.com/assets/512084/17935104/0d6cda74-6a12-11e6-9c66-e00cfa855606.png)

After
![screen shot 2016-08-24 at 3 36 39 pm](https://cloud.githubusercontent.com/assets/512084/17935114/15740ea4-6a12-11e6-83a1-7c06eef6abb8.png)

Author: Robert Kruszewski <robertk@palantir.com>

Closes #14791 from robert3005/robertk/event-timeline.
2016-08-27 08:47:15 +01:00
Yin Huai a6bca3ad02 [SPARK-17266][TEST] Add empty strings to the regressionTests of PrefixComparatorsSuite
## What changes were proposed in this pull request?
This PR adds a regression test to PrefixComparatorsSuite's "String prefix comparator" because this test failed on jenkins once (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/).

I could not reproduce it locally. But, let's this test case in the regressionTests.

Author: Yin Huai <yhuai@databricks.com>

Closes #14837 from yhuai/SPARK-17266.
2016-08-26 19:38:52 -07:00
Michael Gummelt 8e5475be3c [SPARK-16967] move mesos to module
## What changes were proposed in this pull request?

Move Mesos code into a mvn module

## How was this patch tested?

unit tests
manually submitting a client mode and cluster mode job
spark/mesos integration test suite

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14637 from mgummelt/mesos-module.
2016-08-26 12:25:22 -07:00
Marcelo Vanzin 9b5a1d1d53 [SPARK-17240][CORE] Make SparkConf serializable again.
Make the config reader transient, and initialize it lazily so that
serialization works with both java and kryo (and hopefully any other
custom serializer).

Added unit test to make sure SparkConf remains serializable and the
reader works with both built-in serializers.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14813 from vanzin/SPARK-17240.
2016-08-25 16:11:42 -07:00
Sean Owen 2bcd5d5ce3 [SPARK-17193][CORE] HadoopRDD NPE at DEBUG log level when getLocationInfo == null
## What changes were proposed in this pull request?

Handle null from Hadoop getLocationInfo directly instead of catching (and logging) exception

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14760 from srowen/SPARK-17193.
2016-08-25 09:45:49 +01:00
Alex Bozarth 891ac2b914 [SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIData
## What changes were proposed in this pull request?

Based on #12990 by tankkyo

Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 10000)

(This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done)

## How was this patch tested?

Manual testing and dev/run-tests

![spark-15083](https://cloud.githubusercontent.com/assets/13952758/17713694/fe82d246-63b0-11e6-9697-b87ea75ff4ef.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #14673 from ajbozarth/spark15083.
2016-08-24 14:39:41 -05:00
Sean Owen 0b3a4be92c [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment
## What changes were proposed in this pull request?

Update to py4j 0.10.3 to enable JAVA_HOME support

## How was this patch tested?

Pyspark tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14748 from srowen/SPARK-16781.
2016-08-24 20:04:09 +01:00
Weiqing Yang 673a80d223 [MINOR][BUILD] Fix Java CheckStyle Error
## What changes were proposed in this pull request?
As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release.

Before:
```
./dev/lint-java
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119).
[ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103).
```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
## How was this patch tested?
Manual.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #14768 from Sherry302/fixjavastyle.
2016-08-24 10:12:44 +01:00
Tejas Patil c1937dd19a [SPARK-16862] Configurable buffer size in UnsafeSorterSpillReader
## What changes were proposed in this pull request?

Jira: https://issues.apache.org/jira/browse/SPARK-16862

`BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k buffer to read data off disk. This PR makes it configurable to improve on disk reads. I have made the default value to be 1 MB as with that value I observed improved performance.

## How was this patch tested?

I am relying on the existing unit tests.

## Performance

After deploying this change to prod and setting the config to 1 mb, there was a 12% reduction in the CPU time and 19.5% reduction in CPU reservation time.

Author: Tejas Patil <tejasp@fb.com>

Closes #14726 from tejasapatil/spill_buffer_2.
2016-08-23 18:48:08 -07:00
Eric Liang 8e223ea67a [SPARK-16550][SPARK-17042][CORE] Certain classes fail to deserialize in block manager replication
## What changes were proposed in this pull request?

This is a straightforward clone of JoshRosen 's original patch. I have follow-up changes to fix block replication for repl-defined classes as well, but those appear to be flaking tests so I'm going to leave that for SPARK-17042

## How was this patch tested?

End-to-end test in ReplSuite (also more tests in DistributedSuite from the original patch).

Author: Eric Liang <ekl@databricks.com>

Closes #14311 from ericl/spark-16550.
2016-08-22 16:32:14 -07:00
wm624@hotmail.com e328f577e8 [SPARK-17002][CORE] Document that spark.ssl.protocol. is required for SSL
## What changes were proposed in this pull request?

`spark.ssl.enabled`=true, but failing to set `spark.ssl.protocol` will fail and throw meaningless exception. `spark.ssl.protocol` is required when `spark.ssl.enabled`.

Improvement: require `spark.ssl.protocol` when initializing SSLContext, otherwise throws an exception to indicate that.

Remove the OrElse("default").

Document this requirement in configure.md

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual tests:
Build document and check document

Configure `spark.ssl.enabled` only, it throws exception below:
6/08/16 16:04:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(mwang); groups with view permissions: Set(); users  with modify permissions: Set(mwang); groups with modify permissions: Set()
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: spark.ssl.protocol is required when enabling SSL connections.
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:285)
	at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1026)
	at org.apache.spark.deploy.master.Master$.main(Master.scala:1011)
	at org.apache.spark.deploy.master.Master.main(Master.scala)

Configure `spark.ssl.protocol`  and `spark.ssl.protocol`
It works fine.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14674 from wangmiao1981/ssl.
2016-08-21 11:51:46 +01:00
Bryan Cutler 9f37d4eac2 [SPARK-12666][CORE] SparkSubmit packages fix for when 'default' conf doesn't exist in dependent module
## What changes were proposed in this pull request?

Adding a "(runtime)" to the dependency configuration will set a fallback configuration to be used if the requested one is not found.  E.g. with the setting "default(runtime)", Ivy will look for the conf "default" in the module ivy file and if not found will look for the conf "runtime".  This can help with the case when using "sbt publishLocal" which does not write a "default" conf in the published ivy.xml file.

## How was this patch tested?
used spark-submit with --packages option for a package published locally with no default conf, and a package resolved from Maven central.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13428 from BryanCutler/fallback-package-conf-SPARK-12666.
2016-08-20 13:45:26 -07:00
Sital Kedia cf0cce9036 [SPARK-17113] [SHUFFLE] Job failure due to Executor OOM in offheap mode
## What changes were proposed in this pull request?

This PR fixes executor OOM in offheap mode due to bug in Cooperative Memory Management for UnsafeExternSorter.  UnsafeExternalSorter was checking if memory page is being used by upstream by comparing the base object address of the current page with the base object address of upstream. However, in case of offheap memory allocation, the base object addresses are always null, so there was no spilling happening and eventually the operator would OOM.

Following is the stack trace this issue addresses -
java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0
	at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170)

## How was this patch tested?

Tested by running the failing job.

Author: Sital Kedia <skedia@fb.com>

Closes #14693 from sitalkedia/fix_offheap_oom.
2016-08-19 11:27:30 -07:00
Kousuke Saruta 071eaaf9d2 [SPARK-11227][CORE] UnknownHostException can be thrown when NameNode HA is enabled.
## What changes were proposed in this pull request?

If the following conditions are satisfied, executors don't load properties in `hdfs-site.xml` and UnknownHostException can be thrown.

(1) NameNode HA is enabled
(2) spark.eventLogging is disabled or logging path is NOT on HDFS
(3) Using Standalone or Mesos for the cluster manager
(4) There are no code to load `HdfsCondition` class in the driver regardless of directly or indirectly.
(5) The tasks access to HDFS

(There might be some more conditions...)

For example, following code causes UnknownHostException when the conditions above are satisfied.
```
sc.textFile("<path on HDFS>").collect

```

```
java.lang.IllegalArgumentException: java.net.UnknownHostException: hacluster
	at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
	at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
	at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:170)
	at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
	at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:438)
	at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411)
	at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986)
	at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986)
	at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
	at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:213)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:85)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: hacluster
```

But following code doesn't cause the Exception because `textFile` method loads `HdfsConfiguration` indirectly.

```
sc.textFile("<path on HDFS>").collect
```

When a job includes some operations which access to HDFS, the object of `org.apache.hadoop.Configuration` is wrapped by `SerializableConfiguration`,  serialized and broadcasted from driver to executors and each executor deserialize the object with `loadDefaults` false so HDFS related properties should be set before broadcasted.

## How was this patch tested?
Tested manually on my standalone cluster.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #13738 from sarutak/SPARK-11227.
2016-08-19 10:11:25 -05:00
Alex Bozarth e98eb2146f [SPARK-16673][WEB UI] New Executor Page removed conditional for Logs and Thread Dump columns
## What changes were proposed in this pull request?

When #13670 switched `ExecutorsPage` to use JQuery DataTables it incidentally removed the conditional for the Logs and Thread Dump columns. I reimplemented the conditional display of the Logs and Thread dump columns as it was before the switch.

## How was this patch tested?

Manually tested and dev/run-tests

![both](https://cloud.githubusercontent.com/assets/13952758/17186879/da8dd1a8-53eb-11e6-8b0c-d0ff0156a9a7.png)
![dump](https://cloud.githubusercontent.com/assets/13952758/17186881/dab08a04-53eb-11e6-8b1c-50ffd0bf2ae8.png)
![logs](https://cloud.githubusercontent.com/assets/13952758/17186880/dab04d00-53eb-11e6-8754-68dd64d6d9f4.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #14382 from ajbozarth/spark16673.
2016-08-19 10:04:20 -05:00
Nick Lavers 5377fc6236 [SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace
JIRA issue link:
https://issues.apache.org/jira/browse/SPARK-16961

Changed one line of Utils.randomizeInPlace to allow elements to stay in place.

Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution.

Author: Nick Lavers <nick.lavers@videoamp.com>

Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace.
2016-08-19 10:11:59 +01:00
Steve Loughran cc97ea188e [SPARK-16736][CORE][SQL] purge superfluous fs calls
A review of the code, working back from Hadoop's `FileSystem.exists()` and `FileSystem.isDirectory()` code, then removing uses of the calls when superfluous.

1. delete is harmless if called on a nonexistent path, so don't do any checks before deletes
1. any `FileSystem.exists()`  check before `getFileStatus()` or `open()` is superfluous as the operation itself does the check. Instead the `FileNotFoundException` is caught and triggers the downgraded path. When a `FileNotFoundException` was thrown before, the code still creates a new FNFE with the error messages. Though now the inner exceptions are nested, for easier diagnostics.

Initially, relying on Jenkins test runs.

One troublespot here is that some of the codepaths are clearly error situations; it's not clear that they have coverage anyway. Trying to create the failure conditions in tests would be ideal, but it will also be hard.

Author: Steve Loughran <stevel@apache.org>

Closes #14371 from steveloughran/cloud/SPARK-16736-superfluous-fs-calls.
2016-08-17 11:43:01 -07:00
Marcelo Vanzin 5da6c4b24f [SPARK-16671][CORE][SQL] Consolidate code to do variable substitution.
Both core and sql have slightly different code that does variable substitution
of config values. This change refactors that code and encapsulates the logic
of reading config values and expading variables in a new helper class, which
can be configured so that both core and sql can use it without losing existing
functionality, and allows for easier testing and makes it easier to add more
features in the future.

Tested with existing and new unit tests, and by running spark-shell with
some configs referencing variables and making sure it behaved as expected.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14468 from vanzin/SPARK-16671.
2016-08-15 11:09:54 -07:00
Stavros Kontopoulos 1a028bdefa [SPARK-11714][MESOS] Make Spark on Mesos honor port restrictions on coarse grain mode
- Make mesos coarse grained scheduler accept port offers and pre-assign ports

Previous attempt was for fine grained: https://github.com/apache/spark/pull/10808

Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>
Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com>

Closes #11157 from skonto/honour_ports_coarse.
2016-08-15 09:55:32 +01:00
Zhenglai Zhang 2a3d286f34 [WIP][MINOR][TYPO] Fix several trivival typos
## What changes were proposed in this pull request?

* Fixed one typo `"overriden"` as `"overridden"`, also make sure no other same typo.
* Fixed one typo `"lowcase"` as `"lowercase"`, also make sure no other same typo.

## How was this patch tested?

Since the change is very tiny, so I just make sure compilation is successful.
I am new to the spark community,  please feel free to let me do other necessary steps.

Thanks in advance!

----
Updated: Found another typo `lowcase` later and fixed then in the same patch

Author: Zhenglai Zhang <zhenglaizhang@hotmail.com>

Closes #14622 from zhenglaizhang/fixtypo.
2016-08-14 16:10:34 +01:00
Xin Ren 7f7133bdcc [MINOR][CORE] fix warnings on depreciated methods in MesosClusterSchedulerSuite and DiskBlockObjectWriterSuite
## What changes were proposed in this pull request?

Fixed warnings below after scanning through warnings during build:

```
[warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSuite.scala:34: imported `Utils' is permanently hidden by definition of object Utils in package mesos
[warn] import org.apache.spark.scheduler.cluster.mesos.Utils
[warn]                                                 ^
```

and
```
[warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/storage/DiskBlockObjectWriterSuite.scala:113: method shuffleBytesWritten in class ShuffleWriteMetrics is deprecated: use bytesWritten instead
[warn]     assert(writeMetrics.shuffleBytesWritten === file.length())
[warn]                         ^
[warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/storage/DiskBlockObjectWriterSuite.scala:119: method shuffleBytesWritten in class ShuffleWriteMetrics is deprecated: use bytesWritten instead
[warn]     assert(writeMetrics.shuffleBytesWritten === file.length())
[warn]                         ^
[warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/storage/DiskBlockObjectWriterSuite.scala:131: method shuffleBytesWritten in class ShuffleWriteMetrics is deprecated: use bytesWritten instead
[warn]     assert(writeMetrics.shuffleBytesWritten === file.length())
[warn]                         ^
[warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/storage/DiskBlockObjectWriterSuite.scala:135: method shuffleBytesWritten in class ShuffleWriteMetrics is deprecated: use bytesWritten instead
[warn]     assert(writeMetrics.shuffleBytesWritten === file.length())
[warn]                         ^
```

## How was this patch tested?

Tested manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14609 from keypointt/suiteWarnings.
2016-08-13 11:29:42 +01:00
hongshen 993923c8f5 [SPARK-16985] Change dataFormat from yyyyMMddHHmm to yyyyMMddHHmmss
## What changes were proposed in this pull request?

In our cluster, sometimes the sql output maybe overrided. When I submit some sql, all insert into the same table, and the sql will cost less one minute, here is the detail,
1 sql1, 11:03 insert into table.
2 sql2, 11:04:11 insert into table.
3 sql3, 11:04:48 insert into table.
4 sql4, 11:05 insert into table.
5 sql5, 11:06 insert into table.
The sql3's output file will override the sql2's output file. here is the log:
```
16/05/04 11:04:11 INFO hive.SparkHiveHadoopWriter: XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_1204544348/10000/_tmp.p_20160428/attempt_201605041104_0001_m_000000_1

16/05/04 11:04:48 INFO hive.SparkHiveHadoopWriter: XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_212180468/10000/_tmp.p_20160428/attempt_201605041104_0001_m_000000_1

```

The reason is the output file use SimpleDateFormat("yyyyMMddHHmm"), if two sql insert into the same table in the same minute, the output will be overrite. I think we should change dateFormat to "yyyyMMddHHmmss", in our cluster, we can't finished a sql in one second.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: hongshen <shenh062326@126.com>

Closes #14574 from shenh062326/SPARK-16985.
2016-08-12 09:58:02 +01:00
Jeff Zhang 7a9e25c383 [SPARK-13081][PYSPARK][SPARK_SUBMIT] Allow set pythonExec of driver and executor through conf…
Before this PR, user have to export environment variable to specify the python of driver & executor which is not so convenient for users. This PR is trying to allow user to specify python through configuration "--pyspark-driver-python" & "--pyspark-executor-python"

Manually test in local & yarn mode for pyspark-shell and pyspark batch mode.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #13146 from zjffdu/SPARK-13081.
2016-08-11 20:08:39 -07:00
huangzhaowei 4ec5c360ce [SPARK-16868][WEB UI] Fix executor be both dead and alive on executor ui.
## What changes were proposed in this pull request?
In a heavy pressure of the spark application, since the executor will register it to driver block manager twice(because of heart beats), the executor will show as picture show:
![image](https://cloud.githubusercontent.com/assets/7404824/17467245/c1359094-5d4e-11e6-843a-f6d6347e1bf6.png)

## How was this patch tested?
NA

Details in: [SPARK-16868](https://issues.apache.org/jira/browse/SPARK-16868)

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #14530 from SaintBacchus/SPARK-16868.
2016-08-11 14:56:03 -07:00