Commit graph

5542 commits

Author SHA1 Message Date
Sean Owen e87741589a [SPARK-16193][TESTS] Address flaky ExternalAppendOnlyMapSuite spilling tests
## What changes were proposed in this pull request?

Make spill tests wait until job has completed before returning the number of stages that spilled

## How was this patch tested?

Existing Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13896 from srowen/SPARK-16193.
2016-06-25 12:14:14 +01:00
Alex Bozarth 3ee9695d1f [SPARK-1301][WEB UI] Added anchor links to Accumulators and Tasks on StagePage
## What changes were proposed in this pull request?

Sometimes the "Aggregated Metrics by Executor" table on the Stage page can get very long so actor links to the Accumulators and Tasks tables below it have been added to the summary at the top of the page. This has been done in the same way as the Jobs and Stages pages. Note: the Accumulators link only displays when the table exists.

## How was this patch tested?

Manually Tested and dev/run-tests

![justtasks](https://cloud.githubusercontent.com/assets/13952758/15165269/6e8efe8c-16c9-11e6-9784-cffe966fdcf0.png)
![withaccumulators](https://cloud.githubusercontent.com/assets/13952758/15165270/7019ec9e-16c9-11e6-8649-db69ed7a317d.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #13037 from ajbozarth/spark1301.
2016-06-25 09:27:22 +01:00
Sital Kedia bf665a9586 [SPARK-15958] Make initial buffer size for the Sorter configurable
## What changes were proposed in this pull request?

Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable.

## How was this patch tested?

Tested by running a job on the cluster.

Author: Sital Kedia <skedia@fb.com>

Closes #13699 from sitalkedia/config_sort_buffer_upstream.
2016-06-25 09:13:39 +01:00
Liwei Lin a4851ed050 [SPARK-15963][CORE] Catch TaskKilledException correctly in Executor.TaskRunner
## The problem

Before this change, if either of the following cases happened to a task , the task would be marked as `FAILED` instead of `KILLED`:
- the task was killed before it was deserialized
- `executor.kill()` marked `taskRunner.killed`, but before calling `task.killed()` the worker thread threw the `TaskKilledException`

The reason is, in the `catch` block of the current [Executor.TaskRunner](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L362)'s implementation, we are mistakenly catching:
```scala
case _: TaskKilledException | _: InterruptedException if task.killed => ...
```
the semantics of which is:
- **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed`

Then when `TaskKilledException` is thrown but `task.killed` is not marked, we would mark the task as `FAILED` (which should really be `KILLED`).

## What changes were proposed in this pull request?

This patch alters the catch condition's semantics from:
- **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed`

to

- `TaskKilledException` **OR** **(**`InterruptedException` **AND** `task.killed`**)**

so that we can catch `TaskKilledException` correctly and mark the task as `KILLED` correctly.

## How was this patch tested?

Added unit test which failed before the change, ran new test 1000 times manually

Author: Liwei Lin <lwlin7@gmail.com>

Closes #13685 from lw-lin/fix-task-killed.
2016-06-24 10:09:04 -05:00
Sean Owen 158af162ea [SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in favor of commons-lang3
## What changes were proposed in this pull request?

Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException`

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #13843 from srowen/SPARK-16129.
2016-06-24 10:35:54 +01:00
peng.zhang f4fd7432fb [SPARK-16125][YARN] Fix not test yarn cluster mode correctly in YarnClusterSuite
## What changes were proposed in this pull request?

Since SPARK-13220(Deprecate "yarn-client" and "yarn-cluster"), YarnClusterSuite doesn't test "yarn cluster" mode correctly.
This pull request fixes it.

## How was this patch tested?
Unit test

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #13836 from renozhang/SPARK-16125-test-yarn-cluster-mode.
2016-06-24 08:28:32 +01:00
Ryan Blue 738f134bf4 [SPARK-13723][YARN] Change behavior of --num-executors with dynamic allocation.
## What changes were proposed in this pull request?

This changes the behavior of --num-executors and spark.executor.instances when using dynamic allocation. Instead of turning dynamic allocation off, it uses the value for the initial number of executors.

This changes was discussed on [SPARK-13723](https://issues.apache.org/jira/browse/SPARK-13723). I highly recommend using it while we can change the behavior for 2.0.0. In practice, the 1.x behavior causes unexpected behavior for users (it is not clear that it disables dynamic allocation) and wastes cluster resources because users rarely notice the log message.

## How was this patch tested?

This patch updates tests and adds a test for Utils.getDynamicAllocationInitialExecutors.

Author: Ryan Blue <blue@apache.org>

Closes #13338 from rdblue/SPARK-13723-num-executors-with-dynamic-allocation.
2016-06-23 14:03:46 -05:00
Dongjoon Hyun 5eef1e6c6a [SPARK-15660][CORE] Update RDD variance/stdev description and add popVariance/popStdev
## What changes were proposed in this pull request?

In Spark-11490, `variance/stdev` are redefined as the **sample** `variance/stdev` instead of population ones. This PR updates the other old documentations to prevent users from misunderstanding. This will update the following Scala/Java API docs.

- http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.api.java.JavaDoubleRDD
- http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.rdd.DoubleRDDFunctions
- http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.util.StatCounter
- http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/api/java/JavaDoubleRDD.html
- http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
- http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/util/StatCounter.html

Also, this PR adds them `popVariance` and `popStdev` functions clearly.

## How was this patch tested?

Pass the updated Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13403 from dongjoon-hyun/SPARK-15660.
2016-06-23 11:07:34 +01:00
Prajwal Tuladhar 044971eca0 [SPARK-16131] initialize internal logger lazily in Scala preferred way
## What changes were proposed in this pull request?

Initialize logger instance lazily in Scala preferred way

## How was this patch tested?

By running `./build/mvn clean test` locally

Author: Prajwal Tuladhar <praj@infynyxx.com>

Closes #13842 from infynyxx/spark_internal_logger.
2016-06-22 16:30:10 -07:00
Eric Liang 6f915c9ec2 [SPARK-16003] SerializationDebugger runs into infinite loop
## What changes were proposed in this pull request?

This fixes SerializationDebugger to not recurse forever when `writeReplace` returns an object of the same class, which is the case for at least the `SQLMetrics` class.

See also the OpenJDK unit tests on the behavior of recursive `writeReplace()`:
f4d80957e8/test/java/io/Serializable/nestedReplace/NestedReplace.java

cc davies cloud-fan

## How was this patch tested?

Unit tests for SerializationDebugger.

Author: Eric Liang <ekl@databricks.com>

Closes #13814 from ericl/spark-16003.
2016-06-22 12:12:34 -07:00
Imran Rashid cf1995a976 [SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite
## What changes were proposed in this pull request?

Three changes here -- first two were causing failures w/ BlacklistIntegrationSuite

1. The testing framework didn't include the reviveOffers thread, so the test which involved delay scheduling might never submit offers late enough for the delay scheduling to kick in.  So added in the periodic revive offers, just like the real scheduler.

2. `assertEmptyDataStructures` would occasionally fail, because it appeared there was still an active job.  This is because in DAGScheduler, the jobWaiter is notified of the job completion before the data structures are cleaned up.  Most of the time the test code that is waiting on the jobWaiter won't become active until after the data structures are cleared, but occasionally the race goes the other way, and the assertions fail.

3. `DAGSchedulerSuite` was not stopping all the inner parts it was setting up, so each test was leaking a number of threads.  So we stop those parts too.

4. Turns out that `assertMapOutputAvailable` is not terribly useful in this framework -- most of the places I was trying to use it suffer from some race.

5. When there is an exception in the backend, try to improve the error msg a little bit.  Before the exception was printed to the console, but the test would fail w/ a timeout, and the logs wouldn't show anything.

## How was this patch tested?

I ran all the tests in `BlacklistIntegrationSuite` 5k times and everything in `DAGSchedulerSuite` 1k times on my laptop.  Also I ran a full jenkins build with `BlacklistIntegrationSuite` 500 times and `DAGSchedulerSuite` 50 times, see https://github.com/apache/spark/pull/13548.  (I tried more times but jenkins timed out.)

To check for more leaked threads, I added some code to dump the list of all threads at the end of each test in DAGSchedulerSuite, which is how I discovered the mapOutputTracker and eventLoop were leaking threads.  (I removed that code from the final pr, just part of the testing.)

And I'll run Jenkins on this a couple of times to do one more check.

Author: Imran Rashid <irashid@cloudera.com>

Closes #13565 from squito/blacklist_extra_tests.
2016-06-22 08:35:41 -05:00
Shixiong Zhu c399c7f0e4 [SPARK-16002][SQL] Sleep when no new data arrives to avoid 100% CPU usage
## What changes were proposed in this pull request?

Add a configuration to allow people to set a minimum polling delay when no new data arrives (default is 10ms). This PR also cleans up some INFO logs.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13718 from zsxwing/SPARK-16002.
2016-06-21 12:42:49 -07:00
hyukjinkwon 4f7f1c4362 [SPARK-16044][SQL] input_file_name() returns empty strings in data sources based on NewHadoopRDD
## What changes were proposed in this pull request?

This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](cba5eee1ab/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala (L149)) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47).

The codes with the external data sources below:

```scala
df.select(input_file_name).show()
```

will produce

- **Before**
  ```
+-----------------+
|input_file_name()|
+-----------------+
|                 |
+-----------------+
```

- **After**
  ```
+--------------------+
|   input_file_name()|
+--------------------+
|file:/private/var...|
+--------------------+
```

## How was this patch tested?

Unit tests in `ColumnExpressionSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13759 from HyukjinKwon/SPARK-16044.
2016-06-20 21:55:34 -07:00
Shixiong Zhu 62d8fe2089 [SPARK-16017][CORE] Send hostname from CoarseGrainedExecutorBackend to driver
## What changes were proposed in this pull request?

[SPARK-15395](https://issues.apache.org/jira/browse/SPARK-15395) changes the behavior that how the driver gets the executor host and the driver will get the executor IP address instead of the host name. This PR just sends the hostname from executors to driver so that driver can pass it to TaskScheduler.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13741 from zsxwing/SPARK-16017.
2016-06-17 15:48:17 -07:00
Kay Ousterhout c8809db5a5 [SPARK-15926] Improve readability of DAGScheduler stage creation methods
## What changes were proposed in this pull request?

This pull request refactors parts of the DAGScheduler to improve readability, focusing on the code around stage creation.  One goal of this change it to make it clearer which functions may create new stages (as opposed to looking up stages that already exist).  There are no functionality changes in this pull request.  In more detail:

* shuffleToMapStage was renamed to shuffleIdToMapStage (when reading the existing code I have sometimes struggled to remember what the key is -- is it a stage? A stage id? This change is intended to avoid that confusion)
* Cleaned up the code to create shuffle map stages.  Previously, creating a shuffle map stage involved 3 different functions (newOrUsedShuffleStage, newShuffleMapStage, and getShuffleMapStage), and it wasn't clear what the purpose of each function was.  With the new code, a single function (getOrCreateShuffleMapStage) is responsible for getting a stage (if it already exists) or creating new shuffle map stages and any missing ancestor stages, and it delegates to createShuffleMapStage when new stages need to be created.  There's some remaining confusion here because the getOrCreateParentStages call in createShuffleMapStage may recursively create ancestor stages; this is an issue I plan to fix in a future pull request, because it's trickier to fix and involves a slight functionality change.
* newResultStage was renamed to createResultStage, for consistency with naming around shuffle map stages.
* getParentStages has been renamed to getOrCreateParentStages, to make it clear that this function will sometimes create missing ancestor stages.
* The only *slight* functionality change is that on line 478, updateJobIdStageIdMaps now uses a stage's parents instance variable rather than re-calculating them (I couldn't see any reason why they'd need to be re-calculated, and suspect this is just leftover from older code).
* getAncestorShuffleDependencies was renamed to getMissingAncestorShuffleDependencies, to make it clear that this only returns dependencies that have not yet been run.

cc squito markhamstra JoshRosen (who requested more DAG scheduler commenting long ago -- an issue this pull request tries, in part, to address)

FYI rxin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #13677 from kayousterhout/SPARK-15926.
2016-06-17 12:12:46 -07:00
Nezih Yigitbasi 63470afc99 [SPARK-15782][YARN] Fix spark.jars and spark.yarn.dist.jars handling
When `--packages` is specified with spark-shell the classes from those packages cannot be found, which I think is due to some of the changes in SPARK-12343.

Tested manually with both scala 2.10 and 2.11 repls.

vanzin davies can you guys please review?

Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: Nezih Yigitbasi <nyigitbasi@netflix.com>

Closes #13709 from nezihyigitbasi/SPARK-15782.
2016-06-16 18:20:16 -07:00
Alex Bozarth e849285df0 [SPARK-15868][WEB UI] Executors table in Executors tab should sort Executor IDs in numerical order
## What changes were proposed in this pull request?

Currently the Executors table sorts by id using a string sort (since that's what it is stored as). Since  the id is a number (other than the driver) we should be sorting numerically. I have changed both the initial sort on page load as well as the table sort to sort on id numerically, treating non-numeric strings (like the driver) as "-1"

## How was this patch tested?

Manually tested and dev/run-tests

![pageload](https://cloud.githubusercontent.com/assets/13952758/16027882/d32edd0a-318e-11e6-9faf-fc972b7c36ab.png)
![sorted](https://cloud.githubusercontent.com/assets/13952758/16027883/d34541c6-318e-11e6-9ed7-6bfc0cd4152e.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #13654 from ajbozarth/spark15868.
2016-06-16 14:29:11 -07:00
Sean Owen 457126e420 [SPARK-15796][CORE] Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
## What changes were proposed in this pull request?

Reduce `spark.memory.fraction` default to 0.6 in order to make it fit within default JVM old generation size (2/3 heap). See JIRA discussion. This means a full cache doesn't spill into the new gen. CC andrewor14

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13618 from srowen/SPARK-15796.
2016-06-16 23:04:10 +02:00
Narine Kokhlikyan 7c6c692637 [SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR
## What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12836 from NarineK/gapply2.
2016-06-15 21:42:05 -07:00
Davies Liu a153e41c08 Revert "[SPARK-15782][YARN] Set spark.jars system property in client mode"
This reverts commit 4df8df5c2e.
2016-06-15 15:55:07 -07:00
Imran Rashid cafc696d09 [HOTFIX][CORE] fix flaky BasicSchedulerIntegrationTest
## What changes were proposed in this pull request?

SPARK-15927 exacerbated a race in BasicSchedulerIntegrationTest, so it went from very unlikely to fairly frequent.  The issue is that stage numbering is not completely deterministic, but these tests treated it like it was.  So turn off the tests.

## How was this patch tested?

on my laptop the test failed abotu 10% of the time before this change, and didn't fail in 500 runs after the change.

Author: Imran Rashid <irashid@cloudera.com>

Closes #13688 from squito/hotfix_basic_scheduler.
2016-06-15 16:44:18 -05:00
Nezih Yigitbasi 4df8df5c2e [SPARK-15782][YARN] Set spark.jars system property in client mode
## What changes were proposed in this pull request?

When `--packages` is specified with `spark-shell` the classes from those packages cannot be found, which I think is due to some of the changes in `SPARK-12343`. In particular `SPARK-12343` removes a line that sets the `spark.jars` system property in client mode, which is used by the repl main class to set the classpath.

## How was this patch tested?

Tested manually.

This system property is used by the repl to populate its classpath. If
this is not set properly the classes for external packages cannot be
found.

tgravescs vanzin as you may be familiar with this part of the code.

Author: Nezih Yigitbasi <nyigitbasi@netflix.com>

Closes #13527 from nezihyigitbasi/repl-fix.
2016-06-15 14:07:36 -07:00
Tejas Patil 279bd4aa5f [SPARK-15826][CORE] PipedRDD to allow configurable char encoding
## What changes were proposed in this pull request?

Link to jira which describes the problem: https://issues.apache.org/jira/browse/SPARK-15826

The fix in this PR is to allow users specify encoding in the pipe() operation. For backward compatibility,
keeping the default value to be system default.

## How was this patch tested?

Ran existing unit tests

Author: Tejas Patil <tejasp@fb.com>

Closes #13563 from tejasapatil/pipedrdd_utf8.
2016-06-15 12:03:00 -07:00
Liwei Lin 9b234b55d1 [SPARK-15518][CORE][FOLLOW-UP] Rename LocalSchedulerBackendEndpoint -> LocalSchedulerBackend
## What changes were proposed in this pull request?

This patch is a follow-up to https://github.com/apache/spark/pull/13288 completing the renaming:
 - LocalScheduler -> LocalSchedulerBackend~~Endpoint~~

## How was this patch tested?

Updated test cases to reflect the name change.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #13683 from lw-lin/rename-backend.
2016-06-15 11:52:36 -07:00
Marcelo Vanzin 40eeef9525 [SPARK-15046][YARN] Parse value of token renewal interval correctly.
Use the config variable definition both to set and parse the value,
avoiding issues with code expecting the value in a different format.

Tested by running spark-submit with --principal / --keytab.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13669 from vanzin/SPARK-15046.
2016-06-15 09:09:21 -05:00
Kay Ousterhout 5d50d4f0f9 [SPARK-15927] Eliminate redundant DAGScheduler code.
To try to eliminate redundant code to traverse the RDD dependency graph,
this PR creates a new function getShuffleDependencies that returns
shuffle dependencies that are immediate parents of a given RDD.  This
new function is used by getParentStages and
getAncestorShuffleDependencies.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #13646 from kayousterhout/SPARK-15927.
2016-06-14 17:27:01 -07:00
Sean Owen 6151d2641f [MINOR] Clean up several build warnings, mostly due to internal use of old accumulators
## What changes were proposed in this pull request?

Another PR to clean up recent build warnings. This particularly cleans up several instances of the old accumulator API usage in tests that are straightforward to update. I think this qualifies as "minor".

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #13642 from srowen/BuildWarnings.
2016-06-14 09:40:07 -07:00
Dongjoon Hyun 938434dc78 [SPARK-15913][CORE] Dispatcher.stopped should be enclosed by synchronized block.
## What changes were proposed in this pull request?

`Dispatcher.stopped` is guarded by `this`, but it is used without synchronization in `postMessage` function. This PR fixes this and also the exception message became more accurate.

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13634 from dongjoon-hyun/SPARK-15913.
2016-06-13 10:30:17 -07:00
Sean Owen 0a6f090837 [SPARK-15876][CORE] Remove support for "zk://" master URL
## What changes were proposed in this pull request?

Remove deprecated support for `zk://` master (`mesos://zk//` remains supported)

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #13625 from srowen/SPARK-15876.
2016-06-12 11:46:33 -07:00
Sean Owen f51dfe616b [SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator API
## What changes were proposed in this pull request?

- Deprecate old Java accumulator API; should use Scala now
- Update Java tests and examples
- Don't bother testing old accumulator API in Java 8 (too)
- (fix a misspelling too)

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #13606 from srowen/SPARK-15086.
2016-06-12 11:44:33 -07:00
bomeng 50248dcfff [SPARK-15806][DOCUMENTATION] update doc for SPARK_MASTER_IP
## What changes were proposed in this pull request?

SPARK_MASTER_IP is a deprecated environment variable. It is replaced by SPARK_MASTER_HOST according to MasterArguments.scala.

## How was this patch tested?

Manually verified.

Author: bomeng <bmeng@us.ibm.com>

Closes #13543 from bomeng/SPARK-15806.
2016-06-12 14:25:48 +01:00
Imran Rashid 8cc22b0085 [SPARK-15878][CORE][TEST] fix cleanup in EventLoggingListenerSuite and ReplayListenerSuite
## What changes were proposed in this pull request?

These tests weren't properly using `LocalSparkContext` so weren't cleaning up correctly when tests failed.

## How was this patch tested?

Jenkins.

Author: Imran Rashid <irashid@cloudera.com>

Closes #13602 from squito/SPARK-15878_cleanup_replaylistener.
2016-06-12 12:54:57 +01:00
Eric Liang e1f986c7a3 [SPARK-15860] Metrics for codegen size and perf
## What changes were proposed in this pull request?

Adds codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get.

To simplify, I added the metrics under a statically-initialized source that is always registered with SparkEnv.

## How was this patch tested?

Unit tests

Author: Eric Liang <ekl@databricks.com>

Closes #13586 from ericl/spark-15860.
2016-06-11 23:16:21 -07:00
Eric Liang c06c58bbbb [SPARK-14851][CORE] Support radix sort with nullable longs
## What changes were proposed in this pull request?

This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort.

This strategy for nulls does mean the sort is no longer stable. cc davies

## How was this patch tested?

Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts.

Some test queries (best of 5 runs each).
Before change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190437233227987
res3: Double = 4716.471091

After change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190367870952791
res4: Double = 2981.143045

Author: Eric Liang <ekl@databricks.com>

Closes #13161 from ericl/sc-2998.
2016-06-11 15:42:58 -07:00
Sean Owen 3761330dd0 [SPARK-15879][DOCS][UI] Update logo in UI and docs to add "Apache"
## What changes were proposed in this pull request?

Use new Spark logo including "Apache" (now, with crushed PNGs). Remove old unreferenced logo files.

## How was this patch tested?

Manual check of generated HTML site and Spark UI. I searched for references to the deleted files to make sure they were not used.

Author: Sean Owen <sowen@cloudera.com>

Closes #13609 from srowen/SPARK-15879.
2016-06-11 12:46:07 +01:00
wangyang 026eb90644 [SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length == 0 and Seq.length > 0
## What changes were proposed in this pull request?

In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead.

## How was this patch tested?
existing tests

Author: wangyang <wangyang@haizhi.com>

Closes #13601 from yangw1234/isEmpty.
2016-06-10 13:10:03 -07:00
Kay Ousterhout 5c16ad0d52 Revert [SPARK-14485][CORE] ignore task finished for executor lost
This reverts commit 695dbc816a.

This change is being reverted because it hurts performance of some jobs, and
only helps in a narrow set of cases.  For more discussion, refer to the JIRA.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #13580 from kayousterhout/revert-SPARK-14485.
2016-06-10 12:50:50 -07:00
Reynold Xin 254bc8c34e [SPARK-15866] Rename listAccumulator collectionAccumulator
## What changes were proposed in this pull request?
SparkContext.listAccumulator, by Spark's convention, makes it sound like "list" is a verb and the method should return a list of accumulators. This patch renames the method and the class collection accumulator.

## How was this patch tested?
Updated test case to reflect the names.

Author: Reynold Xin <rxin@databricks.com>

Closes #13594 from rxin/SPARK-15866.
2016-06-10 11:08:39 -07:00
Eric Liang b914e1930f [SPARK-15794] Should truncate toString() of very wide plans
## What changes were proposed in this pull request?

With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact.

It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability.

## How was this patch tested?

Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold.

```
numFields = 5
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem)            2336 / 2558          0.0       23364.4       0.1X

numFields = 25
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem)            4237 / 4465          0.0       42367.9       0.1X

numFields = 100
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem)          10458 / 11223          0.0      104582.0       0.0X

numFields = Infinity
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
[info]   java.lang.OutOfMemoryError: Java heap space
```

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #13537 from ericl/truncated-string.
2016-06-09 18:05:16 -07:00
Eric Liang 4e8ac6edd5 [SPARK-15735] Allow specifying min time to run in microbenchmarks
## What changes were proposed in this pull request?

This makes microbenchmarks run for at least 2 seconds by default, to allow some time for jit compilation to kick in.

## How was this patch tested?

Tested manually with existing microbenchmarks. This change is backwards compatible in that existing microbenchmarks which specified numIters per-case will still run exactly that number of iterations. Microbenchmarks which previously overrode defaultNumIters now override minNumIters.

cc hvanhovell

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #13472 from ericl/spark-15735.
2016-06-08 16:21:41 -07:00
zhonghaihua 695dbc816a [SPARK-14485][CORE] ignore task finished for executor lost and removed by driver
Now, when executor is removed by driver with heartbeats timeout, driver will re-queue the task on this executor and send a kill command to cluster to kill this executor.
But, in a situation, the running task of this executor is finished and return result to driver before this executor killed by kill command sent by driver. At this situation, driver will accept the task finished event and ignore speculative task and re-queued task.
But, as we know, this executor has removed by driver, the result of this finished task can not save in driver because the BlockManagerId has also removed from BlockManagerMaster by driver. So, the result data of this stage is not complete, and then, it will cause fetch failure. For more details, [link to jira issues SPARK-14485](https://issues.apache.org/jira/browse/SPARK-14485)
This PR introduce a mechanism to ignore this kind of task finished.

N/A

Author: zhonghaihua <793507405@qq.com>

Closes #12258 from zhonghaihua/ignoreTaskFinishForExecutorLostAndRemovedByDriver.
2016-06-07 16:32:27 -07:00
Imran Rashid 36d3dfa59a [SPARK-15783][CORE] still some flakiness in these blacklist tests so ignore for now
## What changes were proposed in this pull request?

There is still some flakiness in BlacklistIntegrationSuite, so turning it off for the moment to avoid breaking more builds -- will turn it back with more fixes.

## How was this patch tested?

jenkins.

Author: Imran Rashid <irashid@cloudera.com>

Closes #13528 from squito/ignore_blacklist.
2016-06-06 12:53:11 -07:00
Dhruve Ashar fa4bc8ea8b [SPARK-14279][BUILD] Pick the spark version from pom
## What changes were proposed in this pull request?
Change the way spark picks up version information. Also embed the build information to better identify the spark version running.

More context can be found here : https://github.com/apache/spark/pull/12152

## How was this patch tested?
Ran the mvn and sbt builds to verify the version information was being displayed correctly on executing <code>spark-submit --version </code>

![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png)

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #13061 from dhruve/impr/SPARK-14279.
2016-06-06 09:42:50 -07:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Brett Randall 4e767d0f90 [SPARK-15723] Fixed local-timezone-brittle test where short-timezone form "EST" is …
## What changes were proposed in this pull request?

Stop using the abbreviated and ambiguous timezone "EST" in a test, since it is machine-local default timezone dependent, and fails in different timezones.  Fixed [SPARK-15723](https://issues.apache.org/jira/browse/SPARK-15723).

## How was this patch tested?

Note that to reproduce this problem in any locale/timezone, you can modify the scalatest-maven-plugin argLine to add a timezone:

    <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="Australia/Sydney"</argLine>

and run

    $ mvn test -DwildcardSuites=org.apache.spark.status.api.v1.SimpleDateParamSuite -Dtest=none. Equally this will fix it in an effected timezone:

    <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="America/New_York"</argLine>

To test the fix, apply the above change to `pom.xml` to set test TZ to `Australia/Sydney`, and confirm the test now passes.

Author: Brett Randall <javabrett@gmail.com>

Closes #13462 from javabrett/SPARK-15723-SimpleDateParamSuite.
2016-06-05 15:31:56 +01:00
Davies Liu 3074f575a3 [SPARK-15391] [SQL] manage the temporary memory of timsort
## What changes were proposed in this pull request?

Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode.

This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode.

This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #13318 from davies/fix_timsort.
2016-06-03 16:45:09 -07:00
Xin Wu 28ad0f7b0d [SPARK-15681][CORE] allow lowercase or mixed case log level string when calling sc.setLogLevel
## What changes were proposed in this pull request?
Currently `SparkContext API setLogLevel(level: String) `can not handle lower case or mixed case input string. But `org.apache.log4j.Level.toLevel` can take lowercase or mixed case.

This PR is to allow case-insensitive user input for the log level.

## How was this patch tested?
A unit testcase is added.

Author: Xin Wu <xinwu@us.ibm.com>

Closes #13422 from xwu0226/reset_loglevel.
2016-06-03 14:26:48 -07:00
bomeng 8fa00dd05f [SPARK-15737][CORE] fix jetty warning
## What changes were proposed in this pull request?

After upgrading Jetty to 9.2, we always see "WARN org.eclipse.jetty.server.handler.AbstractHandler: No Server set for org.eclipse.jetty.server.handler.ErrorHandler" while running any test cases.

This PR will fix it.

## How was this patch tested?

The existing test cases will cover it.

Author: bomeng <bmeng@us.ibm.com>

Closes #13475 from bomeng/SPARK-15737.
2016-06-03 09:59:15 -07:00
Imran Rashid c2f0cb4f63 [SPARK-15714][CORE] Fix flaky o.a.s.scheduler.BlacklistIntegrationSuite
## What changes were proposed in this pull request?

BlacklistIntegrationSuite (introduced by SPARK-10372) is a bit flaky because of some race conditions:
1. Failed jobs might have non-empty results, because the resultHandler will be invoked for successful tasks (if there are task successes before failures)
2. taskScheduler.taskIdToTaskSetManager must be protected by a lock on taskScheduler

(1) has failed a handful of jenkins builds recently.  I don't think I've seen (2) in jenkins, but I've run into with some uncommitted tests I'm working on where there are lots more tasks.

While I was in there, I also made an unrelated fix to `runningTasks`in the test framework -- there was a pointless `O(n)` operation to remove completed tasks, could be `O(1)`.

## How was this patch tested?

I modified the o.a.s.scheduler.BlacklistIntegrationSuite to have it run the tests 1k times on my laptop.  It failed 11 times before this change, and none with it.  (Pretty sure all the failures were problem (1), though I didn't check all of them).

Also the full suite of tests via jenkins.

Author: Imran Rashid <irashid@cloudera.com>

Closes #13454 from squito/SPARK-15714.
2016-06-03 11:49:33 -05:00
Josh Rosen 229f902257 [SPARK-15736][CORE] Gracefully handle loss of DiskStore files
If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure.

In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block.

This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13473 from JoshRosen/handle-missing-cache-files.
2016-06-02 17:36:31 -07:00