Commit graph

3483 commits

Author SHA1 Message Date
Rui Li 33e64ecacb SPARK-2277: make TaskScheduler track hosts on rack
Hi mateiz, I've created [SPARK-2277](https://issues.apache.org/jira/browse/SPARK-2277) to make TaskScheduler track hosts on each rack. Please help to review, thanks.

Author: Rui Li <rui.li@intel.com>

Closes #1212 from lirui-intel/trackHostOnRack and squashes the following commits:

2b4bd0f [Rui Li] SPARK-2277: refine UT
fbde838 [Rui Li] SPARK-2277: add UT
7bbe658 [Rui Li] SPARK-2277: rename the method
5e4ef62 [Rui Li] SPARK-2277: remove unnecessary import
79ac750 [Rui Li] SPARK-2277: make TaskScheduler track hosts on rack
2014-07-16 22:53:37 +05:30
Henry Saputra 9c12de5092 [SPARK-2500] Move the logInfo for registering BlockManager to BlockManagerMasterActor.register method
PR for SPARK-2500

Move the logInfo call for BlockManager to BlockManagerMasterActor.register instead of BlockManagerInfo constructor.

Previously the loginfo call for registering the registering a BlockManager is happening in the BlockManagerInfo constructor. This kind of confusing because the code could call "new BlockManagerInfo" without actually registering a BlockManager and could confuse when reading the log files.

Author: Henry Saputra <henry.saputra@gmail.com>

Closes #1424 from hsaputra/move_registerblockmanager_log_to_registration_method and squashes the following commits:

3370b4a [Henry Saputra] Move the loginfo for BlockManager to BlockManagerMasterActor.register instead of BlockManagerInfo constructor.
2014-07-15 21:21:52 -07:00
Reynold Xin 4576d80a51 [SPARK-2469] Use Snappy (instead of LZF) for default shuffle compression codec
This reduces shuffle compression memory usage by 3x.

Author: Reynold Xin <rxin@apache.org>

Closes #1415 from rxin/snappy and squashes the following commits:

06c1a01 [Reynold Xin] SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec.
2014-07-15 18:47:39 -07:00
witgo 72ea56da8e SPARK-1291: Link the spark UI to RM ui in yarn-client mode
Author: witgo <witgo@qq.com>

Closes #1112 from witgo/SPARK-1291 and squashes the following commits:

6022bcd [witgo] review commit
1fbb925 [witgo] add addAmIpFilter to yarn alpha
210299c [witgo] review commit
1b92a07 [witgo] review commit
6896586 [witgo] Add comments to addWebUIFilter
3e9630b [witgo] review commit
142ee29 [witgo] review commit
1fe7710 [witgo] Link the spark UI to RM ui in yarn-client mode
2014-07-15 13:52:56 -05:00
William Benton cb09e93c1d Reformat multi-line closure argument.
Author: William Benton <willb@redhat.com>

Closes #1419 from willb/reformat-2486 and squashes the following commits:

2676231 [William Benton] Reformat multi-line closure argument.
2014-07-15 09:13:39 -07:00
Reynold Xin dd95abada7 [SPARK-2399] Add support for LZ4 compression.
Based on Greg Bowyer's patch from JIRA https://issues.apache.org/jira/browse/SPARK-2399

Author: Reynold Xin <rxin@apache.org>

Closes #1416 from rxin/lz4 and squashes the following commits:

6c8fefe [Reynold Xin] Fixed typo.
8a14d38 [Reynold Xin] [SPARK-2399] Add support for LZ4 compression.
2014-07-15 01:46:57 -07:00
lianhuiwang 7446f5ff93 discarded exceeded completedDrivers
When completedDrivers number exceeds the threshold, the first Max(spark.deploy.retainedDrivers, 1) will be discarded.

Author: lianhuiwang <lianhuiwang09@gmail.com>

Closes #1114 from lianhuiwang/retained-drivers and squashes the following commits:

8789418 [lianhuiwang] discarded exceeded completedDrivers
2014-07-15 00:22:06 -07:00
Kousuke Saruta c6d75745de [SPARK-2390] Files in staging directory cannot be deleted and wastes the space of HDFS
When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory (~/.sparkStaging on HDFS) cannot be deleted.
HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get.

On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get.

FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user.
Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance.
When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly.

And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked.
Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory.

I think, calling fileSystem.delete is not needed.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1326 from sarutak/SPARK-2390 and squashes the following commits:

10e1a88 [Kousuke Saruta] Removed fileSystem.close from FileLogger.scala not to prevent any other FileSystem operation
2014-07-14 23:55:39 -07:00
Aaron Davidson a2aa7bebae Add/increase severity of warning in documentation of groupBy()
groupBy()/groupByKey() is notorious for being a very convenient API that can lead to poor performance when used incorrectly.

This PR just makes it clear that users should be cautious not to rely on this API when they really want a different (more performant) one, such as reduceByKey().

(Note that one source of confusion is the name; this groupBy() is not the same as a SQL GROUP-BY, which is used for aggregation and is more similar in nature to Spark's reduceByKey().)

Author: Aaron Davidson <aaron@databricks.com>

Closes #1380 from aarondav/warning and squashes the following commits:

f60da39 [Aaron Davidson] Give better advice
d0afb68 [Aaron Davidson] Add/increase severity of warning in documentation of groupBy()
2014-07-14 23:38:12 -07:00
William Benton 1f99fea53b SPARK-2486: Utils.getCallSite is now resilient to bogus frames
When running Spark under certain instrumenting profilers,
Utils.getCallSite could crash with an NPE.  This commit
makes it more resilient to failures occurring while inspecting
stack frames.

Author: William Benton <willb@redhat.com>

Closes #1413 from willb/spark-2486 and squashes the following commits:

b7c0274 [William Benton] Use explicit null checks instead of Try()
0f0c1ae [William Benton] Utils.getCallSite is now resilient to bogus frames
2014-07-14 23:09:13 -07:00
li-zhihui 3dd8af7a66 [SPARK-1946] Submit tasks after (configured ratio) executors have been registered
Because submitting tasks and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality.

A simple solution is sleeping few seconds in application, so that executors have enough time to register.

The PR add 2 configuration properties to make TaskScheduler submit tasks after a few of executors have been registered.

\# Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0
spark.scheduler.minRegisteredExecutorsRatio = 0.8

\# Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000
spark.scheduler.maxRegisteredExecutorsWaitingTime = 5000

Author: li-zhihui <zhihui.li@intel.com>

Closes #900 from li-zhihui/master and squashes the following commits:

b9f8326 [li-zhihui] Add logs & edit docs
1ac08b1 [li-zhihui] Add new configs to user docs
22ead12 [li-zhihui] Move waitBackendReady to postStartHook
c6f0522 [li-zhihui] Bug fix: numExecutors wasn't set & use constant DEFAULT_NUMBER_EXECUTORS
4d6d847 [li-zhihui] Move waitBackendReady to TaskSchedulerImpl.start & some code refactor
0ecee9a [li-zhihui] Move waitBackendReady from DAGScheduler.submitStage to TaskSchedulerImpl.submitTasks
4261454 [li-zhihui] Add docs for new configs & code style
ce0868a [li-zhihui] Code style, rename configuration property name of minRegisteredRatio & maxRegisteredWaitingTime
6cfb9ec [li-zhihui] Code style, revert default minRegisteredRatio of yarn to 0, driver get --num-executors in yarn/alpha
812c33c [li-zhihui] Fix driver lost --num-executors option in yarn-cluster mode
e7b6272 [li-zhihui] support yarn-cluster
37f7dc2 [li-zhihui] support yarn mode(percentage style)
3f8c941 [li-zhihui] submit stage after (configured ratio of) executors have been registered
2014-07-14 15:32:49 -05:00
Daoyuan 38ccd6ebd4 move some test file to match src code
Just move some test suite to corresponding package

Author: Daoyuan <daoyuan.wang@intel.com>

Closes #1401 from adrian-wang/movetestfiles and squashes the following commits:

d1a6803 [Daoyuan] move some test file to match src code
2014-07-14 10:40:44 -07:00
Daniel Darabos 2245c87af4 Use the Executor's ClassLoader in sc.objectFile().
This makes it possible to read classes from the object file which were specified in the user-provided jars. (By default ObjectInputStream uses latestUserDefinedLoader, which may or may not be the right one.)

I created this because I ran into the following problem. I have x:RDD[X] with X being defined in the jar that I provide to SparkContext. I save it with x.saveAsObjectFile("x"). I try to load it with sc.objectFile\[X\]("x"). It fails with ClassNotFoundException.

After a good while of debugging I figured out that Utils.deserialize() most likely uses the ClassLoader of Utils. This is the bootstrap ClassLoader, so it is not aware of the dynamically added jars. This patch fixes the issue.

A more robust fix would be to always default to Thread.currentThread.getContextClassLoader. This would prevent this problem from biting anyone in the future. It would be a bit harder to test though. On the topic of testing, if you'd like to see tests for this, I will need some hand-holding. Thanks!

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #181 from darabos/master and squashes the following commits:

45a011a [Daniel Darabos] Add test for SPARK-1877. (Fixed in 52eb54d.)
e13e090 [Daniel Darabos] Merge branch 'master' of https://github.com/apache/spark
61fe0d0 [Daniel Darabos] Fix style (line too long).
1b5df2c [Daniel Darabos] Use the Executor's ClassLoader in sc.objectFile(). This makes it possible to read classes from the object file which were specified in the user-provided jars. (By default ObjectInputStream uses latestUserDefinedLoader, which may or may not be the right one.)
2014-07-12 00:07:42 -07:00
Andrew Or f4f46dec5a [Minor] Remove unused val in Master
Author: Andrew Or <andrewor14@gmail.com>

Closes #1365 from andrewor14/master-fs and squashes the following commits:

497f100 [Andrew Or] Sneak in a space and hope no one will notice
05ba6da [Andrew Or] Remove unused val
2014-07-11 00:21:16 -07:00
Prashant Sharma 628932b8d0 [SPARK-1776] Have Spark's SBT build read dependencies from Maven.
Patch introduces the new way of working also retaining the existing ways of doing things.

For example build instruction for yarn in maven is
`mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
in sbt it can become
`MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
Also supports
`sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:

a8ac951 [Prashant Sharma] Updated sbt version.
62b09bb [Prashant Sharma] Improvements.
fa6221d [Prashant Sharma] Excluding sql from mima
4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
72651ca [Prashant Sharma] Addresses code reivew comments.
acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
ac4312c [Prashant Sharma] Revert "minor fix"
6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
446768e [Prashant Sharma] minor fix
89b9777 [Prashant Sharma] Merge conflicts
d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
dccc8ac [Prashant Sharma] updated mima to check against 1.0
a49c61b [Prashant Sharma] Fix for tools jar
a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
cf88758 [Prashant Sharma] cleanup
9439ea3 [Prashant Sharma] Small fix to run-examples script.
96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
4973dbd [Patrick Wendell] Example build using pom reader.
2014-07-10 11:03:37 -07:00
Masayoshi TSUZUKI c2babc089b SPARK-2115: Stage kill link is too close to stage details link
Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #1350 from tsudukim/feature/SPARK-2115 and squashes the following commits:

e2263b0 [Masayoshi TSUZUKI] Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked.
2014-07-10 01:18:37 -07:00
Kay Ousterhout 339441f545 [SPARK-2384] Add tooltips to UI.
This patch adds tooltips to clarify some points of confusion in the UI.  When users mouse over some of the table headers (shuffle read, write, and input size) as well as over the "scheduler delay" metric shown for each stage, a black tool tip (see image below) pops up describing the metric in more detail.  After the tooltip mechanism is added by this commit, I imagine others may want to add more tooltips for other things in the UI, but I think this is a good starting point.

![tooltip](https://cloud.githubusercontent.com/assets/1108612/3491905/994e179e-059f-11e4-92f2-c6c12d248d81.jpg)

This looks scary-big but much of it is adding the bootstrap tool tip JavaScript.

Also I have no idea what to put for the license in tooltip (I left it the same -- the Twitter apache header) or for JQuery (left it as nothing) -- @mateiz what's the right thing here?

cc @pwendell @andrewor14 @rxin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1314 from kayousterhout/tooltips and squashes the following commits:

19981b5 [Kay Ousterhout] Exclude non-licensed javascript files from style check
d9ab5a9 [Kay Ousterhout] Response to Andrew's review
7752449 [Kay Ousterhout] [SPARK-2384] Add tooltips to UI.
2014-07-08 22:57:21 -07:00
Andrew Or bf04a390e4 [SPARK-2392] Executors should not start their own HTTP servers
Executors currently start their own unused HTTP file servers. This is because we use the same SparkEnv class for both executors and drivers, and we do not distinguish this case.

In the longer term, we should separate out SparkEnv for the driver and SparkEnv for the executors.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1335 from andrewor14/executor-http-server and squashes the following commits:

46ef263 [Andrew Or] Start HTTP server only on the driver
2014-07-08 17:35:31 -07:00
Daniel Darabos c8a2313cdf [SPARK-2403] Catch all errors during serialization in DAGScheduler
https://issues.apache.org/jira/browse/SPARK-2403

Spark hangs for us whenever we forget to register a class with Kryo. This should be a simple fix for that. But let me know if you have a better suggestion.

I did not write a new test for this. It would be pretty complicated and I'm not sure it's worthwhile for such a simple change. Let me know if you disagree.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #1329 from darabos/spark-2403 and squashes the following commits:

3aceaad [Daniel Darabos] Print full stack trace for miscellaneous exceptions during serialization.
52c22ba [Daniel Darabos] Only catch NonFatal exceptions.
361e962 [Daniel Darabos] Catch all errors during serialization in DAGScheduler.
2014-07-08 10:43:46 -07:00
witgo 3cd5029be7 Resolve sbt warnings during build Ⅱ
Author: witgo <witgo@qq.com>

Closes #1153 from witgo/expectResult and squashes the following commits:

97541d8 [witgo] merge master
ead26e7 [witgo] Resolve sbt warnings during build
2014-07-08 00:31:42 -07:00
ankit.bhardwaj 42f3abd529 [SPARK-2306]:BoundedPriorityQueue is private and not registered with Kry...
Due to the non registration of BoundedPriorityQueue  with kryoserializer, operations which are dependend on BoundedPriorityQueue are giving exceptions.One such instance is using top along with kryo serialization.
Fixed the issue by registering BoundedPriorityQueue with kryoserializer.

Author: ankit.bhardwaj <ankit.bhardwaj@guavus.com>

Closes #1299 from AnkitBhardwaj12/BoundedPriorityQueueWithKryoIssue and squashes the following commits:

a4ae8ed [ankit.bhardwaj] [SPARK-2306]:BoundedPriorityQueue is private and not registered with Kryo
2014-07-04 22:06:10 -07:00
Reynold Xin 0db5d5a22e Added SignalLogger to HistoryServer.
This was omitted in #1260. @aarondav

Author: Reynold Xin <rxin@apache.org>

Closes #1300 from rxin/historyServer and squashes the following commits:

af720a3 [Reynold Xin] Added SignalLogger to HistoryServer.
2014-07-04 17:33:07 -07:00
Aaron Davidson 97a0bfe1c0 SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark
JIRA: https://issues.apache.org/jira/browse/SPARK-2282

This issue is caused by a buildup of sockets in the TIME_WAIT stage of TCP, which is a stage that lasts for some period of time after the communication closes.

This solution simply allows us to reuse sockets that are in TIME_WAIT, to avoid issues with the buildup of the rapid creation of these sockets.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1220 from aarondav/SPARK-2282 and squashes the following commits:

2e5cab3 [Aaron Davidson] SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark
2014-07-03 23:02:36 -07:00
Andrew Or 3894a49be9 [SPARK-2307][Reprise] Correctly report RDD blocks on SparkUI
**Problem.** The existing code in `ExecutorPage.scala` requires a linear scan through all the blocks to filter out the uncached ones. Every refresh could be expensive if there are many blocks and many executors.

**Solution.** The proper semantics should be the following: `StorageStatusListener` should contain only block statuses that are cached. This means as soon as a block is unpersisted by any mean, its status should be removed. This is reflected in the changes made in `StorageStatusListener.scala`.

Further, the `StorageTab` must stop relying on the `StorageStatusListener` changing a dropped block's status to `StorageLevel.NONE` (which no longer happens). This is reflected in the changes made in `StorageTab.scala` and `StorageUtils.scala`.

----------

If you have been following this chain of PRs like pwendell, you will quickly notice that this reverts the changes in #1249, which reverts the changes in #1080. In other words, we are adding back the changes from #1080, and fixing SPARK-2307 on top of those changes. Please ask questions if you are confused.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1255 from andrewor14/storage-ui-fix-reprise and squashes the following commits:

45416fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into storage-ui-fix-reprise
a82ea25 [Andrew Or] Add tests for StorageStatusListener
8773b01 [Andrew Or] Update comment / minor changes
3afde3f [Andrew Or] Correctly report the number of blocks on SparkUI
2014-07-03 22:48:23 -07:00
Aaron Davidson 586feb5c95 [SPARK-2350] Don't NPE while launching drivers
Prior to this change, we could throw a NPE if we launch a driver while another one is waiting, because removing from an iterator while iterating over it is not safe.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1289 from aarondav/master-fail and squashes the following commits:

1cf1cf4 [Aaron Davidson] SPARK-2350: Don't NPE while launching drivers
2014-07-03 22:31:41 -07:00
Raymond Liu 5fa0a05763 [SPARK-1097] Workaround Hadoop conf ConcurrentModification issue
Workaround Hadoop conf ConcurrentModification issue

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1273 from colorant/hadoopRDD and squashes the following commits:

994e98b [Raymond Liu] Address comments
e2cda3d [Raymond Liu] Workaround Hadoop conf ConcurrentModification issue
2014-07-03 19:24:22 -07:00
Andrew Or c480537739 [SPARK] Fix NPE for ExternalAppendOnlyMap
It did not handle null keys very gracefully before.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1288 from andrewor14/fix-external and squashes the following commits:

312b8d8 [Andrew Or] Abstract key hash code
ed5adf9 [Andrew Or] Fix NPE for ExternalAppendOnlyMap
2014-07-03 10:26:50 -07:00
yantangzhai 3bbeca6489 [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
The spark.local.dir is configured as a list of multiple paths as follows /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver node has error, the application will exit since DiskBlockManager exits directly at createLocalDirs. If the disk data2 of the worker node has error, the executor will exit either.
DiskBlockManager should not exit directly at createLocalDirs if one of spark.local.dir has error. Since spark.local.dir has multiple paths, a problem should not affect the overall situation.
I think DiskBlockManager could ignore the bad directory at createLocalDirs.

Author: yantangzhai <tyz0303@163.com>

Closes #1274 from YanTangZhai/SPARK-2324 and squashes the following commits:

609bf48 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
df08673 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
2014-07-03 10:14:35 -07:00
Kay Ousterhout 05c3d90e35 [SPARK-2185] Emit warning when task size exceeds a threshold.
This functionality was added in an earlier commit but shortly
after was removed due to a bad git merge (totally my fault).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1149 from kayousterhout/warning_bug and squashes the following commits:

3f1bb00 [Kay Ousterhout] Fixed Json tests
462a664 [Kay Ousterhout] Removed task set name from warning message
e89b2f6 [Kay Ousterhout] Fixed Json tests.
7af424c [Kay Ousterhout] Emit warning when task size exceeds a threshold.
2014-07-01 01:56:51 -07:00
Peter MacKinnon 3319a3e3c6 SPARK-2332 [build] add exclusion for old servlet-api on hadoop-client in core
Fix for class of test suite failures in jenkins

Author: Peter MacKinnon <pmackinn@redhat.com>

Closes #1271 from pdmack/master and squashes the following commits:

cfe59fd [Peter MacKinnon] exclude servlet-api in hadoop-client for sbt
6f39fec [Peter MacKinnon] add exclusion for old servlet-api on hadoop-client in core
2014-07-01 00:28:38 -07:00
Reynold Xin 5fccb567b3 [SPARK-2318] When exiting on a signal, print the signal name first.
Author: Reynold Xin <rxin@apache.org>

Closes #1260 from rxin/signalhandler1 and squashes the following commits:

8e73552 [Reynold Xin] Uh add Logging back in ApplicationMaster.
0402ba8 [Reynold Xin] Synchronize SignalLogger.register.
dc70705 [Reynold Xin] Added SignalLogger to YARN ApplicationMaster.
79a21b4 [Reynold Xin] Added license header.
0da052c [Reynold Xin] Added the SignalLogger itself.
e587d2e [Reynold Xin] [SPARK-2318] When exiting on a signal, print the signal name first.
2014-06-30 15:12:38 -07:00
Reynold Xin 358ae1534d [SPARK-2322] Exception in resultHandler should NOT crash DAGScheduler and shutdown SparkContext.
This should go into 1.0.1.

Author: Reynold Xin <rxin@apache.org>

Closes #1264 from rxin/SPARK-2322 and squashes the following commits:

c77c07f [Reynold Xin] Added comment to SparkDriverExecutionException and a test case for accumulator.
5d8d920 [Reynold Xin] [SPARK-2322] Exception in resultHandler could crash DAGScheduler and shutdown SparkContext.
2014-06-30 11:50:22 -07:00
Andrew Ash 6803642253 SPARK-2077 Log serializer that actually ends up being used
I could settle with this being a debug also if we provided an example of how to turn it on in `log4j.properties`

https://issues.apache.org/jira/browse/SPARK-2077

Author: Andrew Ash <andrew@andrewash.com>

Closes #1017 from ash211/SPARK-2077 and squashes the following commits:

580f680 [Andrew Ash] Drop to debug
0266415 [Andrew Ash] SPARK-2077 Log serializer that actually ends up being used
2014-06-29 23:29:05 -07:00
William Benton a484030dae SPARK-897: preemptively serialize closures
These commits cause `ClosureCleaner.clean` to attempt to serialize the cleaned closure with the default closure serializer and throw a `SparkException` if doing so fails.  This behavior is enabled by default but can be disabled at individual callsites of `SparkContext.clean`.

Commit 98e01ae8 fixes some no-op assertions in `GraphSuite` that this work exposed; I'm happy to put that in a separate PR if that would be more appropriate.

Author: William Benton <willb@redhat.com>

Closes #143 from willb/spark-897 and squashes the following commits:

bceab8a [William Benton] Commented DStream corner cases for serializability checking.
64d04d2 [William Benton] FailureSuite now checks both messages and causes.
3b3f74a [William Benton] Stylistic and doc cleanups.
b215dea [William Benton] Fixed spurious failures in ImplicitOrderingSuite
be1ecd6 [William Benton] Don't check serializability of DStream transforms.
abe816b [William Benton] Make proactive serializability checking optional.
5bfff24 [William Benton] Adds proactive closure-serializablilty checking
ed2ccf0 [William Benton] Test cases for SPARK-897.
2014-06-29 23:27:34 -07:00
jerryshao 66135a341d [SPARK-2104] Fix task serializing issues when sort with Java non serializable class
Details can be see in [SPARK-2104](https://issues.apache.org/jira/browse/SPARK-2104). This work is based on Reynold's work, add some unit tests to validate the issue.

@rxin , would you please take a look at this PR, thanks a lot.

Author: jerryshao <saisai.shao@intel.com>

Closes #1245 from jerryshao/SPARK-2104 and squashes the following commits:

c8ee362 [jerryshao] Make field partitions transient
2b41917 [jerryshao] Minor changes
47d763c [jerryshao] Fix task serializing issue when sort with Java non serializable class
2014-06-29 23:00:00 -07:00
Kay Ousterhout 7b71a0e096 [SPARK-1683] Track task read metrics.
This commit adds a new metric in TaskMetrics to record
the input data size and displays this information in the UI.

An earlier version of this commit also added the read time,
which can be useful for diagnosing straggler problems,
but unfortunately that change introduced a significant performance
regression for jobs that don't do much computation. In order to
track read time, we'll need to do sampling.

The screenshots below show the UI with the new "Input" field,
which I added to the stage summary page, the executor summary page,
and the per-stage page.

![image](https://cloud.githubusercontent.com/assets/1108612/3167930/2627f92a-eb77-11e3-861c-98ea5bb7a1a2.png)

![image](https://cloud.githubusercontent.com/assets/1108612/3167936/475a889c-eb77-11e3-9706-f11c48751f17.png)

![image](https://cloud.githubusercontent.com/assets/1108612/3167948/80ebcf12-eb77-11e3-87ed-349fce6a770c.png)

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #962 from kayousterhout/read_metrics and squashes the following commits:

f13b67d [Kay Ousterhout] Correctly format input bytes on executor page
8b70cde [Kay Ousterhout] Added comment about potential inaccuracy of bytesRead
d1016e8 [Kay Ousterhout] Udated SparkListenerSuite test
8461492 [Kay Ousterhout] Miniscule style fix
ae04d99 [Kay Ousterhout] Remove input metrics for parallel collections
719f19d [Kay Ousterhout] Style fixes
bb6ec62 [Kay Ousterhout] Small fixes
869ac7b [Kay Ousterhout] Updated Json tests
44a0301 [Kay Ousterhout] Fixed accidentally added line
4bd0568 [Kay Ousterhout] Added input source, renamed Hdfs to Hadoop.
f27e535 [Kay Ousterhout] Updates based on review comments and to fix rebase
bf41029 [Kay Ousterhout] Updated Json tests to pass
0fc33e0 [Kay Ousterhout] Added explicit backward compatibility test
4e52925 [Kay Ousterhout] Added Json output and associated tests.
365400b [Kay Ousterhout] [SPARK-1683] Track task read metrics.
2014-06-29 22:01:42 -07:00
Reynold Xin cdf613fc52 [SPARK-2320] Reduce exception/code block font size in web ui
Author: Reynold Xin <rxin@apache.org>

Closes #1261 from rxin/ui-pre-size and squashes the following commits:

7ab1a69 [Reynold Xin] [SPARK-2320] Reduce exception/code block font size in web ui
2014-06-29 16:46:28 -07:00
Reynold Xin 2053d793cc Improve MapOutputTracker error logging.
Author: Reynold Xin <rxin@apache.org>

Closes #1258 from rxin/mapOutputTracker and squashes the following commits:

a7c95b6 [Reynold Xin] Improve MapOutputTracker error logging.
2014-06-28 21:05:03 -07:00
Andrew Or f17510e371 [SPARK-2259] Fix highly misleading docs on cluster / client deploy modes
The existing docs are highly misleading. For standalone mode, for example, it encourages the user to use standalone-cluster mode, which is not officially supported. The safeguards have been added in Spark submit itself to prevent bad documentation from leading users down the wrong path in the future.

This PR is prompted by countless headaches users of Spark have run into on the mailing list.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1200 from andrewor14/submit-docs and squashes the following commits:

5ea2460 [Andrew Or] Rephrase cluster vs client explanation
c827f32 [Andrew Or] Clarify spark submit messages
9f7ed8f [Andrew Or] Clarify client vs cluster deploy mode + add safeguards
2014-06-27 16:11:31 -07:00
Andrew Or 21e0f77b63 [SPARK-2307] SparkUI - storage tab displays incorrect RDDs
The issue here is that the `StorageTab` listens for updates from the `StorageStatusListener`, but when a block is kicked out of the cache, `StorageStatusListener` removes it from its list. Thus, there is no way for the `StorageTab` to know whether a block has been dropped.

This issue was introduced in #1080, which was itself a bug fix. Here we revert that PR and offer a different fix for the original bug (SPARK-2144).

Author: Andrew Or <andrewor14@gmail.com>

Closes #1249 from andrewor14/storage-ui-fix and squashes the following commits:

af019ce [Andrew Or] Fix SPARK-2307
2014-06-27 15:23:25 -07:00
witgo 18f29b96c7 SPARK-2181:The keys for sorting the columns of Executor page in SparkUI are incorrect
Author: witgo <witgo@qq.com>

Closes #1135 from witgo/SPARK-2181 and squashes the following commits:

39dad90 [witgo] The keys for sorting the columns of Executor page in SparkUI are incorrect
2014-06-26 21:59:21 -07:00
Xiangrui Meng c23f5db32b [SPARK-2251] fix concurrency issues in random sampler
The following code is very likely to throw an exception:

~~~
val rdd = sc.parallelize(0 until 111, 10).sample(false, 0.1)
rdd.zip(rdd).count()
~~~

because the same random number generator is used in compute partitions.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1229 from mengxr/fix-sample and squashes the following commits:

f1ee3d7 [Xiangrui Meng] fix concurrency issues in random sampler
2014-06-26 21:46:55 -07:00
Reynold Xin d1636dd72f [SPARK-2297][UI] Make task attempt and speculation more explicit in UI.
New UI:

![screen shot 2014-06-26 at 1 43 52 pm](https://cloud.githubusercontent.com/assets/323388/3404643/82b9ddc6-fd73-11e3-96f9-f7592a7aee79.png)

Author: Reynold Xin <rxin@apache.org>

Closes #1236 from rxin/ui-task-attempt and squashes the following commits:

3b645dd [Reynold Xin] Expose attemptId in Stage.
c0474b1 [Reynold Xin] Beefed up unit test.
c404bdd [Reynold Xin] Fix ReplayListenerSuite.
f56be4b [Reynold Xin] Fixed JsonProtocolSuite.
e29e0f7 [Reynold Xin] Minor update.
5e4354a [Reynold Xin] [SPARK-2297][UI] Make task attempt and speculation more explicit in UI.
2014-06-26 21:13:26 -07:00
Reynold Xin bf578deaf2 Removed throwable field from FetchFailedException and added MetadataFetchFailedException
FetchFailedException used to have a Throwable field, but in reality we never propagate any of the throwable/exceptions back to the driver because Executor explicitly looks for FetchFailedException and then sends FetchFailed as the TaskEndReason.

This pull request removes the throwable and adds a MetadataFetchFailedException that extends FetchFailedException (so now MapOutputTracker throws MetadataFetchFailedException instead).

Author: Reynold Xin <rxin@apache.org>

Closes #1227 from rxin/metadataFetchException and squashes the following commits:

5cb1e0a [Reynold Xin] MetadataFetchFailedException extends FetchFailedException.
8861ee2 [Reynold Xin] Throw MetadataFetchFailedException in MapOutputTracker.
2014-06-26 21:12:16 -07:00
Reynold Xin 6587ef7c17 [SPARK-2286][UI] Report exception/errors for failed tasks that are not ExceptionFailure
Also added inline doc for each TaskEndReason.

Author: Reynold Xin <rxin@apache.org>

Closes #1225 from rxin/SPARK-2286 and squashes the following commits:

6a7959d [Reynold Xin] Fix unit test failure.
cf9d5eb [Reynold Xin] Merge branch 'master' into SPARK-2286
a61fae1 [Reynold Xin] Move to line above ...
38c7391 [Reynold Xin] [SPARK-2286][UI] Report exception/errors for failed tasks that are not ExceptionFailure.
2014-06-26 14:03:22 -07:00
Reynold Xin 4a346e242c [SPARK-2284][UI] Mark all failed tasks as failures.
Previously only tasks failed with ExceptionFailure reason was marked as failure.

Author: Reynold Xin <rxin@apache.org>

Closes #1224 from rxin/SPARK-2284 and squashes the following commits:

be79dbd [Reynold Xin] [SPARK-2284][UI] Mark all failed tasks as failures.
2014-06-25 22:35:03 -07:00
Mark Hamstra b88a59a668 [SPARK-1749] Job cancellation when SchedulerBackend does not implement killTask
This is a fixed up version of #686 (cc @markhamstra @pwendell).  The last commit (the only one I authored) reflects the changes I made from Mark's original patch.

Author: Mark Hamstra <markhamstra@gmail.com>
Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1219 from kayousterhout/mark-SPARK-1749 and squashes the following commits:

42dfa7e [Kay Ousterhout] Got rid of terrible double-negative name
80b3205 [Kay Ousterhout] Don't notify listeners of job failure if it wasn't successfully cancelled.
d156d33 [Mark Hamstra] Do nothing in no-kill submitTasks
9312baa [Mark Hamstra] code review update
cc353c8 [Mark Hamstra] scalastyle
e61f7f8 [Mark Hamstra] Catch UnsupportedOperationException when DAGScheduler tries to cancel a job on a SchedulerBackend that does not implement killTask
2014-06-25 20:57:48 -07:00
Sebastien Rainville 1132e472ec [SPARK-2204] Launch tasks on the proper executors in mesos fine-grained mode
The scheduler for Mesos in fine-grained mode launches tasks on the wrong executors. `MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer])` is assuming that `TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer])` is returning task lists in the same order as the offers it was passed, but in the current implementation `TaskSchedulerImpl.resourceOffers` shuffles the offers to avoid assigning the tasks always to the same executors. The result is that the tasks are launched on the wrong executors. The jobs are sometimes able to complete, but most of the time they fail. It seems that as soon as something goes wrong with a task for some reason Spark is not able to recover since it's mistaken as to where the tasks are actually running. Also, it seems that the more the cluster is under load the more likely the job is to fail because there's a higher probability that Spark is trying to launch a task on a slave that doesn't actually have enough resources, again because it's using the wrong offers.

The solution is to not assume that the order in which the tasks are returned is the same as the offers, and simply launch the tasks on the executor decided by `TaskSchedulerImpl.resourceOffers`. What I am not sure about is that I considered slaveId and executorId to be the same, which is true at least in my setup, but I don't know if that is always true.

I tested this on top of the 1.0.0 release and it seems to work fine on our cluster.

Author: Sebastien Rainville <sebastien@hopper.com>

Closes #1140 from sebastienrainville/fine-grained-mode-fix-master and squashes the following commits:

a98b0e0 [Sebastien Rainville] Use a HashMap to retrieve the offer indices
d6ffe54 [Sebastien Rainville] Launch tasks on the proper executors in mesos fine-grained mode
2014-06-25 13:21:18 -07:00
Reynold Xin 7ff2c754f3 [SPARK-2270] Kryo cannot serialize results returned by asJavaIterable
and thus groupBy/cogroup are broken in Java APIs when Kryo is used).

@pwendell this should be merged into 1.0.1.

Thanks @sorenmacbeth for reporting this & helping out with the fix.

Author: Reynold Xin <rxin@apache.org>

Closes #1206 from rxin/kryo-iterable-2270 and squashes the following commits:

09da0aa [Reynold Xin] Updated the comment.
009bf64 [Reynold Xin] [SPARK-2270] Kryo cannot serialize results returned by asJavaIterable (and thus groupBy/cogroup are broken in Java APIs when Kryo is used).
2014-06-25 12:43:22 -07:00
Andrew Or 9aa603296c [SPARK-2258 / 2266] Fix a few worker UI bugs
**SPARK-2258.** Worker UI displays zombie processes if the executor throws an exception before a process is launched. This is because we only inform the Worker of the change if the process is already launched, which in this case it isn't.

**SPARK-2266.** We expose "Some(app-id)" on the log page. This is fairly minor.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1213 from andrewor14/fix-worker-ui and squashes the following commits:

c1223fe [Andrew Or] Fix worker UI bugs
2014-06-25 12:23:08 -07:00