Commit graph

4864 commits

Author SHA1 Message Date
Matei Zaharia aa638ed9c1 Merge pull request #189 from tgravescs/sparkYarnErrorHandling
Impove Spark on Yarn Error handling

Improve cli error handling and only allow a certain number of worker failures before failing the application.  This will help prevent users from doing foolish things and their jobs running forever.  For instance using 32 bit java but trying to allocate 8G containers. This loops forever without this change, now it errors out after a certain number of retries.  The number of tries is configurable.  Also increase the frequency we ping the RM to increase speed at which we get containers if they die. The Yarn MR app defaults to pinging the RM every 1 seconds, so the default of 5 seconds here is fine. But that is configurable as well in case people want to change it.

I do want to make sure there aren't any cases that calling stopExecutors in CoarseGrainedSchedulerBackend would cause problems?  I couldn't think of any and testing on standalone cluster as well as yarn.
2013-11-19 16:05:44 -08:00
Matei Zaharia 55925805fc Merge pull request #187 from aarondav/example-bcast-test
Enable the Broadcast examples to work in a cluster setting

Since they rely on println to display results, we need to first collect those results to the driver to have them actually display locally.

This issue came up on the mailing lists [here](http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3C2013111909591557147628%40ict.ac.cn%3E).
2013-11-19 16:04:01 -08:00
tgravescs 4093e9393a Impove Spark on Yarn Error handling 2013-11-19 12:44:00 -06:00
Henry Saputra 9c934b640f Remove the semicolons at the end of Scala code to make it more pure Scala code.
Also remove unused imports as I found them along the way.
Remove return statements when returning value in the Scala code.

Passing compile and tests.
2013-11-19 10:19:03 -08:00
Matthew Taylor f639b65eab PartitionPruningRDD is using index from parent(review changes) 2013-11-19 10:48:48 +00:00
Aaron Davidson 50fd8d98c0 Enable the Broadcast examples to work in a cluster setting
Since they rely on println to display results, we need to first collect
those results to the driver to have them actually display locally.
2013-11-18 22:51:35 -08:00
Matthew Taylor 13b9bf494b PartitionPruningRDD is using index from parent 2013-11-19 06:27:33 +00:00
Holden Karau e163e31c20 Add spaces 2013-11-18 20:13:25 -08:00
Holden Karau 7de180fd13 Remove explicit boxing 2013-11-18 20:05:05 -08:00
Marek Kolodziej 99cfe89c68 Updates to reflect pull request code review 2013-11-18 22:00:36 -05:00
Marek Kolodziej 09bdfe3b16 XORShift RNG with unit tests and benchmark
To run unit test, start SBT console and type:
compile
test-only org.apache.spark.util.XORShiftRandomSuite
To run benchmark, type:
project core
console
Once the Scala console starts, type:
org.apache.spark.util.XORShiftRandom.benchmark(100000000)
2013-11-18 15:21:43 -05:00
Russell Cardullo 1360f62d15 Cleanup GraphiteSink.scala based on feedback
* Reorder imports according to the style guide
* Consistently use propertyToOption in all places
2013-11-18 08:53:39 -08:00
Harvey Feng a98f5a0ebb Misc style changes in the 'yarn' package. 2013-11-17 22:35:56 -08:00
shiyun.wxm eda05fa439 use HashSet.empty[Long] instead of Seq[Long] 2013-11-18 13:31:14 +08:00
Reynold Xin e2ebc3a9d8 Merge pull request #182 from rxin/vector
Slightly enhanced PrimitiveVector:

1. Added trim() method
2. Added size method.
3. Renamed getUnderlyingArray to array.
4. Minor documentation update.
2013-11-17 18:42:18 -08:00
Reynold Xin 26f616d73a Merge pull request #3 from aarondav/pv-test
Add PrimitiveVectorSuite and fix bug in resize()
2013-11-17 18:18:16 -08:00
Aaron Davidson 85763f4942 Add PrimitiveVectorSuite and fix bug in resize() 2013-11-17 18:16:51 -08:00
Reynold Xin 16a2286d6d Return the vector itself for trim and resize method in PrimitiveVector. 2013-11-17 17:52:02 -08:00
BlackNiuza ecfbaf2442 rename "a" to "statusId" 2013-11-18 09:51:40 +08:00
Reynold Xin c30979c7d6 Slightly enhanced PrimitiveVector:
1. Added trim() method
2. Added size method.
3. Renamed getUnderlyingArray to array.
4. Minor documentation update.
2013-11-17 17:09:40 -08:00
BlackNiuza b60839e56a correct number of tasks in ExecutorsUI 2013-11-17 21:38:57 +08:00
Matei Zaharia 1b5b358309 Merge pull request #178 from hsaputra/simplecleanupcode
Simple cleanup on Spark's Scala code

Simple cleanup on Spark's Scala code while testing some modules:
-) Remove some of unused imports as I found them
-) Remove ";" in the imports statements
-) Remove () at the end of method calls like size that does not have size effect.
2013-11-16 11:44:10 -08:00
Henry Saputra c33f802044 Simple cleanup on Spark's Scala code while testing core and yarn modules:
-) Remove some of unused imports as I found them
-) Remove ";" in the imports statements
-) Remove () at the end of method call like size that does not have size effect.
2013-11-15 10:32:20 -08:00
Raymond Liu f6b2e590b1 Merge pull request #1 from aarondav/scala210-master
Various merge corrections
2013-11-14 23:04:55 -08:00
Aaron Davidson ce1d2af7e4 Use Kafka 2.10 (again) 2013-11-14 23:00:39 -08:00
Matei Zaharia 96e0fb4630 Merge pull request #173 from kayousterhout/scheduler_hang
Fix bug where scheduler could hang after task failure.

When a task fails, we need to call reviveOffers() so that the
task can be rescheduled on a different machine. In the current code,
the state in ClusterTaskSetManager indicating which tasks are
pending may be updated after revive offers is called (there's a
race condition here), so when revive offers is called, the task set
manager does not yet realize that there are failed tasks that need
to be relaunched.

This isn't currently unit tested but will be once my pull request for
merging the cluster and local schedulers goes in -- at which point
many more of the unit tests will exercise the code paths through
the cluster scheduler (currently the failure test suite uses the local
scheduler, which is why we didn't see this bug before).
2013-11-14 22:29:28 -08:00
Aaron Davidson f629ba95b6 Various merge corrections
I've diff'd this patch against my own -- since they were both created
independently, this means that two sets of eyes have gone over all the
merge conflicts that were created, so I'm feeling significantly more
confident in the resulting PR.

@rxin has looked at the changes to the repl and is resoundingly
confident that they are correct.
2013-11-14 22:13:09 -08:00
Matei Zaharia dfd40e9f6f Merge pull request #175 from kayousterhout/no_retry_not_serializable
Don't retry tasks when they fail due to a NotSerializableException

As with my previous pull request, this will be unit tested once the Cluster and Local schedulers get merged.
2013-11-14 19:44:50 -08:00
Matei Zaharia ed25105fd9 Merge pull request #174 from ahirreddy/master
Write Spark UI url to driver file on HDFS

This makes the SIMR code path simpler
2013-11-14 19:43:55 -08:00
Raymond Liu d4cd32330e Some fixes for previous master merge commits 2013-11-15 10:22:31 +08:00
Kay Ousterhout 29c88e408e Don't retry tasks when they fail due to a NotSerializableException 2013-11-14 15:15:19 -08:00
Kay Ousterhout b4546ba9e6 Fix bug where scheduler could hang after task failure.
When a task fails, we need to call reviveOffers() so that the
task can be rescheduled on a different machine. In the current code,
the state in ClusterTaskSetManager indicating which tasks are
pending may be updated after revive offers is called (there's a
race condition here), so when revive offers is called, the task set
manager does not yet realize that there are failed tasks that need
to be relaunched.
2013-11-14 13:55:03 -08:00
Reynold Xin 1a4cfbea33 Merge pull request #169 from kayousterhout/mesos_fix
Don't ignore spark.cores.max when using Mesos Coarse mode

totalCoresAcquired is decremented but never incremented, causing Spark to effectively ignore spark.cores.max in coarse grained Mesos mode.
2013-11-14 10:32:11 -08:00
Reynold Xin 5a4f483652 Merge pull request #170 from liancheng/hadooprdd-doc-typo
Fixed a scaladoc typo in HadoopRDD.scala
2013-11-14 10:30:36 -08:00
Reynold Xin d76f5203af Merge pull request #171 from RIA-pierre-borckmans/master
Fixed typos in the CDH4 distributions version codes.

Nothing important, but annoying when doing a copy/paste...
2013-11-14 10:25:48 -08:00
RIA-pierre-borckmans bef398e572 Fixed typos in the CDH4 distributions version codes. 2013-11-14 11:33:48 +01:00
Lian, Cheng cc8995c8f4 Fixed a scaladoc typo in HadoopRDD.scala 2013-11-14 18:17:05 +08:00
Kay Ousterhout 5125cd3466 Don't ignore spark.cores.max when using Mesos Coarse mode 2013-11-13 23:06:17 -08:00
Raymond Liu a60620b76a Merge branch 'master' into scala-2.10 2013-11-14 12:44:19 +08:00
Matei Zaharia 2054c61a18 Merge pull request #159 from liancheng/dagscheduler-actor-refine
Migrate the daemon thread started by DAGScheduler to Akka actor

`DAGScheduler` adopts an event queue and a daemon thread polling the it to process events sent to a `DAGScheduler`.  This is a classical actor use case.  By migrating this thread to Akka actor, we may benefit from both cleaner code and better performance (context switching cost of Akka actor is much less than that of a native thread).

But things become a little complicated when taking existing test code into consideration.

Code in `DAGSchedulerSuite` is somewhat tightly coupled with `DAGScheduler`, and directly calls `DAGScheduler.processEvent` instead of posting event messages to `DAGScheduler`.  To minimize code change, I chose to let the actor to delegate messages to `processEvent`.  Maybe this doesn't follow conventional actor usage, but I tried to make it apparently correct.

Another tricky part is that, since `DAGScheduler` depends on the `ActorSystem` provided by its field `env`, `env` cannot be null.  But the `dagScheduler` field created in `DAGSchedulerSuite.before` was given a null `env`.  What's more, `BlockManager.blockIdsToBlockManagers` checks whether `env` is null to determine whether to run the production code or the test code (bad smell here, huh?).  I went through all callers of `BlockManager.blockIdsToBlockManagers`, and made sure that if `env != null` holds, then `blockManagerMaster == null` must also hold.  That's the logic behind `BlockManager.scala` [line 896](https://github.com/liancheng/incubator-spark/compare/dagscheduler-actor-refine?expand=1#diff-2b643ea78c1add0381754b1f47eec132L896).

At last, since `DAGScheduler` instances are always `start()`ed after creation, I removed the `start()` method, and starts the `eventProcessActor` within the constructor.
2013-11-13 16:49:55 -08:00
Matei Zaharia 9290e5bcd2 Merge pull request #165 from NathanHowell/kerberos-master
spark-assembly.jar fails to authenticate with YARN ResourceManager

The META-INF/services/ sbt MergeStrategy was discarding support for Kerberos, among others. This pull request changes to a merge strategy similar to sbt-assembly's default. I've also included an update to sbt-assembly 0.9.2, a minor fix to it's zip file handling.
2013-11-13 16:48:44 -08:00
Ahir Reddy 0ea1f8b225 Write Spark UI url to driver file on HDFS 2013-11-13 15:23:36 -08:00
Matei Zaharia 39af914b27 Merge pull request #166 from ahirreddy/simr-spark-ui
SIMR Backend Scheduler will now write Spark UI URL to HDFS, which is to ...

...be retrieved by SIMR clients
2013-11-13 08:39:05 -08:00
Raymond Liu 0f2e3c6e31 Merge branch 'master' into scala-2.10 2013-11-13 16:55:11 +08:00
Matei Zaharia f49ea28d25 Merge pull request #137 from tgravescs/sparkYarnJarsHdfsRebase
Allow spark on yarn to be run from HDFS.

Allows the spark.jar, app.jar, and log4j.properties to be put into hdfs.  Allows you to specify the files on a different hdfs cluster and it will copy them over. It makes sure permissions are correct and makes sure to put things into public distributed cache so they can be reused amongst users if their permissions are appropriate.  Also add a bit of error handling for missing arguments.
2013-11-12 19:13:39 -08:00
Matei Zaharia 87f2f4e5c2 Merge pull request #153 from ankurdave/stop-spot-cluster
Enable stopping and starting a spot cluster

Clusters launched using `--spot-price` contain an on-demand master and spot slaves. Because EC2 does not support stopping spot instances, the spark-ec2 script previously could only destroy such clusters.

This pull request makes it possible to stop and restart a spot cluster.
* The `stop` command works as expected for a spot cluster: the master is stopped and the slaves are terminated.
* To start a stopped spot cluster, the user must invoke `launch --use-existing-master`. This launches fresh spot slaves but resumes the existing master.
2013-11-12 16:26:09 -08:00
Matei Zaharia b8bf04a085 Merge pull request #160 from xiajunluan/JIRA-923
Fix bug JIRA-923

Fix column sort issue in UI for JIRA-923.
https://spark-project.atlassian.net/browse/SPARK-923

Conflicts:
	core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
	core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala
2013-11-12 16:19:50 -08:00
Ahir Reddy ccb099e804 SIMR Backend Scheduler will now write Spark UI URL to HDFS, which is to be retrieved by SIMR clients 2013-11-12 15:58:41 -08:00
Nathan Howell 48eac0bcbf Upgrade to sbt-assembly 0.9.2 2013-11-12 13:29:25 -08:00
Nathan Howell 23146a6705 spark-assembly.jar fails to authenticate with YARN ResourceManager
sbt-assembly is setup to pick the first META-INF/services/org.apache.hadoop.security.SecurityInfo file instead of merging them. This causes Kerberos authentication to fail, this manifests itself in the "info:null" debug log statement:

    DEBUG SaslRpcClient: Get token info proto:interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB info:null
    DEBUG SaslRpcClient: Get kerberos info proto:interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB info:null
    ERROR UserGroupInformation: PriviledgedActionException as:foo@BAR (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
    DEBUG UserGroupInformation: PrivilegedAction as:foo@BAR (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:583)
    WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
    ERROR UserGroupInformation: PriviledgedActionException as:foo@BAR (auth:KERBEROS) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

This previously would just contain a single class:

$ unzip -c assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar META-INF/services/org.apache.hadoop.security.SecurityInfo
Archive:  assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar
  inflating: META-INF/services/org.apache.hadoop.security.SecurityInfo

    org.apache.hadoop.security.AnnotatedSecurityInfo

And now has the full list of classes:

$ unzip -c assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar META-INF/services/org.apache.hadoop.security.SecurityInfoArchive:  assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar
  inflating: META-INF/services/org.apache.hadoop.security.SecurityInfo

    org.apache.hadoop.security.AnnotatedSecurityInfo
    org.apache.hadoop.mapreduce.v2.app.MRClientSecurityInfo
    org.apache.hadoop.mapreduce.v2.security.client.ClientHSSecurityInfo
    org.apache.hadoop.yarn.security.client.ClientRMSecurityInfo
    org.apache.hadoop.yarn.security.ContainerManagerSecurityInfo
    org.apache.hadoop.yarn.security.SchedulerSecurityInfo
    org.apache.hadoop.yarn.security.admin.AdminSecurityInfo
    org.apache.hadoop.yarn.server.RMNMSecurityInfoClass
2013-11-12 13:27:50 -08:00