The code in LocalScheduler/LocalTaskSetManager was nearly identical
to the code in ClusterScheduler/ClusterTaskSetManager. The redundancy
made making updating the schedulers unnecessarily painful and error-
prone. This commit combines the two into a single TaskScheduler/
TaskSetManager.
Fixed incorrect log message in local scheduler
This change is especially relevant at the moment, because some users are seeing this failure, and the log message is misleading/incorrect (because for the tests, the max failures is set to 0, not 4)
Pull SparkHadoopUtil out of SparkEnv (jira SPARK-886)
Having the logic to initialize the correct SparkHadoopUtil in SparkEnv prevents it from being used until after the SparkContext is initialized. This causes issues like https://spark-project.atlassian.net/browse/SPARK-886. It also makes it hard to use in singleton objects. For instance I want to use it in the security code.
Add support for local:// URI scheme for addJars()
This PR adds support for a new URI scheme for SparkContext.addJars(): `local://file/path`.
The *local* scheme indicates that the `/file/path` exists on every worker node. The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters. Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration.
I would add something to the docs, but it's not obvious where to add it.
Oh, and it would be great if this could be merged in time for 0.8.1.
Display both task ID and task attempt ID in UI, and rename taskId to taskAttemptId
Previously only the task attempt ID was shown in the UI; this was confusing because the job can be shown as complete while there are tasks still running. Showing the task ID in addition to the attempt ID makes it clear which tasks are redundant.
This commit also renames taskId to taskAttemptId in TaskInfo and in the local/cluster schedulers. This identifier was used to uniquely identify attempts, not tasks, so the current naming was confusing. The new naming is also more consistent with map reduce.
Added new Spark Streaming operations
New operations
- transformWith which allows arbitrary 2-to-1 DStream transform, added to Scala and Java API
- StreamingContext.transform to allow arbitrary n-to-1 DStream
- leftOuterJoin and rightOuterJoin between 2 DStreams, added to Scala and Java API
- missing variations of join and cogroup added to Scala Java API
- missing JavaStreamingContext.union
Updated a number of Java and Scala API docs
Properly display the name of a stage in the UI.
This fixes a bug introduced by the fix for SPARK-940, which
changed the UI to display the RDD name rather than the stage
name. As a result, no name for the stage was shown when
using the Spark shell, which meant that there was no way to
click on the stage to see more details (e.g., the running
tasks). This commit changes the UI back to using the
stage name.
@pwendell -- let me know if this change was intentional
This fixes a bug introduced by the fix for SPARK-940, which
changed the UI to display the RDD name rather than the stage
name. As a result, no name for the stage was shown when
using the Spark shell, which meant that there was no way to
click on the stage to see more details (e.g., the running
tasks). This commit changes the UI back to using the
stage name.
This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:
1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.
Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.
2. If a user has input data where the number of partitions is not known. E.g.
> sc.textFile("some file").coalesce(50)....
This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.
The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.
I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.
Show "GETTING_RESULTS" state in UI.
This commit adds a set of calls using the SparkListener interface
that indicate when a task is remotely fetching results, so that
we can display this (potentially time-consuming) phase of execution
to users through the UI.
This should fix SPARK-902, an issue where some
Java API Function classes could cause
AbstractMethodErrors when user code is compiled
using the Eclipse compiler.
Thanks to @MartinWeindel for diagnosing this
problem.
(This PR subsumes / closes#30)
This patch fixes a bug where the Spark UI didn't display the correct number of total
tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions.
It also cleans up the listener API a bit by embedding this information in the
StageInfo class rather than passing it seperately.
Split MapOutputTracker into Master/Worker classes
Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers.
I also renamed a few methods and made others protected/private.
This commit adds a set of calls using the SparkListener interface
that indicate when a task is remotely fetching results, so that
we can display this (potentially time-consuming) phase of execution
to users through the UI.