Commit graph

2333 commits

Author SHA1 Message Date
Evan Chan f3679fd494 Add local: URI support to addFile as well 2013-11-01 11:08:03 -07:00
Matei Zaharia 8f1098a3f0 Merge pull request #117 from stephenh/avoid_concurrent_modification_exception
Handle ConcurrentModificationExceptions in SparkContext init.

System.getProperties.toMap will fail-fast when concurrently modified,
and it seems like some other thread started by SparkContext does
a System.setProperty during it's initialization.

Handle this by just looping on ConcurrentModificationException, which
seems the safest, since the non-fail-fast methods (Hastable.entrySet)
have undefined behavior under concurrent modification.
2013-10-30 20:11:48 -07:00
Matei Zaharia dc9ce16f6b Merge pull request #126 from kayousterhout/local_fix
Fixed incorrect log message in local scheduler

This change is especially relevant at the moment, because some users are seeing this failure, and the log message is misleading/incorrect (because for the tests, the max failures is set to 0, not 4)
2013-10-30 17:01:56 -07:00
Matei Zaharia 33de11c51d Merge pull request #124 from tgravescs/sparkHadoopUtilFix
Pull SparkHadoopUtil out of SparkEnv (jira SPARK-886)

Having the logic to initialize the correct SparkHadoopUtil in SparkEnv prevents it from being used until after the SparkContext is initialized.   This causes issues like https://spark-project.atlassian.net/browse/SPARK-886.  It also makes it hard to use in singleton objects.  For instance I want to use it in the security code.
2013-10-30 16:58:27 -07:00
Kay Ousterhout ff038eb4e0 Fixed incorrect log message in local scheduler 2013-10-30 15:27:23 -07:00
Matei Zaharia 618c1f6cf3 Merge pull request #125 from velvia/2013-10/local-jar-uri
Add support for local:// URI scheme for addJars()

This PR adds support for a new URI scheme for SparkContext.addJars():  `local://file/path`.
The *local* scheme indicates that the `/file/path` exists on every worker node.    The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters.  Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration.

I would add something to the docs, but it's not obvious where to add it.

Oh, and it would be great if this could be merged in time for 0.8.1.
2013-10-30 12:03:44 -07:00
Stephen Haberman 09f3b677cb Avoid match errors when filtering for spark.hadoop settings. 2013-10-30 12:29:39 -05:00
tgravescs f231aaa24c move the hadoopJobMetadata back into SparkEnv 2013-10-30 11:46:12 -05:00
Evan Chan de0285556a Add support for local:// URI scheme for addJars()
This indicates that a jar is available locally on each worker node.
2013-10-30 09:41:35 -07:00
tgravescs 54d9c6f253 Merge remote-tracking branch 'upstream/master' into sparkHadoopUtilFix 2013-10-30 10:41:21 -05:00
Josh Rosen cb9c8a922f Extract BlockInfo classes from BlockManager.
This saves space, since the inner classes needed
to keep a reference to the enclosing BlockManager.
2013-10-29 18:06:51 -07:00
Stephen Haberman 3a388c320c Use Properties.clone() instead. 2013-10-29 19:20:40 -05:00
Josh Rosen 846b1cf5ab Store fewer BlockInfo fields for shuffle blocks. 2013-10-29 15:14:29 -07:00
tgravescs eeb5f64c67 Remove SparkHadoopUtil stuff from SparkEnv 2013-10-29 17:12:16 -05:00
Josh Rosen 2d7cf6a271 Restructure BlockInfo fields to reduce memory use. 2013-10-27 23:01:03 -07:00
Matei Zaharia aec9bf9060 Merge pull request #112 from kayousterhout/ui_task_attempt_id
Display both task ID and task attempt ID in UI, and rename taskId to taskAttemptId

Previously only the task attempt ID was shown in the UI; this was confusing because the job can be shown as complete while there are tasks still running.  Showing the task ID in addition to the attempt ID makes it clear which tasks are redundant.

This commit also renames taskId to taskAttemptId in TaskInfo and in the local/cluster schedulers.  This identifier was used to uniquely identify attempts, not tasks, so the current naming was confusing.  The new naming is also more consistent with map reduce.
2013-10-27 19:32:00 -07:00
Stephen Haberman a6ae2b4832 Handle ConcurrentModificationExceptions in SparkContext init.
System.getProperties.toMap will fail-fast when concurrently modified,
and it seems like some other thread started by SparkContext does
a System.setProperty during it's initialization.

Handle this by just looping on ConcurrentModificationException, which
seems the safest, since the non-fail-fast methods (Hastable.entrySet)
have undefined behavior under concurrent modification.
2013-10-27 14:08:32 -05:00
Aaron Davidson 4261e834cb Use flag instead of name check. 2013-10-26 23:53:38 -07:00
Aaron Davidson 596f18479e Eliminate extra memory usage when shuffle file consolidation is disabled
Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled.
Fixing SPARK-946 is still forthcoming.
2013-10-26 22:35:01 -07:00
Kay Ousterhout ae22b4dd99 Display both task ID and task index in UI 2013-10-26 22:18:39 -07:00
Matei Zaharia bab496c120 Merge pull request #108 from alig/master
Changes to enable executing by using HDFS as a synchronization point between driver and executors, as well as ensuring executors exit properly.
2013-10-25 18:28:43 -07:00
Matei Zaharia d307db6e55 Merge pull request #102 from tdas/transform
Added new Spark Streaming operations

New operations
- transformWith which allows arbitrary 2-to-1 DStream transform, added to Scala and Java API
- StreamingContext.transform to allow arbitrary n-to-1 DStream
- leftOuterJoin and rightOuterJoin between 2 DStreams, added to Scala and Java API
- missing variations of join and cogroup added to Scala Java API
- missing JavaStreamingContext.union

Updated a number of Java and Scala API docs
2013-10-25 17:26:06 -07:00
Ali Ghodsi eef261c892 fixing comments on PR 2013-10-25 16:48:33 -07:00
Matei Zaharia 85e2cab6f6 Merge pull request #111 from kayousterhout/ui_name
Properly display the name of a stage in the UI.

This fixes a bug introduced by the fix for SPARK-940, which
changed the UI to display the RDD name rather than the stage
name. As a result, no name for the stage was shown when
using the Spark shell, which meant that there was no way to
click on the stage to see more details (e.g., the running
tasks). This commit changes the UI back to using the
stage name.

@pwendell -- let me know if this change was intentional
2013-10-25 14:46:06 -07:00
Tathagata Das dc9570782a Merge branch 'apache-master' into transform 2013-10-25 14:22:23 -07:00
Kay Ousterhout a9c8d83aaf Properly display the name of a stage in the UI.
This fixes a bug introduced by the fix for SPARK-940, which
changed the UI to display the RDD name rather than the stage
name. As a result, no name for the stage was shown when
using the Spark shell, which meant that there was no way to
click on the stage to see more details (e.g., the running
tasks). This commit changes the UI back to using the
stage name.
2013-10-25 12:00:09 -07:00
Patrick Wendell e5f6d5697b Spacing fix 2013-10-24 22:08:06 -07:00
Patrick Wendell 31e92b72e3 Adding Java versions and associated tests 2013-10-24 21:14:56 -07:00
Patrick Wendell 05ac9940ee Adding tests 2013-10-24 14:31:34 -07:00
Patrick Wendell 2fda84fe3f Always use a shuffle 2013-10-24 14:31:34 -07:00
Patrick Wendell 08c1a42d7d Add a repartition operator.
This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:

1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.

Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.

2. If a user has input data where the number of partitions is not known. E.g.

> sc.textFile("some file").coalesce(50)....

This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.

The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.

I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.
2013-10-24 14:31:33 -07:00
Ali Ghodsi 05a0df2b9e Makes Spark SIMR ready. 2013-10-24 11:59:51 -07:00
Tathagata Das 0400aba1c0 Merge branch 'apache-master' into transform 2013-10-24 11:05:00 -07:00
Tathagata Das bacfe5ebca Added JavaStreamingContext.transform 2013-10-24 10:56:24 -07:00
Matei Zaharia 1dc776b863 Merge pull request #93 from kayousterhout/ui_new_state
Show "GETTING_RESULTS" state in UI.

This commit adds a set of calls using the SparkListener interface
that indicate when a task is remotely fetching results, so that
we can display this (potentially time-consuming) phase of execution
to users through the UI.
2013-10-23 22:05:52 -07:00
Kay Ousterhout b45352e373 Clear akka frame size property in tests 2013-10-23 18:23:28 -07:00
Kay Ousterhout c42f5d1787 Fixed broken tests 2013-10-23 17:35:01 -07:00
Josh Rosen 210858ac02 Add unpersist() to JavaDoubleRDD and JavaPairRDD.
Also add support for new optional `blocking` argument.
2013-10-23 17:27:01 -07:00
Kay Ousterhout a5f8f54ecd Merge remote-tracking branch 'upstream/master' into ui_new_state
Conflicts:
	core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
2013-10-23 16:06:28 -07:00
Tathagata Das 9fccb17a5f Removed Function3.call() based on Josh's comment. 2013-10-23 12:07:07 -07:00
Tathagata Das fe8626efd1 Merge branch 'apache-master' into transform 2013-10-22 23:40:40 -07:00
Tathagata Das 72d2e1dd77 Fixed bug in Java transformWith, added more Java testcases for transform and transformWith, added missing variations of Java join and cogroup, updated various Scala and Java API docs. 2013-10-22 23:35:51 -07:00
Josh Rosen 768eb9c962 Remove redundant Java Function call() definitions
This should fix SPARK-902, an issue where some
Java API Function classes could cause
AbstractMethodErrors when user code is compiled
using the Eclipse compiler.

Thanks to @MartinWeindel for diagnosing this
problem.

(This PR subsumes / closes #30)
2013-10-22 14:26:52 -07:00
Patrick Wendell ab5ece19a3 Formatting cleanup 2013-10-22 13:03:08 -07:00
Patrick Wendell c22046b3cc Minor clean-up in review 2013-10-22 11:00:50 -07:00
Patrick Wendell 7de0ea4d42 Response to code review and adding some more tests 2013-10-22 11:00:50 -07:00
Patrick Wendell 2fa3c4c49c Fix for Spark-870.
This patch fixes a bug where the Spark UI didn't display the correct number of total
tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions.

It also cleans up the listener API a bit by embedding this information in the
StageInfo class rather than passing it seperately.
2013-10-22 11:00:25 -07:00
Patrick Wendell a854f5bfcf SPARK-940: Do not directly pass Stage objects to SparkListener. 2013-10-22 11:00:06 -07:00
Matei Zaharia a0e08f0fb9 Merge pull request #82 from JoshRosen/map-output-tracker-refactoring
Split MapOutputTracker into Master/Worker classes

Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances.  This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers.

I also renamed a few methods and made others protected/private.
2013-10-22 10:20:43 -07:00
Kay Ousterhout 37b9b4cc11 Shorten GETTING_RESULT to GET_RESULT 2013-10-22 10:05:33 -07:00