Commit graph

4571 commits

Author SHA1 Message Date
Evan Chan de0285556a Add support for local:// URI scheme for addJars()
This indicates that a jar is available locally on each worker node.
2013-10-30 09:41:35 -07:00
tgravescs 54d9c6f253 Merge remote-tracking branch 'upstream/master' into sparkHadoopUtilFix 2013-10-30 10:41:21 -05:00
Matei Zaharia 745dc42908 Merge pull request #118 from JoshRosen/blockinfo-memory-usage
Reduce the memory footprint of BlockInfo objects

This pull request reduces the memory footprint of all BlockInfo objects and makes additional optimizations for shuffle blocks.  For all BlockInfo objects, these changes remove two boolean fields and one Object field.  For shuffle blocks, we additionally remove an Object field and a boolean field.

When storing tens of thousands of these objects, this may add up to significant memory savings.  A ShuffleBlockInfo now only needs to wrap a single long.

This was motivated by a [report of high blockInfo memory usage during shuffles](https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C20131026134353.202b2b9b%40sh9%3E).

I haven't run benchmarks to measure the exact memory savings.

/cc @aarondav
2013-10-29 23:47:10 -07:00
tgravescs e5e0ebdb11 fix sparkhdfs lr test 2013-10-29 20:12:45 -05:00
Josh Rosen cb9c8a922f Extract BlockInfo classes from BlockManager.
This saves space, since the inner classes needed
to keep a reference to the enclosing BlockManager.
2013-10-29 18:06:51 -07:00
Stephen Haberman 3a388c320c Use Properties.clone() instead. 2013-10-29 19:20:40 -05:00
Josh Rosen 846b1cf5ab Store fewer BlockInfo fields for shuffle blocks. 2013-10-29 15:14:29 -07:00
tgravescs eeb5f64c67 Remove SparkHadoopUtil stuff from SparkEnv 2013-10-29 17:12:16 -05:00
Reynold Xin f0e23a023c Merge pull request #119 from soulmachine/master
A little revise for the document
2013-10-29 01:41:44 -04:00
soulmachine a197137fde A little revise for the document 2013-10-29 00:28:56 +08:00
Josh Rosen 2d7cf6a271 Restructure BlockInfo fields to reduce memory use. 2013-10-27 23:01:03 -07:00
Matei Zaharia aec9bf9060 Merge pull request #112 from kayousterhout/ui_task_attempt_id
Display both task ID and task attempt ID in UI, and rename taskId to taskAttemptId

Previously only the task attempt ID was shown in the UI; this was confusing because the job can be shown as complete while there are tasks still running.  Showing the task ID in addition to the attempt ID makes it clear which tasks are redundant.

This commit also renames taskId to taskAttemptId in TaskInfo and in the local/cluster schedulers.  This identifier was used to uniquely identify attempts, not tasks, so the current naming was confusing.  The new naming is also more consistent with map reduce.
2013-10-27 19:32:00 -07:00
Reynold Xin d4df4749a8 Merge pull request #115 from aarondav/shuffle-fix
Eliminate extra memory usage when shuffle file consolidation is disabled

Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled.
Fixing SPARK-946 is still forthcoming.
2013-10-27 22:11:21 -04:00
Stephen Haberman a6ae2b4832 Handle ConcurrentModificationExceptions in SparkContext init.
System.getProperties.toMap will fail-fast when concurrently modified,
and it seems like some other thread started by SparkContext does
a System.setProperty during it's initialization.

Handle this by just looping on ConcurrentModificationException, which
seems the safest, since the non-fail-fast methods (Hastable.entrySet)
have undefined behavior under concurrent modification.
2013-10-27 14:08:32 -05:00
Aaron Davidson 4261e834cb Use flag instead of name check. 2013-10-26 23:53:38 -07:00
Aaron Davidson 596f18479e Eliminate extra memory usage when shuffle file consolidation is disabled
Otherwise, we see SPARK-946 even when shuffle file consolidation is disabled.
Fixing SPARK-946 is still forthcoming.
2013-10-26 22:35:01 -07:00
Kay Ousterhout ae22b4dd99 Display both task ID and task index in UI 2013-10-26 22:18:39 -07:00
Patrick Wendell e018f2d0ae Merge pull request #113 from pwendell/master
Improve error message when multiple assembly jars are present.

This can happen easily if building different hadoop versions. Right now it gives a class not found exception.
2013-10-26 11:39:15 -07:00
Reynold Xin 662ee9f321 Merge pull request #114 from soulmachine/master
A little revise for the document
2013-10-26 11:35:59 -07:00
soulmachine 2eed6bbd10 A little revise for the document 2013-10-26 15:13:57 +08:00
Patrick Wendell 4ba32678e0 Adding improved error message when multiple assembly jars are present.
This can happen easily if building different hadoop versions.
2013-10-25 19:01:15 -07:00
Matei Zaharia bab496c120 Merge pull request #108 from alig/master
Changes to enable executing by using HDFS as a synchronization point between driver and executors, as well as ensuring executors exit properly.
2013-10-25 18:28:43 -07:00
Matei Zaharia d307db6e55 Merge pull request #102 from tdas/transform
Added new Spark Streaming operations

New operations
- transformWith which allows arbitrary 2-to-1 DStream transform, added to Scala and Java API
- StreamingContext.transform to allow arbitrary n-to-1 DStream
- leftOuterJoin and rightOuterJoin between 2 DStreams, added to Scala and Java API
- missing variations of join and cogroup added to Scala Java API
- missing JavaStreamingContext.union

Updated a number of Java and Scala API docs
2013-10-25 17:26:06 -07:00
Ali Ghodsi eef261c892 fixing comments on PR 2013-10-25 16:48:33 -07:00
Matei Zaharia 85e2cab6f6 Merge pull request #111 from kayousterhout/ui_name
Properly display the name of a stage in the UI.

This fixes a bug introduced by the fix for SPARK-940, which
changed the UI to display the RDD name rather than the stage
name. As a result, no name for the stage was shown when
using the Spark shell, which meant that there was no way to
click on the stage to see more details (e.g., the running
tasks). This commit changes the UI back to using the
stage name.

@pwendell -- let me know if this change was intentional
2013-10-25 14:46:06 -07:00
Tathagata Das dc9570782a Merge branch 'apache-master' into transform 2013-10-25 14:22:23 -07:00
Kay Ousterhout a9c8d83aaf Properly display the name of a stage in the UI.
This fixes a bug introduced by the fix for SPARK-940, which
changed the UI to display the RDD name rather than the stage
name. As a result, no name for the stage was shown when
using the Spark shell, which meant that there was no way to
click on the stage to see more details (e.g., the running
tasks). This commit changes the UI back to using the
stage name.
2013-10-25 12:00:09 -07:00
Reynold Xin ab35ec4f0f Merge pull request #110 from pwendell/master
Exclude jopt from kafka dependency.

Kafka uses an older version of jopt that causes bad conflicts with the version
used by spark-perf. It's not easy to remove this downstream because of the way
that spark-perf uses Spark (by including a spark assembly as an unmanaged jar).
This fixes the problem at its source by just never including it.
2013-10-25 10:16:18 -07:00
Patrick Wendell af4a529f6e Exclude jopt from kafka dependency.
Kafka uses an older version of jopt that causes bad conflicts with the version
used by spark-perf. It's not easy to remove this downstream because of the way
that spark-perf uses Spark (by including a spark assembly as an unmanaged jar).
This fixes the problem at its source by just never including it.
2013-10-25 09:20:30 -07:00
Reynold Xin 4f2c9438b4 Merge pull request #109 from pwendell/master
Adding Java/Java Streaming versions of `repartition` with associated tests
2013-10-24 22:32:02 -07:00
Patrick Wendell ad5f579cbf Style fixes 2013-10-24 22:18:53 -07:00
Patrick Wendell e5f6d5697b Spacing fix 2013-10-24 22:08:06 -07:00
Patrick Wendell a351fd4aed Small spacing fix 2013-10-24 21:16:30 -07:00
Patrick Wendell 31e92b72e3 Adding Java versions and associated tests 2013-10-24 21:14:56 -07:00
Reynold Xin 99ad4a613a Merge pull request #106 from pwendell/master
Add a `repartition` operator.

This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:

1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.

Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.

2. If a user has input data where the number of partitions is not known. E.g.

> sc.textFile("some file").coalesce(50)....

This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.

The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.

I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.
2013-10-24 17:08:39 -07:00
Patrick Wendell 39f6f75588 Some clean-up of tests 2013-10-24 16:43:33 -07:00
Tathagata Das e962a6e6ee Fixed accidental bug. 2013-10-24 15:17:26 -07:00
Patrick Wendell 9423532fab Removing Java for now 2013-10-24 14:31:34 -07:00
Patrick Wendell 05ac9940ee Adding tests 2013-10-24 14:31:34 -07:00
Patrick Wendell 2fda84fe3f Always use a shuffle 2013-10-24 14:31:34 -07:00
Patrick Wendell 08c1a42d7d Add a repartition operator.
This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:

1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.

Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.

2. If a user has input data where the number of partitions is not known. E.g.

> sc.textFile("some file").coalesce(50)....

This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.

The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.

I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.
2013-10-24 14:31:33 -07:00
Ali Ghodsi 05a0df2b9e Makes Spark SIMR ready. 2013-10-24 11:59:51 -07:00
Reynold Xin 5429d62dfa Merge pull request #107 from ScrapCodes/scala-2.10
Updating to latest akka 2.2.3, which fixes our only failing test Driver Suite.
2013-10-24 11:15:55 -07:00
Tathagata Das 0400aba1c0 Merge branch 'apache-master' into transform 2013-10-24 11:05:00 -07:00
Tathagata Das bacfe5ebca Added JavaStreamingContext.transform 2013-10-24 10:56:24 -07:00
Prashant Sharma c77ca1fed9 Updating to latest akka 2.2.3, which fixes our only failing Driver Suite 2013-10-24 16:11:40 +05:30
Matei Zaharia 1dc776b863 Merge pull request #93 from kayousterhout/ui_new_state
Show "GETTING_RESULTS" state in UI.

This commit adds a set of calls using the SparkListener interface
that indicate when a task is remotely fetching results, so that
we can display this (potentially time-consuming) phase of execution
to users through the UI.
2013-10-23 22:05:52 -07:00
Reynold Xin c4b187d1db Merge pull request #105 from pwendell/doc-fix
Fixing broken links in programming guide

Unfortunately these are broken in 0.8.0.
2013-10-23 21:56:18 -07:00
Patrick Wendell 4e093b88f8 Fixing broken links in programming guide 2013-10-23 21:28:23 -07:00
Kay Ousterhout b45352e373 Clear akka frame size property in tests 2013-10-23 18:23:28 -07:00