Commit graph

4742 commits

Author SHA1 Message Date
Dan Crankshaw 6cb21ce889 groupEdges() now compiles. Still need some unit tests 2013-10-06 15:33:35 -07:00
Aaron Davidson 718e8c2052 Change url format to spark://host1:port1,host2:port2
This replaces the format of spark://host1:port1,spark://host2:port2 and is more
consistent with ZooKeeper's zk:// urls.
2013-10-06 00:02:08 -07:00
Aaron Davidson e1190229e1 Add end-to-end test for standalone scheduler fault tolerance
Docker files drawn mostly from Matt Masse. Some updates from Andre Schumacher.
2013-10-05 23:20:31 -07:00
Patrick Wendell d585613ee2 Merge pull request #37 from pwendell/merge-0.8
merge in remaining changes from `branch-0.8`

This merges in the following changes from `branch-0.8`:

- The scala version is included in the published maven artifact names
- A unit tests which had non-deterministic failures is ignored (see SPARK-908)
- A minor documentation change shows the short version instead of the full version
- Moving the kafka jar to be "provided"
- Changing the default spark ec2 version.
- Some spacing changes caused by Maven's release plugin

Note that I've squashed this into a single commit rather than pull in the branch-0.8 history. There are a bunch of release/revert commits there that make the history super ugly.
2013-10-05 22:57:05 -07:00
Patrick Wendell aa9fb84994 Merging build changes in from 0.8 2013-10-05 22:07:00 -07:00
Dan Crankshaw 730a3156d3 Added initial groupEdges code. Still a prototype, I haven't figured out quite how it should all work yet. 2013-10-05 19:44:28 -07:00
Matei Zaharia 4a25b116d4 Merge pull request #20 from harveyfeng/hadoop-config-cache
Allow users to pass broadcasted Configurations and cache InputFormats across Hadoop file reads.

Note: originally from https://github.com/mesos/spark/pull/942

Currently motivated by Shark queries on Hive-partitioned tables, where there's a JobConf broadcast for every Hive-partition (i.e., every subdirectory read). The only thing different about those JobConfs is the input path - the Hadoop Configuration that the JobConfs are constructed from remain the same.
This PR only modifies the old Hadoop API RDDs, but similar additions to the new API might reduce computation latencies a little bit for high-frequency FileInputDStreams (which only uses the new API right now).

As a small bonus, added InputFormats caching, to avoid reflection calls for every RDD#compute().

Few other notes:

Added a general soft-reference hashmap in SparkHadoopUtil because I wanted to avoid adding another class to SparkEnv.
SparkContext default hadoopConfiguration isn't cached. There's no equals() method for Configuration, so there isn't a good way to determine when configuration properties have changed.
2013-10-05 19:28:55 -07:00
Harvey Feng 6a2bbec5e3 Some comments regarding JobConf and InputFormat caching for HadoopRDDs. 2013-10-05 17:53:58 -07:00
Reynold Xin 8fc68d04bd Merge pull request #36 from pwendell/versions
Bumping EC2 default version in master to .

This change was already made on . This PR ports the change up to master.
2013-10-05 17:24:35 -07:00
Harvey Feng 96929f28bb Make HadoopRDD object Spark private. 2013-10-05 17:14:19 -07:00
Patrick Wendell 2484b84678 Bumping EC2 default version in master to 0.8.0. 2013-10-05 16:59:11 -07:00
Harvey Feng b5e93c1227 Fix API changes; lines > 100 chars. 2013-10-05 16:57:08 -07:00
Dan Crankshaw bfedbee13a Edge partitioner now partitions by canonical edge so all edges between two vertices (in either direction) will be sent to same machine. 2013-10-05 16:04:57 -07:00
Dan Crankshaw e096cbe90e Added 2D canonical edge partitioner 2013-10-05 15:20:15 -07:00
Aaron Davidson 0f070279e7 Address Matei's comments 2013-10-05 15:15:29 -07:00
Matei Zaharia 100222b048 Merge pull request #27 from davidmccauley/master
SPARK-920/921 - JSON endpoint updates

920 - Removal of duplicate scheme part of Spark URI, it was appearing as spark://spark//host:port in the JSON field.

JSON now delivered as:
url:spark://127.0.0.1:7077

921 - Adding the URL of the Main Application UI will allow custom interfaces (that use the JSON output) to redirect from the standalone UI.
2013-10-05 13:38:59 -07:00
Matei Zaharia 08641932bd Merge pull request #33 from AndreSchumacher/pyspark_partition_key_change
Fixing SPARK-602: PythonPartitioner

Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
2013-10-05 13:25:18 -07:00
Mridul Muralidharan b5025d90bb - Allow for finer control of cleaner
- Address review comments, move to incubator spark
- Also includes a change to speculation - including preventing exceptions in rare cases.
2013-10-06 00:35:51 +05:30
Aaron Davidson db6f154940 Fix race conditions during recovery
One major change was the use of messages instead of raw functions as the
parameter of Akka scheduled timers. Since messages are serialized, unlike
raw functions, the behavior is easier to think about and doesn't cause
race conditions when exceptions are thrown.

Another change is to avoid using global pointers that might change without
a lock.
2013-10-04 19:54:33 -07:00
Kay Ousterhout 7b5ae23a37 Renamed StandaloneX to CoarseGrainedX.
The previous names were confusing because the components weren't just
used in Standalone mode -- in fact, the scheduler used for Standalone
mode is called SparkDeploySchedulerBackend. So, the previous names
were misleading.
2013-10-04 13:56:43 -07:00
Andre Schumacher c84946fe21 Fixing SPARK-602: PythonPartitioner
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
2013-10-04 11:56:47 -07:00
Dan Crankshaw 61ffcdeae7 Merge pull request #15 from dcrankshaw/master
Add synthetic generators
2013-10-04 10:52:17 -07:00
Nick Pentreath 93b96b44d7 Adding implicit feedback ALS to MLlib user guide 2013-10-04 14:39:44 +02:00
Nick Pentreath c6ceaeae50 Style fix using 'if' rather than 'match' on boolean 2013-10-04 13:52:53 +02:00
Nick Pentreath 6a7836cddc Fixing closing brace indentation 2013-10-04 13:33:01 +02:00
Nick Pentreath 0bd9b373d1 Reverting to using comma-delimited split 2013-10-04 13:30:33 +02:00
Nick Pentreath 1cbdcb9cb6 Merge remote-tracking branch 'upstream/master' into implicit-als 2013-10-04 13:25:34 +02:00
Dan Crankshaw da3e123afb Removed some comments 2013-10-03 18:11:35 -07:00
Dan Crankshaw 1ee60d3b34 Fixed bug in sampleLogNormal 2013-10-03 17:46:37 -07:00
Reynold Xin d29e8035a0 Added countAsync and various unit tests for async actions. 2013-10-03 15:13:44 -07:00
Matei Zaharia 232765f7b2 Merge pull request #26 from Du-Li/master
fixed a wildcard bug in make-distribution.sh; ask sbt to check local
maven repo in project/SparkBuild.scala

(1) fixed a wildcard bug in make-distribution.sh:
with the wildcard * in quotes, this cp command failed. it worked after
moving the wildcard out quotes.

(2) ask sbt to check local maven repo in SparkBuild.scala:
To build Spark (0.9.0-SNAPSHOT) with the HEAD of mesos (0.15.0), I must
do "make maven-install" under mesos/build, which publishes the java .jar
file under ~/.m2. However, when building Spark (after pointing mesos to
version 0.15.0), sbt uses ivy which by default only checks ~/.ivy2. This
change is to tell sbt to also check ~/.m2.
2013-10-03 12:00:48 -07:00
Matei Zaharia 405e69bb20 Merge pull request #25 from CruncherBigData/master
Update README: updated the link
2013-10-03 10:52:41 -07:00
Matei Zaharia 49dbfccf6b Merge pull request #28 from tgravescs/sparYarnAppName
Allow users to set the application name for Spark on Yarn
2013-10-03 10:52:06 -07:00
Dan Crankshaw 27b442dc06 Fixed annotation import 2013-10-03 10:29:00 -07:00
Dan Crankshaw 8edd499eff Added rmat graph generator 2013-10-03 10:21:34 -07:00
tgravescs 0fff4ee852 Adding in the --addJars option to make SparkContext.addJar work on yarn and cleanup
the classpaths
2013-10-03 11:52:16 -05:00
tgravescs c021b8c202 Add default value to usage statement 2013-10-03 08:07:19 -05:00
Reynold Xin 802bfb870d - Created AsyncRDDActions.
- Make FutureJob a Scala Future instead of Java Future.
2013-10-03 01:22:28 -07:00
Reynold Xin e8e917f209 Merge branch 'master' into kill
Conflicts:
	core/src/main/scala/org/apache/spark/TaskEndReason.scala
	core/src/main/scala/org/apache/spark/executor/Executor.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
2013-10-02 23:01:34 -07:00
Matei Zaharia e597ea34a6 Merge pull request #10 from kayousterhout/results_through-bm
Send Task results through the block manager when larger than Akka frame size (fixes SPARK-669).

This change requires adding an extra failure mode: tasks can complete
successfully, but the result gets lost or flushed from the block manager
before it's been fetched.

This change also moves the deserialization of tasks into a separate thread, so it's no longer part of the DAG scheduler's tight loop. This should improve scheduler throughput, particularly when tasks are sending back large results.

Thanks Josh for writing the original version of this patch!

This is duplicated from the mesos/spark repo: https://github.com/mesos/spark/pull/835
2013-10-02 21:14:24 -07:00
Reynold Xin 1c48ba0d9f Merge remote-tracking branch 'origin' into kill
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
2013-10-02 16:40:44 -07:00
tgravescs bc3b20abdc Allow users to set the application name for Spark on Yarn 2013-10-02 12:54:17 -05:00
David McCauley 1577b373a9 SPARK-921 - Add Application UI URL to ApplicationInfo Json output 2013-10-02 15:03:41 +01:00
David McCauley 351da54676 SPARK-920 - JSON endpoint URI scheme part (spark://) duplicated 2013-10-02 13:23:38 +01:00
Du Li 9fd6bba60d ask ivy/sbt to check local maven repo under ~/.m2 2013-10-01 15:46:51 -07:00
Du Li 0d19f00e9e fixed a bug of using wildcard in quotes 2013-10-01 15:42:06 -07:00
CruncherBigData c85f720588 Update README 2013-10-01 09:05:03 -07:00
Kay Ousterhout 0dcad2edcb Added additional unit test for repeated task failures 2013-09-30 23:26:15 -07:00
Kay Ousterhout dea4677c88 Fixed compilation errors and broken test. 2013-09-30 22:07:01 -07:00
Kay Ousterhout 8deda427bc Merge remote-tracking branch 'upstream/master' into results_through-bm
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/main/scala/org/apache/spark/scheduler/local/LocalTaskSetManager.scala
2013-09-30 10:16:58 -07:00