Commit graph

4456 commits

Author SHA1 Message Date
Patrick Wendell 8b377718b8 Responses to review 2013-10-07 20:03:35 -07:00
Matei Zaharia a8725bf8f8 Don't allocate Kryo buffers unless needed 2013-10-07 19:16:35 -07:00
Patrick Wendell 391133f66a Fix inconsistent and incorrect log messages in shuffle read path 2013-10-07 17:24:18 -07:00
Patrick Wendell 02f37ee853 Merge pull request #39 from pwendell/master
Adding Shark 0.7.1 to EC2 scripts

This adds a newer version of Shark to the ec2 scripts. I've tested this for both Hadoop1 and Hadoop2 clusters.
2013-10-07 15:48:52 -07:00
Patrick Wendell 3745a1827f Adding Shark 0.7.1 to EC2 scripts 2013-10-07 15:03:42 -07:00
Andre Schumacher fdbae41e88 SPARK-705: implement sortByKey() in PySpark 2013-10-07 12:16:33 -07:00
Reynold Xin 5218e46178 Updated Kryo registration. 2013-10-07 11:48:50 -07:00
Reynold Xin 4f916f5302 Created a MessageToPartition class to send messages without saving the partition id. 2013-10-07 11:31:00 -07:00
Reynold Xin 213b70a2db Merge pull request #31 from sundeepn/branch-0.8
Resolving package conflicts with hadoop 0.23.9

Hadoop 0.23.9 is having a package conflict with easymock's dependencies.

(cherry picked from commit 023e3fdf00)
Signed-off-by: Reynold Xin <rxin@apache.org>
2013-10-07 10:54:22 -07:00
Nick Pentreath a5e58b8f98 Merge branch 'master' into implicit-als 2013-10-07 11:46:17 +02:00
Nick Pentreath b0f5f4d441 Bumping up test matrix size to eliminate random failures 2013-10-07 11:44:22 +02:00
Dan Crankshaw 2a8f3db94d Fixed groupEdgeTriplets - it now passes a basic unit test.
The problem was with the way the EdgeTripletRDD iterator worked. Calling
toList on it returned the last value repeatedly. Fixed by overriding
toList in the iterator.
2013-10-06 19:52:40 -07:00
Kay Ousterhout fdc52b2f8b Added back fully qualified class name 2013-10-06 18:45:43 -07:00
Dan Crankshaw 0d3ea36fd8 Added a groupEdges and a groupEdgeTriplets method. For some reason the groupEdgeTriplets method isn't properly iterating through the set of edges and thus is returning the wrong result. groupEdges seems to be working. 2013-10-06 18:34:23 -07:00
Dan Crankshaw 6cb21ce889 groupEdges() now compiles. Still need some unit tests 2013-10-06 15:33:35 -07:00
Aaron Davidson 718e8c2052 Change url format to spark://host1:port1,host2:port2
This replaces the format of spark://host1:port1,spark://host2:port2 and is more
consistent with ZooKeeper's zk:// urls.
2013-10-06 00:02:08 -07:00
Aaron Davidson e1190229e1 Add end-to-end test for standalone scheduler fault tolerance
Docker files drawn mostly from Matt Masse. Some updates from Andre Schumacher.
2013-10-05 23:20:31 -07:00
Patrick Wendell d585613ee2 Merge pull request #37 from pwendell/merge-0.8
merge in remaining changes from `branch-0.8`

This merges in the following changes from `branch-0.8`:

- The scala version is included in the published maven artifact names
- A unit tests which had non-deterministic failures is ignored (see SPARK-908)
- A minor documentation change shows the short version instead of the full version
- Moving the kafka jar to be "provided"
- Changing the default spark ec2 version.
- Some spacing changes caused by Maven's release plugin

Note that I've squashed this into a single commit rather than pull in the branch-0.8 history. There are a bunch of release/revert commits there that make the history super ugly.
2013-10-05 22:57:05 -07:00
Patrick Wendell aa9fb84994 Merging build changes in from 0.8 2013-10-05 22:07:00 -07:00
Dan Crankshaw 730a3156d3 Added initial groupEdges code. Still a prototype, I haven't figured out quite how it should all work yet. 2013-10-05 19:44:28 -07:00
Matei Zaharia 4a25b116d4 Merge pull request #20 from harveyfeng/hadoop-config-cache
Allow users to pass broadcasted Configurations and cache InputFormats across Hadoop file reads.

Note: originally from https://github.com/mesos/spark/pull/942

Currently motivated by Shark queries on Hive-partitioned tables, where there's a JobConf broadcast for every Hive-partition (i.e., every subdirectory read). The only thing different about those JobConfs is the input path - the Hadoop Configuration that the JobConfs are constructed from remain the same.
This PR only modifies the old Hadoop API RDDs, but similar additions to the new API might reduce computation latencies a little bit for high-frequency FileInputDStreams (which only uses the new API right now).

As a small bonus, added InputFormats caching, to avoid reflection calls for every RDD#compute().

Few other notes:

Added a general soft-reference hashmap in SparkHadoopUtil because I wanted to avoid adding another class to SparkEnv.
SparkContext default hadoopConfiguration isn't cached. There's no equals() method for Configuration, so there isn't a good way to determine when configuration properties have changed.
2013-10-05 19:28:55 -07:00
Harvey Feng 6a2bbec5e3 Some comments regarding JobConf and InputFormat caching for HadoopRDDs. 2013-10-05 17:53:58 -07:00
Reynold Xin 8fc68d04bd Merge pull request #36 from pwendell/versions
Bumping EC2 default version in master to .

This change was already made on . This PR ports the change up to master.
2013-10-05 17:24:35 -07:00
Harvey Feng 96929f28bb Make HadoopRDD object Spark private. 2013-10-05 17:14:19 -07:00
Patrick Wendell 2484b84678 Bumping EC2 default version in master to 0.8.0. 2013-10-05 16:59:11 -07:00
Harvey Feng b5e93c1227 Fix API changes; lines > 100 chars. 2013-10-05 16:57:08 -07:00
Dan Crankshaw bfedbee13a Edge partitioner now partitions by canonical edge so all edges between two vertices (in either direction) will be sent to same machine. 2013-10-05 16:04:57 -07:00
Dan Crankshaw e096cbe90e Added 2D canonical edge partitioner 2013-10-05 15:20:15 -07:00
Aaron Davidson 0f070279e7 Address Matei's comments 2013-10-05 15:15:29 -07:00
Matei Zaharia 100222b048 Merge pull request #27 from davidmccauley/master
SPARK-920/921 - JSON endpoint updates

920 - Removal of duplicate scheme part of Spark URI, it was appearing as spark://spark//host:port in the JSON field.

JSON now delivered as:
url:spark://127.0.0.1:7077

921 - Adding the URL of the Main Application UI will allow custom interfaces (that use the JSON output) to redirect from the standalone UI.
2013-10-05 13:38:59 -07:00
Matei Zaharia 08641932bd Merge pull request #33 from AndreSchumacher/pyspark_partition_key_change
Fixing SPARK-602: PythonPartitioner

Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
2013-10-05 13:25:18 -07:00
Mridul Muralidharan b5025d90bb - Allow for finer control of cleaner
- Address review comments, move to incubator spark
- Also includes a change to speculation - including preventing exceptions in rare cases.
2013-10-06 00:35:51 +05:30
Aaron Davidson db6f154940 Fix race conditions during recovery
One major change was the use of messages instead of raw functions as the
parameter of Akka scheduled timers. Since messages are serialized, unlike
raw functions, the behavior is easier to think about and doesn't cause
race conditions when exceptions are thrown.

Another change is to avoid using global pointers that might change without
a lock.
2013-10-04 19:54:33 -07:00
Kay Ousterhout 7b5ae23a37 Renamed StandaloneX to CoarseGrainedX.
The previous names were confusing because the components weren't just
used in Standalone mode -- in fact, the scheduler used for Standalone
mode is called SparkDeploySchedulerBackend. So, the previous names
were misleading.
2013-10-04 13:56:43 -07:00
Andre Schumacher c84946fe21 Fixing SPARK-602: PythonPartitioner
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
2013-10-04 11:56:47 -07:00
Dan Crankshaw 61ffcdeae7 Merge pull request #15 from dcrankshaw/master
Add synthetic generators
2013-10-04 10:52:17 -07:00
Nick Pentreath 93b96b44d7 Adding implicit feedback ALS to MLlib user guide 2013-10-04 14:39:44 +02:00
Nick Pentreath c6ceaeae50 Style fix using 'if' rather than 'match' on boolean 2013-10-04 13:52:53 +02:00
Nick Pentreath 6a7836cddc Fixing closing brace indentation 2013-10-04 13:33:01 +02:00
Nick Pentreath 0bd9b373d1 Reverting to using comma-delimited split 2013-10-04 13:30:33 +02:00
Nick Pentreath 1cbdcb9cb6 Merge remote-tracking branch 'upstream/master' into implicit-als 2013-10-04 13:25:34 +02:00
Dan Crankshaw da3e123afb Removed some comments 2013-10-03 18:11:35 -07:00
Dan Crankshaw 1ee60d3b34 Fixed bug in sampleLogNormal 2013-10-03 17:46:37 -07:00
Reynold Xin d29e8035a0 Added countAsync and various unit tests for async actions. 2013-10-03 15:13:44 -07:00
Matei Zaharia 232765f7b2 Merge pull request #26 from Du-Li/master
fixed a wildcard bug in make-distribution.sh; ask sbt to check local
maven repo in project/SparkBuild.scala

(1) fixed a wildcard bug in make-distribution.sh:
with the wildcard * in quotes, this cp command failed. it worked after
moving the wildcard out quotes.

(2) ask sbt to check local maven repo in SparkBuild.scala:
To build Spark (0.9.0-SNAPSHOT) with the HEAD of mesos (0.15.0), I must
do "make maven-install" under mesos/build, which publishes the java .jar
file under ~/.m2. However, when building Spark (after pointing mesos to
version 0.15.0), sbt uses ivy which by default only checks ~/.ivy2. This
change is to tell sbt to also check ~/.m2.
2013-10-03 12:00:48 -07:00
Matei Zaharia 405e69bb20 Merge pull request #25 from CruncherBigData/master
Update README: updated the link
2013-10-03 10:52:41 -07:00
Matei Zaharia 49dbfccf6b Merge pull request #28 from tgravescs/sparYarnAppName
Allow users to set the application name for Spark on Yarn
2013-10-03 10:52:06 -07:00
Dan Crankshaw 27b442dc06 Fixed annotation import 2013-10-03 10:29:00 -07:00
Dan Crankshaw 8edd499eff Added rmat graph generator 2013-10-03 10:21:34 -07:00
tgravescs 0fff4ee852 Adding in the --addJars option to make SparkContext.addJar work on yarn and cleanup
the classpaths
2013-10-03 11:52:16 -05:00