ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Patrick Wendell	8b377718b8	Responses to review	2013-10-07 20:03:35 -07:00
Matei Zaharia	a8725bf8f8	Don't allocate Kryo buffers unless needed	2013-10-07 19:16:35 -07:00
Patrick Wendell	391133f66a	Fix inconsistent and incorrect log messages in shuffle read path	2013-10-07 17:24:18 -07:00
Patrick Wendell	02f37ee853	Merge pull request #39 from pwendell/master Adding Shark 0.7.1 to EC2 scripts This adds a newer version of Shark to the ec2 scripts. I've tested this for both Hadoop1 and Hadoop2 clusters.	2013-10-07 15:48:52 -07:00
Patrick Wendell	3745a1827f	Adding Shark 0.7.1 to EC2 scripts	2013-10-07 15:03:42 -07:00
Andre Schumacher	fdbae41e88	SPARK-705: implement sortByKey() in PySpark	2013-10-07 12:16:33 -07:00
Reynold Xin	5218e46178	Updated Kryo registration.	2013-10-07 11:48:50 -07:00
Reynold Xin	4f916f5302	Created a MessageToPartition class to send messages without saving the partition id.	2013-10-07 11:31:00 -07:00
Reynold Xin	213b70a2db	Merge pull request #31 from sundeepn/branch-0.8 Resolving package conflicts with hadoop 0.23.9 Hadoop 0.23.9 is having a package conflict with easymock's dependencies. (cherry picked from commit `023e3fdf00`) Signed-off-by: Reynold Xin <rxin@apache.org>	2013-10-07 10:54:22 -07:00
Nick Pentreath	a5e58b8f98	Merge branch 'master' into implicit-als	2013-10-07 11:46:17 +02:00
Nick Pentreath	b0f5f4d441	Bumping up test matrix size to eliminate random failures	2013-10-07 11:44:22 +02:00
Dan Crankshaw	2a8f3db94d	Fixed groupEdgeTriplets - it now passes a basic unit test. The problem was with the way the EdgeTripletRDD iterator worked. Calling toList on it returned the last value repeatedly. Fixed by overriding toList in the iterator.	2013-10-06 19:52:40 -07:00
Kay Ousterhout	fdc52b2f8b	Added back fully qualified class name	2013-10-06 18:45:43 -07:00
Dan Crankshaw	0d3ea36fd8	Added a groupEdges and a groupEdgeTriplets method. For some reason the groupEdgeTriplets method isn't properly iterating through the set of edges and thus is returning the wrong result. groupEdges seems to be working.	2013-10-06 18:34:23 -07:00
Dan Crankshaw	6cb21ce889	groupEdges() now compiles. Still need some unit tests	2013-10-06 15:33:35 -07:00
Aaron Davidson	718e8c2052	Change url format to spark://host1:port1,host2:port2 This replaces the format of spark://host1:port1,spark://host2:port2 and is more consistent with ZooKeeper's zk:// urls.	2013-10-06 00:02:08 -07:00
Aaron Davidson	e1190229e1	Add end-to-end test for standalone scheduler fault tolerance Docker files drawn mostly from Matt Masse. Some updates from Andre Schumacher.	2013-10-05 23:20:31 -07:00
Patrick Wendell	d585613ee2	Merge pull request #37 from pwendell/merge-0.8 merge in remaining changes from `branch-0.8` This merges in the following changes from `branch-0.8`: - The scala version is included in the published maven artifact names - A unit tests which had non-deterministic failures is ignored (see SPARK-908) - A minor documentation change shows the short version instead of the full version - Moving the kafka jar to be "provided" - Changing the default spark ec2 version. - Some spacing changes caused by Maven's release plugin Note that I've squashed this into a single commit rather than pull in the branch-0.8 history. There are a bunch of release/revert commits there that make the history super ugly.	2013-10-05 22:57:05 -07:00
Patrick Wendell	aa9fb84994	Merging build changes in from 0.8	2013-10-05 22:07:00 -07:00
Dan Crankshaw	730a3156d3	Added initial groupEdges code. Still a prototype, I haven't figured out quite how it should all work yet.	2013-10-05 19:44:28 -07:00
Matei Zaharia	4a25b116d4	Merge pull request #20 from harveyfeng/hadoop-config-cache Allow users to pass broadcasted Configurations and cache InputFormats across Hadoop file reads. Note: originally from https://github.com/mesos/spark/pull/942 Currently motivated by Shark queries on Hive-partitioned tables, where there's a JobConf broadcast for every Hive-partition (i.e., every subdirectory read). The only thing different about those JobConfs is the input path - the Hadoop Configuration that the JobConfs are constructed from remain the same. This PR only modifies the old Hadoop API RDDs, but similar additions to the new API might reduce computation latencies a little bit for high-frequency FileInputDStreams (which only uses the new API right now). As a small bonus, added InputFormats caching, to avoid reflection calls for every RDD#compute(). Few other notes: Added a general soft-reference hashmap in SparkHadoopUtil because I wanted to avoid adding another class to SparkEnv. SparkContext default hadoopConfiguration isn't cached. There's no equals() method for Configuration, so there isn't a good way to determine when configuration properties have changed.	2013-10-05 19:28:55 -07:00
Harvey Feng	6a2bbec5e3	Some comments regarding JobConf and InputFormat caching for HadoopRDDs.	2013-10-05 17:53:58 -07:00
Reynold Xin	8fc68d04bd	Merge pull request #36 from pwendell/versions Bumping EC2 default version in master to . This change was already made on . This PR ports the change up to master.	2013-10-05 17:24:35 -07:00
Harvey Feng	96929f28bb	Make HadoopRDD object Spark private.	2013-10-05 17:14:19 -07:00
Patrick Wendell	2484b84678	Bumping EC2 default version in master to `0.8.0`.	2013-10-05 16:59:11 -07:00
Harvey Feng	b5e93c1227	Fix API changes; lines > 100 chars.	2013-10-05 16:57:08 -07:00
Dan Crankshaw	bfedbee13a	Edge partitioner now partitions by canonical edge so all edges between two vertices (in either direction) will be sent to same machine.	2013-10-05 16:04:57 -07:00
Dan Crankshaw	e096cbe90e	Added 2D canonical edge partitioner	2013-10-05 15:20:15 -07:00
Aaron Davidson	0f070279e7	Address Matei's comments	2013-10-05 15:15:29 -07:00
Matei Zaharia	100222b048	Merge pull request #27 from davidmccauley/master SPARK-920/921 - JSON endpoint updates 920 - Removal of duplicate scheme part of Spark URI, it was appearing as spark://spark//host:port in the JSON field. JSON now delivered as: url:spark://127.0.0.1:7077 921 - Adding the URL of the Main Application UI will allow custom interfaces (that use the JSON output) to redirect from the standalone UI.	2013-10-05 13:38:59 -07:00
Matei Zaharia	08641932bd	Merge pull request #33 from AndreSchumacher/pyspark_partition_key_change Fixing SPARK-602: PythonPartitioner Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.	2013-10-05 13:25:18 -07:00
Mridul Muralidharan	b5025d90bb	- Allow for finer control of cleaner - Address review comments, move to incubator spark - Also includes a change to speculation - including preventing exceptions in rare cases.	2013-10-06 00:35:51 +05:30
Aaron Davidson	db6f154940	Fix race conditions during recovery One major change was the use of messages instead of raw functions as the parameter of Akka scheduled timers. Since messages are serialized, unlike raw functions, the behavior is easier to think about and doesn't cause race conditions when exceptions are thrown. Another change is to avoid using global pointers that might change without a lock.	2013-10-04 19:54:33 -07:00
Kay Ousterhout	7b5ae23a37	Renamed StandaloneX to CoarseGrainedX. The previous names were confusing because the components weren't just used in Standalone mode -- in fact, the scheduler used for Standalone mode is called SparkDeploySchedulerBackend. So, the previous names were misleading.	2013-10-04 13:56:43 -07:00
Andre Schumacher	c84946fe21	Fixing SPARK-602: PythonPartitioner Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.	2013-10-04 11:56:47 -07:00
Dan Crankshaw	61ffcdeae7	Merge pull request #15 from dcrankshaw/master Add synthetic generators	2013-10-04 10:52:17 -07:00
Nick Pentreath	93b96b44d7	Adding implicit feedback ALS to MLlib user guide	2013-10-04 14:39:44 +02:00
Nick Pentreath	c6ceaeae50	Style fix using 'if' rather than 'match' on boolean	2013-10-04 13:52:53 +02:00
Nick Pentreath	6a7836cddc	Fixing closing brace indentation	2013-10-04 13:33:01 +02:00
Nick Pentreath	0bd9b373d1	Reverting to using comma-delimited split	2013-10-04 13:30:33 +02:00
Nick Pentreath	1cbdcb9cb6	Merge remote-tracking branch 'upstream/master' into implicit-als	2013-10-04 13:25:34 +02:00
Dan Crankshaw	da3e123afb	Removed some comments	2013-10-03 18:11:35 -07:00
Dan Crankshaw	1ee60d3b34	Fixed bug in sampleLogNormal	2013-10-03 17:46:37 -07:00
Reynold Xin	d29e8035a0	Added countAsync and various unit tests for async actions.	2013-10-03 15:13:44 -07:00
Matei Zaharia	232765f7b2	Merge pull request #26 from Du-Li/master fixed a wildcard bug in make-distribution.sh; ask sbt to check local maven repo in project/SparkBuild.scala (1) fixed a wildcard bug in make-distribution.sh: with the wildcard * in quotes, this cp command failed. it worked after moving the wildcard out quotes. (2) ask sbt to check local maven repo in SparkBuild.scala: To build Spark (0.9.0-SNAPSHOT) with the HEAD of mesos (0.15.0), I must do "make maven-install" under mesos/build, which publishes the java .jar file under ~/.m2. However, when building Spark (after pointing mesos to version 0.15.0), sbt uses ivy which by default only checks ~/.ivy2. This change is to tell sbt to also check ~/.m2.	2013-10-03 12:00:48 -07:00
Matei Zaharia	405e69bb20	Merge pull request #25 from CruncherBigData/master Update README: updated the link	2013-10-03 10:52:41 -07:00
Matei Zaharia	49dbfccf6b	Merge pull request #28 from tgravescs/sparYarnAppName Allow users to set the application name for Spark on Yarn	2013-10-03 10:52:06 -07:00
Dan Crankshaw	27b442dc06	Fixed annotation import	2013-10-03 10:29:00 -07:00
Dan Crankshaw	8edd499eff	Added rmat graph generator	2013-10-03 10:21:34 -07:00
tgravescs	0fff4ee852	Adding in the --addJars option to make SparkContext.addJar work on yarn and cleanup the classpaths	2013-10-03 11:52:16 -05:00

... 2 3 4 5 6 ...

4456 commits