ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Joseph E. Gonzalez	80e4ec3278	IndexedRDD now only supports unique keys	2013-10-16 00:16:44 -07:00
Joseph E. Gonzalez	3cb6dffce0	adding indexed reduce by key	2013-10-15 18:55:06 -07:00
Joseph E. Gonzalez	bf059691f0	Adding a few extra comments.	2013-10-14 19:59:11 -07:00
Joseph E. Gonzalez	11a44d0ec9	Introducing indexedrdd The rest of indexed rdd	2013-10-14 19:46:42 -07:00
Reynold Xin	3b11f43e36	Merge pull request #57 from aarondav/bid Refactor BlockId into an actual type Converts all of our BlockId strings into actual BlockId types. Here are some advantages of doing this now: + Type safety + Code clarity - it's now obvious what the key of a shuffle or rdd block is, for instance. Additionally, appearing in tuple/map type signatures is a big readability bonus. A Seq[(String, BlockStatus)] is not very clear. Further, we can now use more Scala features, like matching on BlockId types. + Explicit usage - we can now formally tell where various BlockIds are being used (without doing string searches); this makes updating current BlockIds a much clearer process, and compiler-supported. (I'm looking at you, shuffle file consolidation.) + It will only get harder to make this change as time goes on. Downside is, of course, that this is a very invasive change touching a lot of different files, which will inevitably lead to merge conflicts for many.	2013-10-14 14:20:01 -07:00
Aaron Davidson	4a45019fb0	Address Matei's comments	2013-10-14 00:24:17 -07:00
Aaron Davidson	da896115ec	Change BlockId filename to name + rest of Patrick's comments	2013-10-13 11:15:02 -07:00
Aaron Davidson	d60352283c	Add unit test and address rest of Reynold's comments	2013-10-12 22:45:15 -07:00
Aaron Davidson	a395911138	Refactor BlockId into an actual type This is an unfortunately invasive change which converts all of our BlockId strings into actual BlockId types. Here are some advantages of doing this now: + Type safety + Code clarity - it's now obvious what the key of a shuffle or rdd block is, for instance. Additionally, appearing in tuple/map type signatures is a big readability bonus. A Seq[(String, BlockStatus)] is not very clear. Further, we can now use more Scala features, like matching on BlockId types. + Explicit usage - we can now formally tell where various BlockIds are being used (without doing string searches); this makes updating current BlockIds a much clearer process, and compiler-supported. (I'm looking at you, shuffle file consolidation.) + It will only get harder to make this change as time goes on. Since this touches a lot of files, it'd be best to either get this patch in quickly or throw it on the ground to avoid too many secondary merge conflicts.	2013-10-12 22:44:57 -07:00
Reynold Xin	99796904ae	Merge pull request #52 from harveyfeng/hadoop-closure Add an optional closure parameter to HadoopRDD instantiation to use when creating local JobConfs. Having HadoopRDD accept this optional closure eliminates the need for the HadoopFileRDD added earlier. It makes the HadoopRDD more general, in that the caller can specify any JobConf initialization flow.	2013-10-12 21:23:26 -07:00
Harvey Feng	6c32aab87d	Remove the new HadoopRDD constructor from SparkContext API, plus some minor style changes.	2013-10-12 21:02:08 -07:00
Reynold Xin	dca80094d3	Merge pull request #54 from aoiwelle/remove_unused_imports Remove unnecessary mutable imports It appears that the imports aren't necessary here.	2013-10-11 16:08:15 -07:00
Matei Zaharia	fb25f32300	Merge pull request #53 from witgo/master Add a zookeeper compile dependency to fix build in maven Add a zookeeper compile dependency to fix build in maven	2013-10-11 15:44:43 -07:00
Matei Zaharia	d6ead47809	Merge pull request #32 from mridulm/master Address review comments, move to incubator spark Also includes a small fix to speculative execution. <edit> Continued from https://github.com/mesos/spark/pull/914 </edit>	2013-10-11 15:43:01 -07:00
Neal Wiggins	67d4a31f87	Remove unnecessary mutable imports	2013-10-11 09:47:27 -07:00
LiGuoqiang	fc60c412ab	Add a zookeeper compile dependency to fix build in maven	2013-10-11 16:31:47 +08:00
Matei Zaharia	8f11c36fe1	Merge remote-tracking branch 'tgravescs/sparkYarnDistCache' Closes #11 Conflicts: docs/running-on-yarn.md yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala	2013-10-10 19:34:33 -07:00
Matei Zaharia	c71499b779	Merge pull request #19 from aarondav/master-zk Standalone Scheduler fault tolerance using ZooKeeper This patch implements full distributed fault tolerance for standalone scheduler Masters. There is only one master Leader at a time, which is actively serving scheduling requests. If this Leader crashes, another master will eventually be elected, reconstruct the state from the first Master, and continue serving scheduling requests. Leader election is performed using the ZooKeeper leader election pattern. We try to minimize the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of retries and session monitoring on top of the ZooKeeper client. Master failover follows directly from the single-node Master recovery via the file system (patch `d5a96fe`), save that the Master state is stored in ZooKeeper instead. Configuration: By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE). By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled. By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory to an appropriate directory accessible by the Master, we will keep the behavior of from `d5a96fe`. Additionally, places where a Master could be specificied by a spark:// url can now take comma-delimited lists to specify backup masters. Note that this is only used for registration of NEW Workers and application Clients. Once a Worker or Client has registered with the Master Leader, it is "in the system" and will never need to register again.	2013-10-10 17:16:42 -07:00
Harvey Feng	5a99e67894	Add an optional closure parameter to HadoopRDD instantiation to used when creating any local JobConfs.	2013-10-10 16:35:52 -07:00
Aaron Davidson	66c20635fa	Minor clarification and cleanup to spark-standalone.md	2013-10-10 14:45:12 -07:00
Matei Zaharia	cd08f73483	Merge pull request #44 from mateiz/fast-map A fast and low-memory append-only map for shuffle operations This is a continuation of the old repo's pull request https://github.com/mesos/spark/pull/823 to add a more efficient hashmap class for shuffles. I've optimized and tested this more thoroughly now so I think it's good to go. I've also addressed some of the comments that were outstanding there. The idea is to reduce the cost of shuffles by taking advantage of the properties their hashmaps need. In particular, the hashmaps there are append-only, and a common operation is updating a key's value based on the old value. The included AppendOnlyMap class uses open hashing to use less space than Java's (by not having a linked list per bucket), does not support deletes, and has a changeValue operation to update a key in place without following the hash chain twice. In micro-benchmarks against java.util.HashMap and scala.collection.mutable.HashMap, this is 20-30% smaller and 10-40% faster depending on the number and type of keys. It's also noticeably faster than fastutil's Object2ObjectOpenHashMap. I've also tested this in Spark apps now. While the speed gain is modest (partly due to other overheads, like serialization), there is some, and I think the lower memory usage is worth it. Here's one example where the speedup is most noticeable, in spark-shell on local mode: ``` scala> val nums = sc.parallelize(1 to 8).flatMap(x => (1 to 5e6.toInt)).cache scala> nums.count scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now } scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, x)).reduceByKey(_ + _).count) / 1000.0) ``` This prints the following times before and after this change: ``` Before: Vector(4.368, 2.635, 2.549, 2.522, 2.233, 2.222, 2.214, 2.195) After: Vector(3.588, 1.741, 1.706, 1.648, 1.777, 1.81, 1.776, 1.731) ``` I've also run the spark-perf suite, enhanced with some tests that use Ints (https://github.com/amplab/spark-perf/pull/9), and it shows some speedup on those, but less on the string ones (presumably due to existing overhead): https://gist.github.com/mateiz/6897121.	2013-10-10 13:55:47 -07:00
Matei Zaharia	001d13f7b9	Merge branch 'master' into fast-map Conflicts: core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala	2013-10-10 13:26:43 -07:00
Aaron Davidson	42d8b8efe6	Address Matei's comments on documentation Updates to the documentation and changing some logError()s to logWarning()s.	2013-10-10 00:33:47 -07:00
Reynold Xin	320418f7c8	Merge pull request #49 from mateiz/kryo-fix-2 Fix Chill serialization of Range objects It used to write out each element one by one, creating very large objects.	2013-10-09 16:55:30 -07:00
Reynold Xin	215238cb39	Merge pull request #50 from kayousterhout/SPARK-908 Fix race condition in SparkListenerSuite (fixes SPARK-908).	2013-10-09 16:49:44 -07:00
Matei Zaharia	c84c205289	Fix Chill serialization of Range objects, which used to write out each element, and register user and Spark classes before Chill's serializers to let them override Chill's behavior in general.	2013-10-09 16:23:40 -07:00
Kay Ousterhout	36966f65df	Style fixes	2013-10-09 15:36:34 -07:00
Kay Ousterhout	3f7e9b265c	Fixed comment to use javadoc style	2013-10-09 15:23:04 -07:00
Kay Ousterhout	a34a4e8174	Fix race condition in SparkListenerSuite (fixes SPARK-908).	2013-10-09 15:07:53 -07:00
Matei Zaharia	7827efc87b	Merge pull request #46 from mateiz/py-sort-update Fix PySpark docs and an overly long line of code after #38 Just noticed these after merging that commit (https://github.com/apache/incubator-spark/pull/38).	2013-10-09 15:07:25 -07:00
Patrick Wendell	7b3ae04ea7	Merge pull request #45 from pwendell/metrics_units Use standard abbreviation in metrics description (MBytes -> MB) This is a small change - older commits are shown here because Github hasn't sync'ed yet with apache.	2013-10-09 12:14:19 -07:00
Matei Zaharia	478b2b7edc	Fix PySpark docs and an overly long line of code after `fdbae41e`	2013-10-09 12:08:04 -07:00
Matei Zaharia	b4fa11f6c9	Merge pull request #38 from AndreSchumacher/pyspark_sorting SPARK-705: implement sortByKey() in PySpark This PR contains the implementation of a RangePartitioner in Python and uses its partition ID's to get a global sort in PySpark.	2013-10-09 11:59:47 -07:00
Patrick Wendell	bd3bcc5f8e	Use standard abbreviations in metrics labels	2013-10-09 11:16:24 -07:00
Patrick Wendell	19d445d37c	Merge pull request #22 from GraceH/metrics-naming SPARK-900 Use coarser grained naming for metrics see SPARK-900 Use coarser grained naming for metrics. Now the new metric name is formatted as {XXX.YYY.ZZZ.COUNTER_UNIT}, XXX.YYY.ZZZ represents the group name, which can group several metrics under the same Ganglia view.	2013-10-09 11:08:34 -07:00
Matei Zaharia	3218fa795f	Merge pull request #4 from MLnick/implicit-als Adding algorithm for implicit feedback data to ALS This PR adds the commonly used "implicit feedack" variant to ALS. The implementation is based in part on Mahout's implementation, which is in turn based on [Collaborative Filtering for Implicit Feedback Datasets](http://research.yahoo.com/pub/2433). It has been adapted for the blocked approach used in MLlib. I have tested this implementation against the MovieLens 100k, 1m and 10m datasets, and confirmed that it produces the same RMSE score as Mahout, as well as my own port of Mahout's implicit ALS implementation to Spark (not that RMSE is necessarily the best metric to judge by for implicit feedback, but it provides a consistent metric for comparison). It turned out to be more straightforward than I had thought to add this. The main additions are: 1. Adding `implicitPrefs` boolean flag and `alpha` parameter 2. Added the `computeYtY` method. In each least-squares step, the algorithm requires the computation of `YtY`, where `Y` is the {user, item} factor matrix. Since the factors are already block-distributed in an `RDD`, this is quite straightforward to compute but does add an extra operation over the explicit version (but only twice per iteration) 3. Finally the actual solve step in `updateBlock` boils down to: * a multiplication of the `XtX` matrix by `alpha * rating` * a multiplication of the `Xty` vector by `1 + alpha * rating` * when solving for the factor vector, the implicit variant adds the `YtY` matrix to the LHS 4. Added `trainImplicit` methods in the `ALS` object 5. Added test cases for both Scala and Java - based on achieving a confidence-weighted RMSE score < 0.4 (this is taken from Mahout's test cases) It would be great to get some feedback on this and have people test things out against some datasets (MovieLens and others and perhaps proprietary datasets) both locally and on a cluster if possible. I have not yet tested on a cluster but will try to do that soon. I have tried to make things as efficient as possible but if there are potential improvements let me know. The results of a run against ml-1m are below (note the vanilla RMSE scores will be very different from the explicit variant): INPUTS ``` iterations=10 factors=10 lambda=0.01 alpha=1 implicitPrefs=true ``` RESULTS ``` Spark MLlib 0.8.0-SNAPSHOT RMSE = 3.1544 Time: 24.834 sec ``` ``` My own port of Mahout's ALS to Spark (updated to 0.8.0-SNAPSHOT) RMSE = 3.1543 Time: 58.708 sec ``` ``` Mahout 0.8 time ./factorize-movielens-1M.sh /path/to/ratings/ml-1m/ratings.dat real 3m48.648s user 6m39.254s sys 0m14.505s RMSE = 3.1539 ``` Results of a run against ml-10m ``` Spark MLlib RMSE = 3.1200 Time: 162.348 sec ``` ``` Mahout 0.8 real 23m2.220s user 43m39.185s sys 0m25.316s RMSE = 3.1187 ```	2013-10-08 23:44:55 -07:00
Matei Zaharia	12d593129d	Create fewer function objects in uses of AppendOnlyMap.changeValue	2013-10-08 23:16:51 -07:00
Matei Zaharia	0b35051f19	Address some comments on code clarity	2013-10-08 23:16:17 -07:00
Matei Zaharia	4acbc5afdd	Moved files that were in the wrong directory after package rename	2013-10-08 23:16:17 -07:00
Matei Zaharia	0e40cfabf8	Fix some review comments	2013-10-08 23:16:16 -07:00
Matei Zaharia	b535db7d89	Added a fast and low-memory append-only map implementation for cogroup and parallel reduce operations	2013-10-08 23:14:38 -07:00
Reynold Xin	e67d5b962a	Merge pull request #43 from mateiz/kryo-fix Don't allocate Kryo buffers unless needed I noticed that the Kryo serializer could be slower than the Java one by 2-3x on small shuffles because it spend a lot of time initializing Kryo Input and Output objects. This is because our default buffer size for them is very large. Since the serializer is often used on streams, I made the initialization lazy for that, and used a smaller buffer (auto-managed by Kryo) for input.	2013-10-08 22:57:38 -07:00
Grace Huang	f7628e4033	remove those futile suffixes like number/count	2013-10-09 08:36:41 +08:00
Aaron Davidson	4ea8ee468f	Add docs for standalone scheduler fault tolerance Also fix a couple HTML/Markdown issues in other files.	2013-10-08 14:18:31 -07:00
Aaron Davidson	749233b869	Revert change to spark-class Also adds comment about how to configure for FaultToleranceTest.	2013-10-08 11:41:52 -07:00
Aaron Davidson	1cd57cd4d3	Add license agreements to dockerfiles	2013-10-08 11:41:12 -07:00
Grace Huang	22bed59d2d	create metrics name manually.	2013-10-08 18:01:11 +08:00
Grace Huang	188abbf8f1	Revert "SPARK-900 Use coarser grained naming for metrics" This reverts commit `4b68be5f3c`.	2013-10-08 17:45:14 +08:00
Grace Huang	a2af6b543a	Revert "remedy the line-wrap while exceeding 100 chars" This reverts commit `892fb8ffa8`.	2013-10-08 17:44:56 +08:00
Reynold Xin	ea34c52102	Merge pull request #42 from pwendell/shuffle-read-perf Fix inconsistent and incorrect log messages in shuffle read path The user-facing messages generated by the CacheManager are currently wrong and somewhat misleading. This patch makes the messages more accurate. It also uses a consistent representation of the partition being fetched (`rdd_xx_yy`) so that it's easier for users to trace what is going on when reading logs.	2013-10-07 20:45:58 -07:00

1 2 3 4 5 ...

4212 commits