ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Reynold Xin	0353f74a9a	Put the job cancellation handling into the dagscheduler's main event loop.	2013-10-10 00:28:00 -07:00
Reynold Xin	dbae7795ba	Merge branch 'master' of github.com:apache/incubator-spark into kill Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala core/src/main/scala/org/apache/spark/scheduler/DAGSchedulerSource.scala	2013-10-09 22:57:35 -07:00
Reynold Xin	53895f9cde	Implemented FutureAction, FutureJob, CancellablePromise. Implemented more unit tests for async actions.	2013-10-09 22:43:06 -07:00
Reynold Xin	320418f7c8	Merge pull request #49 from mateiz/kryo-fix-2 Fix Chill serialization of Range objects It used to write out each element one by one, creating very large objects.	2013-10-09 16:55:30 -07:00
Reynold Xin	215238cb39	Merge pull request #50 from kayousterhout/SPARK-908 Fix race condition in SparkListenerSuite (fixes SPARK-908).	2013-10-09 16:49:44 -07:00
Matei Zaharia	c84c205289	Fix Chill serialization of Range objects, which used to write out each element, and register user and Spark classes before Chill's serializers to let them override Chill's behavior in general.	2013-10-09 16:23:40 -07:00
Kay Ousterhout	36966f65df	Style fixes	2013-10-09 15:36:34 -07:00
Kay Ousterhout	3f7e9b265c	Fixed comment to use javadoc style	2013-10-09 15:23:04 -07:00
Kay Ousterhout	a34a4e8174	Fix race condition in SparkListenerSuite (fixes SPARK-908).	2013-10-09 15:07:53 -07:00
Matei Zaharia	7827efc87b	Merge pull request #46 from mateiz/py-sort-update Fix PySpark docs and an overly long line of code after #38 Just noticed these after merging that commit (https://github.com/apache/incubator-spark/pull/38).	2013-10-09 15:07:25 -07:00
Patrick Wendell	7b3ae04ea7	Merge pull request #45 from pwendell/metrics_units Use standard abbreviation in metrics description (MBytes -> MB) This is a small change - older commits are shown here because Github hasn't sync'ed yet with apache.	2013-10-09 12:14:19 -07:00
Matei Zaharia	478b2b7edc	Fix PySpark docs and an overly long line of code after `fdbae41e`	2013-10-09 12:08:04 -07:00
Matei Zaharia	b4fa11f6c9	Merge pull request #38 from AndreSchumacher/pyspark_sorting SPARK-705: implement sortByKey() in PySpark This PR contains the implementation of a RangePartitioner in Python and uses its partition ID's to get a global sort in PySpark.	2013-10-09 11:59:47 -07:00
Patrick Wendell	bd3bcc5f8e	Use standard abbreviations in metrics labels	2013-10-09 11:16:24 -07:00
Patrick Wendell	19d445d37c	Merge pull request #22 from GraceH/metrics-naming SPARK-900 Use coarser grained naming for metrics see SPARK-900 Use coarser grained naming for metrics. Now the new metric name is formatted as {XXX.YYY.ZZZ.COUNTER_UNIT}, XXX.YYY.ZZZ represents the group name, which can group several metrics under the same Ganglia view.	2013-10-09 11:08:34 -07:00
Shivaram Venkataraman	484166d520	Add new SBT target for dependency assembly	2013-10-09 04:24:34 -07:00
Matei Zaharia	3218fa795f	Merge pull request #4 from MLnick/implicit-als Adding algorithm for implicit feedback data to ALS This PR adds the commonly used "implicit feedack" variant to ALS. The implementation is based in part on Mahout's implementation, which is in turn based on [Collaborative Filtering for Implicit Feedback Datasets](http://research.yahoo.com/pub/2433). It has been adapted for the blocked approach used in MLlib. I have tested this implementation against the MovieLens 100k, 1m and 10m datasets, and confirmed that it produces the same RMSE score as Mahout, as well as my own port of Mahout's implicit ALS implementation to Spark (not that RMSE is necessarily the best metric to judge by for implicit feedback, but it provides a consistent metric for comparison). It turned out to be more straightforward than I had thought to add this. The main additions are: 1. Adding `implicitPrefs` boolean flag and `alpha` parameter 2. Added the `computeYtY` method. In each least-squares step, the algorithm requires the computation of `YtY`, where `Y` is the {user, item} factor matrix. Since the factors are already block-distributed in an `RDD`, this is quite straightforward to compute but does add an extra operation over the explicit version (but only twice per iteration) 3. Finally the actual solve step in `updateBlock` boils down to: * a multiplication of the `XtX` matrix by `alpha * rating` * a multiplication of the `Xty` vector by `1 + alpha * rating` * when solving for the factor vector, the implicit variant adds the `YtY` matrix to the LHS 4. Added `trainImplicit` methods in the `ALS` object 5. Added test cases for both Scala and Java - based on achieving a confidence-weighted RMSE score < 0.4 (this is taken from Mahout's test cases) It would be great to get some feedback on this and have people test things out against some datasets (MovieLens and others and perhaps proprietary datasets) both locally and on a cluster if possible. I have not yet tested on a cluster but will try to do that soon. I have tried to make things as efficient as possible but if there are potential improvements let me know. The results of a run against ml-1m are below (note the vanilla RMSE scores will be very different from the explicit variant): INPUTS ``` iterations=10 factors=10 lambda=0.01 alpha=1 implicitPrefs=true ``` RESULTS ``` Spark MLlib 0.8.0-SNAPSHOT RMSE = 3.1544 Time: 24.834 sec ``` ``` My own port of Mahout's ALS to Spark (updated to 0.8.0-SNAPSHOT) RMSE = 3.1543 Time: 58.708 sec ``` ``` Mahout 0.8 time ./factorize-movielens-1M.sh /path/to/ratings/ml-1m/ratings.dat real 3m48.648s user 6m39.254s sys 0m14.505s RMSE = 3.1539 ``` Results of a run against ml-10m ``` Spark MLlib RMSE = 3.1200 Time: 162.348 sec ``` ``` Mahout 0.8 real 23m2.220s user 43m39.185s sys 0m25.316s RMSE = 3.1187 ```	2013-10-08 23:44:55 -07:00
Matei Zaharia	12d593129d	Create fewer function objects in uses of AppendOnlyMap.changeValue	2013-10-08 23:16:51 -07:00
Matei Zaharia	0b35051f19	Address some comments on code clarity	2013-10-08 23:16:17 -07:00
Matei Zaharia	4acbc5afdd	Moved files that were in the wrong directory after package rename	2013-10-08 23:16:17 -07:00
Matei Zaharia	0e40cfabf8	Fix some review comments	2013-10-08 23:16:16 -07:00
Matei Zaharia	b535db7d89	Added a fast and low-memory append-only map implementation for cogroup and parallel reduce operations	2013-10-08 23:14:38 -07:00
Reynold Xin	e67d5b962a	Merge pull request #43 from mateiz/kryo-fix Don't allocate Kryo buffers unless needed I noticed that the Kryo serializer could be slower than the Java one by 2-3x on small shuffles because it spend a lot of time initializing Kryo Input and Output objects. This is because our default buffer size for them is very large. Since the serializer is often used on streams, I made the initialization lazy for that, and used a smaller buffer (auto-managed by Kryo) for input.	2013-10-08 22:57:38 -07:00
Grace Huang	f7628e4033	remove those futile suffixes like number/count	2013-10-09 08:36:41 +08:00
Aaron Davidson	4ea8ee468f	Add docs for standalone scheduler fault tolerance Also fix a couple HTML/Markdown issues in other files.	2013-10-08 14:18:31 -07:00
Aaron Davidson	749233b869	Revert change to spark-class Also adds comment about how to configure for FaultToleranceTest.	2013-10-08 11:41:52 -07:00
Aaron Davidson	1cd57cd4d3	Add license agreements to dockerfiles	2013-10-08 11:41:12 -07:00
Grace Huang	22bed59d2d	create metrics name manually.	2013-10-08 18:01:11 +08:00
Grace Huang	188abbf8f1	Revert "SPARK-900 Use coarser grained naming for metrics" This reverts commit `4b68be5f3c`.	2013-10-08 17:45:14 +08:00
Grace Huang	a2af6b543a	Revert "remedy the line-wrap while exceeding 100 chars" This reverts commit `892fb8ffa8`.	2013-10-08 17:44:56 +08:00
Patrick Wendell	9e9e9e1b42	Making the timing block more narrow for the sync	2013-10-07 21:28:12 -07:00
Reynold Xin	ea34c52102	Merge pull request #42 from pwendell/shuffle-read-perf Fix inconsistent and incorrect log messages in shuffle read path The user-facing messages generated by the CacheManager are currently wrong and somewhat misleading. This patch makes the messages more accurate. It also uses a consistent representation of the partition being fetched (`rdd_xx_yy`) so that it's easier for users to trace what is going on when reading logs.	2013-10-07 20:45:58 -07:00
Patrick Wendell	8b377718b8	Responses to review	2013-10-07 20:03:35 -07:00
Matei Zaharia	a8725bf8f8	Don't allocate Kryo buffers unless needed	2013-10-07 19:16:35 -07:00
Patrick Wendell	391133f66a	Fix inconsistent and incorrect log messages in shuffle read path	2013-10-07 17:24:18 -07:00
Patrick Wendell	b08306c5cf	Minor cleanup	2013-10-07 16:30:25 -07:00
Patrick Wendell	02f37ee853	Merge pull request #39 from pwendell/master Adding Shark 0.7.1 to EC2 scripts This adds a newer version of Shark to the ec2 scripts. I've tested this for both Hadoop1 and Hadoop2 clusters.	2013-10-07 15:48:52 -07:00
Patrick Wendell	524d01ea31	Perf benchmark	2013-10-07 15:15:42 -07:00
Patrick Wendell	d15acd6457	Trying new approach with writes	2013-10-07 15:15:42 -07:00
Patrick Wendell	a224c8c9b8	Adding option to force sync to the filesystem	2013-10-07 15:15:42 -07:00
Patrick Wendell	3478ca6762	Track and report write throughput for shuffle tasks.	2013-10-07 15:15:41 -07:00
Patrick Wendell	3745a1827f	Adding Shark 0.7.1 to EC2 scripts	2013-10-07 15:03:42 -07:00
Andre Schumacher	fdbae41e88	SPARK-705: implement sortByKey() in PySpark	2013-10-07 12:16:33 -07:00
Reynold Xin	5218e46178	Updated Kryo registration.	2013-10-07 11:48:50 -07:00
Reynold Xin	4f916f5302	Created a MessageToPartition class to send messages without saving the partition id.	2013-10-07 11:31:00 -07:00
Reynold Xin	213b70a2db	Merge pull request #31 from sundeepn/branch-0.8 Resolving package conflicts with hadoop 0.23.9 Hadoop 0.23.9 is having a package conflict with easymock's dependencies. (cherry picked from commit `023e3fdf00`) Signed-off-by: Reynold Xin <rxin@apache.org>	2013-10-07 10:54:22 -07:00
Nick Pentreath	a5e58b8f98	Merge branch 'master' into implicit-als	2013-10-07 11:46:17 +02:00
Nick Pentreath	b0f5f4d441	Bumping up test matrix size to eliminate random failures	2013-10-07 11:44:22 +02:00
Dan Crankshaw	2a8f3db94d	Fixed groupEdgeTriplets - it now passes a basic unit test. The problem was with the way the EdgeTripletRDD iterator worked. Calling toList on it returned the last value repeatedly. Fixed by overriding toList in the iterator.	2013-10-06 19:52:40 -07:00
Kay Ousterhout	fdc52b2f8b	Added back fully qualified class name	2013-10-06 18:45:43 -07:00

... 7 8 9 10 11 ...

4743 commits