ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Ankur Dave	af645be5b8	Fix all code examples in guide	2014-01-13 22:29:45 -08:00
Ankur Dave	2cd9358ccf	Finish `6f6f8c928c`	2014-01-13 22:29:23 -08:00
Patrick Wendell	08b9fec93d	Merge pull request #409 from tdas/unpersist Automatically unpersisting RDDs that have been cleaned up from DStreams Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads. This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default. Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.	2014-01-13 22:29:03 -08:00
Ankur Dave	76ebdae798	Fix bug in GraphLoader.edgeListFile that caused srcId > dstId	2014-01-13 22:20:45 -08:00
Ankur Dave	c6dbfd1694	Edge object must be public for Edge case class	2014-01-13 22:08:44 -08:00
Ankur Dave	6f6f8c928c	Wrap methods in the appropriate class/object declaration	2014-01-13 21:55:35 -08:00
Ankur Dave	67795dbbfb	Write Graph Builders section in guide	2014-01-13 21:45:11 -08:00
Ankur Dave	e14a14bcde	Remove K-Core and LDA sections from guide; they are unimplemented	2014-01-13 21:12:58 -08:00
Ankur Dave	c28e5a08ee	Improve scaladoc links	2014-01-13 21:11:39 -08:00
Ankur Dave	59e4384e19	Fix Pregel SSSP example in programming guide	2014-01-13 21:02:38 -08:00
Ankur Dave	c6023bee60	Fix infinite loop in GraphGenerators.generateRandomEdges The loop occurred when numEdges < numVertices. This commit fixes it by allowing generateRandomEdges to generate a multigraph.	2014-01-13 21:02:37 -08:00
Ankur Dave	84d6af8021	Make Graph{,Impl,Ops} serializable to work around capture	2014-01-13 21:02:37 -08:00
Ankur Dave	d4d9ece1af	Remove Graph.statistics and GraphImpl.printLineage	2014-01-13 21:02:37 -08:00
Andrew Or	839934140f	Wording changes per Patrick	2014-01-13 20:51:38 -08:00
Matei Zaharia	cc93c2abb1	Disable MLlib tests for now while Jenkins is still on Python 2.6	2014-01-13 20:46:46 -08:00
Patrick Wendell	b07bc02a00	Merge pull request #412 from harveyfeng/master Add default value for HadoopRDD's `cloneRecords` constructor arg Small mend to https://github.com/apache/incubator-spark/pull/359/files#diff-1 for backwards compatibility	2014-01-13 20:45:22 -08:00
Reynold Xin	33022d6656	Adjusted visibility of various components.	2014-01-13 19:58:53 -08:00
Patrick Wendell	a2fee38ee0	Merge pull request #411 from tdas/filestream-fix Improved logic of finding new files in FileInputDStream Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it. The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further. Also reduced line lengths in DStream to <=100 chars.	2014-01-13 19:45:26 -08:00
Harvey	9e84e70509	Add default value for HadoopRDD's `cloneRecords` constructor arg, to maintain backwards compatibility.	2014-01-13 19:43:40 -08:00
Joseph E. Gonzalez	ee8931d2c6	Finished documenting vertexrdd.	2014-01-13 19:30:35 -08:00
Patrick Wendell	d4cd5debf4	Fix for Kryo Serializer	2014-01-13 19:03:59 -08:00
Reynold Xin	0fbc0b0561	Merge branch 'graphx' of github.com:ankurdave/incubator-spark into graphx	2014-01-13 18:51:22 -08:00
Reynold Xin	0b18bfba1a	Updated doc for PageRank.	2014-01-13 18:51:04 -08:00
Reynold Xin	9317286b72	More cleanup.	2014-01-13 18:45:35 -08:00
Reynold Xin	8e5c732430	Moved SVDPlusPlusConf into SVDPlusPlus object itself.	2014-01-13 18:45:20 -08:00
Raymond Liu	4c22c55ad6	Address comments to fix code formats	2014-01-14 10:41:42 +08:00
Joseph E. Gonzalez	552de5d42e	Finished second pass on pregel docs.	2014-01-13 18:40:43 -08:00
Joseph E. Gonzalez	622b7f7d39	Minor changes in graphx programming guide.	2014-01-13 18:40:43 -08:00
Raymond Liu	161ab93989	Yarn workerRunnable refactor	2014-01-14 10:36:00 +08:00
Raymond Liu	79a5ba3497	Yarn Client refactor	2014-01-14 10:33:48 +08:00
Reynold Xin	1dce9ce446	Moved PartitionStrategy's into an object.	2014-01-13 18:32:04 -08:00
Reynold Xin	ae06d2c22f	Updated GraphGenerator.	2014-01-13 18:31:49 -08:00
Reynold Xin	87f335db78	Made more things private.	2014-01-13 18:30:26 -08:00
Reynold Xin	a4e12af7aa	Merge branch 'graphx' of github.com:ankurdave/incubator-spark into graphx Conflicts: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala	2014-01-13 17:42:59 -08:00
Reynold Xin	02a8f54bfa	Miscel doc update.	2014-01-13 17:40:36 -08:00
Tathagata Das	1233b3de01	Merge remote-tracking branch 'apache/master' into filestream-fix	2014-01-13 17:29:19 -08:00
Joseph E. Gonzalez	cfe4a29dcb	Improvements in example code for the programming guide as well as adding serialization support for GraphImpl to address issues with failed closure capture.	2014-01-13 17:18:31 -08:00
Ankur Dave	ae4b75d94a	Add EdgeDirection.Either and use it to fix CC bug The bug was due to a misunderstanding of the activeSetOpt parameter to Graph.mapReduceTriplets. Passing EdgeDirection.Both causes mapReduceTriplets to run only on edges with both vertices in the active set. This commit adds EdgeDirection.Either, which causes mapReduceTriplets to run on edges with either vertex in the active set. This is what connected components needed.	2014-01-13 17:03:03 -08:00
Ankur Dave	1bd5cefcae	Remove aggregateNeighbors	2014-01-13 17:03:03 -08:00
Tathagata Das	c0bb38e8aa	Improved file input stream further.	2014-01-13 16:54:52 -08:00
Reynold Xin	dc041cd3b6	Merge branch 'scaladoc1' of github.com:rxin/incubator-spark into graphx	2014-01-13 16:25:21 -08:00
Reynold Xin	01c0d72b32	Merge pull request #410 from rxin/scaladoc1 Updated JavaStreamingContext to make scaladoc compile. `sbt/sbt doc` used to fail. This fixed it.	2014-01-13 16:24:30 -08:00
Reynold Xin	e2d25d2dfe	Merge branch 'master' into graphx	2014-01-13 16:21:26 -08:00
Reynold Xin	30328c347b	Updated JavaStreamingContext to make scaladoc compile. `sbt/sbt doc` used to fail. This fixed it.	2014-01-13 15:58:39 -08:00
Ankur Dave	8038da2328	Merge pull request #2 from jegonzal/GraphXCCIssue Improving documentation and identifying potential bug in CC calculation.	2014-01-13 14:59:30 -08:00
Tathagata Das	27311b1332	Added unpersisting and modified testsuite to better test out metadata cleaning.	2014-01-13 14:57:07 -08:00
Ankur Dave	97cd27e31b	Add graph loader links to doc	2014-01-13 14:54:48 -08:00
Ankur Dave	15ca89b11e	Fix mapReduceTriplets links in doc	2014-01-13 14:54:33 -08:00
Joseph E. Gonzalez	80e4d98dc6	Improving documentation and identifying potential bug in CC calculation.	2014-01-13 13:40:16 -08:00
Patrick Wendell	c3816de504	Changing option wording per discussion with Andrew	2014-01-13 13:25:06 -08:00

... 4 5 6 7 8 ...

6395 commits