ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Reynold Xin	f12e506c9e	Fixed a typo in JavaSparkContext's API doc.	2014-01-14 11:42:28 -08:00
Reynold Xin	1b5623fd0b	Maintain Serializable API compatibility by reverting back to java.io.Serializable for Broadcast and Accumulator.	2014-01-14 11:30:59 -08:00
Reynold Xin	55db77416b	Added license header for package.scala in the Java API package.	2014-01-14 11:20:12 -08:00
Reynold Xin	f8c12e9457	Added package doc for the Java API.	2014-01-14 11:16:25 -08:00
Reynold Xin	6a12b9ebc5	Updated API doc for Accumulable and Accumulator.	2014-01-14 11:16:08 -08:00
Reynold Xin	71b3007dbd	Broadcast variable visibility change & doc update. Note that previously Broadcast class was accidentally marked as private[spark]. It needs to be public for broadcast variables to work. Also exposing the broadcast varaible id.	2014-01-14 11:15:21 -08:00
Joseph E. Gonzalez	0bba7738a2	Additional edits for clarity in the graphx programming guide.	2014-01-14 10:31:54 -08:00
Reynold Xin	3fcc68bfa5	Merge pull request #423 from jegonzal/GraphXProgrammingGuide Improving the graphx-programming-guide This PR will track a few minor improvements to the content and formatting of the graphx-programming-guide.	2014-01-14 09:44:43 -08:00
Joseph E. Gonzalez	486f37c59c	Improving the graphx-programming-guide.	2014-01-14 09:43:33 -08:00
Frank Dai	57fcfc75b3	Added parentheses for that getDouble() also has side effect	2014-01-14 18:56:11 +08:00
Patrick Wendell	fa75e5e1c5	Merge pull request #420 from pwendell/header-files Add missing header files	2014-01-14 01:18:34 -08:00
Patrick Wendell	23034798d7	Add missing header files	2014-01-14 01:17:13 -08:00
Saurabh Rawat	1442cd5d50	Modifications as suggested in PR feedback- - more variants of mapPartitions added to JavaRDDLike - move setGenerator to JavaRDDLike - clean up	2014-01-14 14:19:02 +05:30
Patrick Wendell	980250b1ee	Merge pull request #416 from tdas/filestream-fix Removed unnecessary DStream operations and updated docs Removed StreamingContext.registerInputStream and registerOutputStream - they were useless. InputDStream has been made to register itself, and just registering a DStream as output stream cause RDD objects to be created but the RDDs will not be computed at all.. Also made DStream.register() private[streaming] for the same reasons. Updated docs, specially added package documentation for streaming package. Also, changed NetworkWordCount's input storage level to use MEMORY_ONLY, replication on the local machine causes warning messages (as replication fails) which is scary for a new user trying out his/her first example.	2014-01-14 00:05:37 -08:00
Tathagata Das	f8bd828c7c	Fixed loose ends in docs.	2014-01-14 00:03:46 -08:00
Tathagata Das	f8e239e058	Merge remote-tracking branch 'apache/master' into filestream-fix Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala	2014-01-13 23:57:27 -08:00
Reza Zadeh	845e568fad	Merge remote-tracking branch 'upstream/master' into sparsesvd	2014-01-13 23:52:34 -08:00
Frank Dai	a3da468d8b	Merge remote-tracking branch 'upstream/master' into code-style	2014-01-14 15:29:17 +08:00
Patrick Wendell	055be5c694	Merge pull request #415 from pwendell/shuffle-compress Enable compression by default for spills	2014-01-13 23:26:44 -08:00
Patrick Wendell	0984647aae	Enable compression by default for spills	2014-01-13 23:25:25 -08:00
Tathagata Das	4e497db8f3	Removed StreamingContext.registerInputStream and registerOutputStream - they were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation.	2014-01-13 23:23:46 -08:00
Patrick Wendell	fdaabdc673	Merge pull request #380 from mateiz/py-bayes Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)	2014-01-13 23:08:26 -08:00
Frank Dai	c2852cf42e	Indent two spaces	2014-01-14 14:59:01 +08:00
Patrick Wendell	4a805aff5e	Merge pull request #367 from ankurdave/graphx GraphX: Unifying Graphs and Tables GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/. Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak. Tasks left: - [x] Graph-level uncache - [x] Uncache previous iterations in Pregel - [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release) - [x] - Describe GC issue with GraphLab - [ ] Write `docs/graphx-programming-guide.md` - [x] - Mention future Bagel support in docs - [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again. - [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx - [x] Make Graph serializable to work around capture in Spark shell - [x] Rename graph -> graphx in package name and subproject - [x] Remove standalone PageRank - [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~	2014-01-13 22:58:38 -08:00
Joseph E. Gonzalez	80e73ed000	Adding minimal additional functionality to EdgeRDD	2014-01-13 22:56:57 -08:00
Patrick Wendell	945fe7a37e	Merge pull request #408 from pwendell/external-serializers Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.	2014-01-13 22:56:12 -08:00
Joseph E. Gonzalez	4bafc4f41f	adding documentation about EdgeRDD	2014-01-13 22:55:54 -08:00
Patrick Wendell	68641bce61	Merge pull request #413 from rxin/scaladoc Adjusted visibility of various components and documentation for 0.9.0 release.	2014-01-13 22:54:13 -08:00
Frank Dai	12386b3eea	Since getLong() and getInt() have side effect, get back parentheses, and remove an empty line	2014-01-14 14:53:10 +08:00
Frank Dai	0d94d74edf	Code clean up for mllib	2014-01-14 14:37:26 +08:00
Patrick Wendell	0ca0d4d657	Merge pull request #401 from andrewor14/master External sorting - Add number of bytes spilled to Web UI Additionally, update test suite for external sorting to induce spilling.	2014-01-13 22:32:21 -08:00
Ankur Dave	af645be5b8	Fix all code examples in guide	2014-01-13 22:29:45 -08:00
Ankur Dave	2cd9358ccf	Finish `6f6f8c928c`	2014-01-13 22:29:23 -08:00
Patrick Wendell	08b9fec93d	Merge pull request #409 from tdas/unpersist Automatically unpersisting RDDs that have been cleaned up from DStreams Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads. This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default. Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.	2014-01-13 22:29:03 -08:00
Ankur Dave	76ebdae798	Fix bug in GraphLoader.edgeListFile that caused srcId > dstId	2014-01-13 22:20:45 -08:00
Ankur Dave	c6dbfd1694	Edge object must be public for Edge case class	2014-01-13 22:08:44 -08:00
Ankur Dave	6f6f8c928c	Wrap methods in the appropriate class/object declaration	2014-01-13 21:55:35 -08:00
Ankur Dave	67795dbbfb	Write Graph Builders section in guide	2014-01-13 21:45:11 -08:00
Ankur Dave	e14a14bcde	Remove K-Core and LDA sections from guide; they are unimplemented	2014-01-13 21:12:58 -08:00
Ankur Dave	c28e5a08ee	Improve scaladoc links	2014-01-13 21:11:39 -08:00
Ankur Dave	59e4384e19	Fix Pregel SSSP example in programming guide	2014-01-13 21:02:38 -08:00
Ankur Dave	c6023bee60	Fix infinite loop in GraphGenerators.generateRandomEdges The loop occurred when numEdges < numVertices. This commit fixes it by allowing generateRandomEdges to generate a multigraph.	2014-01-13 21:02:37 -08:00
Ankur Dave	84d6af8021	Make Graph{,Impl,Ops} serializable to work around capture	2014-01-13 21:02:37 -08:00
Ankur Dave	d4d9ece1af	Remove Graph.statistics and GraphImpl.printLineage	2014-01-13 21:02:37 -08:00
Andrew Or	839934140f	Wording changes per Patrick	2014-01-13 20:51:38 -08:00
Matei Zaharia	cc93c2abb1	Disable MLlib tests for now while Jenkins is still on Python 2.6	2014-01-13 20:46:46 -08:00
Patrick Wendell	b07bc02a00	Merge pull request #412 from harveyfeng/master Add default value for HadoopRDD's `cloneRecords` constructor arg Small mend to https://github.com/apache/incubator-spark/pull/359/files#diff-1 for backwards compatibility	2014-01-13 20:45:22 -08:00
Reynold Xin	33022d6656	Adjusted visibility of various components.	2014-01-13 19:58:53 -08:00
Patrick Wendell	a2fee38ee0	Merge pull request #411 from tdas/filestream-fix Improved logic of finding new files in FileInputDStream Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it. The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further. Also reduced line lengths in DStream to <=100 chars.	2014-01-13 19:45:26 -08:00
Harvey	9e84e70509	Add default value for HadoopRDD's `cloneRecords` constructor arg, to maintain backwards compatibility.	2014-01-13 19:43:40 -08:00

... 2 3 4 5 6 ...

6326 commits