ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Matei Zaharia	2ffdaefbcb	Clarify that Python 2.7 is only needed for MLlib	2014-01-15 14:20:39 -08:00
Patrick Wendell	494d3c0774	Merge pull request #433 from markhamstra/debFix Updated Debian packaging	2014-01-15 10:00:50 -08:00
CrazyJvm	263933da97	remove "-XX:+UseCompressedStrings" option remove "-XX:+UseCompressedStrings" option from tuning guide since jdk7 no longer supports this.	2014-01-15 22:26:15 +08:00
Reynold Xin	3d9e66d92a	Merge pull request #436 from ankurdave/VertexId-case Rename VertexID -> VertexId in GraphX	2014-01-14 23:17:05 -08:00
Mark Hamstra	147a943df0	Removed repl-bin and updated maven build doc.	2014-01-14 22:17:24 -08:00
Ankur Dave	f4d9019aa8	VertexID -> VertexId	2014-01-14 22:17:18 -08:00
Reynold Xin	3a386e2389	Merge pull request #424 from jegonzal/GraphXProgrammingGuide Additional edits for clarity in the graphx programming guide. Added an overview of the Graph and GraphOps functions and fixed numerous typos.	2014-01-14 21:52:50 -08:00
Ankur Dave	1210ec2945	Describe GraphX caching and uncaching in guide	2014-01-14 17:25:38 -08:00
Joseph E. Gonzalez	0bba7738a2	Additional edits for clarity in the graphx programming guide.	2014-01-14 10:31:54 -08:00
Joseph E. Gonzalez	486f37c59c	Improving the graphx-programming-guide.	2014-01-14 09:43:33 -08:00
Patrick Wendell	980250b1ee	Merge pull request #416 from tdas/filestream-fix Removed unnecessary DStream operations and updated docs Removed StreamingContext.registerInputStream and registerOutputStream - they were useless. InputDStream has been made to register itself, and just registering a DStream as output stream cause RDD objects to be created but the RDDs will not be computed at all.. Also made DStream.register() private[streaming] for the same reasons. Updated docs, specially added package documentation for streaming package. Also, changed NetworkWordCount's input storage level to use MEMORY_ONLY, replication on the local machine causes warning messages (as replication fails) which is scary for a new user trying out his/her first example.	2014-01-14 00:05:37 -08:00
Tathagata Das	f8bd828c7c	Fixed loose ends in docs.	2014-01-14 00:03:46 -08:00
Tathagata Das	f8e239e058	Merge remote-tracking branch 'apache/master' into filestream-fix Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala	2014-01-13 23:57:27 -08:00
Reza Zadeh	845e568fad	Merge remote-tracking branch 'upstream/master' into sparsesvd	2014-01-13 23:52:34 -08:00
Patrick Wendell	0984647aae	Enable compression by default for spills	2014-01-13 23:25:25 -08:00
Tathagata Das	4e497db8f3	Removed StreamingContext.registerInputStream and registerOutputStream - they were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation.	2014-01-13 23:23:46 -08:00
Patrick Wendell	fdaabdc673	Merge pull request #380 from mateiz/py-bayes Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)	2014-01-13 23:08:26 -08:00
Patrick Wendell	4a805aff5e	Merge pull request #367 from ankurdave/graphx GraphX: Unifying Graphs and Tables GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/. Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak. Tasks left: - [x] Graph-level uncache - [x] Uncache previous iterations in Pregel - [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release) - [x] - Describe GC issue with GraphLab - [ ] Write `docs/graphx-programming-guide.md` - [x] - Mention future Bagel support in docs - [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again. - [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx - [x] Make Graph serializable to work around capture in Spark shell - [x] Rename graph -> graphx in package name and subproject - [x] Remove standalone PageRank - [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~	2014-01-13 22:58:38 -08:00
Patrick Wendell	945fe7a37e	Merge pull request #408 from pwendell/external-serializers Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.	2014-01-13 22:56:12 -08:00
Joseph E. Gonzalez	4bafc4f41f	adding documentation about EdgeRDD	2014-01-13 22:55:54 -08:00
Ankur Dave	af645be5b8	Fix all code examples in guide	2014-01-13 22:29:45 -08:00
Ankur Dave	2cd9358ccf	Finish `6f6f8c928c`	2014-01-13 22:29:23 -08:00
Ankur Dave	6f6f8c928c	Wrap methods in the appropriate class/object declaration	2014-01-13 21:55:35 -08:00
Ankur Dave	67795dbbfb	Write Graph Builders section in guide	2014-01-13 21:45:11 -08:00
Ankur Dave	e14a14bcde	Remove K-Core and LDA sections from guide; they are unimplemented	2014-01-13 21:12:58 -08:00
Ankur Dave	59e4384e19	Fix Pregel SSSP example in programming guide	2014-01-13 21:02:38 -08:00
Joseph E. Gonzalez	ee8931d2c6	Finished documenting vertexrdd.	2014-01-13 19:30:35 -08:00
Joseph E. Gonzalez	552de5d42e	Finished second pass on pregel docs.	2014-01-13 18:40:43 -08:00
Joseph E. Gonzalez	622b7f7d39	Minor changes in graphx programming guide.	2014-01-13 18:40:43 -08:00
Joseph E. Gonzalez	cfe4a29dcb	Improvements in example code for the programming guide as well as adding serialization support for GraphImpl to address issues with failed closure capture.	2014-01-13 17:18:31 -08:00
Ankur Dave	1bd5cefcae	Remove aggregateNeighbors	2014-01-13 17:03:03 -08:00
Reynold Xin	e2d25d2dfe	Merge branch 'master' into graphx	2014-01-13 16:21:26 -08:00
Ankur Dave	8038da2328	Merge pull request #2 from jegonzal/GraphXCCIssue Improving documentation and identifying potential bug in CC calculation.	2014-01-13 14:59:30 -08:00
Ankur Dave	97cd27e31b	Add graph loader links to doc	2014-01-13 14:54:48 -08:00
Ankur Dave	15ca89b11e	Fix mapReduceTriplets links in doc	2014-01-13 14:54:33 -08:00
Joseph E. Gonzalez	80e4d98dc6	Improving documentation and identifying potential bug in CC calculation.	2014-01-13 13:40:16 -08:00
Patrick Wendell	c3816de504	Changing option wording per discussion with Andrew	2014-01-13 13:25:06 -08:00
Patrick Wendell	5d61e051c2	Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.	2014-01-13 12:21:39 -08:00
Patrick Wendell	b93f9d42f2	Merge pull request #400 from tdas/dstream-move Moved DStream and PairDSream to org.apache.spark.streaming.dstream Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure. Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.	2014-01-13 12:18:05 -08:00
Joseph E. Gonzalez	66c9d0092a	Tested and corrected all examples up to mask in the graphx-programming-guide.	2014-01-12 22:11:13 -08:00
Ankur Dave	1efe78a101	Use GraphLoader for algorithms examples in doc	2014-01-12 22:03:03 -08:00
Tathagata Das	777c181d2f	Merge remote-tracking branch 'apache/master' into dstream-move Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala	2014-01-12 21:59:51 -08:00
Ankur Dave	d691e9f47e	Move algorithms to GraphOps	2014-01-12 21:47:16 -08:00
Ankur Dave	20c509b805	Add TriangleCount example	2014-01-12 21:41:32 -08:00
Patrick Wendell	0b96d85c20	Merge pull request #399 from pwendell/consolidate-off Disable shuffle file consolidation by default After running various performance tests for the 0.9 release, this still seems to have performance issues even on XFS. So let's keep this off-by-default for 0.9 and users can experiment with it depending on their disk configurations.	2014-01-12 21:31:43 -08:00
Joseph E. Gonzalez	c787ff5640	Documenting Pregel API	2014-01-12 20:49:52 -08:00
Patrick Wendell	2802cc80bc	Disable shuffle file consolidation by default	2014-01-12 19:16:43 -08:00
Matei Zaharia	54d3486ee9	Fix Scala version in docs (it was printed as 2.1)	2014-01-12 17:49:59 -08:00
Patrick Wendell	f4d77f8cb8	Rename DStream.foreach to DStream.foreachRDD `foreachRDD` makes it clear that the granularity of this operator is per-RDD. As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other DStream operators which get pushed down to individual records within each RDD.	2014-01-12 17:21:00 -08:00
Ankur Dave	7a4bb863c7	Add connected components example to doc	2014-01-12 16:58:18 -08:00
Ankur Dave	5e35d39e0f	Add PageRank example and data	2014-01-12 13:10:53 -08:00
Tathagata Das	448aef6790	Moved DStream, DStreamCheckpointData and PairDStream from org.apache.spark.streaming to org.apache.spark.streaming.dstream.	2014-01-12 11:31:54 -08:00
Ankur Dave	f096f4eaf1	Link methods in programming guide; document VertexID	2014-01-12 10:55:29 -08:00
Matei Zaharia	224f1a754a	Update Python required version to 2.7, and mention MLlib support	2014-01-12 00:15:34 -08:00
Matei Zaharia	4c28a2bad8	Update some Python MLlib parameters to use camelCase, and tweak docs We've used camel case in other Spark methods so it felt reasonable to keep using it here and make the code match Scala/Java as much as possible. Note that parameter names matter in Python because it allows passing optional parameters by name.	2014-01-11 22:30:48 -08:00
Matei Zaharia	9a0dfdf868	Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)	2014-01-11 22:30:48 -08:00
Joseph E. Gonzalez	cf57b1b055	Correcting typos in documentation.	2014-01-11 17:13:10 -08:00
Joseph E. Gonzalez	64c4593586	Finished docummenting join operators and revised some of the initial presentation.	2014-01-11 13:48:35 -08:00
Reza Zadeh	f324d53555	Merge remote-tracking branch 'upstream/master' into sparsesvd	2014-01-11 13:27:15 -08:00
Ankur Dave	732333d78e	Remove GraphLab	2014-01-11 11:49:35 -08:00
Joseph E. Gonzalez	fac44bbe2c	Finished documenting structural operators and starting join operators.	2014-01-11 11:28:01 -08:00
Joseph E. Gonzalez	1f45e4e572	starting structural operator discussion.	2014-01-11 09:27:00 -08:00
Joseph E. Gonzalez	56a245c6bc	Addressing comment about Graph Processing in docs.	2014-01-11 00:21:17 -08:00
Joseph E. Gonzalez	0c9d39bbaa	More organizational changes and dropping the benchmark plot.	2014-01-11 00:09:08 -08:00
Joseph E. Gonzalez	b8a44f12a5	More edits.	2014-01-10 23:52:24 -08:00
Ankur Dave	362b9422e4	Soften wording about GraphX superseding Bagel	2014-01-10 23:48:32 -08:00
Patrick Wendell	d37408f39c	Merge pull request #377 from andrewor14/master External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.	2014-01-10 16:25:01 -08:00
Andrew Or	2e393cd5fd	Update documentation for externalSorting	2014-01-10 15:45:38 -08:00
Andrew Or	e4c51d2113	Address Patrick's and Reynold's comments Aside from trivial formatting changes, use nulls instead of Options for DiskMapIterator, and add documentation for spark.shuffle.externalSorting and spark.shuffle.memoryFraction. Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.	2014-01-10 15:09:51 -08:00
Thomas Graves	7cef8435d7	Merge pull request #371 from tgravescs/yarn_client_addjar_misc_fixes Yarn client addjar and misc fixes Fix the addJar functionality in yarn-client mode, add support for the other options supported in yarn-standalone mode, set the application type on yarn in hadoop 2.X, add documentation, change heartbeat interval to be same code as the yarn-standalone so it doesn't take so long to get containers and exit.	2014-01-10 15:34:15 -06:00
Ankur Dave	3eb83191cb	Generate GraphX docs	2014-01-10 11:37:28 -08:00
Ankur Dave	6bd9a78e78	Add back Bagel links to docs, but mark them superseded	2014-01-10 11:37:10 -08:00
Joseph E. Gonzalez	b1eeefb401	WIP. Updating figures and cleaning up initial skeleton for GraphX Programming guide.	2014-01-10 00:39:08 -08:00
Patrick Wendell	dd03cea02a	Merge pull request #378 from pwendell/consolidate_on Enable shuffle consolidation by default. Bump this to being enabled for 0.9.0.	2014-01-09 23:38:03 -08:00
Reza Zadeh	21c8a54c08	Merge remote-tracking branch 'upstream/master' into sparsesvd Conflicts: docs/mllib-guide.md	2014-01-09 22:45:32 -08:00
Patrick Wendell	460f655cc6	Enable shuffle consolidation by default. Bump this to being enabled for 0.9.0.	2014-01-09 22:42:50 -08:00
Patrick Wendell	300eaa994c	Merge pull request #353 from pwendell/ipython-simplify Simplify and fix pyspark script. This patch removes compatibility for IPython < 1.0 but fixes the launch script and makes it much simpler. I tested this using the three commands in the PySpark documentation page: 1. IPYTHON=1 ./pyspark 2. IPYTHON_OPTS="notebook" ./pyspark 3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark There are two changes: - We rely on PYTHONSTARTUP env var to start PySpark - Removed the quotes around $IPYTHON_OPTS... having quotes gloms them together as a single argument passed to `exec` which seemed to cause ipython to fail (it instead expects them as multiple arguments).	2014-01-09 20:29:51 -08:00
Patrick Wendell	d86a85e9ca	Merge pull request #293 from pwendell/standalone-driver SPARK-998: Support Launching Driver Inside of Standalone Mode [NOTE: I need to bring the tests up to date with new changes, so for now they will fail] This patch provides support for launching driver programs inside of a standalone cluster manager. It also supports monitoring and re-launching of driver programs which is useful for long running, recoverable applications such as Spark Streaming jobs. For those jobs, this patch allows a deployment mode which is resilient to the failure of any worker node, failure of a master node (provided a multi-master setup), and even failures of the applicaiton itself, provided they are recoverable on a restart. Driver information, such as the status and logs from a driver, is displayed in the UI There are a few small TODO's here, but the code is generally feature-complete. They are: - Bring tests up to date and add test coverage - Restarting on failure should be optional and maybe off by default. - See if we can re-use akka connections to facilitate clients behind a firewall A sensible place to start for review would be to look at the `DriverClient` class which presents users the ability to launch their driver program. I've also added an example program (`DriverSubmissionTest`) that allows you to test this locally and play around with killing workers, etc. Most of the code is devoted to persisting driver state in the cluster manger, exposing it in the UI, and dealing correctly with various types of failures. Instructions to test locally: - `sbt/sbt assembly/assembly examples/assembly` - start a local version of the standalone cluster manager ``` ./spark-class org.apache.spark.deploy.client.DriverClient \ -j -Dspark.test.property=something \ -e SPARK_TEST_KEY=SOMEVALUE \ launch spark://10.99.1.14:7077 \ ../path-to-examples-assembly-jar \ org.apache.spark.examples.DriverSubmissionTest 1000 some extra options --some-option-here -X 13 ``` - Go in the UI and make sure it started correctly, look at the output etc - Kill workers, the driver program, masters, etc.	2014-01-09 18:37:52 -08:00
Ankur Dave	b5b0de2de5	Start fixing formatting of graphx-programming-guide	2014-01-09 13:24:25 -08:00
Ankur Dave	e4483582fc	Add docs/graphx-programming-guide.md from 7210257ba3038d5e22d4b60fe9c3113dc45c3dff:README.md	2014-01-09 10:24:43 -08:00
Thomas Graves	c617083e47	yarn-client addJar fix and misc other	2014-01-09 10:24:35 -06:00
Ankur Dave	91227566bc	Merge remote-tracking branch 'spark-upstream/master' into HEAD Conflicts: README.md core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala pom.xml project/SparkBuild.scala repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala	2014-01-08 21:19:08 -08:00
Patrick Wendell	112c0a1776	Fixing config option "retained_stages" => "retainedStages". This is a very esoteric option and it's out of sync with the style we use. So it seems fitting to fix it for 0.9.0.	2014-01-08 21:16:16 -08:00
Thomas Graves	6eef78d769	Merge pull request #345 from colorant/yarn support distributing extra files to worker for yarn client mode So that user doesn't need to package all dependency into one assemble jar as spark app jar	2014-01-08 08:49:20 -06:00
Patrick Wendell	bc81ce040d	Merge remote-tracking branch 'apache-github/master' into standalone-driver Conflicts: core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala pom.xml	2014-01-08 00:38:31 -08:00
Patrick Wendell	c78b381e91	Fixes	2014-01-08 00:09:12 -08:00
Patrick Wendell	bb6a39a687	Merge pull request #322 from falaki/MLLibDocumentationImprovement SPARK-1009 Updated MLlib docs to show how to use it in Python In addition added detailed examples for regression, clustering and recommendation algorithms in a separate Scala section. Fixed a few minor issues with existing documentation.	2014-01-07 22:32:18 -08:00
Hossein Falaki	46cb980a5f	Fixed merge conflict	2014-01-07 21:28:26 -08:00
Patrick Wendell	82a1d38aea	Simplify and fix pyspark script. This patch removes compatibility for IPython < 1.0 but fixes the launch script and makes it much simpler. I tested this using the three commands in the PySpark documentation page: 1. IPYTHON=1 ./pyspark 2. IPYTHON_OPTS="notebook" ./pyspark 3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark There are two changes: - We rely on PYTHONSTARTUP env var to start PySpark - Removed the quotes around $IPYTHON_OPTS... having quotes gloms them together as a single argument passed to `exec` which seemed to cause ipython to fail (it instead expects them as multiple arguments).	2014-01-07 17:55:25 -08:00
Reza Zadeh	4f38b6fab5	documentation for sparsematrix	2014-01-07 17:19:28 -08:00
Matei Zaharia	2c421749ea	Address review comments	2014-01-07 19:30:23 -05:00
Matei Zaharia	d8bcc8e9a0	Add way to limit default # of cores used by applications on standalone mode Also documents the spark.deploy.spreadOut option.	2014-01-07 14:35:52 -05:00
Patrick Wendell	c3cf0475e8	Merge pull request #339 from ScrapCodes/conf-improvements Conf improvements There are two new features. 1. Allow users to set arbitrary akka configurations via spark conf. 2. Allow configuration to be printed in logs for diagnosis.	2014-01-07 00:54:25 -08:00
Reynold Xin	a862cafacf	Merge pull request #331 from holdenk/master Add a script to download sbt if not present on the system As per the discussion on the dev mailing list this script will use the system sbt if present or otherwise attempt to install the sbt launcher. The fall back error message in the event it fails instructs the user to install sbt. While the URLs it fetches from aren't controlled by the spark project directly, they are stable and the current authoritative sources.	2014-01-07 00:18:20 -08:00
Prashant Sharma	c729fa7c8e	formatting related fixes suggested by Patrick.	2014-01-07 13:08:16 +05:30
Prashant Sharma	b84dc780d3	Allow configuration to be printed in logs for diagnosis.	2014-01-07 13:01:43 +05:30
Prashant Sharma	b3018811e1	Allow users to set arbitrary akka configurations via spark conf.	2014-01-07 13:01:43 +05:30
Patrick Wendell	b72cceba27	Some doc fixes	2014-01-06 22:05:53 -08:00
Raymond Liu	67af803136	Export --file for YarnClient mode to support sending extra files to worker on yarn cluster	2014-01-07 10:24:11 +08:00
Patrick Wendell	c0498f9265	Merge remote-tracking branch 'apache-github/master' into standalone-driver Conflicts: core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala core/src/main/scala/org/apache/spark/deploy/client/TestClient.scala core/src/main/scala/org/apache/spark/deploy/master/Master.scala core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala	2014-01-06 17:29:21 -08:00

1 2 3 4 5 ...

539 commits