Commit graph

802 commits

Author SHA1 Message Date
Andrew Ash 069bb94206 Clarify spark.default.parallelism
It's the task count across the cluster, not per worker, per machine, per core, or anything else.
2014-01-21 14:49:35 -08:00
Sandy Ryza adf42611f1 Incorporate Tom's comments - update doc and code to reflect that core requests may not always be honored 2014-01-21 00:38:02 -08:00
Patrick Wendell c324ac10ee Force use of LZF when spilling data 2014-01-20 19:00:48 -08:00
Patrick Wendell cdb003e376 Removing docs on akka options 2014-01-20 16:40:58 -08:00
Sandy Ryza 3e85b87d90 SPARK-1033. Ask for cores in Yarn container requests 2014-01-20 14:42:32 -08:00
Yinan Li 584323c6b1 Addressed comments from Reynold
Signed-off-by: Yinan Li <liyinan926@gmail.com>
2014-01-18 21:28:17 -08:00
Patrick Wendell bf5699543b Merge pull request #462 from mateiz/conf-file-fix
Remove Typesafe Config usage and conf files to fix nested property names

With Typesafe Config we had the subtle problem of no longer allowing
nested property names, which are used for a few of our properties:
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

This PR is for branch 0.9 but should be added into master too.
(cherry picked from commit 34e911ce9a)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-01-18 16:20:00 -08:00
Yinan Li fd833e7ab1 Allow files added through SparkContext.addFile() to be overwritten
This is useful for the cases when a file needs to be refreshed and downloaded
by the executors periodically.

Signed-off-by: Yinan Li <liyinan926@gmail.com>
2014-01-18 15:26:59 -08:00
Reza Zadeh caf97a25a2 Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-17 14:34:03 -08:00
Reza Zadeh 5c639d70df 0index docs 2014-01-17 14:31:39 -08:00
Reza Zadeh cb13b15a60 use 0-indexing 2014-01-17 13:55:42 -08:00
Reza Zadeh d28bf41827 changes from PR 2014-01-17 13:39:40 -08:00
Reynold Xin 0675ca50f3 Merge pull request #439 from CrazyJvm/master
SPARK-1024 Remove "-XX:+UseCompressedStrings" option from tuning guide

remove "-XX:+UseCompressedStrings" option from tuning guide since jdk7 no longer supports this.
2014-01-15 16:09:03 -08:00
Matei Zaharia 2ffdaefbcb Clarify that Python 2.7 is only needed for MLlib 2014-01-15 14:20:39 -08:00
Patrick Wendell 494d3c0774 Merge pull request #433 from markhamstra/debFix
Updated Debian packaging
2014-01-15 10:00:50 -08:00
CrazyJvm 263933da97 remove "-XX:+UseCompressedStrings" option
remove "-XX:+UseCompressedStrings" option from tuning guide since jdk7 no longer supports this.
2014-01-15 22:26:15 +08:00
Reynold Xin 3d9e66d92a Merge pull request #436 from ankurdave/VertexId-case
Rename VertexID -> VertexId in GraphX
2014-01-14 23:17:05 -08:00
Mark Hamstra 147a943df0 Removed repl-bin and updated maven build doc. 2014-01-14 22:17:24 -08:00
Ankur Dave f4d9019aa8 VertexID -> VertexId 2014-01-14 22:17:18 -08:00
Reynold Xin 3a386e2389 Merge pull request #424 from jegonzal/GraphXProgrammingGuide
Additional edits for clarity in the graphx programming guide.

Added an overview of the Graph and GraphOps functions and fixed numerous typos.
2014-01-14 21:52:50 -08:00
Ankur Dave 1210ec2945 Describe GraphX caching and uncaching in guide 2014-01-14 17:25:38 -08:00
Joseph E. Gonzalez 0bba7738a2 Additional edits for clarity in the graphx programming guide. 2014-01-14 10:31:54 -08:00
Joseph E. Gonzalez 486f37c59c Improving the graphx-programming-guide. 2014-01-14 09:43:33 -08:00
Patrick Wendell 980250b1ee Merge pull request #416 from tdas/filestream-fix
Removed unnecessary DStream operations and updated docs

Removed StreamingContext.registerInputStream and registerOutputStream - they were useless. InputDStream has been made to register itself, and just registering a DStream as output stream cause RDD objects to be created but the RDDs will not be computed at all.. Also made DStream.register() private[streaming] for the same reasons.

Updated docs, specially added package documentation for streaming package.

Also, changed NetworkWordCount's input storage level to use MEMORY_ONLY, replication on the local machine causes warning messages (as replication fails) which is scary for a new user trying out his/her first example.
2014-01-14 00:05:37 -08:00
Tathagata Das f8bd828c7c Fixed loose ends in docs. 2014-01-14 00:03:46 -08:00
Tathagata Das f8e239e058 Merge remote-tracking branch 'apache/master' into filestream-fix
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
2014-01-13 23:57:27 -08:00
Reza Zadeh 845e568fad Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-13 23:52:34 -08:00
Patrick Wendell 0984647aae Enable compression by default for spills 2014-01-13 23:25:25 -08:00
Tathagata Das 4e497db8f3 Removed StreamingContext.registerInputStream and registerOutputStream - they were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation. 2014-01-13 23:23:46 -08:00
Patrick Wendell fdaabdc673 Merge pull request #380 from mateiz/py-bayes
Add Naive Bayes to Python MLlib, and some API fixes

- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)
2014-01-13 23:08:26 -08:00
Patrick Wendell 4a805aff5e Merge pull request #367 from ankurdave/graphx
GraphX: Unifying Graphs and Tables

GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/.

Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak.

Tasks left:
- [x] Graph-level uncache
- [x] Uncache previous iterations in Pregel
- [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release)
- [x] - Describe GC issue with GraphLab
- [ ] Write `docs/graphx-programming-guide.md`
- [x] - Mention future Bagel support in docs
- [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again.
- [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx
- [x] Make Graph serializable to work around capture in Spark shell
- [x] Rename graph -> graphx in package name and subproject
- [x] Remove standalone PageRank
- [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~
2014-01-13 22:58:38 -08:00
Patrick Wendell 945fe7a37e Merge pull request #408 from pwendell/external-serializers
Improvements to external sorting

1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.
2014-01-13 22:56:12 -08:00
Joseph E. Gonzalez 4bafc4f41f adding documentation about EdgeRDD 2014-01-13 22:55:54 -08:00
Ankur Dave af645be5b8 Fix all code examples in guide 2014-01-13 22:29:45 -08:00
Ankur Dave 2cd9358ccf Finish 6f6f8c928c 2014-01-13 22:29:23 -08:00
Ankur Dave 6f6f8c928c Wrap methods in the appropriate class/object declaration 2014-01-13 21:55:35 -08:00
Ankur Dave 67795dbbfb Write Graph Builders section in guide 2014-01-13 21:45:11 -08:00
Ankur Dave e14a14bcde Remove K-Core and LDA sections from guide; they are unimplemented 2014-01-13 21:12:58 -08:00
Ankur Dave 59e4384e19 Fix Pregel SSSP example in programming guide 2014-01-13 21:02:38 -08:00
Joseph E. Gonzalez ee8931d2c6 Finished documenting vertexrdd. 2014-01-13 19:30:35 -08:00
Joseph E. Gonzalez 552de5d42e Finished second pass on pregel docs. 2014-01-13 18:40:43 -08:00
Joseph E. Gonzalez 622b7f7d39 Minor changes in graphx programming guide. 2014-01-13 18:40:43 -08:00
Joseph E. Gonzalez cfe4a29dcb Improvements in example code for the programming guide as well as adding serialization support for GraphImpl to address issues with failed closure capture. 2014-01-13 17:18:31 -08:00
Ankur Dave 1bd5cefcae Remove aggregateNeighbors 2014-01-13 17:03:03 -08:00
Reynold Xin e2d25d2dfe Merge branch 'master' into graphx 2014-01-13 16:21:26 -08:00
Ankur Dave 8038da2328 Merge pull request #2 from jegonzal/GraphXCCIssue
Improving documentation and identifying potential bug in CC calculation.
2014-01-13 14:59:30 -08:00
Ankur Dave 97cd27e31b Add graph loader links to doc 2014-01-13 14:54:48 -08:00
Ankur Dave 15ca89b11e Fix mapReduceTriplets links in doc 2014-01-13 14:54:33 -08:00
Joseph E. Gonzalez 80e4d98dc6 Improving documentation and identifying potential bug in CC calculation. 2014-01-13 13:40:16 -08:00
Patrick Wendell c3816de504 Changing option wording per discussion with Andrew 2014-01-13 13:25:06 -08:00
Patrick Wendell 5d61e051c2 Improvements to external sorting
1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.
2014-01-13 12:21:39 -08:00
Patrick Wendell b93f9d42f2 Merge pull request #400 from tdas/dstream-move
Moved DStream and PairDSream to org.apache.spark.streaming.dstream

Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure.

Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
2014-01-13 12:18:05 -08:00
Joseph E. Gonzalez 66c9d0092a Tested and corrected all examples up to mask in the graphx-programming-guide. 2014-01-12 22:11:13 -08:00
Ankur Dave 1efe78a101 Use GraphLoader for algorithms examples in doc 2014-01-12 22:03:03 -08:00
Tathagata Das 777c181d2f Merge remote-tracking branch 'apache/master' into dstream-move
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
2014-01-12 21:59:51 -08:00
Ankur Dave d691e9f47e Move algorithms to GraphOps 2014-01-12 21:47:16 -08:00
Ankur Dave 20c509b805 Add TriangleCount example 2014-01-12 21:41:32 -08:00
Patrick Wendell 0b96d85c20 Merge pull request #399 from pwendell/consolidate-off
Disable shuffle file consolidation by default

After running various performance tests for the 0.9 release, this still seems to have performance issues even on XFS. So let's keep this off-by-default for 0.9 and users can experiment with it depending on their disk configurations.
2014-01-12 21:31:43 -08:00
Joseph E. Gonzalez c787ff5640 Documenting Pregel API 2014-01-12 20:49:52 -08:00
Patrick Wendell 2802cc80bc Disable shuffle file consolidation by default 2014-01-12 19:16:43 -08:00
Matei Zaharia 54d3486ee9 Fix Scala version in docs (it was printed as 2.1) 2014-01-12 17:49:59 -08:00
Patrick Wendell f4d77f8cb8 Rename DStream.foreach to DStream.foreachRDD
`foreachRDD` makes it clear that the granularity of this operator is per-RDD.
As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other
DStream operators which get pushed down to individual records within each RDD.
2014-01-12 17:21:00 -08:00
Ankur Dave 7a4bb863c7 Add connected components example to doc 2014-01-12 16:58:18 -08:00
Ankur Dave 5e35d39e0f Add PageRank example and data 2014-01-12 13:10:53 -08:00
Tathagata Das 448aef6790 Moved DStream, DStreamCheckpointData and PairDStream from org.apache.spark.streaming to org.apache.spark.streaming.dstream. 2014-01-12 11:31:54 -08:00
Ankur Dave f096f4eaf1 Link methods in programming guide; document VertexID 2014-01-12 10:55:29 -08:00
Matei Zaharia 224f1a754a Update Python required version to 2.7, and mention MLlib support 2014-01-12 00:15:34 -08:00
Matei Zaharia 4c28a2bad8 Update some Python MLlib parameters to use camelCase, and tweak docs
We've used camel case in other Spark methods so it felt reasonable to
keep using it here and make the code match Scala/Java as much as
possible. Note that parameter names matter in Python because it allows
passing optional parameters by name.
2014-01-11 22:30:48 -08:00
Matei Zaharia 9a0dfdf868 Add Naive Bayes to Python MLlib, and some API fixes
- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)
2014-01-11 22:30:48 -08:00
Joseph E. Gonzalez cf57b1b055 Correcting typos in documentation. 2014-01-11 17:13:10 -08:00
Joseph E. Gonzalez 64c4593586 Finished docummenting join operators and revised some of the initial presentation. 2014-01-11 13:48:35 -08:00
Reza Zadeh f324d53555 Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-11 13:27:15 -08:00
Ankur Dave 732333d78e Remove GraphLab 2014-01-11 11:49:35 -08:00
Joseph E. Gonzalez fac44bbe2c Finished documenting structural operators and starting join operators. 2014-01-11 11:28:01 -08:00
Joseph E. Gonzalez 1f45e4e572 starting structural operator discussion. 2014-01-11 09:27:00 -08:00
Joseph E. Gonzalez 56a245c6bc Addressing comment about Graph Processing in docs. 2014-01-11 00:21:17 -08:00
Joseph E. Gonzalez 0c9d39bbaa More organizational changes and dropping the benchmark plot. 2014-01-11 00:09:08 -08:00
Joseph E. Gonzalez b8a44f12a5 More edits. 2014-01-10 23:52:24 -08:00
Ankur Dave 362b9422e4 Soften wording about GraphX superseding Bagel 2014-01-10 23:48:32 -08:00
Patrick Wendell d37408f39c Merge pull request #377 from andrewor14/master
External Sorting for Aggregator and CoGroupedRDDs (Revisited)

(This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving)

The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted.

The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order.

Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
2014-01-10 16:25:01 -08:00
Andrew Or 2e393cd5fd Update documentation for externalSorting 2014-01-10 15:45:38 -08:00
Andrew Or e4c51d2113 Address Patrick's and Reynold's comments
Aside from trivial formatting changes, use nulls instead of Options for
DiskMapIterator, and add documentation for spark.shuffle.externalSorting
and spark.shuffle.memoryFraction.

Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.
2014-01-10 15:09:51 -08:00
Thomas Graves 7cef8435d7 Merge pull request #371 from tgravescs/yarn_client_addjar_misc_fixes
Yarn client addjar and misc fixes

Fix the addJar functionality in yarn-client mode, add support for the other options supported in yarn-standalone mode, set the application type on yarn in hadoop 2.X, add documentation, change heartbeat interval to be same code as the yarn-standalone so it doesn't take so long to get containers and exit.
2014-01-10 15:34:15 -06:00
Ankur Dave 3eb83191cb Generate GraphX docs 2014-01-10 11:37:28 -08:00
Ankur Dave 6bd9a78e78 Add back Bagel links to docs, but mark them superseded 2014-01-10 11:37:10 -08:00
Joseph E. Gonzalez b1eeefb401 WIP. Updating figures and cleaning up initial skeleton for GraphX Programming guide. 2014-01-10 00:39:08 -08:00
Patrick Wendell dd03cea02a Merge pull request #378 from pwendell/consolidate_on
Enable shuffle consolidation by default.

Bump this to being enabled for 0.9.0.
2014-01-09 23:38:03 -08:00
Reza Zadeh 21c8a54c08 Merge remote-tracking branch 'upstream/master' into sparsesvd
Conflicts:
	docs/mllib-guide.md
2014-01-09 22:45:32 -08:00
Patrick Wendell 460f655cc6 Enable shuffle consolidation by default.
Bump this to being enabled for 0.9.0.
2014-01-09 22:42:50 -08:00
Patrick Wendell 300eaa994c Merge pull request #353 from pwendell/ipython-simplify
Simplify and fix pyspark script.

This patch removes compatibility for IPython < 1.0 but fixes the launch
script and makes it much simpler.

I tested this using the three commands in the PySpark documentation page:

1. IPYTHON=1 ./pyspark
2. IPYTHON_OPTS="notebook" ./pyspark
3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark

There are two changes:
- We rely on PYTHONSTARTUP env var to start PySpark
- Removed the quotes around $IPYTHON_OPTS... having quotes
  gloms them together as a single argument passed to `exec` which
  seemed to cause ipython to fail (it instead expects them as
  multiple arguments).
2014-01-09 20:29:51 -08:00
Patrick Wendell d86a85e9ca Merge pull request #293 from pwendell/standalone-driver
SPARK-998: Support Launching Driver Inside of Standalone Mode

[NOTE: I need to bring the tests up to date with new changes, so for now they will fail]

This patch provides support for launching driver programs inside of a standalone cluster manager. It also supports monitoring and re-launching of driver programs which is useful for long running, recoverable applications such as Spark Streaming jobs. For those jobs, this patch allows a deployment mode which is resilient to the failure of any worker node, failure of a master node (provided a multi-master setup), and even failures of the applicaiton itself, provided they are recoverable on a restart. Driver information, such as the status and logs from a driver, is displayed in the UI

There are a few small TODO's here, but the code is generally feature-complete. They are:
- Bring tests up to date and add test coverage
- Restarting on failure should be optional and maybe off by default.
- See if we can re-use akka connections to facilitate clients behind a firewall

A sensible place to start for review would be to look at the `DriverClient` class which presents users the ability to launch their driver program. I've also added an example program (`DriverSubmissionTest`) that allows you to test this locally and play around with killing workers, etc. Most of the code is devoted to persisting driver state in the cluster manger, exposing it in the UI, and dealing correctly with various types of failures.

Instructions to test locally:
- `sbt/sbt assembly/assembly examples/assembly`
- start a local version of the standalone cluster manager

```
./spark-class org.apache.spark.deploy.client.DriverClient \
  -j -Dspark.test.property=something \
  -e SPARK_TEST_KEY=SOMEVALUE \
  launch spark://10.99.1.14:7077 \
  ../path-to-examples-assembly-jar \
  org.apache.spark.examples.DriverSubmissionTest 1000 some extra options --some-option-here -X 13
```
- Go in the UI and make sure it started correctly, look at the output etc
- Kill workers, the driver program, masters, etc.
2014-01-09 18:37:52 -08:00
Ankur Dave b5b0de2de5 Start fixing formatting of graphx-programming-guide 2014-01-09 13:24:25 -08:00
Ankur Dave e4483582fc Add docs/graphx-programming-guide.md from 7210257ba3038d5e22d4b60fe9c3113dc45c3dff:README.md 2014-01-09 10:24:43 -08:00
Thomas Graves c617083e47 yarn-client addJar fix and misc other 2014-01-09 10:24:35 -06:00
Ankur Dave 91227566bc Merge remote-tracking branch 'spark-upstream/master' into HEAD
Conflicts:
	README.md
	core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
	core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala
	pom.xml
	project/SparkBuild.scala
	repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
2014-01-08 21:19:08 -08:00
Patrick Wendell 112c0a1776 Fixing config option "retained_stages" => "retainedStages".
This is a very esoteric option and it's out of sync with the style we use.
So it seems fitting to fix it for 0.9.0.
2014-01-08 21:16:16 -08:00
Thomas Graves 6eef78d769 Merge pull request #345 from colorant/yarn
support distributing extra files to worker for yarn client mode

So that user doesn't need to package all dependency into one assemble jar as spark app jar
2014-01-08 08:49:20 -06:00
Patrick Wendell bc81ce040d Merge remote-tracking branch 'apache-github/master' into standalone-driver
Conflicts:
	core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala
	pom.xml
2014-01-08 00:38:31 -08:00
Patrick Wendell c78b381e91 Fixes 2014-01-08 00:09:12 -08:00
Patrick Wendell bb6a39a687 Merge pull request #322 from falaki/MLLibDocumentationImprovement
SPARK-1009 Updated MLlib docs to show how to use it in Python

In addition added detailed examples for regression, clustering and recommendation algorithms in a separate Scala section. Fixed a few minor issues with existing documentation.
2014-01-07 22:32:18 -08:00
Hossein Falaki 46cb980a5f Fixed merge conflict 2014-01-07 21:28:26 -08:00
Patrick Wendell 82a1d38aea Simplify and fix pyspark script.
This patch removes compatibility for IPython < 1.0 but fixes the launch
script and makes it much simpler.

I tested this using the three commands in the PySpark documentation page:

1. IPYTHON=1 ./pyspark
2. IPYTHON_OPTS="notebook" ./pyspark
3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark

There are two changes:
- We rely on PYTHONSTARTUP env var to start PySpark
- Removed the quotes around $IPYTHON_OPTS... having quotes
  gloms them together as a single argument passed to `exec` which
  seemed to cause ipython to fail (it instead expects them as
  multiple arguments).
2014-01-07 17:55:25 -08:00
Reza Zadeh 4f38b6fab5 documentation for sparsematrix 2014-01-07 17:19:28 -08:00
Matei Zaharia 2c421749ea Address review comments 2014-01-07 19:30:23 -05:00
Matei Zaharia d8bcc8e9a0 Add way to limit default # of cores used by applications on standalone mode
Also documents the spark.deploy.spreadOut option.
2014-01-07 14:35:52 -05:00
Patrick Wendell c3cf0475e8 Merge pull request #339 from ScrapCodes/conf-improvements
Conf improvements

There are two new features.

1. Allow users to set arbitrary akka configurations via spark conf.

2. Allow configuration to be printed in logs for diagnosis.
2014-01-07 00:54:25 -08:00
Reynold Xin a862cafacf Merge pull request #331 from holdenk/master
Add a script to download sbt if not present on the system

As per the discussion on the dev mailing list this script will use the system sbt if present or otherwise attempt to install the sbt launcher. The fall back error message in the event it fails instructs the user to install sbt. While the URLs it fetches from aren't controlled by the spark project directly, they are stable and the current authoritative sources.
2014-01-07 00:18:20 -08:00
Prashant Sharma c729fa7c8e formatting related fixes suggested by Patrick. 2014-01-07 13:08:16 +05:30
Prashant Sharma b84dc780d3 Allow configuration to be printed in logs for diagnosis. 2014-01-07 13:01:43 +05:30
Prashant Sharma b3018811e1 Allow users to set arbitrary akka configurations via spark conf. 2014-01-07 13:01:43 +05:30
Patrick Wendell b72cceba27 Some doc fixes 2014-01-06 22:05:53 -08:00
Raymond Liu 67af803136 Export --file for YarnClient mode to support sending extra files to worker on yarn cluster 2014-01-07 10:24:11 +08:00
Patrick Wendell c0498f9265 Merge remote-tracking branch 'apache-github/master' into standalone-driver
Conflicts:
	core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala
	core/src/main/scala/org/apache/spark/deploy/client/TestClient.scala
	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
	core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala
2014-01-06 17:29:21 -08:00
Hossein Falaki 150089dae1 Added proper evaluation example for collaborative filtering and fixed typo 2014-01-06 12:43:17 -08:00
Andrew Ash 2dd4fb5698 Clarify spark.cores.max
It controls the count of cores across the cluster, not on a per-machine basis.
2014-01-06 09:01:46 -08:00
Holden Karau d86dc74d79 Code review feedback 2014-01-05 22:05:30 -08:00
Reza Zadeh 746148bc18 fix docs to use SparseMatrix 2014-01-05 18:03:57 -08:00
Reza Zadeh 73daa700bd add k parameter 2014-01-04 01:52:28 -08:00
Patrick Wendell 604fad9c39 Merge remote-tracking branch 'apache-github/master' into remove-binaries
Conflicts:
	core/src/test/scala/org/apache/spark/DriverSuite.scala
	docs/python-programming-guide.md
2014-01-03 21:29:33 -08:00
Hossein Falaki 8b5be06752 Added table of contents and minor fixes 2014-01-03 16:38:33 -08:00
Patrick Wendell 4ae101ff38 Merge pull request #317 from ScrapCodes/spark-915-segregate-scripts
Spark-915 segregate scripts
2014-01-03 11:24:35 -08:00
Prashant Sharma 74ba97fcf7 sbin/spark-class* -> bin/spark-class* 2014-01-03 15:08:01 +05:30
Prashant Sharma 94f2fffa23 fixed review comments 2014-01-03 14:43:37 +05:30
Prashant Sharma b4bb80002b Merge branch 'master' into spark-1002-remove-jars 2014-01-03 12:12:04 +05:30
Raymond Liu f442afc22e fix docs for yarn 2014-01-03 14:14:35 +08:00
Raymond Liu ebdfa6bb97 Using name yarn-alpha/yarn instead of yarn-2.0/yarn-2.2 2014-01-03 12:14:38 +08:00
Raymond Liu 7815a3ace9 Update maven build documentation 2014-01-03 12:12:38 +08:00
Raymond Liu be343d2a56 Fix yarn/README.md and update docs/running-on-yarn.md 2014-01-03 12:12:38 +08:00
Hossein Falaki 81989e2664 Commented the last part of collaborative filtering examples that lead to errors 2014-01-02 16:22:13 -08:00
Hossein Falaki c189c8362c Added Scala and Python examples for mllib 2014-01-02 15:22:20 -08:00
Prashant Sharma 59e8009b8d a few left over document change 2014-01-02 21:48:44 +05:30
Prashant Sharma a3f90a2ecf pyspark -> bin/pyspark 2014-01-02 18:50:12 +05:30
Prashant Sharma 94b7a7fe37 run-example -> bin/run-example 2014-01-02 18:41:21 +05:30
Prashant Sharma b810a85cdd spark-shell -> bin/spark-shell 2014-01-02 18:37:40 +05:30
Prashant Sharma 980afd280a Merge branch 'scripts-reorg' of github.com:shane-huang/incubator-spark into spark-915-segregate-scripts
Conflicts:
	bin/spark-shell
	core/pom.xml
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
	core/src/main/scala/org/apache/spark/ui/UIWorkloadGenerator.scala
	core/src/test/scala/org/apache/spark/DriverSuite.scala
	python/run-tests
	sbin/compute-classpath.sh
	sbin/spark-class
	sbin/stop-slaves.sh
2014-01-02 17:55:21 +05:30
Reza Zadeh 61405785bc Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-02 01:50:30 -08:00
Prashant Sharma 6be4c11194 Removed sbt folder and changed docs accordingly 2014-01-02 14:09:37 +05:30
Reza Zadeh 53ccf65362 doc tweaks 2014-01-01 20:03:47 -08:00
Reza Zadeh 97dc527849 doc tweak 2014-01-01 20:02:37 -08:00
Reza Zadeh b941b6f7b0 doc tweaks 2014-01-01 20:01:13 -08:00
Reza Zadeh dd0d3f008b New documentation 2014-01-01 19:53:04 -08:00
Matei Zaharia 0fa5809768 Updated docs for SparkConf and handled review comments 2013-12-30 22:17:28 -05:00
Patrick Wendell 6ffa9bb226 Documentation and adding supervise option 2013-12-29 11:26:56 -08:00
Reynold Xin 72a17b69f5 Revert "Merge pull request #310 from jyunfan/master"
This reverts commit 79b20e4dbe, reversing
changes made to 7375047d51.
2013-12-28 21:25:40 -10:00
Jyun-Fan Tsai 17f6620a71 Fix typo in the Accumulators section
val => var
2013-12-29 11:30:02 +08:00
fengdong ad8ce0148a changed the example links in the scala-programming-guid 2013-12-18 19:03:32 +08:00
fengdong ddebaf8280 Fixed the example link. 2013-12-18 11:00:36 +08:00
Reynold Xin 7db9165961 Merge pull request #251 from pwendell/master
Fix list rendering in YARN markdown docs.

This is some minor clean-up which makes the list render correctly.
2013-12-14 14:16:34 -08:00
Prashant Sharma d3090b79a5 A few corrections to documentation. 2013-12-12 10:12:06 +05:30
Prashant Sharma 603af51bb5 Merge branch 'master' into akka-bug-fix
Conflicts:
	core/pom.xml
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
	pom.xml
	project/SparkBuild.scala
	streaming/pom.xml
	yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
2013-12-11 10:21:53 +05:30
Patrick Wendell 1291dd4dce Fix list rendering in YARN markdown docs. 2013-12-10 16:38:33 -08:00
Patrick Wendell 0428145ed4 Small fix 2013-12-07 22:33:11 -08:00
Patrick Wendell b3e87c0f51 Adding HDP 2.0 version 2013-12-07 22:31:46 -08:00
Patrick Wendell 41c60b337a Various broken links in documentation 2013-12-07 22:31:44 -08:00
Patrick Wendell 6494d62fe4 Merge pull request #240 from pwendell/master
SPARK-917 Improve API links in nav bar
2013-12-07 11:56:16 -08:00
Patrick Wendell dd331a6b26 SPARK-917 Improve API links in nav bar 2013-12-07 11:49:49 -08:00
Aaron Davidson cb6ac8aafb Correct spellling error in configuration.md 2013-12-07 01:40:01 -08:00
Patrick Wendell 7a1d1c93b8 Minor formatting fix in config file 2013-12-06 20:28:22 -08:00
Patrick Wendell 1b38f5f277 Merge pull request #236 from pwendell/shuffle-docs
Adding disclaimer for shuffle file consolidation
2013-12-06 20:16:15 -08:00
Patrick Wendell b9451acdf4 Adding disclaimer for shuffle file consolidation 2013-12-06 19:25:28 -08:00
Patrick Wendell bb6e25c663 Minor doc fixes and updating README 2013-12-06 17:42:28 -08:00
Ali Ghodsi e2c2914faa more docs 2013-12-06 16:54:06 -08:00
Ali Ghodsi f2fb4b4228 Updated documentation about the YARN v2.2 build process 2013-12-06 16:31:26 -08:00
Patrick Wendell 5d460253d6 Merge pull request #228 from pwendell/master
Document missing configs and set shuffle consolidation to false.
2013-12-05 12:31:24 -08:00
Patrick Wendell 1450b8ef87 Small changes from Matei review 2013-12-04 18:49:32 -08:00
Patrick Wendell b1c6fa1584 Document missing configs and set shuffle consolidation to false. 2013-12-04 18:39:34 -08:00
Andrew Ash 0c5af38b86 Typo: applicaton 2013-12-04 12:30:25 -08:00
Prashant Sharma 17987778da Merge branch 'master' into wip-scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	python/pyspark/rdd.py
2013-11-27 14:44:12 +05:30
Prashant Sharma 54862af5ee Improvements from the review comments and followed Boy Scout Rule. 2013-11-27 14:26:28 +05:30
Prashant Sharma dca946ff67 Documenting the newly added spark properties. 2013-11-26 20:47:38 +05:30
Andrew Ash 08afef37a0 Update tuning.md
Clarify when serializer is used based on recent user@ mailing list discussion.
2013-11-25 17:08:52 -08:00
Matei Zaharia eb4296c8f7 Merge pull request #101 from colorant/yarn-client-scheduler
For SPARK-527, Support spark-shell when running on YARN

sync to trunk and resubmit here

In current YARN mode approaching, the application is run in the Application Master as a user program thus the whole spark context is on remote.

This approaching won't support application that involve local interaction and need to be run on where it is launched.

So In this pull request I have a YarnClientClusterScheduler and backend added.

With this scheduler, the user application is launched locally,While the executor will be launched by YARN on remote nodes with a thin AM which only launch the executor and monitor the Driver Actor status, so that when client app is done, it can finish the YARN Application as well.

This enables spark-shell to run upon YARN.

This also enable other Spark applications to have the spark context to run locally with a master-url "yarn-client". Thus e.g. SparkPi could have the result output locally on console instead of output in the log of the remote machine where AM is running on.

Docs also updated to show how to use this yarn-client mode.
2013-11-25 15:25:29 -08:00
Prashant Sharma 44fd30d3fb Merge branch 'master' into scala-2.10-wip
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	project/SparkBuild.scala
2013-11-25 18:10:54 +05:30
Reynold Xin 6bcac986b2 Merge branch 'master' of github.com:apache/incubator-spark 2013-11-25 15:47:47 +08:00
Matei Zaharia 859d62dc2a Merge pull request #151 from russellcardullo/add-graphite-sink
Add graphite sink for metrics

This adds a metrics sink for graphite.  The sink must
be configured with the host and port of a graphite node
and optionally may be configured with a prefix that will
be prepended to all metrics that are sent to graphite.
2013-11-24 16:19:51 -08:00
Raymond Liu ab3cefde53 Add YarnClientClusterScheduler and Backend.
With this scheduler, the user application is launched locally,
While the executor will be launched by YARN on remote nodes.

This enables spark-shell to run upon YARN.
2013-11-22 09:23:27 +08:00
Prashant Sharma 95d8dbce91 Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10-temp
Conflicts:
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveVector.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
2013-11-21 12:34:46 +05:30
Neal Wiggins 21b5478ed6 Fix Kryo Serializer buffer inconsistency
The documentation here is inconsistent with the coded default and other documentation.
2013-11-20 16:19:25 -08:00
tgravescs 4093e9393a Impove Spark on Yarn Error handling 2013-11-19 12:44:00 -06:00
Aaron Davidson f629ba95b6 Various merge corrections
I've diff'd this patch against my own -- since they were both created
independently, this means that two sets of eyes have gone over all the
merge conflicts that were created, so I'm feeling significantly more
confident in the resulting PR.

@rxin has looked at the changes to the repl and is resoundingly
confident that they are correct.
2013-11-14 22:13:09 -08:00
RIA-pierre-borckmans bef398e572 Fixed typos in the CDH4 distributions version codes. 2013-11-14 11:33:48 +01:00
Raymond Liu a60620b76a Merge branch 'master' into scala-2.10 2013-11-14 12:44:19 +08:00
Raymond Liu 0f2e3c6e31 Merge branch 'master' into scala-2.10 2013-11-13 16:55:11 +08:00
Russell Cardullo ef85a51f85 Add graphite sink for metrics
This adds a metrics sink for graphite.  The sink must
be configured with the host and port of a graphite node
and optionally may be configured with a prefix that will
be prepended to all metrics that are sent to graphite.
2013-11-08 16:36:03 -08:00
Reynold Xin 551a43fd3d Merge branch 'master' of github.com:apache/incubator-spark into mergemerge
Conflicts:
	README.md
	core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
	core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala
2013-11-04 21:02:36 -08:00
tgravescs a35472e1dd Allow spark on yarn to be run from HDFS. Allows the spark.jar, app.jar, and log4j.properties to be put into hdfs. 2013-11-04 16:16:28 -06:00
Fabrizio (Misto) Milo 3f89354c45 fix persistent-hdfs 2013-11-01 17:47:37 -07:00
Evan Chan e54a37fe15 Document all the URIs for addJar/addFile 2013-11-01 10:58:11 -07:00
Ankur Dave 5064f9b2d2 Merge remote-tracking branch 'spark-upstream/master'
Conflicts:
	project/SparkBuild.scala
2013-10-30 15:59:09 -07:00
Joseph E. Gonzalez 41b3122120 Strating to improve README. 2013-10-29 20:57:55 -07:00
Patrick Wendell 08c1a42d7d Add a repartition operator.
This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:

1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.

Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.

2. If a user has input data where the number of partitions is not known. E.g.

> sc.textFile("some file").coalesce(50)....

This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.

The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.

I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.
2013-10-24 14:31:33 -07:00
Matei Zaharia 452aa36d67 Merge pull request #97 from ewencp/pyspark-system-properties
Add classmethod to SparkContext to set system properties.

Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.
2013-10-22 23:15:33 -07:00
Ewen Cheslack-Postava c8748c25eb Add notes to python documentation about using SparkContext.setSystemProperty. 2013-10-22 11:49:52 -07:00
Aaron Davidson 962bec97ee Docs: Fix links to RDD API documentation 2013-10-22 09:39:36 -07:00
Reynold Xin f628804c02 Merge pull request #76 from pwendell/master
Clarify compression property.

Clarifies that this governs compression of internal data, not input
data or output data.
2013-10-18 23:19:42 -07:00
Patrick Wendell 6b62836285 Clarify compression property.
Clarifies that this governs compression of internal data, not input
data or output data.
2013-10-18 23:08:44 -07:00
Mosharaf Chowdhury 35b2415fb3 Code styling. Updated doc. 2013-10-17 13:14:12 -07:00
Matei Zaharia 8f11c36fe1 Merge remote-tracking branch 'tgravescs/sparkYarnDistCache'
Closes #11

Conflicts:
	docs/running-on-yarn.md
	yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
2013-10-10 19:34:33 -07:00
Matei Zaharia c71499b779 Merge pull request #19 from aarondav/master-zk
Standalone Scheduler fault tolerance using ZooKeeper

This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.

Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.

Master failover follows directly from the single-node Master recovery via the file
system (patch d5a96fe), save that the Master state is stored in ZooKeeper instead.

Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from d5a96fe.

Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.
2013-10-10 17:16:42 -07:00
Aaron Davidson 66c20635fa Minor clarification and cleanup to spark-standalone.md 2013-10-10 14:45:12 -07:00
Aaron Davidson 42d8b8efe6 Address Matei's comments on documentation
Updates to the documentation and changing some logError()s to logWarning()s.
2013-10-10 00:33:47 -07:00
Prashant Sharma 026ab75661 Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10 2013-10-10 09:42:55 +05:30
Matei Zaharia 478b2b7edc Fix PySpark docs and an overly long line of code after fdbae41e 2013-10-09 12:08:04 -07:00
Aaron Davidson 4ea8ee468f Add docs for standalone scheduler fault tolerance
Also fix a couple HTML/Markdown issues in other files.
2013-10-08 14:18:31 -07:00
Prashant Sharma 7be75682b9 Merge branch 'master' into wip-merge-master
Conflicts:
	bagel/pom.xml
	core/pom.xml
	core/src/test/scala/org/apache/spark/ui/UISuite.scala
	examples/pom.xml
	mllib/pom.xml
	pom.xml
	project/SparkBuild.scala
	repl/pom.xml
	streaming/pom.xml
	tools/pom.xml

In scala 2.10, a shorter representation is used for naming artifacts
 so changed to shorter scala version for artifacts and made it a property in pom.
2013-10-08 11:29:40 +05:30
Nick Pentreath a5e58b8f98 Merge branch 'master' into implicit-als 2013-10-07 11:46:17 +02:00
Patrick Wendell aa9fb84994 Merging build changes in from 0.8 2013-10-05 22:07:00 -07:00
Prashant Sharma c810ee0690 Merge branch 'master' into scala-2.10
Conflicts:
	core/src/test/scala/org/apache/spark/DistributedSuite.scala
	project/SparkBuild.scala
2013-10-05 15:52:57 +05:30
Nick Pentreath 93b96b44d7 Adding implicit feedback ALS to MLlib user guide 2013-10-04 14:39:44 +02:00
tgravescs 0fff4ee852 Adding in the --addJars option to make SparkContext.addJar work on yarn and cleanup
the classpaths
2013-10-03 11:52:16 -05:00
tgravescs bc3b20abdc Allow users to set the application name for Spark on Yarn 2013-10-02 12:54:17 -05:00
Prashant Sharma 5829692885 Merge branch 'master' into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala
	docs/_config.yml
	project/SparkBuild.scala
	repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
2013-10-01 11:57:24 +05:30
shane-huang 84849baf88 Merge branch 'reorgscripts' into scripts-reorg 2013-09-27 09:28:33 +08:00
Prashant Sharma 604dc40996 Sync with master and some build fixes 2013-09-26 11:40:02 +05:30
Patrick Wendell 6079721fa1 Update build version in master 2013-09-24 11:41:51 -07:00
Y.CORP.YAHOO.COM\tgraves 9d4246863a Support distributed cache files and archives on spark on yarn and attempt to cleanup the staging directory on exit 2013-09-23 09:09:59 -05:00
shane-huang fcfe4f9204 add admin scripts to sbin
Signed-off-by: shane-huang <shengsheng.huang@intel.com>
2013-09-23 12:42:34 +08:00
shane-huang dfbdc9ddb7 added spark-class and spark-executor to sbin
Signed-off-by: shane-huang <shengsheng.huang@intel.com>
2013-09-23 11:28:58 +08:00
Jey Kottalam ac0dd99394 Fix typo in Maven build docs 2013-09-15 13:29:22 -07:00
Patrick Wendell dbd2c4fd94 Merge pull request #932 from pwendell/mesos-version
Bumping Mesos version to 0.13.0
2013-09-15 13:20:41 -07:00
Patrick Wendell c856860c5b Bumping Mesos version to 0.13.0 2013-09-15 12:46:26 -07:00
Patrick Wendell 362ea0c051 Explain yarn.version in Maven build docs 2013-09-15 12:40:49 -07:00
Prashant Sharma a90e0eff59 version changed 2.9.3 -> 2.10 in shell script. 2013-09-15 12:47:20 +05:30
Benjamin Hindman 8e2602dd70 More updates to Spark on Mesos documentation. 2013-09-11 16:08:54 -07:00
Benjamin Hindman a0f0c1bed2 Updated Spark on Mesos documentation. 2013-09-11 16:05:25 -07:00
Patrick Wendell bddf135670 Change port from 3030 to 4040 2013-09-11 10:01:38 -07:00
Matei Zaharia 2425eb85ca Update Python API features 2013-09-10 11:12:59 -07:00
Patrick Wendell cefee1ed1a Document fortran dependency for MLBase 2013-09-09 21:45:04 -07:00
Matei Zaharia 7a5c4b647b Small tweaks to MLlib docs 2013-09-08 21:47:24 -07:00
Matei Zaharia 7d3204b056 Merge pull request #905 from mateiz/docs2
Job scheduling and cluster mode docs
2013-09-08 21:39:12 -07:00
Matei Zaharia b458854977 Fix some review comments 2013-09-08 21:25:49 -07:00
Ameet Talwalkar 81a8bd46ac respose to PR comments 2013-09-08 19:21:30 -07:00
Ameet Talwalkar bf280c8b0f Merge remote-tracking branch 'upstream/master' 2013-09-08 18:41:38 -07:00
Patrick Wendell f68848d95d Merge pull request #906 from pwendell/ganglia-sink
Clean-up of Metrics Code/Docs and Add Ganglia Sink
2013-09-08 18:32:16 -07:00
Ameet Talwalkar 5ac62dbbd0 updates based on comments to PR 2013-09-08 17:39:08 -07:00
Matei Zaharia 5a587fb98d Updated cluster diagram to show caches 2013-09-08 13:51:57 -07:00
Patrick Wendell c190b48bf5 Adding more docs and some code cleanup 2013-09-08 13:46:28 -07:00
Matei Zaharia af8ffdb73c Review comments 2013-09-08 13:36:50 -07:00
Matei Zaharia c0d375107f Some tweaks to CDH/HDP doc 2013-09-08 00:44:41 -07:00
Matei Zaharia f261d2a60f Added cluster overview doc, made logo higher-resolution, and added more
details on monitoring
2013-09-08 00:29:11 -07:00
Matei Zaharia 651a96adf7 More fair scheduler docs and property names.
Also changed uses of "job" terminology to "application" when they
referred to an entire Spark program, to avoid confusion.
2013-09-08 00:29:11 -07:00
Matei Zaharia 98fb69822c Work in progress:
- Add job scheduling docs
- Rename some fair scheduler properties
- Organize intro page better
- Link to Apache wiki for "contributing to Spark"
2013-09-08 00:29:11 -07:00
Matei Zaharia 38488aca8a Merge pull request #900 from pwendell/cdh-docs
Provide docs to describe running on CDH/HDP cluster.
2013-09-08 00:28:53 -07:00
Patrick Wendell 22b982d2bc File rename 2013-09-07 14:38:54 -07:00
Matei Zaharia cfde85e395 Merge pull request #901 from ooyala/2013-09/0.8-doc-changes
0.8 Doc changes for make-distribution.sh
2013-09-07 13:53:08 -07:00
Patrick Wendell 61c4762d45 Changes based on feedback 2013-09-07 11:55:10 -07:00
Evan Chan be1ee28ca6 CR feedback from Matei 2013-09-07 08:56:24 -07:00
Matei Zaharia afe46ba36e Merge pull request #892 from jey/fix-yarn-assembly
YARN build fixes
2013-09-07 07:28:51 -07:00
Evan Chan ff1dbf2106 Add references to make-distribution.sh 2013-09-06 14:20:44 -07:00
Evan Chan 88d53f0dff "launch" scripts is more accurate terminology 2013-09-06 14:03:44 -07:00
Evan Chan 5a18b854a7 Easier way to start the master 2013-09-06 13:59:43 -07:00
Evan Chan 76d5d2d3c5 Add notes about starting spark-shell 2013-09-06 13:53:00 -07:00
Patrick Wendell a2a0cf9d68 Docs describing Spark monitoring and instrumentation 2013-09-06 13:52:57 -07:00
Patrick Wendell e653a9d891 Provide docs to describe running on CDH/HDP cluster.
This doc consolidates information relevant to CDH/HDP users in a single place.
2013-09-06 13:49:57 -07:00
Jey Kottalam 35ed09f1d1 Clarify YARN example 2013-09-06 11:31:16 -07:00
Ameet Talwalkar d52edfa753 updated content 2013-09-05 21:06:50 -07:00
Y.CORP.YAHOO.COM\tgraves c8cc276110 Review comment changes and update to org.apache packaging 2013-09-03 10:50:21 -05:00
Y.CORP.YAHOO.COM\tgraves 547fc4a412 Merge remote-tracking branch 'mesos/master' into yarnUILink
Conflicts:
	core/src/main/scala/org/apache/spark/ui/UIUtils.scala
	core/src/main/scala/org/apache/spark/ui/jobs/PoolTable.scala
	core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala
	docs/running-on-yarn.md
2013-09-03 08:36:59 -05:00
Matei Zaharia 2615cad30b Some doc improvements
- List higher-level projects that run on Spark
- Tweak CSS
2013-09-02 13:35:28 -07:00
Matei Zaharia 9329a7d4cd Fix spark.io.compression.codec and change default codec to LZF 2013-09-02 10:15:22 -07:00
Matei Zaharia 9ee1e9db2e Doc improvements 2013-09-01 22:12:03 -07:00
Matei Zaharia 3db404a43a Run script fixes for Windows after package & assembly change 2013-09-01 23:45:57 +00:00
Matei Zaharia 0a8cc30921 Move some classes to more appropriate packages:
* RDD, *RDDFunctions -> org.apache.spark.rdd
* Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util
* JavaSerializer, KryoSerializer -> org.apache.spark.serializer
2013-09-01 14:13:16 -07:00
Matei Zaharia 5b4dea2143 More fixes 2013-09-01 14:13:16 -07:00
Matei Zaharia 5701eb92c7 Fix some URLs 2013-09-01 14:13:16 -07:00
Matei Zaharia debcf24389 Fix over-zealous find-and-replace in HTML 2013-09-01 14:13:16 -07:00
Matei Zaharia d27cd03f30 Fix more URLs in docs 2013-09-01 14:13:16 -07:00
Matei Zaharia 4f422032e5 Update docs for new package 2013-09-01 14:13:15 -07:00
Matei Zaharia 4d1cb59fe1 Small tweak to docs gradient 2013-09-01 14:13:15 -07:00
Matei Zaharia 46eecd110a Initial work to rename package to org.apache.spark 2013-09-01 14:13:13 -07:00
Patrick Wendell 0e375a3cc2 Add assmebly plug in links 2013-09-01 09:43:42 -07:00
Patrick Wendell 6371febe18 Better docs 2013-08-31 19:09:06 -07:00
Matei Zaharia 9ddad0dcb4 Fixes suggested by Patrick 2013-08-31 17:40:33 -07:00
Matei Zaharia 4819baa658 More updates, describing changes to recommended use of environment vars
and new Python stuff
2013-08-31 14:21:10 -07:00
Matei Zaharia 4293533032 Update docs about HDFS versions 2013-08-30 15:04:43 -07:00
Y.CORP.YAHOO.COM\tgraves 96452eea56 fix up minor things 2013-08-30 16:04:31 -05:00
Y.CORP.YAHOO.COM\tgraves bac46266a9 Link the Spark UI to the Yarn UI 2013-08-30 15:55:32 -05:00
Matei Zaharia f3a964848d More doc improvements + better warnings when you haven't built Spark 2013-08-30 12:41:25 -07:00
Matei Zaharia 23762efda2 New hardware provisioning doc, and updates to menus 2013-08-30 10:16:26 -07:00
Matei Zaharia 1b0f69c623 Change docs color theme for 0.8 2013-08-30 10:15:58 -07:00
Matei Zaharia e11bc18294 Update Maven docs 2013-08-29 21:19:07 -07:00
Matei Zaharia 2de756ff19 Update some build instructions because only sbt assembly and mvn package
are now needed
2013-08-29 21:19:06 -07:00
Matei Zaharia 53cd50c069 Change build and run instructions to use assemblies
This commit makes Spark invocation saner by using an assembly JAR to
find all of Spark's dependencies instead of adding all the JARs in
lib_managed. It also packages the examples into an assembly and uses
that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
with two better-named scripts: "run-examples" for examples, and
"spark-class" for Spark internal classes (e.g. REPL, master, etc). This
is also designed to minimize the confusion people have in trying to use
"run" to run their own classes; it's not meant to do that, but now at
least if they look at it, they can modify run-examples to do a decent
job for them.

As part of this, Bagel's examples are also now properly moved to the
examples package instead of bagel.
2013-08-29 21:19:04 -07:00
Matei Zaharia baa84e7e4c Merge pull request #865 from tgravescs/fixtmpdir
Spark on Yarn should use yarn approved directories for spark.local.dir and tmp
2013-08-28 12:44:46 -07:00
Y.CORP.YAHOO.COM\tgraves 63dc635de6 fix typos 2013-08-26 17:06:20 -05:00
Y.CORP.YAHOO.COM\tgraves c9464c74a1 Add ability for user to specify environment variables 2013-08-26 16:44:27 -05:00
Y.CORP.YAHOO.COM\tgraves 6dd64e8bb2 Update docs and remove old reference to --user option 2013-08-26 14:29:24 -05:00
Patrick Wendell 2cfe52ef55 Version bump for ec2 docs 2013-08-24 15:16:53 -07:00
Patrick Wendell 4879685910 Merge remote-tracking branch 'mesos/master' into ec2-updates 2013-08-24 14:50:58 -07:00
Matei Zaharia 5a6ac12840 Merge pull request #701 from ScrapCodes/documentation-suggestions
Documentation suggestions for spark streaming.
2013-08-22 22:08:03 -07:00
Prashant Sharma 2bc348e92c Linking custom receiver guide 2013-08-23 09:44:02 +05:30
Prashant Sharma 39a1d58da4 Improved documentation for spark custom receiver 2013-08-23 09:38:50 +05:30
Jey Kottalam f9cc1fbf27 Remove references to unsupported Hadoop versions 2013-08-21 17:14:36 -07:00
Patrick Wendell 6be6b71c8c Merge branch 'master' into ec2-updates
Conflicts:
	ec2/spark_ec2.py
2013-08-21 15:34:31 -07:00
Jey Kottalam 6585f49841 Update build docs 2013-08-21 14:51:56 -07:00
Jey Kottalam 9c6f8df30f Update jekyll plugin to match docs/README.md 2013-08-21 12:57:56 -07:00
Matei Zaharia 53b1c30607 Update docs for Spark UI port 2013-08-20 22:57:11 -07:00
Matei Zaharia aa2b89d98d Merge remote-tracking branch 'jey/hadoop-agnostic'
Conflicts:
	core/src/main/scala/spark/PairRDDFunctions.scala
2013-08-20 10:14:15 -07:00
Matei Zaharia 2a4ed10210 Address some review comments:
- When a resourceOffers() call has multiple offers, force the TaskSets
  to consider them in increasing order of locality levels so that they
  get a chance to launch stuff locally across all offers

- Simplify ClusterScheduler.prioritizeContainers

- Add docs on the new configuration options
2013-08-18 19:51:07 -07:00
Jey Kottalam 14b6bcdf93 update YARN docs 2013-08-15 16:50:37 -07:00
Evan Sparks 4346f0a1e9 Merge pull request #809 from shivaram/sgd-cleanup
Clean up scaladoc in ML Lib.
2013-08-12 12:12:12 -07:00
Shivaram Venkataraman 8b5e3e2eb5 Add ML Lib scaladoc to API dropdown 2013-08-11 23:52:43 -07:00
Patrick Wendell 9244524146 Removing dead docs 2013-08-11 20:33:58 -07:00
Shivaram Venkataraman 4935a2558b Clean up scaladoc in ML Lib.
Also build and copy ML Lib scaladoc in Spark docs build.
Some more minor cleanup with respect to naming, test locations etc.
2013-08-11 19:02:43 -07:00
Matei Zaharia de6c4c995a Merge pull request #787 from ash211/master
Update spark-standalone.md
2013-08-06 17:09:50 -07:00
Andrew Ash afc2c80fdb Update spark-standalone.md 2013-08-07 00:44:43 +01:00
Patrick Wendell 5cc725a0e3 Merge branch 'master' into ec2-updates
Conflicts:
	ec2/deploy.generic/root/mesos-ec2/ec2-variables.sh
2013-07-31 21:35:12 -07:00
Patrick Wendell b7b627d5bb Updating relevant documentation 2013-07-31 21:28:27 -07:00
Matei Zaharia 3097d75d6f Merge remote-tracking branch 'dlyubimov/SPARK-827'
Conflicts:
	docs/configuration.md
2013-07-31 18:36:43 -07:00
Reynold Xin 5227043f84 Documentation update for compression codec. 2013-07-30 17:12:16 -07:00
Matei Zaharia 497f55755f Add docs about ipython 2013-07-29 02:51:43 -04:00
Dmitriy Lyubimov 0862494d44 typo 2013-07-27 23:16:20 -07:00
Dmitriy Lyubimov f5067abe85 changes per comments. 2013-07-27 23:08:00 -07:00
Ubuntu 88a0823c58 Consistently invoke bash with /usr/bin/env bash in scripts to make code more portable (JIRA Ticket SPARK-817) 2013-07-18 00:51:18 +00:00
Matei Zaharia af3c9d5042 Add Apache license headers and LICENSE and NOTICE files 2013-07-16 17:21:33 -07:00
Matei Zaharia d47c16f78d Add an option to disable reference tracking in Kryo 2013-07-15 01:55:54 +00:00
Andy Konwinski cd7259b4b8 Fixes typos in Spark Streaming Programming Guide
These typos were reported on the spark-users mailing list, see: https://groups.google.com/d/msg/spark-users/SyLGgJlKCrI/LpeBypOkSMUJ
2013-07-12 11:51:14 -07:00
Matei Zaharia 1ffadb2d9e Merge remote-tracking branch 'pwendell/ui-updates'
Conflicts:
	core/src/main/scala/spark/scheduler/DAGScheduler.scala
	core/src/main/scala/spark/util/AkkaUtils.scala
	pom.xml
2013-07-06 15:51:41 -07:00
root 7cd490ef5b Clarify that PySpark is not supported on Windows 2013-07-01 06:26:43 +00:00
Matei Zaharia 5bbd0eec84 Update docs on SCALA_LIBRARY_PATH 2013-06-30 17:00:40 -07:00
Matei Zaharia 03d0b858c8 Made use of spark.executor.memory setting consistent and documented it
Conflicts:

	core/src/main/scala/spark/SparkContext.scala
2013-06-30 15:46:46 -07:00
Matei Zaharia aea727f68d Simplify Python docs a little to do substring search 2013-06-26 21:15:09 -07:00
Patrick Wendell a59c15a37e Adding config option for retained stages 2013-06-26 08:54:57 -07:00
Tathagata Das c89af0a7f9 Merge branch 'master' into streaming
Conflicts:
	.gitignore
2013-06-24 23:57:47 -07:00
Matei Zaharia b5df1cd668 ADD_JARS environment variable for spark-shell 2013-06-22 17:14:44 -07:00
Reynold Xin 0eab7a78b9 Fixed a couple typos and formating problems in the YARN documentation. 2013-05-17 18:05:46 -07:00
Reynold Xin 7760d78b3a Merge branch 'master' of https://github.com/mridulm/spark 2013-05-17 17:58:36 -07:00
Mridul Muralidharan da2642bead Fix example jar name 2013-05-17 06:58:46 +05:30
Reynold Xin 3b3300383a Updated Scala version in docs generation ruby script. 2013-05-16 16:51:28 -07:00
Mridul Muralidharan f16c781709 Fix documentation to use yarn-standalone as master 2013-05-16 17:50:22 +05:30
Mridul Muralidharan 87540a7b38 Fix running on yarn documentation 2013-05-16 15:27:58 +05:30
Andrew Ash afcad7b3aa Docs: Mention spark shell's default for MASTER 2013-05-15 14:45:14 -03:00
Mridul Muralidharan ee37612bc9 1) Add support for HADOOP_CONF_DIR (and/or YARN_CONF_DIR - use either) : which is used to specify the client side configuration directory : which needs to be part of the CLASSPATH.
2) Move from var+=".." to var="$var.." : the former does not work on older bash shells unfortunately.
2013-05-11 11:12:22 +05:30
Matei Zaharia cf54b824ff Merge pull request #580 from pwendell/quickstart
SPARK-739 Have quickstart standlone job use README
2013-04-25 11:45:58 -07:00
Patrick Wendell a72134a6ac SPARK-739 Have quickstart standlone job use README 2013-04-25 10:39:28 -07:00
Mridul Muralidharan dd515ca3ee Attempt at fixing merge conflict 2013-04-24 09:24:17 +05:30
Mridul Muralidharan ac2e8e8720 Add some basic documentation 2013-04-19 00:13:19 +05:30
seanm ab0f834dbb adding spark.streaming.blockInterval property 2013-04-16 11:57:05 -06:00
Andy Konwinski 60a91b3b59 Update quick-start.md heading on Operations (not just Transformations). 2013-04-12 12:34:51 -07:00
Andrew Ash 6efc8cae8f Typos: cluser -> cluster 2013-04-10 13:44:10 -03:00
Matei Zaharia 65caa8f711 Merge remote-tracking branch 'jey/bump-development-version-to-0.8.0'
Conflicts:
	docs/_config.yml
	project/SparkBuild.scala
2013-04-08 12:43:17 -04:00
Matei Zaharia a1586412d6 Updated link to SBT 2013-04-07 20:31:19 -04:00
Matei Zaharia 34a47b8bc9 Update Scala version in docs 2013-04-07 20:27:03 -04:00
Matei Zaharia a98996d1fe Merge pull request #545 from ash211/patch-1
Don't use deprecated Application in example
2013-03-29 22:12:15 -07:00
Jey Kottalam bc8ba222ff Bump development version to 0.8.0 2013-03-28 15:42:01 -07:00
Andrew Ash e8f3669c63 Update tuning.md
Make the example more compilable
2013-03-28 19:17:39 -03:00
Andrew Ash 4e2c965383 Don't use deprecated Application in example
As of 2.9.0 extending from Application is not recommended

http://www.scala-lang.org/api/2.9.3/index.html#scala.Application
2013-03-28 17:47:37 -03:00
Andy Konwinski 446b801b3b Fixing typos pointed out by Matei 2013-03-20 17:30:31 -07:00
Andy Konwinski ad7f0452ab Adds page to docs about building using Maven.
Adds links to new instructions in:
* The main Spark project README.md
* The docs nav menu called "More"
* The docs Overview page under the "Building" and "Where to Go from Here" sections
2013-03-17 15:02:40 -07:00
Andy Konwinski c9097628fc Fix broken link to YARN documentation. 2013-03-13 14:51:13 -07:00