Commit graph

315 commits

Author SHA1 Message Date
Aaron Davidson f629ba95b6 Various merge corrections
I've diff'd this patch against my own -- since they were both created
independently, this means that two sets of eyes have gone over all the
merge conflicts that were created, so I'm feeling significantly more
confident in the resulting PR.

@rxin has looked at the changes to the repl and is resoundingly
confident that they are correct.
2013-11-14 22:13:09 -08:00
Raymond Liu a60620b76a Merge branch 'master' into scala-2.10 2013-11-14 12:44:19 +08:00
Raymond Liu 0f2e3c6e31 Merge branch 'master' into scala-2.10 2013-11-13 16:55:11 +08:00
tgravescs a35472e1dd Allow spark on yarn to be run from HDFS. Allows the spark.jar, app.jar, and log4j.properties to be put into hdfs. 2013-11-04 16:16:28 -06:00
Fabrizio (Misto) Milo 3f89354c45 fix persistent-hdfs 2013-11-01 17:47:37 -07:00
Evan Chan e54a37fe15 Document all the URIs for addJar/addFile 2013-11-01 10:58:11 -07:00
Patrick Wendell 08c1a42d7d Add a repartition operator.
This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:

1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.

Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.

2. If a user has input data where the number of partitions is not known. E.g.

> sc.textFile("some file").coalesce(50)....

This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.

The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.

I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.
2013-10-24 14:31:33 -07:00
Matei Zaharia 452aa36d67 Merge pull request #97 from ewencp/pyspark-system-properties
Add classmethod to SparkContext to set system properties.

Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.
2013-10-22 23:15:33 -07:00
Ewen Cheslack-Postava c8748c25eb Add notes to python documentation about using SparkContext.setSystemProperty. 2013-10-22 11:49:52 -07:00
Aaron Davidson 962bec97ee Docs: Fix links to RDD API documentation 2013-10-22 09:39:36 -07:00
Reynold Xin f628804c02 Merge pull request #76 from pwendell/master
Clarify compression property.

Clarifies that this governs compression of internal data, not input
data or output data.
2013-10-18 23:19:42 -07:00
Patrick Wendell 6b62836285 Clarify compression property.
Clarifies that this governs compression of internal data, not input
data or output data.
2013-10-18 23:08:44 -07:00
Mosharaf Chowdhury 35b2415fb3 Code styling. Updated doc. 2013-10-17 13:14:12 -07:00
Matei Zaharia 8f11c36fe1 Merge remote-tracking branch 'tgravescs/sparkYarnDistCache'
Closes #11

Conflicts:
	docs/running-on-yarn.md
	yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
2013-10-10 19:34:33 -07:00
Matei Zaharia c71499b779 Merge pull request #19 from aarondav/master-zk
Standalone Scheduler fault tolerance using ZooKeeper

This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.

Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.

Master failover follows directly from the single-node Master recovery via the file
system (patch d5a96fe), save that the Master state is stored in ZooKeeper instead.

Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from d5a96fe.

Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.
2013-10-10 17:16:42 -07:00
Aaron Davidson 66c20635fa Minor clarification and cleanup to spark-standalone.md 2013-10-10 14:45:12 -07:00
Aaron Davidson 42d8b8efe6 Address Matei's comments on documentation
Updates to the documentation and changing some logError()s to logWarning()s.
2013-10-10 00:33:47 -07:00
Prashant Sharma 026ab75661 Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10 2013-10-10 09:42:55 +05:30
Matei Zaharia 478b2b7edc Fix PySpark docs and an overly long line of code after fdbae41e 2013-10-09 12:08:04 -07:00
Aaron Davidson 4ea8ee468f Add docs for standalone scheduler fault tolerance
Also fix a couple HTML/Markdown issues in other files.
2013-10-08 14:18:31 -07:00
Prashant Sharma 7be75682b9 Merge branch 'master' into wip-merge-master
Conflicts:
	bagel/pom.xml
	core/pom.xml
	core/src/test/scala/org/apache/spark/ui/UISuite.scala
	examples/pom.xml
	mllib/pom.xml
	pom.xml
	project/SparkBuild.scala
	repl/pom.xml
	streaming/pom.xml
	tools/pom.xml

In scala 2.10, a shorter representation is used for naming artifacts
 so changed to shorter scala version for artifacts and made it a property in pom.
2013-10-08 11:29:40 +05:30
Nick Pentreath a5e58b8f98 Merge branch 'master' into implicit-als 2013-10-07 11:46:17 +02:00
Patrick Wendell aa9fb84994 Merging build changes in from 0.8 2013-10-05 22:07:00 -07:00
Prashant Sharma c810ee0690 Merge branch 'master' into scala-2.10
Conflicts:
	core/src/test/scala/org/apache/spark/DistributedSuite.scala
	project/SparkBuild.scala
2013-10-05 15:52:57 +05:30
Nick Pentreath 93b96b44d7 Adding implicit feedback ALS to MLlib user guide 2013-10-04 14:39:44 +02:00
tgravescs 0fff4ee852 Adding in the --addJars option to make SparkContext.addJar work on yarn and cleanup
the classpaths
2013-10-03 11:52:16 -05:00
tgravescs bc3b20abdc Allow users to set the application name for Spark on Yarn 2013-10-02 12:54:17 -05:00
Prashant Sharma 5829692885 Merge branch 'master' into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala
	docs/_config.yml
	project/SparkBuild.scala
	repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
2013-10-01 11:57:24 +05:30
Prashant Sharma 604dc40996 Sync with master and some build fixes 2013-09-26 11:40:02 +05:30
Patrick Wendell 6079721fa1 Update build version in master 2013-09-24 11:41:51 -07:00
Y.CORP.YAHOO.COM\tgraves 9d4246863a Support distributed cache files and archives on spark on yarn and attempt to cleanup the staging directory on exit 2013-09-23 09:09:59 -05:00
Jey Kottalam ac0dd99394 Fix typo in Maven build docs 2013-09-15 13:29:22 -07:00
Patrick Wendell dbd2c4fd94 Merge pull request #932 from pwendell/mesos-version
Bumping Mesos version to 0.13.0
2013-09-15 13:20:41 -07:00
Patrick Wendell c856860c5b Bumping Mesos version to 0.13.0 2013-09-15 12:46:26 -07:00
Patrick Wendell 362ea0c051 Explain yarn.version in Maven build docs 2013-09-15 12:40:49 -07:00
Prashant Sharma a90e0eff59 version changed 2.9.3 -> 2.10 in shell script. 2013-09-15 12:47:20 +05:30
Benjamin Hindman 8e2602dd70 More updates to Spark on Mesos documentation. 2013-09-11 16:08:54 -07:00
Benjamin Hindman a0f0c1bed2 Updated Spark on Mesos documentation. 2013-09-11 16:05:25 -07:00
Patrick Wendell bddf135670 Change port from 3030 to 4040 2013-09-11 10:01:38 -07:00
Matei Zaharia 2425eb85ca Update Python API features 2013-09-10 11:12:59 -07:00
Patrick Wendell cefee1ed1a Document fortran dependency for MLBase 2013-09-09 21:45:04 -07:00
Matei Zaharia 7a5c4b647b Small tweaks to MLlib docs 2013-09-08 21:47:24 -07:00
Matei Zaharia 7d3204b056 Merge pull request #905 from mateiz/docs2
Job scheduling and cluster mode docs
2013-09-08 21:39:12 -07:00
Matei Zaharia b458854977 Fix some review comments 2013-09-08 21:25:49 -07:00
Ameet Talwalkar 81a8bd46ac respose to PR comments 2013-09-08 19:21:30 -07:00
Ameet Talwalkar bf280c8b0f Merge remote-tracking branch 'upstream/master' 2013-09-08 18:41:38 -07:00
Patrick Wendell f68848d95d Merge pull request #906 from pwendell/ganglia-sink
Clean-up of Metrics Code/Docs and Add Ganglia Sink
2013-09-08 18:32:16 -07:00
Ameet Talwalkar 5ac62dbbd0 updates based on comments to PR 2013-09-08 17:39:08 -07:00
Matei Zaharia 5a587fb98d Updated cluster diagram to show caches 2013-09-08 13:51:57 -07:00
Patrick Wendell c190b48bf5 Adding more docs and some code cleanup 2013-09-08 13:46:28 -07:00