ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Aaron Davidson	f629ba95b6	Various merge corrections I've diff'd this patch against my own -- since they were both created independently, this means that two sets of eyes have gone over all the merge conflicts that were created, so I'm feeling significantly more confident in the resulting PR. @rxin has looked at the changes to the repl and is resoundingly confident that they are correct.	2013-11-14 22:13:09 -08:00
Raymond Liu	a60620b76a	Merge branch 'master' into scala-2.10	2013-11-14 12:44:19 +08:00
Raymond Liu	0f2e3c6e31	Merge branch 'master' into scala-2.10	2013-11-13 16:55:11 +08:00
tgravescs	a35472e1dd	Allow spark on yarn to be run from HDFS. Allows the spark.jar, app.jar, and log4j.properties to be put into hdfs.	2013-11-04 16:16:28 -06:00
Fabrizio (Misto) Milo	3f89354c45	fix persistent-hdfs	2013-11-01 17:47:37 -07:00
Evan Chan	e54a37fe15	Document all the URIs for addJar/addFile	2013-11-01 10:58:11 -07:00
Patrick Wendell	08c1a42d7d	Add a `repartition` operator. This patch adds an operator called repartition with more straightforward semantics than the current `coalesce` operator. There are a few use cases where this operator is useful: 1. If a user wants to increase the number of partitions in the RDD. This is more common now with streaming. E.g. a user is ingesting data on one node but they want to add more partitions to ensure parallelism of subsequent operations across threads or the cluster. Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's super confusing. 2. If a user has input data where the number of partitions is not known. E.g. > sc.textFile("some file").coalesce(50).... This is both vague semantically (am I growing or shrinking this RDD) but also, may not work correctly if the base RDD has fewer than 50 partitions. The new operator forces shuffles every time, so it will always produce exactly the number of new partitions. It also throws an exception rather than silently not-working if a bad input is passed. I am currently adding streaming tests (requires refactoring some of the test suite to allow testing at partition granularity), so this is not ready for merge yet. But feedback is welcome.	2013-10-24 14:31:33 -07:00
Matei Zaharia	452aa36d67	Merge pull request #97 from ewencp/pyspark-system-properties Add classmethod to SparkContext to set system properties. Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.	2013-10-22 23:15:33 -07:00
Ewen Cheslack-Postava	c8748c25eb	Add notes to python documentation about using SparkContext.setSystemProperty.	2013-10-22 11:49:52 -07:00
Aaron Davidson	962bec97ee	Docs: Fix links to RDD API documentation	2013-10-22 09:39:36 -07:00
Reynold Xin	f628804c02	Merge pull request #76 from pwendell/master Clarify compression property. Clarifies that this governs compression of internal data, not input data or output data.	2013-10-18 23:19:42 -07:00
Patrick Wendell	6b62836285	Clarify compression property. Clarifies that this governs compression of internal data, not input data or output data.	2013-10-18 23:08:44 -07:00
Mosharaf Chowdhury	35b2415fb3	Code styling. Updated doc.	2013-10-17 13:14:12 -07:00
Matei Zaharia	8f11c36fe1	Merge remote-tracking branch 'tgravescs/sparkYarnDistCache' Closes #11 Conflicts: docs/running-on-yarn.md yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala	2013-10-10 19:34:33 -07:00
Matei Zaharia	c71499b779	Merge pull request #19 from aarondav/master-zk Standalone Scheduler fault tolerance using ZooKeeper This patch implements full distributed fault tolerance for standalone scheduler Masters. There is only one master Leader at a time, which is actively serving scheduling requests. If this Leader crashes, another master will eventually be elected, reconstruct the state from the first Master, and continue serving scheduling requests. Leader election is performed using the ZooKeeper leader election pattern. We try to minimize the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of retries and session monitoring on top of the ZooKeeper client. Master failover follows directly from the single-node Master recovery via the file system (patch `d5a96fe`), save that the Master state is stored in ZooKeeper instead. Configuration: By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE). By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled. By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory to an appropriate directory accessible by the Master, we will keep the behavior of from `d5a96fe`. Additionally, places where a Master could be specificied by a spark:// url can now take comma-delimited lists to specify backup masters. Note that this is only used for registration of NEW Workers and application Clients. Once a Worker or Client has registered with the Master Leader, it is "in the system" and will never need to register again.	2013-10-10 17:16:42 -07:00
Aaron Davidson	66c20635fa	Minor clarification and cleanup to spark-standalone.md	2013-10-10 14:45:12 -07:00
Aaron Davidson	42d8b8efe6	Address Matei's comments on documentation Updates to the documentation and changing some logError()s to logWarning()s.	2013-10-10 00:33:47 -07:00
Prashant Sharma	026ab75661	Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10	2013-10-10 09:42:55 +05:30
Matei Zaharia	478b2b7edc	Fix PySpark docs and an overly long line of code after `fdbae41e`	2013-10-09 12:08:04 -07:00
Aaron Davidson	4ea8ee468f	Add docs for standalone scheduler fault tolerance Also fix a couple HTML/Markdown issues in other files.	2013-10-08 14:18:31 -07:00
Prashant Sharma	7be75682b9	Merge branch 'master' into wip-merge-master Conflicts: bagel/pom.xml core/pom.xml core/src/test/scala/org/apache/spark/ui/UISuite.scala examples/pom.xml mllib/pom.xml pom.xml project/SparkBuild.scala repl/pom.xml streaming/pom.xml tools/pom.xml In scala 2.10, a shorter representation is used for naming artifacts so changed to shorter scala version for artifacts and made it a property in pom.	2013-10-08 11:29:40 +05:30
Nick Pentreath	a5e58b8f98	Merge branch 'master' into implicit-als	2013-10-07 11:46:17 +02:00
Patrick Wendell	aa9fb84994	Merging build changes in from 0.8	2013-10-05 22:07:00 -07:00
Prashant Sharma	c810ee0690	Merge branch 'master' into scala-2.10 Conflicts: core/src/test/scala/org/apache/spark/DistributedSuite.scala project/SparkBuild.scala	2013-10-05 15:52:57 +05:30
Nick Pentreath	93b96b44d7	Adding implicit feedback ALS to MLlib user guide	2013-10-04 14:39:44 +02:00
tgravescs	0fff4ee852	Adding in the --addJars option to make SparkContext.addJar work on yarn and cleanup the classpaths	2013-10-03 11:52:16 -05:00
tgravescs	bc3b20abdc	Allow users to set the application name for Spark on Yarn	2013-10-02 12:54:17 -05:00
Prashant Sharma	5829692885	Merge branch 'master' into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala docs/_config.yml project/SparkBuild.scala repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala	2013-10-01 11:57:24 +05:30
Prashant Sharma	604dc40996	Sync with master and some build fixes	2013-09-26 11:40:02 +05:30
Patrick Wendell	6079721fa1	Update build version in master	2013-09-24 11:41:51 -07:00
$Y.CORP.YAHOO.COM\tgraves$ Y.CORP.YAHOO.COM\tgraves	9d4246863a	Support distributed cache files and archives on spark on yarn and attempt to cleanup the staging directory on exit	2013-09-23 09:09:59 -05:00
Jey Kottalam	ac0dd99394	Fix typo in Maven build docs	2013-09-15 13:29:22 -07:00
Patrick Wendell	dbd2c4fd94	Merge pull request #932 from pwendell/mesos-version Bumping Mesos version to 0.13.0	2013-09-15 13:20:41 -07:00
Patrick Wendell	c856860c5b	Bumping Mesos version to 0.13.0	2013-09-15 12:46:26 -07:00
Patrick Wendell	362ea0c051	Explain yarn.version in Maven build docs	2013-09-15 12:40:49 -07:00
Prashant Sharma	a90e0eff59	version changed 2.9.3 -> 2.10 in shell script.	2013-09-15 12:47:20 +05:30
Benjamin Hindman	8e2602dd70	More updates to Spark on Mesos documentation.	2013-09-11 16:08:54 -07:00
Benjamin Hindman	a0f0c1bed2	Updated Spark on Mesos documentation.	2013-09-11 16:05:25 -07:00
Patrick Wendell	bddf135670	Change port from 3030 to 4040	2013-09-11 10:01:38 -07:00
Matei Zaharia	2425eb85ca	Update Python API features	2013-09-10 11:12:59 -07:00
Patrick Wendell	cefee1ed1a	Document fortran dependency for MLBase	2013-09-09 21:45:04 -07:00
Matei Zaharia	7a5c4b647b	Small tweaks to MLlib docs	2013-09-08 21:47:24 -07:00
Matei Zaharia	7d3204b056	Merge pull request #905 from mateiz/docs2 Job scheduling and cluster mode docs	2013-09-08 21:39:12 -07:00
Matei Zaharia	b458854977	Fix some review comments	2013-09-08 21:25:49 -07:00
Ameet Talwalkar	81a8bd46ac	respose to PR comments	2013-09-08 19:21:30 -07:00
Ameet Talwalkar	bf280c8b0f	Merge remote-tracking branch 'upstream/master'	2013-09-08 18:41:38 -07:00
Patrick Wendell	f68848d95d	Merge pull request #906 from pwendell/ganglia-sink Clean-up of Metrics Code/Docs and Add Ganglia Sink	2013-09-08 18:32:16 -07:00
Ameet Talwalkar	5ac62dbbd0	updates based on comments to PR	2013-09-08 17:39:08 -07:00
Matei Zaharia	5a587fb98d	Updated cluster diagram to show caches	2013-09-08 13:51:57 -07:00
Patrick Wendell	c190b48bf5	Adding more docs and some code cleanup	2013-09-08 13:46:28 -07:00

1 2 3 4 5 ...

315 commits