ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Ankur Dave	5064f9b2d2	Merge remote-tracking branch 'spark-upstream/master' Conflicts: project/SparkBuild.scala	2013-10-30 15:59:09 -07:00
Joseph E. Gonzalez	41b3122120	Strating to improve README.	2013-10-29 20:57:55 -07:00
Patrick Wendell	08c1a42d7d	Add a `repartition` operator. This patch adds an operator called repartition with more straightforward semantics than the current `coalesce` operator. There are a few use cases where this operator is useful: 1. If a user wants to increase the number of partitions in the RDD. This is more common now with streaming. E.g. a user is ingesting data on one node but they want to add more partitions to ensure parallelism of subsequent operations across threads or the cluster. Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's super confusing. 2. If a user has input data where the number of partitions is not known. E.g. > sc.textFile("some file").coalesce(50).... This is both vague semantically (am I growing or shrinking this RDD) but also, may not work correctly if the base RDD has fewer than 50 partitions. The new operator forces shuffles every time, so it will always produce exactly the number of new partitions. It also throws an exception rather than silently not-working if a bad input is passed. I am currently adding streaming tests (requires refactoring some of the test suite to allow testing at partition granularity), so this is not ready for merge yet. But feedback is welcome.	2013-10-24 14:31:33 -07:00
Matei Zaharia	452aa36d67	Merge pull request #97 from ewencp/pyspark-system-properties Add classmethod to SparkContext to set system properties. Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.	2013-10-22 23:15:33 -07:00
Ewen Cheslack-Postava	c8748c25eb	Add notes to python documentation about using SparkContext.setSystemProperty.	2013-10-22 11:49:52 -07:00
Aaron Davidson	962bec97ee	Docs: Fix links to RDD API documentation	2013-10-22 09:39:36 -07:00
Reynold Xin	f628804c02	Merge pull request #76 from pwendell/master Clarify compression property. Clarifies that this governs compression of internal data, not input data or output data.	2013-10-18 23:19:42 -07:00
Patrick Wendell	6b62836285	Clarify compression property. Clarifies that this governs compression of internal data, not input data or output data.	2013-10-18 23:08:44 -07:00
Mosharaf Chowdhury	35b2415fb3	Code styling. Updated doc.	2013-10-17 13:14:12 -07:00
Matei Zaharia	8f11c36fe1	Merge remote-tracking branch 'tgravescs/sparkYarnDistCache' Closes #11 Conflicts: docs/running-on-yarn.md yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala	2013-10-10 19:34:33 -07:00
Matei Zaharia	c71499b779	Merge pull request #19 from aarondav/master-zk Standalone Scheduler fault tolerance using ZooKeeper This patch implements full distributed fault tolerance for standalone scheduler Masters. There is only one master Leader at a time, which is actively serving scheduling requests. If this Leader crashes, another master will eventually be elected, reconstruct the state from the first Master, and continue serving scheduling requests. Leader election is performed using the ZooKeeper leader election pattern. We try to minimize the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of retries and session monitoring on top of the ZooKeeper client. Master failover follows directly from the single-node Master recovery via the file system (patch `d5a96fe`), save that the Master state is stored in ZooKeeper instead. Configuration: By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE). By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled. By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory to an appropriate directory accessible by the Master, we will keep the behavior of from `d5a96fe`. Additionally, places where a Master could be specificied by a spark:// url can now take comma-delimited lists to specify backup masters. Note that this is only used for registration of NEW Workers and application Clients. Once a Worker or Client has registered with the Master Leader, it is "in the system" and will never need to register again.	2013-10-10 17:16:42 -07:00
Aaron Davidson	66c20635fa	Minor clarification and cleanup to spark-standalone.md	2013-10-10 14:45:12 -07:00
Aaron Davidson	42d8b8efe6	Address Matei's comments on documentation Updates to the documentation and changing some logError()s to logWarning()s.	2013-10-10 00:33:47 -07:00
Matei Zaharia	478b2b7edc	Fix PySpark docs and an overly long line of code after `fdbae41e`	2013-10-09 12:08:04 -07:00
Aaron Davidson	4ea8ee468f	Add docs for standalone scheduler fault tolerance Also fix a couple HTML/Markdown issues in other files.	2013-10-08 14:18:31 -07:00
Nick Pentreath	a5e58b8f98	Merge branch 'master' into implicit-als	2013-10-07 11:46:17 +02:00
Patrick Wendell	aa9fb84994	Merging build changes in from 0.8	2013-10-05 22:07:00 -07:00
Nick Pentreath	93b96b44d7	Adding implicit feedback ALS to MLlib user guide	2013-10-04 14:39:44 +02:00
tgravescs	0fff4ee852	Adding in the --addJars option to make SparkContext.addJar work on yarn and cleanup the classpaths	2013-10-03 11:52:16 -05:00
tgravescs	bc3b20abdc	Allow users to set the application name for Spark on Yarn	2013-10-02 12:54:17 -05:00
Patrick Wendell	6079721fa1	Update build version in master	2013-09-24 11:41:51 -07:00
$Y.CORP.YAHOO.COM\tgraves$ Y.CORP.YAHOO.COM\tgraves	9d4246863a	Support distributed cache files and archives on spark on yarn and attempt to cleanup the staging directory on exit	2013-09-23 09:09:59 -05:00
Jey Kottalam	ac0dd99394	Fix typo in Maven build docs	2013-09-15 13:29:22 -07:00
Patrick Wendell	dbd2c4fd94	Merge pull request #932 from pwendell/mesos-version Bumping Mesos version to 0.13.0	2013-09-15 13:20:41 -07:00
Patrick Wendell	c856860c5b	Bumping Mesos version to 0.13.0	2013-09-15 12:46:26 -07:00
Patrick Wendell	362ea0c051	Explain yarn.version in Maven build docs	2013-09-15 12:40:49 -07:00
Benjamin Hindman	8e2602dd70	More updates to Spark on Mesos documentation.	2013-09-11 16:08:54 -07:00
Benjamin Hindman	a0f0c1bed2	Updated Spark on Mesos documentation.	2013-09-11 16:05:25 -07:00
Patrick Wendell	bddf135670	Change port from 3030 to 4040	2013-09-11 10:01:38 -07:00
Matei Zaharia	2425eb85ca	Update Python API features	2013-09-10 11:12:59 -07:00
Patrick Wendell	cefee1ed1a	Document fortran dependency for MLBase	2013-09-09 21:45:04 -07:00
Matei Zaharia	7a5c4b647b	Small tweaks to MLlib docs	2013-09-08 21:47:24 -07:00
Matei Zaharia	7d3204b056	Merge pull request #905 from mateiz/docs2 Job scheduling and cluster mode docs	2013-09-08 21:39:12 -07:00
Matei Zaharia	b458854977	Fix some review comments	2013-09-08 21:25:49 -07:00
Ameet Talwalkar	81a8bd46ac	respose to PR comments	2013-09-08 19:21:30 -07:00
Ameet Talwalkar	bf280c8b0f	Merge remote-tracking branch 'upstream/master'	2013-09-08 18:41:38 -07:00
Patrick Wendell	f68848d95d	Merge pull request #906 from pwendell/ganglia-sink Clean-up of Metrics Code/Docs and Add Ganglia Sink	2013-09-08 18:32:16 -07:00
Ameet Talwalkar	5ac62dbbd0	updates based on comments to PR	2013-09-08 17:39:08 -07:00
Matei Zaharia	5a587fb98d	Updated cluster diagram to show caches	2013-09-08 13:51:57 -07:00
Patrick Wendell	c190b48bf5	Adding more docs and some code cleanup	2013-09-08 13:46:28 -07:00
Matei Zaharia	af8ffdb73c	Review comments	2013-09-08 13:36:50 -07:00
Matei Zaharia	c0d375107f	Some tweaks to CDH/HDP doc	2013-09-08 00:44:41 -07:00
Matei Zaharia	f261d2a60f	Added cluster overview doc, made logo higher-resolution, and added more details on monitoring	2013-09-08 00:29:11 -07:00
Matei Zaharia	651a96adf7	More fair scheduler docs and property names. Also changed uses of "job" terminology to "application" when they referred to an entire Spark program, to avoid confusion.	2013-09-08 00:29:11 -07:00
Matei Zaharia	98fb69822c	Work in progress: - Add job scheduling docs - Rename some fair scheduler properties - Organize intro page better - Link to Apache wiki for "contributing to Spark"	2013-09-08 00:29:11 -07:00
Matei Zaharia	38488aca8a	Merge pull request #900 from pwendell/cdh-docs Provide docs to describe running on CDH/HDP cluster.	2013-09-08 00:28:53 -07:00
Patrick Wendell	22b982d2bc	File rename	2013-09-07 14:38:54 -07:00
Matei Zaharia	cfde85e395	Merge pull request #901 from ooyala/2013-09/0.8-doc-changes 0.8 Doc changes for make-distribution.sh	2013-09-07 13:53:08 -07:00
Patrick Wendell	61c4762d45	Changes based on feedback	2013-09-07 11:55:10 -07:00
Evan Chan	be1ee28ca6	CR feedback from Matei	2013-09-07 08:56:24 -07:00

1 2 3 4 5 ...

305 commits