Commit graph

2577 commits

Author SHA1 Message Date
Tathagata Das 097e120c0c Refactored streaming scheduler and added listener interface.
- Refactored Scheduler + JobManager to JobGenerator + JobScheduler and
  added JobSet for cleaner code. Moved scheduler related code to
  streaming.scheduler package.
- Added StreamingListener trait (similar to SparkListener) to enable
  gathering to streaming stats like processing times and delays.
  StreamingContext.addListener() to added listeners.
- Deduped some code in streaming tests by modifying TestSuiteBase, and
  added StreamingListenerSuite.
2013-12-12 20:48:02 -08:00
Prashant Sharma 603af51bb5 Merge branch 'master' into akka-bug-fix
Conflicts:
	core/pom.xml
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
	pom.xml
	project/SparkBuild.scala
	streaming/pom.xml
	yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
2013-12-11 10:21:53 +05:30
Binh Nguyen 0b494f7db4 Hook directly to Scala API 2013-12-10 11:17:52 -08:00
Binh Nguyen e85af50767 Leave default value of numPartitions to Scala code. 2013-12-10 11:04:14 -08:00
Grega Kespret 558af87334 Fix tests. 2013-12-10 11:43:42 +01:00
Binh Nguyen c82d4f079b Use braces to shorten the line. 2013-12-10 01:04:52 -08:00
Binh Nguyen 5013fb64b2 Expose numPartitions parameter in JavaPairRDD.sortByKey()
This change make Java and Scala API on sortByKey() the same.
2013-12-10 00:38:16 -08:00
Prashant Sharma 17db6a9041 Style fixes and addressed review comments at #221 2013-12-10 11:47:16 +05:30
Patrick Wendell 5b74609d97 License headers 2013-12-09 16:41:01 -08:00
Grega Kespret 14a1df6572 Fix for spark.task.maxFailures not enforced correctly. 2013-12-09 10:39:02 +01:00
wangda.tan ee68a85cff SPARK-968, added sc finalize code to avoid akka rebinding to the same port 2013-12-09 09:38:58 +08:00
Aaron Davidson 40f63eb034 Merge master into 127 2013-12-08 11:16:52 -08:00
wangda.tan 850c4b709a Merge branch 'master' of https://github.com/leftnoteasy/incubator-spark-1 2013-12-09 00:12:46 +08:00
wangda.tan 48e4f2ad14 SPARK-968, In stage UI, add an overview section that shows task stats grouped by executor id 2013-12-09 00:02:59 +08:00
Matei Zaharia e0392343a0 Merge pull request #190 from markhamstra/Stages4Jobs
stageId <--> jobId mapping in DAGScheduler

Okay, I think this one is ready to go -- or at least it's ready for review and discussion.  It's a carry-over of https://github.com/mesos/spark/pull/842 with updates for the newer job cancellation functionality.  The prior discussion still applies.  I've actually changed the job cancellation flow a bit: Instead of ``cancelTasks`` going to the TaskScheduler and then ``taskSetFailed`` coming back to the DAGScheduler (resulting in ``abortStage`` there), the DAGScheduler now takes care of figuring out which stages should be cancelled, tells the TaskScheduler to cancel tasks for those stages, then does the cleanup within the DAGScheduler directly without the need for any further prompting by the TaskScheduler.

I know of three outstanding issues, each of which can and should, I believe, be handled in follow-up pull requests:

1) https://spark-project.atlassian.net/browse/SPARK-960
2) JobLogger should be re-factored to eliminate duplication
3) Related to 2), the WebUI should also become a consumer of the DAGScheduler's new understanding of the relationship between jobs and stages so that it can display progress indication and the like grouped by job.  Right now, some of this information is just being sent out as part of ``SparkListenerJobStart`` messages, but more or different job <--> stage information may need to be exported from the DAGScheduler to meet listeners needs.

Except for the eventQueue -> Actor commit, the rest can be cherry-picked almost cleanly into branch-0.8.  A little merging is needed in MapOutputTracker and the DAGScheduler.  Merged versions of those files are in aba2b40ce0

Note that between the recent Actor change in the DAGScheduler and the cleaning up of DAGScheduler data structures on job completion in this PR, some races have been introduced into the DAGSchedulerSuite.  Those tests usually pass, and I don't think that better-behaved code that doesn't directly inspect DAGScheduler data structures should be seeing any problems, but I'll work on fixing DAGSchedulerSuite as either an addition to this PR or as a separate request.

UPDATE: Fixed the race that I introduced.  Created a JIRA issue (SPARK-965) for the one that was introduced with the switch to eventProcessorActor in the DAGScheduler.
2013-12-06 11:49:59 -08:00
Matei Zaharia bfa68609d9 Merge pull request #233 from hsaputra/changecontexttobackend
Change the name of input argument in ClusterScheduler#initialize from context to backend.

The SchedulerBackend used to be called ClusterSchedulerContext so just want to make small
change of the input param in the ClusterScheduler#initialize to reflect this.
2013-12-06 11:04:03 -08:00
Matei Zaharia 3fb302c08d Merge pull request #205 from kayousterhout/logging
Added logging of scheduler delays to UI

This commit adds two metrics to the UI:

1) The time to get task results, if they're fetched remotely

2) The scheduler delay.  When the scheduler starts getting overwhelmed (because it can't keep up with the rate at which tasks are being submitted), the result is that tasks get delayed on the tail-end: the message from the worker saying that the task has completed ends up in a long queue and takes a while to be processed by the scheduler.  This commit records that delay in the UI so that users can tell when the scheduler is becoming the bottleneck.
2013-12-06 11:03:32 -08:00
Matei Zaharia 87676a6af2 Merge pull request #220 from rxin/zippart
Memoize preferred locations in ZippedPartitionsBaseRDD

so preferred location computation doesn't lead to exponential explosion.

This was a problem in GraphX where we have a whole chain of RDDs that are ZippedPartitionsRDD's, and the preferred locations were taking eternity to compute.

(cherry picked from commit e36fe55a03)
Signed-off-by: Reynold Xin <rxin@apache.org>
2013-12-06 11:01:42 -08:00
Aaron Davidson 94b5881ee9 Fix long lines 2013-12-06 00:22:00 -08:00
Aaron Davidson 5a864e3fce Rename SparkActorSystem to IndestructibleActorSystem 2013-12-06 00:21:43 -08:00
Prashant Sharma c9cd2af71e Merge branch 'wip-scala-2.10' into akka-bug-fix 2013-12-06 13:32:15 +05:30
Mark Hamstra ee888f6b25 FutureAction result tests 2013-12-05 23:01:18 -08:00
Prashant Sharma 4e70480038 A left over akka -> akka.tcp changes 2013-12-06 12:29:53 +05:30
Henry Saputra 1cb259cb57 Change the name of input ragument in ClusterScheduler#initialize from context to backend.
The SchedulerBackend used to be called ClusterSchedulerContext so just want to make small
change of the input param in the ClusterScheduler#initialize to reflect this.
2013-12-05 18:50:26 -08:00
Mark Hamstra aebb123fd3 jobWaiter.synchronized before jobWaiter.wait 2013-12-05 17:16:44 -08:00
Patrick Wendell 5d460253d6 Merge pull request #228 from pwendell/master
Document missing configs and set shuffle consolidation to false.
2013-12-05 12:31:24 -08:00
Patrick Wendell 75d161b357 Forcing shuffle consolidation in DiskBlockManagerSuite 2013-12-05 11:36:41 -08:00
Matei Zaharia 72b696156c Merge pull request #199 from harveyfeng/yarn-2.2
Hadoop 2.2 migration

Includes support for the YARN API stabilized in the Hadoop 2.2 release, and a few style patches.

Short description for each set of commits:

a98f5a0 - "Misc style changes in the 'yarn' package"
a67ebf4 - "A few more style fixes in the 'yarn' package"
Both of these are some minor style changes, such as fixing lines over 100 chars, to the existing YARN code.

ab8652f - "Add a 'new-yarn' directory ... "
Copies everything from `SPARK_HOME/yarn` to `SPARK_HOME/new-yarn`. No actual code changes here.

4f1c3fa - "Hadoop 2.2 YARN API migration ..."
API patches to code in the `SPARK_HOME/new-yarn` directory. There are a few more small style changes mixed in, too.
Based on @colorant's Hadoop 2.2 support for the scala-2.10 branch in #141.

a1a1c62 - "Add optional Hadoop 2.2 settings in sbt build ... "
If Spark should be built against Hadoop 2.2, then:
a) the `org.apache.spark.deploy.yarn` package will be compiled from the `new-yarn` directory.
b) Protobuf v2.5 will be used as a Spark dependency, since Hadoop 2.2 depends on it. Also, Spark will be built against a version of Akka v2.0.5 that's built against Protobuf 2.5, named `akka-2.0.5-protobuf-2.5`. The patched Akka is here: https://github.com/harveyfeng/akka/tree/2.0.5-protobuf-2.5, and was published to local Ivy during testing.

There's also a new boolean environment variable, `SPARK_IS_NEW_HADOOP`, that users can manually set if their `SPARK_HADOOP_VERSION` specification does not start with `2.2`, which is how the build file tries to detect a 2.2 version. Not sure if this is necessary or done in the best way, though...
2013-12-04 23:33:04 -08:00
Patrick Wendell b1c6fa1584 Document missing configs and set shuffle consolidation to false. 2013-12-04 18:39:34 -08:00
Patrick Wendell 182f9baeed Merge pull request #227 from pwendell/master
Fix small bug in web UI and minor clean-up.

There was a bug where sorting order didn't work correctly for write time metrics.

I also cleaned up some earlier code that fixed the same issue for read and
write bytes.
2013-12-04 15:52:07 -08:00
Patrick Wendell 380b90b9b3 Fix small bug in web UI and minor clean-up.
There was a bug where sorting order didn't work correctly for write time metrics.

I also cleaned up some earlier code that fixed the same issue for read and
write bytes.
2013-12-04 14:41:48 -08:00
Andrew Ash 217611680d Add missing space after "Serialized" in StorageLevel
Current code creates outputs like:

scala> res0.getStorageLevel.description
res2: String = Serialized1x Replicated
2013-12-04 11:29:20 -08:00
Matei Zaharia d6e5473872 Merge pull request #223 from rxin/transient
Mark partitioner, name, and generator field in RDD as @transient.

As part of the effort to reduce serialized task size.
2013-12-04 10:28:50 -08:00
Reynold Xin 974a69d79c Marked doCheckpointCalled as transient. 2013-12-03 11:34:38 -08:00
Mark Hamstra 403234dd0d SparkListenerJobStart posted from local jobs 2013-12-03 09:57:32 -08:00
Mark Hamstra f55d0b935d Synchronous, inline cleanup after runLocally 2013-12-03 09:57:32 -08:00
Mark Hamstra c9fcd909d0 Local jobs post SparkListenerJobEnd, and DAGScheduler data structure
cleanup always occurs before any posting of SparkListenerJobEnd.
2013-12-03 09:57:32 -08:00
Mark Hamstra 9ae2d094a9 Tightly couple stageIdToJobIds and jobIdToStageIds 2013-12-03 09:57:32 -08:00
Mark Hamstra 27c45e5236 Cleaned up job cancellation handling 2013-12-03 09:57:32 -08:00
Mark Hamstra 686a420ddc Refactoring to make job removal, stage removal, task cancellation clearer 2013-12-03 09:57:32 -08:00
Mark Hamstra 205566e56e Improved comment 2013-12-03 09:57:32 -08:00
Mark Hamstra 94087c463b Removed redundant residual re: reverted refactoring. 2013-12-03 09:57:31 -08:00
Mark Hamstra 982797dcba Fixed intended side-effects 2013-12-03 09:57:31 -08:00
Mark Hamstra 6f8359b5ad Actor instead of eventQueue for LocalJobCompleted 2013-12-03 09:57:31 -08:00
Mark Hamstra 51458ab4a1 Added stageId <--> jobId mapping in DAGScheduler
...and make sure that DAGScheduler data structures are cleaned up on job completion.
  Initial effort and discussion at https://github.com/mesos/spark/pull/842
2013-12-03 09:57:31 -08:00
Reynold Xin 58d9bbcfec Merge pull request #217 from aarondav/mesos-urls
Re-enable zk:// urls for Mesos SparkContexts

This was broken in PR #71 when we explicitly disallow anything that didn't fit a mesos:// url.
Although it is not really clear that a zk:// url should match Mesos, it is what the docs say and it is necessary for backwards compatibility.

Additionally added a unit test for the creation of all types of TaskSchedulers. Since YARN and Mesos are not necessarily available in the system, they are allowed to pass as long as the YARN/Mesos code paths are exercised.
2013-12-02 21:58:53 -08:00
Prashant Sharma 09e8be9a62 Made running SparkActorSystem specific to executors only. 2013-12-03 11:27:45 +05:30
Aaron Davidson 0f24576c08 Cleanup and documentation of SparkActorSystem 2013-12-03 11:05:12 +05:30
Reynold Xin e34b4693d3 Mark partitioner, name, and generator field in RDD as @transient. 2013-12-02 21:24:44 -08:00
Kay Ousterhout 58b3aff9a8 Fixed problem with scheduler delay 2013-12-02 20:30:03 -08:00
Aaron Davidson f6c8c1c7b6 Cleanup and documentation of SparkActorSystem 2013-12-02 11:42:53 -08:00
Prashant Sharma 5b11028a04 Made akka capable of tolerating fatal exceptions and moving on. 2013-12-02 10:47:39 +05:30
Reynold Xin 740922f25d Merge pull request #219 from sundeepn/schedulerexception
Scheduler quits when newStage fails

The current scheduler thread does not handle exceptions from newStage stage while launching new jobs. The thread fails on any exception that gets triggered at that level, leaving the cluster hanging with no schduler.
2013-12-01 12:46:58 -08:00
Sundeep Narravula be3ea2394f Log exception in scheduler in addition to passing it to the caller.
Code Styling changes.
2013-12-01 00:50:34 -08:00
Reynold Xin 9cf7f31e4d Memoize preferred locations in ZippedPartitionsBaseRDD so preferred location computation doesn't lead to exponential explosion.
(cherry picked from commit e36fe55a03)
Signed-off-by: Reynold Xin <rxin@apache.org>
2013-11-30 18:10:52 -08:00
Sundeep Narravula 4d53830eb7 Scheduler quits when createStage fails.
The current scheduler thread does not handle exceptions from createStage stage while launching new jobs. The thread fails on any exception that gets triggered at that level, leaving the cluster hanging with no schduler.
2013-11-30 16:18:12 -08:00
Aaron Davidson 96df26be47 Add spaces between tests 2013-11-29 13:20:43 -08:00
Prashant Sharma 5618af6803 Merge branch 'master' into wip-scala-2.10 2013-11-29 13:41:21 +05:30
Prashant Sharma 1bc83ca791 Changed defaults for akka to almost disable failure detector. 2013-11-29 13:41:05 +05:30
Lian, Cheng 4a1d966e26 More comments 2013-11-29 16:02:58 +08:00
Lian, Cheng 1e25086009 Updated some inline comments in DAGScheduler 2013-11-29 15:56:47 +08:00
Aaron Davidson 081a0b6861 Add unit test for SparkContext scheduler creation
Since YARN and Mesos are not necessarily available in the system,
they are allowed to pass as long as the YARN/Mesos code paths are
exercised.
2013-11-28 20:40:57 -08:00
Aaron Davidson 37f161cf6b Re-enable zk:// urls for Mesos SparkContexts
This was broken in PR #71 when we explicitly disallow anything that
didn't fit a mesos:// url.

Although it is not really clear that a zk:// url should match Mesos,
it is what the docs say and it is necessary for backwards compatibility.
2013-11-28 20:37:56 -08:00
Lian, Cheng 18def5d6f2 Bugfix: SPARK-965 & SPARK-966
SPARK-965: https://spark-project.atlassian.net/browse/SPARK-965
SPARK-966: https://spark-project.atlassian.net/browse/SPARK-966

* Add back DAGScheduler.start(), eventProcessActor is created and started here.

  Notice that function is only called by SparkContext.

* Cancel the scheduled stage resubmission task when stopping eventProcessActor

* Add a new DAGSchedulerEvent ResubmitFailedStages

  This event message is sent by the scheduled stage resubmission task to eventProcessActor.  In this way, DAGScheduler.resubmitFailedStages is guaranteed to be executed from the same thread that runs DAGScheduler.processEvent.

  Please refer to discussion in SPARK-966 for details.
2013-11-28 17:46:06 +08:00
Prashant Sharma 3ec5d74766 Fixed the broken build. 2013-11-28 13:02:28 +05:30
Matei Zaharia 743a31a7ca Merge pull request #210 from haitaoyao/http-timeout
add http timeout for httpbroadcast

While pulling task bytecode from HttpBroadcast server, there's no timeout value set. This may cause spark executor code hang and other task in the same executor process wait for the lock. I have encountered the issue in my cluster. Here's the stacktrace I captured  : https://gist.github.com/haitaoyao/7655830

So add a time out value to ensure the task fail fast.
2013-11-27 18:24:39 -08:00
Prashant Sharma 17987778da Merge branch 'master' into wip-scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	python/pyspark/rdd.py
2013-11-27 14:44:12 +05:30
Prashant Sharma 54862af5ee Improvements from the review comments and followed Boy Scout Rule. 2013-11-27 14:26:28 +05:30
Matei Zaharia fb6875dd5c Merge pull request #146 from JoshRosen/pyspark-custom-serializers
Custom Serializers for PySpark

This pull request adds support for custom serializers to PySpark.  For now, all Python-transformed (or parallelize()d RDDs) are serialized with the same serializer that's specified when creating SparkContext.

For now, PySpark includes `PickleSerDe` and `MarshalSerDe` classes for using Python's `pickle` and `marshal` serializers.  It's pretty easy to add support for other serializers, although I still need to add instructions on this.

A few notable changes:

- The Scala `PythonRDD` class no longer manipulates Pickled objects; data from `textFile` is written to Python as MUTF-8 strings.  The Python code performs the appropriate bookkeeping to track which deserializer should be used when reading an underlying JavaRDD.  This mechanism could also be used to support other data exchange formats, such as MsgPack.
- Several magic numbers were refactored into constants.
- Batching is implemented by wrapping / decorating an unbatched SerDe.
2013-11-26 20:55:40 -08:00
Matei Zaharia 330ada1766 Merge pull request #207 from henrydavidge/master
Log a warning if a task's serialized size is very big

As per Reynold's instructions, we now create a warning level log entry if a task's serialized size is too big. "Too big" is currently defined as 100kb. This warning message is generated at most once for each stage.
2013-11-26 19:08:33 -08:00
Harvey Feng afe4fe7f5e Merge remote-tracking branch 'origin/master' into yarn-2.2
Conflicts:
	yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
2013-11-26 15:03:03 -08:00
hhd 57579934f0 Emit warning when task size > 100KB 2013-11-26 16:58:39 -05:00
Mark Hamstra ed7ecb93ce [SPARK-963] Wait for SparkListenerBus eventQueue to be empty before checking jobLogger state 2013-11-26 13:30:17 -08:00
Reynold Xin cb976dfb50 Merge pull request #209 from pwendell/better-docs
Improve docs for shuffle instrumentation
2013-11-26 10:23:19 -08:00
Prashant Sharma 560e44a8e1 Restored master address for client. 2013-11-26 18:18:05 +05:30
haitao.yao db998a6e14 add http timeout for httpbroadcast 2013-11-26 18:23:48 +08:00
Prashant Sharma d092a8cc6a Fixed compile time warnings and formatting post merge. 2013-11-26 15:21:50 +05:30
Matei Zaharia 18d6df0e17 Merge pull request #86 from holdenk/master
Add histogram functionality to DoubleRDDFunctions

This pull request add histogram functionality to the DoubleRDDFunctions.
2013-11-26 00:00:07 -08:00
Patrick Wendell 297c09d4bb Improve docs for shuffle instrumentation 2013-11-25 22:53:28 -08:00
Holden Karau 7222ee2977 Fix the test 2013-11-25 21:06:42 -08:00
Matei Zaharia 0e2109ddb2 Merge pull request #204 from rxin/hash
OpenHashSet fixes

Incorporated ideas from pull request #200.
- Use Murmur Hash 3 finalization step to scramble the bits of HashCode
  instead of the simpler version in java.util.HashMap; the latter one
  had trouble with ranges of consecutive integers. Murmur Hash 3 is used
  by fastutil.
- Don't check keys for equality when re-inserting due to growing the
  table; the keys will already be unique.
- Remember the grow threshold instead of recomputing it on each insert

Also added unit tests for size estimation for specialized hash sets and maps.
2013-11-25 20:48:37 -08:00
Matei Zaharia 14bb465bb3 Merge pull request #201 from rxin/mappartitions
Use the proper partition index in mapPartitionsWIthIndex

mapPartitionsWithIndex uses TaskContext.partitionId as the partition index. TaskContext.partitionId used to be identical to the partition index in a RDD. However, pull request #186 introduced a scenario (with partition pruning) that the two can be different. This pull request uses the right partition index in all mapPartitionsWithIndex related calls.

Also removed the extra MapPartitionsWIthContextRDD and put all the mapPartitions related functionality in MapPartitionsRDD.
2013-11-25 18:50:18 -08:00
Matei Zaharia eb4296c8f7 Merge pull request #101 from colorant/yarn-client-scheduler
For SPARK-527, Support spark-shell when running on YARN

sync to trunk and resubmit here

In current YARN mode approaching, the application is run in the Application Master as a user program thus the whole spark context is on remote.

This approaching won't support application that involve local interaction and need to be run on where it is launched.

So In this pull request I have a YarnClientClusterScheduler and backend added.

With this scheduler, the user application is launched locally,While the executor will be launched by YARN on remote nodes with a thin AM which only launch the executor and monitor the Driver Actor status, so that when client app is done, it can finish the YARN Application as well.

This enables spark-shell to run upon YARN.

This also enable other Spark applications to have the spark context to run locally with a master-url "yarn-client". Thus e.g. SparkPi could have the result output locally on console instead of output in the log of the remote machine where AM is running on.

Docs also updated to show how to use this yarn-client mode.
2013-11-25 15:25:29 -08:00
Prashant Sharma 44fd30d3fb Merge branch 'master' into scala-2.10-wip
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	project/SparkBuild.scala
2013-11-25 18:10:54 +05:30
Prashant Sharma 489862a657 Remote death watch has a funny bug.
https://gist.github.com/ScrapCodes/4805fd84906e40b7b03d
2013-11-25 18:00:02 +05:30
Reynold Xin 466fd06475 Incorporated ideas from pull request #200.
- Use Murmur Hash 3 finalization step to scramble the bits of HashCode
  instead of the simpler version in java.util.HashMap; the latter one
  had trouble with ranges of consecutive integers. Murmur Hash 3 is used
  by fastutil.

- Don't check keys for equality when re-inserting due to growing the
  table; the keys will already be unique

- Remember the grow threshold instead of recomputing it on each insert
2013-11-25 18:27:26 +08:00
Reynold Xin 95c55df1c2 Added unit tests for size estimation for specialized hash sets and maps. 2013-11-25 18:27:06 +08:00
Prashant Sharma 77929cfeed Fine tuning defaults for akka and restored tracking of dissassociated events, for they are delivered when a remote TCP socket is closed. Also made transport failure heartbeats larger interval for it is mostly not needed. As we are using remote death watch instead. 2013-11-25 14:13:21 +05:30
Matei Zaharia 859d62dc2a Merge pull request #151 from russellcardullo/add-graphite-sink
Add graphite sink for metrics

This adds a metrics sink for graphite.  The sink must
be configured with the host and port of a graphite node
and optionally may be configured with a prefix that will
be prepended to all metrics that are sent to graphite.
2013-11-24 16:19:51 -08:00
Matei Zaharia 65de73c7f8 Merge pull request #185 from mkolod/random-number-generator
XORShift RNG with unit tests and benchmark

This patch was introduced to address SPARK-950 - the discussion below the ticket explains not only the rationale, but also the design and testing decisions: https://spark-project.atlassian.net/browse/SPARK-950

To run unit test, start SBT console and type:
compile
test-only org.apache.spark.util.XORShiftRandomSuite
To run benchmark, type:
project core
console
Once the Scala console starts, type:
org.apache.spark.util.XORShiftRandom.benchmark(100000000)
XORShiftRandom is also an object with a main method taking the
number of iterations as an argument, so you can also run it
from the command line.
2013-11-24 15:52:33 -08:00
Reynold Xin 972171b9d9 Merge pull request #197 from aarondav/patrick-fix
Fix 'timeWriting' stat for shuffle files

Due to concurrent git branches, changes from shuffle file consolidation patch
caused the shuffle write timing patch to no longer actually measure the time,
since it requires time be measured after the stream has been closed.
2013-11-25 07:50:46 +08:00
Reynold Xin e9ff13ec72 Consolidated both mapPartitions related RDDs into a single MapPartitionsRDD.
Also changed the semantics of the index parameter in mapPartitionsWithIndex from the partition index of the output partition to the partition index in the current RDD.
2013-11-24 17:56:43 +08:00
Matei Zaharia 9837a60234 Some other optimizations to AppendOnlyMap:
- Don't check keys for equality when re-inserting due to growing the
  table; the keys will already be unique
- Remember the grow threshold instead of recomputing it on each insert
2013-11-23 17:38:29 -08:00
Matei Zaharia 7535d7fbcb Fixes to AppendOnlyMap:
- Use Murmur Hash 3 finalization step to scramble the bits of HashCode
  instead of the simpler version in java.util.HashMap; the latter one
  had trouble with ranges of consecutive integers. Murmur Hash 3 is used
  by fastutil.
- Use Object.equals() instead of Scala's == to compare keys, because the
  latter does extra casts for numeric types (see the equals method in
  https://github.com/scala/scala/blob/master/src/library/scala/runtime/BoxesRunTime.java)
2013-11-23 17:21:37 -08:00
Harvey Feng 4f1c3fa5d7 Hadoop 2.2 YARN API migration for SPARK_HOME/new-yarn 2013-11-23 17:08:30 -08:00
Ankur Dave c1507afc6c Support preservesPartitioning in RDD.zipPartitions 2013-11-23 03:03:31 -08:00
Aaron Davidson ccea38b759 Fix 'timeWriting' stat for shuffle files
Due to concurrent git branches, changes from shuffle file consolidation patch
caused the shuffle write timing patch to no longer actually measure the time,
since it requires time be measured after the stream has been closed.
2013-11-21 21:36:08 -08:00
Reynold Xin f20093c3af Merge pull request #196 from pwendell/master
TimeTrackingOutputStream should pass on calls to close() and flush().

Without this fix you get a huge number of open files when running shuffles.
2013-11-22 10:12:13 +08:00
Raymond Liu ab3cefde53 Add YarnClientClusterScheduler and Backend.
With this scheduler, the user application is launched locally,
While the executor will be launched by YARN on remote nodes.

This enables spark-shell to run upon YARN.
2013-11-22 09:23:27 +08:00
Patrick Wendell 53b94ef2f5 TimeTrackingOutputStream should pass on calls to close() and flush().
Without this fix you get a huge number of open shuffles after running
shuffles.
2013-11-21 17:20:15 -08:00