Commit graph

5898 commits

Author SHA1 Message Date
Josh Rosen 3f2571fe98 Remove unused isShuffle field from Dependency. 2012-10-07 00:03:55 -07:00
Matei Zaharia b2fc3dd902 Log message 2012-10-07 06:43:52 +00:00
Matei Zaharia ea096f7cd5 More logging 2012-10-07 06:35:48 +00:00
root 554b42cb24 Log more info in MapOutputTracker 2012-10-07 05:02:18 +00:00
root a73b25826b Made Akka thread pool and message batch sizes configurable 2012-10-07 04:19:54 +00:00
root ce915cadee Made run script add test-classes onto the classpath only if SPARK_TESTING is set; fixes #216 2012-10-07 04:19:16 +00:00
root 975009d688 Avoid acquiring locks in BlockManager when fetching shuffle outputs 2012-10-07 04:02:10 +00:00
root 0bc63f7ef1 Log initial number of fetches in reducer 2012-10-07 03:51:04 +00:00
Matei Zaharia dc28a3ac0a Modified shuffle to limit the maximum outstanding data size in bytes,
instead of the maximum number of outstanding fetches. This should make
it faster when there are many small map output files, as well as more
robust to overallocating memory on large map outputs.
2012-10-06 20:07:10 -07:00
Matei Zaharia 9a3b3f32a3 Pass sizes of map outputs back to MapOutputTracker 2012-10-06 18:46:04 -07:00
Matei Zaharia 0e42832e6a Made block store return the size of each block put in 2012-10-06 18:00:53 -07:00
Matei Zaharia b0110de5b6 Warn about user programs that try to set spark.cache.class 2012-10-06 17:27:14 -07:00
Matei Zaharia 65113b7e1b Only group elements ten at a time into SequenceFile records in
saveAsObjectFile
2012-10-06 17:14:41 -07:00
Matei Zaharia 716e10ca32 Minor formatting fixes 2012-10-05 22:03:06 -07:00
Matei Zaharia 70f02fa912 Merge branch 'dev' of github.com:mesos/spark into dev 2012-10-05 22:00:22 -07:00
Andy Konwinski a242cdd0a6 Factor subclasses of RDD out of RDD.scala into their own classes
in the rdd package.
2012-10-05 19:53:54 -07:00
Andy Konwinski d7363a6b8a Moves all files in core/src/main/scala/ that have RDD in their name
from that directory to a new core/src/main/scala/rdd directory.
2012-10-05 19:23:45 -07:00
Andy Konwinski e0067da082 Moves all files in core/src/main/scala/ that have RDD in them from
package spark to package spark.rdd and updates all references to them.
2012-10-05 19:23:45 -07:00
Matei Zaharia 69588baf65 Cleaning up code slightly 2012-10-05 19:16:09 -07:00
root f52bc09a34 Reduce some overly aggressive logging in connection manager 2012-10-06 01:54:39 +00:00
Shivaram Venkataraman b6e4f46a96 Fix SizeEstimator tests to work with String classes in JDK 6 and 7
Conflicts:

	core/src/test/scala/spark/BoundedMemoryCacheSuite.scala
2012-10-05 16:58:57 -07:00
Matei Zaharia e3ae98b54e Merge pull request #247 from squito/dev
Dev
2012-10-05 10:27:18 -07:00
Imran Rashid e0698f8f26 change tests to show utility of localValue 2012-10-04 23:05:42 -07:00
Imran Rashid 82a3327862 make accumulator.localValue public, add tests
Conflicts:
	core/src/test/scala/spark/AccumulatorSuite.scala
2012-10-04 23:05:01 -07:00
Matei Zaharia 8c82f43db3 Scaladoc documentation for some core Spark functionality 2012-10-04 22:59:36 -07:00
Matei Zaharia c04aeaf365 Merge pull request #243 from andyk/mesos-from-maven-central
Removes the included mesos-0.9.0.jar and pulls it from Maven Central instead
2012-10-03 11:10:00 -07:00
Reynold Xin 45f4b7cc7e Made Serializer and JavaSerializer non private. 2012-10-03 10:20:59 -07:00
Andy Konwinski 5897567679 Removes the included mesos-0.9.0.jar and adds a libraryDependency to
the build file so that mesos-0.9.0-incubating.jar (which contains the
same class files, but has a silightly different name) will be pulled
down from Maven Central instead.
2012-10-03 08:58:05 -07:00
Matei Zaharia 833f1d0c86 Made StorageLevel public 2012-10-03 08:27:25 -07:00
Matei Zaharia 6cf5dffc72 Make more stuff private[spark] 2012-10-02 22:28:55 -07:00
Mosharaf Chowdhury 119e50c7b9 Conflict fixed 2012-10-02 22:25:39 -07:00
Matei Zaharia 626f701931 Merge pull request #240 from dennybritz/private_classes
Package-Private Classes
2012-10-02 21:24:32 -07:00
Denny 0361353a70 Make Java API abstract wrapped functions private 2012-10-02 20:02:53 -07:00
Denny b9badcd5bd accidentially removed trait 2012-10-02 19:35:07 -07:00
Denny 18a1faedf6 Stylistic changes and Public Accumulable and Broadcast 2012-10-02 19:28:37 -07:00
Denny b7a913e1fa Make dependency classes public - used by spark 2012-10-02 19:04:23 -07:00
Denny 4d9f4b01af Make classes package private 2012-10-02 19:00:19 -07:00
Matei Zaharia 97cbd699d7 Merge branch 'dev' of github.com:mesos/spark into dev 2012-10-02 17:31:01 -07:00
Matei Zaharia 5fda59ab99 Added a test for overly large blocks in memory store 2012-10-02 17:30:40 -07:00
Matei Zaharia 6098f7e87a Fixed cache replacement behavior of BlockManager:
- Partitions that get dropped to disk will now be loaded back into RAM
  after they're accessed again
- Same-RDD rule for cache replacement is now implemented (don't drop
  partitions from an RDD to make room for other partitions from itself)
- Items stored as MEMORY_AND_DISK go into memory only first, instead of
  being eagerly written out to disk
- MemoryStore.ensureFreeSpace is called within a lock on the writer
  thread to prevent race conditions (this can still be optimized to
  allow multiple concurrent calls to it but it's a start)
- MemoryStore does not accept blocks larger than its limit
2012-10-02 17:25:38 -07:00
Reynold Xin 7997585616 Added a check to make sure SPARK_MEM <= memoryPerSlave for local cluster
mode.
2012-10-02 15:45:25 -07:00
Reynold Xin 0898a21b95 Merge branch 'dev' of https://github.com/mesos/spark into dev 2012-10-02 13:08:01 -07:00
Matei Zaharia 22684653a5 Revert "Place Spray repo ahead of Cloudera in Maven search path"
This reverts commit 42e0a68082.
2012-10-02 12:01:32 -07:00
Reynold Xin b8cd681169 Allow whitespaces in cluster URL configuration for local cluster. 2012-10-02 11:52:12 -07:00
Matei Zaharia 42e0a68082 Place Spray repo ahead of Cloudera in Maven search path 2012-10-02 11:37:19 -07:00
Matei Zaharia b9fb8d6463 Include date in folder name for Spark local dir. 2012-10-01 15:55:16 -07:00
Matei Zaharia bc881e4798 Merge branch 'dev' of github.com:mesos/spark into dev 2012-10-01 15:21:56 -07:00
Matei Zaharia 802aa8aef9 Some bug fixes and logging fixes for broadcast. 2012-10-01 15:20:42 -07:00
Matei Zaharia 74a9244255 Write all unit test output to a file 2012-10-01 15:07:42 -07:00
Reynold Xin f264153162 Fixed #232: DirectBuffer's cleaner was empty and Spark tried to invoke
clean on it.
2012-10-01 14:07:34 -07:00
Matei Zaharia 3b348f909d Improve log messages from BlockManager 2012-10-01 12:01:38 -07:00
Matei Zaharia 0b84871dbc Remove some printlns in tests 2012-10-01 10:57:26 -07:00
Matei Zaharia 53f90d0f0e Use underscores instead of colons in RDD IDs 2012-10-01 10:48:53 -07:00
Matei Zaharia 2314132d57 Added a (failing) test for LRU with MEMORY_AND_DISK. 2012-09-30 22:52:16 -07:00
Matei Zaharia 3128c57f90 Simplified Class / ClassLoader test 2012-09-30 21:48:27 -07:00
Matei Zaharia 83143f9a5f Fixed several bugs that caused weird behavior with files in spark-shell:
- SizeEstimator was following through a ClassLoader field of Hadoop
  JobConfs, which referenced the whole interpreter, Scala compiler, etc.
  Chaos ensued, giving an estimated size in the tens of gigabytes.
- Broadcast variables in local mode were only stored as MEMORY_ONLY and
  never made accessible over a server, so they fell out of the cache when
  they were deemed too large and couldn't be reloaded.
2012-09-30 21:19:39 -07:00
Matei Zaharia fd0374b9de Comment 2012-09-29 21:43:06 -07:00
Matei Zaharia 5718cef2a4 Removed Logging trait from CoalescedRDD since we don't log anything 2012-09-29 21:40:43 -07:00
Matei Zaharia 143ef4f90d Added a CoalescedRDD class for reducing the number of partitions in an RDD. 2012-09-29 21:30:52 -07:00
Matei Zaharia c45758ddde Comment 2012-09-29 20:27:54 -07:00
Matei Zaharia ebd52347b5 Merge branch 'dev' of github.com:mesos/spark into dev 2012-09-29 20:22:31 -07:00
Matei Zaharia 9b326d01e9 Made BlockManager unmap memory-mapped files when necessary to reduce the
number of open files. Also optimized sending of disk-based blocks.
2012-09-29 20:21:54 -07:00
Matei Zaharia 2f11e3c285 Merge pull request #227 from JoshRosen/fix/distinct_numsplits
Allow controlling number of splits in distinct().
2012-09-28 23:57:24 -07:00
Josh Rosen 8654165e69 Use null as dummy value in distinct(). 2012-09-28 23:55:17 -07:00
Josh Rosen 37c199bbb0 Allow controlling number of splits in distinct(). 2012-09-28 23:44:19 -07:00
Matei Zaharia 56dcad5936 Don't create a Cache in SparkEnv because we don't use it 2012-09-28 23:40:56 -07:00
Matei Zaharia 1d44644f4f Logging tweaks 2012-09-28 23:28:16 -07:00
Matei Zaharia 815d6bd69a Renamed subdirs option 2012-09-28 19:02:41 -07:00
Matei Zaharia e54e1d7043 Made subdirs per local dir configurable, and reduced lock usage a bit 2012-09-28 19:00:50 -07:00
Matei Zaharia ae8c7d6cfa Made disk store use multiple directories, deleted ShuffleManager 2012-09-28 18:28:13 -07:00
Matei Zaharia 3d7267999d Print and track user call sites in more places in Spark 2012-09-28 17:42:00 -07:00
Matei Zaharia 9f6efbf06a Merge pull request #225 from pwendell/dev
Log message which records RDD origin
2012-09-28 16:28:07 -07:00
Matei Zaharia 0121a26bd1 Changed the way tasks' dependency files are sent to workers so that
custom serializers or Kryo registrators can be loaded.
2012-09-28 16:14:05 -07:00
Patrick Wendell 9fc78f8f29 Fixing some whitespace issues 2012-09-28 16:05:50 -07:00
Patrick Wendell bc909c2903 Changes based on Matei's comments 2012-09-28 16:04:36 -07:00
Patrick Wendell c387e40fb1 Log message which records RDD origin
This adds tracking to determine the "origin" of an RDD. Origin is defined by
the boundary between the user's code and the spark code, during an RDD's
instantiation. It is meant to help users understand where a Spark RDD is
coming from in their code.

This patch also logs origin data when stages are submitted to the scheduler.

Finally, it adds a new log message to fix an inconsitency in the way that
dependent stages (those missing parents) and independent stages (those
without) are logged during submission.
2012-09-28 15:51:46 -07:00
Matei Zaharia 2a8bfbca00 Fixed a bug where isLocal was set to false when using local[K] 2012-09-28 14:50:54 -07:00
Matei Zaharia 4a138403ef Fix a bug in JAR fetcher that made it always fetch the JAR 2012-09-27 21:32:06 -07:00
Matei Zaharia 009b0e37e7 Added an option to compress blocks in the block store 2012-09-27 18:45:44 -07:00
Matei Zaharia 7bcb08cef5 Renamed storage levels to something cleaner; fixes #223. 2012-09-27 17:50:59 -07:00
Matei Zaharia 920fab23c3 Merge pull request #222 from rxin/dev
Added MapPartitionsWithSplitRDD.
2012-09-26 23:16:45 -07:00
Matei Zaharia ea05fc130b Updates to standalone cluster, web UI and deploy docs. 2012-09-26 22:54:39 -07:00
Matei Zaharia 1ef4f0fbd2 Allow controlling number of splits in sortByKey. 2012-09-26 19:18:47 -07:00
Reynold Xin 1ad1331a34 Added MapPartitionsWithSplitRDD. 2012-09-26 17:11:28 -07:00
Matei Zaharia ee71fa49c1 Look for Kryo registrator using context class loader 2012-09-26 14:15:16 -07:00
Matei Zaharia d71a358c46 Fixed a test that was getting extremely lucky before, and increased the
number of samples used for sorting
2012-09-26 00:25:34 -07:00
Matei Zaharia 051785c7e6 Several fixes to sampling issues pointed out by Henry Milner:
- takeSample was biased towards earlier partitions
- There were some range errors in takeSample
- SampledRDDs with replacement didn't produce appropriate counts
  across partitions (we took exactly frac of each one)
2012-09-25 21:46:58 -07:00
Matei Zaharia 4d3339a3ec Merge pull request #217 from rxin/dev
Added a method to RDD to expose the ClassManifest.
2012-09-24 23:52:32 -07:00
Reynold Xin 7a4cd92861 Renamed RDD.manifest to RDD.elementClassManifest 2012-09-24 23:42:33 -07:00
Matei Zaharia 296e24b440 Merge pull request #218 from rnpandya/dev
Scripts to start Spark under windows
2012-09-24 21:10:31 -07:00
Reynold Xin 348bcbca1f Added a method to RDD to expose the ClassManifest. 2012-09-24 16:56:27 -07:00
Ravi Pandya 39215357af Windows command scripts for sbt and run 2012-09-24 15:43:19 -07:00
Matei Zaharia 6eeb379cf8 Fix some test issues 2012-09-24 15:39:58 -07:00
Matei Zaharia f855e4fad2 Merge pull request #208 from rxin/dev
Separated ShuffledRDD into multiple classes.
2012-09-24 12:32:01 -07:00
root 107a5ca879 Make default number of parallel fetches slightly smaller since it doesn't seem to hurt performance much and it will cause slightly less GC. 2012-09-23 06:06:12 +00:00
root e41cab04ca Avoid creating an extra buffer when saving a stream of values as DISK_ONLY 2012-09-23 05:56:44 +00:00
Denny afb7ccc838 HTTP File server fixes. 2012-09-21 10:58:13 -07:00
root 6d28dde370 Rename our toIterator method into asIterator to prevent confusion with the
Scala collection one, which often *copies* a collection.
2012-09-21 06:02:55 +00:00
root a642051ade Fixed a performance bug in BlockManager that was creating garbage when
returning deserialized, in-memory RDDs.
2012-09-21 05:42:21 +00:00
root 8feb5caacd Fixed an issue with ordering of classloader setup that was causing Java deserializer to break 2012-09-21 05:13:19 +00:00
Reynold Xin 6b5980da79 Set a limited number of retry in standalone deploy mode. 2012-09-19 15:41:56 -07:00
Reynold Xin 397d3816e1 Separated ShuffledRDD into multiple classes: RepartitionShuffledRDD,
ShuffledSortedRDD, and ShuffledAggregatedRDD.
2012-09-19 12:31:45 -07:00
Denny ca64d16a2d When a file is downloaded, make it executable. That's neccsary for scripts (e.g. in Shark) 2012-09-17 10:08:37 -07:00
Matei Zaharia 840cbcf849 Change default serializer to Java.. it had accidentally become Kryo. 2012-09-13 17:19:26 -07:00
Matei Zaharia b4dfa25c8a Store shuffle map outputs as DISK_ONLY 2012-09-12 16:05:57 -07:00
Matei Zaharia 2d761e3353 Ported performance and FT improvements from latest streaming work 2012-09-12 14:54:40 -07:00
Matei Zaharia 9b4cd1648b Fix bugs with Connection's shutdown callback failing to get its address 2012-09-12 14:54:14 -07:00
Matei Zaharia 9199775d41 Wait for Akka to really shut down in SparkEnv.stop() 2012-09-12 14:50:37 -07:00
Denny 5e4076e3f2 Merge branch 'dev' into feature/fileserver
Conflicts:
	core/src/main/scala/spark/SparkContext.scala
2012-09-11 16:57:17 -07:00
Denny 77873d2c8e Formatting 2012-09-11 16:51:46 -07:00
Denny 24b9b37314 Subclass URLClassLoader instead of using reflection 2012-09-11 16:51:08 -07:00
Denny 31c53e917d Use stageId as index for fileSet caches. 2012-09-11 16:10:45 -07:00
Matei Zaharia 943df48348 Merge branch 'dev' of github.com:mesos/spark into dev 2012-09-11 16:00:37 -07:00
Matei Zaharia 6d7f907e73 Manually merge pull request #175 by Imran Rashid 2012-09-11 16:00:06 -07:00
Reynold Xin 7af7c79ce5 Updated the logError call from the previous commit to conform to
logError API.
2012-09-11 14:32:24 -07:00
Reynold Xin 38b9119c96 Log entire exception (including stack trace) in BlockManagerWorker. 2012-09-11 11:31:35 -07:00
Denny 4d3471dd07 Fix serialization bugs and added local cluster tests 2012-09-10 15:39:58 -07:00
Tathagata Das c63a606458 Made NewHadoopRDD broadcast its job configuration (same as HadoopRDD). 2012-09-10 19:51:27 +00:00
Denny b864c36a30 Dynamically adding jar files and caching fileSets. 2012-09-10 12:49:09 -07:00
Denny f275fb07da General FileServer
A general fileserver for both JARs and regular files.
2012-09-10 12:48:59 -07:00
Matei Zaharia a13780670d Added a unit test for local-cluster mode and simplified some of the code involved in that 2012-09-10 12:48:58 -07:00
Denny f2ac55840c Add shutdown hook to Executor Runner and execute code to shutdown local cluster in Scheduler Backend 2012-09-10 12:48:58 -07:00
Denny 9ead8ab14e Set SPARK_LAUNCH_WITH_SCALA=0 in Executor Runner 2012-09-10 12:48:58 -07:00
Denny 8bb3c73977 Renamed spark-cluster to spark-local. 2012-09-10 12:48:58 -07:00
Denny a367c20f49 Fix wrong counting 2012-09-10 12:48:57 -07:00
Denny 93fe331e6d Delete old DeployUtils. 2012-09-10 12:48:57 -07:00
Denny cf074f9c96 Renamed class. 2012-09-10 12:48:57 -07:00
Denny 3749f94184 Start a standalone cluster locally. 2012-09-10 12:48:57 -07:00
Matei Zaharia 995982b3c9 Added a unit test for local-cluster mode and simplified some of the code involved in that 2012-09-07 17:08:36 -07:00
Matei Zaharia 8d2fcc2832 Merge pull request #189 from dennybritz/feature/localcluster
Simulating a Spark standalone cluster locally
2012-09-07 15:43:43 -07:00
Denny 7ff9311add Add shutdown hook to Executor Runner and execute code to shutdown local cluster in Scheduler Backend 2012-09-07 14:09:12 -07:00
Denny 4e7b264cf7 Set SPARK_LAUNCH_WITH_SCALA=0 in Executor Runner 2012-09-07 11:39:44 -07:00
haoyuan db08a362aa commit opt for grep scalibility test. 2012-09-07 02:17:52 +00:00
root c2da64409a Randomize the order of block fetches in getMultiple 2012-09-06 23:16:26 +00:00
root 9ef90c95f4 Bug fix 2012-09-06 00:43:46 +00:00
root 2fa6d999fd Tuning Akka more 2012-09-06 00:16:39 +00:00
Denny 886183e591 Renamed spark-cluster to spark-local. 2012-09-05 17:10:54 -07:00
root 215544820f Serialize map output locations more efficiently, and only once, in MapOutputTracker 2012-09-05 23:54:04 +00:00
root dc68febdce User Spark's closure serializer for the ShuffleMapTask cache 2012-09-05 23:06:59 +00:00
Reynold Xin c308fbcb79 Removed cache add/remove log messages from CacheTracker.
Added log messages on BlockManagerMaster to reflect block add/remove.
Also did some minor cleanup of storage package code.
2012-09-05 15:59:48 -07:00
root ed937a821f Merge branch 'dev' of github.com:radlab/spark into dev 2012-09-05 22:26:49 +00:00
root 1d6b36d3c3 Further tuning for network performance 2012-09-05 22:26:37 +00:00
root 3fa0d7f0c9 Serialize BlockRDD more efficiently 2012-09-05 08:28:15 +00:00
root 4a5d0d249e Merge branch 'dev' of github.com:radlab/spark into dev 2012-09-05 08:23:09 +00:00
root efc7668d16 Allow serializing HttpBroadcast through Kryo 2012-09-05 08:22:57 +00:00
root 75487b2f5a Broadcast the JobConf in HadoopRDD to reduce task sizes 2012-09-05 08:14:50 +00:00
root b7ad291ac5 Tuning Akka for more connections 2012-09-05 07:08:07 +00:00
root fc186dc18a Merge branch 'dev' of github.com:radlab/spark into dev 2012-09-05 05:53:18 +00:00
root 4ea032a142 Some changes to make important log output visible even if we set the logging to WARNING 2012-09-05 05:53:07 +00:00
Denny babbca0a2f Fix wrong counting 2012-09-04 22:04:18 -07:00
Denny 9326509f66 Delete old DeployUtils. 2012-09-04 21:15:23 -07:00
Denny 1588d4dbe6 Renamed class. 2012-09-04 21:13:25 -07:00
Denny 22dde6e020 Start a standalone cluster locally. 2012-09-04 20:56:30 -07:00
Tathagata Das 7c09ad0e04 Changed DStream member access permissions from private to protected. Updated StateDStream to checkpoint RDDs and forget lineage. 2012-09-04 19:11:49 -07:00
Matei Zaharia a842c63044 Minor formatting fixes 2012-09-03 16:24:00 -07:00
Tathagata Das b8e9e8ea78 Merge branch 'dev' of github.com:radlab/spark into dev 2012-09-02 02:35:32 -07:00
root ceabf71257 tweaks 2012-09-01 21:52:42 +00:00
root 6025889be0 More raw network receiver programs 2012-09-01 20:51:07 +00:00
Harvey 3076b038f4 Start fetching a remote block when a received remote block has been passed
to the reduce function
2012-09-01 12:01:35 -07:00
Matei Zaharia f84d2bbe55 Bug fixes to RateLimitedOutputStream 2012-09-01 00:31:15 -07:00
Matei Zaharia 44758aa8e2 First work towards a RawInputDStream and a sender program for it. 2012-09-01 00:17:59 -07:00
root c42e7ac282 More block manager fixes 2012-09-01 04:31:11 +00:00
Matei Zaharia 389fb4cc54 End runJob() with a SparkException when a task fails too many times in
one of the cluster schedulers.
2012-08-31 17:47:43 -07:00
root 113277549c Really fixed the replication-3 issue. The problem was a few buffers not being rewound. 2012-08-31 05:39:35 +00:00
Mosharaf Chowdhury 31ffe8d528 Synchronization bug fix in broadcast implementations 2012-08-30 22:26:43 -07:00
Matei Zaharia 101ae493e2 Replicate serialized blocks properly, without sharing a ByteBuffer. 2012-08-30 22:24:14 -07:00
Mosharaf Chowdhury 3883532545 Bug fix. Fixed log messages. Updated BroadcastTest example to have iterations. 2012-08-30 21:43:00 -07:00
Matei Zaharia a480dec6b2 Deserialize multi-get results in the caller's thread. This fixes an
issue with shared buffers in the KryoSerializer.
2012-08-30 20:01:06 -07:00
Matei Zaharia 1b3e3352eb Deserialize multi-get results in the caller's thread. This fixes an
issue with shared buffers with the KryoSerializer.
2012-08-30 17:59:25 -07:00
root c4366eb764 Fixes to ShuffleFetcher 2012-08-31 00:34:24 +00:00
Reynold Xin a8a2a08a1a Added a test for testing map-side combine on/off switch. 2012-08-30 12:34:28 -07:00
Reynold Xin 5945bcdcc5 Added a new flag in Aggregator to indicate applying map side combiners. 2012-08-29 23:32:08 -07:00
Reynold Xin c68e820b2a Merge branch 'dev' of github.com:mesos/spark into dev 2012-08-29 23:01:19 -07:00
Reynold Xin 940869dfda Disable running combiners on map tasks when mergeCombiners function is
not specified by the user.
2012-08-29 23:00:02 -07:00
Tathagata Das 4db3a96766 Made minor changes to reduce compilation errors in Eclipse. Twirl stuff still does not compile in Eclipse. 2012-08-29 13:04:01 -07:00
Matei Zaharia bf2e9cb08e Fault tolerance and block store fixes discovered through streaming tests. 2012-08-27 23:07:50 -07:00
Matei Zaharia 17af2df0cd Log levels 2012-08-27 23:07:32 -07:00
Matei Zaharia b4a2214218 More fault tolerance fixes to catch lost tasks 2012-08-27 22:49:29 -07:00
Reynold Xin 3a6a95dc24 Removed the deserialization cache for ShuffleMapTask because it was
causing concurrency problems (some variables in Shark get set to null).
The cost of task deserialization on slaves is trivial compared with the
execution time of the task anyway.
2012-08-27 22:33:15 -07:00
Josh Rosen bff6a46359 Add pipe(), saveAsTextFile(), sc.union() to Python API. 2012-08-27 00:24:47 -07:00
Josh Rosen 200d248dcc Simplify Python worker; pipeline the map step of partitionBy(). 2012-08-27 00:24:39 -07:00
Josh Rosen f79a1e4d2a Add broadcast variables to Python API. 2012-08-27 00:16:47 -07:00
Matei Zaharia b914cd0dfa Serialize generation correctly in ShuffleMapTask 2012-08-26 20:07:59 -07:00
Matei Zaharia 69c2ab0408 logging 2012-08-26 20:00:58 -07:00
Matei Zaharia 117e3f8c86 Fix a bug that was causing FetchFailedException not to be thrown 2012-08-26 19:52:56 -07:00
Matei Zaharia 3c9c44a8d3 More helpful log messages 2012-08-26 19:37:43 -07:00
Matei Zaharia 26dfd20c9a Detect disconnected slaves in StandaloneScheduler 2012-08-26 18:56:56 -07:00
Matei Zaharia 29e83f39e9 Fix replication with MEMORY_ONLY_DESER_2 2012-08-26 18:16:25 -07:00
Matei Zaharia 06ef7c3d1b Less debug info 2012-08-26 16:29:20 -07:00
Matei Zaharia 741899b21e Fix sendMessageReliablySync 2012-08-26 16:26:06 -07:00
Matei Zaharia 5a8015d2db Merge remote-tracking branch 'public/dev' into dev 2012-08-24 16:11:44 -07:00
Matei Zaharia 2c16ae36d7 Set log level in tests to WARN 2012-08-23 20:38:14 -07:00
Matei Zaharia deedb9e7b7 Fix further issues with tests and broadcast.
The broadcast fix is to store values as MEMORY_ONLY_DESER instead of
MEMORY_ONLY, which will save substantial time on serialization.
2012-08-23 20:31:49 -07:00
Matei Zaharia 59b831b9d1 Fixed test failures due to broadcast not stopping correctly 2012-08-23 19:59:55 -07:00
Matei Zaharia 7310a6f499 Merge pull request #147 from mosharaf/dev
Broadcast refactoring/cleaning up
2012-08-23 19:38:28 -07:00
Josh Rosen 607b53abfc Use numpy in Python k-means example. 2012-08-22 00:43:55 -07:00
Josh Rosen fd94e5443c Use only cPickle for serialization in Python API.
Objects serialized with JSON can be compared for equality, but JSON can be slow
to serialize and only supports a limited range of data types.
2012-08-21 14:01:27 -07:00
Matei Zaharia 25a6a39e6d Added other SparkContext constructors to JavaSparkContext 2012-08-19 18:59:16 -07:00
Josh Rosen 886b39de55 Add Python API. 2012-08-18 22:33:51 -07:00
Shivaram Venkataraman 0f4fbb057b Change BlockManagerSuite test cases to use a deterministic size estimator and
update the results to match the new estimates
2012-08-13 13:32:23 -07:00
Shivaram Venkataraman 22ba3a3f77 Add test-cases for 32-bit and no-compressed oops scenarios. 2012-08-13 13:32:10 -07:00
Shivaram Venkataraman 1f68c4b03b Update test cases to match the new size estimates. Uses 64-bit and compressed
oops setting to get deterministic results
2012-08-13 13:31:54 -07:00
Shivaram Venkataraman 1ea269110c Move object size and pointer size initialization into a function to enable unit-testing 2012-08-13 13:31:45 -07:00
Shivaram Venkataraman 44661df9cc If spark.test.useCompressedOops is set, use that to infer compressed oops
setting. This is useful to get a deterministic test case
2012-08-13 13:31:39 -07:00
Shivaram Venkataraman 0dd8fe73ba Use HotSpotDiagnosticMXBean to get if CompressedOops are in use or not 2012-08-13 13:31:29 -07:00
Shivaram Venkataraman 80104ce1da Add link to Java wiki which specifies what changes with compressed oops 2012-08-13 13:31:21 -07:00
Shivaram Venkataraman 00ab5490b3 Changes to make size estimator more accurate. Fixes object size, pointer size
according to architecture and also aligns objects and arrays when computing
instance sizes. Verified using Eclipse Memory Analysis Tool (MAT)
2012-08-13 13:31:11 -07:00
Matei Zaharia 6ae3c375a9 Renamed apply() to call() in Java API and allowed it to throw Exceptions 2012-08-12 23:10:19 +02:00
Matei Zaharia 0141879c40 Use Promises instead of having a Future wait on a thread in
ConnectionManager.
2012-08-12 22:16:32 +02:00
Matei Zaharia 845a870242 Return remotely fetched blocks in a pipelined fashion from BlockManager 2012-08-12 20:01:38 +02:00
Matei Zaharia e17ed9a21d Switch to Akka futures in connection manager.
It's still not good because each Future ends up waiting on a lock, but
it seems to work better than Scala Actors, and more importantly it
allows us to use onComplete and other listeners on futures.
2012-08-12 19:40:37 +02:00
Matei Zaharia ad8a7612a4 Changed multi-get method in BlockManager to return an iterator 2012-08-12 19:18:01 +02:00
Matei Zaharia 3c94e5c188 Merge pull request #168 from shivaram/dev
Use JavaConversion to get a scala iterator
2012-08-10 00:57:33 -07:00
Matei Zaharia e463e7a333 Merge pull request #167 from JoshRosen/piped-rdd-fixes
Detect non-zero exit status from PipedRDD process
2012-08-10 00:56:42 -07:00
Josh Rosen 59c22fb444 Print exit status in PipedRDD failure exception. 2012-08-10 00:33:56 -07:00
Shivaram Venkataraman 1803cce692 Use an implicit conversion to get the scala iterator 2012-08-08 14:31:04 -07:00
Shivaram Venkataraman 674fcf56bf Use JavaConversion to get a scala iterator 2012-08-08 14:10:23 -07:00
Shivaram Venkataraman f4aaec7a48 Avoid a copy in ShuffleMapTask by creating an iterator that will be used by the
block manager.
2012-08-08 00:47:02 -07:00
Mosharaf Chowdhury d821dd3ccc BroadcastManager is a class now (replaced Braodcast object) 2012-08-05 01:10:51 -07:00
Mosharaf Chowdhury b4804119f9 Merge remote-tracking branch 'upstream/dev' into dev 2012-08-04 20:42:12 -07:00
Matei Zaharia 88b016db2a Merge pull request #160 from dennybritz/clusterscripts
Standalone cluster scripts
2012-08-04 17:45:20 -07:00
Mosharaf Chowdhury 1b0534af8f Merge branch 'dev' into bc-bm 2012-08-04 00:30:08 -07:00
Mosharaf Chowdhury d11b457e67 Merge remote-tracking branch 'upstream/dev' into dev 2012-08-04 00:28:10 -07:00
Mosharaf Chowdhury 24b7eb872c Bug fixed. Broadcast now works with BlockManager. 2012-08-04 00:27:28 -07:00
Shivaram Venkataraman ce3444d2cb Fix testcheckpoint to reuse spark context defined in the class 2012-08-03 18:52:26 -07:00
Matei Zaharia 62898b631f Made range partition balance tests more aggressive.
This is because we pull out such a large sample (10x the number of
partitions) that we should expect pretty good balance. The tests are
also deterministic so there's no worry about them failing irreproducibly.
2012-08-03 16:46:48 -04:00
Matei Zaharia 6601a6212b Added a unit test for cross-partition balancing in sort, and changes to
RangePartitioner to make it pass. It turns out that the first partition
was always kind of small due to how we picked partition boundaries.
2012-08-03 16:40:45 -04:00
Harvey 1170de3757 Fix for partitioning when sorting in descending order 2012-08-03 16:40:38 -04:00
Paul Cavallaro d05c0f97ca Logging Throwables in Info and Debug
Logging Throwables in logInfo and logDebug instead of swallowing them.

Conflicts:

	core/src/main/scala/spark/Logging.scala
2012-08-03 16:40:21 -04:00
Denny 0008994044 merged dev branch 2012-08-02 16:00:33 -07:00
Denny 53008c2d8a Settings variables and bugfix for stop script. 2012-08-02 15:59:39 -07:00
Matei Zaharia 71a958b0b7 Merge branch 'dev' of github.com:mesos/spark into dev
Conflicts:
	project/SparkBuild.scala
2012-08-02 17:23:13 -04:00
Denny 7312a5c30f Use spray's implicit Marshaller for Futures. 2012-08-02 14:11:27 -07:00
Denny ba7e30fb5e Mostly stlyistic changes. 2012-08-02 13:55:09 -07:00
Shivaram Venkataraman 1a07bb9ba4 Avoid an extra partition copy by passing an iterator to blockManager.put 2012-08-02 12:22:33 -07:00
Shivaram Venkataraman 6790908b11 Use maxMemory to better estimate memory available for BlockManager cache 2012-08-02 12:05:05 -07:00
Denny 863c31b7c1 Moved resources into static folder 2012-08-02 09:48:36 -07:00
Tathagata Das 1c0aeee960 Merge branch 'dev' of github.com:radlab/spark into dev 2012-08-01 22:11:41 -07:00
Tathagata Das 3be54c2a8a 1. Refactored SparkStreamContext, Scheduler, InputRDS, FileInputRDS and a few other files.
2. Modified Time class to represent milliseconds (long) directly, instead of LongTime.
3. Added new files QueueInputRDS, RecurringTimer, etc.
4. Added RDDSuite as the skeleton for testcases.
5. Added two examples in spark.streaming.examples.
6. Removed all past examples and a few unnecessary files. Moved a number of files to spark.streaming.util.
2012-08-01 22:09:27 -07:00
Denny 0ee44c225e Spark standalone mode cluster scripts.
Heavily inspired by Hadoop cluster scripts ;-)
2012-08-01 20:38:52 -07:00
Denny 6c670c37dd Webui improvements. 2012-08-01 19:47:57 -07:00
Denny 1b29e90a79 merge dev branch 2012-08-01 14:06:09 -07:00
Denny 011220fa55 Compact job page. 2012-08-01 11:26:45 -07:00
Denny 7a295fee96 Spark WebUI Implementation. 2012-08-01 11:01:09 -07:00
Mosharaf Chowdhury f23395e8c5 Merge remote-tracking branch 'upstream/dev' into dev 2012-07-30 19:39:49 -07:00
Matei Zaharia 7814ecbd47 Merge remote-tracking branch 'public/dev' into dev 2012-07-30 15:05:49 -07:00
Matei Zaharia 3ee2530c0c Merge branch 'block-manager-fix' into dev 2012-07-30 13:58:46 -07:00
Matei Zaharia 400221f851 Merge branch 'dev' of git://github.com/tdas/spark into dev 2012-07-30 13:54:57 -07:00
Matei Zaharia ed1b0f8388 Made BlockManagerMaster no longer be a singleton.
Also cleaned up a few formatting things throughout block manager code.
2012-07-30 13:53:47 -07:00
Matei Zaharia f471c82558 Various reorganization and formatting fixes 2012-07-30 11:24:01 -07:00
Mosharaf Chowdhury 5932a87cac Merge remote-tracking branch 'upstream/dev' into dev 2012-07-29 18:20:45 -07:00
Matei Zaharia e2e71a1fb5 Merge remote-tracking branch 'public/dev' into dev 2012-07-28 20:26:59 -07:00
Matei Zaharia d7f089323a Fixed AccumulatorSuite to clean up SparkContext with BeforeAndAfter 2012-07-28 20:25:42 -07:00
Imran Rashid f7149c5e46 tasks cannot access value of accumulator 2012-07-28 20:16:17 -07:00
Imran Rashid 244cbbe33a one more minor cleanup to scaladoc 2012-07-28 20:16:10 -07:00
Imran Rashid 3b392c67db fix up scaladoc, naming of type parameters 2012-07-28 20:16:01 -07:00
Imran Rashid f1face1ea9 rename addToAccum to addAccumulator 2012-07-28 20:16:01 -07:00
Imran Rashid 2d666b9d76 add some functionality to Vector, delete copy in AccumulatorSuite 2012-07-28 20:15:51 -07:00
Imran Rashid edc6972f8e move Vector class into core and spark.util package 2012-07-28 20:15:42 -07:00
Imran Rashid 83659af11c Accumulator now inherits from Accumulable, whcih simplifies a bunch of other things (eg., no +:=)
Conflicts:

	core/src/main/scala/spark/Accumulators.scala
2012-07-28 20:13:51 -07:00
Imran Rashid 79d58ed20a improve scaladoc 2012-07-28 20:12:41 -07:00
Imran Rashid ae07f3864c add Accumulatable, add corresponding docs & tests for accumulators 2012-07-28 20:12:41 -07:00
Matei Zaharia 47b7ebad12 Added the Spark Streaing code, ported to Akka 2 2012-07-28 20:03:26 -07:00
Matei Zaharia f6f917bd00 Add a sleep to prevent a failing test.
The BlockManager's put seems to be slightly asynchronous, which can
cause it to fail this test by not removing stuff from the cache before
we put the next value. We should probably change the semantics of put()
in this case but it's hard right now. It will also be hard for
asynchronously replicated puts.
2012-07-27 16:59:36 -07:00
Matei Zaharia c0c78d2119 Renamed test more descriptively 2012-07-27 16:28:18 -07:00
Matei Zaharia dee8ff1b9d Added a second version of union() without varargs. 2012-07-27 16:27:52 -07:00
Tathagata Das cf429699e1 Updated the new checkpoint RDD to remember partitioning of the original RDD. 2012-07-27 23:16:37 +00:00
Mosharaf Chowdhury b5be936d7c Broadcasts using BlockManager instead of BoundedMemoryCache 2012-07-27 15:38:46 -07:00
Mosharaf Chowdhury 1f19fbb8db Merge remote-tracking branch 'upstream/dev' into dev
Conflicts:
	core/src/main/scala/spark/broadcast/Broadcast.scala
2012-07-27 15:18:23 -07:00
Matei Zaharia b51d733a57 Fixed Java union methods having same erasure.
Changed union() methods on lists to take a separate "first element"
argument in order to differentiate them to the compiler, because Java 7
considered it an error to have them all take Lists parameterized with
different types.
2012-07-27 12:23:27 -07:00
Tathagata Das 3e271c3b61 Merge branch 'dev' of github.com:tdas/spark into dev 2012-07-27 12:01:04 -07:00
Tathagata Das 024905f682 Added BlockRDD and a first-cut version of checkpoint() to RDD class. 2012-07-27 12:00:49 -07:00
Tathagata Das d1eee44a03 Fixed more stuff in BoundedMemoryCache. 2012-07-27 18:33:32 +00:00
Tathagata Das d1b7f41671 Fixed bug in BoundedMemoryCache. 2012-07-27 09:00:45 -07:00
Tathagata Das 435d129bec Fixed bugs in block dropping code of MemoryStore and changed synchronized HashMap to ConcurrentHashMap in BlockManager. 2012-07-27 10:02:26 +00:00
Tathagata Das 0426769f89 Modified the block dropping code for better performance. 2012-07-26 20:53:45 -07:00
Matei Zaharia 5c5aa2ff81 Merge pull request #153 from JoshRosen/new-java-api
Java API
2012-07-26 17:20:52 -07:00
Josh Rosen c5e2810dc7 Add persist(), splits(), glom(), and mapPartitions() to Java API. 2012-07-26 12:46:47 -07:00
Josh Rosen bf61c10072 Detect non-zero exit status from PipedRDD process. 2012-07-26 11:32:59 -07:00
Josh Rosen 6a78e88237 Minor cleanup and optimizations in Java API.
- Add override keywords.
- Cache RDDs and counts in TC example.
- Clean up JavaRDDLike's abstract methods.
2012-07-24 09:47:00 -07:00
Denny 4f4a34c025 Stlystic changes
Conflicts:

	core/src/test/scala/spark/MesosSchedulerSuite.scala
2012-07-23 16:32:20 -07:00
Denny 866e6949df Always destroy SparkContext in after block for the unit tests.
Conflicts:

	core/src/test/scala/spark/ShuffleSuite.scala
2012-07-23 16:29:17 -07:00
Matei Zaharia 600e99728d Fix a bug where an input path was added to a Hadoop job configuration twice 2012-07-23 16:16:19 -07:00
Josh Rosen 042dcbde33 Add type annotations to Java API methods.
Add missing Scala Map to java.util.Map conversions.
2012-07-22 17:35:29 -07:00
Josh Rosen e23938c3be Use mapValues() in JavaPairRDD.cogroupResultToJava(). 2012-07-22 15:10:01 -07:00
Josh Rosen 01dce3f569 Add Java API
Add distinct() method to RDD.

Fix bug in DoubleRDDFunctions.
2012-07-18 17:34:29 -07:00
Mosharaf Chowdhury 85cd9979f2 Fix for isLocal 2012-07-13 01:13:14 -07:00
Mosharaf Chowdhury 1c83fd4b66 Merged with Upstream dev 2012-07-13 01:08:28 -07:00
Mosharaf Chowdhury bb4ee580fa Cleaning BitTorrentBroadcast code... 2012-07-13 01:04:01 -07:00
Mosharaf Chowdhury 8ccffe21da Cleaned TreeBroadcast 2012-07-13 00:54:25 -07:00
Matei Zaharia 628bb5ca7f Allow null keys in Spark's reduce and group by 2012-07-12 18:36:02 -07:00
Matei Zaharia e2a67a8024 Fixes to coarse-grained Mesos scheduler in dealing with failed nodes 2012-07-12 18:21:52 -07:00
Matei Zaharia be622cf867 Formatting 2012-07-11 17:31:44 -07:00
Matei Zaharia e8ae77df24 Added more methods for loading/saving with new Hadoop API 2012-07-11 17:31:33 -07:00
Mosharaf Chowdhury 34999d97f5 Added stop() to the Broadcast subsystem 2012-07-10 01:03:47 -07:00
Mosharaf Chowdhury d6a9680604 Slightly better check for isLocal 2012-07-10 00:16:47 -07:00
Mosharaf Chowdhury 701f49e0d9 Refactoring 2012-07-09 22:39:47 -07:00
Mosharaf Chowdhury cf1c60a1de Refactoring 2012-07-09 22:07:46 -07:00
Mosharaf Chowdhury e71f69ad3d Refactoring 2012-07-09 22:07:17 -07:00
Mosharaf Chowdhury ca02a92332 Refactored TrackMultipleValues out. 2012-07-09 21:35:39 -07:00
Mosharaf Chowdhury 654576ef1a Tweaks 2012-07-09 21:12:42 -07:00
Mosharaf Chowdhury 425c247269 Removed some unused stuff 2012-07-08 14:29:04 -07:00
Matei Zaharia 0a47284003 More work to allow Spark to run on the standalone deploy cluster. 2012-07-08 14:00:04 -07:00
Mosharaf Chowdhury c7c5258e25 Compiles without Dfs 2012-07-08 13:22:12 -07:00
Mosharaf Chowdhury 178bb29f05 Removed Chained and Dfs broadcast implementations 2012-07-08 11:57:00 -07:00
Matei Zaharia 1aa63f775b Added back coarse-grained Mesos scheduler based on StandaloneScheduler. 2012-07-08 10:52:13 -07:00
Matei Zaharia c5cc10cda3 More work on standalone scheduler 2012-07-06 20:17:44 -07:00
Matei Zaharia 909b325243 Further refactoring, and start of a standalone scheduler backend 2012-07-06 17:56:44 -07:00
Matei Zaharia 4e2fe0bdaf Miscellaneous bug fixes 2012-07-06 16:33:40 -07:00
Matei Zaharia e72afdb817 Some refactoring to make cluster scheduler pluggable. 2012-07-06 15:23:26 -07:00
Matei Zaharia 5d1a887bed Further updates to run processes on cluster. 2012-07-01 17:13:31 -07:00
Matei Zaharia 51c46eaca0 More work on standalone deploy system. 2012-07-01 01:05:59 -07:00
Matei Zaharia a6eb9fda61 Detect connection and disconnection of slaves 2012-06-30 17:46:56 -07:00
Matei Zaharia 408b5a1332 More work on deploy code (adding Worker class) 2012-06-30 16:45:57 -07:00
Matei Zaharia 2fb6e7d71e Initial framework to get a master and web UI up. 2012-06-30 14:45:55 -07:00
Matei Zaharia c53670b9bf Various code style fixes, mostly from IntelliJ IDEA 2012-06-29 18:47:12 -07:00
Matei Zaharia c6be4ffbf9 Fixes to CoarseMesosScheduler 2012-06-29 16:18:51 -07:00
Matei Zaharia 3a58efa5a5 Allow binding to a free port and change Akka logging to use SLF4J. Also
fixes various bugs in the previous code when running on Mesos.
2012-06-29 16:02:21 -07:00
Matei Zaharia 3920189932 Upgraded to Akka 2 and fixed test execution (which was still parallel
across projects).
2012-06-28 23:51:28 -07:00
root 6ad3e1f1b4 Various fixes when running on Mesos 2012-06-20 06:48:26 +00:00
Tathagata Das e896a505e2 Added testcase for ByteBufferInputStream bugs. 2012-06-17 16:11:12 -07:00
Tathagata Das 40536e3668 Fixed nasty corner case bug in ByteBufferInputStream. Could not add a test case for this as I could not figure out how to deterministically reproduce the bug in a short testcase. 2012-06-17 13:28:41 -07:00
Matei Zaharia 2893b30550 Various fixes to get unit tests running. In particular, shut down
ConnectionManager and DAGScheduler properly, plus a fix to
LocalScheduler that was not merged in from 0.5 and was actually caught
by one of the tests.
2012-06-17 00:28:45 -07:00
Matei Zaharia b3eeac55b8 Fixed HttpBroadcast to work with this branch's Serializer. 2012-06-15 23:54:38 -07:00
Matei Zaharia f58da6164e Merge branch 'master' into dev 2012-06-15 23:47:11 -07:00
Tathagata Das 5f54bdf98b Added shutdown for akka to SparkContext.stop(). Helps a little, but many testsuites still fail. 2012-06-13 20:49:00 -04:00
Tathagata Das c6156da9e2 Multiple bug fixes to pass the testsuites ShuffleSuite and BlockManagerSuite. 2012-06-13 16:26:49 -04:00
Matei Zaharia 879bc0bece Merge branch 'master' into mesos-0.9 2012-06-09 16:24:16 -07:00
Matei Zaharia 4b05798c06 Further bug fix to HttpBroadcast 2012-06-09 16:24:03 -07:00
Matei Zaharia 587a16a7ef Merge branch 'master' into mesos-0.9 2012-06-09 16:17:07 -07:00
Matei Zaharia 8ed662862e Bug fix to HttpBroadcast 2012-06-09 16:16:55 -07:00
Matei Zaharia 2fd9f994ae Merge branch 'master' into mesos-0.9 2012-06-09 15:58:35 -07:00
Matei Zaharia e75b1b5cb4 Change the default broadcast implementation to a simple HTTP-based
broadcast. Fixes #139.
2012-06-09 15:58:07 -07:00
Matei Zaharia a96558caa3 Performance improvements to shuffle operations: in particular, preserve
RDD partitioning in more cases where it's possible, and use iterators
instead of materializing collections when doing joins.
2012-06-09 14:44:18 -07:00
Matei Zaharia c2c7299d7a Added BlockManagerSuite, which I'd forgotten to merge. 2012-06-07 13:47:10 -07:00
Matei Zaharia 63051dd2bc Merge in engine improvements from the Spark Streaming project, developed
jointly with Tathagata Das and Haoyuan Li. This commit imports the changes
and ports them to Mesos 0.9, but does not yet pass unit tests due to
various classes not supporting a graceful stop() yet.
2012-06-07 12:45:38 -07:00
Matei Zaharia 7e1c97fc4b Merge branch 'master' into mesos-0.9 2012-06-06 16:48:59 -07:00
Matei Zaharia 048276799a Commit task outputs to Hadoop-supported storage systems in parallel on the
cluster instead of on the master. Fixes #110.
2012-06-06 16:46:53 -07:00
Matei Zaharia 6888bc7191 Merge branch 'master' into mesos-0.9 2012-06-06 16:14:19 -07:00
Matei Zaharia 6ae2746d1e Handle arrays that contain the same element many times better in
SizeEstimator. Also added a test for SizeEstimator. Fixes #136.
2012-06-06 16:13:02 -07:00
Matei Zaharia 0a617958d1 Some refactoring to make BoundedMemoryCache test similar to others 2012-06-06 16:12:08 -07:00
Matei Zaharia dbc3c86ae3 Merge branch 'master' into mesos-0.9
Conflicts:
	core/src/main/scala/spark/Executor.scala
2012-06-03 17:44:04 -07:00
Matei Zaharia e141f644ca Merge pull request #132 from Benky/rb-first-iteration
Little refactoring and unit tests for CacheTrackerActor
2012-05-26 13:15:06 -07:00
Richard Benkovsky ae64920337 MesosScheduler refactoring 2012-05-22 11:04:54 +02:00
Richard Benkovsky 3a1bcd4028 Added tests for CacheTrackerActor 2012-05-22 11:04:54 +02:00
Richard Benkovsky 8f2f736d53 Little refactoring 2012-05-22 11:04:54 +02:00
Richard Benkovsky 518506a7c5 Added tests for Utils.copyStream 2012-05-22 11:04:51 +02:00
Richard Benkovsky f162fc2beb Formating fixed 2012-05-22 09:45:38 +02:00
Richard Benkovsky 565245871f BoundedMemoryCache.put fails when estimated size of 'value' is larger than cache capacity 2012-05-20 22:13:35 +02:00
Richard Benkovsky 822a4be37d Utils.memoryBytesToString fixed 2012-05-19 15:13:20 +02:00
Reynold Xin d0c6e9f639 Made some RDD dependencies transient to reduce the amount of data needed
to be serialized in closure serialization. This can significantly reduce
the task setup time in Shark when the query involves a large number of
(Hive) partitions.
2012-05-16 14:16:55 -07:00
Reynold Xin 16461e2eda Updated Cache's put method to use a case class for response. Previously
it was pretty ugly that put() should return -1 for failures.
2012-05-15 00:31:52 -07:00
Reynold Xin 019e48833f Added the capacity to report cache usage status back to the cache
trackor. This is essential for building a dashboard to see the status of
caches on all slaves.
2012-05-14 18:39:04 -07:00
Matei Zaharia f48742683a Made caches dataset-aware so that they won't cyclically evict partitions
from the same dataset.
2012-05-06 20:14:40 -07:00
Matei Zaharia bd2ab635a7 Fixed the way the JAR server is created after finding issue at Twitter 2012-05-05 20:05:15 -07:00
Matei Zaharia 32a4f4623c Merge pull request #129 from mesos/rxin
Force serialize/deserialize task results in local execution mode.
2012-04-24 16:18:39 -07:00
Reynold Xin 761ea65a98 Added a test for the previous commit (failing to serialize task results
would throw an exception for local tasks).
2012-04-24 15:14:35 -07:00
Reynold Xin 9821cd4d42 Force serialize/deserialize task results in local execution mode. 2012-04-24 14:55:28 -07:00
Antonio 3e48818993 Removed commented-out System.exit call 2012-04-23 11:42:58 -07:00
Antonio 39d99168dc Added exception handling instead of just exiting in LocalScheduler for tasks that throw exceptions 2012-04-20 14:46:43 -07:00
Reynold Xin e601b3b9e5 Added the ability to set environmental variables in piped rdd. 2012-04-17 16:40:56 -07:00
Matei Zaharia 3b745176e0 Bug fix to pluggable closure serialization change 2012-04-12 17:53:02 +00:00
Matei Zaharia 112655f032 Merge pull request #121 from rxin/kryo-closure
Added an option (spark.closure.serializer) to specify the serializer for closures.
2012-04-10 14:21:02 -07:00
Reynold Xin d295ccb43c Added a closureSerializer field in SparkEnv and use it to serialize
tasks.
2012-04-10 13:29:46 -07:00
Reynold Xin 968f75f6af Added an option (spark.closure.serializer) to specify the serializer for
closures. This enables using Kryo as the closure serializer.
2012-04-09 21:59:56 -07:00
Matei Zaharia a69c0738d1 Merge branch 'master' into mesos-0.9 2012-04-08 23:41:36 -07:00
Matei Zaharia a633974143 Merge branch 'master' of github.com:mesos/spark 2012-04-08 23:41:25 -07:00
Matei Zaharia 0229d5390f Merge branch 'master' into mesos-0.9 2012-04-08 23:39:37 -07:00
Matei Zaharia d401e1b3e8 Fix a possible deadlock in MesosScheduler 2012-04-08 23:38:49 -07:00
Ankur Dave 7be1c7b331 Report entry dropping in BoundedMemoryCache 2012-04-06 15:49:32 -07:00
Matei Zaharia a8bb324ed9 Merge branch 'master' into mesos-0.9 2012-04-05 14:53:22 -07:00
Matei Zaharia 816d4e5840 Pass local IP address instead of hostname in spark.master.host. Fixes #117. 2012-04-05 14:53:17 -07:00
Matei Zaharia 335a6036ad Converted some tabs to spaces 2012-04-05 11:58:01 -07:00
Matei Zaharia 8c95a85438 Use Runtime.maxMemory instead of Runtime.totalMemory in
BoundedMemoryCache, in case the JVM was not started with its initial
heap size equaling its maximum one (-Xms == -Xmx).
2012-03-30 13:39:35 -04:00
Matei Zaharia 03d5b3b48d Use Runtime.maxMemory instead of Runtime.totalMemory in
BoundedMemoryCache, in case the JVM was not started with its initial
heap size equaling its maximum one (-Xms == -Xmx).
2012-03-30 13:38:19 -04:00
Matei Zaharia 95fb1a16b8 Use Mesos 0.9 RC3 JAR and protobuf 2.4.1 2012-03-30 11:38:49 -04:00
Matei Zaharia dfa3b6b544 Fixes to work with the very latest Mesos 0.9 API 2012-03-29 22:12:35 -04:00
Matei Zaharia 4d52cc6738 Merge branch 'master' into mesos-0.9 2012-03-29 21:29:39 -04:00
Reynold Xin 42dcdbcb2f Removed the extra spaces in OrderedRDDFunctions and SortedRDD. 2012-03-29 15:21:57 -07:00
Matei Zaharia 08cda89e8a Further fixes to how Mesos is found and used 2012-03-17 13:39:14 -07:00
Matei Zaharia 3c3fdf6eca Merge branch 'master' into mesos-0.9 2012-03-17 13:09:21 -07:00
Matei Zaharia c7af538ac1 Some fixes to sorting for when the RDD has fewer elements than the
number of partitions we ask to partition it into. Also, removed a test
that was taking way too long to run.
2012-03-17 13:08:36 -07:00
Matei Zaharia a099a63a8a Initial work to make Spark compile with Mesos 0.9 and Hadoop 1.0 2012-03-17 12:31:34 -07:00
Matei Zaharia a5e2b6a6bd Merge pull request #112 from cengle/master
Changed HadoopRDD to get key and value containers from the RecordReader instead of through reflection
2012-03-06 13:38:32 -08:00
Matei Zaharia 97eee50825 Fixes a nasty bug that could happen when tasks fail, because calling
wait() with a timeout of 0 on a Java object means "wait forever".
2012-03-01 13:43:17 -08:00
Cliff Engle dd68cb6099 Get key and value container from RecordReader 2012-02-29 16:33:23 -08:00
Matei Zaharia 1e10df0a46 Merge pull request #111 from alupher/master
Adding sorting to RDDs
2012-02-24 15:50:14 -08:00
Antonio 0d93d95bcf Removed unnecessary import 2012-02-21 19:57:12 -08:00
Antonio 2990298f71 Added sorting testing suite 2012-02-21 19:54:21 -08:00
Matei Zaharia aa04f87cd2 Added support for parallel execution of jobs in DAGScheduler. 2012-02-19 22:50:23 -08:00
Antonio 620798161b Added fixes to sorting 2012-02-13 00:07:39 -08:00
Matei Zaharia 2587ce1690 Fixed a deadlock that occured with MesosScheduler due to an earlier
synchronization change
2012-02-11 21:22:45 -08:00
Antonio e93f622665 Added sorting by key for pair RDDs 2012-02-11 00:56:28 -08:00
Matei Zaharia 98f008b721 Formatting fixes 2012-02-10 10:52:03 -08:00
Matei Zaharia 7660a8b12f Merge branch 'formatting'
Conflicts:
	core/src/main/scala/spark/DAGScheduler.scala
	core/src/main/scala/spark/SimpleShuffleFetcher.scala
	core/src/main/scala/spark/SparkContext.scala
2012-02-10 10:42:14 -08:00
haoyuan 194c42ab79 Code format. 2012-02-10 08:19:53 -08:00
Matei Zaharia 8f5ed51234 Delete Spark's temporary directories when the JVM exits. 2012-02-09 22:58:24 -08:00
Matei Zaharia c0a0df3285 Made the default cache BoundedMemoryCache, and reduced its default size 2012-02-09 22:32:02 -08:00
Matei Zaharia a766780f4c Added some tests for multithreaded access to Spark. 2012-02-09 22:27:53 -08:00
Matei Zaharia 0e93891d3d Replaced LocalFileShuffle with a non-singleton ShuffleManager class
and made DAGScheduler automatically set SparkEnv.
2012-02-09 22:14:56 -08:00
haoyuan 445e0bb1b5 Format the code a bit mroe. 2012-02-09 15:50:26 -08:00
haoyuan 651932e703 Format the code as coding style agreed by Matei/TD/Haoyuan 2012-02-09 13:26:23 -08:00
Matei Zaharia e02dc83a5b IO optimizations 2012-02-06 20:40:39 -08:00
Matei Zaharia c40e766368 Use java.util.HashMap in shuffles 2012-02-06 19:20:25 -08:00
Matei Zaharia b267175ab5 Synchronization fix in case SparkContext is used from multiple threads. 2012-02-06 14:28:18 -08:00
Matei Zaharia 43a3335090 Simplifying test 2012-02-05 22:46:51 -08:00
Hiral Patel b47952342e Add register immutable map to kryo serializer 2012-01-26 15:24:20 -08:00
Matei Zaharia fabcc82528 Merge pull request #103 from edisontung/master
Made improvements to takeSample. Also changed SparkLocalKMeans to SparkKMeans
2012-01-13 19:20:03 -08:00
Matei Zaharia fd5581a0d3 Fixed a failure recovery bug and added some tests for fault recovery. 2012-01-13 19:17:27 -08:00
Matei Zaharia eb05154b7a Fixed a failure recovery bug and added some tests for fault recovery. 2012-01-13 19:08:25 -08:00
Edison Tung 1ecc221f84 Fixed bugs
I've fixed the bugs detailed in the diff. One of the bugs was already
fixed on the local file (forgot to commit).
2012-01-09 11:59:52 -08:00
Matei Zaharia e269f6f7ea Register RDDs with the MapOutputTracker even if they have no partitions.
Fixes #105.
2012-01-05 15:59:20 -05:00
Matei Zaharia 3034fc0d91 Merge commit 'ad4ebff42c1b738746b2b9ecfbb041b6d06e3e16' 2011-12-14 18:19:43 +01:00
Matei Zaharia 6a650cbbdf Make Spark port default to 7077 so that it's not an ephemeral port that might be taken 2011-12-14 18:18:22 +01:00
Matei Zaharia 735843a049 Merge remote-tracking branch 'origin/charles-newhadoop' 2011-12-02 21:59:30 -08:00
Charles Reiss 66f05f383e Add new Hadoop API reading support. 2011-12-01 14:02:10 -08:00
Charles Reiss 02d43e6986 Add new Hadoop API writing support. 2011-12-01 14:01:28 -08:00
Edison Tung 42f8847a21 Revert de01b6deaaee1b43321e0aac330f4a98c0ea61c6^..HEAD 2011-12-01 13:43:25 -08:00
Edison Tung de01b6deaa Fixed bug in RDD
Math.min takes 2 args, not 1. This was not committed earlier for some
reason
2011-12-01 13:34:37 -08:00
Matei Zaharia 22b8fcf632 Added fold() and aggregate() operations that reuse an object to
merge results into rather than requiring a new object allocation
for each element merged. Fixes #95.
2011-11-30 11:37:47 -08:00
Matei Zaharia 09dd58b3a7 Send SPARK_JAVA_OPTS to slave nodes. 2011-11-30 11:34:58 -08:00
Edison Tung a3bc012af8 added takeSamples method
takeSamples method takes a specified number of samples from the RDD and
outputs it in an array.
2011-11-21 16:38:44 -08:00
Ankur Dave ad4ebff42c Deduplicate exceptions when printing them
The first time they appear, exceptions are printed in full, including
a stack trace. After that, they are printed in abbreviated form. They
are periodically reprinted in full; the reprint interval defaults to 5
seconds and is configurable using the property
spark.logging.exceptionPrintInterval.
2011-11-14 01:54:53 +00:00
Ankur Dave 35b6358a7c Report errors in tasks to the driver via a Mesos status update
When a task throws an exception, the Spark executor previously just
logged it to a local file on the slave and exited. This commit causes
Spark to also report the exception back to the driver using a Mesos
status update, so the user doesn't have to look through a log file on
the slave.

Here's what the reporting currently looks like:

    # ./run spark.examples.ExceptionHandlingTest master@203.0.113.1:5050
    [...]
    11/10/26 21:04:13 INFO spark.SimpleJob: Lost TID 1 (task 0:1)
    11/10/26 21:04:13 INFO spark.SimpleJob: Loss was due to java.lang.Exception: Testing exception handling
    [...]
    11/10/26 21:04:16 INFO spark.SparkContext: Job finished in 5.988547328 s
2011-11-14 01:54:53 +00:00
Matei Zaharia 07532021fe Bug fix: reject offers that we didn't find any tasks for 2011-11-08 23:05:54 -08:00
Matei Zaharia 9e4c79a4d3 Closure cleaner unit test 2011-11-08 00:40:15 -08:00
Matei Zaharia f346e64637 Updates to the closure cleaner to work better with closures in classes.
Before, the cleaner attempted to clone $outer objects that were classes
(as opposed to nested closures) and preserve only their used fields,
which was bad because it would miss fields that are accessed indirectly
by methods, and in general it would confuse user code. Now we keep a
reference to those objects without cloning them. This is not perfect
because the user still needs to be careful of what they'll carry along
into closures, but it works better in some cases that seemed confusing
before. We need to improve the documentation on what variables get
passed along with a closure and possibly add some debugging tools for it
as well.

Fixes #71 -- that code now works in the REPL.
2011-11-08 00:33:28 -08:00
Matei Zaharia c2b7fd6899 Make parallelize() work efficiently for ranges of Long, Double, etc
(splitting them into sub-ranges). Fixes #87.
2011-11-02 15:16:02 -07:00
Matei Zaharia 157279e9eb Update Spark to work with the latest Mesos API 2011-10-30 14:10:56 -07:00
root 3a0e6c4363 Miscellaneous fixes:
- Executor should initialize logging properly
- groupByKey should allow custom partitioner
2011-10-17 18:07:35 +00:00
root 62aa820084 Merge branch 'ankur-master' 2011-10-14 02:14:07 +00:00
Ankur Dave 2d7057bf5d Implement PairRDDFunctions.partitionBy 2011-10-09 15:52:09 -07:00
Ankur Dave 06637cb69e Fix PairRDDFunctions.groupWith partitioning
This commit fixes a bug in groupWith that was causing it to destroy
partitioning information. It replaces a call to map with a call to
mapValues, which preserves partitioning.
2011-10-09 15:48:46 -07:00
Ankur Dave 2911a783d6 Add custom partitioner support to PairRDDFunctions.combineByKey 2011-10-09 15:47:20 -07:00
Ankur Dave 6c6e47e3cd Use BufferedOutputStream in ShuffleMapTask 2011-10-09 15:43:31 -07:00
Matei Zaharia 1069740264 Added a jarOfObject method to get the JAR of the class that an object
belongs to, which seems like a more common case.
2011-08-29 23:27:10 -07:00
Matei Zaharia 0aa23bf17e Added a convenience method for getting the JAR file that loaded a class
(useful for jobs to pass their own JAR files to SparkContext).
2011-08-29 22:59:44 -07:00
Matei Zaharia a161f00610 Made a log message slightly less ugly 2011-08-27 16:58:54 -07:00
Matei Zaharia 3759bcd061 New mesos.jar 2011-08-10 14:03:48 -07:00
Matei Zaharia c22043f150 Minor fix: can use >= when checking memory 2011-08-02 19:11:17 -07:00
Ismael Juma 6ff57f5594 Use scala.math instead of Math as the latter is deprecated. 2011-08-02 10:25:47 +01:00
Ismael Juma 620de2dd1d Change currentThread to Thread.currentThread as the former is deprecated. 2011-08-02 10:25:16 +01:00
Ismael Juma 0fba22b3d2 Fix issue #65: Change @serializable to extends Serializable in 2.9 branch
Note that we use scala.Serializable introduced in Scala 2.9 instead of
java.io.Serializable. Also, case classes inherit from scala.Serializable by
default.
2011-08-02 10:16:33 +01:00
Matei Zaharia 711575391d Merge branch 'scala-2.9'
Conflicts:
	project/build/SparkProject.scala
2011-08-01 15:25:26 -07:00
Matei Zaharia 4050d661c5 Updated to newest Mesos API, which includes better memory accounting
by specifying per-executor memory.
2011-08-01 13:54:48 -07:00
Matei Zaharia d12122502b Various improvements to Kryo serializer:
- Replaced modified Kryo version with the standard one augmented with
  the kryo-serializers package, which includes support for classes with
  no-arg constructors (that was why we had a modified Kryo before)
- The kryo-serializers version also fixes issue #72.
- Added a bunch of tests.
- Serialize maps and a few other common types properly by default.
2011-07-21 22:09:33 -07:00
Matei Zaharia baa72e2747 Removed a debug statement that slipped in as a println 2011-07-21 16:09:33 -07:00
Matei Zaharia 2bfd7931e8 Merge branch 'new-rdds-protobuf'
Conflicts:
	core/src/main/scala/spark/Executor.scala
	core/src/main/scala/spark/RDD.scala
2011-07-21 16:08:39 -07:00
Matei Zaharia 1450fd74d9 Merge branch 'master' into scala-2.9 2011-07-14 17:37:24 -04:00
Matei Zaharia ccf48388cd Lowered default number of splits for files 2011-07-14 17:37:04 -04:00
Matei Zaharia 146a18c2a4 Merge branch 'master' into scala-2.9 2011-07-14 17:29:17 -04:00
Matei Zaharia c8eb8b2b90 Set class loader for remote actors to fix a bug that happens in 2.9 2011-07-14 17:29:11 -04:00
Matei Zaharia 8ea67307b9 Merge branch 'master' into scala-2.9 2011-07-14 14:47:12 -04:00
Matei Zaharia e4c3402d2d Renamed ParallelArray to ParallelCollection 2011-07-14 14:47:01 -04:00
Matei Zaharia 9ac461d85d Remove RDD.toString because it looked confusing 2011-07-14 14:39:32 -04:00
Matei Zaharia 797b4547c3 Fix tracking of updates in accumulators to solve an issue that would manifest in the 2.9 interpreter 2011-07-14 14:08:34 -04:00
Matei Zaharia 3efd9e94d8 Merge branch 'master' into scala-2.9 2011-07-14 12:42:57 -04:00
Matei Zaharia 0ccfe20755 Forgot to add a file 2011-07-14 12:42:50 -04:00
Matei Zaharia 38f38dda5b Merge branch 'master' into scala-2.9 2011-07-14 12:42:02 -04:00
Matei Zaharia 969644df8e Cleaned up a few issues to do with default parallelism levels. Also
renamed HadoopFileWriter to HadoopWriter (since it's not only for files)
and fixed a bug for lookup().
2011-07-14 12:40:56 -04:00
Matei Zaharia 2fb906e8e5 Merge branch 'master' into scala-2.9 2011-07-14 00:20:14 -04:00
Matei Zaharia 2604939f64 Simplified and documented code a little and added test 2011-07-14 00:19:00 -04:00
Matei Zaharia 2439e51a03 Merge branch 'master' into implicit-sequencefile 2011-07-13 23:20:22 -04:00
Matei Zaharia d0c7958364 Merge branch 'master' into scala-2.9
Conflicts:
	core/src/main/scala/spark/HadoopFileWriter.scala
2011-07-13 23:09:33 -04:00
Matei Zaharia 9c0069188b Updated save code to allow non-file-based OutputFormats and added a test
for file-related stuff
2011-07-13 23:04:06 -04:00
Matei Zaharia da8a3b8926 Increase default value of spark.locality.wait a little 2011-07-13 20:07:24 -04:00
Matei Zaharia 080869c6ef Merge branch 'master' into scala-2.9 2011-07-13 00:20:08 -04:00
Matei Zaharia 842e14d567 Added mapPartitions operation and a bunch of tests for RDD ops 2011-07-13 00:19:52 -04:00
Matei Zaharia 9b568d37f7 Merge branch 'master' into scala-2.9
Conflicts:
	core/src/main/scala/spark/RDD.scala
2011-07-11 22:25:53 -04:00
Matei Zaharia d05fea24f3 Simplified parallel shuffle fetcher to use URLConnection 2011-07-11 22:12:36 -04:00
Matei Zaharia 25c3a7781c Moved PairRDD and SequenceFileRDD functions to separate source files 2011-07-10 00:06:15 -04:00
Matei Zaharia b7f1f62ff5 bug fix 2011-07-09 18:53:02 -04:00
Matei Zaharia 003480f374 Register byte[] with Kryo serializer 2011-07-09 18:08:07 -04:00
Matei Zaharia aea5cb4413 Added parallel shuffle fetcher 2011-07-09 17:25:56 -04:00
Matei Zaharia 4b1646a25f Support for non-filesystem-based Hadoop data sources 2011-07-06 20:37:55 -04:00
Matei Zaharia 07a97d47c2 Support for non-filesystem-based Hadoop data sources 2011-07-06 20:37:34 -04:00
Matei Zaharia 3488c386a9 Initial work to make stuff like sequenceFile[Int, Int] work without
requiring the user to provide a Writable type. The approach here might
not be the best but it seems to work correctly.
2011-06-28 17:07:04 -07:00
Matei Zaharia 5633299ec6 Merge remote-tracking branch 'origin/master' into scala-2.9 2011-06-27 22:50:59 -07:00
Matei Zaharia b0ecf1ee41 Don't pass a null context when running tasks locally 2011-06-27 22:50:43 -07:00
Matei Zaharia 85cad5d9dd Fixed HadoopFileWriter to compile for Scala 2.9 2011-06-27 22:44:14 -07:00
Matei Zaharia 393607d5ef Merge branch 'master' into scala-2.9 2011-06-27 18:08:25 -07:00
Matei Zaharia 2f652f1656 Fix a compile error 2011-06-27 18:07:16 -07:00
Tathagata Das 3f08e1129f Merge branch 'master' into td-rdd-save
Conflicts:
	core/src/main/scala/spark/SparkContext.scala
2011-06-27 13:43:44 -07:00
Tathagata Das ad842ac823 Merge branch 'master' into td-rdd-save
Conflicts:
	core/src/main/scala/spark/RDD.scala
2011-06-27 13:39:11 -07:00
Matei Zaharia bae8a97968 Merge branch 'master' into scala-2.9
Conflicts:
	repl/src/main/scala/spark/repl/SparkInterpreterLoop.scala
2011-06-26 19:22:27 -07:00
Matei Zaharia c4dd68ae21 Merge branch 'mos-bt'
This merge keeps only the broadcast work in mos-bt because the structure
of shuffle has changed with the new RDD design. We still need some kind
of parallel shuffle but that will be added later.

Conflicts:
	core/src/main/scala/spark/BitTorrentBroadcast.scala
	core/src/main/scala/spark/ChainedBroadcast.scala
	core/src/main/scala/spark/RDD.scala
	core/src/main/scala/spark/SparkContext.scala
	core/src/main/scala/spark/Utils.scala
	core/src/main/scala/spark/shuffle/BasicLocalFileShuffle.scala
	core/src/main/scala/spark/shuffle/DfsShuffle.scala
2011-06-26 18:22:12 -07:00
Tathagata Das 38f2ba99cc Further changes to HadoopFileWriter. Implemented ability to save RDDs as SequenceFiles and ObjectFiles.
1> HadoopFileWriter changed to take class types as constructor parameters (no more generic type)
2> Multiple types of RDD.saveAsHadoopFile() implemented to provide more saving options
3> RDD.saveAsSequenceFile() automatically converts basic types to Writable types before saving as SequenceFile
4> RDD.saveAsObjectFile() serializes objects and saves them to a ObjectFile
5> SparkContext.objectFile() opens the saved ObjectFiles
2011-06-24 19:51:21 -07:00
Olivier Grisel 2e3531d8bf Implemented RDD.leftOuterJoin and RDD.rightOuterJoin 2011-06-24 11:00:51 +02:00
Tathagata Das 3d2befe831 Improved HadoopFileWriter (saves key and value classes to jobconf) 2011-06-23 08:11:22 -07:00
Olivier Grisel 005d1605a4 add missing test for RDD.groupWith 2011-06-23 02:10:52 +02:00
Matei Zaharia 214250016a Added simple version of lookup 2011-06-20 11:59:16 -07:00
Matei Zaharia 23b42af70a Merge branch 'master' into scala-2.9 2011-06-19 23:06:21 -07:00
Matei Zaharia 23b1c309fb Added pipe() operation on RDDs for mapping through a shell command. 2011-06-19 23:05:19 -07:00
Tathagata Das b5e6645505 Cleaner reimplementation of HadoopFileWriter. Introduced TaskContext.
1> HadoopFileWriter works correctly with task failures
2> It can also take an user specified JobConf object for configuration settings
3> A Task can now get information like stage ID, split ID, and attempt ID using TaskContext class
4> Minor changes in SparkContext, DAGScheduler and subclasses to allow specification of TaskContext as a parameter
2011-06-16 20:57:57 -07:00
Tathagata Das 869836a2fa Implemented TaskContext to hold contextual information (jobID, taskID, attemptID) of a task 2011-06-10 19:47:28 -07:00
Tathagata Das 389e56156f HadoopFileWriter changed to use Hadoop's OutputCommitter 2011-06-09 15:29:22 -07:00
Tathagata Das 24d845833c First-cut implementation of RDD.SaveAsText 2011-06-05 04:14:43 -07:00
Matei Zaharia 3297706ab2 Merge remote-tracking branch 'origin/master' into scala-2.9 2011-06-01 11:46:31 -07:00
Matei Zaharia 9bb448a151 Catch Throwable instead of Exception in LocalScheduler and Executor. Fixes #57. 2011-06-01 11:45:47 -07:00
Matei Zaharia 850fe3274e Make the runJob API public. Fixes #56. 2011-06-01 11:38:44 -07:00
Ismael Juma 82f10bd794 Remove unnecessary toStream calls. 2011-06-01 16:12:42 +01:00
Matei Zaharia 10fe324845 Merge remote-tracking branch 'origin/master' into scala-2.9 2011-05-31 23:48:11 -07:00
Matei Zaharia 5166d76843 Ensure logging is initialized before spawning any threads to fix issue #45 2011-05-31 23:47:32 -07:00
Matei Zaharia 0afd35a8dd Some docs in ClosureCleaner 2011-05-31 22:06:30 -07:00
Matei Zaharia 8b0390d344 Instantiate NullWritable properly in HadoopFile 2011-05-30 23:54:14 -07:00
Matei Zaharia 4096c2287e Various fixes 2011-05-29 18:46:01 -07:00
Matei Zaharia ef706ae959 Merge branch 'master' into new-rdds-protobuf
Conflicts:
	run
2011-05-29 16:20:23 -07:00
Matei Zaharia c501cff924 Executor was looking for the wrong constructor for ExecutorClassLoader 2011-05-29 16:15:59 -07:00
Ismael Juma 1396678baa Move REPL classes to separate module. 2011-05-27 11:22:50 +01:00
Ismael Juma 051da8b4ad Delete liblzf from lib as it's no longer used. 2011-05-27 11:22:10 +01:00
Ismael Juma ae1a1f91f1 Remove several dependencies from git and configure them as SBT managed dependencies.
Upgrade some of the dependencies while at it.
2011-05-27 11:22:01 +01:00
Ismael Juma 164ef4c751 Use explicit asInstanceOf instead of misleading unchecked pattern matching.
Also enable -unchecked warnings in SBT build file.
2011-05-27 07:57:10 +01:00
Ismael Juma 89c8ea2bb2 Replace deprecated - and -- with suggested filterNot (which is uglier). 2011-05-26 22:22:37 +01:00
Ismael Juma 94f05683bd Replace deprecated first with head. 2011-05-26 22:13:41 +01:00
Ismael Juma 0b6a862b68 Use math instead of Math as the latter is deprecated. 2011-05-26 22:06:36 +01:00
Ismael Juma 1f27d94c48 Use Array.iterator instead of Iterator.fromArray as the latter is deprecated. 2011-05-26 22:04:42 +01:00
Ismael Juma 1993a8e556 Use += instead of + for mutable sequences as the latter is deprecated. 2011-05-26 21:59:48 +01:00
root 5ef938615f Initial work on making stuff compile with protobuf Mesos 2011-05-24 22:27:08 +00:00
Matei Zaharia cec427e777 Fixed a bug with preferred locations having changed meaning in new RDDs 2011-05-22 17:12:29 -07:00
Matei Zaharia 4c888b2933 Fix queue type for executor 2011-05-22 16:42:05 -07:00
Matei Zaharia bea3a33012 doc tweak 2011-05-22 16:03:41 -07:00
Matei Zaharia 9bde5a54cb class loader fix 2011-05-22 16:00:41 -07:00
Matei Zaharia 91c07a33d9 Various fixes to serialization 2011-05-21 22:50:08 -07:00
Matei Zaharia f61b61c4ac Merge branch 'master' into new-rdds 2011-05-21 21:25:58 -07:00
Matei Zaharia 24a1e7f838 Scheduler can now recover from lost map outputs 2011-05-20 00:19:53 -07:00
Matei Zaharia 82329b0b28 Updated scheduler to support running on just some partitions of final RDD 2011-05-19 12:47:09 -07:00
Matei Zaharia 328e51b693 Various minor fixes 2011-05-19 11:19:25 -07:00
Matei Zaharia fd1d255821 Stop objectifying various trackers, caches, etc. 2011-05-17 12:41:13 -07:00
Matei Zaharia 4db50e26c7 Fixed unit tests by making them clean up the SparkContext after use and
thus clean up the various singletons (RDDCache, MapOutputTracker, etc).
This isn't perfect yet (ideally we shouldn't use singleton objects at
all) but we can fix that later.
2011-05-13 12:03:58 -07:00
Matei Zaharia aca8150c52 Ensure that AddedToCache messages make it home before tasks finish 2011-05-13 11:43:52 -07:00
Matei Zaharia 16c886a581 Optimization for count() 2011-05-13 10:41:34 -07:00
Mosharaf Chowdhury db7a2c4897 Issue #42 fixed. 2011-04-28 14:30:48 -07:00
Ankur Dave a4c04f3f6f Error handling for disk I/O in DiskSpillingCache
Also renamed the property spark.DiskSpillingCache.cacheDir to spark.diskSpillingCache.cacheDir in order to follow conventions.
2011-04-27 23:23:29 -07:00
Ankur Dave 12ff0d2dc3 Bring an entry back into memory after fetching it from disk 2011-04-27 22:59:05 -07:00
Ankur Dave e30313aa2c Added DiskSpillingCache
DiskSpillingCache is a BoundedMemoryCache that spills entries to disk
when it runs out of space. Currently the implementation is very
simple. In particular, it's missing the following features:

- Error handling for disk I/O, including checking of disk space levels
- Bringing an entry back into memory after fetching it from disk

In addition, here are some features that aren't critical but should be
implemented soon:

- Spilling based on a user-set priority in addition to LRU
- Caching into a subdirectory of spark.DiskSpillingCache.cacheDir
  rather than the root directory
2011-04-27 22:32:35 -07:00
Mosharaf Chowdhury 60d1121343 Refactoring: daemonThreadFactories have all been moved to the Utils
object instead of having multiple copies in Broadcast and Shuffle
objects.
2011-04-27 22:13:01 -07:00
Mosharaf Chowdhury e898e108a3 Cleanup + refactoring... 2011-04-27 22:00:24 -07:00
Mosharaf Chowdhury 0567646180 Shuffle is also working from its own subpackage. 2011-04-27 21:11:41 -07:00
Mosharaf Chowdhury 2742de707a Removed some shuffle implementations. Remaining ones all use local files
to write map outputs.
2011-04-27 20:53:43 -07:00
Mosharaf Chowdhury 9d78779257 Merge branch 'mos-shuffle-tracked' into mos-bt
Conflicts:
	core/src/main/scala/spark/Broadcast.scala
2011-04-27 20:47:07 -07:00
Mosharaf Chowdhury ac7e066383 Merge branch 'master' into mos-shuffle-tracked
Conflicts:
	.gitignore
	core/src/main/scala/spark/LocalFileShuffle.scala
	src/scala/spark/BasicLocalFileShuffle.scala
	src/scala/spark/Broadcast.scala
	src/scala/spark/LocalFileShuffle.scala
2011-04-27 14:35:03 -07:00
Mosharaf Chowdhury 4e4c41026c Added support for custom classes. (from 49ea48) 2011-04-27 12:30:16 -07:00
Mosharaf Chowdhury 65848da8df Refacoring... 2011-04-26 17:41:31 -07:00
Mosharaf Chowdhury b8ab7862b8 Moved broadcast-related code to separate directory under spark.broadcast
package.
2011-04-26 17:22:52 -07:00
Mosharaf Chowdhury e31007248c Merge branch 'master' into mos-bt 2011-04-26 12:04:14 -07:00
Mosharaf Chowdhury 9257a55e3a Refactoring... 2011-04-26 11:45:36 -07:00
Mosharaf Chowdhury 9d2d533493 Temporary fix for issue #42. 2011-04-21 17:40:26 -07:00
Timothy Hunter 5c9535228a fixed small bug when classpath has some strange formatting 2011-04-18 17:12:29 -07:00
Mosharaf Chowdhury a8f47a62b9 Renamed MaxRxPeers to MaxTxPeers to MaxTxSlots and MaxRxSlots
respectively for clarity (most probably they were misunderstood and
misused)
2011-04-13 16:24:19 -07:00
Matei Zaharia 94ba95bcb2 Added flatMapValues 2011-04-12 19:51:58 -07:00
Mosharaf Chowdhury b67a968b5d hasBlocks is now AtomicInteger (even though it was ok) 2011-04-02 22:03:18 -07:00
Mosharaf Chowdhury 5bf3c83b13 BroadcastSuperTracker (right now for BT) is contacted over TCP instead
of direct procedure call.

Need to do the same for others and consolidate all broadcast mechanisms.
2011-04-01 19:31:28 -07:00
Mosharaf Chowdhury 733a130108 Formatting... 2011-04-01 14:51:24 -07:00
Mosharaf Chowdhury 4636aea598 Formatting... 2011-04-01 14:49:59 -07:00
Mosharaf Chowdhury addd569e52 Each broadcasted variable can have different blockSize. Corresponding
logic to adapt blockSize based on network condition is not yet
implemented.

Formatting + consolidation.
2011-03-31 14:51:46 -07:00
Mosharaf Chowdhury 815f3411ec Consolidated Broadcast config params. 2011-03-30 16:45:51 -07:00
Mosharaf Chowdhury a18a28b08e Removed gossip-related code that were already commented out.
More formatting.
2011-03-30 14:22:09 -07:00
Mosharaf Chowdhury 43aceafd70 Formatting... 2011-03-30 12:18:50 -07:00
Mosharaf Chowdhury 73b165220d Random is the default choice; rarestFirst didn't work well in
experiments.
2011-03-29 13:06:43 -07:00
Matei Zaharia d840fa8d0c Merge remote branch 'origin/custom-serialization' into new-rdds 2011-03-09 00:40:07 -08:00
root ff5b13799a Some tweaks to make Kryo cache work better 2011-03-09 03:31:50 -05:00
Matei Zaharia 7febdfbe29 Better reuse of buffers in Kryo serialization 2011-03-08 12:36:36 -08:00
Matei Zaharia 8ee3ec29ee Merge remote branch 'origin/custom-serialization' into new-rdds 2011-03-08 11:58:19 -08:00
Matei Zaharia 7408230bfa Updated modified Kryo to use objenesis 2011-03-08 11:58:08 -08:00
Matei Zaharia ab1216cb14 Register None and Nil properly 2011-03-08 11:52:58 -08:00
Matei Zaharia d39f5dd15e Merge remote branch 'origin/custom-serialization' into new-rdds 2011-03-08 10:28:50 -08:00
Matei Zaharia 4f0d0a7b73 stuff 2011-03-08 10:28:26 -08:00
Matei Zaharia 8b6f3db415 Merge remote branch 'origin/custom-serialization' into new-rdds 2011-03-07 19:20:28 -08:00
Matei Zaharia 38f6bce33d Added SerializingCache 2011-03-07 19:16:24 -08:00
Matei Zaharia 6316c7979d Remove some logging 2011-03-07 18:56:36 -08:00
Matei Zaharia e7b4b047a6 Added pluggable serializers and Kryo serialization 2011-03-07 18:41:53 -08:00
Matei Zaharia 467f056e29 Remove commented code 2011-03-06 23:38:41 -08:00
Matei Zaharia bce95b8458 Finished cogroup stuff 2011-03-06 23:38:16 -08:00
Matei Zaharia 04c2d6a60c stuff 2011-03-06 19:27:03 -08:00
Matei Zaharia 0fb691dd28 Various fixes to get MesosScheduler working with new RDDs 2011-03-06 16:16:38 -08:00
Matei Zaharia 1df5a65a01 Pass cache locations correctly to DAGScheduler. 2011-03-06 12:16:38 -08:00
Matei Zaharia e1436f1eaa Merge remote branch 'origin/master' into new-rdds 2011-03-06 11:11:47 -08:00
Matei Zaharia 370b95816f Added sampling for large arrays in SizeEstimator 2011-03-06 11:11:20 -08:00
Matei Zaharia a789e9aaea Merge remote branch 'origin/master' into new-rdds 2011-03-01 10:33:37 -08:00
Matei Zaharia 021c50a8d4 Remove unnecessary lock which was there to work around a bug in
Configuration in Hadoop 0.20.0
2011-03-01 10:28:38 -08:00
Matei Zaharia adaba4d550 Removed old slf4j jars that came with Hadoop 2011-03-01 10:28:21 -08:00
Matei Zaharia 447debb771 Updated Hadoop to 0.20.2 to include some bug fixes 2011-03-01 10:27:48 -08:00
Matei Zaharia 9e59afd710 More work on new RDD design 2011-02-27 19:15:52 -08:00
Matei Zaharia f38f86d59e More stuff 2011-02-27 14:27:12 -08:00
Matei Zaharia 2e6023f2bf stuff 2011-02-26 23:41:44 -08:00
Matei Zaharia 309367c477 Initial work towards new RDD design 2011-02-26 23:15:33 -08:00
Mosharaf Chowdhury 0416cc22d2 Picking peers weighted by the number of rare blocks they have. A block is rare if there are at most 2 copies in the neighborhood. Better number can be used (some function of neighborhood size) 2011-02-15 16:27:44 -08:00
Mosharaf Chowdhury cf81da9485 Optimization: Master sends out at least one copy of each block first regardless of whatever a client is asking for. Once one copy of each block is out, Master then responds to specific blocks from individual receivers. 2011-02-14 15:08:33 -08:00
Mosharaf Chowdhury 2b946fb2d1 pickBlockRarestFirst and gossips commented OUT for now.
Problem with the rarestFirst implemention is that we are picking peers randomly first and then picking blocks from the random peer using rarestFirst. NOT the right away to do it. It should be the other way around.
Problem with gossip is that peers might end up overwriting newer information by older ones. To fix that we either have to have timestamps or must match the bitVectors before overwriting.
2011-02-13 13:53:15 -08:00
Mosharaf Chowdhury ca2895ebb0 Fix in rarestFirst implemenation.
If there are more than one rarest blocks, pick randomly between them (was deterministic before)
2011-02-10 20:37:44 -08:00
Mosharaf Chowdhury 520bbdc7e3 Peers now gossip about their neighbors when they talk. 2011-02-10 20:15:30 -08:00
Matei Zaharia dc24aecd8f Close record readers in HadoopFile after finishing a split 2011-02-10 12:07:48 -08:00
Mosharaf Chowdhury 441462bc7f Fixed some warnings during compilation. 2011-02-09 12:11:43 -08:00
Mosharaf Chowdhury 1a73c0d265 Merged with master. Using sbt. 2011-02-09 10:48:48 -08:00
Mosharaf Chowdhury 495b38658e Merge branch 'master' into mos-bt 2011-02-09 10:40:23 -08:00
Matei Zaharia 99f3f23efa Changed default shuffle to LocalFileShuffle because it's way faster for small files 2011-02-08 17:03:03 -08:00
Matei Zaharia ec28b607fd Merge branch 'master' into sbt
Conflicts:
	Makefile
	core/src/main/java/spark/compress/lzf/LZF.java
	core/src/main/java/spark/compress/lzf/LZFInputStream.java
	core/src/main/java/spark/compress/lzf/LZFOutputStream.java
	core/src/main/native/spark_compress_lzf_LZF.c
	run
2011-02-02 00:25:54 -08:00
Matei Zaharia e5c4cd8a5e Made examples and core subprojects 2011-02-01 15:11:08 -08:00