Commit graph

679 commits

Author SHA1 Message Date
Charles Reiss cf79de425d Fix NullPointerException when unregistering a map output twice. 2012-11-27 16:12:05 -08:00
Matei Zaharia 3ff6f4bdee Merge pull request #304 from mbautin/configurable_local_ip
SPARK-624: make the default local IP customizable
2012-11-19 13:23:39 -08:00
mbautin 00f4e3ff9c Addressing Matei's comment: SPARK_LOCAL_IP environment variable 2012-11-19 11:52:10 -08:00
Charles Reiss 12c24e786c Set default uncaught exception handler to exit.
Among other things, should prevent OutOfMemoryErrors in some daemon threads
(such as the network manager) from causing a spark executor to enter a state
where it cannot make progress but does not report an error.
2012-11-16 20:12:31 -08:00
mbautin 1f5a7e0e64 SPARK-624: make the default local IP customizable 2012-11-15 13:57:47 -08:00
Matei Zaharia c23a74df0a Use DNS names instead of IP addresses in standalone mode, to allow
matching with data locality hints from storage systems.
2012-11-15 00:10:52 -08:00
Matei Zaharia 173e0354c0 Detect correctly when one has disconnected from a standalone cluster.
SPARK-617 #resolve
2012-11-11 21:06:57 -08:00
root acf8272324 Fix K-means example a little 2012-11-10 23:07:21 -08:00
Tathagata Das 9915989bfa Incorporated Matei's suggestions. Tested with 5 producer(consumer) threads each doing 50k puts (gets), took 15 minutes to run, no errors or deadlocks. 2012-11-09 15:46:15 -08:00
Tathagata Das de00bc63db Fixed deadlock in BlockManager.
1. Changed the lock structure of BlockManager by replacing the 337 coarse-grained locks to use BlockInfo objects as per-block fine-grained locks.
2. Changed the MemoryStore lock structure by making the block putting threads lock on a different object (not the memory store) thus making sure putting threads minimally blocks to the getting treads.
3. Added spark.storage.ThreadingTest to stress test the BlockManager using 5 block producer and 5 block consumer threads.
2012-11-09 14:09:37 -08:00
Matei Zaharia 6607f546cc Added an option to spread out jobs in the standalone mode. 2012-11-08 23:13:12 -08:00
Matei Zaharia 66cbdee941 Fix for connections not being reused (from Josh Rosen) 2012-11-08 09:53:40 -08:00
Imran Rashid 809b2bb1fe fix bug in getting slave id out of mesos 2012-11-08 00:34:28 -08:00
Matei Zaharia bb1bce7924 Various fixes to standalone mode and web UI:
- Don't report a job as finishing multiple times
- Don't show state of workers as LOADING when they're running
- Show start and finish times in web UI
- Sort web UI tables by ID and time by default
2012-11-07 16:49:53 -08:00
Matei Zaharia e2b8477487 Made Akka timeout and message frame size configurable, and upped the defaults 2012-11-06 15:58:05 -08:00
Shivaram Venkataraman a7d967a1ca Remove unnecessary hash-map put in MemoryStore 2012-11-01 10:46:38 -07:00
Josh Rosen 2ccf3b6652 Fix PySpark hash partitioning bug.
A Java array's hashCode is based on its object
identify, not its elements, so this was causing
serialized keys to be hashed incorrectly.

This commit adds a PySpark-specific workaround
and adds more tests.
2012-10-28 22:30:28 -07:00
root e782187b4a Don't throw an error in the block manager when a block is cached on the master due to
a locally computed operation

Conflicts:

	core/src/main/scala/spark/storage/BlockManagerMaster.scala
2012-10-26 00:33:45 -07:00
Matei Zaharia f63a40fd99 Strip leading mesos:// in URLs passed to Mesos 2012-10-24 21:52:13 -07:00
Matei Zaharia d290e964ea Merge pull request #281 from rxin/memreport
Added a method to report slave memory status; force serialize accumulator update in local mode.
2012-10-23 22:04:35 -07:00
Matei Zaharia 0bd20c63e2 Merge remote-tracking branch 'JoshRosen/shuffle_refactoring' into dev
Conflicts:
	core/src/main/scala/spark/Dependency.scala
	core/src/main/scala/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/spark/rdd/ShuffledRDD.scala
2012-10-23 22:01:45 -07:00
Josh Rosen d4f2e5b0ef Remove PYTHONPATH from SparkContext's executorEnvs.
It makes more sense to pass it in the dictionary
of environment variables that is used to construct
PythonRDD.
2012-10-22 10:28:59 -07:00
Josh Rosen c23bf1aff4 Add PySpark README and run scripts. 2012-10-20 00:22:27 +00:00
Josh Rosen 52989c8a2c Update Python API for v0.6.0 compatibility. 2012-10-19 10:24:49 -07:00
Josh Rosen e21eb6e00d Merge tag 'v0.6.0' into python-api 2012-10-19 09:44:32 -07:00
Thomas Dudziak d9c2a89c57 Support for Hadoop 2 distributions such as cdh4 2012-10-18 16:08:54 -07:00
Reynold Xin 4a3fb06ac2 Updated Kryo to 2.20. 2012-10-16 01:10:01 -07:00
Reynold Xin 63fae9bc23 Serialize accumulator updates in TaskResult for local mode. 2012-10-15 21:38:28 -07:00
Reynold Xin 42d20fa8da Added a method to report slave memory status. 2012-10-14 22:30:53 -07:00
Matei Zaharia 64dbf8d372 Made ShuffleDependency automatically find a shuffle ID for itself 2012-10-14 10:00:22 -07:00
Matei Zaharia 8815aeba0c Take executor environment vars as an arguemnt to SparkContext 2012-10-13 15:31:11 -07:00
Josh Rosen 33cd3a0c12 Remove map-side combining from ShuffleMapTask.
This separation of concerns simplifies the 
ShuffleDependency and ShuffledRDD interfaces.

Map-side combining can be performed in a
mapPartitions() call prior to shuffling the RDD.

I don't anticipate this having much of a 
performance impact: in both approaches, each tuple
is hashed twice: once in the bucket partitioning
and once in the combiner's hashtable.  The same
steps are being performed, but in a different
order and through one extra Iterator.
2012-10-13 14:59:20 -07:00
Josh Rosen 10bcd217d2 Remove mapSideCombine field from Aggregator.
Instead, the presence or absense of a ShuffleDependency's aggregator
will control whether map-side combining is performed.
2012-10-13 14:59:20 -07:00
Josh Rosen 4775c55641 Change ShuffleFetcher to return an Iterator. 2012-10-13 14:59:20 -07:00
Josh Rosen 110832e88f Add helper methods to Aggregator. 2012-10-13 14:57:56 -07:00
Denny 0700d1920a Protect from null env variables in mesos. 2012-10-13 13:57:59 -07:00
Denny 21047d923e Protect from setting null environment variables. 2012-10-13 13:44:24 -07:00
Denny fa41d50f7d Don't use system envs for Mesos. 2012-10-13 13:15:50 -07:00
Denny 67c42a41d0 Let the user specify environment variables to be passed to the Executors.
Also removed unused variables in the ExecutorRunner.
2012-10-13 13:08:44 -07:00
Matei Zaharia b4067cbad4 More doc updates, and moved Serializer to a subpackage. 2012-10-12 18:19:21 -07:00
Matei Zaharia 8d7b77bcb5 Some doc and usability improvements:
- Added a StorageLevels class for easy access to StorageLevel constants
  in Java
- Added doc comments on Function classes in Java
- Updated Accumulator and HadoopWriter docs slightly
2012-10-12 17:53:20 -07:00
Matei Zaharia dca496bb77 Document cartesian() operation 2012-10-12 14:46:41 -07:00
Matei Zaharia 23015ccac0 Merge pull request #271 from shivaram/block-manager-npe-fix
Change block manager to accept a ArrayBuffer
2012-10-12 14:36:28 -07:00
Patrick Wendell dc8adbd359 Adding Java documentation 2012-10-11 00:49:03 -07:00
Shivaram Venkataraman 2cf40c5fd5 Change block manager to accept a ArrayBuffer instead of an iterator to ensure
that the computation can proceed even if we run out of memory to cache the
block. Update CacheTracker to use this new interface
2012-10-11 00:42:46 -07:00
Denny d3f095f904 Fixed bug when fetching Jar dependencies.
Instead of checking currentFiles check currentJars.
2012-10-10 16:09:53 -07:00
Matei Zaharia ee2fcb2ce6 Added documentation to all the *RDDFunction classes, and moved them into
the spark package to make them more visible. Also documented various
other miscellaneous things in the API.
2012-10-09 18:38:36 -07:00
Matei Zaharia bc0bc672d0 Updates to documentation:
- Edited quick start and tuning guide to simplify them a little
- Simplified top menu bar
- Made private a SparkContext constructor parameter that was left as
  public
- Various small fixes
2012-10-09 14:30:23 -07:00
Andy Konwinski 1d79ff6028 Fixes a typo, adds scaladoc comments to SparkContext constructors. 2012-10-08 22:49:17 -07:00
Patrick Wendell ac310098ef More docs in RDD class 2012-10-08 22:25:11 -07:00
Andy Konwinski bd688940a1 A start on scaladoc for the public APIs. 2012-10-08 21:13:29 -07:00
Mosharaf Chowdhury edc67bfba8 Merge branch 'dev' into bc-fix-dev 2012-10-08 16:19:13 -07:00
Matei Zaharia efc5423210 Made compression configurable separately for shuffle, broadcast and RDDs 2012-10-07 11:30:53 -07:00
Matei Zaharia 039cc6228e Merge pull request #251 from JoshRosen/docs/internals
Document Dependency classes and make minor interface improvements
2012-10-07 09:56:53 -07:00
Reynold Xin f66c0e9561 Changed the println to logInfo in Utils.fetchFile. 2012-10-07 01:53:24 -07:00
Matei Zaharia d72db3d7dc Merge pull request #250 from rxin/dev
Fixed a bug in addFile that if the file is specified as "file:///", the symlink is created incorrectly for local mode.
2012-10-07 00:56:53 -07:00
Reynold Xin 80f59e17e2 Fixed a bug in addFile that if the file is specified as "file:///", the
symlink is created wrong for local mode.
2012-10-07 00:54:38 -07:00
Josh Rosen e10308f5a0 Make ShuffleDependency.aggregator explicitly optional.
It was confusing to be using

    new Aggregator[K, V, V](null, null, null, false)

to represent the absence of an aggregator.
2012-10-07 00:36:04 -07:00
Matei Zaharia f930fe5d81 Improve error message 2012-10-07 07:34:36 +00:00
Matei Zaharia a3bf0ce57f Don't crash on ask timeout exceptions in deploy.Client.stop() (fixes a crash in tests) 2012-10-07 07:25:41 +00:00
Matei Zaharia eca570f66a Removed the need to sleep in tests due to waiting for Akka to shut down 2012-10-07 00:17:59 -07:00
Josh Rosen 4f72066a9a Document the Dependency classes. 2012-10-07 00:05:37 -07:00
Josh Rosen 3f2571fe98 Remove unused isShuffle field from Dependency. 2012-10-07 00:03:55 -07:00
Matei Zaharia b2fc3dd902 Log message 2012-10-07 06:43:52 +00:00
Matei Zaharia ea096f7cd5 More logging 2012-10-07 06:35:48 +00:00
root 554b42cb24 Log more info in MapOutputTracker 2012-10-07 05:02:18 +00:00
root a73b25826b Made Akka thread pool and message batch sizes configurable 2012-10-07 04:19:54 +00:00
root ce915cadee Made run script add test-classes onto the classpath only if SPARK_TESTING is set; fixes #216 2012-10-07 04:19:16 +00:00
root 975009d688 Avoid acquiring locks in BlockManager when fetching shuffle outputs 2012-10-07 04:02:10 +00:00
root 0bc63f7ef1 Log initial number of fetches in reducer 2012-10-07 03:51:04 +00:00
Matei Zaharia dc28a3ac0a Modified shuffle to limit the maximum outstanding data size in bytes,
instead of the maximum number of outstanding fetches. This should make
it faster when there are many small map output files, as well as more
robust to overallocating memory on large map outputs.
2012-10-06 20:07:10 -07:00
Matei Zaharia 9a3b3f32a3 Pass sizes of map outputs back to MapOutputTracker 2012-10-06 18:46:04 -07:00
Matei Zaharia 0e42832e6a Made block store return the size of each block put in 2012-10-06 18:00:53 -07:00
Matei Zaharia b0110de5b6 Warn about user programs that try to set spark.cache.class 2012-10-06 17:27:14 -07:00
Matei Zaharia 65113b7e1b Only group elements ten at a time into SequenceFile records in
saveAsObjectFile
2012-10-06 17:14:41 -07:00
Matei Zaharia 716e10ca32 Minor formatting fixes 2012-10-05 22:03:06 -07:00
Matei Zaharia 70f02fa912 Merge branch 'dev' of github.com:mesos/spark into dev 2012-10-05 22:00:22 -07:00
Andy Konwinski a242cdd0a6 Factor subclasses of RDD out of RDD.scala into their own classes
in the rdd package.
2012-10-05 19:53:54 -07:00
Andy Konwinski d7363a6b8a Moves all files in core/src/main/scala/ that have RDD in their name
from that directory to a new core/src/main/scala/rdd directory.
2012-10-05 19:23:45 -07:00
Andy Konwinski e0067da082 Moves all files in core/src/main/scala/ that have RDD in them from
package spark to package spark.rdd and updates all references to them.
2012-10-05 19:23:45 -07:00
Matei Zaharia 69588baf65 Cleaning up code slightly 2012-10-05 19:16:09 -07:00
root f52bc09a34 Reduce some overly aggressive logging in connection manager 2012-10-06 01:54:39 +00:00
Matei Zaharia e3ae98b54e Merge pull request #247 from squito/dev
Dev
2012-10-05 10:27:18 -07:00
Imran Rashid e0698f8f26 change tests to show utility of localValue 2012-10-04 23:05:42 -07:00
Imran Rashid 82a3327862 make accumulator.localValue public, add tests
Conflicts:
	core/src/test/scala/spark/AccumulatorSuite.scala
2012-10-04 23:05:01 -07:00
Matei Zaharia 8c82f43db3 Scaladoc documentation for some core Spark functionality 2012-10-04 22:59:36 -07:00
Reynold Xin 45f4b7cc7e Made Serializer and JavaSerializer non private. 2012-10-03 10:20:59 -07:00
Matei Zaharia 833f1d0c86 Made StorageLevel public 2012-10-03 08:27:25 -07:00
Matei Zaharia 6cf5dffc72 Make more stuff private[spark] 2012-10-02 22:28:55 -07:00
Mosharaf Chowdhury 119e50c7b9 Conflict fixed 2012-10-02 22:25:39 -07:00
Matei Zaharia 626f701931 Merge pull request #240 from dennybritz/private_classes
Package-Private Classes
2012-10-02 21:24:32 -07:00
Denny 0361353a70 Make Java API abstract wrapped functions private 2012-10-02 20:02:53 -07:00
Denny b9badcd5bd accidentially removed trait 2012-10-02 19:35:07 -07:00
Denny 18a1faedf6 Stylistic changes and Public Accumulable and Broadcast 2012-10-02 19:28:37 -07:00
Denny b7a913e1fa Make dependency classes public - used by spark 2012-10-02 19:04:23 -07:00
Denny 4d9f4b01af Make classes package private 2012-10-02 19:00:19 -07:00
Matei Zaharia 97cbd699d7 Merge branch 'dev' of github.com:mesos/spark into dev 2012-10-02 17:31:01 -07:00
Matei Zaharia 6098f7e87a Fixed cache replacement behavior of BlockManager:
- Partitions that get dropped to disk will now be loaded back into RAM
  after they're accessed again
- Same-RDD rule for cache replacement is now implemented (don't drop
  partitions from an RDD to make room for other partitions from itself)
- Items stored as MEMORY_AND_DISK go into memory only first, instead of
  being eagerly written out to disk
- MemoryStore.ensureFreeSpace is called within a lock on the writer
  thread to prevent race conditions (this can still be optimized to
  allow multiple concurrent calls to it but it's a start)
- MemoryStore does not accept blocks larger than its limit
2012-10-02 17:25:38 -07:00
Reynold Xin 7997585616 Added a check to make sure SPARK_MEM <= memoryPerSlave for local cluster
mode.
2012-10-02 15:45:25 -07:00
Reynold Xin 0898a21b95 Merge branch 'dev' of https://github.com/mesos/spark into dev 2012-10-02 13:08:01 -07:00
Matei Zaharia 22684653a5 Revert "Place Spray repo ahead of Cloudera in Maven search path"
This reverts commit 42e0a68082.
2012-10-02 12:01:32 -07:00
Reynold Xin b8cd681169 Allow whitespaces in cluster URL configuration for local cluster. 2012-10-02 11:52:12 -07:00
Matei Zaharia 42e0a68082 Place Spray repo ahead of Cloudera in Maven search path 2012-10-02 11:37:19 -07:00
Matei Zaharia b9fb8d6463 Include date in folder name for Spark local dir. 2012-10-01 15:55:16 -07:00
Matei Zaharia bc881e4798 Merge branch 'dev' of github.com:mesos/spark into dev 2012-10-01 15:21:56 -07:00
Matei Zaharia 802aa8aef9 Some bug fixes and logging fixes for broadcast. 2012-10-01 15:20:42 -07:00
Reynold Xin f264153162 Fixed #232: DirectBuffer's cleaner was empty and Spark tried to invoke
clean on it.
2012-10-01 14:07:34 -07:00
Matei Zaharia 3b348f909d Improve log messages from BlockManager 2012-10-01 12:01:38 -07:00
Matei Zaharia 53f90d0f0e Use underscores instead of colons in RDD IDs 2012-10-01 10:48:53 -07:00
Matei Zaharia 2314132d57 Added a (failing) test for LRU with MEMORY_AND_DISK. 2012-09-30 22:52:16 -07:00
Matei Zaharia 3128c57f90 Simplified Class / ClassLoader test 2012-09-30 21:48:27 -07:00
Matei Zaharia 83143f9a5f Fixed several bugs that caused weird behavior with files in spark-shell:
- SizeEstimator was following through a ClassLoader field of Hadoop
  JobConfs, which referenced the whole interpreter, Scala compiler, etc.
  Chaos ensued, giving an estimated size in the tens of gigabytes.
- Broadcast variables in local mode were only stored as MEMORY_ONLY and
  never made accessible over a server, so they fell out of the cache when
  they were deemed too large and couldn't be reloaded.
2012-09-30 21:19:39 -07:00
Matei Zaharia fd0374b9de Comment 2012-09-29 21:43:06 -07:00
Matei Zaharia 5718cef2a4 Removed Logging trait from CoalescedRDD since we don't log anything 2012-09-29 21:40:43 -07:00
Matei Zaharia 143ef4f90d Added a CoalescedRDD class for reducing the number of partitions in an RDD. 2012-09-29 21:30:52 -07:00
Matei Zaharia ebd52347b5 Merge branch 'dev' of github.com:mesos/spark into dev 2012-09-29 20:22:31 -07:00
Matei Zaharia 9b326d01e9 Made BlockManager unmap memory-mapped files when necessary to reduce the
number of open files. Also optimized sending of disk-based blocks.
2012-09-29 20:21:54 -07:00
Matei Zaharia 2f11e3c285 Merge pull request #227 from JoshRosen/fix/distinct_numsplits
Allow controlling number of splits in distinct().
2012-09-28 23:57:24 -07:00
Josh Rosen 8654165e69 Use null as dummy value in distinct(). 2012-09-28 23:55:17 -07:00
Josh Rosen 37c199bbb0 Allow controlling number of splits in distinct(). 2012-09-28 23:44:19 -07:00
Matei Zaharia 56dcad5936 Don't create a Cache in SparkEnv because we don't use it 2012-09-28 23:40:56 -07:00
Matei Zaharia 1d44644f4f Logging tweaks 2012-09-28 23:28:16 -07:00
Matei Zaharia 815d6bd69a Renamed subdirs option 2012-09-28 19:02:41 -07:00
Matei Zaharia e54e1d7043 Made subdirs per local dir configurable, and reduced lock usage a bit 2012-09-28 19:00:50 -07:00
Matei Zaharia ae8c7d6cfa Made disk store use multiple directories, deleted ShuffleManager 2012-09-28 18:28:13 -07:00
Matei Zaharia 3d7267999d Print and track user call sites in more places in Spark 2012-09-28 17:42:00 -07:00
Matei Zaharia 9f6efbf06a Merge pull request #225 from pwendell/dev
Log message which records RDD origin
2012-09-28 16:28:07 -07:00
Matei Zaharia 0121a26bd1 Changed the way tasks' dependency files are sent to workers so that
custom serializers or Kryo registrators can be loaded.
2012-09-28 16:14:05 -07:00
Patrick Wendell 9fc78f8f29 Fixing some whitespace issues 2012-09-28 16:05:50 -07:00
Patrick Wendell bc909c2903 Changes based on Matei's comments 2012-09-28 16:04:36 -07:00
Patrick Wendell c387e40fb1 Log message which records RDD origin
This adds tracking to determine the "origin" of an RDD. Origin is defined by
the boundary between the user's code and the spark code, during an RDD's
instantiation. It is meant to help users understand where a Spark RDD is
coming from in their code.

This patch also logs origin data when stages are submitted to the scheduler.

Finally, it adds a new log message to fix an inconsitency in the way that
dependent stages (those missing parents) and independent stages (those
without) are logged during submission.
2012-09-28 15:51:46 -07:00
Matei Zaharia 2a8bfbca00 Fixed a bug where isLocal was set to false when using local[K] 2012-09-28 14:50:54 -07:00
Matei Zaharia 4a138403ef Fix a bug in JAR fetcher that made it always fetch the JAR 2012-09-27 21:32:06 -07:00
Matei Zaharia 009b0e37e7 Added an option to compress blocks in the block store 2012-09-27 18:45:44 -07:00
Matei Zaharia 7bcb08cef5 Renamed storage levels to something cleaner; fixes #223. 2012-09-27 17:50:59 -07:00
Matei Zaharia 920fab23c3 Merge pull request #222 from rxin/dev
Added MapPartitionsWithSplitRDD.
2012-09-26 23:16:45 -07:00
Matei Zaharia ea05fc130b Updates to standalone cluster, web UI and deploy docs. 2012-09-26 22:54:39 -07:00
Matei Zaharia 1ef4f0fbd2 Allow controlling number of splits in sortByKey. 2012-09-26 19:18:47 -07:00
Reynold Xin 1ad1331a34 Added MapPartitionsWithSplitRDD. 2012-09-26 17:11:28 -07:00
Matei Zaharia ee71fa49c1 Look for Kryo registrator using context class loader 2012-09-26 14:15:16 -07:00
Matei Zaharia d71a358c46 Fixed a test that was getting extremely lucky before, and increased the
number of samples used for sorting
2012-09-26 00:25:34 -07:00
Matei Zaharia 051785c7e6 Several fixes to sampling issues pointed out by Henry Milner:
- takeSample was biased towards earlier partitions
- There were some range errors in takeSample
- SampledRDDs with replacement didn't produce appropriate counts
  across partitions (we took exactly frac of each one)
2012-09-25 21:46:58 -07:00
Matei Zaharia 4d3339a3ec Merge pull request #217 from rxin/dev
Added a method to RDD to expose the ClassManifest.
2012-09-24 23:52:32 -07:00
Reynold Xin 7a4cd92861 Renamed RDD.manifest to RDD.elementClassManifest 2012-09-24 23:42:33 -07:00
Matei Zaharia 296e24b440 Merge pull request #218 from rnpandya/dev
Scripts to start Spark under windows
2012-09-24 21:10:31 -07:00
Reynold Xin 348bcbca1f Added a method to RDD to expose the ClassManifest. 2012-09-24 16:56:27 -07:00
Ravi Pandya 39215357af Windows command scripts for sbt and run 2012-09-24 15:43:19 -07:00
Matei Zaharia 6eeb379cf8 Fix some test issues 2012-09-24 15:39:58 -07:00
Matei Zaharia f855e4fad2 Merge pull request #208 from rxin/dev
Separated ShuffledRDD into multiple classes.
2012-09-24 12:32:01 -07:00
root 107a5ca879 Make default number of parallel fetches slightly smaller since it doesn't seem to hurt performance much and it will cause slightly less GC. 2012-09-23 06:06:12 +00:00