Matei Zaharia
173e0354c0
Detect correctly when one has disconnected from a standalone cluster.
...
SPARK-617 #resolve
2012-11-11 21:06:57 -08:00
Denny
68e0a88282
Merge branch 'master' into blockmanagerUI
2012-11-11 14:00:02 -08:00
Denny
b829fba749
Merge branch 'master' into blockmanagerUI
...
Conflicts:
core/src/main/twirl/spark/deploy/worker/index.scala.html
2012-11-11 13:59:40 -08:00
Tathagata Das
04e9e9d93c
Refactored BlockManagerMaster (not BlockManagerMasterActor) to simplify the code and fix live lock problem in unlimited attempts to contact the master. Also added testcases in the BlockManagerSuite to test BlockManagerMaster methods getPeers and getLocations.
2012-11-11 08:54:21 -08:00
root
acf8272324
Fix K-means example a little
2012-11-10 23:07:21 -08:00
Tathagata Das
355c8e4b17
Fixed deadlock in BlockManager.
2012-11-09 16:28:45 -08:00
Tathagata Das
9915989bfa
Incorporated Matei's suggestions. Tested with 5 producer(consumer) threads each doing 50k puts (gets), took 15 minutes to run, no errors or deadlocks.
2012-11-09 15:46:15 -08:00
Tathagata Das
de00bc63db
Fixed deadlock in BlockManager.
...
1. Changed the lock structure of BlockManager by replacing the 337 coarse-grained locks to use BlockInfo objects as per-block fine-grained locks.
2. Changed the MemoryStore lock structure by making the block putting threads lock on a different object (not the memory store) thus making sure putting threads minimally blocks to the getting treads.
3. Added spark.storage.ThreadingTest to stress test the BlockManager using 5 block producer and 5 block consumer threads.
2012-11-09 14:09:37 -08:00
Matei Zaharia
6607f546cc
Added an option to spread out jobs in the standalone mode.
2012-11-08 23:13:12 -08:00
Matei Zaharia
66cbdee941
Fix for connections not being reused (from Josh Rosen)
2012-11-08 09:53:40 -08:00
Imran Rashid
809b2bb1fe
fix bug in getting slave id out of mesos
2012-11-08 00:34:28 -08:00
Matei Zaharia
bb1bce7924
Various fixes to standalone mode and web UI:
...
- Don't report a job as finishing multiple times
- Don't show state of workers as LOADING when they're running
- Show start and finish times in web UI
- Sort web UI tables by ID and time by default
2012-11-07 16:49:53 -08:00
Matei Zaharia
e2b8477487
Made Akka timeout and message frame size configurable, and upped the defaults
2012-11-06 15:58:05 -08:00
Tathagata Das
72b2303f99
Fixed major bugs in checkpointing.
2012-11-05 11:41:36 -08:00
Tathagata Das
d154238789
Made checkpointing of dstream graph to work with checkpointing of RDDs. For streams requiring checkpointing of its RDD, the default checkpoint interval is set to 10 seconds.
2012-11-04 12:12:06 -08:00
Shivaram Venkataraman
a7d967a1ca
Remove unnecessary hash-map put in MemoryStore
2012-11-01 10:46:38 -07:00
Tathagata Das
34e569f40e
Added 'synchronized' to RDD serialization to ensure checkpoint-related changes are reflected atomically in the task closure. Added to tests to ensure that jobs running on an RDD on which checkpointing is in progress does hurt the result of the job.
2012-10-31 00:56:40 -07:00
Tathagata Das
0dcd770fdc
Added checkpointing support to all RDDs, along with CheckpointSuite to test checkpointing in them.
2012-10-30 16:09:37 -07:00
Denny
ceec1a1a6a
Nicer storage level format on RDD page
2012-10-29 15:03:01 -07:00
Denny
eb95212f4d
code Formatting
2012-10-29 14:57:32 -07:00
Denny
531ac136bf
BlockManager UI.
2012-10-29 14:53:47 -07:00
Tathagata Das
ac12abc17f
Modified RDD API to make dependencies a var (therefore can be changed to checkpointed hadoop rdd) and othere references to parent RDDs either through dependencies or through a weak reference (to allow finalizing when dependencies do not refer to it any more).
2012-10-29 11:55:27 -07:00
Josh Rosen
2ccf3b6652
Fix PySpark hash partitioning bug.
...
A Java array's hashCode is based on its object
identify, not its elements, so this was causing
serialized keys to be hashed incorrectly.
This commit adds a PySpark-specific workaround
and adds more tests.
2012-10-28 22:30:28 -07:00
root
e782187b4a
Don't throw an error in the block manager when a block is cached on the master due to
...
a locally computed operation
Conflicts:
core/src/main/scala/spark/storage/BlockManagerMaster.scala
2012-10-26 00:33:45 -07:00
Matei Zaharia
863a55ae42
Merge remote-tracking branch 'public/master' into dev
...
Conflicts:
core/src/main/scala/spark/BlockStoreShuffleFetcher.scala
core/src/main/scala/spark/KryoSerializer.scala
core/src/main/scala/spark/MapOutputTracker.scala
core/src/main/scala/spark/RDD.scala
core/src/main/scala/spark/SparkContext.scala
core/src/main/scala/spark/executor/Executor.scala
core/src/main/scala/spark/network/Connection.scala
core/src/main/scala/spark/network/ConnectionManagerTest.scala
core/src/main/scala/spark/rdd/BlockRDD.scala
core/src/main/scala/spark/rdd/NewHadoopRDD.scala
core/src/main/scala/spark/scheduler/ShuffleMapTask.scala
core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala
core/src/main/scala/spark/storage/BlockManager.scala
core/src/main/scala/spark/storage/BlockMessage.scala
core/src/main/scala/spark/storage/BlockStore.scala
core/src/main/scala/spark/storage/StorageLevel.scala
core/src/main/scala/spark/util/AkkaUtils.scala
project/SparkBuild.scala
run
2012-10-24 23:21:00 -07:00
Matei Zaharia
f63a40fd99
Strip leading mesos:// in URLs passed to Mesos
2012-10-24 21:52:13 -07:00
Matei Zaharia
d290e964ea
Merge pull request #281 from rxin/memreport
...
Added a method to report slave memory status; force serialize accumulator update in local mode.
2012-10-23 22:04:35 -07:00
Matei Zaharia
0bd20c63e2
Merge remote-tracking branch 'JoshRosen/shuffle_refactoring' into dev
...
Conflicts:
core/src/main/scala/spark/Dependency.scala
core/src/main/scala/spark/rdd/CoGroupedRDD.scala
core/src/main/scala/spark/rdd/ShuffledRDD.scala
2012-10-23 22:01:45 -07:00
Josh Rosen
d4f2e5b0ef
Remove PYTHONPATH from SparkContext's executorEnvs.
...
It makes more sense to pass it in the dictionary
of environment variables that is used to construct
PythonRDD.
2012-10-22 10:28:59 -07:00
Josh Rosen
c23bf1aff4
Add PySpark README and run scripts.
2012-10-20 00:22:27 +00:00
Josh Rosen
52989c8a2c
Update Python API for v0.6.0 compatibility.
2012-10-19 10:24:49 -07:00
Josh Rosen
e21eb6e00d
Merge tag 'v0.6.0' into python-api
2012-10-19 09:44:32 -07:00
Thomas Dudziak
d9c2a89c57
Support for Hadoop 2 distributions such as cdh4
2012-10-18 16:08:54 -07:00
Reynold Xin
4a3fb06ac2
Updated Kryo to 2.20.
2012-10-16 01:10:01 -07:00
Reynold Xin
63fae9bc23
Serialize accumulator updates in TaskResult for local mode.
2012-10-15 21:38:28 -07:00
Reynold Xin
42d20fa8da
Added a method to report slave memory status.
2012-10-14 22:30:53 -07:00
Matei Zaharia
64dbf8d372
Made ShuffleDependency automatically find a shuffle ID for itself
2012-10-14 10:00:22 -07:00
Tathagata Das
e95ff45b53
Implemented checkpointing of StreamingContext and DStream graph.
2012-10-13 20:10:49 -07:00
Matei Zaharia
8815aeba0c
Take executor environment vars as an arguemnt to SparkContext
2012-10-13 15:31:11 -07:00
Josh Rosen
33cd3a0c12
Remove map-side combining from ShuffleMapTask.
...
This separation of concerns simplifies the
ShuffleDependency and ShuffledRDD interfaces.
Map-side combining can be performed in a
mapPartitions() call prior to shuffling the RDD.
I don't anticipate this having much of a
performance impact: in both approaches, each tuple
is hashed twice: once in the bucket partitioning
and once in the combiner's hashtable. The same
steps are being performed, but in a different
order and through one extra Iterator.
2012-10-13 14:59:20 -07:00
Josh Rosen
10bcd217d2
Remove mapSideCombine field from Aggregator.
...
Instead, the presence or absense of a ShuffleDependency's aggregator
will control whether map-side combining is performed.
2012-10-13 14:59:20 -07:00
Josh Rosen
4775c55641
Change ShuffleFetcher to return an Iterator.
2012-10-13 14:59:20 -07:00
Josh Rosen
110832e88f
Add helper methods to Aggregator.
2012-10-13 14:57:56 -07:00
Denny
0700d1920a
Protect from null env variables in mesos.
2012-10-13 13:57:59 -07:00
Denny
21047d923e
Protect from setting null environment variables.
2012-10-13 13:44:24 -07:00
Denny
fa41d50f7d
Don't use system envs for Mesos.
2012-10-13 13:15:50 -07:00
Denny
67c42a41d0
Let the user specify environment variables to be passed to the Executors.
...
Also removed unused variables in the ExecutorRunner.
2012-10-13 13:08:44 -07:00
Matei Zaharia
b4067cbad4
More doc updates, and moved Serializer to a subpackage.
2012-10-12 18:19:21 -07:00
Matei Zaharia
8d7b77bcb5
Some doc and usability improvements:
...
- Added a StorageLevels class for easy access to StorageLevel constants
in Java
- Added doc comments on Function classes in Java
- Updated Accumulator and HadoopWriter docs slightly
2012-10-12 17:53:20 -07:00
Matei Zaharia
dca496bb77
Document cartesian() operation
2012-10-12 14:46:41 -07:00
Matei Zaharia
23015ccac0
Merge pull request #271 from shivaram/block-manager-npe-fix
...
Change block manager to accept a ArrayBuffer
2012-10-12 14:36:28 -07:00
Patrick Wendell
dc8adbd359
Adding Java documentation
2012-10-11 00:49:03 -07:00
Shivaram Venkataraman
2cf40c5fd5
Change block manager to accept a ArrayBuffer instead of an iterator to ensure
...
that the computation can proceed even if we run out of memory to cache the
block. Update CacheTracker to use this new interface
2012-10-11 00:42:46 -07:00
Denny
d3f095f904
Fixed bug when fetching Jar dependencies.
...
Instead of checking currentFiles check currentJars.
2012-10-10 16:09:53 -07:00
Matei Zaharia
ee2fcb2ce6
Added documentation to all the *RDDFunction classes, and moved them into
...
the spark package to make them more visible. Also documented various
other miscellaneous things in the API.
2012-10-09 18:38:36 -07:00
Matei Zaharia
bc0bc672d0
Updates to documentation:
...
- Edited quick start and tuning guide to simplify them a little
- Simplified top menu bar
- Made private a SparkContext constructor parameter that was left as
public
- Various small fixes
2012-10-09 14:30:23 -07:00
Andy Konwinski
1d79ff6028
Fixes a typo, adds scaladoc comments to SparkContext constructors.
2012-10-08 22:49:17 -07:00
Patrick Wendell
ac310098ef
More docs in RDD class
2012-10-08 22:25:11 -07:00
Andy Konwinski
bd688940a1
A start on scaladoc for the public APIs.
2012-10-08 21:13:29 -07:00
Mosharaf Chowdhury
edc67bfba8
Merge branch 'dev' into bc-fix-dev
2012-10-08 16:19:13 -07:00
Matei Zaharia
efc5423210
Made compression configurable separately for shuffle, broadcast and RDDs
2012-10-07 11:30:53 -07:00
Matei Zaharia
039cc6228e
Merge pull request #251 from JoshRosen/docs/internals
...
Document Dependency classes and make minor interface improvements
2012-10-07 09:56:53 -07:00
Reynold Xin
f66c0e9561
Changed the println to logInfo in Utils.fetchFile.
2012-10-07 01:53:24 -07:00
Matei Zaharia
d72db3d7dc
Merge pull request #250 from rxin/dev
...
Fixed a bug in addFile that if the file is specified as "file:///", the symlink is created incorrectly for local mode.
2012-10-07 00:56:53 -07:00
Reynold Xin
80f59e17e2
Fixed a bug in addFile that if the file is specified as "file:///", the
...
symlink is created wrong for local mode.
2012-10-07 00:54:38 -07:00
Josh Rosen
e10308f5a0
Make ShuffleDependency.aggregator explicitly optional.
...
It was confusing to be using
new Aggregator[K, V, V](null, null, null, false)
to represent the absence of an aggregator.
2012-10-07 00:36:04 -07:00
Matei Zaharia
f930fe5d81
Improve error message
2012-10-07 07:34:36 +00:00
Matei Zaharia
a3bf0ce57f
Don't crash on ask timeout exceptions in deploy.Client.stop() (fixes a crash in tests)
2012-10-07 07:25:41 +00:00
Matei Zaharia
eca570f66a
Removed the need to sleep in tests due to waiting for Akka to shut down
2012-10-07 00:17:59 -07:00
Josh Rosen
4f72066a9a
Document the Dependency classes.
2012-10-07 00:05:37 -07:00
Josh Rosen
3f2571fe98
Remove unused isShuffle field from Dependency.
2012-10-07 00:03:55 -07:00
Matei Zaharia
b2fc3dd902
Log message
2012-10-07 06:43:52 +00:00
Matei Zaharia
ea096f7cd5
More logging
2012-10-07 06:35:48 +00:00
root
554b42cb24
Log more info in MapOutputTracker
2012-10-07 05:02:18 +00:00
root
a73b25826b
Made Akka thread pool and message batch sizes configurable
2012-10-07 04:19:54 +00:00
root
ce915cadee
Made run script add test-classes onto the classpath only if SPARK_TESTING is set; fixes #216
2012-10-07 04:19:16 +00:00
root
975009d688
Avoid acquiring locks in BlockManager when fetching shuffle outputs
2012-10-07 04:02:10 +00:00
root
0bc63f7ef1
Log initial number of fetches in reducer
2012-10-07 03:51:04 +00:00
Matei Zaharia
dc28a3ac0a
Modified shuffle to limit the maximum outstanding data size in bytes,
...
instead of the maximum number of outstanding fetches. This should make
it faster when there are many small map output files, as well as more
robust to overallocating memory on large map outputs.
2012-10-06 20:07:10 -07:00
Matei Zaharia
9a3b3f32a3
Pass sizes of map outputs back to MapOutputTracker
2012-10-06 18:46:04 -07:00
Matei Zaharia
0e42832e6a
Made block store return the size of each block put in
2012-10-06 18:00:53 -07:00
Matei Zaharia
b0110de5b6
Warn about user programs that try to set spark.cache.class
2012-10-06 17:27:14 -07:00
Matei Zaharia
65113b7e1b
Only group elements ten at a time into SequenceFile records in
...
saveAsObjectFile
2012-10-06 17:14:41 -07:00
Matei Zaharia
716e10ca32
Minor formatting fixes
2012-10-05 22:03:06 -07:00
Matei Zaharia
70f02fa912
Merge branch 'dev' of github.com:mesos/spark into dev
2012-10-05 22:00:22 -07:00
Andy Konwinski
a242cdd0a6
Factor subclasses of RDD out of RDD.scala into their own classes
...
in the rdd package.
2012-10-05 19:53:54 -07:00
Andy Konwinski
d7363a6b8a
Moves all files in core/src/main/scala/ that have RDD in their name
...
from that directory to a new core/src/main/scala/rdd directory.
2012-10-05 19:23:45 -07:00
Andy Konwinski
e0067da082
Moves all files in core/src/main/scala/ that have RDD in them from
...
package spark to package spark.rdd and updates all references to them.
2012-10-05 19:23:45 -07:00
Matei Zaharia
69588baf65
Cleaning up code slightly
2012-10-05 19:16:09 -07:00
root
f52bc09a34
Reduce some overly aggressive logging in connection manager
2012-10-06 01:54:39 +00:00
Matei Zaharia
e3ae98b54e
Merge pull request #247 from squito/dev
...
Dev
2012-10-05 10:27:18 -07:00
Imran Rashid
e0698f8f26
change tests to show utility of localValue
2012-10-04 23:05:42 -07:00
Imran Rashid
82a3327862
make accumulator.localValue public, add tests
...
Conflicts:
core/src/test/scala/spark/AccumulatorSuite.scala
2012-10-04 23:05:01 -07:00
Matei Zaharia
8c82f43db3
Scaladoc documentation for some core Spark functionality
2012-10-04 22:59:36 -07:00
Reynold Xin
45f4b7cc7e
Made Serializer and JavaSerializer non private.
2012-10-03 10:20:59 -07:00
Matei Zaharia
833f1d0c86
Made StorageLevel public
2012-10-03 08:27:25 -07:00
Matei Zaharia
6cf5dffc72
Make more stuff private[spark]
2012-10-02 22:28:55 -07:00
Mosharaf Chowdhury
119e50c7b9
Conflict fixed
2012-10-02 22:25:39 -07:00
Matei Zaharia
626f701931
Merge pull request #240 from dennybritz/private_classes
...
Package-Private Classes
2012-10-02 21:24:32 -07:00
Denny
0361353a70
Make Java API abstract wrapped functions private
2012-10-02 20:02:53 -07:00
Denny
b9badcd5bd
accidentially removed trait
2012-10-02 19:35:07 -07:00
Denny
18a1faedf6
Stylistic changes and Public Accumulable and Broadcast
2012-10-02 19:28:37 -07:00
Denny
b7a913e1fa
Make dependency classes public - used by spark
2012-10-02 19:04:23 -07:00
Denny
4d9f4b01af
Make classes package private
2012-10-02 19:00:19 -07:00
Matei Zaharia
97cbd699d7
Merge branch 'dev' of github.com:mesos/spark into dev
2012-10-02 17:31:01 -07:00
Matei Zaharia
6098f7e87a
Fixed cache replacement behavior of BlockManager:
...
- Partitions that get dropped to disk will now be loaded back into RAM
after they're accessed again
- Same-RDD rule for cache replacement is now implemented (don't drop
partitions from an RDD to make room for other partitions from itself)
- Items stored as MEMORY_AND_DISK go into memory only first, instead of
being eagerly written out to disk
- MemoryStore.ensureFreeSpace is called within a lock on the writer
thread to prevent race conditions (this can still be optimized to
allow multiple concurrent calls to it but it's a start)
- MemoryStore does not accept blocks larger than its limit
2012-10-02 17:25:38 -07:00
Reynold Xin
7997585616
Added a check to make sure SPARK_MEM <= memoryPerSlave for local cluster
...
mode.
2012-10-02 15:45:25 -07:00
Reynold Xin
0898a21b95
Merge branch 'dev' of https://github.com/mesos/spark into dev
2012-10-02 13:08:01 -07:00
Matei Zaharia
22684653a5
Revert "Place Spray repo ahead of Cloudera in Maven search path"
...
This reverts commit 42e0a68082
.
2012-10-02 12:01:32 -07:00
Reynold Xin
b8cd681169
Allow whitespaces in cluster URL configuration for local cluster.
2012-10-02 11:52:12 -07:00
Matei Zaharia
42e0a68082
Place Spray repo ahead of Cloudera in Maven search path
2012-10-02 11:37:19 -07:00
Matei Zaharia
b9fb8d6463
Include date in folder name for Spark local dir.
2012-10-01 15:55:16 -07:00
Matei Zaharia
bc881e4798
Merge branch 'dev' of github.com:mesos/spark into dev
2012-10-01 15:21:56 -07:00
Matei Zaharia
802aa8aef9
Some bug fixes and logging fixes for broadcast.
2012-10-01 15:20:42 -07:00
Reynold Xin
f264153162
Fixed #232 : DirectBuffer's cleaner was empty and Spark tried to invoke
...
clean on it.
2012-10-01 14:07:34 -07:00
Matei Zaharia
3b348f909d
Improve log messages from BlockManager
2012-10-01 12:01:38 -07:00
Matei Zaharia
53f90d0f0e
Use underscores instead of colons in RDD IDs
2012-10-01 10:48:53 -07:00
Matei Zaharia
2314132d57
Added a (failing) test for LRU with MEMORY_AND_DISK.
2012-09-30 22:52:16 -07:00
Matei Zaharia
3128c57f90
Simplified Class / ClassLoader test
2012-09-30 21:48:27 -07:00
Matei Zaharia
83143f9a5f
Fixed several bugs that caused weird behavior with files in spark-shell:
...
- SizeEstimator was following through a ClassLoader field of Hadoop
JobConfs, which referenced the whole interpreter, Scala compiler, etc.
Chaos ensued, giving an estimated size in the tens of gigabytes.
- Broadcast variables in local mode were only stored as MEMORY_ONLY and
never made accessible over a server, so they fell out of the cache when
they were deemed too large and couldn't be reloaded.
2012-09-30 21:19:39 -07:00
Matei Zaharia
fd0374b9de
Comment
2012-09-29 21:43:06 -07:00
Matei Zaharia
5718cef2a4
Removed Logging trait from CoalescedRDD since we don't log anything
2012-09-29 21:40:43 -07:00
Matei Zaharia
143ef4f90d
Added a CoalescedRDD class for reducing the number of partitions in an RDD.
2012-09-29 21:30:52 -07:00
Matei Zaharia
ebd52347b5
Merge branch 'dev' of github.com:mesos/spark into dev
2012-09-29 20:22:31 -07:00
Matei Zaharia
9b326d01e9
Made BlockManager unmap memory-mapped files when necessary to reduce the
...
number of open files. Also optimized sending of disk-based blocks.
2012-09-29 20:21:54 -07:00
Matei Zaharia
2f11e3c285
Merge pull request #227 from JoshRosen/fix/distinct_numsplits
...
Allow controlling number of splits in distinct().
2012-09-28 23:57:24 -07:00
Josh Rosen
8654165e69
Use null as dummy value in distinct().
2012-09-28 23:55:17 -07:00
Josh Rosen
37c199bbb0
Allow controlling number of splits in distinct().
2012-09-28 23:44:19 -07:00
Matei Zaharia
56dcad5936
Don't create a Cache in SparkEnv because we don't use it
2012-09-28 23:40:56 -07:00
Matei Zaharia
1d44644f4f
Logging tweaks
2012-09-28 23:28:16 -07:00
Matei Zaharia
815d6bd69a
Renamed subdirs option
2012-09-28 19:02:41 -07:00
Matei Zaharia
e54e1d7043
Made subdirs per local dir configurable, and reduced lock usage a bit
2012-09-28 19:00:50 -07:00
Matei Zaharia
ae8c7d6cfa
Made disk store use multiple directories, deleted ShuffleManager
2012-09-28 18:28:13 -07:00
Matei Zaharia
3d7267999d
Print and track user call sites in more places in Spark
2012-09-28 17:42:00 -07:00
Matei Zaharia
9f6efbf06a
Merge pull request #225 from pwendell/dev
...
Log message which records RDD origin
2012-09-28 16:28:07 -07:00
Matei Zaharia
0121a26bd1
Changed the way tasks' dependency files are sent to workers so that
...
custom serializers or Kryo registrators can be loaded.
2012-09-28 16:14:05 -07:00
Patrick Wendell
9fc78f8f29
Fixing some whitespace issues
2012-09-28 16:05:50 -07:00
Patrick Wendell
bc909c2903
Changes based on Matei's comments
2012-09-28 16:04:36 -07:00
Patrick Wendell
c387e40fb1
Log message which records RDD origin
...
This adds tracking to determine the "origin" of an RDD. Origin is defined by
the boundary between the user's code and the spark code, during an RDD's
instantiation. It is meant to help users understand where a Spark RDD is
coming from in their code.
This patch also logs origin data when stages are submitted to the scheduler.
Finally, it adds a new log message to fix an inconsitency in the way that
dependent stages (those missing parents) and independent stages (those
without) are logged during submission.
2012-09-28 15:51:46 -07:00
Matei Zaharia
2a8bfbca00
Fixed a bug where isLocal was set to false when using local[K]
2012-09-28 14:50:54 -07:00
Matei Zaharia
4a138403ef
Fix a bug in JAR fetcher that made it always fetch the JAR
2012-09-27 21:32:06 -07:00
Matei Zaharia
009b0e37e7
Added an option to compress blocks in the block store
2012-09-27 18:45:44 -07:00
Matei Zaharia
7bcb08cef5
Renamed storage levels to something cleaner; fixes #223 .
2012-09-27 17:50:59 -07:00
Matei Zaharia
920fab23c3
Merge pull request #222 from rxin/dev
...
Added MapPartitionsWithSplitRDD.
2012-09-26 23:16:45 -07:00
Matei Zaharia
ea05fc130b
Updates to standalone cluster, web UI and deploy docs.
2012-09-26 22:54:39 -07:00
Matei Zaharia
1ef4f0fbd2
Allow controlling number of splits in sortByKey.
2012-09-26 19:18:47 -07:00
Reynold Xin
1ad1331a34
Added MapPartitionsWithSplitRDD.
2012-09-26 17:11:28 -07:00
Matei Zaharia
ee71fa49c1
Look for Kryo registrator using context class loader
2012-09-26 14:15:16 -07:00
Matei Zaharia
d71a358c46
Fixed a test that was getting extremely lucky before, and increased the
...
number of samples used for sorting
2012-09-26 00:25:34 -07:00
Matei Zaharia
051785c7e6
Several fixes to sampling issues pointed out by Henry Milner:
...
- takeSample was biased towards earlier partitions
- There were some range errors in takeSample
- SampledRDDs with replacement didn't produce appropriate counts
across partitions (we took exactly frac of each one)
2012-09-25 21:46:58 -07:00
Matei Zaharia
4d3339a3ec
Merge pull request #217 from rxin/dev
...
Added a method to RDD to expose the ClassManifest.
2012-09-24 23:52:32 -07:00
Reynold Xin
7a4cd92861
Renamed RDD.manifest to RDD.elementClassManifest
2012-09-24 23:42:33 -07:00
Matei Zaharia
296e24b440
Merge pull request #218 from rnpandya/dev
...
Scripts to start Spark under windows
2012-09-24 21:10:31 -07:00
Reynold Xin
348bcbca1f
Added a method to RDD to expose the ClassManifest.
2012-09-24 16:56:27 -07:00
Ravi Pandya
39215357af
Windows command scripts for sbt and run
2012-09-24 15:43:19 -07:00
Matei Zaharia
6eeb379cf8
Fix some test issues
2012-09-24 15:39:58 -07:00
Matei Zaharia
f855e4fad2
Merge pull request #208 from rxin/dev
...
Separated ShuffledRDD into multiple classes.
2012-09-24 12:32:01 -07:00
root
107a5ca879
Make default number of parallel fetches slightly smaller since it doesn't seem to hurt performance much and it will cause slightly less GC.
2012-09-23 06:06:12 +00:00
root
e41cab04ca
Avoid creating an extra buffer when saving a stream of values as DISK_ONLY
2012-09-23 05:56:44 +00:00
Denny
afb7ccc838
HTTP File server fixes.
2012-09-21 10:58:13 -07:00
root
6d28dde370
Rename our toIterator method into asIterator to prevent confusion with the
...
Scala collection one, which often *copies* a collection.
2012-09-21 06:02:55 +00:00
root
a642051ade
Fixed a performance bug in BlockManager that was creating garbage when
...
returning deserialized, in-memory RDDs.
2012-09-21 05:42:21 +00:00
root
8feb5caacd
Fixed an issue with ordering of classloader setup that was causing Java deserializer to break
2012-09-21 05:13:19 +00:00
Reynold Xin
6b5980da79
Set a limited number of retry in standalone deploy mode.
2012-09-19 15:41:56 -07:00
Reynold Xin
397d3816e1
Separated ShuffledRDD into multiple classes: RepartitionShuffledRDD,
...
ShuffledSortedRDD, and ShuffledAggregatedRDD.
2012-09-19 12:31:45 -07:00
Denny
ca64d16a2d
When a file is downloaded, make it executable. That's neccsary for scripts (e.g. in Shark)
2012-09-17 10:08:37 -07:00
Matei Zaharia
840cbcf849
Change default serializer to Java.. it had accidentally become Kryo.
2012-09-13 17:19:26 -07:00
Matei Zaharia
b4dfa25c8a
Store shuffle map outputs as DISK_ONLY
2012-09-12 16:05:57 -07:00
Matei Zaharia
2d761e3353
Ported performance and FT improvements from latest streaming work
2012-09-12 14:54:40 -07:00
Matei Zaharia
9b4cd1648b
Fix bugs with Connection's shutdown callback failing to get its address
2012-09-12 14:54:14 -07:00
Matei Zaharia
9199775d41
Wait for Akka to really shut down in SparkEnv.stop()
2012-09-12 14:50:37 -07:00
Denny
5e4076e3f2
Merge branch 'dev' into feature/fileserver
...
Conflicts:
core/src/main/scala/spark/SparkContext.scala
2012-09-11 16:57:17 -07:00
Denny
77873d2c8e
Formatting
2012-09-11 16:51:46 -07:00
Denny
24b9b37314
Subclass URLClassLoader instead of using reflection
2012-09-11 16:51:08 -07:00
Denny
31c53e917d
Use stageId as index for fileSet caches.
2012-09-11 16:10:45 -07:00
Matei Zaharia
943df48348
Merge branch 'dev' of github.com:mesos/spark into dev
2012-09-11 16:00:37 -07:00
Matei Zaharia
6d7f907e73
Manually merge pull request #175 by Imran Rashid
2012-09-11 16:00:06 -07:00
Reynold Xin
7af7c79ce5
Updated the logError call from the previous commit to conform to
...
logError API.
2012-09-11 14:32:24 -07:00
Reynold Xin
38b9119c96
Log entire exception (including stack trace) in BlockManagerWorker.
2012-09-11 11:31:35 -07:00
Denny
4d3471dd07
Fix serialization bugs and added local cluster tests
2012-09-10 15:39:58 -07:00
Tathagata Das
c63a606458
Made NewHadoopRDD broadcast its job configuration (same as HadoopRDD).
2012-09-10 19:51:27 +00:00
Denny
b864c36a30
Dynamically adding jar files and caching fileSets.
2012-09-10 12:49:09 -07:00
Denny
f275fb07da
General FileServer
...
A general fileserver for both JARs and regular files.
2012-09-10 12:48:59 -07:00
Matei Zaharia
a13780670d
Added a unit test for local-cluster mode and simplified some of the code involved in that
2012-09-10 12:48:58 -07:00
Denny
f2ac55840c
Add shutdown hook to Executor Runner and execute code to shutdown local cluster in Scheduler Backend
2012-09-10 12:48:58 -07:00
Denny
9ead8ab14e
Set SPARK_LAUNCH_WITH_SCALA=0 in Executor Runner
2012-09-10 12:48:58 -07:00
Denny
8bb3c73977
Renamed spark-cluster to spark-local.
2012-09-10 12:48:58 -07:00
Denny
a367c20f49
Fix wrong counting
2012-09-10 12:48:57 -07:00
Denny
93fe331e6d
Delete old DeployUtils.
2012-09-10 12:48:57 -07:00
Denny
cf074f9c96
Renamed class.
2012-09-10 12:48:57 -07:00
Denny
3749f94184
Start a standalone cluster locally.
2012-09-10 12:48:57 -07:00
Matei Zaharia
995982b3c9
Added a unit test for local-cluster mode and simplified some of the code involved in that
2012-09-07 17:08:36 -07:00
Matei Zaharia
8d2fcc2832
Merge pull request #189 from dennybritz/feature/localcluster
...
Simulating a Spark standalone cluster locally
2012-09-07 15:43:43 -07:00
Denny
7ff9311add
Add shutdown hook to Executor Runner and execute code to shutdown local cluster in Scheduler Backend
2012-09-07 14:09:12 -07:00
Denny
4e7b264cf7
Set SPARK_LAUNCH_WITH_SCALA=0 in Executor Runner
2012-09-07 11:39:44 -07:00
haoyuan
db08a362aa
commit opt for grep scalibility test.
2012-09-07 02:17:52 +00:00
root
c2da64409a
Randomize the order of block fetches in getMultiple
2012-09-06 23:16:26 +00:00
root
9ef90c95f4
Bug fix
2012-09-06 00:43:46 +00:00
root
2fa6d999fd
Tuning Akka more
2012-09-06 00:16:39 +00:00
Denny
886183e591
Renamed spark-cluster to spark-local.
2012-09-05 17:10:54 -07:00
root
215544820f
Serialize map output locations more efficiently, and only once, in MapOutputTracker
2012-09-05 23:54:04 +00:00
root
dc68febdce
User Spark's closure serializer for the ShuffleMapTask cache
2012-09-05 23:06:59 +00:00
Reynold Xin
c308fbcb79
Removed cache add/remove log messages from CacheTracker.
...
Added log messages on BlockManagerMaster to reflect block add/remove.
Also did some minor cleanup of storage package code.
2012-09-05 15:59:48 -07:00
root
ed937a821f
Merge branch 'dev' of github.com:radlab/spark into dev
2012-09-05 22:26:49 +00:00
root
1d6b36d3c3
Further tuning for network performance
2012-09-05 22:26:37 +00:00
root
3fa0d7f0c9
Serialize BlockRDD more efficiently
2012-09-05 08:28:15 +00:00
root
4a5d0d249e
Merge branch 'dev' of github.com:radlab/spark into dev
2012-09-05 08:23:09 +00:00
root
efc7668d16
Allow serializing HttpBroadcast through Kryo
2012-09-05 08:22:57 +00:00
root
75487b2f5a
Broadcast the JobConf in HadoopRDD to reduce task sizes
2012-09-05 08:14:50 +00:00
root
b7ad291ac5
Tuning Akka for more connections
2012-09-05 07:08:07 +00:00
root
fc186dc18a
Merge branch 'dev' of github.com:radlab/spark into dev
2012-09-05 05:53:18 +00:00
root
4ea032a142
Some changes to make important log output visible even if we set the logging to WARNING
2012-09-05 05:53:07 +00:00
Denny
babbca0a2f
Fix wrong counting
2012-09-04 22:04:18 -07:00
Denny
9326509f66
Delete old DeployUtils.
2012-09-04 21:15:23 -07:00
Denny
1588d4dbe6
Renamed class.
2012-09-04 21:13:25 -07:00
Denny
22dde6e020
Start a standalone cluster locally.
2012-09-04 20:56:30 -07:00
Tathagata Das
7c09ad0e04
Changed DStream member access permissions from private to protected. Updated StateDStream to checkpoint RDDs and forget lineage.
2012-09-04 19:11:49 -07:00
Matei Zaharia
a842c63044
Minor formatting fixes
2012-09-03 16:24:00 -07:00
Tathagata Das
b8e9e8ea78
Merge branch 'dev' of github.com:radlab/spark into dev
2012-09-02 02:35:32 -07:00
root
ceabf71257
tweaks
2012-09-01 21:52:42 +00:00
root
6025889be0
More raw network receiver programs
2012-09-01 20:51:07 +00:00
Harvey
3076b038f4
Start fetching a remote block when a received remote block has been passed
...
to the reduce function
2012-09-01 12:01:35 -07:00
Matei Zaharia
f84d2bbe55
Bug fixes to RateLimitedOutputStream
2012-09-01 00:31:15 -07:00
Matei Zaharia
44758aa8e2
First work towards a RawInputDStream and a sender program for it.
2012-09-01 00:17:59 -07:00
root
c42e7ac282
More block manager fixes
2012-09-01 04:31:11 +00:00
Matei Zaharia
389fb4cc54
End runJob() with a SparkException when a task fails too many times in
...
one of the cluster schedulers.
2012-08-31 17:47:43 -07:00
root
113277549c
Really fixed the replication-3 issue. The problem was a few buffers not being rewound.
2012-08-31 05:39:35 +00:00
Mosharaf Chowdhury
31ffe8d528
Synchronization bug fix in broadcast implementations
2012-08-30 22:26:43 -07:00
Matei Zaharia
101ae493e2
Replicate serialized blocks properly, without sharing a ByteBuffer.
2012-08-30 22:24:14 -07:00
Mosharaf Chowdhury
3883532545
Bug fix. Fixed log messages. Updated BroadcastTest example to have iterations.
2012-08-30 21:43:00 -07:00
Matei Zaharia
a480dec6b2
Deserialize multi-get results in the caller's thread. This fixes an
...
issue with shared buffers in the KryoSerializer.
2012-08-30 20:01:06 -07:00
Matei Zaharia
1b3e3352eb
Deserialize multi-get results in the caller's thread. This fixes an
...
issue with shared buffers with the KryoSerializer.
2012-08-30 17:59:25 -07:00
root
c4366eb764
Fixes to ShuffleFetcher
2012-08-31 00:34:24 +00:00
Reynold Xin
5945bcdcc5
Added a new flag in Aggregator to indicate applying map side combiners.
2012-08-29 23:32:08 -07:00
Reynold Xin
c68e820b2a
Merge branch 'dev' of github.com:mesos/spark into dev
2012-08-29 23:01:19 -07:00
Reynold Xin
940869dfda
Disable running combiners on map tasks when mergeCombiners function is
...
not specified by the user.
2012-08-29 23:00:02 -07:00
Tathagata Das
4db3a96766
Made minor changes to reduce compilation errors in Eclipse. Twirl stuff still does not compile in Eclipse.
2012-08-29 13:04:01 -07:00
Matei Zaharia
bf2e9cb08e
Fault tolerance and block store fixes discovered through streaming tests.
2012-08-27 23:07:50 -07:00
Matei Zaharia
17af2df0cd
Log levels
2012-08-27 23:07:32 -07:00
Matei Zaharia
b4a2214218
More fault tolerance fixes to catch lost tasks
2012-08-27 22:49:29 -07:00
Reynold Xin
3a6a95dc24
Removed the deserialization cache for ShuffleMapTask because it was
...
causing concurrency problems (some variables in Shark get set to null).
The cost of task deserialization on slaves is trivial compared with the
execution time of the task anyway.
2012-08-27 22:33:15 -07:00
Josh Rosen
bff6a46359
Add pipe(), saveAsTextFile(), sc.union() to Python API.
2012-08-27 00:24:47 -07:00
Josh Rosen
200d248dcc
Simplify Python worker; pipeline the map step of partitionBy().
2012-08-27 00:24:39 -07:00
Josh Rosen
f79a1e4d2a
Add broadcast variables to Python API.
2012-08-27 00:16:47 -07:00
Matei Zaharia
b914cd0dfa
Serialize generation correctly in ShuffleMapTask
2012-08-26 20:07:59 -07:00
Matei Zaharia
69c2ab0408
logging
2012-08-26 20:00:58 -07:00
Matei Zaharia
117e3f8c86
Fix a bug that was causing FetchFailedException not to be thrown
2012-08-26 19:52:56 -07:00
Matei Zaharia
3c9c44a8d3
More helpful log messages
2012-08-26 19:37:43 -07:00
Matei Zaharia
26dfd20c9a
Detect disconnected slaves in StandaloneScheduler
2012-08-26 18:56:56 -07:00
Matei Zaharia
29e83f39e9
Fix replication with MEMORY_ONLY_DESER_2
2012-08-26 18:16:25 -07:00
Matei Zaharia
06ef7c3d1b
Less debug info
2012-08-26 16:29:20 -07:00
Matei Zaharia
741899b21e
Fix sendMessageReliablySync
2012-08-26 16:26:06 -07:00
Matei Zaharia
5a8015d2db
Merge remote-tracking branch 'public/dev' into dev
2012-08-24 16:11:44 -07:00
Matei Zaharia
deedb9e7b7
Fix further issues with tests and broadcast.
...
The broadcast fix is to store values as MEMORY_ONLY_DESER instead of
MEMORY_ONLY, which will save substantial time on serialization.
2012-08-23 20:31:49 -07:00
Matei Zaharia
59b831b9d1
Fixed test failures due to broadcast not stopping correctly
2012-08-23 19:59:55 -07:00
Matei Zaharia
7310a6f499
Merge pull request #147 from mosharaf/dev
...
Broadcast refactoring/cleaning up
2012-08-23 19:38:28 -07:00
Josh Rosen
607b53abfc
Use numpy in Python k-means example.
2012-08-22 00:43:55 -07:00
Josh Rosen
fd94e5443c
Use only cPickle for serialization in Python API.
...
Objects serialized with JSON can be compared for equality, but JSON can be slow
to serialize and only supports a limited range of data types.
2012-08-21 14:01:27 -07:00
Matei Zaharia
25a6a39e6d
Added other SparkContext constructors to JavaSparkContext
2012-08-19 18:59:16 -07:00
Josh Rosen
886b39de55
Add Python API.
2012-08-18 22:33:51 -07:00
Shivaram Venkataraman
1ea269110c
Move object size and pointer size initialization into a function to enable unit-testing
2012-08-13 13:31:45 -07:00
Shivaram Venkataraman
44661df9cc
If spark.test.useCompressedOops is set, use that to infer compressed oops
...
setting. This is useful to get a deterministic test case
2012-08-13 13:31:39 -07:00
Shivaram Venkataraman
0dd8fe73ba
Use HotSpotDiagnosticMXBean to get if CompressedOops are in use or not
2012-08-13 13:31:29 -07:00
Shivaram Venkataraman
80104ce1da
Add link to Java wiki which specifies what changes with compressed oops
2012-08-13 13:31:21 -07:00
Shivaram Venkataraman
00ab5490b3
Changes to make size estimator more accurate. Fixes object size, pointer size
...
according to architecture and also aligns objects and arrays when computing
instance sizes. Verified using Eclipse Memory Analysis Tool (MAT)
2012-08-13 13:31:11 -07:00
Matei Zaharia
6ae3c375a9
Renamed apply() to call() in Java API and allowed it to throw Exceptions
2012-08-12 23:10:19 +02:00
Matei Zaharia
0141879c40
Use Promises instead of having a Future wait on a thread in
...
ConnectionManager.
2012-08-12 22:16:32 +02:00
Matei Zaharia
845a870242
Return remotely fetched blocks in a pipelined fashion from BlockManager
2012-08-12 20:01:38 +02:00
Matei Zaharia
e17ed9a21d
Switch to Akka futures in connection manager.
...
It's still not good because each Future ends up waiting on a lock, but
it seems to work better than Scala Actors, and more importantly it
allows us to use onComplete and other listeners on futures.
2012-08-12 19:40:37 +02:00
Matei Zaharia
ad8a7612a4
Changed multi-get method in BlockManager to return an iterator
2012-08-12 19:18:01 +02:00
Matei Zaharia
3c94e5c188
Merge pull request #168 from shivaram/dev
...
Use JavaConversion to get a scala iterator
2012-08-10 00:57:33 -07:00
Matei Zaharia
e463e7a333
Merge pull request #167 from JoshRosen/piped-rdd-fixes
...
Detect non-zero exit status from PipedRDD process
2012-08-10 00:56:42 -07:00
Josh Rosen
59c22fb444
Print exit status in PipedRDD failure exception.
2012-08-10 00:33:56 -07:00
Shivaram Venkataraman
1803cce692
Use an implicit conversion to get the scala iterator
2012-08-08 14:31:04 -07:00
Shivaram Venkataraman
674fcf56bf
Use JavaConversion to get a scala iterator
2012-08-08 14:10:23 -07:00
Shivaram Venkataraman
f4aaec7a48
Avoid a copy in ShuffleMapTask by creating an iterator that will be used by the
...
block manager.
2012-08-08 00:47:02 -07:00
Mosharaf Chowdhury
d821dd3ccc
BroadcastManager is a class now (replaced Braodcast object)
2012-08-05 01:10:51 -07:00
Mosharaf Chowdhury
b4804119f9
Merge remote-tracking branch 'upstream/dev' into dev
2012-08-04 20:42:12 -07:00
Matei Zaharia
88b016db2a
Merge pull request #160 from dennybritz/clusterscripts
...
Standalone cluster scripts
2012-08-04 17:45:20 -07:00
Mosharaf Chowdhury
1b0534af8f
Merge branch 'dev' into bc-bm
2012-08-04 00:30:08 -07:00
Mosharaf Chowdhury
d11b457e67
Merge remote-tracking branch 'upstream/dev' into dev
2012-08-04 00:28:10 -07:00
Mosharaf Chowdhury
24b7eb872c
Bug fixed. Broadcast now works with BlockManager.
2012-08-04 00:27:28 -07:00
Matei Zaharia
6601a6212b
Added a unit test for cross-partition balancing in sort, and changes to
...
RangePartitioner to make it pass. It turns out that the first partition
was always kind of small due to how we picked partition boundaries.
2012-08-03 16:40:45 -04:00
Harvey
1170de3757
Fix for partitioning when sorting in descending order
2012-08-03 16:40:38 -04:00
Paul Cavallaro
d05c0f97ca
Logging Throwables in Info and Debug
...
Logging Throwables in logInfo and logDebug instead of swallowing them.
Conflicts:
core/src/main/scala/spark/Logging.scala
2012-08-03 16:40:21 -04:00
Denny
0008994044
merged dev branch
2012-08-02 16:00:33 -07:00
Denny
53008c2d8a
Settings variables and bugfix for stop script.
2012-08-02 15:59:39 -07:00
Matei Zaharia
71a958b0b7
Merge branch 'dev' of github.com:mesos/spark into dev
...
Conflicts:
project/SparkBuild.scala
2012-08-02 17:23:13 -04:00
Denny
7312a5c30f
Use spray's implicit Marshaller for Futures.
2012-08-02 14:11:27 -07:00
Denny
ba7e30fb5e
Mostly stlyistic changes.
2012-08-02 13:55:09 -07:00
Shivaram Venkataraman
1a07bb9ba4
Avoid an extra partition copy by passing an iterator to blockManager.put
2012-08-02 12:22:33 -07:00
Shivaram Venkataraman
6790908b11
Use maxMemory to better estimate memory available for BlockManager cache
2012-08-02 12:05:05 -07:00
Denny
863c31b7c1
Moved resources into static folder
2012-08-02 09:48:36 -07:00
Tathagata Das
1c0aeee960
Merge branch 'dev' of github.com:radlab/spark into dev
2012-08-01 22:11:41 -07:00
Tathagata Das
3be54c2a8a
1. Refactored SparkStreamContext, Scheduler, InputRDS, FileInputRDS and a few other files.
...
2. Modified Time class to represent milliseconds (long) directly, instead of LongTime.
3. Added new files QueueInputRDS, RecurringTimer, etc.
4. Added RDDSuite as the skeleton for testcases.
5. Added two examples in spark.streaming.examples.
6. Removed all past examples and a few unnecessary files. Moved a number of files to spark.streaming.util.
2012-08-01 22:09:27 -07:00
Denny
0ee44c225e
Spark standalone mode cluster scripts.
...
Heavily inspired by Hadoop cluster scripts ;-)
2012-08-01 20:38:52 -07:00
Denny
6c670c37dd
Webui improvements.
2012-08-01 19:47:57 -07:00
Denny
1b29e90a79
merge dev branch
2012-08-01 14:06:09 -07:00
Denny
011220fa55
Compact job page.
2012-08-01 11:26:45 -07:00
Denny
7a295fee96
Spark WebUI Implementation.
2012-08-01 11:01:09 -07:00