Commit graph

755 commits

Author SHA1 Message Date
Stephen Haberman 8dc06069fe Rename RDD.tupleBy to keyBy. 2013-01-06 15:21:45 -06:00
Matei Zaharia 8fd3a70c18 Add PairRDD.keys() and values() to Java API 2013-01-05 22:46:45 -05:00
Matei Zaharia b1663752c6 Merge pull request #351 from stephenh/values
Add PairRDDFunctions.keys and values.
2013-01-05 19:15:54 -08:00
Matei Zaharia 0982572519 Add methods called just 'accumulator' for int/double in Java API 2013-01-05 22:11:28 -05:00
Matei Zaharia 86af64b0a6 Fix Accumulators in Java, and add a test for them 2013-01-05 20:55:17 -05:00
Stephen Haberman 1fdb6946b5 Add RDD.tupleBy. 2013-01-05 13:07:59 -06:00
Stephen Haberman 6a0db3b449 Fix typo. 2013-01-05 12:56:17 -06:00
Stephen Haberman f4e6b9361f Add RDD.collect(PartialFunction). 2013-01-05 12:14:08 -06:00
Stephen Haberman 8d57c78c83 Add PairRDDFunctions.keys and values. 2013-01-05 12:04:01 -06:00
Josh Rosen 33beba3965 Change PySpark RDD.take() to not call iterator(). 2013-01-03 14:52:21 -08:00
Josh Rosen b58340dbd9 Rename top-level 'pyspark' directory to 'python' 2013-01-01 15:05:00 -08:00
Josh Rosen 170e451fbd Minor documentation and style fixes for PySpark. 2013-01-01 13:52:14 -08:00
Matei Zaharia 55809fbc6d Merge pull request #349 from woggling/cache-finally
Avoid stalls when computation of cached RDD throws exception
2013-01-01 08:21:33 -08:00
Charles Reiss 58072a7340 Remove some dead comments 2013-01-01 08:07:44 -08:00
Charles Reiss 21636ee4fa Test with exception while computing cached RDD. 2013-01-01 08:07:40 -08:00
Charles Reiss feadaf72f4 Mark key as not loading in CacheTracker even when compute() fails 2013-01-01 07:57:20 -08:00
Josh Rosen f803953998 Raise exception when hashing Java arrays (SPARK-597) 2012-12-31 20:20:11 -08:00
Josh Rosen 59195c68ec Update PySpark for compatibility with TaskContext. 2012-12-29 16:01:03 -08:00
Josh Rosen c5cee53f20 Merge remote-tracking branch 'origin/master' into python-api
Conflicts:
	docs/quick-start.md
2012-12-29 16:00:51 -08:00
Josh Rosen 7ec3595de2 Fix bug (introduced by batching) in PySpark take() 2012-12-28 22:21:16 -08:00
Josh Rosen 397e67103c Change Utils.fetchFile() warning to SparkException. 2012-12-28 17:37:13 -08:00
Josh Rosen d64fa72d2e Add addFile() and addJar() to JavaSparkContext. 2012-12-28 17:00:57 -08:00
Josh Rosen bd237d4a9d Add synchronization to LocalScheduler.updateDependencies(). 2012-12-28 17:00:57 -08:00
Josh Rosen f1bf4f0385 Skip deletion of files in clearFiles().
This fixes an issue where Spark could delete
original files in the current working directory
that were added to the job using addFile().

There was also the potential for addFile() to
overwrite local files, which is addressed by
changing Utils.fetchFile() to log a warning
instead of overwriting a file with new contents.

This is a short-term fix; a better long-term
solution would be to remove the dependence on
storing files in the current working directory,
since we can't change the cwd from Java.
2012-12-28 17:00:57 -08:00
Josh Rosen fbadb1cda5 Mark api.python classes as private; echo Java output to stderr. 2012-12-28 09:06:11 -08:00
Josh Rosen 1dca0c5180 Remove debug output from PythonPartitioner. 2012-12-26 18:23:06 -08:00
Josh Rosen 4608902fb8 Use filesystem to collect RDDs in PySpark.
Passing large volumes of data through Py4J seems
to be slow.  It appears to be faster to write the
data to the local filesystem and read it back from
Python.
2012-12-24 17:20:10 -08:00
Mark Hamstra 903f3518df fall back to filter-map-collect when calling lookup() on an RDD without a partitioner 2012-12-24 13:18:45 -08:00
Mark Hamstra 61be8566e2 Allow distinct() to be called without parentheses when using the default number of splits. 2012-12-24 02:36:47 -08:00
Reynold Xin 60f7338092 Remove the call to close input stream in Kryo serializer. 2012-12-21 15:49:33 -08:00
Matei Zaharia 3334b7c6b5 Merge pull request #341 from rxin/4a3fb06ac2d11125feb08acbbd4df76d1e91b677
Kryo2 update against Spark master
2012-12-21 15:31:23 -08:00
Matei Zaharia 5e51b889fe Merge pull request #327 from rxin/spark-633
Added the ability in block manager to remove blocks.
2012-12-20 11:33:38 -08:00
Reynold Xin 9397c5014e Let the slave notify the master block removal. 2012-12-20 01:37:09 -08:00
Reynold Xin 68c52d80ec Moved BlockManager's IdGenerator into BlockManager object. Removed some
excessive debug messages.
2012-12-19 15:27:23 -08:00
Patrick Wendell bfac06e1f6 SPARK-616: Logging dead workers in Web UI.
This patch keeps track of which workers have died and marks them
as such in the master web UI. It also handles workers which die and
re-register using different actor ID's.
2012-12-17 23:09:05 -08:00
Matei Zaharia b82a6dd2c7 Merge pull request #332 from JoshRosen/spark-607
Add try-finally to handle MapOutputTracker timeouts
2012-12-14 11:41:16 -08:00
Reynold Xin 06f855c24d Merge branch 'spark-633' of github.com:rxin/spark into spark-633 2012-12-14 00:27:24 -08:00
Reynold Xin 8c01295b85 Fixed conflicts from merging Charles' and TD's block manager changes. 2012-12-14 00:26:36 -08:00
Charles Reiss c528932a41 Code review cleanup. 2012-12-13 22:37:16 -08:00
Charles Reiss 0aad42b5e7 Have standalone cluster report exit codes to clients. Addresses SPARK-639. 2012-12-13 22:37:16 -08:00
Reynold Xin 0235667f73 Merge branch 'master' of github.com:mesos/spark into spark-633 2012-12-13 22:33:41 -08:00
Reynold Xin 97434f49b8 Merged TD's block manager refactoring. 2012-12-13 22:32:19 -08:00
Reynold Xin f4a9e1b9be Fixed the broken Java unit test from SPARK-635. 2012-12-13 22:22:12 -08:00
Reynold Xin 41e58a519a Merge branch 'master' of github.com:mesos/spark into spark-633 2012-12-13 22:06:47 -08:00
Josh Rosen cf52d9cade Add try-finally to handle MapOutputTracker timeouts. 2012-12-13 21:53:30 -08:00
Matei Zaharia 05e225f988 Merge pull request #329 from woggling/executor-status-codes
Executor exit status codes
2012-12-13 20:14:10 -08:00
Charles Reiss b054d3b222 ExecutorLostReason -> ExecutorLossReason 2012-12-13 18:44:07 -08:00
Charles Reiss 24d7aa2d15 Extra whitespace in ExecutorExitCode 2012-12-13 18:39:23 -08:00
Reynold Xin dc7d7fc286 Merge branch 'master' of github.com:mesos/spark into spark-633 2012-12-13 16:48:34 -08:00
Reynold Xin 4f076e105e SPARK-635: Pass a TaskContext object to compute() interface and use
that to close Hadoop input stream. Incorporated Matei's command.
2012-12-13 16:41:15 -08:00
Charles Reiss 829206f1a7 Explain slaveLost calls made by StandaloneSchedulerBackend 2012-12-13 16:23:36 -08:00
Charles Reiss a4041dd87f Log duplicate slaveLost() calls in ClusterScheduler. 2012-12-13 16:23:36 -08:00
Charles Reiss fa9df4a45d Normalize executor exit statuses and report them to the user. 2012-12-13 16:23:31 -08:00
Reynold Xin eacb98e900 SPARK-635: Pass a TaskContext object to compute() interface and use that
to close Hadoop input stream.
2012-12-13 15:41:53 -08:00
Josh Rosen 7c9e3d1c21 Return success or failure in BlockStore.remove(). 2012-12-13 15:22:27 -08:00
Reynold Xin 1b7a0451ed Added the ability in block manager to remove blocks. 2012-12-13 00:04:42 -08:00
Charles Reiss 1d8e2e6cff Call slaveLost on executor death for standalone clusters. 2012-12-12 21:15:34 -08:00
Reynold Xin 21b271f5bd Suppress shuffle block updates when a slave node comes back. 2012-12-10 20:36:03 -08:00
Matei Zaharia a1a2daa7ef Merge pull request #317 from woggling/block-manager-heartbeat
Implement block manager heartbeat
2012-12-10 11:03:55 -08:00
Charles Reiss b6b62d774f Decrease BlockManagerMaster logging verbosity 2012-12-10 00:31:55 -08:00
Charles Reiss 5d3e917d09 Use Akka scheduler for BlockManager heart beats.
Adds required ActorSystem argument to BlockManager constructors.
2012-12-10 00:31:50 -08:00
Charles Reiss b53dd28c90 Changed default block manager heartbeat interval to 5 s 2012-12-09 23:03:34 -08:00
Matei Zaharia beb440089e Merge pull request #310 from tomdz/master-mavenized
Maven build setup
2012-12-09 21:40:05 -08:00
Matei Zaharia e1d7cd2276 Search for a non-loopback address in Utils.getLocalIpAddress 2012-12-08 00:33:11 -08:00
Charles Reiss 714c8d32d5 Don't divide by milliseconds by 1000 more. 2012-12-06 18:38:34 -08:00
Charles Reiss 8f0819520c map -> foreach 2012-12-06 18:29:50 -08:00
Charles Reiss 7a033fd795 Make LocalSparkCluster use distinct IPs 2012-12-06 00:03:08 -08:00
Charles Reiss a2a94fdbc7 Tests for block manager heartbeats. 2012-12-05 23:36:05 -08:00
Charles Reiss d21ca010ac Add block manager heart beats.
Renames old message called 'HeartBeat' to 'BlockUpdate'.

The BlockManager periodically sends a heart beat message to the master.
If the manager is currently not registered. The master responds to the
heart beat by indicating whether the BlockManager is currently registered
with the master. Additionally, the master now also responds to block
updates by indicating whether the BlockManager in question is registered.
When the BlockManager detects (by heart beat or failed block update)
that it stopped being registered, it reregisters and sends block
updates for all its blocks.
2012-12-05 23:35:20 -08:00
Charles Reiss c9e54a6755 Track block managers by hostname; handle manager removal. 2012-12-05 23:35:20 -08:00
Charles Reiss 5afa2ee9e9 Actually put millis in _lastSeenMs 2012-12-05 23:35:20 -08:00
Charles Reiss 813ac71459 Don't use bogus port number in notifyADeadHost(). 2012-12-05 23:35:20 -08:00
Josh Rosen cdaa0fad51 Use external addresses in standalone WebUI on EC2. 2012-12-01 18:19:13 -08:00
Thomas Dudziak 84e584fa8c Code review feedback fix 2012-11-28 19:46:06 -08:00
Matei Zaharia f86960cba9 Merge pull request #313 from rxin/pde_size_compress
Added a partition preserving flag to MapPartitionsWithSplitRDD.
2012-11-27 22:39:25 -08:00
Matei Zaharia 3ebd8e1885 Added zip to Java API 2012-11-27 22:38:09 -08:00
Matei Zaharia 27e43abd19 Added a zip() operation for RDDs with the same shape (number of
partitions and number of elements in each partition)
2012-11-27 22:27:47 -08:00
Matei Zaharia f410a111ad Merge branch 'master' of github.com:mesos/spark 2012-11-27 20:51:58 -08:00
Josh Rosen 7d71b9a56a Fix NullPointerException caused by unregistered map outputs. 2012-11-27 20:51:51 -08:00
Matei Zaharia 935c468b71 Merge pull request #311 from woggling/map-output-npe
Fix NullPointerException when map output unregistered from MapOutputTracker twice
2012-11-27 20:50:48 -08:00
Reynold Xin bd6dd1a3a6 Added a partition preserving flag to MapPartitionsWithSplitRDD. 2012-11-27 19:43:30 -08:00
Reynold Xin f24bfd2dd1 For size compression, compress non zero values into non zero values. 2012-11-27 19:20:45 -08:00
Thomas Dudziak 3b643e86bc Updated versions in the pom.xml files to match current master 2012-11-27 17:50:42 -08:00
Charles Reiss cf79de425d Fix NullPointerException when unregistering a map output twice. 2012-11-27 16:12:05 -08:00
Charles Reiss 5fa868b98b Tests for MapOutputTracker. 2012-11-27 16:05:36 -08:00
Thomas Dudziak 69297c64be Addressed code review comments 2012-11-27 15:45:16 -08:00
Thomas Dudziak 811a32257b Added maven and debian build files 2012-11-20 16:19:51 -08:00
Matei Zaharia 3ff6f4bdee Merge pull request #304 from mbautin/configurable_local_ip
SPARK-624: make the default local IP customizable
2012-11-19 13:23:39 -08:00
mbautin 00f4e3ff9c Addressing Matei's comment: SPARK_LOCAL_IP environment variable 2012-11-19 11:52:10 -08:00
Charles Reiss 12c24e786c Set default uncaught exception handler to exit.
Among other things, should prevent OutOfMemoryErrors in some daemon threads
(such as the network manager) from causing a spark executor to enter a state
where it cannot make progress but does not report an error.
2012-11-16 20:12:31 -08:00
mbautin 1f5a7e0e64 SPARK-624: make the default local IP customizable 2012-11-15 13:57:47 -08:00
Matei Zaharia c23a74df0a Use DNS names instead of IP addresses in standalone mode, to allow
matching with data locality hints from storage systems.
2012-11-15 00:10:52 -08:00
Matei Zaharia 173e0354c0 Detect correctly when one has disconnected from a standalone cluster.
SPARK-617 #resolve
2012-11-11 21:06:57 -08:00
root acf8272324 Fix K-means example a little 2012-11-10 23:07:21 -08:00
Tathagata Das 9915989bfa Incorporated Matei's suggestions. Tested with 5 producer(consumer) threads each doing 50k puts (gets), took 15 minutes to run, no errors or deadlocks. 2012-11-09 15:46:15 -08:00
Tathagata Das de00bc63db Fixed deadlock in BlockManager.
1. Changed the lock structure of BlockManager by replacing the 337 coarse-grained locks to use BlockInfo objects as per-block fine-grained locks.
2. Changed the MemoryStore lock structure by making the block putting threads lock on a different object (not the memory store) thus making sure putting threads minimally blocks to the getting treads.
3. Added spark.storage.ThreadingTest to stress test the BlockManager using 5 block producer and 5 block consumer threads.
2012-11-09 14:09:37 -08:00
Matei Zaharia 6607f546cc Added an option to spread out jobs in the standalone mode. 2012-11-08 23:13:12 -08:00
Matei Zaharia 66cbdee941 Fix for connections not being reused (from Josh Rosen) 2012-11-08 09:53:40 -08:00
Imran Rashid 809b2bb1fe fix bug in getting slave id out of mesos 2012-11-08 00:34:28 -08:00
Matei Zaharia bb1bce7924 Various fixes to standalone mode and web UI:
- Don't report a job as finishing multiple times
- Don't show state of workers as LOADING when they're running
- Show start and finish times in web UI
- Sort web UI tables by ID and time by default
2012-11-07 16:49:53 -08:00