Commit graph

2785 commits

Author SHA1 Message Date
Josh Rosen 61569906cc Fix SPARK-978: ClassCastException in PySpark cartesian. 2014-01-23 15:09:19 -08:00
Josh Rosen 0035dbbc81 Fix SPARK-1034: Py4JException on PySpark Cartesian Result 2014-01-23 13:05:59 -08:00
Josh Rosen fad6aacfb0 Merge pull request #406 from eklavya/master
Extending Java API coverage

Hi,

I have added three new methods to JavaRDD.

Please review and merge.
2014-01-23 11:14:15 -08:00
eklavya 60e7457266 fixed ClassTag in mapPartitions 2014-01-23 17:40:36 +05:30
Patrick Wendell a1cd185122 Merge pull request #496 from pwendell/master
Fix bug in worker clean-up in UI

Introduced in d5a96fec (/cc @aarondav).

This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.
2014-01-22 19:37:29 -08:00
Patrick Wendell 034dce2a7e Merge pull request #447 from CodingCat/SPARK-1027
fix for SPARK-1027

fix for SPARK-1027  (https://spark-project.atlassian.net/browse/SPARK-1027)

FIXES

1. change sparkhome from String to Option(String) in ApplicationDesc

2. remove sparkhome parameter in LaunchExecutor message

3. adjust involved files
2014-01-22 18:58:02 -08:00
Patrick Wendell 6285513147 Fix bug in worker clean-up in UI
Introduced in d5a96fec. This should be picked into 0.8 and 0.9 as well.
2014-01-22 18:19:52 -08:00
CodingCat 2b3c461451 refactor sparkHome to val
clean code
2014-01-22 20:20:46 -05:00
Kay Ousterhout 19da82c50f Fixed bug where task set managers are added to queue twice
This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.
2014-01-22 09:52:12 -08:00
Henry Saputra 90ea9d5a8f Replace the code to check for Option != None with Option.isDefined call in Scala code.
This hopefully will make the code cleaner.
2014-01-21 23:22:10 -08:00
Patrick Wendell a9bcc980b6 Style clean-up 2014-01-21 00:05:28 -08:00
Patrick Wendell a917a87e02 Adding small code comment 2014-01-20 23:11:45 -08:00
Patrick Wendell d46df96de3 Avoid matching attempt files in the checkpoint 2014-01-20 20:03:23 -08:00
Patrick Wendell de526ad527 Remove shuffle files if they are still present on a machine. 2014-01-20 19:11:22 -08:00
Patrick Wendell f84400e86c Fixing speculation bug 2014-01-20 19:05:03 -08:00
Patrick Wendell c324ac10ee Force use of LZF when spilling data 2014-01-20 19:00:48 -08:00
Patrick Wendell 1b299142a8 Bug fix for reporting of spill output 2014-01-20 18:34:00 -08:00
Patrick Wendell 54867e9566 Minor fixes 2014-01-20 18:33:21 -08:00
Patrick Wendell cdb003e376 Removing docs on akka options 2014-01-20 16:40:58 -08:00
CodingCat 29f4b6a2d9 fix for SPARK-1027
change TestClient & Worker to Some("xxx")

kill manager if it is started

remove unnecessary .get when fetch "SPARK_HOME" values
2014-01-20 02:50:30 -05:00
CodingCat f9a95d6736 executor creation failed should not make the worker restart 2014-01-20 02:50:30 -05:00
Thomas Graves dd56b2125e update comment 2014-01-19 12:21:39 -06:00
Thomas Graves ceb79a3931 Only log error on missing jar to allow spark examples to jar. 2014-01-19 12:16:58 -06:00
Yinan Li 584323c6b1 Addressed comments from Reynold
Signed-off-by: Yinan Li <liyinan926@gmail.com>
2014-01-18 21:28:17 -08:00
Patrick Wendell 73dfd42fba Merge pull request #437 from mridulm/master
Minor api usability changes

- Expose checkpoint directory - since it is autogenerated now
- null check for jars
- Expose SparkHadoopUtil : so that configuration creation is abstracted even from user code to avoid duplication of functionality already in spark.
2014-01-18 16:23:56 -08:00
Patrick Wendell bf5699543b Merge pull request #462 from mateiz/conf-file-fix
Remove Typesafe Config usage and conf files to fix nested property names

With Typesafe Config we had the subtle problem of no longer allowing
nested property names, which are used for a few of our properties:
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

This PR is for branch 0.9 but should be added into master too.
(cherry picked from commit 34e911ce9a)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-01-18 16:20:00 -08:00
Yinan Li fd833e7ab1 Allow files added through SparkContext.addFile() to be overwritten
This is useful for the cases when a file needs to be refreshed and downloaded
by the executors periodically.

Signed-off-by: Yinan Li <liyinan926@gmail.com>
2014-01-18 15:26:59 -08:00
Patrick Wendell 5316bcac3c Use renamed shuffle spill config in CoGroupedRDD.scala 2014-01-18 11:58:42 -08:00
Mridul Muralidharan b690e11d9c Address review comment 2014-01-17 18:28:55 +05:30
Patrick Wendell d4fd89e3c8 Merge pull request #438 from ScrapCodes/clone-records-java-api
Clone records java api
2014-01-16 23:17:30 -08:00
Prashant Sharma fcb4fc653d adding clone records field to equivaled java apis 2014-01-17 11:16:03 +05:30
Mridul Muralidharan edd82c58a2 Use method, not variable 2014-01-16 17:26:42 +05:30
Mridul Muralidharan 1a0da89277 Address review comments 2014-01-16 17:23:25 +05:30
Reynold Xin c06a307ca2 Merge pull request #445 from kayousterhout/exec_lost
Fail rather than hanging if a task crashes the JVM.

Prior to this commit, if a task crashes the JVM, the task (and
all other tasks running on that executor) is marked at KILLED rather
than FAILED.  As a result, the TaskSetManager will retry the task
indefinitely rather than failing the job after maxFailures. Eventually,
this makes the job hang, because the Standalone Scheduler removes
the application after 10 works have failed, and then the app is left
in a state where it's disconnected from the master and waiting to reconnect.
This commit fixes that problem by marking tasks as FAILED rather than
killed when an executor is lost.

The downside of this commit is that if task A fails because another
task running on the same executor caused the VM to crash, the failure
will incorrectly be counted as a failure of task A. This should not
be an issue because we typically set maxFailures to 3, and it is
unlikely that a task will be co-located with a JVM-crashing task
multiple times.
2014-01-15 23:47:25 -08:00
Kay Ousterhout a268d63411 Fail rather than hanging if a task crashes the JVM.
Prior to this commit, if a task crashes the JVM, the task (and
all other tasks running on that executor) is marked at KILLED rather
than FAILED.  As a result, the TaskSetManager will retry the task
indefiniteily rather than failing the job after maxFailures. This
commit fixes that problem by marking tasks as FAILED rather than
killed when an executor is lost.

The downside of this commit is that if task A fails because another
task running on the same executor caused the VM to crash, the failure
will incorrectly be counted as a failure of task A. This should not
be an issue because we typically set maxFailures to 3, and it is
unlikely that a task will be co-located with a JVM-crashing task
multiple times.
2014-01-15 16:03:40 -08:00
Patrick Wendell 59f475c79f Merge pull request #442 from pwendell/standalone
Workers should use working directory as spark home if it's not specified

If users don't set SPARK_HOME in their environment file when launching an application, the standalone cluster should default to the spark home of the worker.
2014-01-15 13:55:14 -08:00
Patrick Wendell 00a3f7eec5 Workers should use working directory as spark home if it's not specified 2014-01-15 11:05:36 -08:00
Mridul Muralidharan 0aea33d39e Expose method and class - so that we can use it from user code (particularly since checkpoint directory is autogenerated now 2014-01-15 12:44:44 +05:30
Tathagata Das 0e15bd7827 Merge remote-tracking branch 'apache/master' into filestream-fix 2014-01-14 22:21:20 -08:00
Tathagata Das 1f4718c480 Changed SparkConf to not be serializable. And also fixed unit-test log paths in log4j.properties of external modules. 2014-01-14 22:20:14 -08:00
Reynold Xin 74b46acdc5 Merge pull request #428 from pwendell/writeable-objects
Don't clone records for text files
2014-01-14 14:59:13 -08:00
Reynold Xin d601a76d1f Merge pull request #427 from pwendell/deprecate-aggregator
Deprecate rather than remove old combineValuesByKey function
2014-01-14 14:52:24 -08:00
Patrick Wendell b1b22b7a13 Style fix 2014-01-14 13:56:27 -08:00
Patrick Wendell 8ea2cd56e4 Adding fix covering combineCombinersByKey as well 2014-01-14 13:52:23 -08:00
Patrick Wendell b683608c9f Deprecate rather than remove old combineValuesByKey function 2014-01-14 12:15:10 -08:00
Patrick Wendell 6f965a46a9 Don't clone records for text files 2014-01-14 11:57:53 -08:00
Reynold Xin f12e506c9e Fixed a typo in JavaSparkContext's API doc. 2014-01-14 11:42:28 -08:00
Reynold Xin 1b5623fd0b Maintain Serializable API compatibility by reverting back to java.io.Serializable for Broadcast and Accumulator. 2014-01-14 11:30:59 -08:00
Reynold Xin 55db77416b Added license header for package.scala in the Java API package. 2014-01-14 11:20:12 -08:00
Reynold Xin f8c12e9457 Added package doc for the Java API. 2014-01-14 11:16:25 -08:00
Reynold Xin 6a12b9ebc5 Updated API doc for Accumulable and Accumulator. 2014-01-14 11:16:08 -08:00
Reynold Xin 71b3007dbd Broadcast variable visibility change & doc update.
Note that previously Broadcast class was accidentally marked as private[spark]. It needs to be public
for broadcast variables to work. Also exposing the broadcast varaible id.
2014-01-14 11:15:21 -08:00
Patrick Wendell 23034798d7 Add missing header files 2014-01-14 01:17:13 -08:00
Saurabh Rawat 1442cd5d50 Modifications as suggested in PR feedback-
- more variants of mapPartitions added to JavaRDDLike
- move setGenerator to JavaRDDLike
- clean up
2014-01-14 14:19:02 +05:30
Patrick Wendell 0984647aae Enable compression by default for spills 2014-01-13 23:25:25 -08:00
Patrick Wendell 4a805aff5e Merge pull request #367 from ankurdave/graphx
GraphX: Unifying Graphs and Tables

GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/.

Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak.

Tasks left:
- [x] Graph-level uncache
- [x] Uncache previous iterations in Pregel
- [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release)
- [x] - Describe GC issue with GraphLab
- [ ] Write `docs/graphx-programming-guide.md`
- [x] - Mention future Bagel support in docs
- [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again.
- [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx
- [x] Make Graph serializable to work around capture in Spark shell
- [x] Rename graph -> graphx in package name and subproject
- [x] Remove standalone PageRank
- [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~
2014-01-13 22:58:38 -08:00
Patrick Wendell 945fe7a37e Merge pull request #408 from pwendell/external-serializers
Improvements to external sorting

1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.
2014-01-13 22:56:12 -08:00
Patrick Wendell 68641bce61 Merge pull request #413 from rxin/scaladoc
Adjusted visibility of various components and documentation for 0.9.0 release.
2014-01-13 22:54:13 -08:00
Patrick Wendell 0ca0d4d657 Merge pull request #401 from andrewor14/master
External sorting - Add number of bytes spilled to Web UI

Additionally, update test suite for external sorting to induce spilling.
2014-01-13 22:32:21 -08:00
Patrick Wendell 08b9fec93d Merge pull request #409 from tdas/unpersist
Automatically unpersisting RDDs that have been cleaned up from DStreams

Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads.

This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default.

Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.
2014-01-13 22:29:03 -08:00
Andrew Or 839934140f Wording changes per Patrick 2014-01-13 20:51:38 -08:00
Reynold Xin 33022d6656 Adjusted visibility of various components. 2014-01-13 19:58:53 -08:00
Harvey 9e84e70509 Add default value for HadoopRDD's cloneRecords constructor arg, to maintain backwards compatibility. 2014-01-13 19:43:40 -08:00
Patrick Wendell d4cd5debf4 Fix for Kryo Serializer 2014-01-13 19:03:59 -08:00
Reynold Xin e2d25d2dfe Merge branch 'master' into graphx 2014-01-13 16:21:26 -08:00
Tathagata Das 27311b1332 Added unpersisting and modified testsuite to better test out metadata cleaning. 2014-01-13 14:57:07 -08:00
Patrick Wendell c3816de504 Changing option wording per discussion with Andrew 2014-01-13 13:25:06 -08:00
Patrick Wendell 5d61e051c2 Improvements to external sorting
1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.
2014-01-13 12:21:39 -08:00
Saurabh Rawat e922973373 Modifications as suggested in PR feedback-
- mapPartitions, foreachPartition moved to JavaRDDLike
- call scala rdd's setGenerator instead of setting directly in JavaRDD
2014-01-13 23:40:04 +05:30
eklavya fa42951e3b Remove default param from mapPartitions 2014-01-13 18:13:22 +05:30
eklavya 8fe562c0fa Remove classtag from mapPartitions. 2014-01-13 18:09:58 +05:30
eklavya 6a65feebc7 Added foreachPartition method to JavaRDD. 2014-01-13 17:56:47 +05:30
eklavya dbadc6b994 Added mapPartitions method to JavaRDD. 2014-01-13 17:56:10 +05:30
eklavya aae8a01425 Added setter method setGenerator to JavaRDD. 2014-01-13 17:53:35 +05:30
Andrew Or a1f0992fae Report bytes spilled for both memory and disk on Web UI 2014-01-12 23:42:57 -08:00
Andrew Or 69c9aebed0 Enable external sorting by default 2014-01-12 22:43:01 -08:00
Reynold Xin e6ed13f255 Merge pull request #397 from pwendell/host-port
Remove now un-needed hostPort option

I noticed this was logging some scary error messages in various places. After I looked into it, this is no longer really used. I removed the option and re-wrote the one remaining use case (it was unnecessary there anyways).
2014-01-12 22:35:14 -08:00
Andrew Or 8d40e7222f Get rid of spill map in SparkEnv 2014-01-12 22:34:33 -08:00
Patrick Wendell 0b96d85c20 Merge pull request #399 from pwendell/consolidate-off
Disable shuffle file consolidation by default

After running various performance tests for the 0.9 release, this still seems to have performance issues even on XFS. So let's keep this off-by-default for 0.9 and users can experiment with it depending on their disk configurations.
2014-01-12 21:31:43 -08:00
Patrick Wendell 0ab505a29e Merge pull request #395 from hsaputra/remove_simpleredundantreturn_scala
Remove simple redundant return statements for Scala methods/functions

Remove simple redundant return statements for Scala methods/functions:

-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
-) Add small changes to making var to val if possible and remove () for simple get

This hopefully makes the review simpler =)

Pass compile and tests.
2014-01-12 21:31:04 -08:00
Patrick Wendell 2802cc80bc Disable shuffle file consolidation by default 2014-01-12 19:16:43 -08:00
Henry Saputra 5a8abfb70e Address code review concerns and comments. 2014-01-12 19:15:09 -08:00
Tathagata Das aa2c993858 Merge remote-tracking branch 'apache/master' into error-handling 2014-01-12 17:37:46 -08:00
Patrick Wendell 074f50232f Merge pull request #396 from pwendell/executor-env
Setting load defaults to true in executor

This preserves the behavior in earlier releases. If properties are set for the executors via `spark-env.sh` on the slaves, then they should take precedence over spark defaults. This is useful for if system administrators are setting properties for a standalone cluster, such as shuffle locations.

/cc @andrewor14 who initially reported this issue.
2014-01-12 17:01:13 -08:00
Reynold Xin 82e2b92c6d Merge pull request #392 from rxin/listenerbus
Stop SparkListenerBus daemon thread when DAGScheduler is stopped.

Otherwise this leads to hundreds of SparkListenerBus daemon threads in our unit tests (and also problematic if user applications launches multiple SparkContext).
2014-01-12 16:55:11 -08:00
Patrick Wendell 0bb33076e2 Removing mentions in tests 2014-01-12 16:53:58 -08:00
Patrick Wendell 0d4886c000 Remove now un-needed hostPort option 2014-01-12 16:47:52 -08:00
Patrick Wendell cfb1e6c13c Setting load defaults to true in executor 2014-01-12 15:35:08 -08:00
Henry Saputra f1c5eca494 Fix accidental comment modification. 2014-01-12 10:40:21 -08:00
Henry Saputra 91a563608e Merge branch 'master' into remove_simpleredundantreturn_scala 2014-01-12 10:34:13 -08:00
Henry Saputra 93a65e5fde Remove simple redundant return statement for Scala methods/functions:
-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
2014-01-12 10:30:04 -08:00
Tathagata Das 18f4889d96 Merge remote-tracking branch 'apache/master' into error-handling 2014-01-11 23:40:57 -08:00
Tathagata Das f5108ffc24 Converted JobScheduler to use actors for event handling. Changed protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite. 2014-01-11 23:15:09 -08:00
Reynold Xin 288a878999 Merge pull request #389 from rxin/clone-writables
Minor update for clone writables and more documentation.
2014-01-11 21:53:19 -08:00
Reynold Xin dbc11df411 Merge pull request #388 from pwendell/master
Fix UI bug introduced in #244.

The 'duration' field was incorrectly renamed to 'task time' in the table that
lists stages.
2014-01-11 18:07:13 -08:00
Reynold Xin 362cda18bc Renamed cloneKeyValues to cloneRecords; updated docs. 2014-01-11 18:01:29 -08:00
Patrick Wendell 07b952e1d1 Revert "Fix default TTL for metadata cleaner"
This reverts commit 669ba4caa9.
2014-01-11 16:07:10 -08:00
Reynold Xin 2180c87188 Stop SparkListenerBus daemon thread when DAGScheduler is stopped. 2014-01-11 13:36:37 -08:00
Reynold Xin b0fbfccadc Minor update for clone writables and more documentation. 2014-01-11 12:35:10 -08:00
Reynold Xin ee6e7f9b8c Merge pull request #359 from ScrapCodes/clone-writables
We clone hadoop key and values by default and reuse objects if asked to.

 We try to clone for most common types of writables and we call WritableUtils.clone otherwise intention is to optimize, for example for NullWritable there is no need and for Long, int and String creating a new object with value set would be faster than doing copy on object hopefully.

There is another way to do this PR where we ask for both key and values whether to clone them or not, but could not think of a use case for it except either of them is actually a NullWritable for which I have already worked around. So thought that would be unnecessary.
2014-01-11 12:07:55 -08:00