ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Patrick Wendell	fb98488fc8	Clean up and simplify Spark configuration Over time as we've added more deployment modes, this have gotten a bit unwieldy with user-facing configuration options in Spark. Going forward we'll advise all users to run `spark-submit` to launch applications. This is a WIP patch but it makes the following improvements: 1. Improved `spark-env.sh.template` which was missing a lot of things users now set in that file. 2. Removes the shipping of SPARK_CLASSPATH, SPARK_JAVA_OPTS, and SPARK_LIBRARY_PATH to the executors on the cluster. This was an ugly hack. Instead it introduces config variables spark.executor.extraJavaOpts, spark.executor.extraLibraryPath, and spark.executor.extraClassPath. 3. Adds ability to set these same variables for the driver using `spark-submit`. 4. Allows you to load system properties from a `spark-defaults.conf` file when running `spark-submit`. This will allow setting both SparkConf options and other system properties utilized by `spark-submit`. 5. Made `SPARK_LOCAL_IP` an environment variable rather than a SparkConf property. This is more consistent with it being set on each node. Author: Patrick Wendell <pwendell@gmail.com> Closes #299 from pwendell/config-cleanup and squashes the following commits: 127f301 [Patrick Wendell] Improvements to testing a006464 [Patrick Wendell] Moving properties file template. b4b496c [Patrick Wendell] spark-defaults.properties -> spark-defaults.conf 0086939 [Patrick Wendell] Minor style fixes af09e3e [Patrick Wendell] Mention config file in docs and clean-up docs b16e6a2 [Patrick Wendell] Cleanup of spark-submit script and Scala quick start guide af0adf7 [Patrick Wendell] Automatically add user jar a56b125 [Patrick Wendell] Responses to Tom's review d50c388 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup a762901 [Patrick Wendell] Fixing test failures ffa00fe [Patrick Wendell] Review feedback fda0301 [Patrick Wendell] Note 308f1f6 [Patrick Wendell] Properly escape quotes and other clean-up for YARN e83cd8f [Patrick Wendell] Changes to allow re-use of test applications be42f35 [Patrick Wendell] Handle case where SPARK_HOME is not set c2a2909 [Patrick Wendell] Test compile fixes 4ee6f9d [Patrick Wendell] Making YARN doc changes consistent afc9ed8 [Patrick Wendell] Cleaning up line limits and two compile errors. b08893b [Patrick Wendell] Additional improvements. ace4ead [Patrick Wendell] Responses to review feedback. b72d183 [Patrick Wendell] Review feedback for spark env file 46555c1 [Patrick Wendell] Review feedback and import clean-ups 437aed1 [Patrick Wendell] Small fix 761ebcd [Patrick Wendell] Library path and classpath for drivers 7cc70e4 [Patrick Wendell] Clean up terminology inside of spark-env script 5b0ba8e [Patrick Wendell] Don't ship executor envs 84cc5e5 [Patrick Wendell] Small clean-up 1f75238 [Patrick Wendell] SPARK_JAVA_OPTS --> SPARK_MASTER_OPTS for master settings 4982331 [Patrick Wendell] Remove SPARK_LIBRARY_PATH 6eaf7d0 [Patrick Wendell] executorJavaOpts 0faa3b6 [Patrick Wendell] Stash of adding config options in submit script and YARN ac2d65e [Patrick Wendell] Change spark.local.dir -> SPARK_LOCAL_DIRS	2014-04-21 10:26:33 -07:00
Chen Chao	9edd88782e	update spark.default.parallelism actually, the value 8 is only valid in mesos fine-grained mode : <code> override def defaultParallelism() = sc.conf.getInt("spark.default.parallelism", 8) </code> while in coarse-grained model including mesos coares-grained, the value of the property depending on core numbers! <code> override def defaultParallelism(): Int = { conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) } </code> Author: Chen Chao <crazyjvm@gmail.com> Closes #389 from CrazyJvm/patch-2 and squashes the following commits: 84a7fe4 [Chen Chao] miss </li> at the end of every single line 04a9796 [Chen Chao] change format ee0fae0 [Chen Chao] update spark.default.parallelism	2014-04-16 09:14:18 -07:00
Sundeep Narravula	2c557837b4	SPARK-1202 - Add a "cancel" button in the UI for stages Author: Sundeep Narravula <sundeepn@superduel.local> Author: Sundeep Narravula <sundeepn@dhcpx-204-110.corp.yahoo.com> Closes #246 from sundeepn/uikilljob and squashes the following commits: 5fdd0e2 [Sundeep Narravula] Fix test string f6fdff1 [Sundeep Narravula] Format fix; reduced line size to less than 100 chars d1daeb9 [Sundeep Narravula] Incorporating review comments. 8d97923 [Sundeep Narravula] Ability to kill jobs thru the UI. This behavior can be turned on be settings the following variable: spark.ui.killEnabled=true (default=false) Adding DAGScheduler event StageCancelled and corresponding handlers. Added cancellation reason to handlers.	2014-04-10 17:10:11 -07:00
Holden Karau	fa0524fd02	Spark-939: allow user jars to take precedence over spark jars I still need to do a small bit of re-factoring [mostly the one Java file I'll switch it back to a Scala file and use it in both the close loaders], but comments on other things I should do would be great. Author: Holden Karau <holden@pigscanfly.ca> Closes #217 from holdenk/spark-939-allow-user-jars-to-take-precedence-over-spark-jars and squashes the following commits: cf0cac9 [Holden Karau] Fix the executorclassloader 1955232 [Holden Karau] Fix long line in TestUtils 8f89965 [Holden Karau] Fix tests for new class name 7546549 [Holden Karau] CR feedback, merge some of the testutils methods down, rename the classloader 644719f [Holden Karau] User the class generator for the repl class loader tests too f0b7114 [Holden Karau] Fix the core/src/test/scala/org/apache/spark/executor/ExecutorURLClassLoaderSuite.scala tests 204b199 [Holden Karau] Fix the generated classes 9f68f10 [Holden Karau] Start rewriting the ExecutorURLClassLoaderSuite to not use the hard coded classes 858aba2 [Holden Karau] Remove a bunch of test junk 261aaee [Holden Karau] simplify executorurlclassloader a bit 7a7bf5f [Holden Karau] CR feedback d4ae848 [Holden Karau] rewrite component into scala aa95083 [Holden Karau] CR feedback 7752594 [Holden Karau] re-add https comment a0ef85a [Holden Karau] Fix style issues 125ea7f [Holden Karau] Easier to just remove those files, we don't need them bb8d179 [Holden Karau] Fix issues with the repl class loader 241b03d [Holden Karau] fix my rat excludes a343350 [Holden Karau] Update rat-excludes and remove a useless file d90d217 [Holden Karau] Fix fall back with custom class loader and add a test for it 4919bf9 [Holden Karau] Fix parent calling class loader issue 8a67302 [Holden Karau] Test are good 9e2d236 [Holden Karau] It works comrade 691ee00 [Holden Karau] It works ish dc4fe44 [Holden Karau] Does not depend on being in my home directory 47046ff [Holden Karau] Remove bad import' 22d83cb [Holden Karau] Add a test suite for the executor url class loader suite 7ef4628 [Holden Karau] Clean up 792d961 [Holden Karau] Almost works 16aecd1 [Holden Karau] Doesn't quite work 8d2241e [Holden Karau] Adda FakeClass for testing ClassLoader precedence options 648b559 [Holden Karau] Both class loaders compile. Now for testing e1d9f71 [Holden Karau] One loader workers.	2014-04-08 22:30:03 -07:00
Evan Chan	1440154c27	SPARK-1154: Clean up app folders in worker nodes This is a fix for [SPARK-1154](https://issues.apache.org/jira/browse/SPARK-1154). The issue is that worker nodes fill up with a huge number of app-* folders after some time. This change adds a periodic cleanup task which asynchronously deletes app directories older than a configurable TTL. Two new configuration parameters have been introduced: spark.worker.cleanup_interval spark.worker.app_data_ttl This change does not include moving the downloads of application jars to a location outside of the work directory. We will address that if we have time, but that potentially involves caching so it will come either as part of this PR or a separate PR. Author: Evan Chan <ev@ooyala.com> Author: Kelvin Chu <kelvinkwchu@yahoo.com> Closes #288 from velvia/SPARK-1154-cleanup-app-folders and squashes the following commits: 0689995 [Evan Chan] CR from @aarondav - move config, clarify for standalone mode 9f10d96 [Evan Chan] CR from @pwendell - rename configs and add cleanup.enabled f2f6027 [Evan Chan] CR from @andrewor14 553d8c2 [Kelvin Chu] change the variable name to currentTimeMillis since it actually tracks in seconds 8dc9cb5 [Kelvin Chu] Fixed a bug in Utils.findOldFiles() after merge. cb52f2b [Kelvin Chu] Change the name of findOldestFiles() to findOldFiles() 72f7d2d [Kelvin Chu] Fix a bug of Utils.findOldestFiles(). file.lastModified is returned in milliseconds. ad99955 [Kelvin Chu] Add unit test for Utils.findOldestFiles() dc1a311 [Evan Chan] Don't recompute current time with every new file e3c408e [Evan Chan] Document the two new settings b92752b [Evan Chan] SPARK-1154: Add a periodic task to clean up app directories	2014-04-06 19:21:40 -07:00
Haoyuan Li	b50ddfde03	SPARK-1305: Support persisting RDD's directly to Tachyon Move the PR#468 of apache-incubator-spark to the apache-spark "Adding an option to persist Spark RDD blocks into Tachyon." Author: Haoyuan Li <haoyuan@cs.berkeley.edu> Author: RongGu <gurongwalker@gmail.com> Closes #158 from RongGu/master and squashes the following commits: 72b7768 [Haoyuan Li] merge master 9f7fa1b [Haoyuan Li] fix code style ae7834b [Haoyuan Li] minor cleanup a8b3ec6 [Haoyuan Li] merge master branch e0f4891 [Haoyuan Li] better check offheap. 55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel 7cd4600 [RongGu] remove some logic code for tachyonstore's replication 51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore 8adfcfa [RongGu] address arron's comment on inTachyonSize 120e48a [RongGu] changed the root-level dir name in Tachyon 5cc041c [Haoyuan Li] address aaron's comments 9b97935 [Haoyuan Li] address aaron's comments d9a6438 [Haoyuan Li] fix for pspark 77d2703 [Haoyuan Li] change python api.git status 3dcace4 [Haoyuan Li] address matei's comments 91fa09d [Haoyuan Li] address patrick's comments 589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE 64348b2 [Haoyuan Li] update conf docs. ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1 619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler 49cc724 [Haoyuan Li] update docs with off_headp option 4572f9f [RongGu] reserving the old apply function API of StorageLevel 04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP 76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix 939e467 [Haoyuan Li] 0.4.1-thrift from maven central 86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1 16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem 6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 d827250 [RongGu] fix JsonProtocolSuie test failure 716e93b [Haoyuan Li] revert the version ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift 2825a13 [RongGu] up-merging to the current master branch of the apache spark 6a22c1a [Haoyuan Li] fix scalastyle 8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client. 77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice. 1dcadf9 [Haoyuan Li] typo bf278fa [Haoyuan Li] fix python tests e82909c [Haoyuan Li] minor cleanup 776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR 8859371 [Haoyuan Li] various minor fixes and clean up e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode. fcaeab2 [Haoyuan Li] address Aaron's comment e554b1e [Haoyuan Li] add python code 47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels. dc8ef24 [Haoyuan Li] add old storelevel constructor e01a271 [Haoyuan Li] update tachyon 0.4.1 8011a96 [RongGu] fix a brought-in mistake in StorageLevel `70ca182` [RongGu] a bit change in comment 556978b [RongGu] fix the scalastyle errors 791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark	2014-04-04 20:38:20 -07:00
Shivaram Venkataraman	f8111eaeb0	SPARK-1319: Fix scheduler to account for tasks using > 1 CPUs. Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends. Thanks @kayousterhout for the design discussion Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu> Closes #219 from shivaram/multi-cpus and squashes the following commits: 5c7d685 [Shivaram Venkataraman] Don't pass availableCpus to TaskSetManager 260e4d5 [Shivaram Venkataraman] Add a check for non-zero CPUs in TaskSetManager 73fcf6f [Shivaram Venkataraman] Add documentation for spark.task.cpus 647bc45 [Shivaram Venkataraman] Fix scheduler to account for tasks using > 1 CPUs. Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends.	2014-03-25 13:05:30 -07:00
Andrew Or	79d07d6604	[SPARK-1132] Persisting Web UI through refactoring the SparkListener interface The fleeting nature of the Spark Web UI has long been a problem reported by many users: The existing Web UI disappears as soon as the associated application terminates. This is because SparkUI is tightly coupled with SparkContext, and cannot be instantiated independently from it. To solve this, some state must be saved to persistent storage while the application is still running. The approach taken by this PR involves persisting the UI state through SparkListenerEvents. This requires a major refactor of the SparkListener interface because existing events (1) maintain deep references, making de/serialization is difficult, and (2) do not encode all the information displayed on the UI. In this design, each existing listener for the UI (e.g. ExecutorsListener) maintains state that can be fully constructed from SparkListenerEvents. This state is then supplied to the parent UI (e.g. ExecutorsUI), which renders the associated page(s) on demand. This PR introduces two important classes: the EventLoggingListener, and the ReplayListenerBus. In a live application, SparkUI registers an EventLoggingListener with the SparkContext in addition to the existing listeners. Over the course of the application, this listener serializes and logs all events to persisted storage. Then, after the application has finished, the SparkUI can be revived by replaying all the logged events to the existing UI listeners through the ReplayListenerBus. This feature is currently integrated with the Master Web UI, which optionally rebuilds a SparkUI from event logs as soon as the corresponding application finishes. More details can be found in the commit messages, comments within the code, and the [design doc](https://spark-project.atlassian.net/secure/attachment/12900/PersistingSparkWebUI.pdf). Comments and feedback are most welcome. Author: Andrew Or <andrewor14@gmail.com> Author: andrewor14 <andrewor14@gmail.com> Closes #42 from andrewor14/master and squashes the following commits: e5f14fa [Andrew Or] Merge github.com:apache/spark a1c5cd9 [Andrew Or] Merge github.com:apache/spark b8ba817 [Andrew Or] Remove UI from map when removing application in Master 83af656 [Andrew Or] Scraps and pieces (no functionality change) 222adcd [Andrew Or] Merge github.com:apache/spark 124429f [Andrew Or] Clarify LiveListenerBus behavior + Add tests for new behavior f80bd31 [Andrew Or] Simplify static handler and BlockManager status update logic 9e14f97 [Andrew Or] Moved around functionality + renamed classes per Patrick 6740e49 [Andrew Or] Fix comment nits 650eb12 [Andrew Or] Add unit tests + Fix bugs found through tests 45fd84c [Andrew Or] Remove now deprecated test c5c2c8f [Andrew Or] Remove list of (TaskInfo, TaskMetrics) from StageInfo 3456090 [Andrew Or] Address Patrick's comments bf80e3d [Andrew Or] Imports, comments, and code formatting, once again (minor) ac69ec8 [Andrew Or] Fix test fail d801d11 [Andrew Or] Merge github.com:apache/spark (major) dc93915 [Andrew Or] Imports, comments, and code formatting (minor) 77ba283 [Andrew Or] Address Kay's and Patrick's comments b6eaea7 [Andrew Or] Treating SparkUI as a handler of MasterUI d59da5f [Andrew Or] Avoid logging all the blocks on each executor d6e3b4a [Andrew Or] Merge github.com:apache/spark ca258a4 [Andrew Or] Master UI - add support for reading compressed event logs 176e68e [Andrew Or] Fix deprecated message for JavaSparkContext (minor) 4f69c4a [Andrew Or] Master UI - Rebuild SparkUI on application finish 291b2be [Andrew Or] Correct directory in log message "INFO: Logging events to <dir>" 1ba3407 [Andrew Or] Add a few configurable options to event logging e375431 [Andrew Or] Add new constructors for SparkUI 18b256d [Andrew Or] Refactor out event logging and replaying logic from UI bb4c503 [Andrew Or] Use a more mnemonic path for logging aef411c [Andrew Or] Fix bug: storage status was not reflected on UI in the local case 03eda0b [Andrew Or] Fix HDFS flush behavior 36b3e5d [Andrew Or] Add HDFS support for event logging cceff2b [andrewor14] Fix 100 char format fail 2fee310 [Andrew Or] Address Patrick's comments 2981d61 [Andrew Or] Move SparkListenerBus out of DAGScheduler + Clean up 5d2cec1 [Andrew Or] JobLogger: ID -> Id 0503e4b [Andrew Or] Fix PySpark tests + remove sc.clearFiles/clearJars 4d2fb0c [Andrew Or] Fix format fail faa113e [Andrew Or] General clean up d47585f [Andrew Or] Clean up FileLogger 472fd8a [Andrew Or] Fix a couple of tests 996d7a2 [Andrew Or] Reflect RDD unpersist on UI 7b2f811 [Andrew Or] Guard against TaskMetrics NPE + Fix tests d1f4285 [Andrew Or] Migrate from lift-json to json4s-jackson 28019ca [Andrew Or] Merge github.com:apache/spark bbe3501 [Andrew Or] Embed storage status and RDD info in Task events 6631c02 [Andrew Or] More formatting changes, this time mainly for Json DSL 70e7e7a [Andrew Or] Formatting changes e9e1c6d [Andrew Or] Move all JSON de/serialization logic to JsonProtocol d646df6 [Andrew Or] Completely decouple SparkUI from SparkContext 6814da0 [Andrew Or] Explicitly register each UI listener rather than through some magic 64d2ce1 [Andrew Or] Fix BlockManagerUI bug by introducing new event 4273013 [Andrew Or] Add a gateway SparkListener to simplify event logging 904c729 [Andrew Or] Fix another major bug 5ac906d [Andrew Or] Mostly naming, formatting, and code style changes 3fd584e [Andrew Or] Fix two major bugs f3fc13b [Andrew Or] General refactor 4dfcd22 [Andrew Or] Merge git://git.apache.org/incubator-spark into persist-ui b3976b0 [Andrew Or] Add functionality of reconstructing a persisted UI from SparkContext 8add36b [Andrew Or] JobProgressUI: Add JSON functionality d859efc [Andrew Or] BlockManagerUI: Add JSON functionality c4cd480 [Andrew Or] Also deserialize new events 8a2ebe6 [Andrew Or] Fix bugs for EnvironmentUI and ExecutorsUI de8a1cd [Andrew Or] Serialize events both to and from JSON (rather than just to) bf0b2e9 [Andrew Or] ExecutorUI: Serialize events rather than arbitary executor information bb222b9 [Andrew Or] ExecutorUI: render completely from JSON dcbd312 [Andrew Or] Add JSON Serializability for all SparkListenerEvent's 10ed49d [Andrew Or] Merge github.com:apache/incubator-spark into persist-ui 8e09306 [Andrew Or] Use JSON for ExecutorsUI e3ae35f [Andrew Or] Merge github.com:apache/incubator-spark 3ddeb7e [Andrew Or] Also privatize fields 090544a [Andrew Or] Privatize methods 13920c9 [Andrew Or] Update docs bd5a1d7 [Andrew Or] Typo: phyiscal -> physical 287ef44 [Andrew Or] Avoid reading the entire batch into memory; also simplify streaming logic 3df7005 [Andrew Or] Merge branch 'master' of github.com:andrewor14/incubator-spark a531d2e [Andrew Or] Relax assumptions on compressors and serializers when batching 164489d [Andrew Or] Relax assumptions on compressors and serializers when batching	2014-03-19 13:17:01 -07:00
Patrick Wendell	faf4cad1de	Fix markup errors introduced in #33 (SPARK-1189) These were causing errors on the configuration page. Author: Patrick Wendell <pwendell@gmail.com> Closes #111 from pwendell/master and squashes the following commits: 8467a86 [Patrick Wendell] Fix markup errors introduced in #33 (SPARK-1189)	2014-03-09 11:57:06 -07:00
Jiacheng Guo	f6f9d02e85	Add timeout for fetch file Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set. Author: Jiacheng Guo <guojc03@gmail.com> Closes #98 from guojc/master and squashes the following commits: abfe698 [Jiacheng Guo] add space according request 2a37c34 [Jiacheng Guo] Add timeout for fetch file Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set.	2014-03-09 11:38:40 -07:00
Thomas Graves	7edbea41b4	SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets resubmit pull request. was https://github.com/apache/incubator-spark/pull/332. Author: Thomas Graves <tgraves@apache.org> Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits: dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser 05eebed [Thomas Graves] Fix dependency lost in upmerge d1040ec [Thomas Graves] Fix up various imports 05ff5e0 [Thomas Graves] Fix up imports after upmerging to master ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase 13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests. 4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets 2f77147 [Thomas Graves] Rework from comments 50dd9f2 [Thomas Graves] fix header in SecurityManager ecbfb65 [Thomas Graves] Fix spacing and formatting b514bec [Thomas Graves] Fix reference to config ed3d1c1 [Thomas Graves] Add security.md 6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments 2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework 5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets	2014-03-06 18:27:50 -06:00
Kyle Ellrott	40566e10aa	SPARK-942: Do not materialize partitions when DISK_ONLY storage level is used This is a port of a pull request original targeted at incubator-spark: https://github.com/apache/incubator-spark/pull/180 Essentially if a user returns a generative iterator (from a flatMap operation), when trying to persist the data, Spark would first unroll the iterator into an ArrayBuffer, and then try to figure out if it could store the data. In cases where the user provided an iterator that generated more data then available memory, this would case a crash. With this patch, if the user requests a persist with a 'StorageLevel.DISK_ONLY', the iterator will be unrolled as it is inputed into the serializer. To do this, two changes where made: 1) The type of the 'values' argument in the putValues method of the BlockStore interface was changed from ArrayBuffer to Iterator (and all code interfacing with this method was modified to connect correctly. 2) The JavaSerializer now calls the ObjectOutputStream 'reset' method every 1000 objects. This was done because the ObjectOutputStream caches objects (thus preventing them from being GC'd) to write more compact serialization. If reset is never called, eventually the memory fills up, if it is called too often then the serialization streams become much larger because of redundant class descriptions. Author: Kyle Ellrott <kellrott@gmail.com> Closes #50 from kellrott/iterator-to-disk and squashes the following commits: 9ef7cb8 [Kyle Ellrott] Fixing formatting issues. 60e0c57 [Kyle Ellrott] Fixing issues (formatting, variable names, etc.) from review comments 8aa31cd [Kyle Ellrott] Merge ../incubator-spark into iterator-to-disk 33ac390 [Kyle Ellrott] Merge branch 'iterator-to-disk' of github.com:kellrott/incubator-spark into iterator-to-disk 2f684ea [Kyle Ellrott] Refactoring the BlockManager to replace the Either[Either[A,B]] usage. Now using trait 'Values'. Also modified BlockStore.putBytes call to return PutResult, so that it behaves like putValues. f70d069 [Kyle Ellrott] Adding docs for spark.serializer.objectStreamReset configuration 7ccc74b [Kyle Ellrott] Moving the 'LargeIteratorSuite' to simply test persistance of iterators. It doesn't try to invoke a OOM error any more 16a4cea [Kyle Ellrott] Streamlined the LargeIteratorSuite unit test. It should now run in ~25 seconds. Confirmed that it still crashes an unpatched copy of Spark. c2fb430 [Kyle Ellrott] Removing more un-needed array-buffer to iterator conversions 627a8b7 [Kyle Ellrott] Wrapping a few long lines 0f28ec7 [Kyle Ellrott] Adding second putValues to BlockStore interface that accepts an ArrayBuffer (rather then an Iterator). This will allow BlockStores to have slightly different behaviors dependent on whether they get an Iterator or ArrayBuffer. In the case of the MemoryStore, it needs to duplicate and cache an Iterator into an ArrayBuffer, but if handed a ArrayBuffer, it can skip the duplication. 656c33e [Kyle Ellrott] Fixing the JavaSerializer to read from the SparkConf rather then the System property. 8644ee8 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 00c98e0 [Kyle Ellrott] Making the Java ObjectStreamSerializer reset rate configurable by the system variable 'spark.serializer.objectStreamReset', default is not 10000. 40fe1d7 [Kyle Ellrott] Removing rouge space 31fe08e [Kyle Ellrott] Removing un-needed semi-colons 9df0276 [Kyle Ellrott] Added check to make sure that streamed-to-dist RDD actually returns good data in the LargeIteratorSuite a6424ba [Kyle Ellrott] Wrapping long line 2eeda75 [Kyle Ellrott] Fixing dumb mistake ("\|\|" instead of "&&") 0e6f808 [Kyle Ellrott] Deleting temp output directory when done 95c7f67 [Kyle Ellrott] Simplifying StorageLevel checks 56f71cd [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 44ec35a [Kyle Ellrott] Adding some comments. 5eb2b7e [Kyle Ellrott] Changing the JavaSerializer reset to occur every 1000 objects. f403826 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 81d670c [Kyle Ellrott] Adding unit test for straight to disk iterator methods. d32992f [Kyle Ellrott] Merge remote-tracking branch 'origin/master' into iterator-to-disk cac1fad [Kyle Ellrott] Fixing MemoryStore, so that it converts incoming iterators to ArrayBuffer objects. This was previously done higher up the stack. efe1102 [Kyle Ellrott] Changing CacheManager and BlockManager to pass iterators directly to the serializer when a 'DISK_ONLY' persist is called. This is in response to SPARK-942.	2014-03-06 14:51:19 -08:00
CodingCat	1865dd681b	SPARK-1178: missing document of spark.scheduler.revive.interval https://spark-project.atlassian.net/browse/SPARK-1178 The configuration on spark.scheduler.revive.interval is undocumented but actually used https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L64 Author: CodingCat <zhunansjtu@gmail.com> Closes #74 from CodingCat/SPARK-1178 and squashes the following commits: 783ec69 [CodingCat] missing document of spark.scheduler.revive.interval	2014-03-04 10:28:17 -08:00
Andrew Or	1896c6e7c9	Merge pull request #533 from andrewor14/master. Closes #533 . External spilling - generalize batching logic The existing implementation consists of a hack for Kryo specifically and only works for LZF compression. Introducing an intermediate batch-level stream takes care of pre-fetching and other arbitrary behavior of higher level streams in a more general way. Author: Andrew Or <andrewor14@gmail.com> == Merge branch commits == commit 3ddeb7ef89a0af2b685fb5d071aa0f71c975cc82 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 12:09:32 2014 -0800 Also privatize fields commit 090544a87a0767effd0c835a53952f72fc8d24f0 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 10:58:23 2014 -0800 Privatize methods commit 13920c918efe22e66a1760b14beceb17a61fd8cc Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 16:34:15 2014 -0800 Update docs commit bd5a1d7350467ed3dc19c2de9b2c9f531f0e6aa3 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:44:24 2014 -0800 Typo: phyiscal -> physical commit 287ef44e593ad72f7434b759be3170d9ee2723d2 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:38:32 2014 -0800 Avoid reading the entire batch into memory; also simplify streaming logic Additionally, address formatting comments. commit 3df700509955f7074821e9aab1e74cb53c58b5a5 Merge: a531d2e 164489d Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:27:49 2014 -0800 Merge branch 'master' of github.com:andrewor14/incubator-spark commit a531d2e347acdcecf2d0ab72cd4f965ab5e145d8 Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF. commit 164489d6f176bdecfa9dabec2dfce5504d1ee8af Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF.	2014-02-06 22:05:53 -08:00
Reynold Xin	ac712e48af	Merge pull request #524 from rxin/doc Added spark.shuffle.file.buffer.kb to configuration doc. Author: Reynold Xin <rxin@apache.org> == Merge branch commits == commit 0eea1d761ff772ff89be234e1e28035d54e5a7de Author: Reynold Xin <rxin@apache.org> Date: Wed Jan 29 14:40:48 2014 -0800 Added spark.shuffle.file.buffer.kb to configuration doc.	2014-01-30 09:33:18 -08:00
Tathagata Das	7930209614	Merge pull request #497 from tdas/docs-update Updated Spark Streaming Programming Guide Here is the updated version of the Spark Streaming Programming Guide. This is still a work in progress, but the major changes are in place. So feedback is most welcome. In general, I have tried to make the guide to easier to understand even if the reader does not know much about Spark. The updated website is hosted here - http://www.eecs.berkeley.edu/~tdas/spark_docs/streaming-programming-guide.html The major changes are: - Overview illustrates the usecases of Spark Streaming - various input sources and various output sources - An example right after overview to quickly give an idea of what Spark Streaming program looks like - Made Java API and examples a first class citizen like Scala by using tabs to show both Scala and Java examples (similar to AMPCamp tutorial's code tabs) - Highlighted the DStream operations updateStateByKey and transform because of their powerful nature - Updated driver node failure recovery text to highlight automatic recovery in Spark standalone mode - Added information about linking and using the external input sources like Kafka and Flume - In general, reorganized the sections to better show the Basic section and the more advanced sections like Tuning and Recovery. Todos: - Links to the docs of external Kafka, Flume, etc - Illustrate window operation with figure as well as example. Author: Tathagata Das <tathagata.das1565@gmail.com> == Merge branch commits == commit 18ff10556570b39d672beeb0a32075215cfcc944 Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Tue Jan 28 21:49:30 2014 -0800 Fixed a lot of broken links. commit 34a5a6008dac2e107624c7ff0db0824ee5bae45f Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Tue Jan 28 18:02:28 2014 -0800 Updated github url to use SPARK_GITHUB_URL variable. commit f338a60ae8069e0a382d2cb170227e5757cc0b7a Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Mon Jan 27 22:42:42 2014 -0800 More updates based on Patrick and Harvey's comments. commit 89a81ff25726bf6d26163e0dd938290a79582c0f Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Mon Jan 27 13:08:34 2014 -0800 Updated docs based on Patricks PR comments. commit d5b6196b532b5746e019b959a79ea0cc013a8fc3 Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Sun Jan 26 20:15:58 2014 -0800 Added spark.streaming.unpersist config and info on StreamingListener interface. commit e3dcb46ab83d7071f611d9b5008ba6bc16c9f951 Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Sun Jan 26 18:41:12 2014 -0800 Fixed docs on StreamingContext.getOrCreate. commit 6c29524639463f11eec721e4d17a9d7159f2944b Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Thu Jan 23 18:49:39 2014 -0800 Added example and figure for window operations, and links to Kafka and Flume API docs. commit f06b964a51bb3b21cde2ff8bdea7d9785f6ce3a9 Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Wed Jan 22 22:49:12 2014 -0800 Fixed missing endhighlight tag in the MLlib guide. commit 036a7d46187ea3f2a0fb8349ef78f10d6c0b43a9 Merge: eab351d `a1cd185` Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Wed Jan 22 22:17:42 2014 -0800 Merge remote-tracking branch 'apache/master' into docs-update commit eab351d05c0baef1d4b549e1581310087158d78d Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Wed Jan 22 22:17:15 2014 -0800 Update Spark Streaming Programming Guide.	2014-01-28 21:51:05 -08:00
Reynold Xin	84670f2715	Merge pull request #466 from liyinan926/file-overwrite-new Allow files added through SparkContext.addFile() to be overwritten This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. For example, a possible use case is: the driver periodically renews a Hadoop delegation token and writes it to a token file. The token file needs to be downloaded by the executors whenever it gets renewed. However, the current implementation throws an exception when the target file exists and its contents do not match those of the new source. This PR adds an option to allow files to be overwritten to support use cases similar to the above.	2014-01-27 17:08:35 -08:00
Andrew Ash	069bb94206	Clarify spark.default.parallelism It's the task count across the cluster, not per worker, per machine, per core, or anything else.	2014-01-21 14:49:35 -08:00
Patrick Wendell	c324ac10ee	Force use of LZF when spilling data	2014-01-20 19:00:48 -08:00
Patrick Wendell	cdb003e376	Removing docs on akka options	2014-01-20 16:40:58 -08:00
Yinan Li	584323c6b1	Addressed comments from Reynold Signed-off-by: Yinan Li <liyinan926@gmail.com>	2014-01-18 21:28:17 -08:00
Patrick Wendell	bf5699543b	Merge pull request #462 from mateiz/conf-file-fix Remove Typesafe Config usage and conf files to fix nested property names With Typesafe Config we had the subtle problem of no longer allowing nested property names, which are used for a few of our properties: http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html This PR is for branch 0.9 but should be added into master too. (cherry picked from commit `34e911ce9a`) Signed-off-by: Patrick Wendell <pwendell@gmail.com>	2014-01-18 16:20:00 -08:00
Yinan Li	fd833e7ab1	Allow files added through SparkContext.addFile() to be overwritten This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. Signed-off-by: Yinan Li <liyinan926@gmail.com>	2014-01-18 15:26:59 -08:00
Patrick Wendell	0984647aae	Enable compression by default for spills	2014-01-13 23:25:25 -08:00
Patrick Wendell	c3816de504	Changing option wording per discussion with Andrew	2014-01-13 13:25:06 -08:00
Patrick Wendell	5d61e051c2	Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.	2014-01-13 12:21:39 -08:00
Patrick Wendell	2802cc80bc	Disable shuffle file consolidation by default	2014-01-12 19:16:43 -08:00
Patrick Wendell	d37408f39c	Merge pull request #377 from andrewor14/master External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.	2014-01-10 16:25:01 -08:00
Andrew Or	2e393cd5fd	Update documentation for externalSorting	2014-01-10 15:45:38 -08:00
Andrew Or	e4c51d2113	Address Patrick's and Reynold's comments Aside from trivial formatting changes, use nulls instead of Options for DiskMapIterator, and add documentation for spark.shuffle.externalSorting and spark.shuffle.memoryFraction. Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.	2014-01-10 15:09:51 -08:00
Patrick Wendell	460f655cc6	Enable shuffle consolidation by default. Bump this to being enabled for 0.9.0.	2014-01-09 22:42:50 -08:00
Patrick Wendell	112c0a1776	Fixing config option "retained_stages" => "retainedStages". This is a very esoteric option and it's out of sync with the style we use. So it seems fitting to fix it for 0.9.0.	2014-01-08 21:16:16 -08:00
Matei Zaharia	2c421749ea	Address review comments	2014-01-07 19:30:23 -05:00
Matei Zaharia	d8bcc8e9a0	Add way to limit default # of cores used by applications on standalone mode Also documents the spark.deploy.spreadOut option.	2014-01-07 14:35:52 -05:00
Prashant Sharma	c729fa7c8e	formatting related fixes suggested by Patrick.	2014-01-07 13:08:16 +05:30
Prashant Sharma	b84dc780d3	Allow configuration to be printed in logs for diagnosis.	2014-01-07 13:01:43 +05:30
Prashant Sharma	b3018811e1	Allow users to set arbitrary akka configurations via spark conf.	2014-01-07 13:01:43 +05:30
Andrew Ash	2dd4fb5698	Clarify spark.cores.max It controls the count of cores across the cluster, not on a per-machine basis.	2014-01-06 09:01:46 -08:00
Matei Zaharia	0fa5809768	Updated docs for SparkConf and handled review comments	2013-12-30 22:17:28 -05:00
Prashant Sharma	d3090b79a5	A few corrections to documentation.	2013-12-12 10:12:06 +05:30
Prashant Sharma	603af51bb5	Merge branch 'master' into akka-bug-fix Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala	2013-12-11 10:21:53 +05:30
Aaron Davidson	cb6ac8aafb	Correct spellling error in configuration.md	2013-12-07 01:40:01 -08:00
Patrick Wendell	7a1d1c93b8	Minor formatting fix in config file	2013-12-06 20:28:22 -08:00
Patrick Wendell	b9451acdf4	Adding disclaimer for shuffle file consolidation	2013-12-06 19:25:28 -08:00
Patrick Wendell	1450b8ef87	Small changes from Matei review	2013-12-04 18:49:32 -08:00
Patrick Wendell	b1c6fa1584	Document missing configs and set shuffle consolidation to false.	2013-12-04 18:39:34 -08:00
Prashant Sharma	54862af5ee	Improvements from the review comments and followed Boy Scout Rule.	2013-11-27 14:26:28 +05:30
Prashant Sharma	dca946ff67	Documenting the newly added spark properties.	2013-11-26 20:47:38 +05:30
Reynold Xin	f628804c02	Merge pull request #76 from pwendell/master Clarify compression property. Clarifies that this governs compression of internal data, not input data or output data.	2013-10-18 23:19:42 -07:00
Patrick Wendell	6b62836285	Clarify compression property. Clarifies that this governs compression of internal data, not input data or output data.	2013-10-18 23:08:44 -07:00

1 2

90 commits