Commit graph

186 commits

Author SHA1 Message Date
nemccarthy 4cca3917dc [SPARK-6313] Add config option to disable file locks/fetchFile cache to ...
...support NFS mounts.

This is a work around for now with the goal to find a more permanent solution.
https://issues.apache.org/jira/browse/SPARK-6313

Author: nemccarthy <nathan@nemccarthy.me>

Closes #5036 from nemccarthy/master and squashes the following commits:

2eaaf42 [nemccarthy] [SPARK-6313] Update config wording doc for spark.files.useFetchCache
5de7eb4 [nemccarthy] [SPARK-6313] Add config option to disable file locks/fetchFile cache to support NFS mounts
2015-03-17 09:33:11 -07:00
Brennon York 127268bc39 [SPARK-6329][Docs]: Minor doc changes for Mesos and TOC
Updated the configuration docs from the minor items that Reynold had left over from SPARK-1182; specifically I updated the `running-on-mesos` link to point directly to `running-on-mesos#configuration` and upgraded the `yarn`, `mesos`, etc. bullets to `<h5>` tags in hopes that they'll get pushed into the TOC.

Author: Brennon York <brennon.york@capitalone.com>

Closes #5022 from brennonyork/SPARK-6329 and squashes the following commits:

42a10a9 [Brennon York] minor doc fixes
2015-03-14 17:28:13 +00:00
Tathagata Das cd3b68d93a [SPARK-6128][Streaming][Documentation] Updates to Spark Streaming Programming Guide
Updates to the documentation are as follows:

- Added information on Kafka Direct API and Kafka Python API
- Added joins to the main streaming guide
- Improved details on the fault-tolerance semantics

Generated docs located here
http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#fault-tolerance-semantics

More things to add:
- Configuration for Kafka receive rate
- May be add concurrentJobs

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4956 from tdas/streaming-guide-update-1.3 and squashes the following commits:

819408c [Tathagata Das] Minor fixes.
debe484 [Tathagata Das] Added DataFrames and MLlib
380cf8d [Tathagata Das] Fix link
04167a6 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-guide-update-1.3
0b77486 [Tathagata Das] Updates based on Josh's comments.
86c4c2a [Tathagata Das] Updated streaming guides
82de92a [Tathagata Das] Add Kafka to Python api docs
2015-03-11 18:48:21 -07:00
Andrew Or 258d154c9f [SPARK-6048] SparkConf should not translate deprecated configs on set
There are multiple issues with translating on set outlined in the JIRA.

This PR reverts the translation logic added to `SparkConf`. In the future, after the 1.3.0 release we will figure out a way to reorganize the internal structure more elegantly. For now, let's preserve the existing semantics of `SparkConf` since it's a public interface. Unfortunately this means duplicating some code for now, but this is all internal and we can always clean it up later.

Author: Andrew Or <andrew@databricks.com>

Closes #4799 from andrewor14/conf-set-translate and squashes the following commits:

11c525b [Andrew Or] Move warning to driver
10e77b5 [Andrew Or] Add documentation for deprecation precedence
a369cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into conf-set-translate
c26a9e3 [Andrew Or] Revert all translate logic in SparkConf
fef6c9c [Andrew Or] Restore deprecation logic for spark.executor.userClassPathFirst
94b4dfa [Andrew Or] Translate on get, not set
2015-03-02 16:36:42 -08:00
Li Zhihui 10094a523e Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs.
The configuration is not supported in mesos mode now.
See https://github.com/apache/spark/pull/1462

Author: Li Zhihui <zhihui.li@intel.com>

Closes #4781 from li-zhihui/fixdocconf and squashes the following commits:

63e7a44 [Li Zhihui] Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs.
2015-02-26 13:07:49 -08:00
Brennon York 46a044a36a [SPARK-1182][Docs] Sort the configuration parameters in configuration.md
Sorts all configuration options present on the `configuration.md` page to ease readability.

Author: Brennon York <brennon.york@capitalone.com>

Closes #3863 from brennonyork/SPARK-1182 and squashes the following commits:

5696f21 [Brennon York] fixed merge conflict with port comments
81a7b10 [Brennon York] capitalized A in Allocation
e240486 [Brennon York] moved all spark.mesos properties into the running-on-mesos doc
7de5f75 [Brennon York] moved serialization from application to compression and serialization section
a16fec0 [Brennon York] moved shuffle settings from network to shuffle
f8fa286 [Brennon York] sorted encryption category
1023f15 [Brennon York] moved initialExecutors
e9d62aa [Brennon York] fixed akka.heartbeat.interval
25e6f6f [Brennon York] moved spark.executer.user*
4625ade [Brennon York] added spark.executor.extra* items
4ee5648 [Brennon York] fixed merge conflicts
1b49234 [Brennon York] sorting mishap
2b5758b [Brennon York] sorting mishap
6fbdf42 [Brennon York] sorting mishap
55dc6f8 [Brennon York] sorted security
ec34294 [Brennon York] sorted dynamic allocation
2a7c4a3 [Brennon York] sorted scheduling
aa9acdc [Brennon York] sorted networking
a4380b8 [Brennon York] sorted execution behavior
27f3919 [Brennon York] sorted compression and serialization
80a5bbb [Brennon York] sorted spark ui
3f32e5b [Brennon York] sorted shuffle behavior
6c51b38 [Brennon York] sorted runtime environment
efe9d6f [Brennon York] sorted application properties
2015-02-25 16:12:56 -08:00
Sean Owen 7d8e6a2e44 SPARK-5930 [DOCS] Documented default of spark.shuffle.io.retryWait is confusing
Clarify default max wait in spark.shuffle.io.retryWait docs

CC andrewor14

Author: Sean Owen <sowen@cloudera.com>

Closes #4769 from srowen/SPARK-5930 and squashes the following commits:

ae2792b [Sean Owen] Clarify default max wait in spark.shuffle.io.retryWait docs
2015-02-25 12:20:44 -08:00
CodingCat 242d49584c [SPARK-5724] fix the misconfiguration in AkkaUtils
https://issues.apache.org/jira/browse/SPARK-5724

In AkkaUtil, we set several failure detector related the parameters as following

```
al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
      .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
      s"""
      |akka.daemonic = on
      |akka.loggers = [""akka.event.slf4j.Slf4jLogger""]
      |akka.stdout-loglevel = "ERROR"
      |akka.jvm-exit-on-fatal-error = off
      |akka.remote.require-cookie = "$requireCookie"
      |akka.remote.secure-cookie = "$secureCookie"
      |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s
      |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s
      |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
      |akka.actor.provider = "akka.remote.RemoteActorRefProvider"
      |akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport"
      |akka.remote.netty.tcp.hostname = "$host"
      |akka.remote.netty.tcp.port = $port
      |akka.remote.netty.tcp.tcp-nodelay = on
      |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
      |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
      |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
      |akka.actor.default-dispatcher.throughput = $akkaBatchSize
      |akka.log-config-on-start = $logAkkaConfig
      |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
      |akka.log-dead-letters = $lifecycleEvents
      |akka.log-dead-letters-during-shutdown = $lifecycleEvents
      """.stripMargin))

```

Actually, we do not have any parameter naming "akka.remote.transport-failure-detector.threshold"
see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html
what we have is "akka.remote.watch-failure-detector.threshold"

Author: CodingCat <zhunansjtu@gmail.com>

Closes #4512 from CodingCat/SPARK-5724 and squashes the following commits:

bafe56e [CodingCat] fix the grammar in configuration doc
338296e [CodingCat] remove failure-detector related info
8bfcfd4 [CodingCat] fix the misconfiguration in AkkaUtils
2015-02-23 11:29:25 +00:00
Ilya Ganelin 6bddc40353 SPARK-5570: No docs stating that `new SparkConf().set("spark.driver.memory", ...) will not work
I've updated documentation to reflect true behavior of this setting in client vs. cluster mode.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #4665 from ilganeli/SPARK-5570 and squashes the following commits:

5d1c8dd [Ilya Ganelin] Added example configuration code
a51700a [Ilya Ganelin] Getting rid of extra spaces
85f7a08 [Ilya Ganelin] Reworded note
5889d43 [Ilya Ganelin] Formatting adjustment
f149ba1 [Ilya Ganelin] Minor updates
1fec7a5 [Ilya Ganelin] Updated to add clarification for other driver properties
db47595 [Ilya Ganelin] Slight formatting update
c899564 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5570
17b751d [Ilya Ganelin] Updated documentation for driver-memory to reflect its true behavior in client vs cluster mode
2015-02-19 15:53:20 -08:00
Marcelo Vanzin 20a6013106 [SPARK-2996] Implement userClassPathFirst for driver, yarn.
Yarn's config option `spark.yarn.user.classpath.first` does not work the same way as
`spark.files.userClassPathFirst`; Yarn's version is a lot more dangerous, in that it
modifies the system classpath, instead of restricting the changes to the user's class
loader. So this change implements the behavior of the latter for Yarn, and deprecates
the more dangerous choice.

To be able to achieve feature-parity, I also implemented the option for drivers (the existing
option only applies to executors). So now there are two options, each controlling whether
to apply userClassPathFirst to the driver or executors. The old option was deprecated, and
aliased to the new one (`spark.executor.userClassPathFirst`).

The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it
was also doing some things that ended up causing JVM errors depending on how things
were being called.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3233 from vanzin/SPARK-2996 and squashes the following commits:

9cf9cf1 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
a1499e2 [Marcelo Vanzin] Remove SPARK_HOME propagation.
fa7df88 [Marcelo Vanzin] Remove 'test.resource' file, create it dynamically.
a8c69f1 [Marcelo Vanzin] Review feedback.
cabf962 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
a1b8d7e [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
3f768e3 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
2ce3c7a [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
0e6d6be [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
70d4044 [Marcelo Vanzin] Fix pyspark/yarn-cluster test.
0fe7777 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
0e6ef19 [Marcelo Vanzin] Move class loaders around and make names more meaninful.
fe970a7 [Marcelo Vanzin] Review feedback.
25d4fed [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
3cb6498 [Marcelo Vanzin] Call the right loadClass() method on the parent.
fbb8ab5 [Marcelo Vanzin] Add locking in loadClass() to avoid deadlocks.
2e6c4b7 [Marcelo Vanzin] Mention new setting in documentation.
b6497f9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
a10f379 [Marcelo Vanzin] Some feedback.
3730151 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
f513871 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
44010b6 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
7b57cba [Marcelo Vanzin] Remove now outdated message.
5304d64 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
35949c8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
54e1a98 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
d1273b2 [Marcelo Vanzin] Add test file to rat exclude.
fa1aafa [Marcelo Vanzin] Remove write check on user jars.
89d8072 [Marcelo Vanzin] Cleanups.
a963ea3 [Marcelo Vanzin] Implement spark.driver.userClassPathFirst for standalone cluster mode.
50afa5f [Marcelo Vanzin] Fix Yarn executor command line.
7d14397 [Marcelo Vanzin] Register user jars in executor up front.
7f8603c [Marcelo Vanzin] Fix yarn-cluster mode without userClassPathFirst.
20373f5 [Marcelo Vanzin] Fix ClientBaseSuite.
55c88fa [Marcelo Vanzin] Run all Yarn integration tests via spark-submit.
0b64d92 [Marcelo Vanzin] Add deprecation warning to yarn option.
4a84d87 [Marcelo Vanzin] Fix the child-first class loader.
d0394b8 [Marcelo Vanzin] Add "deprecated configs" to SparkConf.
46d8cf2 [Marcelo Vanzin] Update doc with new option, change name to "userClassPathFirst".
a314f2d [Marcelo Vanzin] Enable driver class path isolation in SparkSubmit.
91f7e54 [Marcelo Vanzin] [yarn] Enable executor class path isolation.
a853e74 [Marcelo Vanzin] Re-work CoarseGrainedExecutorBackend command line arguments.
89522ef [Marcelo Vanzin] Add class path isolation support for Yarn cluster mode.
2015-02-09 21:17:28 -08:00
Andrew Or fe3740c4c8 [SPARK-5636] Ramp up faster in dynamic allocation
A recent patch #4051 made the initial number default to 0. With this change, any Spark application using dynamic allocation's default settings will ramp up very slowly. Since we never request more executors than needed to saturate the pending tasks, it is safe to ramp up quickly. The current default of 60 may be too slow.

Author: Andrew Or <andrew@databricks.com>

Closes #4409 from andrewor14/dynamic-allocation-interval and squashes the following commits:

d3cc485 [Andrew Or] Lower request interval
2015-02-06 10:55:13 -08:00
Matei Zaharia 4d74f0601a [SPARK-5608] Improve SEO of Spark documentation pages
- Add meta description tags on some of the most important doc pages
- Shorten the titles of some pages to have more relevant keywords; for
  example there's no reason to have "Spark SQL Programming Guide - Spark
  1.2.0 documentation", we can just say "Spark SQL - Spark 1.2.0
  documentation".

Author: Matei Zaharia <matei@databricks.com>

Closes #4381 from mateiz/docs-seo and squashes the following commits:

4940563 [Matei Zaharia] [SPARK-5608] Improve SEO of Spark documentation pages
2015-02-05 11:12:50 -08:00
Josh Rosen 9a7ce70eab [SPARK-5411] Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext
This patch introduces a new configuration option, `spark.extraListeners`, that allows SparkListeners to be specified in SparkConf and registered before the SparkContext is initialized.  From the configuration documentation:

> A comma-separated list of classes that implement SparkListener; when initializing SparkContext, instances of these classes will be created and registered with Spark's listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called. If no valid constructor can be found, the SparkContext creation will fail with an exception.

This motivation for this patch is to allow monitoring code to be easily injected into existing Spark programs without having to modify those programs' code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4111 from JoshRosen/SPARK-5190-register-sparklistener-in-sc-constructor and squashes the following commits:

8370839 [Josh Rosen] Two minor fixes after merging with master
6e0122c [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5190-register-sparklistener-in-sc-constructor
1a5b9a0 [Josh Rosen] Remove SPARK_EXTRA_LISTENERS environment variable.
2daff9b [Josh Rosen] Add a couple of explanatory comments for SPARK_EXTRA_LISTENERS.
b9973da [Josh Rosen] Add test to ensure that conf and env var settings are merged, not overriden.
d6f3113 [Josh Rosen] Use getConstructors() instead of try-catch to find right constructor.
d0d276d [Josh Rosen] Move code into setupAndStartListenerBus() method
b22b379 [Josh Rosen] Instantiate SparkListeners from classes listed in configurations.
9c0d8f1 [Josh Rosen] Revert "[SPARK-5190] Allow SparkListeners to be registered before SparkContext starts."
217ecc0 [Josh Rosen] Revert "Add addSparkListener to JavaSparkContext"
25988f3 [Josh Rosen] Add addSparkListener to JavaSparkContext
163ba19 [Josh Rosen] [SPARK-5190] Allow SparkListeners to be registered before SparkContext starts.
2015-02-04 17:18:03 -08:00
Jacek Lewandowski cfea30037f Spark 3883: SSL support for HttpServer and Akka
SPARK-3883: SSL support for Akka connections and Jetty based file servers.

This story introduced the following changes:
- Introduced SSLOptions object which holds the SSL configuration and can build the appropriate configuration for Akka or Jetty. SSLOptions can be created by parsing SparkConf entries at a specified namespace.
- SSLOptions is created and kept by SecurityManager
- All Akka actor address creation snippets based on interpolated strings were replaced by a dedicated methods from AkkaUtils. Those methods select the proper Akka protocol - whether akka.tcp or akka.ssl.tcp
- Added tests cases for AkkaUtils, FileServer, SSLOptions and SecurityManager
- Added a way to use node local SSL configuration by executors and driver in standalone mode. It can be done by specifying spark.ssl.useNodeLocalConf in SparkConf.
- Made CoarseGrainedExecutorBackend not overwrite the settings which are executor startup configuration - they are passed anyway from Worker

Refer to https://github.com/apache/spark/pull/3571 for discussion and details

Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
Author: Jacek Lewandowski <jacek.lewandowski@datastax.com>

Closes #3571 from jacek-lewandowski/SPARK-3883-master and squashes the following commits:

9ef4ed1 [Jacek Lewandowski] Merge pull request #2 from jacek-lewandowski/SPARK-3883-docs2
fb31b49 [Jacek Lewandowski] SPARK-3883: Added SSL setup documentation
2532668 [Jacek Lewandowski] SPARK-3883: Refactored AkkaUtils.protocol method to not use Try
90a8762 [Jacek Lewandowski] SPARK-3883: Refactored methods to resolve Akka address and made it possible to easily configure multiple communication layers for SSL
72b2541 [Jacek Lewandowski] SPARK-3883: A reference to the fallback SSLOptions can be provided when constructing SSLOptions
93050f4 [Jacek Lewandowski] SPARK-3883: SSL support for HttpServer and Akka
2015-02-02 17:27:26 -08:00
Sandy Ryza b2047b55c5 SPARK-4585. Spark dynamic executor allocation should use minExecutors as...
... initial number

Author: Sandy Ryza <sandy@cloudera.com>

Closes #4051 from sryza/sandy-spark-4585 and squashes the following commits:

d1dd039 [Sandy Ryza] Add spark.dynamicAllocation.initialNumExecutors and make min and max not required
b7c59dc [Sandy Ryza] SPARK-4585. Spark dynamic executor allocation should use minExecutors as initial number
2015-02-02 12:27:08 -08:00
Yandu Oppacher 3bead67d59 [SPARK-4387][PySpark] Refactoring python profiling code to make it extensible
This PR is based on #3255 , fix conflicts and code style.

Closes #3255.

Author: Yandu Oppacher <yandu.oppacher@jadedpixel.com>
Author: Davies Liu <davies@databricks.com>

Closes #3901 from davies/refactor-python-profile-code and squashes the following commits:

b4a9306 [Davies Liu] fix tests
4b79ce8 [Davies Liu] add docstring for profiler_cls
2700e47 [Davies Liu] use BasicProfiler as default
349e341 [Davies Liu] more refactor
6a5d4df [Davies Liu] refactor and fix tests
31bf6b6 [Davies Liu] fix code style
0864b5d [Yandu Oppacher] Remove unused method
76a6c37 [Yandu Oppacher] Added a profile collector to accumulate the profilers per stage
9eefc36 [Yandu Oppacher] Fix doc
9ace076 [Yandu Oppacher] Refactor of profiler, and moved tests around
8739aff [Yandu Oppacher] Code review fixes
9bda3ec [Yandu Oppacher] Refactor profiler code
2015-01-28 13:48:06 -08:00
Sean Owen c586b45dd2 SPARK-3852 [DOCS] Document spark.driver.extra* configs
As per the JIRA. I copied the `spark.executor.extra*` text, but removed info that appears to be specific to the `executor` config and not `driver`.

Author: Sean Owen <sowen@cloudera.com>

Closes #4185 from srowen/SPARK-3852 and squashes the following commits:

f60a8a1 [Sean Owen] Document spark.driver.extra* configs
2015-01-25 15:08:35 -08:00
WangTaoTheTonic 2be82b1e66 [SPARK-1507][YARN]specify # cores for ApplicationMaster
Based on top of changes in https://github.com/apache/spark/pull/3806.

https://issues.apache.org/jira/browse/SPARK-1507

`--driver-cores` and `spark.driver.cores` for all cluster modes and `spark.yarn.am.cores` for yarn client mode.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #4018 from WangTaoTheTonic/SPARK-1507 and squashes the following commits:

01419d3 [WangTaoTheTonic] amend the args name
b255795 [WangTaoTheTonic] indet thing
d86557c [WangTaoTheTonic] some comments amend
43c9392 [WangTao] fix compile error
b39a100 [WangTao] specify # cores for ApplicationMaster
2015-01-16 09:16:56 -08:00
uncleGen 39e333ec43 [SPARK-5131][Streaming][DOC]: There is a discrepancy in WAL implementation and configuration doc.
There is a discrepancy in WAL implementation and configuration doc.

Author: uncleGen <hustyugm@gmail.com>

Closes #3930 from uncleGen/master-clean-doc and squashes the following commits:

3a4245f [uncleGen] doc typo
8e407d3 [uncleGen] doc typo
2015-01-13 10:07:19 -08:00
lewuathe 1656aae2b4 [SPARK-5073] spark.storage.memoryMapThreshold have two default value
Because major OS page sizes is about 4KB, the default value of spark.storage.memoryMapThreshold is integrated to 2 * 4096

Author: lewuathe <lewuathe@me.com>

Closes #3900 from Lewuathe/integrate-memoryMapThreshold and squashes the following commits:

e417acd [lewuathe] [SPARK-5073] Update docs/configuration
834aba4 [lewuathe] [SPARK-5073] Fix style
adcea33 [lewuathe] [SPARK-5073] Integrate memory map threshold to 2MB
fcce2e5 [lewuathe] [SPARK-5073] spark.storage.memoryMapThreshold have two default value
2015-01-11 13:50:42 -08:00
Reynold Xin bbcba3a943 [SPARK-5093] Set spark.network.timeout to 120s consistently.
Author: Reynold Xin <rxin@databricks.com>

Closes #3903 from rxin/timeout-120 and squashes the following commits:

7c2138e [Reynold Xin] [SPARK-5093] Set spark.network.timeout to 120s consistently.
2015-01-05 15:19:53 -08:00
Varun Saxena d3f07fd23c [SPARK-4688] Have a single shared network timeout in Spark
[SPARK-4688] Have a single shared network timeout in Spark

Author: Varun Saxena <vsaxena.varun@gmail.com>
Author: varunsaxena <vsaxena.varun@gmail.com>

Closes #3562 from varunsaxena/SPARK-4688 and squashes the following commits:

6e97f72 [Varun Saxena] [SPARK-4688] Single shared network timeout
cd783a2 [Varun Saxena] SPARK-4688
d6f8c29 [Varun Saxena] SCALA-4688
9562b15 [Varun Saxena] SPARK-4688
a75f014 [varunsaxena] SPARK-4688
594226c [varunsaxena] SPARK-4688
2015-01-05 10:32:37 -08:00
Josh Rosen 939ba1f8f6 [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs
This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery.

Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists.  SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat.

In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times.  In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions.  When output spec. validation is enabled, the second calls to these actions will fail due to existing output.

This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler.  This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits:

36eaf35 [Josh Rosen] Add comment explaining use of transform() in test.
6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform()
7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide
bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming.
e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic.
762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
2015-01-04 20:26:18 -08:00
wangxiaojing 6645e52580 [SPARK-4982][DOC] spark.ui.retainedJobs description is wrong in Spark UI configuration guide
Author: wangxiaojing <u9jing@gmail.com>

Closes #3818 from wangxiaojing/SPARK-4982 and squashes the following commits:

fe2ad5f [wangxiaojing] change stages to jobs
2014-12-29 10:45:26 -08:00
Aaron Davidson fbca6b6ce2 [SPARK-4864] Add documentation to Netty-based configs
Author: Aaron Davidson <aaron@databricks.com>

Closes #3713 from aarondav/netty-configs and squashes the following commits:

8a8b373 [Aaron Davidson] Address Patrick's comments
3b1f84e [Aaron Davidson] [SPARK-4864] Add documentation to Netty-based configs
2014-12-22 13:09:22 -08:00
Andrew Or 15c03e1e0e [SPARK-4140] Document dynamic allocation
Once the external shuffle service is also documented, the dynamic allocation section will link to it. Let me know if the whole dynamic allocation should be moved to its separate page; I personally think the organization might be cleaner that way.

This patch builds on top of oza's work in #3689.

aarondav pwendell

Author: Andrew Or <andrew@databricks.com>
Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@gmail.com>

Closes #3731 from andrewor14/document-dynamic-allocation and squashes the following commits:

1281447 [Andrew Or] Address a few comments
b9843f2 [Andrew Or] Document the configs as well
246fb44 [Andrew Or] Merge branch 'SPARK-4839' of github.com:oza/spark into document-dynamic-allocation
8c64004 [Andrew Or] Add documentation for dynamic allocation (without configs)
6827b56 [Tsuyoshi Ozawa] Fixing a documentation of spark.dynamicAllocation.enabled.
53cff58 [Tsuyoshi Ozawa] Adding a documentation about dynamic resource allocation.
2014-12-19 19:36:20 -08:00
Ryan Williams 8176b7a02e [SPARK-4668] Fix some documentation typos.
Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #3523 from ryan-williams/tweaks and squashes the following commits:

d2eddaa [Ryan Williams] code review feedback
ce27fc1 [Ryan Williams] CoGroupedRDD comment nit
c6cfad9 [Ryan Williams] remove unnecessary if statement
b74ea35 [Ryan Williams] comment fix
b0221f0 [Ryan Williams] fix a gendered pronoun
c71ffed [Ryan Williams] use names on a few boolean parameters
89954aa [Ryan Williams] clarify some comments in {Security,Shuffle}Manager
e465dac [Ryan Williams] Saved building-spark.md with Dillinger.io
83e8358 [Ryan Williams] fix pom.xml typo
dc4662b [Ryan Williams] typo fixes in tuning.md, configuration.md
2014-12-15 14:52:17 -08:00
Tathagata Das b004150adb [SPARK-4806] Streaming doc update for 1.2
Important updates to the streaming programming guide
- Make the fault-tolerance properties easier to understand, with information about write ahead logs
- Update the information about deploying the spark streaming app with information about Driver HA
- Update Receiver guide to discuss reliable vs unreliable receivers.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>

Closes #3653 from tdas/streaming-doc-update-1.2 and squashes the following commits:

f53154a [Tathagata Das] Addressed Josh's comments.
ce299e4 [Tathagata Das] Minor update.
ca19078 [Tathagata Das] Minor change
f746951 [Tathagata Das] Mentioned performance problem with WAL
7787209 [Tathagata Das] Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2
2184729 [Tathagata Das] Updated Kafka and Flume guides with reliability information.
2f3178c [Tathagata Das] Added more information about writing reliable receivers in the custom receiver guide.
91aa5aa [Tathagata Das] Improved API Docs menu
5707581 [Tathagata Das] Added Pythn API badge
b9c8c24 [Tathagata Das] Merge pull request #26 from JoshRosen/streaming-programming-guide
b8c8382 [Josh Rosen] minor fixes
a4ef126 [Josh Rosen] Restructure parts of the fault-tolerance section to read a bit nicer when skipping over the headings
65f66cd [Josh Rosen] Fix broken link to fault-tolerance semantics section.
f015397 [Josh Rosen] Minor grammar / pluralization fixes.
3019f3a [Josh Rosen] Fix minor Markdown formatting issues
aa8bb87 [Tathagata Das] Small update.
195852c [Tathagata Das] Updated based on Josh's comments, updated receiver reliability and deploying section, and also updated configuration.
17b99fb [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-doc-update-1.2
a0217c0 [Tathagata Das] Changed Deploying menu layout
67fcffc [Tathagata Das] Added cluster mode + supervise example to submitting application guide.
e45453b [Tathagata Das] Update streaming guide, added deploying section.
192c7a7 [Tathagata Das] Added more info about Python API, and rewrote the checkpointing section.
2014-12-11 06:21:23 -08:00
Sandy Ryza cda94d15ea SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio doc...
...umented default is incorrect for YARN

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3624 from sryza/sandy-spark-4770 and squashes the following commits:

bd81a3a [Sandy Ryza] SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN
2014-12-08 16:28:36 -08:00
Kay Ousterhout d9a148ba6a [SPARK-4686] Link to allowed master URLs is broken
The link points to the old scala programming guide; it should point to the submitting applications page.

This should be backported to 1.1.2 (it's been broken as of 1.0).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #3542 from kayousterhout/SPARK-4686 and squashes the following commits:

a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken
2014-12-02 09:06:02 -08:00
arahuja d240760191 [SPARK-4344][DOCS] adding documentation on spark.yarn.user.classpath.first
The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter

Author: arahuja <aahuja11@gmail.com>

Closes #3209 from arahuja/yarn-classpath-first-param and squashes the following commits:

51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst
2014-11-25 08:23:41 -06:00
WangTao e421072da0 [SPARK-3722][Docs]minor improvement and fix in docs
https://issues.apache.org/jira/browse/SPARK-3722

Author: WangTao <barneystinson@aliyun.com>

Closes #2579 from WangTaoTheTonic/docsWork and squashes the following commits:

6f91cec [WangTao] use more wording express
29d22fa [WangTao] delete the specified version link
34cb4ea [WangTao] Update running-on-yarn.md
4ee1a26 [WangTao] minor improvement and fix in docs
2014-11-14 08:09:42 -06:00
Sandy Ryza c6f4e70421 SPARK-4230. Doc for spark.default.parallelism is incorrect
Author: Sandy Ryza <sandy@cloudera.com>

Closes #3107 from sryza/sandy-spark-4230 and squashes the following commits:

37a1d19 [Sandy Ryza] Clear up a couple things
34d53de [Sandy Ryza] SPARK-4230. Doc for spark.default.parallelism is incorrect
2014-11-10 12:40:41 -08:00
jay@apache.org 868cd4c3ca SPARK-4040. Update documentation to exemplify use of local (n) value, fo...
This is a minor docs update which helps to clarify the way local[n] is used for streaming apps.

Author: jay@apache.org <jayunit100>

Closes #2964 from jayunit100/SPARK-4040 and squashes the following commits:

35b5a5e [jay@apache.org] SPARK-4040: Update documentation to exemplify use of local (n) value.
2014-11-05 15:45:34 -08:00
Aaron Davidson 1ae51f6dc7 [SPARK-4183] Enable NettyBlockTransferService by default
Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3049 from aarondav/enable-netty and squashes the following commits:

bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
2014-11-02 18:14:57 -08:00
Davies Liu 6181577e99 [SPARK-3466] Limit size of results that a driver collects for each action
Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data.

This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it.

By default, it's 1g (for backward compatibility for most the cases).

In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM.

cc mateiz

Author: Davies Liu <davies@databricks.com>

Closes #3003 from davies/collect and squashes the following commits:

248ed5e [Davies Liu] fix compile
272522e [Davies Liu] address comments
2c35773 [Davies Liu] add sizes in message of abort()
5d62303 [Davies Liu] address comments
bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect
11f97c5 [Davies Liu] address comments
47b144f [Davies Liu] check the size of result before send and fetch
3d81af2 [Davies Liu] address comments
ca8267d [Davies Liu] limit the size of data by collect
2014-11-02 00:03:51 -07:00
Patrick Wendell 7894de276b Revert "[SPARK-4183] Enable NettyBlockTransferService by default"
This reverts commit 59e626c701.
2014-11-01 15:18:58 -07:00
Aaron Davidson 59e626c701 [SPARK-4183] Enable NettyBlockTransferService by default
Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3049 from aarondav/enable-netty and squashes the following commits:

bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
2014-11-01 13:15:24 -07:00
Josh Rosen 9530316887 [SPARK-2321] Stable pull-based progress / status API
This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)).  For now, I'd like to discuss the basic implementation, API names, and overall interface design.  Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API.

#### Design goals:

- Pull-based API
- Usable from Java / Scala / Python (eventually, likely with a wrapper)
- Can be extended to expose more information without introducing binary incompatibilities.
- Returns immutable objects.
- Don't leak any implementation details, preserving our freedom to change the implementation.

#### Implementation:

- Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved.
- Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values.  These interfaces consist entirely of Java-style getter methods.  The interfaces are currently implemented in Java.  I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves.
-Allow an existing JobProgressListener to be used when constructing a live SparkUI.  This allows us to re-use this listeners in the implementation of this status API.  There are a few reasons why this listener re-use makes sense:
   - The status API and web UI are guaranteed to show consistent information.
   - These listeners are already well-tested.
   - The same garbage-collection / information retention configurations can apply to both this API and the web UI.
- Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings.

The progress API methods are implemented in a separate trait that's mixed into SparkContext.  This helps to avoid SparkContext.scala from becoming larger and more difficult to read.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <joshrosen@apache.org>

Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits:

e6aa78d [Josh Rosen] Add tests.
b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses.
c96402d [Josh Rosen] Address review comments.
2707f98 [Josh Rosen] Expose current stage attempt id
c28ba76 [Josh Rosen] Update demo code:
646ff1d [Josh Rosen] Document spark.ui.retainedJobs.
7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback.
b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api
787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext.
f9a9a00 [Josh Rosen] More review comments:
3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext.
249ca16 [Josh Rosen] Address several review comments:
da5648e [Josh Rosen] Add example of basic progress reporting in Java.
7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods.
cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark.
6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics:
08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API.
ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener
24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs:
2014-10-25 00:06:57 -07:00
Sandy Ryza 6bb56faea8 SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
Author: Sandy Ryza <sandy@cloudera.com>

Closes #789 from sryza/sandy-spark-1813 and squashes the following commits:

48b05e9 [Sandy Ryza] Simplify
b824932 [Sandy Ryza] Allow both spark.kryo.classesToRegister and spark.kryo.registrator at the same time
6a15bb7 [Sandy Ryza] Small fix
a2278c0 [Sandy Ryza] Respond to review comments
6ef592e [Sandy Ryza] SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
2014-10-21 21:53:09 -07:00
Josh Rosen 7e63bb49c5 [SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport)
This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`.  The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops.  The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks.  Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket).

This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits:

f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf.
b562451 [Josh Rosen] Remove unused jobConfCacheKey field.
dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task.

(cherry picked from commit 2cd40db2b3)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
2014-10-19 00:35:05 -07:00
WangTaoTheTonic e7f4ea8a52 [SPARK-3890][Docs]remove redundant spark.executor.memory in doc
Introduced in f7e79bc42c, I'm not sure why we need two spark.executor.memory here.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #2745 from WangTaoTheTonic/redundantconfig and squashes the following commits:

e7564dc [WangTao] too long line
fdbdb1f [WangTaoTheTonic] trivial workaround
d06b6e5 [WangTaoTheTonic] remove redundant spark.executor.memory in doc
2014-10-16 19:12:57 -07:00
Aaron Davidson 7f7b50ed9d [SPARK-3923] Increase Akka heartbeat pause above heartbeat interval
Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!

I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2784 from aarondav/fix-timeout and squashes the following commits:

bd1151a [Aaron Davidson] Increase pause, don't decrease interval
9cb0372 [Aaron Davidson] [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
2014-10-16 18:58:18 -07:00
nartz 13cab5ba44 add spark.driver.memory to config docs
It took me a minute to track this down, so I thought it could be useful to have it in the docs.

I'm unsure if 512mb is the default for spark.driver.memory? Also - there could be a better value for the 'description' to differentiate it from spark.executor.memory.

Author: nartz <nartzpod@gmail.com>
Author: Nathan Artz <nathanartz@Nathans-MacBook-Pro.local>

Closes #2410 from nartz/docs/add-spark-driver-memory-to-config-docs and squashes the following commits:

a2f6c62 [nartz] Update configuration.md
74521b8 [Nathan Artz] add spark.driver.memory to config docs
2014-10-09 00:02:11 -07:00
Brenden Matthews a8c52d5343 [SPARK-3535][Mesos] Fix resource handling.
Author: Brenden Matthews <brenden@diddyinc.com>

Closes #2401 from brndnmtthws/master and squashes the following commits:

4abaa5d [Brenden Matthews] [SPARK-3535][Mesos] Fix resource handling.
2014-10-03 12:58:04 -07:00
EugenCepoi f0811f928e SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR
Update of PR #997.

With this PR, setting SPARK_CONF_DIR overrides SPARK_HOME/conf (not only spark-defaults.conf and spark-env).

Author: EugenCepoi <cepoi.eugen@gmail.com>

Closes #2481 from EugenCepoi/SPARK-2058 and squashes the following commits:

0bb32c2 [EugenCepoi] use orElse orNull and fixing trailing percent in compute-classpath.cmd
77f35d7 [EugenCepoi] SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR
2014-10-03 10:03:15 -07:00
scwf c6469a02f1 [SPARK-3766][Doc]Snappy is also the default compress codec for broadcast variables
Author: scwf <wangfei1@huawei.com>

Closes #2632 from scwf/compress-doc and squashes the following commits:

7983a1a [scwf] snappy is the default compression codec for broadcast
2014-10-02 13:47:30 -07:00
Davies Liu c5414b6818 [SPARK-3478] [PySpark] Profile the Python tasks
This patch add profiling support for PySpark, it will show the profiling results
before the driver exits, here is one example:

```
============================================================
Profile of RDD<id=3>
============================================================
         5146507 function calls (5146487 primitive calls) in 71.094 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
       20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
       20    0.017    0.001    0.017    0.001 {cPickle.dumps}
     1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
       20    0.001    0.000    0.001    0.000 {reduce}
       21    0.001    0.000    0.001    0.000 {cPickle.loads}
       20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
       41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
       40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
       62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
       20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
       20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
    40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
       41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
       40    0.000    0.000   71.072    1.777 rdd.py:304(func)
       20    0.000    0.000   71.094    3.555 worker.py:82(process)
```

Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
by `sc.dump_profiles(path)`, such as

```python
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)
```
The profiling is disabled by default, can be enabled by "spark.python.profile=true".

Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"

This is bugfix of #2351 cc JoshRosen

Author: Davies Liu <davies.liu@gmail.com>

Closes #2556 from davies/profiler and squashes the following commits:

e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
858e74c [Davies Liu] compatitable with python 2.6
7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
2b0daf2 [Davies Liu] fix docs
7a56c24 [Davies Liu] bugfix
cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
09d02c3 [Davies Liu] Merge branch 'master' into profiler
c23865c [Davies Liu] Merge branch 'master' into profiler
15d6f18 [Davies Liu] add docs for two configs
dadee1a [Davies Liu] add docs string and clear profiles after show or dump
4f8309d [Davies Liu] address comment, add tests
0a5b6eb [Davies Liu] fix Python UDF
4b20494 [Davies Liu] add profile for python
2014-09-30 18:24:57 -07:00
Josh Rosen f872e4fb80 Revert "[SPARK-3478] [PySpark] Profile the Python tasks"
This reverts commit 1aa549ba98.
2014-09-26 14:47:14 -07:00
Davies Liu 1aa549ba98 [SPARK-3478] [PySpark] Profile the Python tasks
This patch add profiling support for PySpark, it will show the profiling results
before the driver exits, here is one example:

```
============================================================
Profile of RDD<id=3>
============================================================
         5146507 function calls (5146487 primitive calls) in 71.094 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
       20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
       20    0.017    0.001    0.017    0.001 {cPickle.dumps}
     1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
       20    0.001    0.000    0.001    0.000 {reduce}
       21    0.001    0.000    0.001    0.000 {cPickle.loads}
       20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
       41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
       40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
       62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
       20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
       20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
    40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
       41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
       40    0.000    0.000   71.072    1.777 rdd.py:304(func)
       20    0.000    0.000   71.094    3.555 worker.py:82(process)
```

Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
by `sc.dump_profiles(path)`, such as

```python
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)
```
The profiling is disabled by default, can be enabled by "spark.python.profile=true".

Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"

Author: Davies Liu <davies.liu@gmail.com>

Closes #2351 from davies/profiler and squashes the following commits:

7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
2b0daf2 [Davies Liu] fix docs
7a56c24 [Davies Liu] bugfix
cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
09d02c3 [Davies Liu] Merge branch 'master' into profiler
c23865c [Davies Liu] Merge branch 'master' into profiler
15d6f18 [Davies Liu] add docs for two configs
dadee1a [Davies Liu] add docs string and clear profiles after show or dump
4f8309d [Davies Liu] address comment, add tests
0a5b6eb [Davies Liu] fix Python UDF
4b20494 [Davies Liu] add profile for python
2014-09-26 09:27:42 -07:00