Commit graph

863 commits

Author SHA1 Message Date
Ryan Williams 8176b7a02e [SPARK-4668] Fix some documentation typos.
Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #3523 from ryan-williams/tweaks and squashes the following commits:

d2eddaa [Ryan Williams] code review feedback
ce27fc1 [Ryan Williams] CoGroupedRDD comment nit
c6cfad9 [Ryan Williams] remove unnecessary if statement
b74ea35 [Ryan Williams] comment fix
b0221f0 [Ryan Williams] fix a gendered pronoun
c71ffed [Ryan Williams] use names on a few boolean parameters
89954aa [Ryan Williams] clarify some comments in {Security,Shuffle}Manager
e465dac [Ryan Williams] Saved building-spark.md with Dillinger.io
83e8358 [Ryan Williams] fix pom.xml typo
dc4662b [Ryan Williams] typo fixes in tuning.md, configuration.md
2014-12-15 14:52:17 -08:00
Tathagata Das b004150adb [SPARK-4806] Streaming doc update for 1.2
Important updates to the streaming programming guide
- Make the fault-tolerance properties easier to understand, with information about write ahead logs
- Update the information about deploying the spark streaming app with information about Driver HA
- Update Receiver guide to discuss reliable vs unreliable receivers.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>

Closes #3653 from tdas/streaming-doc-update-1.2 and squashes the following commits:

f53154a [Tathagata Das] Addressed Josh's comments.
ce299e4 [Tathagata Das] Minor update.
ca19078 [Tathagata Das] Minor change
f746951 [Tathagata Das] Mentioned performance problem with WAL
7787209 [Tathagata Das] Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2
2184729 [Tathagata Das] Updated Kafka and Flume guides with reliability information.
2f3178c [Tathagata Das] Added more information about writing reliable receivers in the custom receiver guide.
91aa5aa [Tathagata Das] Improved API Docs menu
5707581 [Tathagata Das] Added Pythn API badge
b9c8c24 [Tathagata Das] Merge pull request #26 from JoshRosen/streaming-programming-guide
b8c8382 [Josh Rosen] minor fixes
a4ef126 [Josh Rosen] Restructure parts of the fault-tolerance section to read a bit nicer when skipping over the headings
65f66cd [Josh Rosen] Fix broken link to fault-tolerance semantics section.
f015397 [Josh Rosen] Minor grammar / pluralization fixes.
3019f3a [Josh Rosen] Fix minor Markdown formatting issues
aa8bb87 [Tathagata Das] Small update.
195852c [Tathagata Das] Updated based on Josh's comments, updated receiver reliability and deploying section, and also updated configuration.
17b99fb [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-doc-update-1.2
a0217c0 [Tathagata Das] Changed Deploying menu layout
67fcffc [Tathagata Das] Added cluster mode + supervise example to submitting application guide.
e45453b [Tathagata Das] Update streaming guide, added deploying section.
192c7a7 [Tathagata Das] Added more info about Python API, and rewrote the checkpointing section.
2014-12-11 06:21:23 -08:00
Andrew Ash 652b781a9b SPARK-3526 Add section about data locality to the tuning guide
cc kayousterhout

I have a few outstanding questions from compiling this documentation:
- What's the difference between NO_PREF and ANY?  I understand the implications of the ordering but don't know what an example of each would be
- Why is NO_PREF ahead of RACK_LOCAL?  I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other.  Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better?
- Will there be a datacenter-local locality level in the future?  Apache Cassandra for example has this level

Author: Andrew Ash <andrew@andrewash.com>

Closes #2519 from ash211/SPARK-3526 and squashes the following commits:

44cff28 [Andrew Ash] Link to spark.locality parameters rather than copying the list
6d5d966 [Andrew Ash] Stay focused on Spark, no astronaut architecture mumbo-jumbo
20e0e31 [Andrew Ash] SPARK-3526 Add section about data locality to the tuning guide
2014-12-10 15:01:15 -08:00
Andrew Or 56212831c6 [SPARK-4771][Docs] Document standalone cluster supervise mode
tdas looks like streaming already refers to the supervise mode. The link from there is broken though.

Author: Andrew Or <andrew@databricks.com>

Closes #3627 from andrewor14/document-supervise and squashes the following commits:

9ca0908 [Andrew Or] Wording changes
2b55ed2 [Andrew Or] Document standalone cluster supervise mode
2014-12-10 12:41:36 -08:00
Sandy Ryza 912563aa35 SPARK-4338. [YARN] Ditch yarn-alpha.
Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3215 from sryza/sandy-spark-4338 and squashes the following commits:

1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline
9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
2014-12-09 11:02:43 -08:00
Sandy Ryza cda94d15ea SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio doc...
...umented default is incorrect for YARN

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3624 from sryza/sandy-spark-4770 and squashes the following commits:

bd81a3a [Sandy Ryza] SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN
2014-12-08 16:28:36 -08:00
CrazyJvm 6eb1b6f620 Streaming doc : do you mean inadvertently?
Author: CrazyJvm <crazyjvm@gmail.com>

Closes #3620 from CrazyJvm/streaming-foreachRDD and squashes the following commits:

b72886b [CrazyJvm] do you mean inadvertently?
2014-12-05 13:42:13 -08:00
Andrew Or fd8525334c Revert "SPARK-2624 add datanucleus jars to the container in yarn-cluster"
This reverts commit a975dc3279.
2014-12-04 21:53:49 -08:00
Masayoshi TSUZUKI ca379039f7 [SPARK-4464] Description about configuration options need to be modified in docs.
Added description about -h and -host.
Modified description about -i and -ip which are now deprecated.
Added description about --properties-file.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3329 from tsudukim/feature/SPARK-4464 and squashes the following commits:

6c07caf [Masayoshi TSUZUKI] [SPARK-4464] Description about configuration options need to be modified in docs.
2014-12-04 19:33:02 -08:00
Andy Konwinski 15cf3b0125 Fix typo in Spark SQL docs.
Author: Andy Konwinski <andykonwinski@gmail.com>

Closes #3611 from andyk/patch-3 and squashes the following commits:

7bab333 [Andy Konwinski] Fix typo in Spark SQL docs.
2014-12-04 18:27:02 -08:00
Masayoshi TSUZUKI ddfc09c363 [SPARK-4421] Wrong link in spark-standalone.html
Modified the link of building Spark.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3279 from tsudukim/feature/SPARK-4421 and squashes the following commits:

56e31c1 [Masayoshi TSUZUKI] Modified the link of building Spark.
2014-12-04 18:14:36 -08:00
lewuathe ab8177da2d [SPARK-4652][DOCS] Add docs about spark-git-repo option
There might be some cases when WIPS spark version need to be run
on EC2 cluster. In order to setup this type of cluster more easily,
add --spark-git-repo option description to ec2 documentation.

Author: lewuathe <lewuathe@me.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits:

6dae8ee [lewuathe] Wrap consistent with other descriptions
cfaf9be [lewuathe] Add docs about spark-git-repo option

(Editing / cleanup by Josh Rosen)
2014-12-04 15:24:36 -08:00
Xiangrui Meng 7e758d7092 [FIX][DOC] Fix broken links in ml-guide.md
and some minor changes in ScalaDoc.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits:

c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
2014-12-04 20:16:35 +08:00
Joseph K. Bradley 469a6e5f3b [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)

CC: mengxr shivaram  etrain  Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:

d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit).  updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params.  Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version.  CrossValidatorExample not working yet.  Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
2014-12-04 17:00:06 +08:00
Joseph K. Bradley 529439bd50 [docs] Fix outdated comment in tuning guide
When you use the SPARK_JAVA_OPTS env variable, Spark complains:

```
SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps ').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with conf/spark-defaults.conf to set defaults for an application
 - ./spark-submit with --driver-java-options to set -X options for a driver
 - spark.executor.extraJavaOptions to set -X options for executors
 - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
```

This updates the docs to redirect the user to the relevant part of the configuration docs.

CC: mengxr  but please CC someone else as needed

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3592 from jkbradley/tuning-doc and squashes the following commits:

0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide
2014-12-04 00:59:32 -08:00
Joseph K. Bradley 657a88835d [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)

Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
 * Use train/test split, and compute test error instead of training error.
 * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)

Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming

I have run all examples and relevant unit tests.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:

70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
2014-12-04 09:57:50 +08:00
Joseph K. Bradley 27ab0b8a03 [SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizer
I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3569 from jkbradley/lr-doc and squashes the following commits:

654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization
5035ad0 [Joseph K. Bradley] updated based on review
94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method
2014-12-04 08:58:03 +08:00
Masayoshi TSUZUKI 692f49378f [SPARK-4642] Add description about spark.yarn.queue to running-on-YARN document.
Added descriptions about these parameters.
- spark.yarn.queue

Modified description about the defalut value of this parameter.
- spark.yarn.submit.file.replication

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3500 from tsudukim/feature/SPARK-4642 and squashes the following commits:

ce99655 [Masayoshi TSUZUKI] better gramatically.
21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties.
88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update
2014-12-03 13:16:24 -08:00
Jim Lim a975dc3279 SPARK-2624 add datanucleus jars to the container in yarn-cluster
If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container.

This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container.

Author: Jim Lim <jim@quixey.com>

Closes #3238 from jimjh/SPARK-2624 and squashes the following commits:

3633071 [Jim Lim] SPARK-2624 update documentation and comments
fe95125 [Jim Lim] SPARK-2624 keep java imports together
6c31fe0 [Jim Lim] SPARK-2624 update documentation
6690fbf [Jim Lim] SPARK-2624 add tests
d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option
84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster
2014-12-03 11:16:29 -08:00
Kay Ousterhout d9a148ba6a [SPARK-4686] Link to allowed master URLs is broken
The link points to the old scala programming guide; it should point to the submitting applications page.

This should be backported to 1.1.2 (it's been broken as of 1.0).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #3542 from kayousterhout/SPARK-4686 and squashes the following commits:

a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken
2014-12-02 09:06:02 -08:00
Daoyuan Wang 5edbcbfb61 [SQL][DOC] Date type in SQL programming guide
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3535 from adrian-wang/datedoc and squashes the following commits:

18ff1ed [Daoyuan Wang] [DOC] Date type
2014-12-01 14:04:07 -08:00
wangfei 7b79957879 [SQL] Minor fix for doc and comment
Author: wangfei <wangfei1@huawei.com>

Closes #3533 from scwf/sql-doc1 and squashes the following commits:

962910b [wangfei] doc and comment fix
2014-12-01 14:02:02 -08:00
Cheng Lian 5db8dcaf49 [SPARK-4258][SQL][DOC] Documents spark.sql.parquet.filterPushdown
Documents `spark.sql.parquet.filterPushdown`, explains why it's turned off by default and when it's safe to be turned on.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3440)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3440 from liancheng/parquet-filter-pushdown-doc and squashes the following commits:

2104311 [Cheng Lian] Documents spark.sql.parquet.filterPushdown
2014-12-01 13:09:51 -08:00
Madhu Siddalingaiah 2b233f5fc4 Documentation: add description for repartitionAndSortWithinPartitions
Author: Madhu Siddalingaiah <madhu@madhu.com>

Closes #3390 from msiddalingaiah/master and squashes the following commits:

cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again)
332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code>
cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions
2014-12-01 08:45:34 -08:00
Cheng Lian 2a4d389f70 [DOC] Fixes formatting typo in SQL programming guide
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3498)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3498 from liancheng/fix-sql-doc-typo and squashes the following commits:

865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide
2014-11-30 19:04:07 -08:00
lewuathe a217ec5fd5 [SPARK-4656][Doc] Typo in Programming Guide markdown
Grammatical error in Programming Guide document

Author: lewuathe <lewuathe@me.com>

Closes #3412 from Lewuathe/typo-programming-guide and squashes the following commits:

a3e2f00 [lewuathe] Typo in Programming Guide markdown
2014-11-30 17:18:50 -08:00
Takuya UESHIN 0fcd24cc54 [DOCS][BUILD] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'.
To build with Scala 2.11, we have to execute `change-version-to-2.11.sh` before Maven execute, otherwise inter-module dependencies are broken.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3361 from ueshin/docs/building-spark_2.11 and squashes the following commits:

1d29126 [Takuya UESHIN] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'.
2014-11-30 00:10:31 -05:00
Sean Owen 5d7fe178b3 SPARK-4170 [CORE] Closure problems when running Scala app that "extends App"
Warn against subclassing scala.App, and remove one instance of this in examples

Author: Sean Owen <sowen@cloudera.com>

Closes #3497 from srowen/SPARK-4170 and squashes the following commits:

4a6131f [Sean Owen] Restore multiline string formatting
a8ca895 [Sean Owen] Warn against subclassing scala.App, and remove one instance of this in examples
2014-11-27 09:03:17 -08:00
CodingCat 5af53ada65 [SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator
https://issues.apache.org/jira/browse/SPARK-3628

In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive

In this patch, I changed the way for the DAGScheduler to update the accumulator,

DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt...

Author: CodingCat <zhunansjtu@gmail.com>

Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits:

701a1e8 [CodingCat] roll back change on Accumulator.scala
1433e6f [CodingCat] make MIMA happy
b233737 [CodingCat] address Matei's comments
02261b8 [CodingCat] rollback  some changes
6b0aff9 [CodingCat] update document
2b2e8cf [CodingCat] updateAccumulator
83b75f8 [CodingCat] style fix
84570d2 [CodingCat] re-enable  the bad accumulator guard
1e9e14d [CodingCat] add NPE guard
21b6840 [CodingCat] simplify the patch
88d1f03 [CodingCat] fix rebase error
f74266b [CodingCat] add test case for resubmitted result stage
5cf586f [CodingCat] de-duplicate on task level
138f9b3 [CodingCat] make MIMA happy
67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator
2014-11-26 16:52:04 -08:00
Xiangrui Meng 7eba0fbe45 [Spark-4509] Revert EC2 tag-based cluster membership patch
This PR reverts changes related to tag-based cluster membership. As discussed in SPARK-3332, we didn't figure out a safe strategy to use tags to determine cluster membership, because tagging is not atomic. The following changes are reverted:

SPARK-2333: 94053a7b76
SPARK-3213: 7faf755ae4
SPARK-3608: 78d4220fa0.

I tested launch, login, and destroy. It is easy to check the diff by comparing it to Josh's patch for branch-1.1:

https://github.com/apache/spark/pull/2225/files

JoshRosen I sent the PR to master. It might be easier for us to keep master and branch-1.2 the same at this time. We can always re-apply the patch once we figure out a stable solution.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3453 from mengxr/SPARK-4509 and squashes the following commits:

f0b708b [Xiangrui Meng] revert 94053a7b76
4298ea5 [Xiangrui Meng] revert 7faf755ae4
35963a1 [Xiangrui Meng] Revert "SPARK-3608 Break if the instance tag naming succeeds"
2014-11-25 16:07:09 -08:00
Andrew Or 9afcbe494a [SPARK-4546] Improve HistoryServer first time user experience
The documentation points the user to run the following
```
sbin/start-history-server.sh
```
The first thing this does is throw an exception that complains a log directory is not specified. The exception message itself does not say anything about what to set. Instead we should have a default and a landing page with a better message. The new default log directory is `file:/tmp/spark-events`.

This is what it looks like as of this PR:

![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png)

Author: Andrew Or <andrew@databricks.com>

Closes #3411 from andrewor14/minor-history-improvements and squashes the following commits:

f33d6b3 [Andrew Or] Point user to set config if default log dir does not exist
fc4c17a [Andrew Or] Improve HistoryServer UX
2014-11-25 15:48:02 -08:00
arahuja d240760191 [SPARK-4344][DOCS] adding documentation on spark.yarn.user.classpath.first
The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter

Author: arahuja <aahuja11@gmail.com>

Closes #3209 from arahuja/yarn-classpath-first-param and squashes the following commits:

51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst
2014-11-25 08:23:41 -06:00
wangfei 0fe54cff19 [DOC][Build] Wrong cmd for build spark with apache hadoop 2.4.X and hive 12
Author: wangfei <wangfei1@huawei.com>

Closes #3335 from scwf/patch-10 and squashes the following commits:

d343113 [wangfei] add '-Phive'
60d595e [wangfei] [DOC] Wrong cmd for build spark with apache hadoop 2.4.X and Hive 12 support
2014-11-24 22:32:39 -08:00
Sandy Ryza 29372b6318 SPARK-4457. Document how to build for Hadoop versions greater than 2.4
Author: Sandy Ryza <sandy@cloudera.com>

Closes #3322 from sryza/sandy-spark-4457 and squashes the following commits:

5e72b77 [Sandy Ryza] Feedback
0cf05c1 [Sandy Ryza] Caveat
be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions greater than 2.4
2014-11-24 13:28:48 -06:00
Reynold Xin 28fdc6f682 [Doc][GraphX] Remove unused png files. 2014-11-21 00:30:58 -08:00
Reynold Xin b97070ec78 [Doc][GraphX] Remove Motivation section and did some minor update. 2014-11-21 00:29:02 -08:00
Davies Liu 8cd6eea629 add Sphinx as a dependency of building docs
Author: Davies Liu <davies@databricks.com>

Closes #3388 from davies/doc_readme and squashes the following commits:

daa1482 [Davies Liu] add Sphinx dependency
2014-11-20 19:13:16 -08:00
Joseph E. Gonzalez 377b068209 Updating GraphX programming guide and documentation
This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>

Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:

4421964 [Joseph E. Gonzalez] updating documentation for graphx
2014-11-19 16:53:33 -08:00
Marcelo Vanzin 397d3aae5b Bumping version to 1.3.0-SNAPSHOT.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3277 from vanzin/version-1.3 and squashes the following commits:

7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
2014-11-18 21:24:18 -08:00
Josh Rosen 0f3ceb56c7 [SPARK-4180] [Core] Prevent creation of multiple active SparkContexts
This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details).

**The solution implemented here is only a partial fix.**  A complete fix would have the following properties:

1. Only one SparkContext may ever be under construction at any given time.
2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped.
3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194).
4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts.

This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release.

### The correct solution:

I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object.  Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.).  Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor.  For example:

```scala
class SparkContext private (deps: SparkContextDependencies) {
  def this(conf: SparkConf) {
    this(SparkContext.getDeps(conf))
  }
}

object SparkContext(
  private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized {
    if (anotherSparkContextIsActive) { throw Exception(...) }
    var dagScheduler: DAGScheduler = null
    try {
        dagScheduler = new DAGScheduler(...)
        [...]
    } catch {
      case e: Exception =>
         Option(dagScheduler).foreach(_.stop())
          [...]
    }
    SparkContextDependencies(dagScheduler, ....)
  }
}
```

This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up.

This indirection is necessary to maintain binary compatibility.  In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier.

### Alternative solutions:

As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block.  Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block.  If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures.

The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification.

### This PR's solution:

- At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception.
- If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt).
- At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context.  If so, throw an exception.

This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor).  If two threads race to construct SparkContexts, then one of them will win and another will throw an exception.

This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`.  The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts.  I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits:

23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
d38251b [Josh Rosen] Address latest round of feedback.
c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods.
85a424a [Josh Rosen] Incorporate more review feedback.
372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
f5bb78c [Josh Rosen] Update mvn build, too.
d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts.
79a7e6f [Josh Rosen] Fix commented out test
a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
7ba6db8 [Josh Rosen] Add utility to set system properties in tests.
4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests.
ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests.
1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite
d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging.
918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation.
afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
2014-11-17 12:48:18 -08:00
Andy Konwinski cec1116b4b [DOCS][SQL] Fix broken link to Row class scaladoc
Author: Andy Konwinski <andykonwinski@gmail.com>

Closes #3323 from andyk/patch-2 and squashes the following commits:

4699fdc [Andy Konwinski] Fix broken link to Row class scaladoc
2014-11-17 11:52:47 -08:00
zsxwing 861223ee5b [SPARK-4363][Doc] Update the Broadcast example
Author: zsxwing <zsxwing@gmail.com>

Closes #3226 from zsxwing/SPARK-4363 and squashes the following commits:

8109914 [zsxwing] Update the Broadcast example
2014-11-14 22:28:48 -08:00
Sandy Ryza f5f757e4ed SPARK-4375. no longer require -Pscala-2.10
It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits:

0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt
cd42d94 [Sandy Ryza] Update doc
f6644c3 [Sandy Ryza] SPARK-4375 take 2
2014-11-14 14:21:57 -08:00
WangTao e421072da0 [SPARK-3722][Docs]minor improvement and fix in docs
https://issues.apache.org/jira/browse/SPARK-3722

Author: WangTao <barneystinson@aliyun.com>

Closes #2579 from WangTaoTheTonic/docsWork and squashes the following commits:

6f91cec [WangTao] use more wording express
29d22fa [WangTao] delete the specified version link
34cb4ea [WangTao] Update running-on-yarn.md
4ee1a26 [WangTao] minor improvement and fix in docs
2014-11-14 08:09:42 -06:00
Prashant Sharma daaca14c16 Support cross building for Scala 2.11
Let's give this another go using a version of Hive that shades its JLine dependency.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:

e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
2121071 [Patrick Wendell] Migrating version detection to PySpark
b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
f5cad4e [Patrick Wendell] Add Scala 2.11 docs
210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
e22b104 [Patrick Wendell] Small fix in pom file
ec402ab [Patrick Wendell] Various fixes
0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
4eaec65 [Prashant Sharma] Changed scripts to ignore target.
5167bea [Prashant Sharma] small correction
a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
cb059b0 [Prashant Sharma] Code review
0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
2014-11-11 21:36:48 -08:00
Kousuke Saruta 3c07b8f082 [SPARK-4330][Doc] Link to proper URL for YARN overview
In running-on-yarn.md, a link to YARN overview is here.
But the URL is to YARN alpha's.
It should be stable's.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3196 from sarutak/SPARK-4330 and squashes the following commits:

30baa21 [Kousuke Saruta] Fixed running-on-yarn.md to point proper URL for YARN
2014-11-10 22:18:00 -08:00
Sandy Ryza c6f4e70421 SPARK-4230. Doc for spark.default.parallelism is incorrect
Author: Sandy Ryza <sandy@cloudera.com>

Closes #3107 from sryza/sandy-spark-4230 and squashes the following commits:

37a1d19 [Sandy Ryza] Clear up a couple things
34d53de [Sandy Ryza] SPARK-4230. Doc for spark.default.parallelism is incorrect
2014-11-10 12:40:41 -08:00
Sean Owen 8c99a47a4f SPARK-971 [DOCS] Link to Confluence wiki from project website / documentation
This is a trivial change to add links to the wiki from `README.md` and the main docs page. It is already linked to from spark.apache.org.

Author: Sean Owen <sowen@cloudera.com>

Closes #3169 from srowen/SPARK-971 and squashes the following commits:

dcb84d0 [Sean Owen] Add link to wiki from README, docs home page
2014-11-09 17:40:48 -08:00
wangfei 636d7bcc96 [SQL][DOC][Minor] Spark SQL Hive now support dynamic partitioning
Author: wangfei <wangfei1@huawei.com>

Closes #3127 from scwf/patch-9 and squashes the following commits:

e39a560 [wangfei] now support dynamic partitioning
2014-11-07 11:43:35 -08:00
jay@apache.org 868cd4c3ca SPARK-4040. Update documentation to exemplify use of local (n) value, fo...
This is a minor docs update which helps to clarify the way local[n] is used for streaming apps.

Author: jay@apache.org <jayunit100>

Closes #2964 from jayunit100/SPARK-4040 and squashes the following commits:

35b5a5e [jay@apache.org] SPARK-4040: Update documentation to exemplify use of local (n) value.
2014-11-05 15:45:34 -08:00