Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#3772 from nchammas/patch-1 and squashes the following commits:
b7d9083 [Nicholas Chammas] [Docs] Minor typo fixes
Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@lab.ntt.co.jp>
Closes#3757 from oza/SPARK-4915 and squashes the following commits:
3b0d6d6 [Tsuyoshi Ozawa] Fix classname to be specified for external shuffle service.
Once the external shuffle service is also documented, the dynamic allocation section will link to it. Let me know if the whole dynamic allocation should be moved to its separate page; I personally think the organization might be cleaner that way.
This patch builds on top of oza's work in #3689.
aarondav pwendell
Author: Andrew Or <andrew@databricks.com>
Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@gmail.com>
Closes#3731 from andrewor14/document-dynamic-allocation and squashes the following commits:
1281447 [Andrew Or] Address a few comments
b9843f2 [Andrew Or] Document the configs as well
246fb44 [Andrew Or] Merge branch 'SPARK-4839' of github.com:oza/spark into document-dynamic-allocation
8c64004 [Andrew Or] Add documentation for dynamic allocation (without configs)
6827b56 [Tsuyoshi Ozawa] Fixing a documentation of spark.dynamicAllocation.enabled.
53cff58 [Tsuyoshi Ozawa] Adding a documentation about dynamic resource allocation.
the signature of registerKryoClasses is actually of Array[Class[_]] not Seq
Author: Eran Medan <ehrann.mehdan@gmail.com>
Closes#3747 from eranation/patch-1 and squashes the following commits:
ee9885d [Eran Medan] change signature of example to match released code
Author: Timothy Chen <tnachen@gmail.com>
Closes#3349 from tnachen/mesos_doc and squashes the following commits:
737ef49 [Timothy Chen] Add TOC
5ca546a [Timothy Chen] Update description around cores requested.
26283a5 [Timothy Chen] Add mesos specific configurations into doc
... changed to a time period
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3471 from sryza/sandy-spark-3779 and squashes the following commits:
20b9887 [Sandy Ryza] Deprecate old property
42b5df7 [Sandy Ryza] Review feedback
9a959a1 [Sandy Ryza] SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period
Currently, there is no way to pass yarn am specific java options. It cause some potential issues when reading classpath from hadoop configuration file. Hadoop configuration actually replace variables in its property with the system property passed in java options. How to specify the value depends on different hadoop distribution.
The new options are SPARK_YARN_JAVA_OPTS or spark.yarn.extraJavaOptions. I make it as spark global level, because typically we don't want user to specify this in their command line each time submitting spark job after it is setup in spark-defaults.conf.
In addition, with this new extra options enabled to be passed to AM, it provides more flexibility.
For example int the following valid mapred-site.xml file, we have the class path which specify values using system property. Hadoop can correctly handle it because it has java options passed in.
This is the example, currently spark will break due to hadoop.version is not passed in.
<property>
<name>mapreduce.application.classpath</name>
<value>/etc/hadoop/${hadoop.version}/mapreduce/*</value>
</property>
In the meantime, we cannot relies on mapreduce.admin.map.child.java.opts in mapred-site.xml, because it has its own extra java options specified, which does not apply to Spark.
Author: Zhan Zhang <zhazhan@gmail.com>
Closes#3409 from zhzhan/Spark-4461 and squashes the following commits:
daec3d0 [Zhan Zhang] solve review comments
08f44a7 [Zhan Zhang] add warning in driver mode if spark.yarn.am.extraJavaOptions is configured
5a505d3 [Zhan Zhang] solve review comments
4ed43ad [Zhan Zhang] solve review comments
ad777ed [Zhan Zhang] Merge branch 'master' into Spark-4461
3e9e574 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
e3f9abe [Zhan Zhang] solve review comments
8963552 [Zhan Zhang] rebase
f8f6700 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
dea1692 [Zhan Zhang] change the option key name to client mode specific
90d5dff [Zhan Zhang] rebase
8ac9254 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
092a25f [Zhan Zhang] solve review comments
bc5a9ae [Zhan Zhang] solve review comments
782b014 [Zhan Zhang] add new configuration to docs/running-on-yarn.md and remove it from spark-defaults.conf.template
6faaa97 [Zhan Zhang] solve review comments
369863f [Zhan Zhang] clean up unnecessary var
733de9c [Zhan Zhang] Merge branch 'master' into Spark-4461
a68e7f0 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
864505a [Zhan Zhang] Add extra java options to be passed to Yarn application master
15830fc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
685d911 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
03ebad3 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
46d9e3d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
ebb213a [Zhan Zhang] revert
b983ef3 [Zhan Zhang] test
c4efb9b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
779d67b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4daae6d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
12e1be5 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
ce0ca7b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
93f3081 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3764505 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
a9d372b [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
a00f60f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f6a8a40 [Zhan Zhang] revert
ba14f28 [Zhan Zhang] test
* This commit hopes to avoid the confusion I faced when trying
to submit a regular, valid multi-line JSON file, also see
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html
Author: Peter Vandenabeele <peter@vandenabeele.com>
Closes#3517 from petervandenabeele/pv-docs-note-on-jsonFile-format/01 and squashes the following commits:
1f98e52 [Peter Vandenabeele] Revert to people.json and simple Note text
6b6e062 [Peter Vandenabeele] Change the "JSON" connotation to "txt"
fca7dfb [Peter Vandenabeele] Add a Note on jsonFile having separate JSON objects per line
Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode.
Author: Judy Nash <judynash@microsoft.com>
Author: judynash <judynash@microsoft.com>
Closes#3672 from judynash/master and squashes the following commits:
526315d [Judy Nash] correct spacing on startThriftServer method
31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue
47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length
2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide
1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master'
2b1d312 [Judy Nash] updated http thrift server support based on feedback
377532c [judynash] add HTTP protocol spark thrift server
Based on this gist:
https://gist.github.com/amar-analytx/0b62543621e1f246c0a2
We use security group ids instead of security group to get around this issue:
https://github.com/boto/boto/issues/350
Author: Mike Jennings <mvj101@gmail.com>
Author: Mike Jennings <mvj@google.com>
Closes#2872 from mvj101/SPARK-3405 and squashes the following commits:
be9cb43 [Mike Jennings] `pep8 spark_ec2.py` runs cleanly.
4dc6756 [Mike Jennings] Remove duplicate comment
731d94c [Mike Jennings] Update for code review.
ad90a36 [Mike Jennings] Merge branch 'master' of https://github.com/apache/spark into SPARK-3405
1ebffa1 [Mike Jennings] Merge branch 'master' into SPARK-3405
52aaeec [Mike Jennings] [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py
Important updates to the streaming programming guide
- Make the fault-tolerance properties easier to understand, with information about write ahead logs
- Update the information about deploying the spark streaming app with information about Driver HA
- Update Receiver guide to discuss reliable vs unreliable receivers.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>
Closes#3653 from tdas/streaming-doc-update-1.2 and squashes the following commits:
f53154a [Tathagata Das] Addressed Josh's comments.
ce299e4 [Tathagata Das] Minor update.
ca19078 [Tathagata Das] Minor change
f746951 [Tathagata Das] Mentioned performance problem with WAL
7787209 [Tathagata Das] Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2
2184729 [Tathagata Das] Updated Kafka and Flume guides with reliability information.
2f3178c [Tathagata Das] Added more information about writing reliable receivers in the custom receiver guide.
91aa5aa [Tathagata Das] Improved API Docs menu
5707581 [Tathagata Das] Added Pythn API badge
b9c8c24 [Tathagata Das] Merge pull request #26 from JoshRosen/streaming-programming-guide
b8c8382 [Josh Rosen] minor fixes
a4ef126 [Josh Rosen] Restructure parts of the fault-tolerance section to read a bit nicer when skipping over the headings
65f66cd [Josh Rosen] Fix broken link to fault-tolerance semantics section.
f015397 [Josh Rosen] Minor grammar / pluralization fixes.
3019f3a [Josh Rosen] Fix minor Markdown formatting issues
aa8bb87 [Tathagata Das] Small update.
195852c [Tathagata Das] Updated based on Josh's comments, updated receiver reliability and deploying section, and also updated configuration.
17b99fb [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-doc-update-1.2
a0217c0 [Tathagata Das] Changed Deploying menu layout
67fcffc [Tathagata Das] Added cluster mode + supervise example to submitting application guide.
e45453b [Tathagata Das] Update streaming guide, added deploying section.
192c7a7 [Tathagata Das] Added more info about Python API, and rewrote the checkpointing section.
cc kayousterhout
I have a few outstanding questions from compiling this documentation:
- What's the difference between NO_PREF and ANY? I understand the implications of the ordering but don't know what an example of each would be
- Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other. Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better?
- Will there be a datacenter-local locality level in the future? Apache Cassandra for example has this level
Author: Andrew Ash <andrew@andrewash.com>
Closes#2519 from ash211/SPARK-3526 and squashes the following commits:
44cff28 [Andrew Ash] Link to spark.locality parameters rather than copying the list
6d5d966 [Andrew Ash] Stay focused on Spark, no astronaut architecture mumbo-jumbo
20e0e31 [Andrew Ash] SPARK-3526 Add section about data locality to the tuning guide
tdas looks like streaming already refers to the supervise mode. The link from there is broken though.
Author: Andrew Or <andrew@databricks.com>
Closes#3627 from andrewor14/document-supervise and squashes the following commits:
9ca0908 [Andrew Or] Wording changes
2b55ed2 [Andrew Or] Document standalone cluster supervise mode
Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier.
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3215 from sryza/sandy-spark-4338 and squashes the following commits:
1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline
9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
...umented default is incorrect for YARN
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3624 from sryza/sandy-spark-4770 and squashes the following commits:
bd81a3a [Sandy Ryza] SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN
Author: CrazyJvm <crazyjvm@gmail.com>
Closes#3620 from CrazyJvm/streaming-foreachRDD and squashes the following commits:
b72886b [CrazyJvm] do you mean inadvertently?
Added description about -h and -host.
Modified description about -i and -ip which are now deprecated.
Added description about --properties-file.
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes#3329 from tsudukim/feature/SPARK-4464 and squashes the following commits:
6c07caf [Masayoshi TSUZUKI] [SPARK-4464] Description about configuration options need to be modified in docs.
Author: Andy Konwinski <andykonwinski@gmail.com>
Closes#3611 from andyk/patch-3 and squashes the following commits:
7bab333 [Andy Konwinski] Fix typo in Spark SQL docs.
Modified the link of building Spark.
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes#3279 from tsudukim/feature/SPARK-4421 and squashes the following commits:
56e31c1 [Masayoshi TSUZUKI] Modified the link of building Spark.
There might be some cases when WIPS spark version need to be run
on EC2 cluster. In order to setup this type of cluster more easily,
add --spark-git-repo option description to ec2 documentation.
Author: lewuathe <lewuathe@me.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes#3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits:
6dae8ee [lewuathe] Wrap consistent with other descriptions
cfaf9be [lewuathe] Add docs about spark-git-repo option
(Editing / cleanup by Josh Rosen)
and some minor changes in ScalaDoc.
Author: Xiangrui Meng <meng@databricks.com>
Closes#3601 from mengxr/SPARK-4575-fix and squashes the following commits:
c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md
Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)
Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)
CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.
Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#3588 from jkbradley/ml-package-docs and squashes the following commits:
d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
When you use the SPARK_JAVA_OPTS env variable, Spark complains:
```
SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps ').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with conf/spark-defaults.conf to set defaults for an application
- ./spark-submit with --driver-java-options to set -X options for a driver
- spark.executor.extraJavaOptions to set -X options for executors
- SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
```
This updates the docs to redirect the user to the relevant part of the configuration docs.
CC: mengxr but please CC someone else as needed
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3592 from jkbradley/tuning-doc and squashes the following commits:
0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)
Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
* Use train/test split, and compute test error instead of training error.
* Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)
Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming
I have run all examples and relevant unit tests.
CC: mengxr manishamde codedeft
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes#3461 from jkbradley/ensemble-docs and squashes the following commits:
70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples
I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3569 from jkbradley/lr-doc and squashes the following commits:
654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization
5035ad0 [Joseph K. Bradley] updated based on review
94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method
Added descriptions about these parameters.
- spark.yarn.queue
Modified description about the defalut value of this parameter.
- spark.yarn.submit.file.replication
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes#3500 from tsudukim/feature/SPARK-4642 and squashes the following commits:
ce99655 [Masayoshi TSUZUKI] better gramatically.
21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties.
88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update
If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container.
This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container.
Author: Jim Lim <jim@quixey.com>
Closes#3238 from jimjh/SPARK-2624 and squashes the following commits:
3633071 [Jim Lim] SPARK-2624 update documentation and comments
fe95125 [Jim Lim] SPARK-2624 keep java imports together
6c31fe0 [Jim Lim] SPARK-2624 update documentation
6690fbf [Jim Lim] SPARK-2624 add tests
d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option
84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster
The link points to the old scala programming guide; it should point to the submitting applications page.
This should be backported to 1.1.2 (it's been broken as of 1.0).
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#3542 from kayousterhout/SPARK-4686 and squashes the following commits:
a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3535 from adrian-wang/datedoc and squashes the following commits:
18ff1ed [Daoyuan Wang] [DOC] Date type
Documents `spark.sql.parquet.filterPushdown`, explains why it's turned off by default and when it's safe to be turned on.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3440)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3440 from liancheng/parquet-filter-pushdown-doc and squashes the following commits:
2104311 [Cheng Lian] Documents spark.sql.parquet.filterPushdown
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3498)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3498 from liancheng/fix-sql-doc-typo and squashes the following commits:
865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide
Grammatical error in Programming Guide document
Author: lewuathe <lewuathe@me.com>
Closes#3412 from Lewuathe/typo-programming-guide and squashes the following commits:
a3e2f00 [lewuathe] Typo in Programming Guide markdown
To build with Scala 2.11, we have to execute `change-version-to-2.11.sh` before Maven execute, otherwise inter-module dependencies are broken.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3361 from ueshin/docs/building-spark_2.11 and squashes the following commits:
1d29126 [Takuya UESHIN] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'.
Warn against subclassing scala.App, and remove one instance of this in examples
Author: Sean Owen <sowen@cloudera.com>
Closes#3497 from srowen/SPARK-4170 and squashes the following commits:
4a6131f [Sean Owen] Restore multiline string formatting
a8ca895 [Sean Owen] Warn against subclassing scala.App, and remove one instance of this in examples
https://issues.apache.org/jira/browse/SPARK-3628
In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive
In this patch, I changed the way for the DAGScheduler to update the accumulator,
DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt...
Author: CodingCat <zhunansjtu@gmail.com>
Closes#2524 from CodingCat/SPARK-732-1 and squashes the following commits:
701a1e8 [CodingCat] roll back change on Accumulator.scala
1433e6f [CodingCat] make MIMA happy
b233737 [CodingCat] address Matei's comments
02261b8 [CodingCat] rollback some changes
6b0aff9 [CodingCat] update document
2b2e8cf [CodingCat] updateAccumulator
83b75f8 [CodingCat] style fix
84570d2 [CodingCat] re-enable the bad accumulator guard
1e9e14d [CodingCat] add NPE guard
21b6840 [CodingCat] simplify the patch
88d1f03 [CodingCat] fix rebase error
f74266b [CodingCat] add test case for resubmitted result stage
5cf586f [CodingCat] de-duplicate on task level
138f9b3 [CodingCat] make MIMA happy
67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator
This PR reverts changes related to tag-based cluster membership. As discussed in SPARK-3332, we didn't figure out a safe strategy to use tags to determine cluster membership, because tagging is not atomic. The following changes are reverted:
SPARK-2333: 94053a7b76
SPARK-3213: 7faf755ae4
SPARK-3608: 78d4220fa0.
I tested launch, login, and destroy. It is easy to check the diff by comparing it to Josh's patch for branch-1.1:
https://github.com/apache/spark/pull/2225/files
JoshRosen I sent the PR to master. It might be easier for us to keep master and branch-1.2 the same at this time. We can always re-apply the patch once we figure out a stable solution.
Author: Xiangrui Meng <meng@databricks.com>
Closes#3453 from mengxr/SPARK-4509 and squashes the following commits:
f0b708b [Xiangrui Meng] revert 94053a7b76
4298ea5 [Xiangrui Meng] revert 7faf755ae4
35963a1 [Xiangrui Meng] Revert "SPARK-3608 Break if the instance tag naming succeeds"
The documentation points the user to run the following
```
sbin/start-history-server.sh
```
The first thing this does is throw an exception that complains a log directory is not specified. The exception message itself does not say anything about what to set. Instead we should have a default and a landing page with a better message. The new default log directory is `file:/tmp/spark-events`.
This is what it looks like as of this PR:
![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png)
Author: Andrew Or <andrew@databricks.com>
Closes#3411 from andrewor14/minor-history-improvements and squashes the following commits:
f33d6b3 [Andrew Or] Point user to set config if default log dir does not exist
fc4c17a [Andrew Or] Improve HistoryServer UX
The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter
Author: arahuja <aahuja11@gmail.com>
Closes#3209 from arahuja/yarn-classpath-first-param and squashes the following commits:
51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst
Author: wangfei <wangfei1@huawei.com>
Closes#3335 from scwf/patch-10 and squashes the following commits:
d343113 [wangfei] add '-Phive'
60d595e [wangfei] [DOC] Wrong cmd for build spark with apache hadoop 2.4.X and Hive 12 support
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3322 from sryza/sandy-spark-4457 and squashes the following commits:
5e72b77 [Sandy Ryza] Feedback
0cf05c1 [Sandy Ryza] Caveat
be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions greater than 2.4
Author: Davies Liu <davies@databricks.com>
Closes#3388 from davies/doc_readme and squashes the following commits:
daa1482 [Davies Liu] add Sphinx dependency
This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes#3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:
4421964 [Joseph E. Gonzalez] updating documentation for graphx
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#3277 from vanzin/version-1.3 and squashes the following commits:
7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.