## What changes were proposed in this pull request?
Fix the broken links in the programming guide of the Graphx Migration and understanding closures
## How was this patch tested?
By running the test cases and checking the links.
Author: Shivansh <shiv4nsh@gmail.com>
Closes#14503 from shiv4nsh/SPARK-16911.
## What changes were proposed in this pull request?
default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
## How was this patch tested?
not need
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
…ide.md
JIRA_ID:SPARK-16870
Description:default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
Test:done
Author: keliang <keliang@cmss.chinamobile.com>
Closes#14477 from biglobster/keliang.
## What changes were proposed in this pull request?
In the programming guide, the accumulator section mixes up both the old and new APIs causing it to be confusing. This is not necessary for Scala, so all references to the old API are removed. For Java, it is somewhat fixed up except for the example of a custom accumulator because I don't think an API exists yet. Python has not currently implemented the new API.
## How was this patch tested?
built doc locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#14516 from BryanCutler/fixup-accumulator-programming-guide-SPARK-15702.
## What changes were proposed in this pull request?
Doc for the Kafka 0.10 integration
## How was this patch tested?
Scala code examples were taken from my example repo, so hopefully they compile.
Author: cody koeninger <cody@koeninger.org>
Closes#14385 from koeninger/SPARK-16312.
## What changes were proposed in this pull request?
Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch
## How was this patch tested?
Tested by running a job on the cluster and the shuffle read time was reduced by 50%.
Author: Sital Kedia <skedia@fb.com>
Closes#12944 from sitalkedia/shuffle_service.
## What changes were proposed in this pull request?
This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed.
## How was this patch tested?
Manually tested.
Author: Cheng Lian <lian@databricks.com>
Closes#14368 from liancheng/revise-examples.
## What changes were proposed in this pull request?
Fix the link at http://spark.apache.org/docs/latest/ml-guide.html.
## How was this patch tested?
None
Author: Sun Dapeng <sdp@apache.org>
Closes#14386 from sundapeng/doclink.
## What changes were proposed in this pull request?
New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)}
This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/
The benefit is losing the dependency on `dockerd`, and all the costs which it incurs.
I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs.
This is blocked on: https://github.com/apache/spark/pull/14167
## How was this patch tested?
- manually testing jobs submitted with both "mesos" and "docker" settings for the new config var.
- spark/mesos integration test suite
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14275 from mgummelt/unified-containerizer.
## What changes were proposed in this pull request?
added missing keyword for java example
## How was this patch tested?
wasn't
Author: Bartek Wiśniewski <wedi@Ava.local>
Closes#14381 from wedi-dev/quickfix/missing_keyword.
## What changes were proposed in this pull request?
Adding a new property to SparkConf called spark.metrics.namespace that allows users to
set a custom namespace for executor and driver metrics in the metrics systems.
By default, the root namespace used for driver or executor metrics is
the value of `spark.app.id`. However, often times, users want to be able to track the metrics
across apps for driver and executor metrics, which is hard to do with application ID
(i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases,
users can set the `spark.metrics.namespace` property to another spark configuration key like
`spark.app.name` which is then used to populate the root namespace of the metrics system
(with the app name in our example). `spark.metrics.namespace` property can be set to any
arbitrary spark property key, whose value would be used to set the root namespace of the
metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor
does the `spark.metrics.namespace` property have any such affect on such metrics.
## How was this patch tested?
Added new unit tests, modified existing unit tests.
Author: Mark Grover <mark@apache.org>
Closes#14270 from markgrover/spark-5847.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
(`spark.mesos.executor.docker.forcePullImage`). Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#14348 from philipphoffmann/force-pull-image.
## What changes were proposed in this pull request?
Minor doc fix regarding the spark.speculation.quantile configuration parameter. It incorrectly states it should be a percentage, when it should be a fraction.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
I tried building the documentation but got some unidoc errors. I also got them when building off origin/master, so I don't think I caused that problem. I did run the web app and saw the changes reflected as expected.
Author: Nicholas Brown <nbrown@adroitdigital.com>
Closes#14352 from nwbvt/master.
## What changes were proposed in this pull request?
This pr is to fix a wrong description for parquet default compression.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#14351 from maropu/FixParquetDoc.
## What changes were proposed in this pull request?
Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
`spark.mesos.executor.docker.forcePullImage`. Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).
## How was this patch tested?
I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set.
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#13051 from philipphoffmann/force-pull-image.
This PR is based on PR #14098 authored by wangmiao1981.
## What changes were proposed in this pull request?
This PR replaces the original Python Spark SQL example file with the following three files:
- `sql/basic.py`
Demonstrates basic Spark SQL features.
- `sql/datasource.py`
Demonstrates various Spark SQL data sources.
- `sql/hive.py`
Demonstrates Spark SQL Hive interaction.
This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag.
## How was this patch tested?
Manually tested.
Author: wm624@hotmail.com <wm624@hotmail.com>
Author: Cheng Lian <lian@databricks.com>
Closes#14317 from liancheng/py-examples-update.
Clarify documentation on spark.task.maxFailures
No tests run as its documentation
Author: Tom Graves <tgraves@yahoo-inc.com>
Closes#14287 from tgravescs/SPARK-16650.
## What changes were proposed in this pull request?
Added new configuration namespace: spark.mesos.env.*
This allows a user submitting a job in cluster mode to set arbitrary environment variables on the driver.
spark.mesos.driverEnv.KEY=VAL will result in the env var "KEY" being set to "VAL"
I've also refactored the tests a bit so we can re-use code in MesosClusterScheduler.
And I've refactored the command building logic in `buildDriverCommand`. Command builder values were very intertwined before, and now it's easier to determine exactly how each variable is set.
## How was this patch tested?
unit tests
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14167 from mgummelt/driver-env-vars.
## What changes were proposed in this pull request?
Fix parquet to csv in a comment to match the input format being read.
## How was this patch tested?
N/A (doc change only)
Author: Holden Karau <holden@us.ibm.com>
Closes#14274 from holdenk/minor-docfix-schema-of-csv-rather-than-parquet.
1. Create the executorspage-template.html for displaying application information in datables.
2. Added REST API endpoint "allexecutors" to be able to see all executors created for particular job.
3. The executorspage.js uses jQuery to access the data from /api/v1/applications/appid/allexecutors REST API, and use DataTable to display executors for the application. It also, generates summary of dead/live and total executors created during life of the application.
4. Similar changes applicable to Executors Page on history server for a given application.
Snapshots for how it looks like now:
<img width="938" alt="screen shot 2016-06-14 at 2 45 44 pm" src="https://cloud.githubusercontent.com/assets/6090397/16060092/ad1de03a-324b-11e6-8469-9eaa3f2548b5.png">
New Executors Page screenshot looks like this:
<img width="1436" alt="screen shot 2016-06-15 at 10 12 01 am" src="https://cloud.githubusercontent.com/assets/6090397/16085514/ee7004f0-32e1-11e6-9340-33d91e407f2b.png">
Author: Kishor Patil <kpatil@yahoo-inc.com>
Closes#13670 from kishorvpatil/execTemplates.
## What changes were proposed in this pull request?
Update monitoring.md.
…<appId>'
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#14163 from Sherry302/master.
## What changes were proposed in this pull request?
update `refreshTable` API in python code of the sql-programming-guide.
This API is added in SPARK-15820
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14220 from WeichenXu123/update_sql_doc_catalog.
## What changes were proposed in this pull request?
Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.
## How was this patch tested?
`jekyll serve`
Author: Ahmed Mahran <ahmed.mahran@mashin.io>
Closes#14234 from ahmed-mahran/b-struct-streaming-docs.
## What changes were proposed in this pull request?
This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL".
## How was this patch tested?
Manually verified the generated HTML page.
Author: Cheng Lian <lian@databricks.com>
Closes#14245 from liancheng/minor-scala-example-update.
## What changes were proposed in this pull request?
Fix code style from ad hoc review of RC4 doc
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14250 from felixcheung/rdocs2rc4.
## What changes were proposed in this pull request?
Updates programming guide for spark.gapply/spark.gapplyCollect.
Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality.
Please, let me know if you prefer another example.
## How was this patch tested?
Existing test cases in R
Author: Narine Kokhlikyan <narine@slice.com>
Closes#14090 from NarineK/gapplyProgGuide.
## What changes were proposed in this pull request?
Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
* **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
* Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
* **Reviewers**: I did not change any of the content of the migration guides.
Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
* **Reviewers**: I did not change the content of these guides, except some intro text.
* Sidebar remains the same, but with pipeline and tuning sections added
Other:
* ml-classification-regression.html: Moved text about linear methods to new section in page
## How was this patch tested?
Generated docs locally
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14213 from jkbradley/ml-guide-2.0.
If a custom Jekyll template tag throws Ruby's equivalent of a "file not found" exception, then Jekyll will stop the doc building process but will exit with a successful status, causing our doc publishing jobs to silently fail.
This is caused by https://github.com/jekyll/jekyll/issues/5104, a case of bad error-handling logic in Jekyll. This patch works around this by updating our `include_example.rb` plugin to catch the exception and exit rather than allowing it to bubble up and be ignored by Jekyll.
I tested this manually with
```
rm ./examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala
cd docs
SKIP_API=1 jekyll build
echo $?
```
Author: Josh Rosen <joshrosen@databricks.com>
Closes#14209 from JoshRosen/fix-doc-building.
## What changes were proposed in this pull request?
Fixes a typo in the sql programming guide
## How was this patch tested?
Building docs locally
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14208 from shivaram/spark-sql-doc-fix.
This prevents the NM from starting when something is wrong, which would
lead to later errors which are confusing and harder to debug.
Added a unit test to verify startup fails if something is wrong.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#14162 from vanzin/SPARK-16505.
## What changes were proposed in this pull request?
Minor documentation update for code example, code style, and missed reference to "sparkR.init"
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14178 from felixcheung/rcsvprogrammingguide.
## What changes were proposed in this pull request?
Updated structured streaming programming guide with new windowed example.
## How was this patch tested?
Docs
Author: James Thomas <jamesjoethomas@gmail.com>
Closes#14183 from jjthomas/ss_docs_update.
## What changes were proposed in this pull request?
Add Asynchronous Actions documentation inside action of programming guide
## How was this patch tested?
check the documentation indentation and formatting with md preview.
Author: sandy <phalodi@gmail.com>
Closes#14104 from phalodi/SPARK-16438.
- Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project.
- Removed the inconsistency between Scala and Java Spark SQL examples
- Scala and Java Spark SQL examples were updated
The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review.
![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png)
Author: aokolnychyi <okolnychyyanton@gmail.com>
Closes#14119 from aokolnychyi/spark_16303.
## What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
## How was this patch tested?
add unit tests
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>
Closes#13494 from lianhuiwang/metadata-only.
## What changes were proposed in this pull request?
Some minor changes for documentation page "Spark Streaming + Kinesis Integration".
Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets.
## How was this patch tested?
Tested manually, on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#14097 from keypointt/kinesisDoc.
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.
## How was this patch tested?
Only docs update, manually check the generated docs.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14011 from yanboliang/r-user-guide-update.
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#14130 from rxin/SPARK-16477.
## What changes were proposed in this pull request?
Documentation changes to indicate that fine-grained mode is now deprecated. No code changes were made, and all fine-grained mode instructions were left in place. We can remove all of that once the deprecation cycle completes (Does Spark have a standard deprecation cycle? One major version?)
Blocked on https://github.com/apache/spark/pull/14059
## How was this patch tested?
Viewed in Github
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14078 from mgummelt/deprecate-fine-grained.
## What changes were proposed in this pull request?
docs
## How was this patch tested?
viewed the docs in github
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14059 from mgummelt/coarse-grained.
## What changes were proposed in this pull request?
I search the whole documents directory using SQLContext, and update the following places:
- docs/configuration.md, sparkR code snippets.
- docs/streaming-programming-guide.md, several example code.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14025 from WeichenXu123/WIP_SQLContext_update.
## What changes were proposed in this pull request?
Coincidentally, I discovered that a couple images were unused in `docs/`, and then searched and found more, and then realized some PNGs were pretty big and could be crushed, and before I knew it, had done the same for the ASF site (not committed yet).
No functional change at all, just less superfluous image data.
## How was this patch tested?
`jekyll serve`
Author: Sean Owen <sowen@cloudera.com>
Closes#14029 from srowen/RemoveCompressImages.
## What changes were proposed in this pull request?
I extract 6 example programs from GraphX programming guide and replace them with
`include_example` label.
The 6 example programs are:
- AggregateMessagesExample.scala
- SSSPExample.scala
- TriangleCountingExample.scala
- ConnectedComponentsExample.scala
- ComprehensiveExample.scala
- PageRankExample.scala
All the example code can run using
`bin/run-example graphx.EXAMPLE_NAME`
## How was this patch tested?
Manual.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14015 from WeichenXu123/graphx_example_plugin.
## What changes were proposed in this pull request?
There are two test data files used for graphx examples existing in directory "graphx/data"
I move it into "data/" directory because the "graphx" directory is used for code files and other test data files (such as mllib, streaming test data) are all in there.
I also update the graphx document where reference the data files which I move place.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14010 from WeichenXu123/move_graphx_data_dir.
This PR adds the breaking changes from [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) to the migration guide.
## How was this patch tested?
Built docs locally.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13924 from MLnick/SPARK-15643-migration-guide.
## What changes were proposed in this pull request?
This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page.
## How was this patch tested?
Manually tested.
<img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png">
Author: Cheng Lian <lian@databricks.com>
Closes#13972 from liancheng/include-example-with-labels.
## What changes were proposed in this pull request?
Yarn supports rolling log aggregation since 2.6, previously log will only be aggregated to HDFS after application is finished, it is quite painful for long running applications like Spark Streaming, thriftserver. Also out of disk problem will be occurred when log file is too large. So here propose to add support of rolling log aggregation for Spark on yarn.
One limitation for this is that log4j should be set to change to file appender, now in Spark itself uses console appender by default, in which file will not be created again once removed after aggregation. But I think lots of production users should have changed their log4j configuration instead of default on, so this is not a big problem.
## How was this patch tested?
Manually verified with Hadoop 2.7.1.
Author: jerryshao <sshao@hortonworks.com>
Closes#13712 from jerryshao/SPARK-15990.
## What changes were proposed in this pull request?
Update ```spark.ml``` and ```spark.mllib``` migration guide from 1.6 to 2.0.
## How was this patch tested?
Docs update, no tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13378 from yanboliang/spark-13448.
## What changes were proposed in this pull request?
This PR makes several updates to SQL programming guide.
Author: Yin Huai <yhuai@databricks.com>
Closes#13938 from yhuai/doc.
## What changes were proposed in this pull request?
Made changes to HashingTF,QuantileVectorizer and CountVectorizer
Author: GayathriMurali <gayathri.m@intel.com>
Closes#13745 from GayathriMurali/SPARK-15997.
## What changes were proposed in this pull request?
This changes the behavior of --num-executors and spark.executor.instances when using dynamic allocation. Instead of turning dynamic allocation off, it uses the value for the initial number of executors.
This changes was discussed on [SPARK-13723](https://issues.apache.org/jira/browse/SPARK-13723). I highly recommend using it while we can change the behavior for 2.0.0. In practice, the 1.x behavior causes unexpected behavior for users (it is not clear that it disables dynamic allocation) and wastes cluster resources because users rarely notice the log message.
## How was this patch tested?
This patch updates tests and adds a test for Utils.getDynamicAllocationInitialExecutors.
Author: Ryan Blue <blue@apache.org>
Closes#13338 from rdblue/SPARK-13723-num-executors-with-dynamic-allocation.
## What changes were proposed in this pull request?
Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
Also updated roxygen2 doc and R programming guide on deprecations.
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13838 from felixcheung/rjobgroup.
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions
## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">
Author: Kai Jiang <jiangkai@gmail.com>
Closes#13660 from vectorijk/spark-15672-R-guide-update.
## What changes were proposed in this pull request?
Doc changes
## How was this patch tested?
manual
liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13827 from felixcheung/sqldocdeprecate.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16045
2.0 Audit: Update document for StopWordsRemover and Binarizer.
## How was this patch tested?
manual review for doc
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#13375 from hhbyyh/stopdoc.
## What changes were proposed in this pull request?
Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options.
## How was this patch tested?
N/A
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13797 from maropu/SPARK-15894-2.
## What changes were proposed in this pull request?
Update doc as per discussion in PR #13592
## How was this patch tested?
manual
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13799 from felixcheung/rsqlprogrammingguide.
This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon.
Author: Eric Liang <ekl@databricks.com>
Closes#13744 from ericl/spark-16025.
## What changes were proposed in this pull request?
Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete.
We may also want to add more examples for Scala/Java Dataset typed transformations.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#13592 from liancheng/sql-programming-guide-2.0.
## What changes were proposed in this pull request?
roxygen2 doc, programming guide, example updates
## How was this patch tested?
manual checks
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13751 from felixcheung/rsparksessiondoc.
## What changes were proposed in this pull request?
In the 2.0 document, Line "A full example that produces the experiment described in the PIC paper can be found under examples/." is redundant.
There is already "Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" in the Spark repo.".
We should remove the first line, which is consistent with other documents.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13755 from wangmiao1981/doc.
## What changes were proposed in this pull request?
Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs
Author: GayathriMurali <gayathri.m@intel.com>
Closes#13285 from GayathriMurali/SPARK-15129.
## What changes were proposed in this pull request?
add ml doc for ml isotonic regression
add scala example for ml isotonic regression
add java example for ml isotonic regression
add python example for ml isotonic regression
modify scala example for mllib isotonic regression
modify java example for mllib isotonic regression
modify python example for mllib isotonic regression
add data/mllib/sample_isotonic_regression_libsvm_data.txt
delete data/mllib/sample_isotonic_regression_data.txt
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13381 from WeichenXu123/add_isotonic_regression_doc.
## What changes were proposed in this pull request?
Reduce `spark.memory.fraction` default to 0.6 in order to make it fit within default JVM old generation size (2/3 heap). See JIRA discussion. This means a full cache doesn't spill into the new gen. CC andrewor14
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#13618 from srowen/SPARK-15796.
## What changes were proposed in this pull request?
minor typo
## How was this patch tested?
minor typo in the doc, should be self explanatory
Author: Mortada Mehyar <mortada.mehyar@gmail.com>
Closes#13639 from mortada/typo.
## What changes were proposed in this pull request?
- Deprecate old Java accumulator API; should use Scala now
- Update Java tests and examples
- Don't bother testing old accumulator API in Java 8 (too)
- (fix a misspelling too)
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#13606 from srowen/SPARK-15086.
## What changes were proposed in this pull request?
SPARK_MASTER_IP is a deprecated environment variable. It is replaced by SPARK_MASTER_HOST according to MasterArguments.scala.
## How was this patch tested?
Manually verified.
Author: bomeng <bmeng@us.ibm.com>
Closes#13543 from bomeng/SPARK-15806.
## What changes were proposed in this pull request?
Like `SPARK_JAVA_OPTS` and `SPARK_CLASSPATH`, we will remove the document for `SPARK_WORKER_INSTANCES` to discourage user not to use them. If they are actually used, SparkConf will show a warning message as before.
## How was this patch tested?
Manually tested.
Author: bomeng <bmeng@us.ibm.com>
Closes#13533 from bomeng/SPARK-15781.
## What changes were proposed in this pull request?
Use new Spark logo including "Apache" (now, with crushed PNGs). Remove old unreferenced logo files.
## How was this patch tested?
Manual check of generated HTML site and Spark UI. I searched for references to the deleted files to make sure they were not used.
Author: Sean Owen <sowen@cloudera.com>
Closes#13609 from srowen/SPARK-15879.
## What changes were proposed in this pull request?
fixing documentation for the groupby/agg example in python
## How was this patch tested?
the existing example in the documentation dose not contain valid syntax (missing parenthesis) and is not using `Column` in the expression for `agg()`
after the fix here's how I tested it:
```
In [1]: from pyspark.sql import Row
In [2]: import pyspark.sql.functions as func
In [3]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:records = [{'age': 19, 'department': 1, 'expense': 100},
: {'age': 20, 'department': 1, 'expense': 200},
: {'age': 21, 'department': 2, 'expense': 300},
: {'age': 22, 'department': 2, 'expense': 300},
: {'age': 23, 'department': 3, 'expense': 300}]
:--
In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records])
In [5]: df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense")).show()
+----------+----------+--------+------------+
|department|department|max(age)|sum(expense)|
+----------+----------+--------+------------+
| 1| 1| 20| 300|
| 2| 2| 22| 600|
| 3| 3| 23| 300|
+----------+----------+--------+------------+
Author: Mortada Mehyar <mortada.mehyar@gmail.com>
Closes#13587 from mortada/groupby_agg_doc_fix.
## What changes were proposed in this pull request?
Mentioned Scala version in the sbt configuration file is 2.11, so the path of the target JAR should be `/target/scala-2.11/simple-project_2.11-1.0.jar`
## How was this patch tested?
n/a
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabsmails@gmail.com>
Closes#13554 from prabeesh/master.
## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM.
When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg.
We should output a warning message and clarify in document for this condition.
## How was this patch tested?
Document change, no unit test.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12731 from yanboliang/spark-13590.
While there, also document spark.files and spark.jars. Text is the
same as the spark-submit help text with some minor adjustments.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13502 from vanzin/SPARK-15760.
## What changes were proposed in this pull request?
I use spell check tools checks typo in spark documents and fix them.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13538 from WeichenXu123/fix_doc_typo.
## What changes were proposed in this pull request?
1, del precision,recall in `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`
## How was this patch tested?
local build
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Closes#13390 from zhengruifeng/clarify_f1.
## What changes were proposed in this pull request?
The patch updates the codes & docs in the example module as well as the related doc module:
- [ ] [docs] `streaming-programming-guide.md`
- [x] scala code part
- [ ] java code part
- [ ] python code part
- [x] [examples] `RecoverableNetworkWordCount.scala`
- [ ] [examples] `JavaRecoverableNetworkWordCount.java`
- [ ] [examples] `recoverable_network_wordcount.py`
## How was this patch tested?
Ran the examples and verified results manually.
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12981 from lw-lin/accumulatorV2-examples.
## What changes were proposed in this pull request?
Update document programming-guide accumulator section (scala language)
java and python version, because the API haven't done, so I do not modify them.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13441 from WeichenXu123/update_doc_accumulatorV2_clean.
## What changes were proposed in this pull request?
Fixed broken java code examples in streaming documentation
Attn: tdas
Author: Matthew Wise <matthew.rs.wise@gmail.com>
Closes#13388 from mawise/fix_docs_java_streaming_example.
## What changes were proposed in this pull request?
* Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```.
* Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```.
Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference.
## How was this patch tested?
Document update, no tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13262 from yanboliang/spark-15484.
## What changes were proposed in this pull request?
This patch adds a user guide section for generalized linear regression and includes the examples from [#12754](https://github.com/apache/spark/pull/12754).
## How was this patch tested?
Documentation only, no tests required.
## Approach
In general, it is a bit unclear what level of detail ought to be included in the user guide since there is a lot of variability within the current user guide. I tried to give a fairly brief mathematical introduction to GLMs, and cover what types of problems they could be used for. Additionally, I included a brief blurb on the IRLS solver. The input/output columns are given in a table as is found elsewhere in the docs (though, again, these appear rather intermittently in the current docs), as well as a table providing the supported families and their link functions.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#13139 from sethah/SPARK-15186.
## What changes were proposed in this pull request?
Remove several obsolete env variables not supported for Spark on YARN now, also updates the docs to include several changes with 2.0.
## How was this patch tested?
N/A
CC vanzin tgravescs
Author: jerryshao <sshao@hortonworks.com>
Closes#13296 from jerryshao/yarn-doc.
## What changes were proposed in this pull request?
`a` -> `an`
I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.
## How was this patch tested?
local build
`lint-java` checking
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13317 from zhengruifeng/a_an.
## What changes were proposed in this pull request?
Follow up on the earlier PR - in here we are fixing up roxygen2 doc examples.
Also add to the programming guide migration section.
## How was this patch tested?
SparkR tests
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#13340 from felixcheung/sqlcontextdoc.
This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted
Author: Steve Loughran <stevel@hortonworks.com>
Author: Steve Loughran <stevel@apache.org>
Closes#11033 from steveloughran/stevel/feature/SPARK-13148-oozie.
## What changes were proposed in this pull request?
Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`.
## How was this patch tested?
Document update, no tests.
Author: Krishna Kalyan <krishnakalyan3@gmail.com>
Closes#13268 from krishnakalyan3/spark-12071-1.
## What changes were proposed in this pull request?
PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).
## How was this patch tested?
built pydocs locally, tested new user build instructions
Author: Holden Karau <holden@us.ibm.com>
Closes#13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
This PR adds a note to clarify that the ML API for ALS only supports integers for user/item ids, and that other types for these columns can be used but the ids must fall within integer range.
(Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)).
Also cleaned up a reference to `mllib` in the ML doc.
## How was this patch tested?
Built and viewed User Guide doc locally.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13278 from MLnick/SPARK-15502-als-int-id-doc-note.