## What changes were proposed in this pull request?
Correct some typos and incorrectly worded sentences.
## How was this patch tested?
Doc changes only.
Note that many of these changes were identified by whomfire01
Author: sethah <seth.hendrickson16@gmail.com>
Closes#13180 from sethah/ml_guide_audit.
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13098 from clockfly/spark-15171-remove-deprecation.
## What changes were proposed in this pull request?
We should now begin copying algorithm details from the spark.mllib guide to spark.ml as needed, rather than just linking back to the corresponding algorithms in the spark.mllib user guide.
## How was this patch tested?
manual review for doc.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#12957 from hhbyyh/tfidfdoc.
## What changes were proposed in this pull request?
See JIRA for the motivation. The changes are almost entirely movement of text and edits to sections. Minor changes to text include:
- Copying in / merging text from the "Useful Developer Tools" wiki, in areas of
- Docker
- R
- Running one test
- standardizing on ./build/mvn not mvn, and likewise for ./build/sbt
- correcting some typos
- standardizing code block formatting
No text has been removed from this doc; text has been imported from the https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools wiki
## How was this patch tested?
Jekyll doc build and inspection of resulting HTML in browser.
Author: Sean Owen <sowen@cloudera.com>
Closes#13124 from srowen/SPARK-15333.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual compile and test all examples
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12788 from wangmiao1981/example.
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13078 from zhengruifeng/fix_ann.
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions
## How was this patch tested?
Unit tests
Author: cody koeninger <cody@koeninger.org>
Closes#12946 from koeninger/SPARK-15085.
## What changes were proposed in this pull request?
1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt`
2,add python example
3,directly read the datafile in examples
4,BTW, change to `SparkSession` in `aft_survival_regression.py`
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/lda_example.py`
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12927 from zhengruifeng/lda_pe.
This PR:
* Clarifies that Spark *does* support Python 3, starting with Python 3.4.
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#13017 from nchammas/supported-python-versions.
## What changes were proposed in this pull request?
Python example for ml.kmeans already exists, but not included in user guide.
1,small changes like: `example_on` `example_off`
2,add it to user guide
3,update examples to directly read datafile
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/kmeans_example.py
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12925 from zhengruifeng/km_pe.
## What changes were proposed in this pull request?
1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile `data/mllib/sample_kmeans_data.txt`
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#11844 from zhengruifeng/doc_bkm.
## What changes were proposed in this pull request?
1, Add python example for OneVsRest
2, remove args-parsing
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12920 from zhengruifeng/ovr_pe.
## What changes were proposed in this pull request?
The current build documents don't specify that for PySpark tests we need to include Hive in the assembly otherwise the ORC tests fail.
## How was the this patch tested?
Manually built the docs locally. Ran the provided build command follow by the PySpark SQL tests.
![pyspark2](https://cloud.githubusercontent.com/assets/59893/13190008/8829cde4-d70f-11e5-8ff5-a88b7894d2ad.png)
Author: Holden Karau <holden@us.ibm.com>
Closes#11278 from holdenk/SPARK-13382-update-pyspark-testing-notes-r2.
## What changes were proposed in this pull request?
The configuration setting `spark.executor.logs.rolling.size.maxBytes` was changed to `spark.executor.logs.rolling.maxSize` in 1.4 or so.
This commit fixes a remaining reference to the old name in the documentation.
Also the description for `spark.executor.logs.rolling.maxSize` was edited to clearly state that the unit for the size is bytes.
## How was this patch tested?
no tests
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#13001 from philipphoffmann/patch-3.
## What changes were proposed in this pull request?
* Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
* Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.
## How was this patch tested?
Offline test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13005 from yanboliang/r-df-examples.
## What changes were proposed in this pull request?
Fixed some minor errors found when reviewing feature.ml user guide
## How was this patch tested?
built docs locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#12940 from BryanCutler/feature.ml-doc_fixes-DOCS-MINOR.
## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12281 from zhengruifeng/discret_pe.
## What changes were proposed in this pull request?
Create a maven profile for executing the docker integration tests using maven
Remove docker integration tests from main sbt build
Update documentation on how to run docker integration tests from sbt
## How was this patch tested?
Manual test of the docker integration tests as in :
mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile test
## Other comments
Note that the the DB2 Docker Tests are still disabled as there is a kernel version issue on the AMPLab Jenkins slaves and we would need to get them on the right level before enabling those tests. They do run ok locally with the updates from PR #12348
Author: Luciano Resende <lresende@apache.org>
Closes#12508 from lresende/docker.
Remove history server functionality from standalone Master. Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270). Keeping this functionality out of the Master will help to simplify the process and increase stability.
Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly. Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#10991 from BryanCutler/remove-history-master-SPARK-12299.
## What changes were proposed in this pull request?
Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs.
**Changes Proposed in the fix**
Three new corresponding config entries have been added where the user can specify the groups to be given access.
```
spark.admin.acls.groups
spark.modify.acls.groups
spark.ui.view.acls.groups
```
New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter.
A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided.
Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping```
**How the patch was Tested**
We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly.
Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior.
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#12760 from dhruve/impr/SPARK-4224.
## What changes were proposed in this pull request?
Some python snippets is using scala imports and comments.
## How was this patch tested?
Generated the docs locally with `SKIP_API=1 jekyll build` and viewed the changes in the browser.
Author: Shuai Lin <linshuai2012@gmail.com>
Closes#12869 from lins05/fix-mllib-python-snippets.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13973
Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing.
## How was this patch tested?
Manual testing; set IPYTHON=1 and verified that the error message prints.
Author: pshearer <pshearer@massmutual.com>
Author: shearerp <shearerp@umich.edu>
Closes#12528 from shearerp/master.
## What changes were proposed in this pull request?
dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
The function signature is:
dapply(df, function(localDF) {}, schema = NULL)
R function input: local data.frame from the partition on local node
R function output: local data.frame
Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>
Closes#12493 from sun-rui/SPARK-12919.
## What changes were proposed in this pull request?
Add simple clarification that Spark can be cross-built for other Scala versions.
## How was this patch tested?
Automated doc build
Author: Sean Owen <sowen@cloudera.com>
Closes#12757 from srowen/SPARK-14882.
This work is based on twinkle-sachdeva 's proposal. In parallel to such mechanism for AM failures, here add similar mechanism for executor failure tracking, this is useful for long running Spark service to mitigate the executor failure problems.
Please help to review, tgravescs sryza and vanzin
Author: jerryshao <sshao@hortonworks.com>
Closes#10241 from jerryshao/SPARK-6735.
## What changes were proposed in this pull request?
Add the missing python example for VectorSlicer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12282 from zhengruifeng/vecslicer_pe.
## What changes were proposed in this pull request?
Documentation changes
## How was this patch tested?
No tests
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#12664 from mgummelt/fix-dynamic-docs.
## What changes were proposed in this pull request?
This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.
- Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
- Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
- Fix datatypes in `sparkr.md`.
- Update a data result in `sparkr.md`.
- Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
- Other minor syntax fixes and a typo.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12649 from dongjoon-hyun/SPARK-14883.
## What changes were proposed in this pull request?
Added screenshot + minor fixes to improve reading
## How was this patch tested?
Manual
Author: Jacek Laskowski <jacek@japila.pl>
Closes#12569 from jaceklaskowski/docs-accumulators.
Add to the REST API details on the ? args and examples from the test suite.
I've used the existing table, adding all the fields to the second table.
see [in the pr](https://github.com/steveloughran/spark/blob/history/SPARK-13267-doc-params/docs/monitoring.md).
There's a slightly more sophisticated option: make the table 3 columns wide, and for all existing entries, have the initial `td` span 2 columns. The new entries would then have an empty 1st column, param in 2nd and text in 3rd, with any examples after a `br` entry.
Author: Steve Loughran <stevel@hortonworks.com>
Closes#11152 from steveloughran/history/SPARK-13267-doc-params.
## What changes were proposed in this pull request?
Fixed inadvertent roxygen2 doc changes, added class name change to programming guide
Follow up of #12621
## How was this patch tested?
manually checked
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#12647 from felixcheung/rdataframe.
## What changes were proposed in this pull request?
The patch makes event log processing multi threaded.
## How was this patch tested?
Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI.
Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
Closes#11800 from Parth-Brahmbhatt/SPARK-13988.
## What changes were proposed in this pull request?
Restore `ec2-scripts.md` as a redirect to amplab/spark-ec2 docs
## How was this patch tested?
`jekyll build` and checked with the browser
Author: Sean Owen <sowen@cloudera.com>
Closes#12534 from srowen/SPARK-14742.
## What changes were proposed in this pull request?
Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
## How was this patch tested?
unit tests and doc generation
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#12454 from hhbyyh/tfdoc.
## What changes were proposed in this pull request?
The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.
## How was this patch tested?
Removed some tests related to the old manager.
Author: Reynold Xin <rxin@databricks.com>
Closes#12423 from rxin/SPARK-14667.
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12283 from zhengruifeng/chi2_pe.
## What changes were proposed in this pull request?
Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.
## How was this patch tested?
Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.
Author: Mark Grover <mark@apache.org>
Closes#12365 from markgrover/spark-14601.
## What changes were proposed in this pull request?
The configuration docs are updated to reflect the changes introduced with [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This allows the user to specify initial heap memory settings through the extraJavaOptions for executor, driver and am.
## How was this patch tested?
The changes are tested in [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This is just documenting the changes made.
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#12333 from dhruve/doc/SPARK-14572.
jira: https://issues.apache.org/jira/browse/SPARK-13089
Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files).
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#11015 from hhbyyh/naiveBayesDoc.
## What changes were proposed in this pull request?
Add python CountVectorizerExample
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#11917 from zhengruifeng/cv_pe.
## What changes were proposed in this pull request?
This PR fixes the `age` data types from `integer` to `long` in `SQL Programming Guide: JSON Datasets`.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12290 from dongjoon-hyun/minor_fix_type_in_json_example.
## What changes were proposed in this pull request?
add three python examples
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12063 from zhengruifeng/dct_pe.
Docs change to remove the sentence about Mesos not supporting cluster mode.
It was not.
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#12249 from mgummelt/fix-mesos-cluster-docs.
## What changes were proposed in this pull request?
Since not having the correct zk url causes job failure, the documentation should include all parameters
## How was this patch tested?
no tests necessary
Author: Malte <elmalto@users.noreply.github.com>
Closes#12218 from elmalto/patch-1.
## What changes were proposed in this pull request?
This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.
## How was this patch tested?
Removed the related tests also.
Author: Reynold Xin <rxin@databricks.com>
Closes#12229 from rxin/SPARK-10063.