## What changes were proposed in this pull request?
update `refreshTable` API in python code of the sql-programming-guide.
This API is added in SPARK-15820
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14220 from WeichenXu123/update_sql_doc_catalog.
## What changes were proposed in this pull request?
Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.
## How was this patch tested?
`jekyll serve`
Author: Ahmed Mahran <ahmed.mahran@mashin.io>
Closes#14234 from ahmed-mahran/b-struct-streaming-docs.
## What changes were proposed in this pull request?
This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL".
## How was this patch tested?
Manually verified the generated HTML page.
Author: Cheng Lian <lian@databricks.com>
Closes#14245 from liancheng/minor-scala-example-update.
## What changes were proposed in this pull request?
Fix code style from ad hoc review of RC4 doc
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14250 from felixcheung/rdocs2rc4.
## What changes were proposed in this pull request?
Updates programming guide for spark.gapply/spark.gapplyCollect.
Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality.
Please, let me know if you prefer another example.
## How was this patch tested?
Existing test cases in R
Author: Narine Kokhlikyan <narine@slice.com>
Closes#14090 from NarineK/gapplyProgGuide.
## What changes were proposed in this pull request?
Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
* **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
* Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
* **Reviewers**: I did not change any of the content of the migration guides.
Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
* **Reviewers**: I did not change the content of these guides, except some intro text.
* Sidebar remains the same, but with pipeline and tuning sections added
Other:
* ml-classification-regression.html: Moved text about linear methods to new section in page
## How was this patch tested?
Generated docs locally
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14213 from jkbradley/ml-guide-2.0.
If a custom Jekyll template tag throws Ruby's equivalent of a "file not found" exception, then Jekyll will stop the doc building process but will exit with a successful status, causing our doc publishing jobs to silently fail.
This is caused by https://github.com/jekyll/jekyll/issues/5104, a case of bad error-handling logic in Jekyll. This patch works around this by updating our `include_example.rb` plugin to catch the exception and exit rather than allowing it to bubble up and be ignored by Jekyll.
I tested this manually with
```
rm ./examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala
cd docs
SKIP_API=1 jekyll build
echo $?
```
Author: Josh Rosen <joshrosen@databricks.com>
Closes#14209 from JoshRosen/fix-doc-building.
## What changes were proposed in this pull request?
Fixes a typo in the sql programming guide
## How was this patch tested?
Building docs locally
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14208 from shivaram/spark-sql-doc-fix.
This prevents the NM from starting when something is wrong, which would
lead to later errors which are confusing and harder to debug.
Added a unit test to verify startup fails if something is wrong.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#14162 from vanzin/SPARK-16505.
## What changes were proposed in this pull request?
Minor documentation update for code example, code style, and missed reference to "sparkR.init"
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14178 from felixcheung/rcsvprogrammingguide.
## What changes were proposed in this pull request?
Updated structured streaming programming guide with new windowed example.
## How was this patch tested?
Docs
Author: James Thomas <jamesjoethomas@gmail.com>
Closes#14183 from jjthomas/ss_docs_update.
## What changes were proposed in this pull request?
Add Asynchronous Actions documentation inside action of programming guide
## How was this patch tested?
check the documentation indentation and formatting with md preview.
Author: sandy <phalodi@gmail.com>
Closes#14104 from phalodi/SPARK-16438.
- Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project.
- Removed the inconsistency between Scala and Java Spark SQL examples
- Scala and Java Spark SQL examples were updated
The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review.
![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png)
Author: aokolnychyi <okolnychyyanton@gmail.com>
Closes#14119 from aokolnychyi/spark_16303.
## What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
## How was this patch tested?
add unit tests
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>
Closes#13494 from lianhuiwang/metadata-only.
## What changes were proposed in this pull request?
Some minor changes for documentation page "Spark Streaming + Kinesis Integration".
Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets.
## How was this patch tested?
Tested manually, on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#14097 from keypointt/kinesisDoc.
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.
## How was this patch tested?
Only docs update, manually check the generated docs.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14011 from yanboliang/r-user-guide-update.
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#14130 from rxin/SPARK-16477.
## What changes were proposed in this pull request?
Documentation changes to indicate that fine-grained mode is now deprecated. No code changes were made, and all fine-grained mode instructions were left in place. We can remove all of that once the deprecation cycle completes (Does Spark have a standard deprecation cycle? One major version?)
Blocked on https://github.com/apache/spark/pull/14059
## How was this patch tested?
Viewed in Github
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14078 from mgummelt/deprecate-fine-grained.
## What changes were proposed in this pull request?
docs
## How was this patch tested?
viewed the docs in github
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14059 from mgummelt/coarse-grained.
## What changes were proposed in this pull request?
I search the whole documents directory using SQLContext, and update the following places:
- docs/configuration.md, sparkR code snippets.
- docs/streaming-programming-guide.md, several example code.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14025 from WeichenXu123/WIP_SQLContext_update.
## What changes were proposed in this pull request?
Coincidentally, I discovered that a couple images were unused in `docs/`, and then searched and found more, and then realized some PNGs were pretty big and could be crushed, and before I knew it, had done the same for the ASF site (not committed yet).
No functional change at all, just less superfluous image data.
## How was this patch tested?
`jekyll serve`
Author: Sean Owen <sowen@cloudera.com>
Closes#14029 from srowen/RemoveCompressImages.
## What changes were proposed in this pull request?
I extract 6 example programs from GraphX programming guide and replace them with
`include_example` label.
The 6 example programs are:
- AggregateMessagesExample.scala
- SSSPExample.scala
- TriangleCountingExample.scala
- ConnectedComponentsExample.scala
- ComprehensiveExample.scala
- PageRankExample.scala
All the example code can run using
`bin/run-example graphx.EXAMPLE_NAME`
## How was this patch tested?
Manual.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14015 from WeichenXu123/graphx_example_plugin.
## What changes were proposed in this pull request?
There are two test data files used for graphx examples existing in directory "graphx/data"
I move it into "data/" directory because the "graphx" directory is used for code files and other test data files (such as mllib, streaming test data) are all in there.
I also update the graphx document where reference the data files which I move place.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14010 from WeichenXu123/move_graphx_data_dir.
This PR adds the breaking changes from [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) to the migration guide.
## How was this patch tested?
Built docs locally.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13924 from MLnick/SPARK-15643-migration-guide.
## What changes were proposed in this pull request?
This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page.
## How was this patch tested?
Manually tested.
<img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png">
Author: Cheng Lian <lian@databricks.com>
Closes#13972 from liancheng/include-example-with-labels.
## What changes were proposed in this pull request?
Yarn supports rolling log aggregation since 2.6, previously log will only be aggregated to HDFS after application is finished, it is quite painful for long running applications like Spark Streaming, thriftserver. Also out of disk problem will be occurred when log file is too large. So here propose to add support of rolling log aggregation for Spark on yarn.
One limitation for this is that log4j should be set to change to file appender, now in Spark itself uses console appender by default, in which file will not be created again once removed after aggregation. But I think lots of production users should have changed their log4j configuration instead of default on, so this is not a big problem.
## How was this patch tested?
Manually verified with Hadoop 2.7.1.
Author: jerryshao <sshao@hortonworks.com>
Closes#13712 from jerryshao/SPARK-15990.
## What changes were proposed in this pull request?
Update ```spark.ml``` and ```spark.mllib``` migration guide from 1.6 to 2.0.
## How was this patch tested?
Docs update, no tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13378 from yanboliang/spark-13448.
## What changes were proposed in this pull request?
This PR makes several updates to SQL programming guide.
Author: Yin Huai <yhuai@databricks.com>
Closes#13938 from yhuai/doc.
## What changes were proposed in this pull request?
Made changes to HashingTF,QuantileVectorizer and CountVectorizer
Author: GayathriMurali <gayathri.m@intel.com>
Closes#13745 from GayathriMurali/SPARK-15997.
## What changes were proposed in this pull request?
This changes the behavior of --num-executors and spark.executor.instances when using dynamic allocation. Instead of turning dynamic allocation off, it uses the value for the initial number of executors.
This changes was discussed on [SPARK-13723](https://issues.apache.org/jira/browse/SPARK-13723). I highly recommend using it while we can change the behavior for 2.0.0. In practice, the 1.x behavior causes unexpected behavior for users (it is not clear that it disables dynamic allocation) and wastes cluster resources because users rarely notice the log message.
## How was this patch tested?
This patch updates tests and adds a test for Utils.getDynamicAllocationInitialExecutors.
Author: Ryan Blue <blue@apache.org>
Closes#13338 from rdblue/SPARK-13723-num-executors-with-dynamic-allocation.
## What changes were proposed in this pull request?
Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
Also updated roxygen2 doc and R programming guide on deprecations.
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13838 from felixcheung/rjobgroup.
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions
## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">
Author: Kai Jiang <jiangkai@gmail.com>
Closes#13660 from vectorijk/spark-15672-R-guide-update.
## What changes were proposed in this pull request?
Doc changes
## How was this patch tested?
manual
liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13827 from felixcheung/sqldocdeprecate.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16045
2.0 Audit: Update document for StopWordsRemover and Binarizer.
## How was this patch tested?
manual review for doc
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#13375 from hhbyyh/stopdoc.
## What changes were proposed in this pull request?
Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options.
## How was this patch tested?
N/A
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13797 from maropu/SPARK-15894-2.
## What changes were proposed in this pull request?
Update doc as per discussion in PR #13592
## How was this patch tested?
manual
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13799 from felixcheung/rsqlprogrammingguide.
This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon.
Author: Eric Liang <ekl@databricks.com>
Closes#13744 from ericl/spark-16025.
## What changes were proposed in this pull request?
Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete.
We may also want to add more examples for Scala/Java Dataset typed transformations.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#13592 from liancheng/sql-programming-guide-2.0.
## What changes were proposed in this pull request?
roxygen2 doc, programming guide, example updates
## How was this patch tested?
manual checks
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13751 from felixcheung/rsparksessiondoc.
## What changes were proposed in this pull request?
In the 2.0 document, Line "A full example that produces the experiment described in the PIC paper can be found under examples/." is redundant.
There is already "Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" in the Spark repo.".
We should remove the first line, which is consistent with other documents.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13755 from wangmiao1981/doc.
## What changes were proposed in this pull request?
Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs
Author: GayathriMurali <gayathri.m@intel.com>
Closes#13285 from GayathriMurali/SPARK-15129.
## What changes were proposed in this pull request?
add ml doc for ml isotonic regression
add scala example for ml isotonic regression
add java example for ml isotonic regression
add python example for ml isotonic regression
modify scala example for mllib isotonic regression
modify java example for mllib isotonic regression
modify python example for mllib isotonic regression
add data/mllib/sample_isotonic_regression_libsvm_data.txt
delete data/mllib/sample_isotonic_regression_data.txt
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13381 from WeichenXu123/add_isotonic_regression_doc.
## What changes were proposed in this pull request?
Reduce `spark.memory.fraction` default to 0.6 in order to make it fit within default JVM old generation size (2/3 heap). See JIRA discussion. This means a full cache doesn't spill into the new gen. CC andrewor14
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#13618 from srowen/SPARK-15796.
## What changes were proposed in this pull request?
minor typo
## How was this patch tested?
minor typo in the doc, should be self explanatory
Author: Mortada Mehyar <mortada.mehyar@gmail.com>
Closes#13639 from mortada/typo.