### What changes were proposed in this pull request?
Bugfix: link to correction location of Pyspark Dataframe documentation
### Why are the changes needed?
Current website returns "Not found"
### Does this PR introduce _any_ user-facing change?
Website fix
### How was this patch tested?
Documentation change
Closes#33420 from dominikgehl/feature/SPARK-36209.
Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 3a1db2ddd4)
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR.
This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained.
### Why are the changes needed?
As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world.
### Does this PR introduce _any_ user-facing change?
In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests.
### How was this patch tested?
Existing tests should be suitable since no behavior changes are expected as a result of this PR.
Closes#28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists.
Authored-by: Erik Krogen <ekrogen@linkedin.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
This change replaces the world slave with alternatives matching the context.
### Why are the changes needed?
There is no need to call things slave, we might as well use better clearer names.
### Does this PR introduce _any_ user-facing change?
Yes, the ouput JSON does change. To allow backwards compatibility this is an additive change.
The shell scripts for starting & stopping workers are renamed, and for backwards compatibility old scripts are added to call through to the new ones while printing a deprecation message to stderr.
### How was this patch tested?
Existing tests.
Closes#28864 from holdenk/SPARK-32004-drop-references-to-slave.
Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
Change the link to the Scala API document.
```
$ git grep "#org.apache.spark.package"
docs/_layouts/global.html: <li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
docs/index.md:* [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package)
docs/rdd-programming-guide.md:[Scala](api/scala/#org.apache.spark.package), [Java](api/java/), [Python](api/python/) and [R](api/R/).
```
### Why are the changes needed?
The home page link for Scala API document is incorrect after upgrade to 3.0
### Does this PR introduce any user-facing change?
Document UI change only.
### How was this patch tested?
Local test, attach screenshots below:
Before:
![image](https://user-images.githubusercontent.com/4833765/74335713-c2385300-4dd7-11ea-95d8-f5a3639d2578.png)
After:
![image](https://user-images.githubusercontent.com/4833765/74335727-cbc1bb00-4dd7-11ea-89d9-4dcc1310e679.png)
Closes#27549 from xuanyuanking/scala-doc.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
- Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport`
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc
Notes:
- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated.
### Why are the changes needed?
Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old.
### Does this PR introduce any user-facing change?
Yes, in that deprecated items are removed from some public APIs.
### How was this patch tested?
Existing tests.
Closes#25684 from srowen/SPARK-28980.
Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.
## How was this patch tested?
Doc build
Closes#24243 from srowen/SPARK-26918.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Spark apps do not need to package Spark. In fact it can cause problems in some cases. Our examples should show depending on Spark as a 'provided' dependency.
Packaging Spark makes the app much bigger by tens of megabytes. It can also bring in conflicting dependencies that wouldn't otherwise be a problem. https://issues.apache.org/jira/browse/SPARK-26146 was what reminded me of this.
## How was this patch tested?
Doc build
Closes#23938 from srowen/Provided.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Fix Typos.
## How was this patch tested?
NA
Closes#23145 from kjmrknsn/docUpdate.
Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Remove Kafka 0.8 integration
## How was this patch tested?
Existing tests, build scripts
Closes#22703 from srowen/SPARK-25705.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Removes all vestiges of Flume in the build, for Spark 3.
I don't think this needs Jenkins config changes.
## How was this patch tested?
Existing tests.
Closes#22692 from srowen/SPARK-25598.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Fix a few broken links and typos, and, nit, use HTTPS more consistently esp. on scripts and Apache links
## How was this patch tested?
Doc build
Closes#22172 from srowen/DocTypo.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Updated streaming guide for direct stream and link to integration guide.
## How was this patch tested?
jekyll build
Author: Rekha Joshi <rekhajoshm@gmail.com>
Closes#21683 from rekhajoshm/SPARK-24507.
## What changes were proposed in this pull request?
Easy fix in the documentation.
## How was this patch tested?
N/A
Closes#20948
Author: Daniel Sakuma <dsakuma@gmail.com>
Closes#20928 from dsakuma/fix_typo_configuration_docs.
## What changes were proposed in this pull request?
Fix spelling in quick-start doc.
## How was this patch tested?
Doc only.
Author: Shashwat Anand <me@shashwat.me>
Closes#20336 from ashashwat/SPARK-23165.
Change-Id: I88c272444ca734dc2cbc2592607c11287b90a383
## What changes were proposed in this pull request?
The documentation on File DStreams is enhanced to
1. Detail the exact timestamp logic for examining directories and files.
1. Detail how object stores different from filesystems, and so how using them as a source of data should be treated with caution, possibly publishing data to the store differently (direct PUTs as opposed to stage + rename)
## How was this patch tested?
n/a
Author: Steve Loughran <stevel@hortonworks.com>
Closes#17743 from steveloughran/cloud/SPARK-20448-document-dstream-blobstore.
## What changes were proposed in this pull request?
Put Kafka 0.8 support behind a kafka-0-8 profile.
## How was this patch tested?
Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all.
Author: Sean Owen <sowen@cloudera.com>
Closes#19134 from srowen/SPARK-21893.
## What changes were proposed in this pull request?
Update internal references from programming-guide to rdd-programming-guide
See 5ddf243fd8 and https://github.com/apache/spark/pull/18485#issuecomment-314789751
Let's keep the redirector even if it's problematic to build, but not rely on it internally.
## How was this patch tested?
(Doc build)
Author: Sean Owen <sowen@cloudera.com>
Closes#18625 from srowen/SPARK-21267.2.
- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings
For the future:
- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#16871 from srowen/SPARK-19493.
Spark's I/O encryption uses an ephemeral key for each driver instance.
So driver B cannot decrypt data written by driver A since it doesn't
have the correct key.
The write ahead log is used for recovery, thus needs to be readable by
a different driver. So it cannot be encrypted by Spark's I/O encryption
code.
The BlockManager APIs used by the WAL code to write the data automatically
encrypt data, so changes are needed so that callers can to opt out of
encryption.
Aside from that, the "putBytes" API in the BlockManager does not do
encryption, so a separate situation arised where the WAL would write
unencrypted data to the BM and, when those blocks were read, decryption
would fail. So the WAL code needs to ask the BM to encrypt that data
when encryption is enabled; this code is not optimal since it results
in a (temporary) second copy of the data block in memory, but should be
OK for now until a more performant solution is added. The non-encryption
case should not be affected.
Tested with new unit tests, and by running streaming apps that do
recovery using the WAL data with I/O encryption turned on.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#16862 from vanzin/SPARK-19520.
## What changes were proposed in this pull request?
Added missing Java example under section "Design Patterns for using foreachRDD". Now this section has examples in all 3 languages, improving consistency of documentation.
## How was this patch tested?
Manual.
Generated docs using command "SKIP_API=1 jekyll build" and verified generated HTML page manually.
The syntax of example has been tested for correctness using sample code on Java1.7 and Spark 2.2.0-SNAPSHOT.
Author: adesharatushar <tushar_adeshara@persistent.com>
Closes#16408 from adesharatushar/streaming-doc-fix.
## What changes were proposed in this pull request?
Updates links to the wiki to links to the new location of content on spark.apache.org.
## How was this patch tested?
Doc builds
Author: Sean Owen <sowen@cloudera.com>
Closes#15967 from srowen/SPARK-18073.1.
## What changes were proposed in this pull request?
The discussion of the interaction of Accumulators and Broadcast Variables should logically follow the discussion on Checkpointing. As currently written, this section discusses Checkpointing before it is formally introduced. To remedy this:
- Rename this section to "Accumulators, Broadcast Variables, and Checkpoints", and
- Move this section after "Checkpointing".
## How was this patch tested?
Testing: ran
$ SKIP_API=1 jekyll build
, and verified changes in a Web browser pointed at docs/_site/index.html.
Author: José Hiram Soltren <jose@cloudera.com>
Closes#15281 from jsoltren/doc-changes.
## What changes were proposed in this pull request?
Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki.
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15075 from srowen/SPARK-17445.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Streaming doc correction.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Satendra Kumar <satendra@knoldus.com>
Closes#14996 from satendrakumar06/patch-1.
## What changes were proposed in this pull request?
Fix minor typos python example code in streaming programming guide
## How was this patch tested?
N/A
Author: Dmitriy Sokolov <silentsokolov@gmail.com>
Closes#14805 from silentsokolov/fix-typos.
## What changes were proposed in this pull request?
Updated links of external dstream projects.
## How was this patch tested?
Just document changes.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#14814 from zsxwing/dstream-link.
## What changes were proposed in this pull request?
When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT```
…from its own release version] [Streaming programming guide]
Author: Jagadeesan <as2@us.ibm.com>
Closes#14596 from jagadeesanas2/SPARK-12370.
## What changes were proposed in this pull request?
Fix the broken links in the programming guide of the Graphx Migration and understanding closures
## How was this patch tested?
By running the test cases and checking the links.
Author: Shivansh <shiv4nsh@gmail.com>
Closes#14503 from shiv4nsh/SPARK-16911.
## What changes were proposed in this pull request?
Doc for the Kafka 0.10 integration
## How was this patch tested?
Scala code examples were taken from my example repo, so hopefully they compile.
Author: cody koeninger <cody@koeninger.org>
Closes#14385 from koeninger/SPARK-16312.
## What changes were proposed in this pull request?
added missing keyword for java example
## How was this patch tested?
wasn't
Author: Bartek Wiśniewski <wedi@Ava.local>
Closes#14381 from wedi-dev/quickfix/missing_keyword.
## What changes were proposed in this pull request?
Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
* **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
* Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
* **Reviewers**: I did not change any of the content of the migration guides.
Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
* **Reviewers**: I did not change the content of these guides, except some intro text.
* Sidebar remains the same, but with pipeline and tuning sections added
Other:
* ml-classification-regression.html: Moved text about linear methods to new section in page
## How was this patch tested?
Generated docs locally
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14213 from jkbradley/ml-guide-2.0.
## What changes were proposed in this pull request?
I search the whole documents directory using SQLContext, and update the following places:
- docs/configuration.md, sparkR code snippets.
- docs/streaming-programming-guide.md, several example code.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14025 from WeichenXu123/WIP_SQLContext_update.
## What changes were proposed in this pull request?
- Deprecate old Java accumulator API; should use Scala now
- Update Java tests and examples
- Don't bother testing old accumulator API in Java 8 (too)
- (fix a misspelling too)
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#13606 from srowen/SPARK-15086.
## What changes were proposed in this pull request?
I use spell check tools checks typo in spark documents and fix them.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13538 from WeichenXu123/fix_doc_typo.
## What changes were proposed in this pull request?
The patch updates the codes & docs in the example module as well as the related doc module:
- [ ] [docs] `streaming-programming-guide.md`
- [x] scala code part
- [ ] java code part
- [ ] python code part
- [x] [examples] `RecoverableNetworkWordCount.scala`
- [ ] [examples] `JavaRecoverableNetworkWordCount.java`
- [ ] [examples] `recoverable_network_wordcount.py`
## How was this patch tested?
Ran the examples and verified results manually.
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12981 from lw-lin/accumulatorV2-examples.
## What changes were proposed in this pull request?
Fixed broken java code examples in streaming documentation
Attn: tdas
Author: Matthew Wise <matthew.rs.wise@gmail.com>
Closes#13388 from mawise/fix_docs_java_streaming_example.
## What changes were proposed in this pull request?
`a` -> `an`
I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.
## How was this patch tested?
local build
`lint-java` checking
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13317 from zhengruifeng/a_an.
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13098 from clockfly/spark-15171-remove-deprecation.
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions
## How was this patch tested?
Unit tests
Author: cody koeninger <cody@koeninger.org>
Closes#12946 from koeninger/SPARK-15085.
## What changes were proposed in this pull request?
Straggler references to Tachyon were removed:
- for docs, `tachyon` has been generalized as `off-heap memory`;
- for Mesos test suits, the key-value `tachyon:true`/`tachyon:false` has been changed to `os:centos`/`os:ubuntu`, since `os` is an example constrain used by the [Mesos official docs](http://mesos.apache.org/documentation/attributes-resources/).
## How was this patch tested?
Existing test suites.
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12129 from lw-lin/tachyon-cleanup.
## What changes were proposed in this pull request?
This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages
Also remove mqtt_wordcount.py that I forgot to remove previously.
## How was this patch tested?
Jenkins PR Build.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11824 from zsxwing/remove-doc.
## What changes were proposed in this pull request?
I have copied the docs of Streaming Akka to https://github.com/spark-packages/dstream-akka/blob/master/README.md
So we can remove them from Spark now.
## How was this patch tested?
Only document changes.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11711 from zsxwing/remove-akka-doc.
## What changes were proposed in this pull request?
In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
```
- final ArrayList<Product2<Object, Object>> dataToWrite =
- new ArrayList<Product2<Object, Object>>();
+ final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```
Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
## How was this patch tested?
Manual.
Pass the existing tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11541 from dongjoon-hyun/SPARK-13702.
## What changes were proposed in this pull request?
The reference to StatefulNetworkWordCount.scala from updateStatesByKey documentation should be removed, till there is a example for updateStatesByKey.
## How was this patch tested?
Have tested the new documentation with jekyll build.
Author: rmishra <rmishra@pivotal.io>
Closes#11545 from rishitesh/SPARK-13705.
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11300 from dongjoon-hyun/minor_fix_typos.
Clarify that reduce functions need to be commutative, and fold functions do not
See https://github.com/apache/spark/pull/11091
Author: Sean Owen <sowen@cloudera.com>
Closes#11217 from srowen/SPARK-13339.