## What changes were proposed in this pull request?
Some PySpark examples need a SparkContext and get it by accessing _sc directly from the session. These examples should use the provided property `sparkContext` in `SparkSession` instead.
## How was this patch tested?
Ran modified examples
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13303 from BryanCutler/pyspark-session-sparkContext-MINOR.
## What changes were proposed in this pull request?
Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)
`MLLLIB` is not recommended to use now, so examples in `MLLIB` are ignored in this PR.
`StreamingContext` can not be directly obtained from `SparkSession`, so example in `Streaming` are ignored too.
cc andrewor14
## How was this patch tested?
manual tests with spark-submit
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13164 from zhengruifeng/use_sparksession_ii.
## What changes were proposed in this pull request?
Update example code in examples/src/main/r/ml.R to reflect the new algorithms.
* spark.glm and glm
* spark.survreg
* spark.naiveBayes
* spark.kmeans
## How was this patch tested?
Offline test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13000 from yanboliang/spark-15222.
## What changes were proposed in this pull request?
MLlib are not recommended to use, and some methods are even deprecated.
Update the warning message to recommend ML usage.
```
def showWarning() {
System.err.println(
"""WARN: This is a naive implementation of Logistic Regression and is given as an example!
|Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
|org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
|for more conventional use.
""".stripMargin)
}
```
To
```
def showWarning() {
System.err.println(
"""WARN: This is a naive implementation of Logistic Regression and is given as an example!
|Please use org.apache.spark.ml.classification.LogisticRegression
|for more conventional use.
""".stripMargin)
}
```
## How was this patch tested?
local build
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13190 from zhengruifeng/update_recd.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
In this DataFrame example, we use VectorImplicits._, which is private API.
Since Vectors object has public API, we use Vectors.fromML instead of implicts.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually run the example.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13213 from wangmiao1981/ml.
## What changes were proposed in this pull request?
Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion
## How was this patch tested?
Existing Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13101 from techaddict/SPARK-15296.
## What changes were proposed in this pull request?
It seems most of Python examples were changed to use SparkSession by https://github.com/apache/spark/pull/12809. This PR said both examples below:
- `simple_params_example.py`
- `aft_survival_regression.py`
are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by https://github.com/apache/spark/pull/13050 but `simple_params_example.py` is not yet.
This PR corrects the example and make this use SparkSession.
In more detail, it seems `threshold` is replaced to `thresholds` here and there by 5a23213c14. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`).
According to the comment below. this is not allowed, 354f8f11bd/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (L58-L61).
So, in this PR, it sets the equivalent value so that this does not throw an exception.
## How was this patch tested?
Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`)
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13135 from HyukjinKwon/SPARK-15031.
## What changes were proposed in this pull request?
I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)
## How was this patch tested?
Exisiting unit tests
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13112 from WeichenXu123/update_accuV2_in_mllib.
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13098 from clockfly/spark-15171-remove-deprecation.
## What changes were proposed in this pull request?
Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#12627 from dbtsai/SPARK-14615-NewML.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
copy & paste example in ml-collaborative-filtering.html into spark-shell, we see the following errors.
scala> case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
defined class Rating
scala> object Rating {
def parseRating(str: String): Rating = { | val fields = str.split("::") | assert(fields.size == 4) | Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong) | }
}
<console>:29: error: Rating.type does not take parameters
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
^
In standard scala repl, it has the same error.
Scala/spark-shell repl has some quirks (e.g. packages are also not well supported).
The reason of errors is that scala/spark-shell repl discards previous definitions when we define the Object with the same class name. Solution: We can rename the Object Rating.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually test it: 1). ./bin/run-example ALSExample
2). copy & paste example in the generated document. It works fine.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13110 from wangmiao1981/repl.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual compile and test all examples
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12788 from wangmiao1981/example.
## What changes were proposed in this pull request?
Add Scala/Java/Python examples for ```GeneralizedLinearRegression```.
## How was this patch tested?
They are examples and have been tested offline.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12754 from yanboliang/spark-14979.
## What changes were proposed in this pull request?
Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#12945 from clockfly/spark-15171.
## What changes were proposed in this pull request?
1, Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)
2, Update indent for `SparkContext` according to [SPARK-15134](https://issues.apache.org/jira/browse/SPARK-15134)
3, BTW, remove some duplicate space and add missing '.'
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13050 from zhengruifeng/use_sparksession.
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions
## How was this patch tested?
Unit tests
Author: cody koeninger <cody@koeninger.org>
Closes#12946 from koeninger/SPARK-15085.
## What changes were proposed in this pull request?
This fixes compile errors.
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13053 from dongjoon-hyun/hotfix_sqlquerysuite.
## What changes were proposed in this pull request?
1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt`
2,add python example
3,directly read the datafile in examples
4,BTW, change to `SparkSession` in `aft_survival_regression.py`
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/lda_example.py`
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12927 from zhengruifeng/lda_pe.
## What changes were proposed in this pull request?
Python example for ml.kmeans already exists, but not included in user guide.
1,small changes like: `example_on` `example_off`
2,add it to user guide
3,update examples to directly read datafile
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/kmeans_example.py
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12925 from zhengruifeng/km_pe.
## What changes were proposed in this pull request?
1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile `data/mllib/sample_kmeans_data.txt`
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#11844 from zhengruifeng/doc_bkm.
## What changes were proposed in this pull request?
1, Add python example for OneVsRest
2, remove args-parsing
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12920 from zhengruifeng/ovr_pe.
This PR removes `sqlContext` in examples. Actual usage was all replaced in https://github.com/apache/spark/pull/12809 but there are some in comments.
Manual style checking.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13006 from HyukjinKwon/minor-docs.
## What changes were proposed in this pull request?
* Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
* Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.
## How was this patch tested?
Offline test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13005 from yanboliang/r-df-examples.
Cleans up ALS examples by removing unnecessary casts to double for `rating` and `prediction` columns, since `RegressionEvaluator` now supports `Double` & `Float` input types.
## How was this patch tested?
Manual compile and run with `run-example ml.ALSExample` and `spark-submit examples/src/main/python/ml/als_example.py`.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#12892 from MLnick/als-examples-cleanup.
## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12281 from zhengruifeng/discret_pe.
## What changes were proposed in this pull request?
This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)
## How was this patch tested?
After passing the Jenkins tests and run `dev/lint-java` manually.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12911 from dongjoon-hyun/SPARK-15134.
## What changes were proposed in this pull request?
Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`
## How was this patch tested?
ran tests locally
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12851 from techaddict/SPARK-15072.
## What changes were proposed in this pull request?
This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.
- Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
- Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
- Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
- `SqlNetworkWordCount.scala`
- `JavaSqlNetworkWordCount.java`
- `sql_network_wordcount.py`
Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
- `simple_params_example.py`
- `aft_survival_regression.py`
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12809 from dongjoon-hyun/SPARK-15031.
## What changes were proposed in this pull request?
Add python3 compatibility in python examples
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12868 from zhengruifeng/fix_gmm_py.
## What changes were proposed in this pull request?
This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12860 from dongjoon-hyun/SPARK-15084.
## What changes were proposed in this pull request?
Users should use the builder pattern instead.
## How was this patch tested?
Jenks.
Author: Andrew Or <andrew@databricks.com>
Closes#12873 from andrewor14/spark-session-constructor.
## What changes were proposed in this pull request?
Now, `SQLContext` is used for backward-compatibility, we had better use `SparkSession` in Spark 2.0 examples.
## How was this patch tested?
It's just example change. After building, run `bin/run-example org.apache.spark.examples.sql.RDDRelation`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12808 from dongjoon-hyun/rddrelation.
## What changes were proposed in this pull request?
In spark.ml document, the LogisticRegression scala example uses sqlCtx. It is inconsistent with java and python examples which use sqlContext. In addition, a user can't copy & paste to run the example in spark-shell as sqlCtx doesn't exist in spark-shell while sqlContext exists.
Change the scala example referred by the spark.ml example.
## How was this patch tested?
Compile the example scala file and it passes compilation.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12717 from wangmiao1981/doc.
Spark's published POMs are supposed to be flattened and not contain variable substitution (see SPARK-3812), but the dummy dependency that was required for this was accidentally removed. We should re-introduce this dependency in order to fix an issue where the un-flattened POMs cause the wrong dependencies to be included in Scala 2.10 published POMs.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12706 from JoshRosen/SPARK-14925-published-poms-should-be-flattened.
## What changes were proposed in this pull request?
Add the missing python example for VectorSlicer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12282 from zhengruifeng/vecslicer_pe.
## What changes were proposed in this pull request?
Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type
## How was this patch tested?
Unit tests
Author: Azeem Jiva <azeemj@gmail.com>
Closes#12520 from javawithjiva/minor.
## What changes were proposed in this pull request?
This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.
Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)
## How was this patch tested?
No change in functionality.
Author: Andrew Or <andrew@databricks.com>
Closes#12585 from andrewor14/delete-hive-context.
First, make all dependencies in the examples module provided, and explicitly
list a couple of ones that somehow are promoted to compile by maven. This
means that to run streaming examples, the streaming connector package needs
to be provided to run-examples using --packages or --jars, just like regular
apps.
Also, remove a couple of outdated examples. HBase has had Spark bindings for
a while and is even including them in the HBase distribution in the next
version, making the examples obsolete. The same applies to Cassandra, which
seems to have a proper Spark binding library already.
I just tested the build, which passes, and ran SparkPi. The examples jars
directory now has only two jars:
```
$ ls -1 examples/target/scala-2.11/jars/
scopt_2.11-3.3.0.jar
spark-examples_2.11-2.0.0-SNAPSHOT.jar
```
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#12544 from vanzin/SPARK-14744.
## What changes were proposed in this pull request?
This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.
- Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
- Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
- Fix datatypes in `sparkr.md`.
- Update a data result in `sparkr.md`.
- Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
- Other minor syntax fixes and a typo.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12649 from dongjoon-hyun/SPARK-14883.
## What changes were proposed in this pull request?
Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values
## How was this patch tested?
Existing (updated) Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#12637 from srowen/SPARK-14873.
## What changes were proposed in this pull request?
`JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0.
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#12418 from srowen/SPARK-8393.
## What changes were proposed in this pull request?
Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
## How was this patch tested?
unit tests and doc generation
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#12454 from hhbyyh/tfdoc.
## What changes were proposed in this pull request?
Move the spark-examples.jar from being in examples/target to examples/target/scala-2.11/jars
## How was this patch tested?
Built distribution to make sure examples jar was being included in the tarball.
Ran run-example to make sure examples were run.
Author: Mark Grover <mark@apache.org>
Closes#12476 from markgrover/spark-14711.
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12283 from zhengruifeng/chi2_pe.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14299
Delete duplications in scala/examples/ml.
TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample
CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
## How was this patch tested?
Existing tests passed.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12366 from yinxusen/SPARK-14299-2.
## What changes were proposed in this pull request?
This PR removes
- Inappropriate type notations
For example, from
```scala
words.foreachRDD { (rdd: RDD[String], time: Time) =>
...
```
to
```scala
words.foreachRDD { (rdd, time) =>
...
```
- Extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
and corrects some obvious style nits.
## How was this patch tested?
This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.
The rules applied were below:
- For the first correction,
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```
- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```
**Those rules were not added**
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12413 from HyukjinKwon/SPARK-style.
## What changes were proposed in this pull request?
This PR removes extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
## How was this patch tested?
Related unit tests and `sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12382 from HyukjinKwon/minor-extra-closers.
jira: https://issues.apache.org/jira/browse/SPARK-13089
Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files).
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#11015 from hhbyyh/naiveBayesDoc.