## What changes were proposed in this pull request?
This change modifies the implementation of DataFrameWriter.save such that it works with jdbc, and the call to jdbc merely delegates to save.
## How was this patch tested?
This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario.
## Additional details
rxin This seems to have been most recently touched by you and was also commented on in the JIRA.
This contribution is my original work and I license the work to the project under the project's open source license.
Author: Justin Pihony <justin.pihony@gmail.com>
Author: Justin Pihony <justin.pihony@typesafe.com>
Closes#12601 from JustinPihony/jdbc_reconciliation.
Code is equivalent, but map comprehency is most of the time faster than a map.
Author: Gaetan Semet <gaetan@xeberon.net>
Closes#14863 from Stibbons/map_comprehension.
## What changes were proposed in this pull request?
We propose to fix the Encoder type in the Dataset example
## How was this patch tested?
The PR will be tested with the current unit test cases
Author: CodingCat <zhunansjtu@gmail.com>
Closes#14901 from CodingCat/SPARK-17347.
## What changes were proposed in this pull request?
Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.
## How was this patch tested?
Jenkins tests, including new caes to reflect the new behavior.
Author: Sean Owen <sowen@cloudera.com>
Closes#14663 from srowen/SPARK-17001.
## What changes were proposed in this pull request?
As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release.
Before:
```
./dev/lint-java
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119).
[ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103).
```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
## How was this patch tested?
Manual.
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#14768 from Sherry302/fixjavastyle.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
For R SQL example, appname is "MyApp". While examples in scala, Java and python, the appName is "x Spark SQL basic example".
I made the R example consistent with other examples.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14703 from wangmiao1981/example.
## What changes were proposed in this pull request?
Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.
This PR fixes three things below:
- Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.
- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.
- Fix `StructuredNetworkWordCountWindowed` and `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).
## How was this patch tested?
N/A
Closes https://github.com/apache/spark/pull/14491
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>
Closes#14564 from HyukjinKwon/SPARK-16886.
## What changes were proposed in this pull request?
This PR is to fix the minor Java linter errors as following:
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[42,10] (modifier) RedundantModifier: Redundant 'final' modifier.
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[97,10] (modifier) RedundantModifier: Redundant 'final' modifier.
## How was this patch tested?
Manual test.
dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#14532 from Sherry302/master.
## What changes were proposed in this pull request?
Improve example outputs to better reflect the functionality that is being presented. This mostly consisted of modifying what was printed at the end of the example, such as calling show() with truncate=False, but sometimes required minor tweaks in the example data to get relevant output. Explicitly set parameters when they are used as part of the example. Fixed Java examples that failed to run because of using old-style MLlib Vectors or problem with schema. Synced examples between different APIs.
## How was this patch tested?
Ran each example for Scala, Python, and Java and made sure output was legible on a terminal of width 100.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#14308 from BryanCutler/ml-examples-improve-output-SPARK-16260.
## What changes were proposed in this pull request?
Modify java example which is also reflect in document.
## How was this patch tested?
run test cases.
Author: sandy <phalodi@gmail.com>
Closes#14436 from phalodi/SPARK-16816.
## What changes were proposed in this pull request?
mllib.LDAExample uses ML pipeline and MLlib LDA algorithm. The former transforms original data into MLVector format, while the latter uses MLlibVector format.
## How was this patch tested?
Test manually.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#14212 from yinxusen/SPARK-16558.
## What changes were proposed in this pull request?
This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed.
## How was this patch tested?
Manually tested.
Author: Cheng Lian <lian@databricks.com>
Closes#14368 from liancheng/revise-examples.
## What changes were proposed in this pull request?
Some Java examples are using mllib.linalg.Vectors instead of ml.linalg.Vectors and causes an exception when run. Also there are some Java examples that incorrectly specify data types in the schema, also causing an exception.
## How was this patch tested?
Ran corrected examples locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#14405 from BryanCutler/java-examples-ml.Vectors-fix-SPARK-16800.
## What changes were proposed in this pull request?
Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#14332 from srowen/SPARK-16694.
This PR is based on PR #14098 authored by wangmiao1981.
## What changes were proposed in this pull request?
This PR replaces the original Python Spark SQL example file with the following three files:
- `sql/basic.py`
Demonstrates basic Spark SQL features.
- `sql/datasource.py`
Demonstrates various Spark SQL data sources.
- `sql/hive.py`
Demonstrates Spark SQL Hive interaction.
This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag.
## How was this patch tested?
Manually tested.
Author: wm624@hotmail.com <wm624@hotmail.com>
Author: Cheng Lian <lian@databricks.com>
Closes#14317 from liancheng/py-examples-update.
https://issues.apache.org/jira/browse/SPARK-16535
## What changes were proposed in this pull request?
When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)
I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).
ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762
## How was this patch tested?
I've tested by re-building the project, and build succeeded.
Author: Xin Ren <iamshrek@126.com>
Closes#14189 from keypointt/SPARK-16535.
## What changes were proposed in this pull request?
This PR fixes four java linter `LineLength` errors. Those are all `LineLength` errors, but we had better remove all java linter errors before release.
## How was this patch tested?
After pass the Jenkins, `./dev/lint-java`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14255 from dongjoon-hyun/minor_java_linter.
## What changes were proposed in this pull request?
This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL".
## How was this patch tested?
Manually verified the generated HTML page.
Author: Cheng Lian <lian@databricks.com>
Closes#14245 from liancheng/minor-scala-example-update.
## What changes were proposed in this pull request?
The second arg in method `update()` is never used. So I delete it.
## How was this patch tested?
local run with `./bin/spark-submit examples/src/main/python/als.py`
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14247 from zhengruifeng/als_refine.
## What changes were proposed in this pull request?
Fix code style from ad hoc review of RC4 doc
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14250 from felixcheung/rdocs2rc4.
## What changes were proposed in this pull request?
Cleanup of examples, mostly from PySpark-ML to fix minor issues: unused imports, style consistency, pipeline_example is a duplicate, use future print funciton, and a spelling error.
* The "Pipeline Example" is duplicated by "Simple Text Classification Pipeline" in Scala, Python, and Java.
* "Estimator Transformer Param Example" is duplicated by "Simple Params Example" in Scala, Python and Java
* Synced random_forest_classifier_example.py with Scala by adding IndexToString label converted
* Synced train_validation_split.py (in Scala ModelSelectionViaTrainValidationExample) by adjusting data split, adding grid for intercept.
* RegexTokenizer was doing nothing in tokenizer_example.py and JavaTokenizerExample.java, synced with Scala version
## How was this patch tested?
local tests and run modified examples
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#14081 from BryanCutler/examples-cleanup-SPARK-16403.
## What changes were proposed in this pull request?
Minor example updates
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14171 from felixcheung/rexample.
- Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project.
- Removed the inconsistency between Scala and Java Spark SQL examples
- Scala and Java Spark SQL examples were updated
The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review.
![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png)
Author: aokolnychyi <okolnychyyanton@gmail.com>
Closes#14119 from aokolnychyi/spark_16303.
## What changes were proposed in this pull request?
A structured streaming example with event time windowing.
## How was this patch tested?
Run locally
Author: James Thomas <jamesjoethomas@gmail.com>
Closes#13957 from jjthomas/current.
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.
## How was this patch tested?
Only docs update, manually check the generated docs.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14011 from yanboliang/r-user-guide-update.
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#14130 from rxin/SPARK-16477.
## What changes were proposed in this pull request?
1). Remove unused import in Scala example;
2). Move spark session import outside example off;
3). Change parameter setting the same as Scala;
4). Change comment to be consistent;
5). Make sure that Scala and python using the same data set;
I did one pass and fixed the above issues. There are missing examples in python, which might be added later.
TODO: For some examples, there are comments on how to run examples; But there are many missing. We can add them later.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually test them
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14021 from wangmiao1981/ann.
## What changes were proposed in this pull request?
I extract 6 example programs from GraphX programming guide and replace them with
`include_example` label.
The 6 example programs are:
- AggregateMessagesExample.scala
- SSSPExample.scala
- TriangleCountingExample.scala
- ConnectedComponentsExample.scala
- ComprehensiveExample.scala
- PageRankExample.scala
All the example code can run using
`bin/run-example graphx.EXAMPLE_NAME`
## How was this patch tested?
Manual.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14015 from WeichenXu123/graphx_example_plugin.
## What changes were proposed in this pull request?
This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page.
## How was this patch tested?
Manually tested.
<img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png">
Author: Cheng Lian <lian@databricks.com>
Closes#13972 from liancheng/include-example-with-labels.
## What changes were proposed in this pull request?
Some appNames in ML examples are incorrect, mostly in PySpark but one in Scala. This corrects the names.
## How was this patch tested?
Style, local tests
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13949 from BryanCutler/pyspark-example-appNames-fix-SPARK-16261.
## What changes were proposed in this pull request?
Network word count example for structured streaming
## How was this patch tested?
Run locally
Author: James Thomas <jamesjoethomas@gmail.com>
Author: James Thomas <jamesthomas@Jamess-MacBook-Pro.local>
Closes#13816 from jjthomas/master.
## What changes were proposed in this pull request?
Need to convert ML Vectors to the old MLlib style before doing Statistics.colStats operations on the DataFrame
## How was this patch tested?
Ran example, local tests
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13928 from BryanCutler/pyspark-ml-example-vector-conv-SPARK-16231.
## What changes were proposed in this pull request?
reduce the denominator of SparkPi by 1
## How was this patch tested?
integration tests
Author: 杨浩 <yanghaogn@163.com>
Closes#13910 from yanghaogn/patch-1.
## What changes were proposed in this pull request?
Made changes to HashingTF,QuantileVectorizer and CountVectorizer
Author: GayathriMurali <gayathri.m@intel.com>
Closes#13745 from GayathriMurali/SPARK-15997.
## What changes were proposed in this pull request?
roxygen2 doc, programming guide, example updates
## How was this patch tested?
manual checks
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13751 from felixcheung/rsparksessiondoc.
## What changes were proposed in this pull request?
Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs
Author: GayathriMurali <gayathri.m@intel.com>
Closes#13285 from GayathriMurali/SPARK-15129.
## What changes were proposed in this pull request?
add ml doc for ml isotonic regression
add scala example for ml isotonic regression
add java example for ml isotonic regression
add python example for ml isotonic regression
modify scala example for mllib isotonic regression
modify java example for mllib isotonic regression
modify python example for mllib isotonic regression
add data/mllib/sample_isotonic_regression_libsvm_data.txt
delete data/mllib/sample_isotonic_regression_data.txt
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13381 from WeichenXu123/add_isotonic_regression_doc.
## What changes were proposed in this pull request?
Currently, R examples(`dataframe.R` and `data-manipulation.R`) fail like the following. We had better update them before releasing 2.0 RC. This PR updates them to use up-to-date APIs.
```bash
$ bin/spark-submit examples/src/main/r/dataframe.R
...
Warning message:
'createDataFrame(sqlContext...)' is deprecated.
Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead.
See help("Deprecated")
...
Warning message:
'read.json(sqlContext...)' is deprecated.
Use 'read.json(path)' instead.
See help("Deprecated")
...
Error: could not find function "registerTempTable"
Execution halted
```
## How was this patch tested?
Manual.
```
curl -LO http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv
bin/spark-submit examples/src/main/r/dataframe.R
bin/spark-submit examples/src/main/r/data-manipulation.R flights.csv
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13714 from dongjoon-hyun/SPARK-15996.
## What changes were proposed in this pull request?
- Deprecate old Java accumulator API; should use Scala now
- Update Java tests and examples
- Don't bother testing old accumulator API in Java 8 (too)
- (fix a misspelling too)
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#13606 from srowen/SPARK-15086.
## What changes were proposed in this pull request?
This PR fixes Python examples to use the new ML Vector and Matrix APIs in the ML pipeline based algorithms.
I firstly executed this shell command, `grep -r "from pyspark.mllib" .` and then executed them all.
Some of tests in `ml` produced the error messages as below:
```
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Input type must be VectorUDT but got org.apache.spark.mllib.linalg.VectorUDTf71b0bce.'
```
So, I fixed them to use new ones just identically with some Python tests fixed in https://github.com/apache/spark/pull/12627
## How was this patch tested?
Manually tested for all the examples listed by `grep -r "from pyspark.mllib" .`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13393 from HyukjinKwon/SPARK-14615.
## What changes were proposed in this pull request?
Instead of using local variable `sc` like the following example, this PR uses `spark.sparkContext`. This makes examples more concise, and also fixes some misleading, i.e., creating SparkContext from SparkSession.
```
- println("Creating SparkContext")
- val sc = spark.sparkContext
-
println("Writing local file to DFS")
val dfsFilename = dfsDirPath + "/dfs_read_write_test"
- val fileRDD = sc.parallelize(fileContents)
+ val fileRDD = spark.sparkContext.parallelize(fileContents)
```
This will change 12 files (+30 lines, -52 lines).
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13520 from dongjoon-hyun/SPARK-15773.
## What changes were proposed in this pull request?
Made DefaultParamsReadable, DefaultParamsWritable public. Also added relevant doc and annotations. Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable.
## How was this patch tested?
Wrote example making use of the now-public APIs. Compiled and ran locally
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#13461 from jkbradley/defaultparamswritable.
## What changes were proposed in this pull request?
Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken.
```python
pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.'
```
We should use ```accuracy``` to replace ```precision``` in these examples.
## How was this patch tested?
Offline tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13519 from yanboliang/spark-15771.
## What changes were proposed in this pull request?
See [SPARK-15605](https://issues.apache.org/jira/browse/SPARK-15605) for the detail of this bug. This PR fix 2 major bugs in this example:
* The java example class use Param ```maxIter```, it will fail when calling ```Param.shouldOwn```. We need add a public method which return the ```maxIter``` Object. Because ```Params.params``` use java reflection to list all public method whose return type is ```Param```, and invoke them to get all defined param objects in the instance.
* The ```uid``` member defined in Java class will be initialized after Scala traits such as ```HasFeaturesCol```. So when ```HasFeaturesCol``` being constructed, they get ```uid``` with null, which will cause ```Param.shouldOwn``` check fail.
so, here is my changes:
* Add public method:
```public IntParam getMaxIterParam() {return maxIter;}```
* Use Java anonymous class overriding ```uid()``` to defined the ```uid```, and it solve the second problem described above.
* To make the ```getMaxIterParam ``` can be invoked using java reflection, we must make the two class (MyJavaLogisticRegression and MyJavaLogisticRegressionModel) public. so I make them become inner public static class.
## How was this patch tested?
Offline tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13353 from yanboliang/spark-15605.
## What changes were proposed in this pull request?
The patch updates the codes & docs in the example module as well as the related doc module:
- [ ] [docs] `streaming-programming-guide.md`
- [x] scala code part
- [ ] java code part
- [ ] python code part
- [x] [examples] `RecoverableNetworkWordCount.scala`
- [ ] [examples] `JavaRecoverableNetworkWordCount.java`
- [ ] [examples] `recoverable_network_wordcount.py`
## How was this patch tested?
Ran the examples and verified results manually.
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12981 from lw-lin/accumulatorV2-examples.
## What changes were proposed in this pull request?
This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings.
```
- val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
+ val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
```
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13365 from dongjoon-hyun/SPARK-15618.
## What changes were proposed in this pull request?
No code change, just some typo fixing.
## How was this patch tested?
Manually run project build with testing, and build is successful.
Author: Xin Ren <iamshrek@126.com>
Closes#13385 from keypointt/codeWalkThroughStreaming.
## What changes were proposed in this pull request?
Temp directory used to save records is not deleted after program exit in DataFrameExample. Although it called deleteOnExit, it doesn't work as the directory is not empty. Similar things happend in ContextCleanerSuite. Update the code to make sure temp directory is deleted after program exit.
## How was this patch tested?
unit tests and local build.
Author: dding3 <ding.ding@intel.com>
Closes#13328 from dding3/master.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
In the MLLib naivebayes example, scala and python example doesn't use libsvm data, but Java does.
I make changes in scala and python example to use the libsvm data as the same as Java example.
## How was this patch tested?
Manual tests
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13301 from wangmiao1981/example.
## What changes were proposed in this pull request?
`a` -> `an`
I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.
## How was this patch tested?
local build
`lint-java` checking
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13317 from zhengruifeng/a_an.
## What changes were proposed in this pull request?
Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
* WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
* Use in PythonMLlibAPI: Change to using private constructors
* Streaming algs: No warnings after we un-deprecate the classes
* Examples: Deprecate or change ones which use deprecated APIs
* MulticlassMetrics fields (precision, etc.)
* LinearRegressionSummary.model field
## How was this patch tested?
Existing tests. Checked for warnings manually.
Author: Sean Owen <sowen@cloudera.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#13314 from jkbradley/warning-cleanups.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
The Binarization scala example val dataFrame : Dataframe = spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted in the spark-shell as Dataframe is not imported. Compared with other examples, this explicit type is not required.
So I removed Dataframe in the code.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually tested
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13266 from wangmiao1981/unit.
## What changes were proposed in this pull request?
Some PySpark examples need a SparkContext and get it by accessing _sc directly from the session. These examples should use the provided property `sparkContext` in `SparkSession` instead.
## How was this patch tested?
Ran modified examples
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13303 from BryanCutler/pyspark-session-sparkContext-MINOR.
## What changes were proposed in this pull request?
Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)
`MLLLIB` is not recommended to use now, so examples in `MLLIB` are ignored in this PR.
`StreamingContext` can not be directly obtained from `SparkSession`, so example in `Streaming` are ignored too.
cc andrewor14
## How was this patch tested?
manual tests with spark-submit
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13164 from zhengruifeng/use_sparksession_ii.
## What changes were proposed in this pull request?
Update example code in examples/src/main/r/ml.R to reflect the new algorithms.
* spark.glm and glm
* spark.survreg
* spark.naiveBayes
* spark.kmeans
## How was this patch tested?
Offline test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13000 from yanboliang/spark-15222.
## What changes were proposed in this pull request?
MLlib are not recommended to use, and some methods are even deprecated.
Update the warning message to recommend ML usage.
```
def showWarning() {
System.err.println(
"""WARN: This is a naive implementation of Logistic Regression and is given as an example!
|Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
|org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
|for more conventional use.
""".stripMargin)
}
```
To
```
def showWarning() {
System.err.println(
"""WARN: This is a naive implementation of Logistic Regression and is given as an example!
|Please use org.apache.spark.ml.classification.LogisticRegression
|for more conventional use.
""".stripMargin)
}
```
## How was this patch tested?
local build
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13190 from zhengruifeng/update_recd.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
In this DataFrame example, we use VectorImplicits._, which is private API.
Since Vectors object has public API, we use Vectors.fromML instead of implicts.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually run the example.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13213 from wangmiao1981/ml.
## What changes were proposed in this pull request?
Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion
## How was this patch tested?
Existing Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13101 from techaddict/SPARK-15296.
## What changes were proposed in this pull request?
It seems most of Python examples were changed to use SparkSession by https://github.com/apache/spark/pull/12809. This PR said both examples below:
- `simple_params_example.py`
- `aft_survival_regression.py`
are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by https://github.com/apache/spark/pull/13050 but `simple_params_example.py` is not yet.
This PR corrects the example and make this use SparkSession.
In more detail, it seems `threshold` is replaced to `thresholds` here and there by 5a23213c14. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`).
According to the comment below. this is not allowed, 354f8f11bd/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (L58-L61).
So, in this PR, it sets the equivalent value so that this does not throw an exception.
## How was this patch tested?
Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`)
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13135 from HyukjinKwon/SPARK-15031.
## What changes were proposed in this pull request?
I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)
## How was this patch tested?
Exisiting unit tests
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13112 from WeichenXu123/update_accuV2_in_mllib.
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13098 from clockfly/spark-15171-remove-deprecation.
## What changes were proposed in this pull request?
Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#12627 from dbtsai/SPARK-14615-NewML.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
copy & paste example in ml-collaborative-filtering.html into spark-shell, we see the following errors.
scala> case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
defined class Rating
scala> object Rating {
def parseRating(str: String): Rating = { | val fields = str.split("::") | assert(fields.size == 4) | Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong) | }
}
<console>:29: error: Rating.type does not take parameters
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
^
In standard scala repl, it has the same error.
Scala/spark-shell repl has some quirks (e.g. packages are also not well supported).
The reason of errors is that scala/spark-shell repl discards previous definitions when we define the Object with the same class name. Solution: We can rename the Object Rating.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually test it: 1). ./bin/run-example ALSExample
2). copy & paste example in the generated document. It works fine.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13110 from wangmiao1981/repl.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual compile and test all examples
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12788 from wangmiao1981/example.
## What changes were proposed in this pull request?
Add Scala/Java/Python examples for ```GeneralizedLinearRegression```.
## How was this patch tested?
They are examples and have been tested offline.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12754 from yanboliang/spark-14979.
## What changes were proposed in this pull request?
Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#12945 from clockfly/spark-15171.
## What changes were proposed in this pull request?
1, Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)
2, Update indent for `SparkContext` according to [SPARK-15134](https://issues.apache.org/jira/browse/SPARK-15134)
3, BTW, remove some duplicate space and add missing '.'
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13050 from zhengruifeng/use_sparksession.
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions
## How was this patch tested?
Unit tests
Author: cody koeninger <cody@koeninger.org>
Closes#12946 from koeninger/SPARK-15085.
## What changes were proposed in this pull request?
This fixes compile errors.
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13053 from dongjoon-hyun/hotfix_sqlquerysuite.
## What changes were proposed in this pull request?
1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt`
2,add python example
3,directly read the datafile in examples
4,BTW, change to `SparkSession` in `aft_survival_regression.py`
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/lda_example.py`
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12927 from zhengruifeng/lda_pe.
## What changes were proposed in this pull request?
Python example for ml.kmeans already exists, but not included in user guide.
1,small changes like: `example_on` `example_off`
2,add it to user guide
3,update examples to directly read datafile
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/kmeans_example.py
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12925 from zhengruifeng/km_pe.
## What changes were proposed in this pull request?
1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile `data/mllib/sample_kmeans_data.txt`
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#11844 from zhengruifeng/doc_bkm.
## What changes were proposed in this pull request?
1, Add python example for OneVsRest
2, remove args-parsing
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12920 from zhengruifeng/ovr_pe.
This PR removes `sqlContext` in examples. Actual usage was all replaced in https://github.com/apache/spark/pull/12809 but there are some in comments.
Manual style checking.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13006 from HyukjinKwon/minor-docs.
## What changes were proposed in this pull request?
* Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
* Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.
## How was this patch tested?
Offline test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13005 from yanboliang/r-df-examples.
Cleans up ALS examples by removing unnecessary casts to double for `rating` and `prediction` columns, since `RegressionEvaluator` now supports `Double` & `Float` input types.
## How was this patch tested?
Manual compile and run with `run-example ml.ALSExample` and `spark-submit examples/src/main/python/ml/als_example.py`.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#12892 from MLnick/als-examples-cleanup.
## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12281 from zhengruifeng/discret_pe.
## What changes were proposed in this pull request?
This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)
## How was this patch tested?
After passing the Jenkins tests and run `dev/lint-java` manually.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12911 from dongjoon-hyun/SPARK-15134.
## What changes were proposed in this pull request?
Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`
## How was this patch tested?
ran tests locally
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12851 from techaddict/SPARK-15072.
## What changes were proposed in this pull request?
This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.
- Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
- Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
- Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
- `SqlNetworkWordCount.scala`
- `JavaSqlNetworkWordCount.java`
- `sql_network_wordcount.py`
Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
- `simple_params_example.py`
- `aft_survival_regression.py`
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12809 from dongjoon-hyun/SPARK-15031.
## What changes were proposed in this pull request?
Add python3 compatibility in python examples
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12868 from zhengruifeng/fix_gmm_py.
## What changes were proposed in this pull request?
This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12860 from dongjoon-hyun/SPARK-15084.
## What changes were proposed in this pull request?
Users should use the builder pattern instead.
## How was this patch tested?
Jenks.
Author: Andrew Or <andrew@databricks.com>
Closes#12873 from andrewor14/spark-session-constructor.
## What changes were proposed in this pull request?
Now, `SQLContext` is used for backward-compatibility, we had better use `SparkSession` in Spark 2.0 examples.
## How was this patch tested?
It's just example change. After building, run `bin/run-example org.apache.spark.examples.sql.RDDRelation`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12808 from dongjoon-hyun/rddrelation.
## What changes were proposed in this pull request?
In spark.ml document, the LogisticRegression scala example uses sqlCtx. It is inconsistent with java and python examples which use sqlContext. In addition, a user can't copy & paste to run the example in spark-shell as sqlCtx doesn't exist in spark-shell while sqlContext exists.
Change the scala example referred by the spark.ml example.
## How was this patch tested?
Compile the example scala file and it passes compilation.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12717 from wangmiao1981/doc.
Spark's published POMs are supposed to be flattened and not contain variable substitution (see SPARK-3812), but the dummy dependency that was required for this was accidentally removed. We should re-introduce this dependency in order to fix an issue where the un-flattened POMs cause the wrong dependencies to be included in Scala 2.10 published POMs.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12706 from JoshRosen/SPARK-14925-published-poms-should-be-flattened.
## What changes were proposed in this pull request?
Add the missing python example for VectorSlicer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12282 from zhengruifeng/vecslicer_pe.
## What changes were proposed in this pull request?
Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type
## How was this patch tested?
Unit tests
Author: Azeem Jiva <azeemj@gmail.com>
Closes#12520 from javawithjiva/minor.
## What changes were proposed in this pull request?
This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.
Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)
## How was this patch tested?
No change in functionality.
Author: Andrew Or <andrew@databricks.com>
Closes#12585 from andrewor14/delete-hive-context.
First, make all dependencies in the examples module provided, and explicitly
list a couple of ones that somehow are promoted to compile by maven. This
means that to run streaming examples, the streaming connector package needs
to be provided to run-examples using --packages or --jars, just like regular
apps.
Also, remove a couple of outdated examples. HBase has had Spark bindings for
a while and is even including them in the HBase distribution in the next
version, making the examples obsolete. The same applies to Cassandra, which
seems to have a proper Spark binding library already.
I just tested the build, which passes, and ran SparkPi. The examples jars
directory now has only two jars:
```
$ ls -1 examples/target/scala-2.11/jars/
scopt_2.11-3.3.0.jar
spark-examples_2.11-2.0.0-SNAPSHOT.jar
```
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#12544 from vanzin/SPARK-14744.
## What changes were proposed in this pull request?
This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.
- Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
- Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
- Fix datatypes in `sparkr.md`.
- Update a data result in `sparkr.md`.
- Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
- Other minor syntax fixes and a typo.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12649 from dongjoon-hyun/SPARK-14883.
## What changes were proposed in this pull request?
Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values
## How was this patch tested?
Existing (updated) Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#12637 from srowen/SPARK-14873.
## What changes were proposed in this pull request?
`JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0.
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#12418 from srowen/SPARK-8393.
## What changes were proposed in this pull request?
Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
## How was this patch tested?
unit tests and doc generation
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#12454 from hhbyyh/tfdoc.
## What changes were proposed in this pull request?
Move the spark-examples.jar from being in examples/target to examples/target/scala-2.11/jars
## How was this patch tested?
Built distribution to make sure examples jar was being included in the tarball.
Ran run-example to make sure examples were run.
Author: Mark Grover <mark@apache.org>
Closes#12476 from markgrover/spark-14711.
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12283 from zhengruifeng/chi2_pe.