Commit graph

918 commits

Author SHA1 Message Date
Gaetan Semet b3c2291228 [SPARK-16992][PYSPARK] use map comprehension in doc
Code is equivalent, but map comprehency is most of the time faster than a map.

Author: Gaetan Semet <gaetan@xeberon.net>

Closes #14863 from Stibbons/map_comprehension.
2016-09-12 12:21:33 +01:00
CodingCat 97da41039b [SPARK-17347][SQL][EXAMPLES] Encoder in Dataset example has incorrect type
## What changes were proposed in this pull request?

We propose to fix the Encoder type in the Dataset example

## How was this patch tested?

The PR will be tested with the current unit test cases

Author: CodingCat <zhunansjtu@gmail.com>

Closes #14901 from CodingCat/SPARK-17347.
2016-09-03 10:03:40 +01:00
Sean Owen e07baf1412 [SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when withMean=True
## What changes were proposed in this pull request?

Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.

## How was this patch tested?

Jenkins tests, including new caes to reflect the new behavior.

Author: Sean Owen <sowen@cloudera.com>

Closes #14663 from srowen/SPARK-17001.
2016-08-27 08:48:56 +01:00
Weiqing Yang 673a80d223 [MINOR][BUILD] Fix Java CheckStyle Error
## What changes were proposed in this pull request?
As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release.

Before:
```
./dev/lint-java
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119).
[ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103).
```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
## How was this patch tested?
Manual.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #14768 from Sherry302/fixjavastyle.
2016-08-24 10:12:44 +01:00
wm624@hotmail.com 3e5fdeb3fb [SPARKR][EXAMPLE] change example APP name
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

For R SQL example, appname is "MyApp". While examples in scala, Java and python, the appName is "x Spark SQL basic example".

I made the R example consistent with other examples.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14703 from wangmiao1981/example.
2016-08-20 07:00:51 -07:00
hyukjinkwon 7186e8c318 [SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation
## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.
2016-08-11 11:31:52 +01:00
Weiqing Yang e10ca8de49 [SPARK-16945] Fix Java Lint errors
## What changes were proposed in this pull request?
This PR is to fix the minor Java linter errors as following:
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[42,10] (modifier) RedundantModifier: Redundant 'final' modifier.
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[97,10] (modifier) RedundantModifier: Redundant 'final' modifier.

## How was this patch tested?
Manual test.
dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #14532 from Sherry302/master.
2016-08-08 09:24:37 +01:00
Bryan Cutler 180fd3e0a3 [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
## What changes were proposed in this pull request?
Improve example outputs to better reflect the functionality that is being presented.  This mostly consisted of modifying what was printed at the end of the example, such as calling show() with truncate=False, but sometimes required minor tweaks in the example data to get relevant output.  Explicitly set parameters when they are used as part of the example.  Fixed Java examples that failed to run because of using old-style MLlib Vectors or problem with schema.  Synced examples between different APIs.

## How was this patch tested?
Ran each example for Scala, Python, and Java and made sure output was legible on a terminal of width 100.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14308 from BryanCutler/ml-examples-improve-output-SPARK-16260.
2016-08-05 20:57:46 +01:00
sandy cbdff49357 [SPARK-16816] Modify java example which is also reflect in documentation exmaple
## What changes were proposed in this pull request?

Modify java example which is also reflect in document.

## How was this patch tested?

run test cases.

Author: sandy <phalodi@gmail.com>

Closes #14436 from phalodi/SPARK-16816.
2016-08-02 10:34:01 -07:00
Xusen Yin dd8514fa20 [SPARK-16558][EXAMPLES][MLLIB] examples/mllib/LDAExample should use MLVector instead of MLlib Vector
## What changes were proposed in this pull request?

mllib.LDAExample uses ML pipeline and MLlib LDA algorithm. The former transforms original data into MLVector format, while the latter uses MLlibVector format.

## How was this patch tested?

Test manually.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #14212 from yinxusen/SPARK-16558.
2016-08-02 07:28:46 -07:00
Cheng Lian 10e1c0e638 [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindings
## What changes were proposed in this pull request?

This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed.

## How was this patch tested?

Manually tested.

Author: Cheng Lian <lian@databricks.com>

Closes #14368 from liancheng/revise-examples.
2016-08-02 15:02:40 +08:00
Bryan Cutler a6290e51e4 [SPARK-16800][EXAMPLES][ML] Fix Java examples that fail to run due to exception
## What changes were proposed in this pull request?
Some Java examples are using mllib.linalg.Vectors instead of ml.linalg.Vectors and causes an exception when run.  Also there are some Java examples that incorrectly specify data types in the schema, also causing an exception.

## How was this patch tested?
Ran corrected examples locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14405 from BryanCutler/java-examples-ml.Vectors-fix-SPARK-16800.
2016-07-30 08:08:33 -07:00
Sean Owen 0dc4310b47 [SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose side effects are required
## What changes were proposed in this pull request?

Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14332 from srowen/SPARK-16694.
2016-07-30 04:42:38 -07:00
Cheng Lian 53b2456d1d [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding
This PR is based on PR #14098 authored by wangmiao1981.

## What changes were proposed in this pull request?

This PR replaces the original Python Spark SQL example file with the following three files:

- `sql/basic.py`

  Demonstrates basic Spark SQL features.

- `sql/datasource.py`

  Demonstrates various Spark SQL data sources.

- `sql/hive.py`

  Demonstrates Spark SQL Hive interaction.

This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag.

## How was this patch tested?

Manually tested.

Author: wm624@hotmail.com <wm624@hotmail.com>
Author: Cheng Lian <lian@databricks.com>

Closes #14317 from liancheng/py-examples-update.
2016-07-23 11:41:24 -07:00
Dongjoon Hyun 556a9437ac [MINOR][BUILD] Fix Java Linter LineLength errors
## What changes were proposed in this pull request?

This PR fixes four java linter `LineLength` errors. Those are all `LineLength` errors, but we had better remove all java linter errors before release.

## How was this patch tested?

After pass the Jenkins, `./dev/lint-java`.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14255 from dongjoon-hyun/minor_java_linter.
2016-07-19 11:51:43 +01:00
Cheng Lian 1426a08052 [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update
## What changes were proposed in this pull request?

This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL".

## How was this patch tested?

Manually verified the generated HTML page.

Author: Cheng Lian <lian@databricks.com>

Closes #14245 from liancheng/minor-scala-example-update.
2016-07-18 23:07:59 -07:00
Zheng RuiFeng e5fbb182c0 [MINOR] Remove unused arg in als.py
## What changes were proposed in this pull request?
The second arg in method `update()` is never used. So I delete it.

## How was this patch tested?
local run with `./bin/spark-submit examples/src/main/python/als.py`

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #14247 from zhengruifeng/als_refine.
2016-07-18 22:57:13 -07:00
Felix Cheung 75f0efe74d [SPARKR][DOCS] minor code sample update in R programming guide
## What changes were proposed in this pull request?

Fix code style from ad hoc review of RC4 doc

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14250 from felixcheung/rdocs2rc4.
2016-07-18 16:01:57 -07:00
Bryan Cutler e3f8a03367 [SPARK-16403][EXAMPLES] Cleanup to remove unused imports, consistent style, minor fixes
## What changes were proposed in this pull request?

Cleanup of examples, mostly from PySpark-ML to fix minor issues:  unused imports, style consistency, pipeline_example is a duplicate, use future print funciton, and a spelling error.

* The "Pipeline Example" is duplicated by "Simple Text Classification Pipeline" in Scala, Python, and Java.

* "Estimator Transformer Param Example" is duplicated by "Simple Params Example" in Scala, Python and Java

* Synced random_forest_classifier_example.py with Scala by adding IndexToString label converted

* Synced train_validation_split.py (in Scala ModelSelectionViaTrainValidationExample) by adjusting data split, adding grid for intercept.

* RegexTokenizer was doing nothing in tokenizer_example.py and JavaTokenizerExample.java, synced with Scala version

## How was this patch tested?
local tests and run modified examples

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14081 from BryanCutler/examples-cleanup-SPARK-16403.
2016-07-14 09:12:46 +01:00
Felix Cheung b4baf086ca [SPARKR][MINOR] R examples and test updates
## What changes were proposed in this pull request?

Minor example updates

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14171 from felixcheung/rexample.
2016-07-13 13:33:34 -07:00
aokolnychyi 772c213ec7 [SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examples
- Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project.
- Removed the inconsistency between Scala and Java Spark SQL examples
- Scala and Java Spark SQL examples were updated

The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review.

![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png)

Author: aokolnychyi <okolnychyyanton@gmail.com>

Closes #14119 from aokolnychyi/spark_16303.
2016-07-13 16:12:11 +08:00
James Thomas 9e2c763dbb [SPARK-16114][SQL] structured streaming event time window example
## What changes were proposed in this pull request?

A structured streaming example with event time windowing.

## How was this patch tested?

Run locally

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #13957 from jjthomas/current.
2016-07-11 17:57:51 -07:00
Yanbo Liang 2ad031be67 [SPARKR][DOC] SparkR ML user guides update for 2.0
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.

## How was this patch tested?
Only docs update, manually check the generated docs.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14011 from yanboliang/r-user-guide-update.
2016-07-11 14:31:11 -07:00
Xin Ren 9cb1eb7af7 [SPARK-16381][SQL][SPARKR] Update SQL examples and programming guide for R language binding
https://issues.apache.org/jira/browse/SPARK-16381

## What changes were proposed in this pull request?

Update SQL examples and programming guide for R language binding.

Here I just follow example https://github.com/apache/spark/compare/master...liancheng:example-snippet-extraction, created a separate R file to store all the example code.

## How was this patch tested?

Manual test on my local machine.
Screenshot as below:

![screen shot 2016-07-06 at 4 52 25 pm](https://cloud.githubusercontent.com/assets/3925641/16638180/13925a58-439a-11e6-8d57-8451a63dcae9.png)

Author: Xin Ren <iamshrek@126.com>

Closes #14082 from keypointt/SPARK-16381.
2016-07-11 20:05:28 +08:00
wm624@hotmail.com a539b724c1 [SPARK-16260][ML][EXAMPLE] PySpark ML Example Improvements and Cleanup
## What changes were proposed in this pull request?
1). Remove unused import in Scala example;

2). Move spark session import outside example off;

3). Change parameter setting the same as Scala;

4). Change comment to be consistent;

5). Make sure that Scala and python using the same data set;

I did one pass and fixed the above issues. There are missing examples in python, which might be added later.

TODO: For some examples, there are comments on how to run examples; But there are many missing. We can add them later.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manually test them

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14021 from wangmiao1981/ann.
2016-07-03 23:23:02 -07:00
WeichenXu 0bd7cd18bc [SPARK-16345][DOCUMENTATION][EXAMPLES][GRAPHX] Extract graphx programming guide example snippets from source files instead of hard code them
## What changes were proposed in this pull request?

I extract 6 example programs from GraphX programming guide and replace them with
`include_example` label.

The 6 example programs are:
- AggregateMessagesExample.scala
- SSSPExample.scala
- TriangleCountingExample.scala
- ConnectedComponentsExample.scala
- ComprehensiveExample.scala
- PageRankExample.scala

All the example code can run using
`bin/run-example graphx.EXAMPLE_NAME`

## How was this patch tested?

Manual.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14015 from WeichenXu123/graphx_example_plugin.
2016-07-02 16:29:00 +01:00
Cheng Lian bde1d6a615 [SPARK-16294][SQL] Labelling support for the include_example Jekyll plugin
## What changes were proposed in this pull request?

This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page.

## How was this patch tested?

Manually tested.

<img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png">

Author: Cheng Lian <lian@databricks.com>

Closes #13972 from liancheng/include-example-with-labels.
2016-06-29 22:50:53 -07:00
Bryan Cutler 21385d02a9 [SPARK-16261][EXAMPLES][ML] Fixed incorrect appNames in ML Examples
## What changes were proposed in this pull request?

Some appNames in ML examples are incorrect, mostly in PySpark but one in Scala.  This corrects the names.

## How was this patch tested?
Style, local tests

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13949 from BryanCutler/pyspark-example-appNames-fix-SPARK-16261.
2016-06-29 14:06:38 +02:00
James Thomas 3554713a16 [SPARK-16114][SQL] structured streaming network word count examples
## What changes were proposed in this pull request?

Network word count example for structured streaming

## How was this patch tested?

Run locally

Author: James Thomas <jamesjoethomas@gmail.com>
Author: James Thomas <jamesthomas@Jamess-MacBook-Pro.local>

Closes #13816 from jjthomas/master.
2016-06-28 16:12:48 -07:00
Bryan Cutler 1aa191e58e [SPARK-16231][PYSPARK][ML][EXAMPLES] dataframe_example.py fails to convert ML style vectors
## What changes were proposed in this pull request?
Need to convert ML Vectors to the old MLlib style before doing Statistics.colStats operations on the DataFrame

## How was this patch tested?
Ran example, local tests

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13928 from BryanCutler/pyspark-ml-example-vector-conv-SPARK-16231.
2016-06-27 12:58:39 -07:00
杨浩 b452026324 [SPARK-16214][EXAMPLES] fix the denominator of SparkPi
## What changes were proposed in this pull request?

reduce the denominator of SparkPi by 1

## How was this patch tested?

  integration tests

Author: 杨浩 <yanghaogn@163.com>

Closes #13910 from yanghaogn/patch-1.
2016-06-27 08:31:52 +01:00
GayathriMurali be88383e15 [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer
## What changes were proposed in this pull request?

Made changes to HashingTF,QuantileVectorizer and CountVectorizer

Author: GayathriMurali <gayathri.m@intel.com>

Closes #13745 from GayathriMurali/SPARK-15997.
2016-06-24 13:25:40 +02:00
Felix Cheung 359c2e827d [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates
## What changes were proposed in this pull request?

roxygen2 doc, programming guide, example updates

## How was this patch tested?

manual checks
shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13751 from felixcheung/rsparksessiondoc.
2016-06-20 13:46:24 -07:00
GayathriMurali af2a4b0826 [SPARK-15129][R][DOC] R API changes in ML
## What changes were proposed in this pull request?

Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs

Author: GayathriMurali <gayathri.m@intel.com>

Closes #13285 from GayathriMurali/SPARK-15129.
2016-06-17 21:10:29 -07:00
WeichenXu 9040d83bc2 [SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic regression
## What changes were proposed in this pull request?

add ml doc for ml isotonic regression
add scala example for ml isotonic regression
add java example for ml isotonic regression
add python example for ml isotonic regression

modify scala example for mllib isotonic regression
modify java example for mllib isotonic regression
modify python example for mllib isotonic regression

add data/mllib/sample_isotonic_regression_libsvm_data.txt
delete data/mllib/sample_isotonic_regression_data.txt
## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13381 from WeichenXu123/add_isotonic_regression_doc.
2016-06-16 17:35:40 -07:00
Dongjoon Hyun a865f6e052 [SPARK-15996][R] Fix R examples by removing deprecated functions
## What changes were proposed in this pull request?

Currently, R examples(`dataframe.R` and `data-manipulation.R`) fail like the following. We had better update them before releasing 2.0 RC. This PR updates them to use up-to-date APIs.

```bash
$ bin/spark-submit examples/src/main/r/dataframe.R
...
Warning message:
'createDataFrame(sqlContext...)' is deprecated.
Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead.
See help("Deprecated")
...
Warning message:
'read.json(sqlContext...)' is deprecated.
Use 'read.json(path)' instead.
See help("Deprecated")
...
Error: could not find function "registerTempTable"
Execution halted
```

## How was this patch tested?

Manual.
```
curl -LO http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv
bin/spark-submit examples/src/main/r/dataframe.R
bin/spark-submit examples/src/main/r/data-manipulation.R flights.csv
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13714 from dongjoon-hyun/SPARK-15996.
2016-06-16 12:46:25 -07:00
Wenchen Fan e2ab79d5ea [SPARK-15898][SQL] DataFrameReader.text should return DataFrame
## What changes were proposed in this pull request?

We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String].

affected PRs:
https://github.com/apache/spark/pull/11731
https://github.com/apache/spark/pull/13104
https://github.com/apache/spark/pull/13184

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13604 from cloud-fan/revert.
2016-06-12 21:36:41 -07:00
Sean Owen f51dfe616b [SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator API
## What changes were proposed in this pull request?

- Deprecate old Java accumulator API; should use Scala now
- Update Java tests and examples
- Don't bother testing old accumulator API in Java 8 (too)
- (fix a misspelling too)

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #13606 from srowen/SPARK-15086.
2016-06-12 11:44:33 -07:00
hyukjinkwon 99f3c82776 [SPARK-14615][ML][FOLLOWUP] Fix Python examples to use the new ML Vector and Matrix APIs in the ML pipeline based algorithms
## What changes were proposed in this pull request?

This PR fixes Python examples to use the new ML Vector and Matrix APIs in the ML pipeline based algorithms.

I firstly executed this shell command, `grep -r "from pyspark.mllib" .` and then executed them all.
Some of tests in `ml` produced the error messages as below:

```
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Input type must be VectorUDT but got org.apache.spark.mllib.linalg.VectorUDTf71b0bce.'
```

So, I fixed them to use new ones just identically with some Python tests fixed in https://github.com/apache/spark/pull/12627

## How was this patch tested?

Manually tested for all the examples listed by `grep -r "from pyspark.mllib" .`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13393 from HyukjinKwon/SPARK-14615.
2016-06-10 18:29:26 -07:00
Dongjoon Hyun 2022afe57d [SPARK-15773][CORE][EXAMPLE] Avoid creating local variable sc in examples if possible
## What changes were proposed in this pull request?

Instead of using local variable `sc` like the following example, this PR uses `spark.sparkContext`. This makes examples more concise, and also fixes some misleading, i.e., creating SparkContext from SparkSession.
```
-    println("Creating SparkContext")
-    val sc = spark.sparkContext
-
     println("Writing local file to DFS")
     val dfsFilename = dfsDirPath + "/dfs_read_write_test"
-    val fileRDD = sc.parallelize(fileContents)
+    val fileRDD = spark.sparkContext.parallelize(fileContents)
```

This will change 12 files (+30 lines, -52 lines).

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13520 from dongjoon-hyun/SPARK-15773.
2016-06-10 15:40:29 -07:00
Joseph K. Bradley 4c74ee8d8e [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public
## What changes were proposed in this pull request?

Made DefaultParamsReadable, DefaultParamsWritable public.  Also added relevant doc and annotations.  Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable.

## How was this patch tested?

Wrote example making use of the now-public APIs.  Compiled and ran locally

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #13461 from jkbradley/defaultparamswritable.
2016-06-06 09:49:45 -07:00
Yanbo Liang a95252823e [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples
## What changes were proposed in this pull request?
Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken.
```python
pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.'
```
We should use ```accuracy``` to replace ```precision``` in these examples.

## How was this patch tested?
Offline tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13519 from yanboliang/spark-15771.
2016-06-06 09:36:34 +01:00
Yanbo Liang 4fe7c7bd1e [SPARK-15605][ML][EXAMPLES] Fix broken ML JavaDeveloperApiExample.
## What changes were proposed in this pull request?
See [SPARK-15605](https://issues.apache.org/jira/browse/SPARK-15605) for the detail of this bug. This PR fix 2 major bugs in this example:
* The java example class use Param ```maxIter```, it will fail when calling ```Param.shouldOwn```. We need add a public method which return the ```maxIter``` Object. Because ```Params.params``` use java reflection to list all public method whose return type is ```Param```, and invoke them to get all defined param objects in the instance.
* The ```uid``` member defined in Java class will be initialized after Scala traits such as ```HasFeaturesCol```. So when ```HasFeaturesCol``` being constructed, they get ```uid``` with null, which will cause ```Param.shouldOwn``` check fail.

so, here is my changes:
* Add public method:
```public IntParam getMaxIterParam() {return maxIter;}```

* Use Java anonymous class overriding ```uid()``` to defined the ```uid```, and it solve the second problem described above.
* To make the ```getMaxIterParam ``` can be invoked using java reflection, we must make the two class (MyJavaLogisticRegression and MyJavaLogisticRegressionModel) public. so I make them become inner public static class.

## How was this patch tested?
Offline tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13353 from yanboliang/spark-15605.
2016-06-02 11:10:13 -05:00
Liwei Lin a0eec8e8ff [SPARK-15208][WIP][CORE][STREAMING][DOCS] Update Spark examples with AccumulatorV2
## What changes were proposed in this pull request?

The patch updates the codes & docs in the example module as well as the related doc module:

- [ ] [docs] `streaming-programming-guide.md`
  - [x] scala code part
  - [ ] java code part
  - [ ] python code part
- [x] [examples] `RecoverableNetworkWordCount.scala`
- [ ] [examples] `JavaRecoverableNetworkWordCount.java`
- [ ] [examples] `recoverable_network_wordcount.py`

## How was this patch tested?

Ran the examples and verified results manually.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12981 from lw-lin/accumulatorV2-examples.
2016-06-02 11:07:15 -05:00
Dongjoon Hyun 85d6b0db9f [SPARK-15618][SQL][MLLIB] Use SparkSession.builder.sparkContext if applicable.
## What changes were proposed in this pull request?

This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings.
```
- val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
+ val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
```

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13365 from dongjoon-hyun/SPARK-15618.
2016-05-31 17:40:44 -07:00
Xin Ren 5728aa558e [SPARK-15645][STREAMING] Fix some typos of Streaming module
## What changes were proposed in this pull request?

No code change, just some typo fixing.

## How was this patch tested?

Manually run project build with testing, and build is successful.

Author: Xin Ren <iamshrek@126.com>

Closes #13385 from keypointt/codeWalkThroughStreaming.
2016-05-30 08:40:03 -05:00
dding3 88c9c467a3 [SPARK-15562][ML] Delete temp directory after program exit in DataFrameExample
## What changes were proposed in this pull request?
Temp directory used to save records is not deleted after program exit in DataFrameExample. Although it called deleteOnExit, it doesn't work as the directory is not empty. Similar things happend in ContextCleanerSuite. Update the code to make sure temp directory is deleted after program exit.

## How was this patch tested?

unit tests and local build.

Author: dding3 <ding.ding@intel.com>

Closes #13328 from dding3/master.
2016-05-27 21:01:50 -05:00
wm624@hotmail.com 5d4dafe8fd [SPARK-15449][MLLIB][EXAMPLE] Wrong Data Format - Documentation Issue
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
In the MLLib naivebayes example, scala and python example doesn't use libsvm data, but Java does.

I make changes in scala and python example to use the libsvm data as the same as Java example.

## How was this patch tested?

Manual tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13301 from wangmiao1981/example.
2016-05-27 20:59:24 -05:00
Zheng RuiFeng 6b1a6180e7 [MINOR] Fix Typos 'a -> an'
## What changes were proposed in this pull request?

`a` -> `an`

I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.

## How was this patch tested?

local build
`lint-java` checking

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13317 from zhengruifeng/a_an.
2016-05-26 22:39:14 -07:00
Sean Owen b0a03feef2 [SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecations
## What changes were proposed in this pull request?

Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
* WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
  * Use in PythonMLlibAPI: Change to using private constructors
  * Streaming algs: No warnings after we un-deprecate the classes
  * Examples: Deprecate or change ones which use deprecated APIs
* MulticlassMetrics fields (precision, etc.)
* LinearRegressionSummary.model field

## How was this patch tested?

Existing tests.  Checked for warnings manually.

Author: Sean Owen <sowen@cloudera.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #13314 from jkbradley/warning-cleanups.
2016-05-26 14:25:28 -07:00
wm624@hotmail.com e451f7f0c3 [SPARK-15492][ML][DOC] Binarization scala example copy & paste to spark-shell error
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
The Binarization scala example val dataFrame : Dataframe = spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted in the spark-shell as Dataframe is not imported. Compared with other examples, this explicit type is not required.

So I removed Dataframe in the code.
## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually tested

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13266 from wangmiao1981/unit.
2016-05-26 12:36:36 +02:00
Bryan Cutler 9c297df3d4 [MINOR] [PYSPARK] [EXAMPLES] Changed examples to use SparkSession.sparkContext instead of _sc
## What changes were proposed in this pull request?

Some PySpark examples need a SparkContext and get it by accessing _sc directly from the session.  These examples should use the provided property `sparkContext` in `SparkSession` instead.

## How was this patch tested?
Ran modified examples

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13303 from BryanCutler/pyspark-session-sparkContext-MINOR.
2016-05-25 14:29:14 -07:00
gatorsmile 6cb8f836da [SPARK-15396][SQL][DOC] It can't connect hive metastore database
#### What changes were proposed in this pull request?
The `hive.metastore.warehouse.dir` property in hive-site.xml is deprecated since Spark 2.0.0. Users might not be able to connect to the existing metastore if they do not use the new conf parameter `spark.sql.warehouse.dir`.

This PR is to update the document and example for explaining the latest changes in the configuration of default location of database.

Below is the screenshot of the latest generated docs:

<img width="681" alt="screenshot 2016-05-20 08 38 10" src="https://cloud.githubusercontent.com/assets/11567269/15433296/a05c4ace-1e66-11e6-8d2b-73682b32e9c2.png">

<img width="789" alt="screenshot 2016-05-20 08 53 26" src="https://cloud.githubusercontent.com/assets/11567269/15433734/645dc42e-1e68-11e6-9476-effc9f8721bb.png">

<img width="789" alt="screenshot 2016-05-20 08 53 37" src="https://cloud.githubusercontent.com/assets/11567269/15433738/68569f92-1e68-11e6-83d3-ef5bb221a8d8.png">

No change is made in the R's example.

<img width="860" alt="screenshot 2016-05-20 08 54 38" src="https://cloud.githubusercontent.com/assets/11567269/15433779/965b8312-1e68-11e6-8bc4-53c88ceacde2.png">

#### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13225 from gatorsmile/document.
2016-05-21 23:12:27 -07:00
Zheng RuiFeng 127bf1bb07 [SPARK-15031][EXAMPLE] Use SparkSession in examples
## What changes were proposed in this pull request?
Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)

`MLLLIB` is not recommended to use now, so examples in `MLLIB` are ignored in this PR.
`StreamingContext` can not be directly obtained from `SparkSession`, so example in `Streaming` are ignored too.

cc andrewor14

## How was this patch tested?
manual tests with spark-submit

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13164 from zhengruifeng/use_sparksession_ii.
2016-05-20 16:40:33 -07:00
Yanbo Liang 9a9c6f5c22 [SPARK-15222][SPARKR][ML] SparkR ML examples update in 2.0
## What changes were proposed in this pull request?
Update example code in examples/src/main/r/ml.R to reflect the new algorithms.
* spark.glm and glm
* spark.survreg
* spark.naiveBayes
* spark.kmeans

## How was this patch tested?
Offline test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13000 from yanboliang/spark-15222.
2016-05-20 09:30:20 -07:00
Zheng RuiFeng 47a2940da9 [SPARK-15398][ML] Update the warning message to recommend ML usage
## What changes were proposed in this pull request?
MLlib are not recommended to use, and some methods are even deprecated.
Update the warning message to recommend ML usage.
```
  def showWarning() {
    System.err.println(
      """WARN: This is a naive implementation of Logistic Regression and is given as an example!
        |Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
        |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
        |for more conventional use.
      """.stripMargin)
  }
```
To
```
  def showWarning() {
    System.err.println(
      """WARN: This is a naive implementation of Logistic Regression and is given as an example!
        |Please use org.apache.spark.ml.classification.LogisticRegression
        |for more conventional use.
      """.stripMargin)
  }
```

## How was this patch tested?
local build

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13190 from zhengruifeng/update_recd.
2016-05-19 23:26:11 -07:00
wm624@hotmail.com 4c7a6b385c [SPARK-15363][ML][EXAMPLE] Example code shouldn't use VectorImplicits._, asML/fromML
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
In this DataFrame example, we use VectorImplicits._, which is private API.

Since Vectors object has public API, we use Vectors.fromML instead of implicts.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manually run the example.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13213 from wangmiao1981/ml.
2016-05-19 23:21:17 -07:00
Sandeep Singh 01cf649c4f [SPARK-15296][MLLIB] Refactor All Java Tests that use SparkSession
## What changes were proposed in this pull request?
Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion

## How was this patch tested?
Existing Tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13101 from techaddict/SPARK-15296.
2016-05-19 20:38:44 -07:00
hyukjinkwon e2ec32dab8 [SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with SparkSession
## What changes were proposed in this pull request?

It seems most of Python examples were changed to use SparkSession by https://github.com/apache/spark/pull/12809. This PR said both examples below:

- `simple_params_example.py`
- `aft_survival_regression.py`

are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by https://github.com/apache/spark/pull/13050 but `simple_params_example.py` is not yet.

This PR corrects the example and make this use SparkSession.

In more detail, it seems `threshold` is replaced to `thresholds` here and there by 5a23213c14. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`).

According to the comment below. this is not allowed, 354f8f11bd/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (L58-L61).

So, in this PR, it sets the equivalent value so that this does not throw an exception.

## How was this patch tested?

Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13135 from HyukjinKwon/SPARK-15031.
2016-05-19 08:52:41 +02:00
WeichenXu 2f9047b5eb [SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into accumulatorV2 in spark project
## What changes were proposed in this pull request?

I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)

## How was this patch tested?

Exisiting unit tests

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13112 from WeichenXu123/update_accuV2_in_mllib.
2016-05-18 11:48:46 +01:00
Sean Zhong 25b315e6ca [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
## What changes were proposed in this pull request?

Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.

## How was this patch tested?

This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13098 from clockfly/spark-15171-remove-deprecation.
2016-05-18 09:01:59 +08:00
DB Tsai e2efe0529a [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms
## What changes were proposed in this pull request?

Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.

## How was this patch tested?

Unit tests

Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #12627 from dbtsai/SPARK-14615-NewML.
2016-05-17 12:51:07 -07:00
wm624@hotmail.com bebe5f9811 [SPARK-15318][ML][EXAMPLE] spark.ml Collaborative Filtering example does not work in spark-shell
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

copy & paste example in ml-collaborative-filtering.html into spark-shell, we see the following errors.
scala> case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
defined class Rating

scala> object Rating {
def parseRating(str: String): Rating = { | val fields = str.split("::") | assert(fields.size == 4) | Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong) | }
}
<console>:29: error: Rating.type does not take parameters
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
^
In standard scala repl, it has the same error.

Scala/spark-shell repl has some quirks (e.g. packages are also not well supported).

The reason of errors is that scala/spark-shell repl discards previous definitions when we define the Object with the same class name. Solution: We can rename the Object Rating.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manually test it: 1). ./bin/run-example ALSExample
2). copy & paste example in the generated document. It works fine.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13110 from wangmiao1981/repl.
2016-05-17 16:51:01 +01:00
wm624@hotmail.com 4134ff0c65 [SPARK-14434][ML] User guide doc and examples for GaussianMixture in spark.ml
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual compile and test all examples

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12788 from wangmiao1981/example.
2016-05-17 15:20:47 +02:00
Yanbo Liang f116a84ef8 [SPARK-14979][ML][PYSPARK] Add examples for GeneralizedLinearRegression
## What changes were proposed in this pull request?
Add Scala/Java/Python examples for ```GeneralizedLinearRegression```.

## How was this patch tested?
They are examples and have been tested offline.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12754 from yanboliang/spark-14979.
2016-05-16 09:55:35 +02:00
Sean Zhong 33c6eb5218 [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView
## What changes were proposed in this pull request?

Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #12945 from clockfly/spark-15171.
2016-05-12 15:51:53 +08:00
Zheng RuiFeng 9e266d07a4 [SPARK-15031][SPARK-15134][EXAMPLE][DOC] Use SparkSession and update indent in examples
## What changes were proposed in this pull request?
1, Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)
2, Update indent for `SparkContext` according to [SPARK-15134](https://issues.apache.org/jira/browse/SPARK-15134)
3, BTW, remove some duplicate space and add missing '.'

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13050 from zhengruifeng/use_sparksession.
2016-05-11 22:45:30 -07:00
Dongjoon Hyun e1576478bd [SPARK-14933][HOTFIX] Replace sqlContext with spark.
## What changes were proposed in this pull request?

This fixes compile errors.

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13053 from dongjoon-hyun/hotfix_sqlquerysuite.
2016-05-11 10:03:51 -07:00
Zheng RuiFeng d88afabdfa [SPARK-15150][EXAMPLE][DOC] Update LDA examples
## What changes were proposed in this pull request?
1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt`
2,add python example
3,directly read the datafile in examples
4,BTW, change to `SparkSession` in `aft_survival_regression.py`

## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/lda_example.py`

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12927 from zhengruifeng/lda_pe.
2016-05-11 12:49:41 +02:00
Zheng RuiFeng 8beae59144 [SPARK-15149][EXAMPLE][DOC] update kmeans example
## What changes were proposed in this pull request?
Python example for ml.kmeans already exists, but not included in user guide.
1,small changes like: `example_on` `example_off`
2,add it to user guide
3,update examples to directly read datafile

## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/kmeans_example.py

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12925 from zhengruifeng/km_pe.
2016-05-11 10:01:43 +02:00
Zheng RuiFeng cef73b5638 [SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans
## What changes were proposed in this pull request?

1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile `data/mllib/sample_kmeans_data.txt`

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11844 from zhengruifeng/doc_bkm.
2016-05-11 09:56:36 +02:00
Zheng RuiFeng ad1a8466e9 [SPARK-15141][EXAMPLE][DOC] Update OneVsRest Examples
## What changes were proposed in this pull request?
1, Add python example for OneVsRest
2, remove args-parsing

## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12920 from zhengruifeng/ovr_pe.
2016-05-11 09:53:36 +02:00
hyukjinkwon 2992a215c9 [MINOR][DOCS] Remove remaining sqlContext in documentation at examples
This PR removes `sqlContext` in examples. Actual usage was all replaced in https://github.com/apache/spark/pull/12809 but there are some in comments.

Manual style checking.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13006 from HyukjinKwon/minor-docs.
2016-05-09 10:55:17 -07:00
Yanbo Liang ee3b171562 [MINOR] [SPARKR] Update data-manipulation.R to use native csv reader
## What changes were proposed in this pull request?
* Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
* Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.

## How was this patch tested?
Offline test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13005 from yanboliang/r-df-examples.
2016-05-09 09:58:36 -07:00
Nick Pentreath b0cafdb6cc [MINOR][ML][PYSPARK] ALS example cleanup
Cleans up ALS examples by removing unnecessary casts to double for `rating` and `prediction` columns, since `RegressionEvaluator` now supports `Double` & `Float` input types.

## How was this patch tested?

Manual compile and run with `run-example ml.ALSExample` and `spark-submit examples/src/main/python/ml/als_example.py`.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12892 from MLnick/als-examples-cleanup.
2016-05-07 10:57:40 +02:00
Zheng RuiFeng 76ad04d9a0 [SPARK-14512] [DOC] Add python example for QuantileDiscretizer
## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12281 from zhengruifeng/discret_pe.
2016-05-06 10:47:13 -07:00
Dongjoon Hyun 2c170dd3d7 [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update binary_classification_metrics_example.py
## What changes were proposed in this pull request?

This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)

## How was this patch tested?

After passing the Jenkins tests and run `dev/lint-java` manually.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12911 from dongjoon-hyun/SPARK-15134.
2016-05-05 14:37:50 -07:00
Sandeep Singh ed6f3f8a5f [SPARK-15072][SQL][REPL][EXAMPLES] Remove SparkSession.withHiveSupport
## What changes were proposed in this pull request?
Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`

## How was this patch tested?
ran tests locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12851 from techaddict/SPARK-15072.
2016-05-05 14:35:15 -07:00
Dongjoon Hyun cdce4e62a5 [SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example.
## What changes were proposed in this pull request?

This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.

- Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
- Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
- Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
  - `SqlNetworkWordCount.scala`
  - `JavaSqlNetworkWordCount.java`
  - `sql_network_wordcount.py`

Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
- `simple_params_example.py`
- `aft_survival_regression.py`

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12809 from dongjoon-hyun/SPARK-15031.
2016-05-04 14:31:36 -07:00
Zheng RuiFeng 4530250f5a [MINOR] Add python3 compatibility in python examples
## What changes were proposed in this pull request?
Add python3 compatibility in python examples

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12868 from zhengruifeng/fix_gmm_py.
2016-05-04 10:59:36 -07:00
Dongjoon Hyun 0903a185c7 [SPARK-15084][PYTHON][SQL] Use builder pattern to create SparkSession in PySpark.
## What changes were proposed in this pull request?

This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12860 from dongjoon-hyun/SPARK-15084.
2016-05-03 18:05:40 -07:00
Andrew Or 588cac414a [SPARK-15073][SQL] Hide SparkSession constructor from the public
## What changes were proposed in this pull request?

Users should use the builder pattern instead.

## How was this patch tested?

Jenks.

Author: Andrew Or <andrew@databricks.com>

Closes #12873 from andrewor14/spark-session-constructor.
2016-05-03 13:47:58 -07:00
Dongjoon Hyun f86f71763c [MINOR][EXAMPLE] Use SparkSession instead of SQLContext in RDDRelation.scala
## What changes were proposed in this pull request?

Now, `SQLContext` is used for backward-compatibility, we had better use `SparkSession` in Spark 2.0 examples.

## How was this patch tested?

It's just example change. After building, run `bin/run-example org.apache.spark.examples.sql.RDDRelation`.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12808 from dongjoon-hyun/rddrelation.
2016-04-30 00:15:04 -07:00
wm624@hotmail.com c74fd1e546 [SPARK-14937][ML][DOCUMENT] spark.ml LogisticRegression sqlCtx in scala is inconsistent with java and python
## What changes were proposed in this pull request?
In spark.ml document, the LogisticRegression scala example uses sqlCtx. It is inconsistent with java and python examples which use sqlContext. In addition, a user can't copy & paste to run the example in spark-shell as sqlCtx doesn't exist in spark-shell while sqlContext exists.

Change the scala example referred by the spark.ml example.

## How was this patch tested?

Compile the example scala file and it passes compilation.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12717 from wangmiao1981/doc.
2016-04-27 11:56:57 -07:00
Zheng RuiFeng e88476c8c6 [SPARK-14514][DOC] Add python example for VectorSlicer
## What changes were proposed in this pull request?
Add the missing python example for VectorSlicer

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12282 from zhengruifeng/vecslicer_pe.
2016-04-26 14:38:29 -07:00
Azeem Jiva de6e633420 [SPARK-14756][CORE] Use parseLong instead of valueOf
## What changes were proposed in this pull request?

Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type

## How was this patch tested?

Unit tests

Author: Azeem Jiva <azeemj@gmail.com>

Closes #12520 from javawithjiva/minor.
2016-04-26 11:49:04 +01:00
Andrew Or 3c5e65c339 [SPARK-14721][SQL] Remove HiveContext (part 2)
## What changes were proposed in this pull request?

This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.

Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)

## How was this patch tested?

No change in functionality.

Author: Andrew Or <andrew@databricks.com>

Closes #12585 from andrewor14/delete-hive-context.
2016-04-25 13:23:05 -07:00
Marcelo Vanzin a680562a6f [SPARK-14744][EXAMPLES] Clean up examples packaging, remove outdated examples.
First, make all dependencies in the examples module provided, and explicitly
list a couple of ones that somehow are promoted to compile by maven. This
means that to run streaming examples, the streaming connector package needs
to be provided to run-examples using --packages or --jars, just like regular
apps.

Also, remove a couple of outdated examples. HBase has had Spark bindings for
a while and is even including them in the HBase distribution in the next
version, making the examples obsolete. The same applies to Cassandra, which
seems to have a proper Spark binding library already.

I just tested the build, which passes, and ran SparkPi. The examples jars
directory now has only two jars:

```
$ ls -1 examples/target/scala-2.11/jars/
scopt_2.11-3.3.0.jar
spark-examples_2.11-2.0.0-SNAPSHOT.jar
```

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12544 from vanzin/SPARK-14744.
2016-04-25 10:20:51 -07:00
Dongjoon Hyun 6ab4d9e0c7 [SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date
## What changes were proposed in this pull request?

This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.

- Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
- Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
- Fix datatypes in `sparkr.md`.
- Update a data result in `sparkr.md`.
- Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
- Other minor syntax fixes and a typo.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12649 from dongjoon-hyun/SPARK-14883.
2016-04-24 22:10:27 -07:00
Sean Owen be0d5d3bbe [SPARK-14873][CORE] Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object
## What changes were proposed in this pull request?

Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values

## How was this patch tested?

Existing (updated) Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12637 from srowen/SPARK-14873.
2016-04-23 10:47:50 -07:00
Sean Owen 8bd05c9db2 [SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
## What changes were proposed in this pull request?

`JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12418 from srowen/SPARK-8393.
2016-04-21 11:03:16 +01:00
Yuhao Yang ed9d803854 [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF
## What changes were proposed in this pull request?

Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.

## How was this patch tested?

unit tests and doc generation

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12454 from hhbyyh/tfdoc.
2016-04-20 11:45:08 +01:00
Zheng RuiFeng 9bfb35da1e [SPARK-14515][DOC] Add python example for ChiSqSelector
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12283 from zhengruifeng/chi2_pe.
2016-04-18 17:14:22 -07:00
hyukjinkwon 6fc1e72d9b [MINOR] Revert removing explicit typing (changed in some examples and StatFunctions)
## What changes were proposed in this pull request?

This PR reverts some changes in https://github.com/apache/spark/pull/12413. (please see the discussion in that PR).

from
```scala
    words.foreachRDD { (rdd, time) =>
    ...
```

to
```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
```

Also, this was discussed in dev-mailing list, [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-Scala-style-explicit-typing-within-transformation-functions-and-anonymous-val-td17173.html)

## How was this patch tested?

This was tested with `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12452 from HyukjinKwon/revert-explicit-typing.
2016-04-18 13:45:03 -07:00
Xusen Yin 8c62edb70f [SPARK-14299][EXAMPLES] Remove duplications for scala.examples.ml
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14299

Delete duplications in scala/examples/ml.

TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample
CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample

## How was this patch tested?

Existing tests passed.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12366 from yinxusen/SPARK-14299-2.
2016-04-18 13:34:36 -07:00
hyukjinkwon 9f678e9754 [MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes

- Inappropriate type notations
    For example, from
    ```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
    ```
    to
    ```scala
    words.foreachRDD { (rdd, time) =>
    ...
    ```

- Extra anonymous closure within functional transformations.
    For example,
    ```scala
    .map(item => {
      ...
    })
    ```

    which can be just simply as below:

    ```scala
    .map { item =>
      ...
    }
    ```

and corrects some obvious style nits.

## How was this patch tested?

This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.

The rules applied were below:

- For the first correction,

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```

- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```

**Those rules were not added**

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12413 from HyukjinKwon/SPARK-style.
2016-04-16 14:56:23 +01:00
hyukjinkwon 6fc3dc8839 [MINOR][SQL] Remove extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes extra anonymous closure within functional transformations.

For example,

```scala
.map(item => {
  ...
})
```

which can be just simply as below:

```scala
.map { item =>
  ...
}
```

## How was this patch tested?

Related unit tests and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12382 from HyukjinKwon/minor-extra-closers.
2016-04-14 09:43:41 +01:00
Yuhao Yang 781df49983 [SPARK-13089][ML] [Doc] spark.ml Naive Bayes user guide and examples
jira: https://issues.apache.org/jira/browse/SPARK-13089

Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files).

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11015 from hhbyyh/naiveBayesDoc.
2016-04-13 13:58:35 -07:00
Zheng RuiFeng fcdd69260e [SPARK-14509][DOC] Add python CountVectorizerExample
## What changes were proposed in this pull request?
Add python CountVectorizerExample

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11917 from zhengruifeng/cv_pe.
2016-04-13 13:56:23 -07:00
Dongjoon Hyun b0f5497e95 [SPARK-14508][BUILD] Add a new ScalaStyle Rule OmitBracesInCase
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
  ```
  case: Always omit braces in case clauses.
  ```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.

## How was this patch tested?

Pass the Jenkins tests (including Scala style checking)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12280 from dongjoon-hyun/SPARK-14508.
2016-04-12 00:43:28 -07:00
Joseph K. Bradley e9e1adc036 [MINOR][ML] Fixed MLlib build warnings
## What changes were proposed in this pull request?

Fixes to eliminate warnings during package and doc builds.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12263 from jkbradley/warning-cleanups.
2016-04-12 03:24:26 +01:00
Xiangrui Meng 1c751fcf48 [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIs
## What changes were proposed in this pull request?

This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator.

Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2.

TODOs:
- [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920)
- [x] Python
- [x] add a new test to accept Dataset[LabeledPoint]
- [x] remove unused imports of Dataset

## How was this patch tested?

Exiting unit tests with some modifications.

cc: rxin jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #12274 from mengxr/SPARK-14500.
2016-04-11 09:28:28 -07:00
Örjan Lundberg b5c785629a Update KMeansExample.scala
## What changes were proposed in this pull request?
example does not work wo DataFrame import

## How was this patch tested?

example doc only

example does not work wo DataFrame import

Author: Örjan Lundberg <orjan.lundberg@gmail.com>

Closes #12277 from oluies/patch-1.
2016-04-10 16:30:30 +01:00
Yong Tang 72e66bb270 [SPARK-14301][EXAMPLES] Java examples code merge and clean up.
## What changes were proposed in this pull request?

This fix tries to remove duplicate Java code in examples/mllib and examples/ml. The following changes have been made:

```
deleted: ml/JavaCrossValidatorExample.java (duplicate of JavaModelSelectionViaCrossValidationExample.java)
deleted: ml/JavaTrainValidationSplitExample.java (duplicated of JavaModelSelectionViaTrainValidationSplitExample.java)
deleted: mllib/JavaFPGrowthExample.java (duplicate of JavaSimpleFPGrowth.java)
deleted: mllib/JavaLDAExample.java (duplicate of JavaLatentDirichletAllocationExample.java)
deleted: mllib/JavaKMeans.java (merged with JavaKMeansExample.java)
deleted: mllib/JavaLR.java (duplicate of JavaLinearRegressionWithSGDExample.java)
updated: mllib/JavaKMeansExample.java (merged with mllib/JavaKMeans.java)
```

## How was this patch tested?
Existing tests passed.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12143 from yongtang/SPARK-14301.
2016-04-10 02:37:11 +01:00
Zheng RuiFeng adb9d73cd6 [SPARK-14339][DOC] Add python examples for DCT,MinMaxScaler,MaxAbsScaler
## What changes were proposed in this pull request?
add three python examples

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12063 from zhengruifeng/dct_pe.
2016-04-09 11:25:39 -07:00
Dongjoon Hyun d717ae1fd7 [SPARK-14444][BUILD] Add a new scalastyle NoScalaDoc to prevent ScalaDoc-style multiline comments
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
```
/** In Spark, we don't use the ScalaDoc style so this
  * is not correct.
  */
```

## How was this patch tested?

Pass the Jenkins tests (including `lint-scala`).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12221 from dongjoon-hyun/SPARK-14444.
2016-04-06 16:02:55 -07:00
Dongjoon Hyun 3f749f7ed4 [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static analysis results
## What changes were proposed in this pull request?

This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.

## How was this patch tested?

Manual and pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12139 from dongjoon-hyun/SPARK-14355.
2016-04-03 18:14:16 -07:00
Dongjoon Hyun 4a6e78abd9 [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.
## What changes were proposed in this pull request?

This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
(All comment-only changes over 77 files: +786 lines, −747 lines)

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
2016-04-02 17:50:40 -07:00
Jacek Laskowski 06694f1c68 [MINOR] Typo fixes
## What changes were proposed in this pull request?

Typo fixes. No functional changes.

## How was this patch tested?

Built the sources and ran with samples.

Author: Jacek Laskowski <jacek@japila.pl>

Closes #11802 from jaceklaskowski/typo-fixes.
2016-04-02 08:12:04 -07:00
Dongjoon Hyun 1808465855 [MINOR] Fix newly added java-lint errors
## What changes were proposed in this pull request?

This PR fixes some newly added java-lint errors(unused-imports, line-lengsth).

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11968 from dongjoon-hyun/SPARK-14167.
2016-03-26 11:55:49 +00:00
Shixiong Zhu d23ad7c1c9 [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter
## What changes were proposed in this pull request?

This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages

Also remove mqtt_wordcount.py that I forgot to remove previously.

## How was this patch tested?

Jenkins PR Build.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11824 from zsxwing/remove-doc.
2016-03-26 01:47:27 -07:00
Shixiong Zhu 24587ce433 [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark
## What changes were proposed in this pull request?

This PR moves flume back to Spark as per the discussion in the dev mail-list.

## How was this patch tested?

Existing Jenkins tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11895 from zsxwing/move-flume-back.
2016-03-25 17:37:16 -07:00
Xin Ren d283223a5a [SPARK-13017][DOCS] Replace example code in mllib-feature-extraction.md using include_example
Replace example code in mllib-feature-extraction.md using include_example
https://issues.apache.org/jira/browse/SPARK-13017

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11142 from keypointt/SPARK-13017.
2016-03-24 14:25:10 -07:00
Xin Ren dd9ca7b960 [SPARK-13019][DOCS] fix for scala-2.10 build: Replace example code in mllib-statistics.md using include_example
## What changes were proposed in this pull request?

This PR for ticket SPARK-13019 is based on previous PR(https://github.com/apache/spark/pull/11108).
Since PR(https://github.com/apache/spark/pull/11108) is breaking scala-2.10 build, more work is needed to fix build errors.

What I did new in this PR is adding keyword argument for 'fractions':
`    val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)`
`    val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)`

I reopened ticket on JIRA but sorry I don't know how to reopen a GitHub pull request, so I just submitting a new pull request.
## How was this patch tested?

Manual build testing on local machine, build based on scala-2.10.

Author: Xin Ren <iamshrek@126.com>

Closes #11901 from keypointt/SPARK-13019.
2016-03-24 09:34:54 +00:00
Xiangrui Meng 43ef1e52bf Revert "[SPARK-13019][DOCS] Replace example code in mllib-statistics.md using include_example"
This reverts commit 1af8de200c.
2016-03-21 17:42:30 -07:00
Xin Ren 1af8de200c [SPARK-13019][DOCS] Replace example code in mllib-statistics.md using include_example
https://issues.apache.org/jira/browse/SPARK-13019

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11108 from keypointt/SPARK-13019.
2016-03-21 16:09:34 -07:00
Dongjoon Hyun 20fd254101 [SPARK-14011][CORE][SQL] Enable LineLength Java checkstyle rule
## What changes were proposed in this pull request?

[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.

```xml
-        <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
-        <!--
         <module name="LineLength">
             <property name="max" value="100"/>
             <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
         </module>
-        -->
         <module name="NoLineWrap"/>
         <module name="EmptyBlock">
             <property name="option" value="TEXT"/>
 -167,5 +164,7
         </module>
         <module name="CommentsIndentation"/>
         <module name="UnusedImports"/>
+        <module name="RedundantImport"/>
+        <module name="RedundantModifier"/>
```

## How was this patch tested?

Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11831 from dongjoon-hyun/SPARK-14011.
2016-03-21 07:58:57 +00:00
Zheng RuiFeng 53f32a22da [MINOR][DOC] Fix nits in JavaStreamingTestExample
## What changes were proposed in this pull request?

Fix some nits discussed in https://github.com/apache/spark/pull/11776#issuecomment-198207419
use !rdd.isEmpty instead of rdd.count > 0
use static instead of AtomicInteger
remove unneeded "throws Exception"

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11821 from zhengruifeng/je_fix.
2016-03-18 12:34:14 +00:00
Wenchen Fan 8ef3399aff [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging
## What changes were proposed in this pull request?

Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11764 from cloud-fan/logger.
2016-03-17 19:23:38 +08:00
Zheng RuiFeng 204c9dec2c [MINOR][DOC] Add JavaStreamingTestExample
## What changes were proposed in this pull request?

Add the java example of StreamingTest

## How was this patch tested?

manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11776 from zhengruifeng/streaming_je.
2016-03-17 11:09:02 +02:00
Shixiong Zhu 06dec37455 [SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages
## What changes were proposed in this pull request?

Currently there are a few sub-projects, each for integrating with different external sources for Streaming.  Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages

- streaming-flume
- streaming-akka
- streaming-mqtt
- streaming-zeromq
- streaming-twitter

They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster.

I have already copied these projects to https://github.com/spark-packages

## How was this patch tested?

Jenkins tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11672 from zsxwing/remove-external-pkg.
2016-03-14 16:56:04 -07:00
Dongjoon Hyun acdf219703 [MINOR][DOCS] Fix more typos in comments/strings.
## What changes were proposed in this pull request?

This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11689 from dongjoon-hyun/fix_more_typos.
2016-03-14 09:07:39 +00:00
Sean Owen 1840852841 [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
## What changes were proposed in this pull request?

- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11657 from srowen/SPARK-13823.
2016-03-13 21:03:49 -07:00
Cheng Lian c079420d7c [SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows()
## What changes were proposed in this pull request?

This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11678 from liancheng/remove-collect-rows-and-take-rows.
2016-03-13 12:02:52 +08:00
Zheng RuiFeng 42afd72c65 [SPARK-13814] [PYSPARK] Delete unnecessary imports in python examples files
JIRA:  https://issues.apache.org/jira/browse/SPARK-13814

## What changes were proposed in this pull request?

delete unnecessary imports in python examples files

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11651 from zhengruifeng/del_import_pe.
2016-03-11 13:49:37 -08:00
Cheng Lian 6d37e1eb90 [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame
## What changes were proposed in this pull request?

PR #11443 temporarily disabled MiMA check, this PR re-enables it.

One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API  changes.

## How was this patch tested?

Tested by MiMA check triggered by Jenkins.

Author: Cheng Lian <lian@databricks.com>

Closes #11656 from liancheng/re-enable-mima.
2016-03-11 22:17:50 +08:00
Nick Pentreath 8fff0f92a4 [HOT-FIX][SQL][ML] Fix compile error from use of DataFrame in Java MaxAbsScaler example
## What changes were proposed in this pull request?

Fix build failure introduced in #11392 (change `DataFrame` -> `Dataset<Row>`).

## How was this patch tested?

Existing build/unit tests

Author: Nick Pentreath <nick.pentreath@gmail.com>

Closes #11653 from MLnick/java-maxabs-example-fix.
2016-03-11 10:20:39 +02:00
Yuhao Yang 0b713e0455 [SPARK-13512][ML] add example and doc for MaxAbsScaler
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-13512
Add example and doc for ml.feature.MaxAbsScaler.

## How was this patch tested?
 unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11392 from hhbyyh/maxabsdoc.
2016-03-11 09:31:35 +02:00
Zheng RuiFeng d18276cb1d [SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIB
JIRA: https://issues.apache.org/jira/browse/SPARK-13672

## What changes were proposed in this pull request?

add two python examples of BisectingKMeans for ml and mllib

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11515 from zhengruifeng/mllib_bkm_pe.
2016-03-11 09:21:12 +02:00
Cheng Lian 1d542785b9 [SPARK-13244][SQL] Migrates DataFrame to Dataset
## What changes were proposed in this pull request?

This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.

Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).

There are several noticeable API changes related to those returning arrays:

1.  `collect`/`take`

    -   Old APIs in class `DataFrame`:

        ```scala
        def collect(): Array[Row]
        def take(n: Int): Array[Row]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def collect(): Array[T]
        def take(n: Int): Array[T]

        def collectRows(): Array[Row]
        def takeRows(n: Int): Array[Row]
        ```

    Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.

    Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).

1.  `randomSplit`

    -   Old APIs in class `DataFrame`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
        def randomSplit(weights: Array[Double]): Array[DataFrame]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
        def randomSplit(weights: Array[Double]): Array[Dataset[T]]
        ```

    Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.

1.  `groupBy`

    Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.

Other noticeable changes:

1.  Dataset always do eager analysis now

    We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.

## How was this patch tested?

Existing tests do the work.

## TODO

- [ ] Fix all tests
- [ ] Re-enable MiMA check
- [ ] Update ScalaDoc (`since`, `group`, and example code)

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>

Closes #11443 from liancheng/ds-to-df.
2016-03-10 17:00:17 -08:00
Dongjoon Hyun 91fed8e9c5 [SPARK-3854][BUILD] Scala style: require spaces before {.
## What changes were proposed in this pull request?

Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern  for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
```
// Correct:
if (true) {
  println("Wow!")
}

// Incorrect:
if (true){
   println("Wow!")
}
```
IntelliJ also shows new warnings based on this.

## How was this patch tested?

Pass the Jenkins ScalaStyle test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11637 from dongjoon-hyun/SPARK-3854.
2016-03-10 15:57:22 -08:00
JeremyNixon 3e3c3d58d8 [SPARK-13706][ML] Add Python Example for Train Validation Split
## What changes were proposed in this pull request?

This pull request adds a python example for train validation split.

## How was this patch tested?

This was style tested through lint-python, generally tested with ./dev/run-tests, and run in notebook and shell environments. It was viewed in docs locally with jekyll serve.

This contribution is my original work and I license it to Spark under its open source license.

Author: JeremyNixon <jnixon2@gmail.com>

Closes #11547 from JeremyNixon/tvs_example.
2016-03-10 09:18:15 +02:00
Dongjoon Hyun c3689bc24e [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code.
## What changes were proposed in this pull request?

In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.

```
-    final ArrayList<Product2<Object, Object>> dataToWrite =
-      new ArrayList<Product2<Object, Object>>();
+    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```

Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.

## How was this patch tested?

Manual.
Pass the existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11541 from dongjoon-hyun/SPARK-13702.
2016-03-09 10:31:26 +00:00
Dongjoon Hyun f3201aeeb0 [SPARK-13692][CORE][SQL] Fix trivial Coverity/Checkstyle defects
## What changes were proposed in this pull request?

This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle.

- Implement both null and type checking in equals functions.
- Fix wrong type casting logic in SimpleJavaBean2.equals.
- Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
- Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
- Fix coding style: Add '{}' to single `for` statement in mllib examples.
- Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`.
- Remove unused fields in `ChunkFetchIntegrationSuite`.
- Add `stop()` to prevent resource leak.

Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583).

## How was this patch tested?

manual via `./dev/lint-java` and Coverity site.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11530 from dongjoon-hyun/SPARK-13692.
2016-03-09 10:12:23 +00:00
Yury Liavitski 03f57a6c2d Fixing the type of the sentiment happiness value
## What changes were proposed in this pull request?

Added the conversion to int for the 'happiness value' read from the file. Otherwise, later on line 75 the multiplication will multiply a string by a number, yielding values like "-2-2" instead of -4.

## How was this patch tested?

Tested manually.

Author: Yury Liavitski <seconds.before@gmail.com>
Author: Yury Liavitski <yury.liavitski@il111.ice.local>

Closes #11540 from heliocentrist/fix-sentiment-value-type.
2016-03-07 10:54:33 +00:00
Dongjoon Hyun 941b270b70 [MINOR] Fix typos in comments and testcase name of code
## What changes were proposed in this pull request?

This PR fixes typos in comments and testcase name of code.

## How was this patch tested?

manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
2016-03-03 22:42:12 +00:00
Xin Ren 70f6f9649b [SPARK-13013][DOCS] Replace example code in mllib-clustering.md using include_example
Replace example code in mllib-clustering.md using include_example
https://issues.apache.org/jira/browse/SPARK-13013

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11116 from keypointt/SPARK-13013.
2016-03-03 09:32:47 -08:00
Dongjoon Hyun b5f02d6743 [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule
## What changes were proposed in this pull request?

After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.

## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11438 from dongjoon-hyun/SPARK-13583.
2016-03-03 10:12:32 +00:00
Sean Owen e97fc7f176 [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x
## What changes were proposed in this pull request?

Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:

- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map

The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.

## How was the this patch tested?

Existing Jenkins unit tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11292 from srowen/SPARK-13423.
2016-03-03 09:54:09 +00:00
Dongjoon Hyun 02b7677e95 [HOT-FIX] Recover some deprecations for 2.10 compatibility.
## What changes were proposed in this pull request?

#11479 [SPARK-13627] broke 2.10 compatibility: [2.10-Build](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/292/console)
At this moment, we need to support both 2.10 and 2.11.
This PR recovers some deprecated methods which were replace by [SPARK-13627].

## How was this patch tested?

Jenkins build: Both 2.10, 2.11.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11488 from dongjoon-hyun/hotfix_compatibility_with_2.10.
2016-03-03 09:53:02 +00:00
Dongjoon Hyun 9c274ac4a6 [SPARK-13627][SQL][YARN] Fix simple deprecation warnings.
## What changes were proposed in this pull request?

This PR aims to fix the following deprecation warnings.
  * MethodSymbolApi.paramss--> paramLists
  * AnnotationApi.tpe -> tree.tpe
  * BufferLike.readOnly -> toList.
  * StandardNames.nme -> termNames
  * scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader
  * TypeApi.declarations-> decls

## How was this patch tested?

Check the compile build log and pass the tests.
```
./build/sbt
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11479 from dongjoon-hyun/SPARK-13627.
2016-03-02 20:34:22 -08:00
Dongjoon Hyun 366f26d2da [MINOR][STREAMING] Replace deprecated apply with create in example.
## What changes were proposed in this pull request?

Twitter Algebird deprecated `apply` in HyperLogLog.scala.
```
deprecated("Use toHLL", since = "0.10.0 / 2015-05")
def apply[T <% Array[Byte]](t: T) = create(t)
```
This PR replace the deprecated usage `apply` with new `create`
according to the upstream change.

## How was this patch tested?
manual.
```
/bin/spark-submit --class org.apache.spark.examples.streaming.TwitterAlgebirdHLL examples/target/scala-2.11/spark-examples-2.0.0-SNAPSHOT-hadoop2.2.0.jar
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11451 from dongjoon-hyun/replace_deprecated_hll_apply.
2016-03-02 11:48:23 +00:00
Zheng RuiFeng 3c5f5e3b5c [SPARK-13550][ML] Add java example for ml.clustering.BisectingKMeans
JIRA: https://issues.apache.org/jira/browse/SPARK-13550

## What changes were proposed in this pull request?

Just add a java example for ml.clustering.BisectingKMeans

## How was this patch tested?

manual tests were done.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11428 from zhengruifeng/ml_bkm_je.
2016-02-29 23:55:26 -08:00
Zheng RuiFeng 0a4b620f31 [SPARK-13551][MLLIB] Fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample
JIRA: https://issues.apache.org/jira/browse/SPARK-13551

## What changes were proposed in this pull request?

Fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample

## How was this patch tested?

manual test

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11429 from zhengruifeng/mllib_bkm_je.
2016-02-29 22:24:43 -08:00
Dongjoon Hyun 7af0de076f [SPARK-11381][DOCS] Replace example code in mllib-linear-methods.md using include_example
## What changes were proposed in this pull request?

This PR replaces example codes in `mllib-linear-methods.md` using `include_example`
by doing the followings:
  * Extracts the example codes(Scala,Java,Python) as files in `example` module.
  * Merges some dialog-style examples into a single file.
  * Hide redundant codes in HTML for the consistency with other docs.

## How was the this patch tested?

manual test.
This PR can be tested by document generations, `SKIP_API=1 jekyll build`.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11320 from dongjoon-hyun/SPARK-11381.
2016-02-26 08:31:55 -08:00
Cheng Lian 99dfcedbfd [SPARK-13457][SQL] Removes DataFrame RDD operations
## What changes were proposed in this pull request?

This is another try of PR #11323.

This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`.

PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323.

## How was the this patch tested?

No extra tests are added. Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11388 from liancheng/remove-df-rdd-ops.
2016-02-27 00:28:30 +08:00
Davies Liu 751724b132 Revert "[SPARK-13457][SQL] Removes DataFrame RDD operations"
This reverts commit 157fe64f3e.
2016-02-25 11:53:48 -08:00
Cheng Lian 157fe64f3e [SPARK-13457][SQL] Removes DataFrame RDD operations
## What changes were proposed in this pull request?

This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`.

## How was the this patch tested?

No extra tests are added. Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11323 from liancheng/remove-df-rdd-ops.
2016-02-25 23:07:59 +08:00
JeremyNixon 230bbeaa61 [SPARK-10759][ML] update cross validator with include_example
This pull request uses {%include_example%} to add an example for the python cross validator to ml-guide.

Author: JeremyNixon <jnixon2@gmail.com>

Closes #11240 from JeremyNixon/pipeline_include_example.
2016-02-23 15:57:29 -08:00
movelikeriver 5cd3e6f60b [SPARK-13257][IMPROVEMENT] Refine naive Bayes example by checking model after loading it
Refine naive Bayes example by checking model after loading it

Author: movelikeriver <mars.lenjoy@gmail.com>

Closes #11125 from movelikeriver/naive_bayes.
2016-02-22 23:58:54 -08:00
Devaraj K 02b1fefffb [SPARK-13012][DOCUMENTATION] Replace example code in ml-guide.md using include_example
Replaced example code in ml-guide.md using include_example

Author: Devaraj K <devaraj@apache.org>

Closes #11053 from devaraj-kavali/SPARK-13012.
2016-02-22 17:21:37 -08:00
Devaraj K 9f410871ca [SPARK-13016][DOCUMENTATION] Replace example code in mllib-dimensionality-reduction.md using include_example
Replaced example example code in mllib-dimensionality-reduction.md using
include_example

Author: Devaraj K <devaraj@apache.org>

Closes #11132 from devaraj-kavali/SPARK-13016.
2016-02-22 17:16:56 -08:00
Luciano Resende 1a340da8d7 [SPARK-13248][STREAMING] Remove deprecated Streaming APIs.
Remove deprecated Streaming APIs and adjust sample applications.

Author: Luciano Resende <lresende@apache.org>

Closes #11139 from lresende/streaming-deprecated-apis.
2016-02-21 16:27:56 +00:00
shijinkui 97ee85daf6 [SPARK-12953][EXAMPLES] RDDRelation writer set overwrite mode
https://issues.apache.org/jira/browse/SPARK-12953

fix error when run RDDRelation.main():
"path file:/Users/sjk/pair.parquet already exists"

Set DataFrameWriter's mode to SaveMode.Overwrite

Author: shijinkui <shijinkui666@163.com>

Closes #10864 from shijinkui/set_mode.
2016-02-17 15:08:22 -08:00
BenFradet 00c72d27bf [SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general
This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10411 from BenFradet/SPARK-12247.
2016-02-16 13:03:28 +00:00
Xin Ren e4675c2402 [SPARK-13018][DOCS] Replace example code in mllib-pmml-model-export.md using include_example
Replace example code in mllib-pmml-model-export.md using include_example
https://issues.apache.org/jira/browse/SPARK-13018

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11126 from keypointt/SPARK-13018.
2016-02-15 20:17:21 -08:00
Liang-Chi Hsieh e3441e3f68 [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test
JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10539 from viirya/fix-poweriter.
2016-02-13 15:56:20 -08:00
Sean Owen 68ed3632c5 [SPARK-13170][STREAMING] Investigate replacing SynchronizedQueue as it is deprecated
Replace SynchronizeQueue with synchronized access to a Queue

Author: Sean Owen <sowen@cloudera.com>

Closes #11111 from srowen/SPARK-13170.
2016-02-09 11:23:29 +00:00
sachin aggarwal d9ba4d27f4 [SPARK-13177][EXAMPLES] Update ActorWordCount example to not directly use low level linked list as it is deprecated.
Author: sachin aggarwal <different.sachin@gmail.com>

Closes #11113 from agsachin/master.
2016-02-09 08:52:58 +00:00
Sean Owen 649e9d0f5b [SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator
Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.

CC rxin pwendell for API change; tdas since it also touches streaming.

Author: Sean Owen <sowen@cloudera.com>

Closes #10413 from srowen/SPARK-3369.
2016-01-26 11:55:28 +00:00
Mark Grover 006906db59 [SPARK-12960] [PYTHON] Some examples are missing support for python2
Without importing the print_function, the lines later on like ```print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)``` fail when using python2.*. Import fixes that problem and doesn't break anything on python3 either.

Author: Mark Grover <mark@apache.org>

Closes #10872 from markgrover/python2_compat.
2016-01-21 21:17:36 -08:00
Shixiong Zhu b7d74a602f [SPARK-7799][SPARK-12786][STREAMING] Add "streaming-akka" project
Include the following changes:

1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream
2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream"
3. Update the ActorWordCount example and add the JavaActorWordCount example
4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10744 from zsxwing/streaming-akka-2.
2016-01-20 13:55:41 -08:00
Reynold Xin ad1503f92e [SPARK-12667] Remove block manager's internal "external block store" API
This pull request removes the external block store API. This is rarely used, and the file system interface is actually a better, more standard way to interact with external storage systems.

There are some other things to remove also, as pointed out by JoshRosen. We will do those as follow-up pull requests.

Author: Reynold Xin <rxin@databricks.com>

Closes #10752 from rxin/remove-offheap.
2016-01-15 12:03:28 -08:00
Reynold Xin fe7246fea6 [SPARK-12830] Java style: disallow trailing whitespaces.
Author: Reynold Xin <rxin@databricks.com>

Closes #10764 from rxin/SPARK-12830.
2016-01-14 23:33:45 -08:00
Kousuke Saruta 39ae04e6b7 [SPARK-12692][BUILD][STREAMING] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10685 from sarutak/SPARK-12692-followup-streaming.
2016-01-11 21:06:22 -08:00
Yanbo Liang ee4ee02b86 [SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10552 from yanboliang/spark-12603.
2016-01-11 14:43:25 -08:00
Udo Klein bd723bd53d removed lambda from sortByKey()
According to the documentation the sortByKey method does not take a lambda as an argument, thus the example is flawed. Removed the argument completely as this will default to ascending sort.

Author: Udo Klein <git@blinkenlight.net>

Closes #10640 from udoklein/patch-1.
2016-01-11 09:30:08 +00:00
Kousuke Saruta e5904bb5e7 [SPARK-12692][BUILD][MLLIB] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10684 from sarutak/SPARK-12692-followup-mllib.
2016-01-10 12:38:57 -08:00
Sean Owen 659fd9d04b [SPARK-4819] Remove Guava's "Optional" from public API
Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)

See also https://github.com/apache/spark/pull/10512

Author: Sean Owen <sowen@cloudera.com>

Closes #10513 from srowen/SPARK-4819.
2016-01-08 13:02:30 -08:00
Udo Klein 8c70cb4c62 fixed numVertices in transitive closure example
Author: Udo Klein <git@blinkenlight.net>

Closes #10642 from udoklein/patch-2.
2016-01-08 20:32:37 +00:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
Shixiong Zhu c0c397509b [SPARK-12510][STREAMING] Refactor ActorReceiver to support Java
This PR includes the following changes:

1. Rename `ActorReceiver` to `ActorReceiverSupervisor`
2. Remove `ActorHelper`
3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java
4. Add `JavaActorWordCount` example

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10457 from zsxwing/java-actor-stream.
2016-01-07 15:26:55 -08:00
Jacek Laskowski 8113dbda0b [STREAMING][DOCS][EXAMPLES] Minor fixes
Author: Jacek Laskowski <jacek@japila.pl>

Closes #10603 from jaceklaskowski/streaming-actor-custom-receiver.
2016-01-07 00:27:13 -08:00
Reynold Xin 8ce645d4ee [SPARK-12615] Remove some deprecated APIs in RDD/SparkContext
I looked at each case individually and it looks like they can all be removed. The only one that I had to think twice was toArray (I even thought about un-deprecating it, until I realized it was a problem in Java to have toArray returning java.util.List).

Author: Reynold Xin <rxin@databricks.com>

Closes #10569 from rxin/SPARK-12615.
2016-01-05 11:10:14 -08:00
Marcelo Vanzin 7058dc1150 [SPARK-3873][EXAMPLES] Import ordering fixes.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10575 from vanzin/SPARK-3873-examples.
2016-01-04 22:42:54 -08:00
Sean Owen 15bd73627e [SPARK-12481][CORE][STREAMING][SQL] Remove usage of Hadoop deprecated APIs and reflection that supported 1.x
Remove use of deprecated Hadoop APIs now that 2.2+ is required

Author: Sean Owen <sowen@cloudera.com>

Closes #10446 from srowen/SPARK-12481.
2016-01-02 13:15:53 +00:00
Reynold Xin ee8f8d3184 [SPARK-12588] Remove HttpBroadcast in Spark 2.0.
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0.

Author: Reynold Xin <rxin@databricks.com>

Closes #10531 from rxin/SPARK-12588.
2015-12-30 18:07:07 -08:00
Shixiong Zhu 20591afd79 [SPARK-12429][STREAMING][DOC] Add Accumulator and Broadcast example for Streaming
This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10385 from zsxwing/accumulator-broadcast-example.
2015-12-22 16:39:10 -08:00
Jeff L ea59b0f3a6 [SPARK-9057][STREAMING] Twitter example joining to static RDD of word sentiment values
Example of joining a static RDD of word sentiments to a streaming RDD of Tweets in order to demo the usage of the transform() method.

Author: Jeff L <sha0lin@alumni.carnegiemellon.edu>

Closes #8431 from Agent007/SPARK-9057.
2015-12-18 15:06:54 +00:00
Shixiong Zhu 540b5aeadc [SPARK-12410][STREAMING] Fix places that use '.' and '|' directly in split
String.split accepts a regular expression, so we should escape "." and "|".

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10361 from zsxwing/reg-bug.
2015-12-17 13:23:48 -08:00
Yanbo Liang 1a8b2a17db [SPARK-12364][ML][SPARKR] Add ML example for SparkR
We have DataFrame example for SparkR, we also need to add ML example under ```examples/src/main/r```.

cc mengxr jkbradley shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10324 from yanboliang/spark-12364.
2015-12-16 12:59:22 -08:00
Yu ISHIKAWA 7b6dc29d0e [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means
This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9952 from yu-iskw/SPARK-6518.
2015-12-16 10:55:42 -08:00
Yu ISHIKAWA 26d70bd2b4 [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #10244 from yu-iskw/SPARK-12215.
2015-12-16 10:43:45 -08:00
Yanbo Liang b24c12d733 [MINOR][ML] Rename weights to coefficients for examples/DeveloperApiExample
Rename ```weights``` to ```coefficients``` for examples/DeveloperApiExample.

cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10280 from yanboliang/spark-coefficients.
2015-12-15 16:29:39 -08:00
Xusen Yin 98b212d36b [SPARK-12199][DOC] Follow-up: Refine example code in ml-features.md
https://issues.apache.org/jira/browse/SPARK-12199

Follow-up PR of SPARK-11551. Fix some errors in ml-features.md

mengxr

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10193 from yinxusen/SPARK-12199.
2015-12-12 17:47:01 -08:00
Yanbo Liang a0ff6d16ef [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to dataframe_example.py
Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion.
#9873 finished the work of Scala example, here we focus on the Python one.
Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```.
BTW, fix minor missing issues of #9873.
cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9957 from yanboliang/SPARK-11978.
2015-12-11 18:02:24 -08:00
Yanbo Liang 0fb9825556 [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files
* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments
jsonFile(sqlContext, “path1,path2”)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for ```parquetFile```.

cc felixcheung sun-rui shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10145 from yanboliang/spark-12146.
2015-12-11 11:47:35 -08:00
Bryan Cutler 6a6c1fc5c8 [SPARK-11713] [PYSPARK] [STREAMING] Initial RDD updateStateByKey for PySpark
Adding ability to define an initial state RDD for use with updateStateByKey PySpark.  Added unit test and changed stateful_network_wordcount example to use initial RDD.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
2015-12-10 14:21:15 -08:00
Tathagata Das bd2cd4f53d [SPARK-12244][SPARK-12245][STREAMING] Rename trackStateByKey to mapWithState and change tracking function signature
SPARK-12244:

Based on feedback from early users and personal experience attempting to explain it, the name trackStateByKey had two problem.
"trackState" is a completely new term which really does not give any intuition on what the operation is
the resultant data stream of objects returned by the function is called in docs as the "emitted" data for the lack of a better.
"mapWithState" makes sense because the API is like a mapping function like (Key, Value) => T with State as an additional parameter. The resultant data stream is "mapped data". So both problems are solved.

SPARK-12245:

From initial experiences, not having the key in the function makes it hard to return mapped stuff, as the whole information of the records is not there. Basically the user is restricted to doing something like mapValue() instead of map(). So adding the key as a parameter.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #10224 from tdas/rename.
2015-12-09 20:47:15 -08:00
Xusen Yin 051c6a066f [SPARK-11551][DOC] Replace example code in ml-features.md using include_example
PR on behalf of somideshmukh, thanks!

Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #10219 from yinxusen/SPARK-11551.
2015-12-09 12:00:48 -08:00
BenFradet 06746b3005 [SPARK-12159][ML] Add user guide section for IndexToString transformer
Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10166 from BenFradet/SPARK-12159.
2015-12-08 12:45:34 -08:00
Yuhao Yang 5cb4695051 [SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docs
jira: https://issues.apache.org/jira/browse/SPARK-11605
Check Java compatibility for MLlib for this release.

fix:

1. `StreamingTest.registerStream` needs java friendly interface.

2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`.

TBD:
[updated] no fix for now per discussion.
`org.apache.spark.mllib.classification.LogisticRegressionModel`
`public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation.
`SVMModel` has the similar issue.

Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary.

cc jkbradley feynmanliang

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10102 from hhbyyh/javaAPI.
2015-12-08 11:46:26 -08:00
Yuhao Yang 872a2ee281 [SPARK-10393] use ML pipeline in LDA example
jira: https://issues.apache.org/jira/browse/SPARK-10393

Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: yuhaoyang <yuhao@zhanglipings-iMac.local>

Closes #8551 from hhbyyh/ldaExUpdate.
2015-12-08 10:29:51 -08:00
Cheng Lian da2012a0e1 [SPARK-11551][DOC][EXAMPLE] Revert PR #10002
This reverts PR #10002, commit 78209b0cca.

The original PR wasn't tested on Jenkins before being merged.

Author: Cheng Lian <lian@databricks.com>

Closes #10200 from liancheng/revert-pr-10002.
2015-12-08 19:18:59 +08:00
Yanbo Liang 4a39b5a1be [SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10006 from yanboliang/spark-11958.
2015-12-07 23:50:57 -08:00
somideshmukh 78209b0cca [SPARK-11551][DOC][EXAMPLE] Replace example code in ml-features.md using include_example
Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three  java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer

Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #10002 from somideshmukh/SomilBranch1.33.
2015-12-07 23:26:34 -08:00
Xusen Yin 871e85d9c1 [SPARK-11963][DOC] Add docs for QuantileDiscretizer
https://issues.apache.org/jira/browse/SPARK-11963

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9962 from yinxusen/SPARK-11963.
2015-12-07 13:16:47 -08:00
Shixiong Zhu 3af53e61fd [SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectly
`ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct.

This patch fixed all places that use `ByteBuffer.array` incorrectly.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10083 from zsxwing/bytebuffer-array.
2015-12-04 17:02:04 -08:00
Josh Rosen b7204e1d41 [SPARK-12112][BUILD] Upgrade to SBT 0.13.9
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).

I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
2015-12-05 08:15:30 +08:00
Dmitry Erastov d0d8222778 [SPARK-6990][BUILD] Add Java linting script; fix minor warnings
This replaces https://github.com/apache/spark/pull/9696

Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.

Suggest fixing those TODOs in a separate PR(s).

More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).

Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):

> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1

Also fix some of the minor violations that didn't require sweeping changes.

Apologies for the previous botched PRs - I finally figured out the issue.

cr: JoshRosen, pwendell

> I state that the contribution is my original work, and I license the work to the project under the project's open source license.

Author: Dmitry Erastov <derastov@gmail.com>

Closes #9867 from dskrvk/master.
2015-12-04 12:03:45 -08:00
Xusen Yin e76431f886 [SPARK-11961][DOC] Add docs of ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-11961

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9965 from yinxusen/SPARK-11961.
2015-12-01 15:21:53 -08:00
Feynman Liang 5535888930 [SPARK-11960][MLLIB][DOC] User guide for streaming tests
CC jkbradley mengxr josepablocam

Author: Feynman Liang <feynman.liang@gmail.com>

Closes #10005 from feynmanliang/streaming-test-user-guide.
2015-11-30 15:38:44 -08:00
Yanbo Liang de64b65f7c [SPARK-11975][ML] Remove duplicate mllib example (DT/RF/GBT in Java/Python)
Remove duplicate mllib example (DT/RF/GBT in Java/Python).
Since we have tutorial code for DT/RF/GBT classification/regression in Scala/Java/Python and example applications for DT/RF/GBT in Scala, so we mark these as duplicated and remove them.
mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9954 from yanboliang/SPARK-11975.
2015-11-30 15:01:08 -08:00
Yuhao Yang e232720a65 [SPARK-11689][ML] Add user guide and example code for LDA under spark.ml
jira: https://issues.apache.org/jira/browse/SPARK-11689

Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.

Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722

mengxr feynmanliang yinxusen  Sorry for the troubling.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9974 from hhbyyh/ldaMLExample.
2015-11-30 14:56:51 -08:00
Yanbo Liang 56a0aba0a6 [SPARK-11952][ML] Remove duplicate ml examples
Remove duplicate ml examples (only for ml).  mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9933 from yanboliang/SPARK-11685.
2015-11-24 09:52:53 -08:00
Yanbo Liang 98d7ec7df4 [SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in examples and user guide doc
ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead.
The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918).
It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9905 from yanboliang/spark-11920.
2015-11-23 11:51:29 -08:00
Xiangrui Meng fe89c1817d [SPARK-11895][ML] rename and refactor DatasetExample under mllib/examples
We used the name `Dataset` to refer to `SchemaRDD` in 1.2 in ML pipelines and created this example file. Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid confusion. This PR also removes support for dense format to simplify the example code.

cc: yinxusen

Author: Xiangrui Meng <meng@databricks.com>

Closes #9873 from mengxr/SPARK-11895.
2015-11-22 21:45:46 -08:00
Xiangrui Meng a2dce22e0a Revert "[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml"
This reverts commit e359d5dcf5.
2015-11-20 16:51:47 -08:00
Vikas Nelamangala ed47b1e660 [SPARK-11549][DOCS] Replace example code in mllib-evaluation-metrics.md using include_example
Author: Vikas Nelamangala <vikasnelamangala@Vikass-MacBook-Pro.local>

Closes #9689 from vikasnp/master.
2015-11-20 15:18:41 -08:00
Yuhao Yang e359d5dcf5 [SPARK-11689][ML] Add user guide and example code for LDA under spark.ml
jira: https://issues.apache.org/jira/browse/SPARK-11689

Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9722 from hhbyyh/ldaMLExample.
2015-11-20 09:57:09 -08:00
Yuhao Yang 67c75828ff [SPARK-11816][ML] fix some style issue in ML/MLlib examples
jira: https://issues.apache.org/jira/browse/SPARK-11816
Currently I only fixed some obvious comments issue like
// scalastyle:off println
on the bottom.

Yet the style in examples is not quite consistent, like only half of the examples  are with
// Example usage: ./bin/run-example mllib.FPGrowthExample \,

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9808 from hhbyyh/exampleStyle.
2015-11-18 18:49:46 -08:00
Josh Rosen 4b11712190 [SPARK-11495] Fix potential socket / file handle leaks that were found via static analysis
The HP Fortify Opens Source Review team (https://www.hpfod.com/open-source-review-project) reported a handful of potential resource leaks that were discovered using their static analysis tool. We should fix the issues identified by their scan.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9455 from JoshRosen/fix-potential-resource-leaks.
2015-11-18 16:00:35 -08:00
Viveka Kulharia 1429e0a2b5 rmse was wrongly calculated
It was multiplying with U instaed of dividing by U

Author: Viveka Kulharia <vivkul@iitk.ac.in>

Closes #9771 from vivkul/patch-1.
2015-11-18 09:10:15 +00:00
Xusen Yin 9154f89bef [SPARK-11728] Replace example code in ml-ensembles.md using include_example
JIRA issue https://issues.apache.org/jira/browse/SPARK-11728.

The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9716 from yinxusen/SPARK-11728.
2015-11-17 23:44:06 -08:00
Rohan Bhanderi e29656f8e7 [MINOR] Correct comments in JavaDirectKafkaWordCount
Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9781 from RohanBhanderi/patch-3.
2015-11-17 15:45:46 -08:00
Xusen Yin 328eb49e62 [SPARK-11729] Replace example code in ml-linear-methods.md using include_example
JIRA link: https://issues.apache.org/jira/browse/SPARK-11729

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9713 from yinxusen/SPARK-11729.
2015-11-17 13:59:59 -08:00
jerryshao bd10eb81c9 [EXAMPLE][MINOR] Add missing awaitTermination in click stream example
Author: jerryshao <sshao@hortonworks.com>

Closes #9730 from jerryshao/clickstream-fix.
2015-11-16 17:02:21 -08:00
Rohan Bhanderi 22e96b87fb Typo in comment: use 2 seconds instead of 1
Use 2 seconds batch size as duration specified in JavaStreamingContext constructor is 2000 ms

Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9714 from RohanBhanderi/patch-2.
2015-11-14 13:38:53 +00:00
Yanbo Liang 99693fef0a [SPARK-11723][ML][DOC] Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame
Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include:
* Use libSVM data source for all example codes under examples/ml, and remove unused import.
* Use libSVM data source for user guides under ml-*** which were omitted by #8697.
* Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```.
* Code cleanup.

mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9690 from yanboliang/spark-11723.
2015-11-13 08:43:05 -08:00
Rishabh Bhardwaj 61a28486cc [SPARK-11445][DOCS] Replaced example code in mllib-ensembles.md using include_example
I have made the required changes and tested.
Kindly review the changes.

Author: Rishabh Bhardwaj <rbnext29@gmail.com>

Closes #9407 from rishabhbhardwaj/SPARK-11445.
2015-11-13 08:36:46 -08:00
Yanbo Liang ea5ae2705a [SPARK-11629][ML][PYSPARK][DOC] Python example code for Multilayer Perceptron Classification
Add Python example code for Multilayer Perceptron Classification, and make example code in user guide document testable. mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9594 from yanboliang/spark-11629.
2015-11-12 21:29:43 -08:00
Shixiong Zhu 0f1d00a905 [SPARK-11663][STREAMING] Add Java API for trackStateByKey
TODO
- [x] Add Java API
- [x] Add API tests
- [x] Add a function test

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9636 from zsxwing/java-track.
2015-11-12 17:48:43 -08:00
Tathagata Das 99f5f98861 [SPARK-11290][STREAMING] Basic implementation of trackStateByKey
Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons.
* Need for more optimized state management that does not scan every key
* Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state

The high level idea that of this PR
* Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts.
* Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data.
Here is the detailed design doc. Please take a look and provide feedback as comments.
https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em

This is still WIP. Major things left to be done.
- [x] Implement basic functionality of state tracking, with initial RDD and timeouts
- [x] Unit tests for state tracking
- [x] Unit tests for initial RDD and timeout
- [ ] Unit tests for TrackStateRDD
       - [x] state creating, updating, removing
       - [ ] emitting
       - [ ] checkpointing
- [x] Misc unit tests for State, TrackStateSpec, etc.
- [x] Update docs and experimental tags

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9256 from tdas/trackStateByKey.
2015-11-10 23:16:18 -08:00
Pravin Gadakh 638c51d938 [SPARK-11550][DOCS] Replace example code in mllib-optimization.md using include_example
Author: Pravin Gadakh <pravingadakh177@gmail.com>

Closes #9516 from pravingadakh/SPARK-11550.
2015-11-10 14:47:04 -08:00
Xusen Yin a81f47ff74 [SPARK-11382] Replace example code in mllib-decision-tree.md using include_example
https://issues.apache.org/jira/browse/SPARK-11382

B.T.W. I fix an error in naive_bayes_example.py.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9596 from yinxusen/SPARK-11382.
2015-11-10 10:05:53 -08:00
Rishabh Bhardwaj b7720fa455 [SPARK-11548][DOCS] Replaced example code in mllib-collaborative-filtering.md using include_example
Kindly review the changes.

Author: Rishabh Bhardwaj <rbnext29@gmail.com>

Closes #9519 from rishabhbhardwaj/SPARK-11337.
2015-11-09 14:27:36 -08:00
sachin aggarwal 51d41e4b1a [SPARK-11552][DOCS][Replaced example code in ml-decision-tree.md using include_example]
I have tested it on my local, it is working fine, please review

Author: sachin aggarwal <different.sachin@gmail.com>

Closes #9539 from agsachin/SPARK-11552-real.
2015-11-09 14:25:42 -08:00
Yanbo Liang d50a66cc04 [SPARK-10689][ML][DOC] User guide and example code for AFTSurvivalRegression
Add user guide and example code for ```AFTSurvivalRegression```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9491 from yanboliang/spark-10689.
2015-11-09 08:57:29 -08:00
Pravin Gadakh 820064e613 [SPARK-11380][DOCS] Replace example code in mllib-frequent-pattern-mining.md using include_example
Author: Pravin Gadakh <pravingadakh177@gmail.com>
Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #9340 from pravingadakh/SPARK-11380.
2015-11-04 08:32:08 -08:00
Rishabh Bhardwaj 2804674a7a [SPARK-11383][DOCS] Replaced example code in mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example
I have made the required changes in mllib-naive-bayes.md/mllib-isotonic-regression.md and also verified them.
Kindle Review it.

Author: Rishabh Bhardwaj <rbnext29@gmail.com>

Closes #9353 from rishabhbhardwaj/SPARK-11383.
2015-11-02 14:03:50 -08:00
Xusen Yin 943d4fa204 [SPARK-11289][DOC] Substitute code examples in ML features extractors with include_example
mengxr https://issues.apache.org/jira/browse/SPARK-11289

I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9266 from yinxusen/SPARK-11289.
2015-10-26 21:17:53 -07:00
sethah 098be27ad5 [SPARK-9715] [ML] Store numFeatures in all ML PredictionModel types
All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #8675 from sethah/SPARK-9715.
2015-09-23 15:00:52 -07:00
Feynman Liang aeef44a3e3 [SPARK-3147] [MLLIB] [STREAMING] Streaming 2-sample statistical significance testing
Implementation of significance testing using Streaming API.

Author: Feynman Liang <fliang@databricks.com>
Author: Feynman Liang <feynman.liang@gmail.com>

Closes #4716 from feynmanliang/ab_testing.
2015-09-21 13:11:28 -07:00
Josh Rosen b3a7480ab0 [SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil JobContext methods
This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8521 from JoshRosen/SPARK-10330-part2.
2015-09-12 16:23:55 -07:00
martinzapletal e8ea5bafee [SPARK-9910] [ML] User guide for train validation split
Author: martinzapletal <zapletal-martin@email.cz>

Closes #8377 from zapletal-martin/SPARK-9910.
2015-08-28 21:03:48 -07:00
Shuo Xiang 45723214e6 [SPARK-10336][example] fix not being able to set intercept in LR example
`fitIntercept` is a command line option but not set in the main program.

dbtsai

Author: Shuo Xiang <sxiang@pinterest.com>

Closes #8510 from coderxiang/intercept and squashes the following commits:

57c9b7d [Shuo Xiang] fix not being able to set intercept in LR example
2015-08-28 13:09:13 -07:00
Sean Owen 69c9c17716 [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`

Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.

Author: Sean Owen <sowen@cloudera.com>

Closes #8033 from srowen/SPARK-9613.
2015-08-25 12:33:13 +01:00
zsxwing 1f29d502e7 [SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark Streaming and some docs
This PR includes the following fixes:
1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3.
2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` when receiving an empty `bytes` in Python 3.
3. Fix the commands in docs so that the user can copy them directly to the command line. The previous commands was broken in the middle of a path, so when copying to the command line, the path would be split to two parts by the extra spaces, which forces the user to fix it manually.

Author: zsxwing <zsxwing@gmail.com>

Closes #8315 from zsxwing/SPARK-9812.
2015-08-19 18:36:01 -07:00
Herman van Hovell a85fb6c07f [SPARK-9980] [BUILD] Fix SBT publishLocal error due to invalid characters in doc
Tiny modification to a few comments ```sbt publishLocal``` work again.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #8209 from hvanhovell/SPARK-9980.
2015-08-15 10:46:04 +01:00
Davies Liu 37586e5449 [HOTFIX] fix duplicated braces
Author: Davies Liu <davies@databricks.com>

Closes #8219 from davies/fix_typo.
2015-08-14 20:56:55 -07:00
lewuathe 2932e25da4 [SPARK-9073] [ML] spark.ml Models copy() should call setParent when there is a parent
Copied ML models must have the same parent of original ones

Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>

Closes #7447 from Lewuathe/SPARK-9073.
2015-08-13 09:17:19 -07:00
Prabeesh K 853809e948 [SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python
This PR is based on #4229, thanks prabeesh.

Closes #4229

Author: Prabeesh K <prabsmails@gmail.com>
Author: zsxwing <zsxwing@gmail.com>
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabeesh.k@namshi.com>

Closes #7833 from zsxwing/pr4229 and squashes the following commits:

9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python
2015-08-10 16:33:23 -07:00
Holden Karau 5a23213c14 [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier
This PR replaces the old "threshold" with a generalized "thresholds" Param.  We keep getThreshold,setThreshold for backwards compatibility for binary classification.

Note that the primary author of this PR is holdenk

Author: Holden Karau <holden@pigscanfly.ca>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits:

3952977 [Joseph K. Bradley] fixed pyspark doc test
85febc8 [Joseph K. Bradley] made python unit tests a little more robust
7eb1d86 [Joseph K. Bradley] small cleanups
6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues.
0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests
7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat
be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc.
6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests
25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression
c02d6c0 [Holden Karau] No default for thresholds
5e43628 [Holden Karau] CR feedback and fixed the renamed test
f3fbbd1 [Holden Karau] revert the changes to random forest :(
51f581c [Holden Karau] Add explicit types to public methods, fix long line
f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes
adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic
398078a [Holden Karau] move the thresholding around a bunch based on the design doc
4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok)
638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test
e09919c [Holden Karau] Fix return type, I need more coffee....
8d92cac [Holden Karau] Use ClassifierParams as the head
3456ed3 [Holden Karau] Add explicit return types even though just test
a0f3b0c [Holden Karau] scala style fixes
6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now
ffc8dab [Holden Karau] Update the sharedParams
0420290 [Holden Karau] Allow us to override the get methods selectively
978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions
1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there"
1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there
efb9084 [Holden Karau] move setThresholds only to where its used
6b34809 [Holden Karau] Add a test with thresholding for the RFCS
74f54c3 [Holden Karau] Fix creation of vote array
1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down.
2f44b18 [Holden Karau] Add a global default of null for thresholds param
f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds"
634b06f [Holden Karau] Some progress towards unifying threshold and thresholds
85c9e01 [Holden Karau] Test passes again... little fnur
099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer)
0f46836 [Holden Karau] Start adding a classifiersuite
f70eb5e [Holden Karau] Fix test compile issues
a7d59c8 [Holden Karau] Move thresholding into Classifier trait
5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test)
1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation
31d6bf2 [Holden Karau] Start threading the threshold info through
0ef228c [Holden Karau] Add hasthresholds
2015-08-04 10:12:22 -07:00
Sean Owen 76d74090d6 [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.

I'll explain several of the changes inline in comments.

Author: Sean Owen <sowen@cloudera.com>

Closes #7862 from srowen/SPARK-9534 and squashes the following commits:

ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
2015-08-04 12:02:26 +01:00
Yu ISHIKAWA 244016a95c [SPARK-9149] [ML] [EXAMPLES] Add an example of spark.ml KMeans
[SPARK-9149] Add an example of spark.ml KMeans - ASF JIRA https://issues.apache.org/jira/browse/SPARK-9149

jkbradley Should we support other data formats, such as TSV or CSV. I have implemented these examples which support only space separated file which is same as the example for `spark.mllib`'s `KMeans`.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7697 from yu-iskw/SPARK-9149 and squashes the following commits:

7137bad [Yu ISHIKAWA] Fix the typo
56b9da2 [Yu ISHIKAWA] Fix the place of the wrong import statment
554e574 [Yu ISHIKAWA] Change the way to format input data in KMeansExample
e7a948a [Yu ISHIKAWA] Import spark.ml.clustering.KMeans
1901e0c [Yu ISHIKAWA] Change how to initialize an array for a DataFrame schema
d8043f5 [Yu ISHIKAWA] Return a value directly
d81bf55 [Yu ISHIKAWA] Fix a typo and its access specifiers
3e0862d [Yu ISHIKAWA] Make KMeansExample more simple
51ce9c1 [Yu ISHIKAWA] Make JavaKMeansExample more simple
a5a01e0 [Yu ISHIKAWA] Fix a Javadoc about the command to execute the example
b09ec13 [Yu ISHIKAWA] [SPARK-9149][ML][Examples] Add an example of spark.ml KMeans
2015-08-02 09:00:58 +01:00
Jonathan Alter e14b545d2d [SPARK-7977] [BUILD] Disallowing println
Author: Jonathan Alter <jonalter@users.noreply.github.com>

Closes #7093 from jonalter/SPARK-7977 and squashes the following commits:

ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite
7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite
10724b6 [Jonathan Alter] Changing some printlns to logs in tests
eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0b1dcb4 [Jonathan Alter] More println cleanup
aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0c16fa3 [Jonathan Alter] Replacing some printlns with logs
45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
5c8e283 [Jonathan Alter] Allowing println in audit-release examples
5b50da1 [Jonathan Alter] Allowing printlns in example files
ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
83ab635 [Jonathan Alter] Fixing new printlns
54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns
b837c3a [Jonathan Alter] Disallowing println
2015-07-10 11:34:01 +01:00
Daniel Emaasit (PhD Student) 293225e0cd [SPARK-8124] [SPARKR] Created more examples on SparkR DataFrames
Here are more examples on SparkR DataFrames including creating a Spark Contect and a SQL
context, loading data and simple data manipulation.

Author: Daniel Emaasit (PhD Student) <daniel.emaasit@gmail.com>

Closes #6668 from Emaasit/dan-dev and squashes the following commits:

3a97867 [Daniel Emaasit (PhD Student)] Used fewer rows for createDataFrame
f7227f9 [Daniel Emaasit (PhD Student)] Using command line arguments
a550f70 [Daniel Emaasit (PhD Student)] Used base R functions
33f9882 [Daniel Emaasit (PhD Student)] Renamed file
b6603e3 [Daniel Emaasit (PhD Student)] changed "Describe" function to "describe"
90565dd [Daniel Emaasit (PhD Student)] Deleted the getting-started file
b95a103 [Daniel Emaasit (PhD Student)] Deleted this file
cc55cd8 [Daniel Emaasit (PhD Student)] combined all the code into one .R file
c6933af [Daniel Emaasit (PhD Student)] changed variable name to SQLContext
8e0fe14 [Daniel Emaasit (PhD Student)] provided two options for creating DataFrames
2653573 [Daniel Emaasit (PhD Student)] Updates to a comment and variable name
275b787 [Daniel Emaasit (PhD Student)] Added the Apache License at the top of the file
2e8f724 [Daniel Emaasit (PhD Student)] Added the Apache License at the top of the file
486f44e [Daniel Emaasit (PhD Student)] Added the Apache License at the file
d705112 [Daniel Emaasit (PhD Student)] Created more examples on SparkR DataFrames
2015-07-06 11:08:36 -07:00
zsxwing 75b9fe4c5f [SPARK-8378] [STREAMING] Add the Python API for Flume
Author: zsxwing <zsxwing@gmail.com>

Closes #6830 from zsxwing/flume-python and squashes the following commits:

78dfdac [zsxwing] Fix the compile error in the test code
f1bf3c0 [zsxwing] Address TD's comments
0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
e93736b [zsxwing] Fix the test case for determine_modules_to_test
9d5821e [zsxwing] Fix pyspark_core dependencies
f9ee681 [zsxwing] Merge branch 'master' into flume-python
7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
b96b0de [zsxwing] Merge branch 'master' into flume-python
ce85e83 [zsxwing] Fix incompatible issues for Python 3
01cbb3d [zsxwing] Add import sys
152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
14ba0ff [zsxwing] Add flume-assembly for sbt building
b8d5551 [zsxwing] Merge branch 'master' into flume-python
4762c34 [zsxwing] Fix the doc
0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
9f33873 [zsxwing] Add the Python API for Flume
2015-07-01 11:59:24 -07:00
Shuo Xiang 5452457410 [SPARK-8551] [ML] Elastic net python code example
Author: Shuo Xiang <shuoxiangpub@gmail.com>

Closes #6946 from coderxiang/en-java-code-example and squashes the following commits:

7a4bdf8 [Shuo Xiang] address comments
cddb02b [Shuo Xiang] add elastic net python example code
f4fa534 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
6ad4865 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test
2015-06-29 23:50:34 -07:00
Liang-Chi Hsieh 4a462c282c [HOTFIX] Fix scala style in DFSReadWriteTest that causes tests failed
This scala style problem causes tested failed.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6907 from viirya/hotfix_style and squashes the following commits:

c53f188 [Liang-Chi Hsieh] Fix scala style.
2015-06-19 11:36:59 -07:00