Commit graph

815 commits

Author SHA1 Message Date
wm624@hotmail.com bebe5f9811 [SPARK-15318][ML][EXAMPLE] spark.ml Collaborative Filtering example does not work in spark-shell
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

copy & paste example in ml-collaborative-filtering.html into spark-shell, we see the following errors.
scala> case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
defined class Rating

scala> object Rating {
def parseRating(str: String): Rating = { | val fields = str.split("::") | assert(fields.size == 4) | Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong) | }
}
<console>:29: error: Rating.type does not take parameters
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
^
In standard scala repl, it has the same error.

Scala/spark-shell repl has some quirks (e.g. packages are also not well supported).

The reason of errors is that scala/spark-shell repl discards previous definitions when we define the Object with the same class name. Solution: We can rename the Object Rating.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manually test it: 1). ./bin/run-example ALSExample
2). copy & paste example in the generated document. It works fine.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13110 from wangmiao1981/repl.
2016-05-17 16:51:01 +01:00
wm624@hotmail.com 4134ff0c65 [SPARK-14434][ML] User guide doc and examples for GaussianMixture in spark.ml
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual compile and test all examples

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12788 from wangmiao1981/example.
2016-05-17 15:20:47 +02:00
Yanbo Liang f116a84ef8 [SPARK-14979][ML][PYSPARK] Add examples for GeneralizedLinearRegression
## What changes were proposed in this pull request?
Add Scala/Java/Python examples for ```GeneralizedLinearRegression```.

## How was this patch tested?
They are examples and have been tested offline.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12754 from yanboliang/spark-14979.
2016-05-16 09:55:35 +02:00
Sean Zhong 33c6eb5218 [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView
## What changes were proposed in this pull request?

Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #12945 from clockfly/spark-15171.
2016-05-12 15:51:53 +08:00
Zheng RuiFeng 9e266d07a4 [SPARK-15031][SPARK-15134][EXAMPLE][DOC] Use SparkSession and update indent in examples
## What changes were proposed in this pull request?
1, Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)
2, Update indent for `SparkContext` according to [SPARK-15134](https://issues.apache.org/jira/browse/SPARK-15134)
3, BTW, remove some duplicate space and add missing '.'

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13050 from zhengruifeng/use_sparksession.
2016-05-11 22:45:30 -07:00
cody koeninger 89e67d6667 [SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifact
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions

## How was this patch tested?
Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #12946 from koeninger/SPARK-15085.
2016-05-11 12:15:41 -07:00
Dongjoon Hyun e1576478bd [SPARK-14933][HOTFIX] Replace sqlContext with spark.
## What changes were proposed in this pull request?

This fixes compile errors.

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13053 from dongjoon-hyun/hotfix_sqlquerysuite.
2016-05-11 10:03:51 -07:00
Zheng RuiFeng d88afabdfa [SPARK-15150][EXAMPLE][DOC] Update LDA examples
## What changes were proposed in this pull request?
1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt`
2,add python example
3,directly read the datafile in examples
4,BTW, change to `SparkSession` in `aft_survival_regression.py`

## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/lda_example.py`

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12927 from zhengruifeng/lda_pe.
2016-05-11 12:49:41 +02:00
Zheng RuiFeng 8beae59144 [SPARK-15149][EXAMPLE][DOC] update kmeans example
## What changes were proposed in this pull request?
Python example for ml.kmeans already exists, but not included in user guide.
1,small changes like: `example_on` `example_off`
2,add it to user guide
3,update examples to directly read datafile

## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/kmeans_example.py

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12925 from zhengruifeng/km_pe.
2016-05-11 10:01:43 +02:00
Zheng RuiFeng cef73b5638 [SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans
## What changes were proposed in this pull request?

1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile `data/mllib/sample_kmeans_data.txt`

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11844 from zhengruifeng/doc_bkm.
2016-05-11 09:56:36 +02:00
Zheng RuiFeng ad1a8466e9 [SPARK-15141][EXAMPLE][DOC] Update OneVsRest Examples
## What changes were proposed in this pull request?
1, Add python example for OneVsRest
2, remove args-parsing

## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12920 from zhengruifeng/ovr_pe.
2016-05-11 09:53:36 +02:00
hyukjinkwon 2992a215c9 [MINOR][DOCS] Remove remaining sqlContext in documentation at examples
This PR removes `sqlContext` in examples. Actual usage was all replaced in https://github.com/apache/spark/pull/12809 but there are some in comments.

Manual style checking.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13006 from HyukjinKwon/minor-docs.
2016-05-09 10:55:17 -07:00
Yanbo Liang ee3b171562 [MINOR] [SPARKR] Update data-manipulation.R to use native csv reader
## What changes were proposed in this pull request?
* Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
* Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.

## How was this patch tested?
Offline test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13005 from yanboliang/r-df-examples.
2016-05-09 09:58:36 -07:00
Nick Pentreath b0cafdb6cc [MINOR][ML][PYSPARK] ALS example cleanup
Cleans up ALS examples by removing unnecessary casts to double for `rating` and `prediction` columns, since `RegressionEvaluator` now supports `Double` & `Float` input types.

## How was this patch tested?

Manual compile and run with `run-example ml.ALSExample` and `spark-submit examples/src/main/python/ml/als_example.py`.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12892 from MLnick/als-examples-cleanup.
2016-05-07 10:57:40 +02:00
Zheng RuiFeng 76ad04d9a0 [SPARK-14512] [DOC] Add python example for QuantileDiscretizer
## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12281 from zhengruifeng/discret_pe.
2016-05-06 10:47:13 -07:00
Dongjoon Hyun 2c170dd3d7 [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update binary_classification_metrics_example.py
## What changes were proposed in this pull request?

This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)

## How was this patch tested?

After passing the Jenkins tests and run `dev/lint-java` manually.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12911 from dongjoon-hyun/SPARK-15134.
2016-05-05 14:37:50 -07:00
Sandeep Singh ed6f3f8a5f [SPARK-15072][SQL][REPL][EXAMPLES] Remove SparkSession.withHiveSupport
## What changes were proposed in this pull request?
Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`

## How was this patch tested?
ran tests locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12851 from techaddict/SPARK-15072.
2016-05-05 14:35:15 -07:00
Dongjoon Hyun cdce4e62a5 [SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example.
## What changes were proposed in this pull request?

This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.

- Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
- Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
- Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
  - `SqlNetworkWordCount.scala`
  - `JavaSqlNetworkWordCount.java`
  - `sql_network_wordcount.py`

Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
- `simple_params_example.py`
- `aft_survival_regression.py`

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12809 from dongjoon-hyun/SPARK-15031.
2016-05-04 14:31:36 -07:00
Zheng RuiFeng 4530250f5a [MINOR] Add python3 compatibility in python examples
## What changes were proposed in this pull request?
Add python3 compatibility in python examples

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12868 from zhengruifeng/fix_gmm_py.
2016-05-04 10:59:36 -07:00
Dongjoon Hyun 0903a185c7 [SPARK-15084][PYTHON][SQL] Use builder pattern to create SparkSession in PySpark.
## What changes were proposed in this pull request?

This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12860 from dongjoon-hyun/SPARK-15084.
2016-05-03 18:05:40 -07:00
Andrew Or 588cac414a [SPARK-15073][SQL] Hide SparkSession constructor from the public
## What changes were proposed in this pull request?

Users should use the builder pattern instead.

## How was this patch tested?

Jenks.

Author: Andrew Or <andrew@databricks.com>

Closes #12873 from andrewor14/spark-session-constructor.
2016-05-03 13:47:58 -07:00
Dongjoon Hyun f86f71763c [MINOR][EXAMPLE] Use SparkSession instead of SQLContext in RDDRelation.scala
## What changes were proposed in this pull request?

Now, `SQLContext` is used for backward-compatibility, we had better use `SparkSession` in Spark 2.0 examples.

## How was this patch tested?

It's just example change. After building, run `bin/run-example org.apache.spark.examples.sql.RDDRelation`.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12808 from dongjoon-hyun/rddrelation.
2016-04-30 00:15:04 -07:00
wm624@hotmail.com c74fd1e546 [SPARK-14937][ML][DOCUMENT] spark.ml LogisticRegression sqlCtx in scala is inconsistent with java and python
## What changes were proposed in this pull request?
In spark.ml document, the LogisticRegression scala example uses sqlCtx. It is inconsistent with java and python examples which use sqlContext. In addition, a user can't copy & paste to run the example in spark-shell as sqlCtx doesn't exist in spark-shell while sqlContext exists.

Change the scala example referred by the spark.ml example.

## How was this patch tested?

Compile the example scala file and it passes compilation.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12717 from wangmiao1981/doc.
2016-04-27 11:56:57 -07:00
Josh Rosen 75879ac3c0 [SPARK-14925][BUILD] Re-introduce 'unused' dependency so that published POMs are flattened
Spark's published POMs are supposed to be flattened and not contain variable substitution (see SPARK-3812), but the dummy dependency that was required for this was accidentally removed. We should re-introduce this dependency in order to fix an issue where the un-flattened POMs cause the wrong dependencies to be included in Scala 2.10 published POMs.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12706 from JoshRosen/SPARK-14925-published-poms-should-be-flattened.
2016-04-26 15:14:17 -07:00
Zheng RuiFeng e88476c8c6 [SPARK-14514][DOC] Add python example for VectorSlicer
## What changes were proposed in this pull request?
Add the missing python example for VectorSlicer

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12282 from zhengruifeng/vecslicer_pe.
2016-04-26 14:38:29 -07:00
Azeem Jiva de6e633420 [SPARK-14756][CORE] Use parseLong instead of valueOf
## What changes were proposed in this pull request?

Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type

## How was this patch tested?

Unit tests

Author: Azeem Jiva <azeemj@gmail.com>

Closes #12520 from javawithjiva/minor.
2016-04-26 11:49:04 +01:00
Andrew Or 3c5e65c339 [SPARK-14721][SQL] Remove HiveContext (part 2)
## What changes were proposed in this pull request?

This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.

Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)

## How was this patch tested?

No change in functionality.

Author: Andrew Or <andrew@databricks.com>

Closes #12585 from andrewor14/delete-hive-context.
2016-04-25 13:23:05 -07:00
Marcelo Vanzin a680562a6f [SPARK-14744][EXAMPLES] Clean up examples packaging, remove outdated examples.
First, make all dependencies in the examples module provided, and explicitly
list a couple of ones that somehow are promoted to compile by maven. This
means that to run streaming examples, the streaming connector package needs
to be provided to run-examples using --packages or --jars, just like regular
apps.

Also, remove a couple of outdated examples. HBase has had Spark bindings for
a while and is even including them in the HBase distribution in the next
version, making the examples obsolete. The same applies to Cassandra, which
seems to have a proper Spark binding library already.

I just tested the build, which passes, and ran SparkPi. The examples jars
directory now has only two jars:

```
$ ls -1 examples/target/scala-2.11/jars/
scopt_2.11-3.3.0.jar
spark-examples_2.11-2.0.0-SNAPSHOT.jar
```

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12544 from vanzin/SPARK-14744.
2016-04-25 10:20:51 -07:00
Dongjoon Hyun 6ab4d9e0c7 [SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date
## What changes were proposed in this pull request?

This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.

- Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
- Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
- Fix datatypes in `sparkr.md`.
- Update a data result in `sparkr.md`.
- Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
- Other minor syntax fixes and a typo.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12649 from dongjoon-hyun/SPARK-14883.
2016-04-24 22:10:27 -07:00
Sean Owen be0d5d3bbe [SPARK-14873][CORE] Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object
## What changes were proposed in this pull request?

Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values

## How was this patch tested?

Existing (updated) Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12637 from srowen/SPARK-14873.
2016-04-23 10:47:50 -07:00
Sean Owen 8bd05c9db2 [SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
## What changes were proposed in this pull request?

`JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12418 from srowen/SPARK-8393.
2016-04-21 11:03:16 +01:00
Yuhao Yang ed9d803854 [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF
## What changes were proposed in this pull request?

Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.

## How was this patch tested?

unit tests and doc generation

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12454 from hhbyyh/tfdoc.
2016-04-20 11:45:08 +01:00
Mark Grover 2b151b6b93 [SPARK-14711][BUILD] Examples jar not a part of distribution.
## What changes were proposed in this pull request?

Move the spark-examples.jar from being in examples/target to examples/target/scala-2.11/jars

## How was this patch tested?

Built distribution to make sure examples jar was being included in the tarball.
Ran run-example to make sure examples were run.

Author: Mark Grover <mark@apache.org>

Closes #12476 from markgrover/spark-14711.
2016-04-18 17:44:42 -07:00
Zheng RuiFeng 9bfb35da1e [SPARK-14515][DOC] Add python example for ChiSqSelector
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12283 from zhengruifeng/chi2_pe.
2016-04-18 17:14:22 -07:00
hyukjinkwon 6fc1e72d9b [MINOR] Revert removing explicit typing (changed in some examples and StatFunctions)
## What changes were proposed in this pull request?

This PR reverts some changes in https://github.com/apache/spark/pull/12413. (please see the discussion in that PR).

from
```scala
    words.foreachRDD { (rdd, time) =>
    ...
```

to
```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
```

Also, this was discussed in dev-mailing list, [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-Scala-style-explicit-typing-within-transformation-functions-and-anonymous-val-td17173.html)

## How was this patch tested?

This was tested with `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12452 from HyukjinKwon/revert-explicit-typing.
2016-04-18 13:45:03 -07:00
Xusen Yin 8c62edb70f [SPARK-14299][EXAMPLES] Remove duplications for scala.examples.ml
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14299

Delete duplications in scala/examples/ml.

TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample
CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample

## How was this patch tested?

Existing tests passed.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12366 from yinxusen/SPARK-14299-2.
2016-04-18 13:34:36 -07:00
hyukjinkwon 9f678e9754 [MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes

- Inappropriate type notations
    For example, from
    ```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
    ```
    to
    ```scala
    words.foreachRDD { (rdd, time) =>
    ...
    ```

- Extra anonymous closure within functional transformations.
    For example,
    ```scala
    .map(item => {
      ...
    })
    ```

    which can be just simply as below:

    ```scala
    .map { item =>
      ...
    }
    ```

and corrects some obvious style nits.

## How was this patch tested?

This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.

The rules applied were below:

- For the first correction,

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```

- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```

**Those rules were not added**

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12413 from HyukjinKwon/SPARK-style.
2016-04-16 14:56:23 +01:00
hyukjinkwon 6fc3dc8839 [MINOR][SQL] Remove extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes extra anonymous closure within functional transformations.

For example,

```scala
.map(item => {
  ...
})
```

which can be just simply as below:

```scala
.map { item =>
  ...
}
```

## How was this patch tested?

Related unit tests and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12382 from HyukjinKwon/minor-extra-closers.
2016-04-14 09:43:41 +01:00
Yuhao Yang 781df49983 [SPARK-13089][ML] [Doc] spark.ml Naive Bayes user guide and examples
jira: https://issues.apache.org/jira/browse/SPARK-13089

Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files).

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11015 from hhbyyh/naiveBayesDoc.
2016-04-13 13:58:35 -07:00
Zheng RuiFeng fcdd69260e [SPARK-14509][DOC] Add python CountVectorizerExample
## What changes were proposed in this pull request?
Add python CountVectorizerExample

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11917 from zhengruifeng/cv_pe.
2016-04-13 13:56:23 -07:00
Dongjoon Hyun b0f5497e95 [SPARK-14508][BUILD] Add a new ScalaStyle Rule OmitBracesInCase
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
  ```
  case: Always omit braces in case clauses.
  ```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.

## How was this patch tested?

Pass the Jenkins tests (including Scala style checking)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12280 from dongjoon-hyun/SPARK-14508.
2016-04-12 00:43:28 -07:00
Joseph K. Bradley e9e1adc036 [MINOR][ML] Fixed MLlib build warnings
## What changes were proposed in this pull request?

Fixes to eliminate warnings during package and doc builds.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12263 from jkbradley/warning-cleanups.
2016-04-12 03:24:26 +01:00
Xiangrui Meng 1c751fcf48 [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIs
## What changes were proposed in this pull request?

This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator.

Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2.

TODOs:
- [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920)
- [x] Python
- [x] add a new test to accept Dataset[LabeledPoint]
- [x] remove unused imports of Dataset

## How was this patch tested?

Exiting unit tests with some modifications.

cc: rxin jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #12274 from mengxr/SPARK-14500.
2016-04-11 09:28:28 -07:00
Örjan Lundberg b5c785629a Update KMeansExample.scala
## What changes were proposed in this pull request?
example does not work wo DataFrame import

## How was this patch tested?

example doc only

example does not work wo DataFrame import

Author: Örjan Lundberg <orjan.lundberg@gmail.com>

Closes #12277 from oluies/patch-1.
2016-04-10 16:30:30 +01:00
Yong Tang 72e66bb270 [SPARK-14301][EXAMPLES] Java examples code merge and clean up.
## What changes were proposed in this pull request?

This fix tries to remove duplicate Java code in examples/mllib and examples/ml. The following changes have been made:

```
deleted: ml/JavaCrossValidatorExample.java (duplicate of JavaModelSelectionViaCrossValidationExample.java)
deleted: ml/JavaTrainValidationSplitExample.java (duplicated of JavaModelSelectionViaTrainValidationSplitExample.java)
deleted: mllib/JavaFPGrowthExample.java (duplicate of JavaSimpleFPGrowth.java)
deleted: mllib/JavaLDAExample.java (duplicate of JavaLatentDirichletAllocationExample.java)
deleted: mllib/JavaKMeans.java (merged with JavaKMeansExample.java)
deleted: mllib/JavaLR.java (duplicate of JavaLinearRegressionWithSGDExample.java)
updated: mllib/JavaKMeansExample.java (merged with mllib/JavaKMeans.java)
```

## How was this patch tested?
Existing tests passed.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12143 from yongtang/SPARK-14301.
2016-04-10 02:37:11 +01:00
Zheng RuiFeng adb9d73cd6 [SPARK-14339][DOC] Add python examples for DCT,MinMaxScaler,MaxAbsScaler
## What changes were proposed in this pull request?
add three python examples

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12063 from zhengruifeng/dct_pe.
2016-04-09 11:25:39 -07:00
Dongjoon Hyun d717ae1fd7 [SPARK-14444][BUILD] Add a new scalastyle NoScalaDoc to prevent ScalaDoc-style multiline comments
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
```
/** In Spark, we don't use the ScalaDoc style so this
  * is not correct.
  */
```

## How was this patch tested?

Pass the Jenkins tests (including `lint-scala`).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12221 from dongjoon-hyun/SPARK-14444.
2016-04-06 16:02:55 -07:00
Marcelo Vanzin 24d7d2e453 [SPARK-13579][BUILD] Stop building the main Spark assembly.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).

I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).

Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11796 from vanzin/SPARK-13579.
2016-04-04 16:52:22 -07:00
Dongjoon Hyun 3f749f7ed4 [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static analysis results
## What changes were proposed in this pull request?

This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.

## How was this patch tested?

Manual and pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12139 from dongjoon-hyun/SPARK-14355.
2016-04-03 18:14:16 -07:00
Dongjoon Hyun 4a6e78abd9 [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.
## What changes were proposed in this pull request?

This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
(All comment-only changes over 77 files: +786 lines, −747 lines)

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
2016-04-02 17:50:40 -07:00
Jacek Laskowski 06694f1c68 [MINOR] Typo fixes
## What changes were proposed in this pull request?

Typo fixes. No functional changes.

## How was this patch tested?

Built the sources and ran with samples.

Author: Jacek Laskowski <jacek@japila.pl>

Closes #11802 from jaceklaskowski/typo-fixes.
2016-04-02 08:12:04 -07:00
Dongjoon Hyun 1808465855 [MINOR] Fix newly added java-lint errors
## What changes were proposed in this pull request?

This PR fixes some newly added java-lint errors(unused-imports, line-lengsth).

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11968 from dongjoon-hyun/SPARK-14167.
2016-03-26 11:55:49 +00:00
Shixiong Zhu d23ad7c1c9 [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter
## What changes were proposed in this pull request?

This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages

Also remove mqtt_wordcount.py that I forgot to remove previously.

## How was this patch tested?

Jenkins PR Build.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11824 from zsxwing/remove-doc.
2016-03-26 01:47:27 -07:00
Shixiong Zhu 24587ce433 [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark
## What changes were proposed in this pull request?

This PR moves flume back to Spark as per the discussion in the dev mail-list.

## How was this patch tested?

Existing Jenkins tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11895 from zsxwing/move-flume-back.
2016-03-25 17:37:16 -07:00
Xin Ren d283223a5a [SPARK-13017][DOCS] Replace example code in mllib-feature-extraction.md using include_example
Replace example code in mllib-feature-extraction.md using include_example
https://issues.apache.org/jira/browse/SPARK-13017

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11142 from keypointt/SPARK-13017.
2016-03-24 14:25:10 -07:00
Xin Ren dd9ca7b960 [SPARK-13019][DOCS] fix for scala-2.10 build: Replace example code in mllib-statistics.md using include_example
## What changes were proposed in this pull request?

This PR for ticket SPARK-13019 is based on previous PR(https://github.com/apache/spark/pull/11108).
Since PR(https://github.com/apache/spark/pull/11108) is breaking scala-2.10 build, more work is needed to fix build errors.

What I did new in this PR is adding keyword argument for 'fractions':
`    val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)`
`    val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)`

I reopened ticket on JIRA but sorry I don't know how to reopen a GitHub pull request, so I just submitting a new pull request.
## How was this patch tested?

Manual build testing on local machine, build based on scala-2.10.

Author: Xin Ren <iamshrek@126.com>

Closes #11901 from keypointt/SPARK-13019.
2016-03-24 09:34:54 +00:00
Xiangrui Meng 43ef1e52bf Revert "[SPARK-13019][DOCS] Replace example code in mllib-statistics.md using include_example"
This reverts commit 1af8de200c.
2016-03-21 17:42:30 -07:00
Xin Ren 1af8de200c [SPARK-13019][DOCS] Replace example code in mllib-statistics.md using include_example
https://issues.apache.org/jira/browse/SPARK-13019

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11108 from keypointt/SPARK-13019.
2016-03-21 16:09:34 -07:00
Dongjoon Hyun 20fd254101 [SPARK-14011][CORE][SQL] Enable LineLength Java checkstyle rule
## What changes were proposed in this pull request?

[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.

```xml
-        <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
-        <!--
         <module name="LineLength">
             <property name="max" value="100"/>
             <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
         </module>
-        -->
         <module name="NoLineWrap"/>
         <module name="EmptyBlock">
             <property name="option" value="TEXT"/>
 -167,5 +164,7
         </module>
         <module name="CommentsIndentation"/>
         <module name="UnusedImports"/>
+        <module name="RedundantImport"/>
+        <module name="RedundantModifier"/>
```

## How was this patch tested?

Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11831 from dongjoon-hyun/SPARK-14011.
2016-03-21 07:58:57 +00:00
Zheng RuiFeng 53f32a22da [MINOR][DOC] Fix nits in JavaStreamingTestExample
## What changes were proposed in this pull request?

Fix some nits discussed in https://github.com/apache/spark/pull/11776#issuecomment-198207419
use !rdd.isEmpty instead of rdd.count > 0
use static instead of AtomicInteger
remove unneeded "throws Exception"

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11821 from zhengruifeng/je_fix.
2016-03-18 12:34:14 +00:00
Wenchen Fan 8ef3399aff [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging
## What changes were proposed in this pull request?

Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11764 from cloud-fan/logger.
2016-03-17 19:23:38 +08:00
Zheng RuiFeng 204c9dec2c [MINOR][DOC] Add JavaStreamingTestExample
## What changes were proposed in this pull request?

Add the java example of StreamingTest

## How was this patch tested?

manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11776 from zhengruifeng/streaming_je.
2016-03-17 11:09:02 +02:00
Marcelo Vanzin 48978abfa4 [SPARK-13576][BUILD] Don't create assembly for examples.
As part of the goal to stop creating assemblies in Spark, this change
modifies the mvn and sbt builds to not create an assembly for examples.

Instead, dependencies are copied to the build directory (under
target/scala-xx/jars), and in the final archive, into the "examples/jars"
directory.

To avoid having to deal too much with Windows batch files, I made examples
run through the launcher library; the spark-submit launcher now has a
special mode to run examples, which adds all the necessary jars to the
spark-submit command line, and replaces the bash and batch scripts that
were used to run examples. The scripts are now just a thin wrapper around
spark-submit; another advantage is that now all spark-submit options are
supported.

There are a few glitches; in the mvn build, a lot of duplicated dependencies
get copied, because they are promoted to "compile" scope due to extra
dependencies in the examples module (such as HBase). In the sbt build,
all dependencies are copied, because there doesn't seem to be an easy
way to filter things.

I plan to clean some of this up when the rest of the tasks are finished.
When the main assembly is replaced with jars, we can remove duplicate jars
from the examples directory during packaging.

Tested by running SparkPi in: maven build, sbt build, dist created by
make-distribution.sh.

Finally: note that running the "assembly" target in sbt doesn't build
the examples anymore. You need to run "package" for that.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11452 from vanzin/SPARK-13576.
2016-03-15 09:44:51 -07:00
Shixiong Zhu 06dec37455 [SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages
## What changes were proposed in this pull request?

Currently there are a few sub-projects, each for integrating with different external sources for Streaming.  Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages

- streaming-flume
- streaming-akka
- streaming-mqtt
- streaming-zeromq
- streaming-twitter

They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster.

I have already copied these projects to https://github.com/spark-packages

## How was this patch tested?

Jenkins tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11672 from zsxwing/remove-external-pkg.
2016-03-14 16:56:04 -07:00
Dongjoon Hyun acdf219703 [MINOR][DOCS] Fix more typos in comments/strings.
## What changes were proposed in this pull request?

This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11689 from dongjoon-hyun/fix_more_typos.
2016-03-14 09:07:39 +00:00
Sean Owen 1840852841 [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
## What changes were proposed in this pull request?

- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11657 from srowen/SPARK-13823.
2016-03-13 21:03:49 -07:00
Cheng Lian c079420d7c [SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows()
## What changes were proposed in this pull request?

This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11678 from liancheng/remove-collect-rows-and-take-rows.
2016-03-13 12:02:52 +08:00
Zheng RuiFeng 42afd72c65 [SPARK-13814] [PYSPARK] Delete unnecessary imports in python examples files
JIRA:  https://issues.apache.org/jira/browse/SPARK-13814

## What changes were proposed in this pull request?

delete unnecessary imports in python examples files

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11651 from zhengruifeng/del_import_pe.
2016-03-11 13:49:37 -08:00
Cheng Lian 6d37e1eb90 [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame
## What changes were proposed in this pull request?

PR #11443 temporarily disabled MiMA check, this PR re-enables it.

One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API  changes.

## How was this patch tested?

Tested by MiMA check triggered by Jenkins.

Author: Cheng Lian <lian@databricks.com>

Closes #11656 from liancheng/re-enable-mima.
2016-03-11 22:17:50 +08:00
Nick Pentreath 8fff0f92a4 [HOT-FIX][SQL][ML] Fix compile error from use of DataFrame in Java MaxAbsScaler example
## What changes were proposed in this pull request?

Fix build failure introduced in #11392 (change `DataFrame` -> `Dataset<Row>`).

## How was this patch tested?

Existing build/unit tests

Author: Nick Pentreath <nick.pentreath@gmail.com>

Closes #11653 from MLnick/java-maxabs-example-fix.
2016-03-11 10:20:39 +02:00
Yuhao Yang 0b713e0455 [SPARK-13512][ML] add example and doc for MaxAbsScaler
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-13512
Add example and doc for ml.feature.MaxAbsScaler.

## How was this patch tested?
 unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11392 from hhbyyh/maxabsdoc.
2016-03-11 09:31:35 +02:00
Zheng RuiFeng d18276cb1d [SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIB
JIRA: https://issues.apache.org/jira/browse/SPARK-13672

## What changes were proposed in this pull request?

add two python examples of BisectingKMeans for ml and mllib

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11515 from zhengruifeng/mllib_bkm_pe.
2016-03-11 09:21:12 +02:00
Cheng Lian 1d542785b9 [SPARK-13244][SQL] Migrates DataFrame to Dataset
## What changes were proposed in this pull request?

This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.

Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).

There are several noticeable API changes related to those returning arrays:

1.  `collect`/`take`

    -   Old APIs in class `DataFrame`:

        ```scala
        def collect(): Array[Row]
        def take(n: Int): Array[Row]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def collect(): Array[T]
        def take(n: Int): Array[T]

        def collectRows(): Array[Row]
        def takeRows(n: Int): Array[Row]
        ```

    Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.

    Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).

1.  `randomSplit`

    -   Old APIs in class `DataFrame`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
        def randomSplit(weights: Array[Double]): Array[DataFrame]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
        def randomSplit(weights: Array[Double]): Array[Dataset[T]]
        ```

    Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.

1.  `groupBy`

    Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.

Other noticeable changes:

1.  Dataset always do eager analysis now

    We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.

## How was this patch tested?

Existing tests do the work.

## TODO

- [ ] Fix all tests
- [ ] Re-enable MiMA check
- [ ] Update ScalaDoc (`since`, `group`, and example code)

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>

Closes #11443 from liancheng/ds-to-df.
2016-03-10 17:00:17 -08:00
Dongjoon Hyun 91fed8e9c5 [SPARK-3854][BUILD] Scala style: require spaces before {.
## What changes were proposed in this pull request?

Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern  for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
```
// Correct:
if (true) {
  println("Wow!")
}

// Incorrect:
if (true){
   println("Wow!")
}
```
IntelliJ also shows new warnings based on this.

## How was this patch tested?

Pass the Jenkins ScalaStyle test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11637 from dongjoon-hyun/SPARK-3854.
2016-03-10 15:57:22 -08:00
JeremyNixon 3e3c3d58d8 [SPARK-13706][ML] Add Python Example for Train Validation Split
## What changes were proposed in this pull request?

This pull request adds a python example for train validation split.

## How was this patch tested?

This was style tested through lint-python, generally tested with ./dev/run-tests, and run in notebook and shell environments. It was viewed in docs locally with jekyll serve.

This contribution is my original work and I license it to Spark under its open source license.

Author: JeremyNixon <jnixon2@gmail.com>

Closes #11547 from JeremyNixon/tvs_example.
2016-03-10 09:18:15 +02:00
Dongjoon Hyun c3689bc24e [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code.
## What changes were proposed in this pull request?

In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.

```
-    final ArrayList<Product2<Object, Object>> dataToWrite =
-      new ArrayList<Product2<Object, Object>>();
+    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```

Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.

## How was this patch tested?

Manual.
Pass the existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11541 from dongjoon-hyun/SPARK-13702.
2016-03-09 10:31:26 +00:00
Dongjoon Hyun f3201aeeb0 [SPARK-13692][CORE][SQL] Fix trivial Coverity/Checkstyle defects
## What changes were proposed in this pull request?

This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle.

- Implement both null and type checking in equals functions.
- Fix wrong type casting logic in SimpleJavaBean2.equals.
- Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
- Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
- Fix coding style: Add '{}' to single `for` statement in mllib examples.
- Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`.
- Remove unused fields in `ChunkFetchIntegrationSuite`.
- Add `stop()` to prevent resource leak.

Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583).

## How was this patch tested?

manual via `./dev/lint-java` and Coverity site.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11530 from dongjoon-hyun/SPARK-13692.
2016-03-09 10:12:23 +00:00
Yury Liavitski 03f57a6c2d Fixing the type of the sentiment happiness value
## What changes were proposed in this pull request?

Added the conversion to int for the 'happiness value' read from the file. Otherwise, later on line 75 the multiplication will multiply a string by a number, yielding values like "-2-2" instead of -4.

## How was this patch tested?

Tested manually.

Author: Yury Liavitski <seconds.before@gmail.com>
Author: Yury Liavitski <yury.liavitski@il111.ice.local>

Closes #11540 from heliocentrist/fix-sentiment-value-type.
2016-03-07 10:54:33 +00:00
Dongjoon Hyun 941b270b70 [MINOR] Fix typos in comments and testcase name of code
## What changes were proposed in this pull request?

This PR fixes typos in comments and testcase name of code.

## How was this patch tested?

manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
2016-03-03 22:42:12 +00:00
Xin Ren 70f6f9649b [SPARK-13013][DOCS] Replace example code in mllib-clustering.md using include_example
Replace example code in mllib-clustering.md using include_example
https://issues.apache.org/jira/browse/SPARK-13013

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11116 from keypointt/SPARK-13013.
2016-03-03 09:32:47 -08:00
Dongjoon Hyun b5f02d6743 [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule
## What changes were proposed in this pull request?

After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.

## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11438 from dongjoon-hyun/SPARK-13583.
2016-03-03 10:12:32 +00:00
Sean Owen e97fc7f176 [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x
## What changes were proposed in this pull request?

Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:

- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map

The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.

## How was the this patch tested?

Existing Jenkins unit tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11292 from srowen/SPARK-13423.
2016-03-03 09:54:09 +00:00
Dongjoon Hyun 02b7677e95 [HOT-FIX] Recover some deprecations for 2.10 compatibility.
## What changes were proposed in this pull request?

#11479 [SPARK-13627] broke 2.10 compatibility: [2.10-Build](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/292/console)
At this moment, we need to support both 2.10 and 2.11.
This PR recovers some deprecated methods which were replace by [SPARK-13627].

## How was this patch tested?

Jenkins build: Both 2.10, 2.11.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11488 from dongjoon-hyun/hotfix_compatibility_with_2.10.
2016-03-03 09:53:02 +00:00
Dongjoon Hyun 9c274ac4a6 [SPARK-13627][SQL][YARN] Fix simple deprecation warnings.
## What changes were proposed in this pull request?

This PR aims to fix the following deprecation warnings.
  * MethodSymbolApi.paramss--> paramLists
  * AnnotationApi.tpe -> tree.tpe
  * BufferLike.readOnly -> toList.
  * StandardNames.nme -> termNames
  * scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader
  * TypeApi.declarations-> decls

## How was this patch tested?

Check the compile build log and pass the tests.
```
./build/sbt
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11479 from dongjoon-hyun/SPARK-13627.
2016-03-02 20:34:22 -08:00
Dongjoon Hyun 366f26d2da [MINOR][STREAMING] Replace deprecated apply with create in example.
## What changes were proposed in this pull request?

Twitter Algebird deprecated `apply` in HyperLogLog.scala.
```
deprecated("Use toHLL", since = "0.10.0 / 2015-05")
def apply[T <% Array[Byte]](t: T) = create(t)
```
This PR replace the deprecated usage `apply` with new `create`
according to the upstream change.

## How was this patch tested?
manual.
```
/bin/spark-submit --class org.apache.spark.examples.streaming.TwitterAlgebirdHLL examples/target/scala-2.11/spark-examples-2.0.0-SNAPSHOT-hadoop2.2.0.jar
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11451 from dongjoon-hyun/replace_deprecated_hll_apply.
2016-03-02 11:48:23 +00:00
Zheng RuiFeng 3c5f5e3b5c [SPARK-13550][ML] Add java example for ml.clustering.BisectingKMeans
JIRA: https://issues.apache.org/jira/browse/SPARK-13550

## What changes were proposed in this pull request?

Just add a java example for ml.clustering.BisectingKMeans

## How was this patch tested?

manual tests were done.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11428 from zhengruifeng/ml_bkm_je.
2016-02-29 23:55:26 -08:00
Zheng RuiFeng 0a4b620f31 [SPARK-13551][MLLIB] Fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample
JIRA: https://issues.apache.org/jira/browse/SPARK-13551

## What changes were proposed in this pull request?

Fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample

## How was this patch tested?

manual test

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11429 from zhengruifeng/mllib_bkm_je.
2016-02-29 22:24:43 -08:00
Dongjoon Hyun 7af0de076f [SPARK-11381][DOCS] Replace example code in mllib-linear-methods.md using include_example
## What changes were proposed in this pull request?

This PR replaces example codes in `mllib-linear-methods.md` using `include_example`
by doing the followings:
  * Extracts the example codes(Scala,Java,Python) as files in `example` module.
  * Merges some dialog-style examples into a single file.
  * Hide redundant codes in HTML for the consistency with other docs.

## How was the this patch tested?

manual test.
This PR can be tested by document generations, `SKIP_API=1 jekyll build`.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11320 from dongjoon-hyun/SPARK-11381.
2016-02-26 08:31:55 -08:00
Cheng Lian 99dfcedbfd [SPARK-13457][SQL] Removes DataFrame RDD operations
## What changes were proposed in this pull request?

This is another try of PR #11323.

This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`.

PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323.

## How was the this patch tested?

No extra tests are added. Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11388 from liancheng/remove-df-rdd-ops.
2016-02-27 00:28:30 +08:00
Davies Liu 751724b132 Revert "[SPARK-13457][SQL] Removes DataFrame RDD operations"
This reverts commit 157fe64f3e.
2016-02-25 11:53:48 -08:00
Cheng Lian 157fe64f3e [SPARK-13457][SQL] Removes DataFrame RDD operations
## What changes were proposed in this pull request?

This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`.

## How was the this patch tested?

No extra tests are added. Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11323 from liancheng/remove-df-rdd-ops.
2016-02-25 23:07:59 +08:00
JeremyNixon 230bbeaa61 [SPARK-10759][ML] update cross validator with include_example
This pull request uses {%include_example%} to add an example for the python cross validator to ml-guide.

Author: JeremyNixon <jnixon2@gmail.com>

Closes #11240 from JeremyNixon/pipeline_include_example.
2016-02-23 15:57:29 -08:00
movelikeriver 5cd3e6f60b [SPARK-13257][IMPROVEMENT] Refine naive Bayes example by checking model after loading it
Refine naive Bayes example by checking model after loading it

Author: movelikeriver <mars.lenjoy@gmail.com>

Closes #11125 from movelikeriver/naive_bayes.
2016-02-22 23:58:54 -08:00
Devaraj K 02b1fefffb [SPARK-13012][DOCUMENTATION] Replace example code in ml-guide.md using include_example
Replaced example code in ml-guide.md using include_example

Author: Devaraj K <devaraj@apache.org>

Closes #11053 from devaraj-kavali/SPARK-13012.
2016-02-22 17:21:37 -08:00
Devaraj K 9f410871ca [SPARK-13016][DOCUMENTATION] Replace example code in mllib-dimensionality-reduction.md using include_example
Replaced example example code in mllib-dimensionality-reduction.md using
include_example

Author: Devaraj K <devaraj@apache.org>

Closes #11132 from devaraj-kavali/SPARK-13016.
2016-02-22 17:16:56 -08:00
Luciano Resende 1a340da8d7 [SPARK-13248][STREAMING] Remove deprecated Streaming APIs.
Remove deprecated Streaming APIs and adjust sample applications.

Author: Luciano Resende <lresende@apache.org>

Closes #11139 from lresende/streaming-deprecated-apis.
2016-02-21 16:27:56 +00:00
Sean Owen b84404865b [SPARK-13324][CORE][BUILD] Update plugin, test, example dependencies for 2.x
Phase 1: update plugin versions, test dependencies, some example and third-party versions

Author: Sean Owen <sowen@cloudera.com>

Closes #11206 from srowen/SPARK-13324.
2016-02-17 19:03:29 -08:00
shijinkui 97ee85daf6 [SPARK-12953][EXAMPLES] RDDRelation writer set overwrite mode
https://issues.apache.org/jira/browse/SPARK-12953

fix error when run RDDRelation.main():
"path file:/Users/sjk/pair.parquet already exists"

Set DataFrameWriter's mode to SaveMode.Overwrite

Author: shijinkui <shijinkui666@163.com>

Closes #10864 from shijinkui/set_mode.
2016-02-17 15:08:22 -08:00
BenFradet 00c72d27bf [SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general
This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10411 from BenFradet/SPARK-12247.
2016-02-16 13:03:28 +00:00
Xin Ren e4675c2402 [SPARK-13018][DOCS] Replace example code in mllib-pmml-model-export.md using include_example
Replace example code in mllib-pmml-model-export.md using include_example
https://issues.apache.org/jira/browse/SPARK-13018

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11126 from keypointt/SPARK-13018.
2016-02-15 20:17:21 -08:00