Commit graph

651 commits

Author SHA1 Message Date
Kousuke Saruta e5904bb5e7 [SPARK-12692][BUILD][MLLIB] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10684 from sarutak/SPARK-12692-followup-mllib.
2016-01-10 12:38:57 -08:00
Sean Owen 659fd9d04b [SPARK-4819] Remove Guava's "Optional" from public API
Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)

See also https://github.com/apache/spark/pull/10512

Author: Sean Owen <sowen@cloudera.com>

Closes #10513 from srowen/SPARK-4819.
2016-01-08 13:02:30 -08:00
Udo Klein 8c70cb4c62 fixed numVertices in transitive closure example
Author: Udo Klein <git@blinkenlight.net>

Closes #10642 from udoklein/patch-2.
2016-01-08 20:32:37 +00:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
Shixiong Zhu c0c397509b [SPARK-12510][STREAMING] Refactor ActorReceiver to support Java
This PR includes the following changes:

1. Rename `ActorReceiver` to `ActorReceiverSupervisor`
2. Remove `ActorHelper`
3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java
4. Add `JavaActorWordCount` example

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10457 from zsxwing/java-actor-stream.
2016-01-07 15:26:55 -08:00
Jacek Laskowski 8113dbda0b [STREAMING][DOCS][EXAMPLES] Minor fixes
Author: Jacek Laskowski <jacek@japila.pl>

Closes #10603 from jaceklaskowski/streaming-actor-custom-receiver.
2016-01-07 00:27:13 -08:00
Reynold Xin 8ce645d4ee [SPARK-12615] Remove some deprecated APIs in RDD/SparkContext
I looked at each case individually and it looks like they can all be removed. The only one that I had to think twice was toArray (I even thought about un-deprecating it, until I realized it was a problem in Java to have toArray returning java.util.List).

Author: Reynold Xin <rxin@databricks.com>

Closes #10569 from rxin/SPARK-12615.
2016-01-05 11:10:14 -08:00
Marcelo Vanzin 7058dc1150 [SPARK-3873][EXAMPLES] Import ordering fixes.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10575 from vanzin/SPARK-3873-examples.
2016-01-04 22:42:54 -08:00
Sean Owen 15bd73627e [SPARK-12481][CORE][STREAMING][SQL] Remove usage of Hadoop deprecated APIs and reflection that supported 1.x
Remove use of deprecated Hadoop APIs now that 2.2+ is required

Author: Sean Owen <sowen@cloudera.com>

Closes #10446 from srowen/SPARK-12481.
2016-01-02 13:15:53 +00:00
Reynold Xin ee8f8d3184 [SPARK-12588] Remove HttpBroadcast in Spark 2.0.
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0.

Author: Reynold Xin <rxin@databricks.com>

Closes #10531 from rxin/SPARK-12588.
2015-12-30 18:07:07 -08:00
Shixiong Zhu 20591afd79 [SPARK-12429][STREAMING][DOC] Add Accumulator and Broadcast example for Streaming
This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10385 from zsxwing/accumulator-broadcast-example.
2015-12-22 16:39:10 -08:00
Reynold Xin 284e29a870 [SPARK-11808] Remove Bagel.
Author: Reynold Xin <rxin@databricks.com>

Closes #10395 from rxin/SPARK-11808.
2015-12-19 22:40:35 -08:00
Reynold Xin f496031bd2 Bump master version to 2.0.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #10387 from rxin/version-bump.
2015-12-19 15:13:05 -08:00
Jeff L ea59b0f3a6 [SPARK-9057][STREAMING] Twitter example joining to static RDD of word sentiment values
Example of joining a static RDD of word sentiments to a streaming RDD of Tweets in order to demo the usage of the transform() method.

Author: Jeff L <sha0lin@alumni.carnegiemellon.edu>

Closes #8431 from Agent007/SPARK-9057.
2015-12-18 15:06:54 +00:00
Shixiong Zhu 540b5aeadc [SPARK-12410][STREAMING] Fix places that use '.' and '|' directly in split
String.split accepts a regular expression, so we should escape "." and "|".

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10361 from zsxwing/reg-bug.
2015-12-17 13:23:48 -08:00
Yanbo Liang 1a8b2a17db [SPARK-12364][ML][SPARKR] Add ML example for SparkR
We have DataFrame example for SparkR, we also need to add ML example under ```examples/src/main/r```.

cc mengxr jkbradley shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10324 from yanboliang/spark-12364.
2015-12-16 12:59:22 -08:00
Yu ISHIKAWA 7b6dc29d0e [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means
This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9952 from yu-iskw/SPARK-6518.
2015-12-16 10:55:42 -08:00
Yu ISHIKAWA 26d70bd2b4 [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #10244 from yu-iskw/SPARK-12215.
2015-12-16 10:43:45 -08:00
Yanbo Liang b24c12d733 [MINOR][ML] Rename weights to coefficients for examples/DeveloperApiExample
Rename ```weights``` to ```coefficients``` for examples/DeveloperApiExample.

cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10280 from yanboliang/spark-coefficients.
2015-12-15 16:29:39 -08:00
Xusen Yin 98b212d36b [SPARK-12199][DOC] Follow-up: Refine example code in ml-features.md
https://issues.apache.org/jira/browse/SPARK-12199

Follow-up PR of SPARK-11551. Fix some errors in ml-features.md

mengxr

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10193 from yinxusen/SPARK-12199.
2015-12-12 17:47:01 -08:00
Yanbo Liang a0ff6d16ef [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to dataframe_example.py
Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion.
#9873 finished the work of Scala example, here we focus on the Python one.
Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```.
BTW, fix minor missing issues of #9873.
cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9957 from yanboliang/SPARK-11978.
2015-12-11 18:02:24 -08:00
Yanbo Liang 0fb9825556 [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files
* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments
jsonFile(sqlContext, “path1,path2”)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for ```parquetFile```.

cc felixcheung sun-rui shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10145 from yanboliang/spark-12146.
2015-12-11 11:47:35 -08:00
Bryan Cutler 6a6c1fc5c8 [SPARK-11713] [PYSPARK] [STREAMING] Initial RDD updateStateByKey for PySpark
Adding ability to define an initial state RDD for use with updateStateByKey PySpark.  Added unit test and changed stateful_network_wordcount example to use initial RDD.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
2015-12-10 14:21:15 -08:00
Tathagata Das bd2cd4f53d [SPARK-12244][SPARK-12245][STREAMING] Rename trackStateByKey to mapWithState and change tracking function signature
SPARK-12244:

Based on feedback from early users and personal experience attempting to explain it, the name trackStateByKey had two problem.
"trackState" is a completely new term which really does not give any intuition on what the operation is
the resultant data stream of objects returned by the function is called in docs as the "emitted" data for the lack of a better.
"mapWithState" makes sense because the API is like a mapping function like (Key, Value) => T with State as an additional parameter. The resultant data stream is "mapped data". So both problems are solved.

SPARK-12245:

From initial experiences, not having the key in the function makes it hard to return mapped stuff, as the whole information of the records is not there. Basically the user is restricted to doing something like mapValue() instead of map(). So adding the key as a parameter.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #10224 from tdas/rename.
2015-12-09 20:47:15 -08:00
Xusen Yin 051c6a066f [SPARK-11551][DOC] Replace example code in ml-features.md using include_example
PR on behalf of somideshmukh, thanks!

Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #10219 from yinxusen/SPARK-11551.
2015-12-09 12:00:48 -08:00
BenFradet 06746b3005 [SPARK-12159][ML] Add user guide section for IndexToString transformer
Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10166 from BenFradet/SPARK-12159.
2015-12-08 12:45:34 -08:00
Yuhao Yang 5cb4695051 [SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docs
jira: https://issues.apache.org/jira/browse/SPARK-11605
Check Java compatibility for MLlib for this release.

fix:

1. `StreamingTest.registerStream` needs java friendly interface.

2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`.

TBD:
[updated] no fix for now per discussion.
`org.apache.spark.mllib.classification.LogisticRegressionModel`
`public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation.
`SVMModel` has the similar issue.

Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary.

cc jkbradley feynmanliang

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10102 from hhbyyh/javaAPI.
2015-12-08 11:46:26 -08:00
Yuhao Yang 872a2ee281 [SPARK-10393] use ML pipeline in LDA example
jira: https://issues.apache.org/jira/browse/SPARK-10393

Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: yuhaoyang <yuhao@zhanglipings-iMac.local>

Closes #8551 from hhbyyh/ldaExUpdate.
2015-12-08 10:29:51 -08:00
Cheng Lian da2012a0e1 [SPARK-11551][DOC][EXAMPLE] Revert PR #10002
This reverts PR #10002, commit 78209b0cca.

The original PR wasn't tested on Jenkins before being merged.

Author: Cheng Lian <lian@databricks.com>

Closes #10200 from liancheng/revert-pr-10002.
2015-12-08 19:18:59 +08:00
Yanbo Liang 4a39b5a1be [SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10006 from yanboliang/spark-11958.
2015-12-07 23:50:57 -08:00
somideshmukh 78209b0cca [SPARK-11551][DOC][EXAMPLE] Replace example code in ml-features.md using include_example
Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three  java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer

Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #10002 from somideshmukh/SomilBranch1.33.
2015-12-07 23:26:34 -08:00
Xusen Yin 871e85d9c1 [SPARK-11963][DOC] Add docs for QuantileDiscretizer
https://issues.apache.org/jira/browse/SPARK-11963

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9962 from yinxusen/SPARK-11963.
2015-12-07 13:16:47 -08:00
Shixiong Zhu 3af53e61fd [SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectly
`ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct.

This patch fixed all places that use `ByteBuffer.array` incorrectly.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10083 from zsxwing/bytebuffer-array.
2015-12-04 17:02:04 -08:00
Josh Rosen b7204e1d41 [SPARK-12112][BUILD] Upgrade to SBT 0.13.9
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).

I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
2015-12-05 08:15:30 +08:00
Dmitry Erastov d0d8222778 [SPARK-6990][BUILD] Add Java linting script; fix minor warnings
This replaces https://github.com/apache/spark/pull/9696

Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.

Suggest fixing those TODOs in a separate PR(s).

More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).

Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):

> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1

Also fix some of the minor violations that didn't require sweeping changes.

Apologies for the previous botched PRs - I finally figured out the issue.

cr: JoshRosen, pwendell

> I state that the contribution is my original work, and I license the work to the project under the project's open source license.

Author: Dmitry Erastov <derastov@gmail.com>

Closes #9867 from dskrvk/master.
2015-12-04 12:03:45 -08:00
Xusen Yin e76431f886 [SPARK-11961][DOC] Add docs of ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-11961

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9965 from yinxusen/SPARK-11961.
2015-12-01 15:21:53 -08:00
Feynman Liang 5535888930 [SPARK-11960][MLLIB][DOC] User guide for streaming tests
CC jkbradley mengxr josepablocam

Author: Feynman Liang <feynman.liang@gmail.com>

Closes #10005 from feynmanliang/streaming-test-user-guide.
2015-11-30 15:38:44 -08:00
Yanbo Liang de64b65f7c [SPARK-11975][ML] Remove duplicate mllib example (DT/RF/GBT in Java/Python)
Remove duplicate mllib example (DT/RF/GBT in Java/Python).
Since we have tutorial code for DT/RF/GBT classification/regression in Scala/Java/Python and example applications for DT/RF/GBT in Scala, so we mark these as duplicated and remove them.
mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9954 from yanboliang/SPARK-11975.
2015-11-30 15:01:08 -08:00
Yuhao Yang e232720a65 [SPARK-11689][ML] Add user guide and example code for LDA under spark.ml
jira: https://issues.apache.org/jira/browse/SPARK-11689

Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.

Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722

mengxr feynmanliang yinxusen  Sorry for the troubling.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9974 from hhbyyh/ldaMLExample.
2015-11-30 14:56:51 -08:00
Yanbo Liang 56a0aba0a6 [SPARK-11952][ML] Remove duplicate ml examples
Remove duplicate ml examples (only for ml).  mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9933 from yanboliang/SPARK-11685.
2015-11-24 09:52:53 -08:00
Yanbo Liang 98d7ec7df4 [SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in examples and user guide doc
ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead.
The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918).
It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9905 from yanboliang/spark-11920.
2015-11-23 11:51:29 -08:00
Xiangrui Meng fe89c1817d [SPARK-11895][ML] rename and refactor DatasetExample under mllib/examples
We used the name `Dataset` to refer to `SchemaRDD` in 1.2 in ML pipelines and created this example file. Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid confusion. This PR also removes support for dense format to simplify the example code.

cc: yinxusen

Author: Xiangrui Meng <meng@databricks.com>

Closes #9873 from mengxr/SPARK-11895.
2015-11-22 21:45:46 -08:00
Xiangrui Meng a2dce22e0a Revert "[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml"
This reverts commit e359d5dcf5.
2015-11-20 16:51:47 -08:00
Vikas Nelamangala ed47b1e660 [SPARK-11549][DOCS] Replace example code in mllib-evaluation-metrics.md using include_example
Author: Vikas Nelamangala <vikasnelamangala@Vikass-MacBook-Pro.local>

Closes #9689 from vikasnp/master.
2015-11-20 15:18:41 -08:00
Yuhao Yang e359d5dcf5 [SPARK-11689][ML] Add user guide and example code for LDA under spark.ml
jira: https://issues.apache.org/jira/browse/SPARK-11689

Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9722 from hhbyyh/ldaMLExample.
2015-11-20 09:57:09 -08:00
Yuhao Yang 67c75828ff [SPARK-11816][ML] fix some style issue in ML/MLlib examples
jira: https://issues.apache.org/jira/browse/SPARK-11816
Currently I only fixed some obvious comments issue like
// scalastyle:off println
on the bottom.

Yet the style in examples is not quite consistent, like only half of the examples  are with
// Example usage: ./bin/run-example mllib.FPGrowthExample \,

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9808 from hhbyyh/exampleStyle.
2015-11-18 18:49:46 -08:00
Josh Rosen 4b11712190 [SPARK-11495] Fix potential socket / file handle leaks that were found via static analysis
The HP Fortify Opens Source Review team (https://www.hpfod.com/open-source-review-project) reported a handful of potential resource leaks that were discovered using their static analysis tool. We should fix the issues identified by their scan.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9455 from JoshRosen/fix-potential-resource-leaks.
2015-11-18 16:00:35 -08:00
Viveka Kulharia 1429e0a2b5 rmse was wrongly calculated
It was multiplying with U instaed of dividing by U

Author: Viveka Kulharia <vivkul@iitk.ac.in>

Closes #9771 from vivkul/patch-1.
2015-11-18 09:10:15 +00:00
Xusen Yin 9154f89bef [SPARK-11728] Replace example code in ml-ensembles.md using include_example
JIRA issue https://issues.apache.org/jira/browse/SPARK-11728.

The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9716 from yinxusen/SPARK-11728.
2015-11-17 23:44:06 -08:00
Rohan Bhanderi e29656f8e7 [MINOR] Correct comments in JavaDirectKafkaWordCount
Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9781 from RohanBhanderi/patch-3.
2015-11-17 15:45:46 -08:00