This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10385 from zsxwing/accumulator-broadcast-example.
Example of joining a static RDD of word sentiments to a streaming RDD of Tweets in order to demo the usage of the transform() method.
Author: Jeff L <sha0lin@alumni.carnegiemellon.edu>
Closes#8431 from Agent007/SPARK-9057.
String.split accepts a regular expression, so we should escape "." and "|".
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10361 from zsxwing/reg-bug.
We have DataFrame example for SparkR, we also need to add ML example under ```examples/src/main/r```.
cc mengxr jkbradley shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10324 from yanboliang/spark-12364.
This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#9952 from yu-iskw/SPARK-6518.
Rename ```weights``` to ```coefficients``` for examples/DeveloperApiExample.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10280 from yanboliang/spark-coefficients.
Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion.
#9873 finished the work of Scala example, here we focus on the Python one.
Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```.
BTW, fix minor missing issues of #9873.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9957 from yanboliang/SPARK-11978.
* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments
jsonFile(sqlContext, “path1,path2”)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for ```parquetFile```.
cc felixcheung sun-rui shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10145 from yanboliang/spark-12146.
Adding ability to define an initial state RDD for use with updateStateByKey PySpark. Added unit test and changed stateful_network_wordcount example to use initial RDD.
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes#10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
SPARK-12244:
Based on feedback from early users and personal experience attempting to explain it, the name trackStateByKey had two problem.
"trackState" is a completely new term which really does not give any intuition on what the operation is
the resultant data stream of objects returned by the function is called in docs as the "emitted" data for the lack of a better.
"mapWithState" makes sense because the API is like a mapping function like (Key, Value) => T with State as an additional parameter. The resultant data stream is "mapped data". So both problems are solved.
SPARK-12245:
From initial experiences, not having the key in the function makes it hard to return mapped stuff, as the whole information of the records is not there. Basically the user is restricted to doing something like mapValue() instead of map(). So adding the key as a parameter.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#10224 from tdas/rename.
Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10166 from BenFradet/SPARK-12159.
jira: https://issues.apache.org/jira/browse/SPARK-11605
Check Java compatibility for MLlib for this release.
fix:
1. `StreamingTest.registerStream` needs java friendly interface.
2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`.
TBD:
[updated] no fix for now per discussion.
`org.apache.spark.mllib.classification.LogisticRegressionModel`
`public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation.
`SVMModel` has the similar issue.
Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary.
cc jkbradley feynmanliang
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10102 from hhbyyh/javaAPI.
jira: https://issues.apache.org/jira/browse/SPARK-10393
Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: yuhaoyang <yuhao@zhanglipings-iMac.local>
Closes#8551 from hhbyyh/ldaExUpdate.
This reverts PR #10002, commit 78209b0cca.
The original PR wasn't tested on Jenkins before being merged.
Author: Cheng Lian <lian@databricks.com>
Closes#10200 from liancheng/revert-pr-10002.
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10006 from yanboliang/spark-11958.
Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer
Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes#10002 from somideshmukh/SomilBranch1.33.
`ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct.
This patch fixed all places that use `ByteBuffer.array` incorrectly.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10083 from zsxwing/bytebuffer-array.
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).
I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10112 from JoshRosen/upgrade-to-sbt-0.13.9.
This replaces https://github.com/apache/spark/pull/9696
Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.
Suggest fixing those TODOs in a separate PR(s).
More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).
Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):
> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1
Also fix some of the minor violations that didn't require sweeping changes.
Apologies for the previous botched PRs - I finally figured out the issue.
cr: JoshRosen, pwendell
> I state that the contribution is my original work, and I license the work to the project under the project's open source license.
Author: Dmitry Erastov <derastov@gmail.com>
Closes#9867 from dskrvk/master.
Remove duplicate mllib example (DT/RF/GBT in Java/Python).
Since we have tutorial code for DT/RF/GBT classification/regression in Scala/Java/Python and example applications for DT/RF/GBT in Scala, so we mark these as duplicated and remove them.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9954 from yanboliang/SPARK-11975.
jira: https://issues.apache.org/jira/browse/SPARK-11689
Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.
Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722
mengxr feynmanliang yinxusen Sorry for the troubling.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9974 from hhbyyh/ldaMLExample.
ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead.
The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918).
It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9905 from yanboliang/spark-11920.
We used the name `Dataset` to refer to `SchemaRDD` in 1.2 in ML pipelines and created this example file. Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid confusion. This PR also removes support for dense format to simplify the example code.
cc: yinxusen
Author: Xiangrui Meng <meng@databricks.com>
Closes#9873 from mengxr/SPARK-11895.
jira: https://issues.apache.org/jira/browse/SPARK-11689
Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9722 from hhbyyh/ldaMLExample.
jira: https://issues.apache.org/jira/browse/SPARK-11816
Currently I only fixed some obvious comments issue like
// scalastyle:off println
on the bottom.
Yet the style in examples is not quite consistent, like only half of the examples are with
// Example usage: ./bin/run-example mllib.FPGrowthExample \,
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9808 from hhbyyh/exampleStyle.
The HP Fortify Opens Source Review team (https://www.hpfod.com/open-source-review-project) reported a handful of potential resource leaks that were discovered using their static analysis tool. We should fix the issues identified by their scan.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9455 from JoshRosen/fix-potential-resource-leaks.
JIRA issue https://issues.apache.org/jira/browse/SPARK-11728.
The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9716 from yinxusen/SPARK-11728.
Use 2 seconds batch size as duration specified in JavaStreamingContext constructor is 2000 ms
Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>
Closes#9714 from RohanBhanderi/patch-2.
Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include:
* Use libSVM data source for all example codes under examples/ml, and remove unused import.
* Use libSVM data source for user guides under ml-*** which were omitted by #8697.
* Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```.
* Code cleanup.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9690 from yanboliang/spark-11723.
I have made the required changes and tested.
Kindly review the changes.
Author: Rishabh Bhardwaj <rbnext29@gmail.com>
Closes#9407 from rishabhbhardwaj/SPARK-11445.
Add Python example code for Multilayer Perceptron Classification, and make example code in user guide document testable. mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9594 from yanboliang/spark-11629.
TODO
- [x] Add Java API
- [x] Add API tests
- [x] Add a function test
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#9636 from zsxwing/java-track.
Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons.
* Need for more optimized state management that does not scan every key
* Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state
The high level idea that of this PR
* Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts.
* Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data.
Here is the detailed design doc. Please take a look and provide feedback as comments.
https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em
This is still WIP. Major things left to be done.
- [x] Implement basic functionality of state tracking, with initial RDD and timeouts
- [x] Unit tests for state tracking
- [x] Unit tests for initial RDD and timeout
- [ ] Unit tests for TrackStateRDD
- [x] state creating, updating, removing
- [ ] emitting
- [ ] checkpointing
- [x] Misc unit tests for State, TrackStateSpec, etc.
- [x] Update docs and experimental tags
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#9256 from tdas/trackStateByKey.