[SPARK-11960][MLLIB][DOC] User guide for streaming tests
CC jkbradley mengxr josepablocam Author: Feynman Liang <feynman.liang@gmail.com> Closes #10005 from feynmanliang/streaming-test-user-guide.
This commit is contained in:
parent
de64b65f7c
commit
5535888930
|
@ -34,6 +34,7 @@ We list major functionality from both below, with links to detailed guides.
|
|||
* [correlations](mllib-statistics.html#correlations)
|
||||
* [stratified sampling](mllib-statistics.html#stratified-sampling)
|
||||
* [hypothesis testing](mllib-statistics.html#hypothesis-testing)
|
||||
* [streaming significance testing](mllib-statistics.html#streaming-significance-testing)
|
||||
* [random data generation](mllib-statistics.html#random-data-generation)
|
||||
* [Classification and regression](mllib-classification-regression.html)
|
||||
* [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html)
|
||||
|
|
|
@ -521,6 +521,31 @@ print(testResult) # summary of the test including the p-value, test statistic,
|
|||
</div>
|
||||
</div>
|
||||
|
||||
### Streaming Significance Testing
|
||||
MLlib provides online implementations of some tests to support use cases
|
||||
like A/B testing. These tests may be performed on a Spark Streaming
|
||||
`DStream[(Boolean,Double)]` where the first element of each tuple
|
||||
indicates control group (`false`) or treatment group (`true`) and the
|
||||
second element is the value of an observation.
|
||||
|
||||
Streaming significance testing supports the following parameters:
|
||||
|
||||
* `peacePeriod` - The number of initial data points from the stream to
|
||||
ignore, used to mitigate novelty effects.
|
||||
* `windowSize` - The number of past batches to perform hypothesis
|
||||
testing over. Setting to `0` will perform cumulative processing using
|
||||
all prior batches.
|
||||
|
||||
|
||||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
[`StreamingTest`](api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest)
|
||||
provides streaming hypothesis testing.
|
||||
|
||||
{% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
## Random data generation
|
||||
|
||||
|
|
|
@ -64,6 +64,7 @@ object StreamingTestExample {
|
|||
dir.toString
|
||||
})
|
||||
|
||||
// $example on$
|
||||
val data = ssc.textFileStream(dataDir).map(line => line.split(",") match {
|
||||
case Array(label, value) => (label.toBoolean, value.toDouble)
|
||||
})
|
||||
|
@ -75,6 +76,7 @@ object StreamingTestExample {
|
|||
|
||||
val out = streamingTest.registerStream(data)
|
||||
out.print()
|
||||
// $example off$
|
||||
|
||||
// Stop processing if test becomes significant or we time out
|
||||
var timeoutCounter = numBatchesTimeout
|
||||
|
|
Loading…
Reference in a new issue