[SPARK-11960][MLLIB][DOC] User guide for streaming tests

CC jkbradley mengxr josepablocam Author: Feynman Liang <feynman.liang@gmail.com> Closes #10005 from feynmanliang/streaming-test-user-guide.
2015-11-30 15:38:44 -08:00 · 2015-11-30 15:38:44 -08:00 · 5535888930
parent de64b65f7c
commit 5535888930
3 changed files with 28 additions and 0 deletions
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@ -34,6 +34,7 @@ We list major functionality from both below, with links to detailed guides.
  * [correlations](mllib-statistics.html#correlations)
  * [stratified sampling](mllib-statistics.html#stratified-sampling)
  * [hypothesis testing](mllib-statistics.html#hypothesis-testing)
+  * [streaming significance testing](mllib-statistics.html#streaming-significance-testing)
  * [random data generation](mllib-statistics.html#random-data-generation)
 * [Classification and regression](mllib-classification-regression.html)
  * [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html)
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@ -521,6 +521,31 @@ print(testResult) # summary of the test including the p-value, test statistic,
 </div>
 </div>

+### Streaming Significance Testing
+MLlib provides online implementations of some tests to support use cases
+like A/B testing. These tests may be performed on a Spark Streaming
+`DStream[(Boolean,Double)]` where the first element of each tuple
+indicates control group (`false`) or treatment group (`true`) and the
+second element is the value of an observation.
+
+Streaming significance testing supports the following parameters:
+
+* `peacePeriod` - The number of initial data points from the stream to
+ignore, used to mitigate novelty effects.
+* `windowSize` - The number of past batches to perform hypothesis
+testing over. Setting to `0` will perform cumulative processing using
+all prior batches.
+
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`StreamingTest`](api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest)
+provides streaming hypothesis testing.
+
+{% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %}
+</div>
+</div>
+

 ## Random data generation

--- a/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingTestExample.scala
+++ b/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingTestExample.scala
@ -64,6 +64,7 @@ object StreamingTestExample {
      dir.toString
    })

+    // $example on$
    val data = ssc.textFileStream(dataDir).map(line => line.split(",") match {
      case Array(label, value) => (label.toBoolean, value.toDouble)
    })
@ -75,6 +76,7 @@ object StreamingTestExample {

    val out = streamingTest.registerStream(data)
    out.print()
+    // $example off$

    // Stop processing if test becomes significant or we time out
    var timeoutCounter = numBatchesTimeout