[SPARK-20208][R][DOCS] Document R fpGrowth support
## What changes were proposed in this pull request? Document fpGrowth in: - vignettes - programming guide - code example ## How was this patch tested? Manual tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17557 from zero323/SPARK-20208.
This commit is contained in:
parent
e468a96c40
commit
702d85af2d
|
@ -505,6 +505,10 @@ SparkR supports the following machine learning models and algorithms.
|
|||
|
||||
* Alternating Least Squares (ALS)
|
||||
|
||||
#### Frequent Pattern Mining
|
||||
|
||||
* FP-growth
|
||||
|
||||
#### Statistics
|
||||
|
||||
* Kolmogorov-Smirnov Test
|
||||
|
@ -707,7 +711,7 @@ summary(tweedieGLM1)
|
|||
```
|
||||
We can try other distributions in the tweedie family, for example, a compound Poisson distribution with a log link:
|
||||
```{r}
|
||||
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
|
||||
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
|
||||
var.power = 1.2, link.power = 0.0)
|
||||
summary(tweedieGLM2)
|
||||
```
|
||||
|
@ -906,6 +910,37 @@ predicted <- predict(model, df)
|
|||
head(predicted)
|
||||
```
|
||||
|
||||
#### FP-growth
|
||||
|
||||
`spark.fpGrowth` executes FP-growth algorithm to mine frequent itemsets on a `SparkDataFrame`. `itemsCol` should be an array of values.
|
||||
|
||||
```{r}
|
||||
df <- selectExpr(createDataFrame(data.frame(rawItems = c(
|
||||
"T,R,U", "T,S", "V,R", "R,U,T,V", "R,S", "V,S,U", "U,R", "S,T", "V,R", "V,U,S",
|
||||
"T,V,U", "R,V", "T,S", "T,S", "S,T", "S,U", "T,R", "V,R", "S,V", "T,S,U"
|
||||
))), "split(rawItems, ',') AS items")
|
||||
|
||||
fpm <- spark.fpGrowth(df, minSupport = 0.2, minConfidence = 0.5)
|
||||
```
|
||||
|
||||
`spark.freqItemsets` method can be used to retrieve a `SparkDataFrame` with the frequent itemsets.
|
||||
|
||||
```{r}
|
||||
head(spark.freqItemsets(fpm))
|
||||
```
|
||||
|
||||
`spark.associationRules` returns a `SparkDataFrame` with the association rules.
|
||||
|
||||
```{r}
|
||||
head(spark.associationRules(fpm))
|
||||
```
|
||||
|
||||
We can make predictions based on the `antecedent`.
|
||||
|
||||
```{r}
|
||||
head(predict(fpm, df))
|
||||
```
|
||||
|
||||
#### Kolmogorov-Smirnov Test
|
||||
|
||||
`spark.kstest` runs a two-sided, one-sample [Kolmogorov-Smirnov (KS) test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test).
|
||||
|
|
50
examples/src/main/r/ml/fpm.R
Normal file
50
examples/src/main/r/ml/fpm.R
Normal file
|
@ -0,0 +1,50 @@
|
|||
#
|
||||
# Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
# contributor license agreements. See the NOTICE file distributed with
|
||||
# this work for additional information regarding copyright ownership.
|
||||
# The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
# (the "License"); you may not use this file except in compliance with
|
||||
# the License. You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
# To run this example use
|
||||
# ./bin/spark-submit examples/src/main/r/ml/fpm.R
|
||||
|
||||
# Load SparkR library into your R session
|
||||
library(SparkR)
|
||||
|
||||
# Initialize SparkSession
|
||||
sparkR.session(appName = "SparkR-ML-fpm-example")
|
||||
|
||||
# $example on$
|
||||
# Load training data
|
||||
|
||||
df <- selectExpr(createDataFrame(data.frame(rawItems = c(
|
||||
"1,2,5", "1,2,3,5", "1,2"
|
||||
))), "split(rawItems, ',') AS items")
|
||||
|
||||
fpm <- spark.fpGrowth(df, itemsCol="items", minSupport=0.5, minConfidence=0.6)
|
||||
|
||||
# Extracting frequent itemsets
|
||||
|
||||
spark.freqItemsets(fpm)
|
||||
|
||||
# Extracting association rules
|
||||
|
||||
spark.associationRules(fpm)
|
||||
|
||||
# Predict uses association rules to and combines possible consequents
|
||||
|
||||
predict(fpm, df)
|
||||
|
||||
# $example off$
|
||||
|
||||
sparkR.session.stop()
|
Loading…
Reference in a new issue