Liang-Chi Hsieh e847d86215 [SPARK-7556] [ML] [DOC] Add user guide for spark.ml Binarizer, including Scala, Java and Python examples

JIRA: https://issues.apache.org/jira/browse/SPARK-7556

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6116 from viirya/binarizer_doc and squashes the following commits:

40cb677 [Liang-Chi Hsieh] Better print out.
5b7ef1d [Liang-Chi Hsieh] Make examples more clear.
1bf9c09 [Liang-Chi Hsieh] For comments.
6cf8cba [Liang-Chi Hsieh] Add user guide for Binarizer.

(cherry picked from commit c8696337e2)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

2015-05-15 15:05:13 -07:00

11 KiB

Raw Blame History

layout	title	displayTitle
global	Feature Extraction, Transformation, and Selection - SparkML	<a href="ml-guide.html">ML</a> - Features

This section covers algorithms for working with features, roughly divided into these groups:

Extraction: Extracting features from "raw" data
Transformation: Scaling, converting, or modifying features
Selection: Selecting a subset from a larger set of features

Table of Contents

This will become a table of contents (this text will be scraped). {:toc}

Feature Extractors

Hashing Term-Frequency (HashingTF)

HashingTF is a Transformer which takes sets of terms (e.g., String terms can be sets of words) and converts those sets into fixed-length feature vectors. The algorithm combines Term Frequency (TF) counts with the hashing trick for dimensionality reduction. Please refer to the MLlib user guide on TF-IDF for more details on Term-Frequency.

HashingTF is implemented in HashingTF. In the following code segment, we start with a set of sentences. We split each sentence into words using Tokenizer. For each sentence (bag of words), we hash it into a feature vector. This feature vector could then be passed to a learning algorithm.

{% highlight scala %} import org.apache.spark.ml.feature.{HashingTF, Tokenizer}

val sentenceDataFrame = sqlContext.createDataFrame(Seq( (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") )).toDF("label", "sentence") val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words") val wordsDataFrame = tokenizer.transform(sentenceDataFrame) val hashingTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20) val featurized = hashingTF.transform(wordsDataFrame) featurized.select("features", "label").take(3).foreach(println) {% endhighlight %}

{% highlight java %} import com.google.common.collect.Lists;

import org.apache.spark.api.java.JavaRDD; import org.apache.spark.ml.feature.HashingTF; import org.apache.spark.ml.feature.Tokenizer; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType;

JavaRDD jrdd = jsc.parallelize(Lists.newArrayList( RowFactory.create(0, "Hi I heard about Spark"), RowFactory.create(0, "I wish Java could use case classes"), RowFactory.create(1, "Logistic regression models are neat") )); StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("sentence", DataTypes.StringType, false, Metadata.empty()) }); DataFrame sentenceDataFrame = sqlContext.createDataFrame(jrdd, schema); Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words"); DataFrame wordsDataFrame = tokenizer.transform(sentenceDataFrame); int numFeatures = 20; HashingTF hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(numFeatures); DataFrame featurized = hashingTF.transform(wordsDataFrame); for (Row r : featurized.select("features", "label").take(3)) { Vector features = r.getAs(0); Double label = r.getDouble(1); System.out.println(features); } {% endhighlight %}

{% highlight python %} from pyspark.ml.feature import HashingTF, Tokenizer

sentenceDataFrame = sqlContext.createDataFrame([ (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsDataFrame = tokenizer.transform(sentenceDataFrame) hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20) featurized = hashingTF.transform(wordsDataFrame) for features_label in featurized.select("features", "label").take(3): print features_label {% endhighlight %}

Feature Transformers

Tokenizer

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.

Note: A more advanced tokenizer is provided via RegexTokenizer.

{% highlight scala %} import org.apache.spark.ml.feature.Tokenizer

val sentenceDataFrame = sqlContext.createDataFrame(Seq( (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") )).toDF("label", "sentence") val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words") val wordsDataFrame = tokenizer.transform(sentenceDataFrame) wordsDataFrame.select("words", "label").take(3).foreach(println) {% endhighlight %}

{% highlight java %} import com.google.common.collect.Lists;

import org.apache.spark.api.java.JavaRDD; import org.apache.spark.ml.feature.Tokenizer; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType;

JavaRDD jrdd = jsc.parallelize(Lists.newArrayList( RowFactory.create(0, "Hi I heard about Spark"), RowFactory.create(0, "I wish Java could use case classes"), RowFactory.create(1, "Logistic regression models are neat") )); StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("sentence", DataTypes.StringType, false, Metadata.empty()) }); DataFrame sentenceDataFrame = sqlContext.createDataFrame(jrdd, schema); Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words"); DataFrame wordsDataFrame = tokenizer.transform(sentenceDataFrame); for (Row r : wordsDataFrame.select("words", "label").take(3)) { java.util.List words = r.getList(0); for (String word : words) System.out.print(word + " "); System.out.println(); } {% endhighlight %}

{% highlight python %} from pyspark.ml.feature import Tokenizer

sentenceDataFrame = sqlContext.createDataFrame([ (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsDataFrame = tokenizer.transform(sentenceDataFrame) for words_label in wordsDataFrame.select("words", "label").take(3): print words_label {% endhighlight %}

Binarizer

Binarization is the process of thresholding numerical features to binary features. As some probabilistic estimators make assumption that the input data is distributed according to Bernoulli distribution, a binarizer is useful for pre-processing the input data with continuous numerical features.

A simple Binarizer class provides this functionality. Besides the common parameters of inputCol and outputCol, Binarizer has the parameter threshold used for binarizing continuous numerical features. The features greater than the threshold, will be binarized to 1.0. The features equal to or less than the threshold, will be binarized to 0.0. The example below shows how to binarize numerical features.

{% highlight scala %} import org.apache.spark.ml.feature.Binarizer import org.apache.spark.sql.DataFrame

val data = Array( (0, 0.1), (1, 0.8), (2, 0.2) ) val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", "feature")

val binarizer: Binarizer = new Binarizer() .setInputCol("feature") .setOutputCol("binarized_feature") .setThreshold(0.5)

val binarizedDataFrame = binarizer.transform(dataFrame) val binarizedFeatures = binarizedDataFrame.select("binarized_feature") binarizedFeatures.collect().foreach(println) {% endhighlight %}

{% highlight java %} import com.google.common.collect.Lists;

import org.apache.spark.api.java.JavaRDD; import org.apache.spark.ml.feature.Binarizer; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType;

JavaRDD jrdd = jsc.parallelize(Lists.newArrayList( RowFactory.create(0, 0.1), RowFactory.create(1, 0.8), RowFactory.create(2, 0.2) )); StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("feature", DataTypes.DoubleType, false, Metadata.empty()) }); DataFrame continuousDataFrame = jsql.createDataFrame(jrdd, schema); Binarizer binarizer = new Binarizer() .setInputCol("feature") .setOutputCol("binarized_feature") .setThreshold(0.5); DataFrame binarizedDataFrame = binarizer.transform(continuousDataFrame); DataFrame binarizedFeatures = binarizedDataFrame.select("binarized_feature"); for (Row r : binarizedFeatures.collect()) { Double binarized_value = r.getDouble(0); System.out.println(binarized_value); } {% endhighlight %}

{% highlight python %} from pyspark.ml.feature import Binarizer

continuousDataFrame = sqlContext.createDataFrame([ (0, 0.1), (1, 0.8), (2, 0.2) ], ["label", "feature"]) binarizer = Binarizer(threshold=0.5, inputCol="feature", outputCol="binarized_feature") binarizedDataFrame = binarizer.transform(continuousDataFrame) binarizedFeatures = binarizedDataFrame.select("binarized_feature") for binarized_feature, in binarizedFeatures.collect(): print binarized_feature {% endhighlight %}

11 KiB Raw Blame History

Feature Extractors

Hashing Term-Frequency (HashingTF)

Feature Transformers

Tokenizer

Binarizer

Feature Selectors

11 KiB

Raw Blame History