JIRA: https://issues.apache.org/jira/browse/SPARK-7556
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #6116 from viirya/binarizer_doc and squashes the following commits:
40cb677 [Liang-Chi Hsieh] Better print out.
5b7ef1d [Liang-Chi Hsieh] Make examples more clear.
1bf9c09 [Liang-Chi Hsieh] For comments.
6cf8cba [Liang-Chi Hsieh] Add user guide for Binarizer.
(cherry picked from commit c8696337e2
)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
11 KiB
layout | title | displayTitle |
---|---|---|
global | Feature Extraction, Transformation, and Selection - SparkML | <a href="ml-guide.html">ML</a> - Features |
This section covers algorithms for working with features, roughly divided into these groups:
- Extraction: Extracting features from "raw" data
- Transformation: Scaling, converting, or modifying features
- Selection: Selecting a subset from a larger set of features
Table of Contents
- This will become a table of contents (this text will be scraped). {:toc}
Feature Extractors
Hashing Term-Frequency (HashingTF)
HashingTF
is a Transformer
which takes sets of terms (e.g., String
terms can be sets of words) and converts those sets into fixed-length feature vectors.
The algorithm combines Term Frequency (TF) counts with the hashing trick for dimensionality reduction. Please refer to the MLlib user guide on TF-IDF for more details on Term-Frequency.
HashingTF is implemented in
HashingTF.
In the following code segment, we start with a set of sentences. We split each sentence into words using Tokenizer
. For each sentence (bag of words), we hash it into a feature vector. This feature vector could then be passed to a learning algorithm.
val sentenceDataFrame = sqlContext.createDataFrame(Seq( (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") )).toDF("label", "sentence") val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words") val wordsDataFrame = tokenizer.transform(sentenceDataFrame) val hashingTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20) val featurized = hashingTF.transform(wordsDataFrame) featurized.select("features", "label").take(3).foreach(println) {% endhighlight %}
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.ml.feature.HashingTF; import org.apache.spark.ml.feature.Tokenizer; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType;
JavaRDD jrdd = jsc.parallelize(Lists.newArrayList( RowFactory.create(0, "Hi I heard about Spark"), RowFactory.create(0, "I wish Java could use case classes"), RowFactory.create(1, "Logistic regression models are neat") )); StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("sentence", DataTypes.StringType, false, Metadata.empty()) }); DataFrame sentenceDataFrame = sqlContext.createDataFrame(jrdd, schema); Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words"); DataFrame wordsDataFrame = tokenizer.transform(sentenceDataFrame); int numFeatures = 20; HashingTF hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(numFeatures); DataFrame featurized = hashingTF.transform(wordsDataFrame); for (Row r : featurized.select("features", "label").take(3)) { Vector features = r.getAs(0); Double label = r.getDouble(1); System.out.println(features); } {% endhighlight %}
sentenceDataFrame = sqlContext.createDataFrame([ (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsDataFrame = tokenizer.transform(sentenceDataFrame) hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20) featurized = hashingTF.transform(wordsDataFrame) for features_label in featurized.select("features", "label").take(3): print features_label {% endhighlight %}
Feature Transformers
Tokenizer
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.
Note: A more advanced tokenizer is provided via RegexTokenizer.
val sentenceDataFrame = sqlContext.createDataFrame(Seq( (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") )).toDF("label", "sentence") val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words") val wordsDataFrame = tokenizer.transform(sentenceDataFrame) wordsDataFrame.select("words", "label").take(3).foreach(println) {% endhighlight %}
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.ml.feature.Tokenizer; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType;
JavaRDD jrdd = jsc.parallelize(Lists.newArrayList( RowFactory.create(0, "Hi I heard about Spark"), RowFactory.create(0, "I wish Java could use case classes"), RowFactory.create(1, "Logistic regression models are neat") )); StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("sentence", DataTypes.StringType, false, Metadata.empty()) }); DataFrame sentenceDataFrame = sqlContext.createDataFrame(jrdd, schema); Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words"); DataFrame wordsDataFrame = tokenizer.transform(sentenceDataFrame); for (Row r : wordsDataFrame.select("words", "label").take(3)) { java.util.List words = r.getList(0); for (String word : words) System.out.print(word + " "); System.out.println(); } {% endhighlight %}
sentenceDataFrame = sqlContext.createDataFrame([ (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsDataFrame = tokenizer.transform(sentenceDataFrame) for words_label in wordsDataFrame.select("words", "label").take(3): print words_label {% endhighlight %}
Binarizer
Binarization is the process of thresholding numerical features to binary features. As some probabilistic estimators make assumption that the input data is distributed according to Bernoulli distribution, a binarizer is useful for pre-processing the input data with continuous numerical features.
A simple Binarizer class provides this functionality. Besides the common parameters of inputCol
and outputCol
, Binarizer
has the parameter threshold
used for binarizing continuous numerical features. The features greater than the threshold, will be binarized to 1.0. The features equal to or less than the threshold, will be binarized to 0.0. The example below shows how to binarize numerical features.
val data = Array( (0, 0.1), (1, 0.8), (2, 0.2) ) val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", "feature")
val binarizer: Binarizer = new Binarizer() .setInputCol("feature") .setOutputCol("binarized_feature") .setThreshold(0.5)
val binarizedDataFrame = binarizer.transform(dataFrame) val binarizedFeatures = binarizedDataFrame.select("binarized_feature") binarizedFeatures.collect().foreach(println) {% endhighlight %}
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.ml.feature.Binarizer; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType;
JavaRDD jrdd = jsc.parallelize(Lists.newArrayList( RowFactory.create(0, 0.1), RowFactory.create(1, 0.8), RowFactory.create(2, 0.2) )); StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("feature", DataTypes.DoubleType, false, Metadata.empty()) }); DataFrame continuousDataFrame = jsql.createDataFrame(jrdd, schema); Binarizer binarizer = new Binarizer() .setInputCol("feature") .setOutputCol("binarized_feature") .setThreshold(0.5); DataFrame binarizedDataFrame = binarizer.transform(continuousDataFrame); DataFrame binarizedFeatures = binarizedDataFrame.select("binarized_feature"); for (Row r : binarizedFeatures.collect()) { Double binarized_value = r.getDouble(0); System.out.println(binarized_value); } {% endhighlight %}
continuousDataFrame = sqlContext.createDataFrame([ (0, 0.1), (1, 0.8), (2, 0.2) ], ["label", "feature"]) binarizer = Binarizer(threshold=0.5, inputCol="feature", outputCol="binarized_feature") binarizedDataFrame = binarizer.transform(continuousDataFrame) binarizedFeatures = binarizedDataFrame.select("binarized_feature") for binarized_feature, in binarizedFeatures.collect(): print binarized_feature {% endhighlight %}