[SPARK-25584][ML][DOC] datasource for libsvm user guide

## What changes were proposed in this pull request?
it seems that doc for libsvm datasource is not added in https://github.com/apache/spark/pull/22675.
This pr is to add it.

## How was this patch tested?
doc built locally
![图片](https://user-images.githubusercontent.com/7322292/62044350-4ad51480-b235-11e9-8f09-cbcbe9d3b7f9.png)

Closes #25286 from zhengruifeng/doc_libsvm_data_source.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
This commit is contained in:
zhengruifeng 2019-08-01 09:15:42 -05:00 committed by Sean Owen
parent 80ab19b9fd
commit b29829e2ab
2 changed files with 118 additions and 4 deletions

View file

@ -1,7 +1,7 @@
- text: Basic statistics
url: ml-statistics.html
- text: Data sources
url: ml-datasource
url: ml-datasource.html
- text: Pipelines
url: ml-pipeline.html
- text: Extracting, transforming and selecting features

View file

@ -63,7 +63,7 @@ scala> df.select("image.origin", "image.width", "image.height").show(truncate=fa
<div data-lang="java" markdown="1">
[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
implements Spark SQL data source API for loading image data as DataFrame.
implements Spark SQL data source API for loading image data as a DataFrame.
{% highlight java %}
Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens");
@ -83,7 +83,7 @@ Will output:
</div>
<div data-lang="python" markdown="1">
In PySpark we provide Spark SQL data source API for loading image data as DataFrame.
In PySpark we provide Spark SQL data source API for loading image data as a DataFrame.
{% highlight python %}
>>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
@ -100,7 +100,7 @@ In PySpark we provide Spark SQL data source API for loading image data as DataFr
</div>
<div data-lang="r" markdown="1">
In SparkR we provide Spark SQL data source API for loading image data as DataFrame.
In SparkR we provide Spark SQL data source API for loading image data as a DataFrame.
{% highlight r %}
> df = read.df("data/mllib/images/origin/kittens", "image")
@ -120,4 +120,118 @@ In SparkR we provide Spark SQL data source API for loading image data as DataFra
</div>
</div>
## LIBSVM data source
This `LIBSVM` data source is used to load 'libsvm' type files from a directory.
The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
The schemas of the columns are:
- label: `DoubleType` (represents the instance label)
- features: `VectorUDT` (represents the feature vector)
<div class="codetabs">
<div data-lang="scala" markdown="1">
[`LibSVMDataSource`](api/scala/index.html#org.apache.spark.ml.source.libsvm.LibSVMDataSource)
implements a Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
{% highlight scala %}
scala> val df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
scala> df.show(10)
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(780,[127,128,129...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[124,125,126...|
| 1.0|(780,[152,153,154...|
| 1.0|(780,[151,152,153...|
| 0.0|(780,[129,130,131...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[99,100,101,...|
| 0.0|(780,[154,155,156...|
| 0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
[`LibSVMDataSource`](api/java/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html)
implements Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
{% highlight java %}
Dataset<Row> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt");
df.show(10);
/*
Will output:
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(780,[127,128,129...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[124,125,126...|
| 1.0|(780,[152,153,154...|
| 1.0|(780,[151,152,153...|
| 0.0|(780,[129,130,131...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[99,100,101,...|
| 0.0|(780,[154,155,156...|
| 0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows
*/
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
In PySpark we provide Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
{% highlight python %}
>>> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
>>> df.show(10)
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(780,[127,128,129...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[124,125,126...|
| 1.0|(780,[152,153,154...|
| 1.0|(780,[151,152,153...|
| 0.0|(780,[129,130,131...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[99,100,101,...|
| 0.0|(780,[154,155,156...|
| 0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
In SparkR we provide Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
{% highlight r %}
> df = read.df("data/mllib/sample_libsvm_data.txt", "libsvm")
> head(select(df, df$label, df$features), 10)
label features
1 0 <environment: 0x7fe6d35366e8>
2 1 <environment: 0x7fe6d353bf78>
3 1 <environment: 0x7fe6d3541840>
4 1 <environment: 0x7fe6d3545108>
5 1 <environment: 0x7fe6d354c8e0>
6 0 <environment: 0x7fe6d35501a8>
7 1 <environment: 0x7fe6d3555a70>
8 1 <environment: 0x7fe6d3559338>
9 0 <environment: 0x7fe6d355cc00>
10 0 <environment: 0x7fe6d35643d8>
{% endhighlight %}
</div>
</div>