[SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working

## What changes were proposed in this pull request?

**binary_classification_metrics_example.py**

LibSVM datasource loads `ml.linalg.SparseVector` whereas the example requires it to be `mllib.linalg.SparseVector`.  For the equivalent Scala exmaple, `BinaryClassificationMetricsExample.scala` seems fine.

```
./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py
```

```
  File ".../spark/examples/src/main/python/mllib/binary_classification_metrics_example.py", line 39, in <lambda>
    .rdd.map(lambda row: LabeledPoint(row[0], row[1]))
  File ".../spark/python/pyspark/mllib/regression.py", line 54, in __init__
    self.features = _convert_to_vector(features)
  File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 80, in _convert_to_vector
    raise TypeError("Cannot convert type %s into Vector" % type(l))
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector
```

**status_api_demo.py** (this one does not work on Python 3.4.6)

It's `queue` in Python 3+.

```
PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py
```

```
Traceback (most recent call last):
  File ".../spark/examples/src/main/python/status_api_demo.py", line 22, in <module>
    import Queue
ImportError: No module named 'Queue'
```

**bisecting_k_means_example.py**

`BisectingKMeansModel` does not implement `save` and `load` in Python.

```bash
./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py
```

```
Traceback (most recent call last):
  File ".../spark/examples/src/main/python/mllib/bisecting_k_means_example.py", line 46, in <module>
    model.save(sc, path)
AttributeError: 'BisectingKMeansModel' object has no attribute 'save'
```

**elementwise_product_example.py**

It calls `collect` from the vector.

```bash
./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py
```

```
Traceback (most recent call last):
  File ".../spark/examples/src/main/python/mllib/elementwise_product_example.py", line 48, in <module>
    for each in transformedData2.collect():
  File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 478, in __getattr__
    return getattr(self.array, item)
AttributeError: 'numpy.ndarray' object has no attribute 'collect'
```

**These three tests look throwing an exception for a relative path set in `spark.sql.warehouse.dir`.**

**hive.py**

```
./bin/spark-submit examples/src/main/python/sql/hive.py
```

```
Traceback (most recent call last):
  File ".../spark/examples/src/main/python/sql/hive.py", line 47, in <module>
    spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
  File ".../spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 541, in sql
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File ".../spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse);'
```

**SparkHiveExample.scala**

```
./bin/run-example sql.hive.SparkHiveExample
```

```
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498)
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)
```

**JavaSparkHiveExample.java**

```
./bin/run-example sql.hive.JavaSparkHiveExample
```

```
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498)
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)
```

## How was this patch tested?

Manually via

```
./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py
```

```
PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py
```

```
./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py
```

```
./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py
```

```
./bin/spark-submit examples/src/main/python/sql/hive.py
```

```
./bin/run-example sql.hive.JavaSparkHiveExample
```

```
./bin/run-example sql.hive.SparkHiveExample
```

These were found via

```bash
find ./examples/src/main/python -name "*.py" -exec spark-submit {} \;
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16515 from HyukjinKwon/minor-example-fix.
This commit is contained in:
hyukjinkwon 2017-01-10 02:18:07 -08:00 committed by Yanbo Liang
parent 3ef6d98a80
commit b0e5840d4b
7 changed files with 18 additions and 21 deletions

View file

@ -17,6 +17,7 @@
package org.apache.spark.examples.sql.hive;
// $example on:spark_hive$
import java.io.File;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
@ -56,7 +57,7 @@ public class JavaSparkHiveExample {
public static void main(String[] args) {
// $example on:spark_hive$
// warehouseLocation points to the default location for managed databases and tables
String warehouseLocation = "spark-warehouse";
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")

View file

@ -18,25 +18,20 @@
Binary Classification Metrics Example.
"""
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark import SparkContext
# $example on$
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
# $example off$
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("BinaryClassificationMetricsExample")\
.getOrCreate()
sc = SparkContext(appName="BinaryClassificationMetricsExample")
# $example on$
# Several of the methods available in scala are currently missing from pyspark
# Load training data in LIBSVM format
data = spark\
.read.format("libsvm").load("data/mllib/sample_binary_classification_data.txt")\
.rdd.map(lambda row: LabeledPoint(row[0], row[1]))
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt")
# Split data into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed=11)
@ -58,4 +53,4 @@ if __name__ == "__main__":
print("Area under ROC = %s" % metrics.areaUnderROC)
# $example off$
spark.stop()
sc.stop()

View file

@ -40,11 +40,6 @@ if __name__ == "__main__":
# Evaluate clustering
cost = model.computeCost(parsedData)
print("Bisecting K-means Cost = " + str(cost))
# Save and load model
path = "target/org/apache/spark/PythonBisectingKMeansExample/BisectingKMeansModel"
model.save(sc, path)
sameModel = BisectingKMeansModel.load(sc, path)
# $example off$
sc.stop()

View file

@ -45,7 +45,7 @@ if __name__ == "__main__":
print(each)
print("transformedData2:")
for each in transformedData2.collect():
for each in transformedData2:
print(each)
sc.stop()

View file

@ -18,7 +18,7 @@
from __future__ import print_function
# $example on:spark_hive$
from os.path import expanduser, join
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
@ -34,7 +34,7 @@ Run with:
if __name__ == "__main__":
# $example on:spark_hive$
# warehouse_location points to the default location for managed databases and tables
warehouse_location = 'spark-warehouse'
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \

View file

@ -19,6 +19,10 @@ from __future__ import print_function
import time
import threading
import sys
if sys.version >= '3':
import queue as Queue
else:
import Queue
from pyspark import SparkConf, SparkContext

View file

@ -17,6 +17,8 @@
package org.apache.spark.examples.sql.hive
// $example on:spark_hive$
import java.io.File
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
// $example off:spark_hive$
@ -38,7 +40,7 @@ object SparkHiveExample {
// $example on:spark_hive$
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = "spark-warehouse"
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()