[SPARK-26315][PYSPARK] auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel

## What changes were proposed in this pull request? If the input parameter 'threshold' to the function approxSimilarityJoin is not a float, we would get an exception. The fix is to convert the 'threshold' into a float before calling the java implementation method. ## How was this patch tested? Added a new test case. Without this fix, the test will throw an exception as reported in the JIRA. With the fix, the test passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23313 from jerryjch/SPARK-26315. Authored-by: Jing Chen He <jinghe@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-12-15 08:41:16 -06:00 · 2018-12-15 08:41:16 -06:00 · 860f4497f2
parent 9ccae0c9e7
commit 860f4497f2
1 changed files with 11 additions and 0 deletions
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@ -192,6 +192,7 @@ class LSHModel(JavaModel):
                 "datasetA" and "datasetB", and a column "distCol" is added to show the distance
                 between each pair.
        """
+        threshold = TypeConverters.toFloat(threshold)
        return self._call_java("approxSimilarityJoin", datasetA, datasetB, threshold, distCol)


@ -239,6 +240,16 @@ class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, HasOutp
    |  3|  6| 2.23606797749979|
    +---+---+-----------------+
    ...
+    >>> model.approxSimilarityJoin(df, df2, 3, distCol="EuclideanDistance").select(
+    ...     col("datasetA.id").alias("idA"),
+    ...     col("datasetB.id").alias("idB"),
+    ...     col("EuclideanDistance")).show()
+    +---+---+-----------------+
+    |idA|idB|EuclideanDistance|
+    +---+---+-----------------+
+    |  3|  6| 2.23606797749979|
+    +---+---+-----------------+
+    ...
    >>> brpPath = temp_path + "/brp"
    >>> brp.save(brpPath)
    >>> brp2 = BucketedRandomProjectionLSH.load(brpPath)