spark-instrumented-optimizer/mllib
Weichen Xu b2300fca1e [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0)
### What changes were proposed in this pull request?

In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0
```
    for (i <- 0 until splits.length) {
      if (splits(i) == -0.0) {
        splits(i) = 0.0
      }
    }
```
### Why are the changes needed?
Fix bug.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

#### Manually test:

~~~scala
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)

import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")

import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)

val model = qd.fit(df1) // will raise error in spark master.
~~~

### Explain
scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.

And array.distinct will rely on elem.hashCode so it leads to this error.

Test code on distinct
```
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))
```

Then you will see output like:
```
...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...
```

Closes #28498 from WeichenXu123/SPARK-31676.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-14 09:24:40 -05:00
..
benchmarks [SPARK-29297][TESTS] Compare core/mllib module benchmarks in JDK8/11 2019-09-29 21:43:58 -07:00
src [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) 2020-05-14 09:24:40 -05:00
pom.xml [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00