[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 <fan12356789@gmail.com>
Co-authored-by: yijiefan <fanyije@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is contained in:
fan31415 2020-05-11 18:23:23 -05:00 committed by Sean Owen
parent d7c3e9e53e
commit 64fb358a99
2 changed files with 12 additions and 1 deletions

View file

@ -233,7 +233,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] {
getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns)
case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint
|to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}."""
.stripMargin.replaceAll("\n", " ")) .stripMargin.replaceAll("\n", " "))
case (_, _) => Map.empty case (_, _) => Map.empty
} }

View file

@ -261,4 +261,15 @@ class VectorAssemblerSuite
val output = vectorAssembler.transform(dfWithNullsAndNaNs) val output = vectorAssembler.transform(dfWithNullsAndNaNs)
assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty)))
} }
test("SPARK-31671: should give explicit error message when can not infer column lengths") {
val df = Seq(
(Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")
val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
val assembler = new VectorAssembler()
.setInputCols(Array("n1", "n2")).setOutputCol("features")
assert(!intercept[RuntimeException](assembler.setHandleInvalid("keep").transform(hintedDf))
.getMessage.contains("n1"), "should only show no vector size columns' name")
}
} }