spark-instrumented-optimizer/mllib
Liang Zhang 82d0aa37ae [SPARK-30762] Add dtype=float32 support to vector_to_array UDF
### What changes were proposed in this pull request?
In this PR, we add a parameter in the python function vector_to_array(col) that allows converting to a column of arrays of Float (32bits) in scala, which would be mapped to a numpy array of dtype=float32.

### Why are the changes needed?
In the downstream ML training, using float32 instead of float64 (default) would allow a larger batch size, i.e., allow more data to fit in the memory.

### Does this PR introduce any user-facing change?
Yes.
Old: `vector_to_array()` only take one param
```
df.select(vector_to_array("colA"), ...)
```
New: `vector_to_array()` can take an additional optional param: `dtype` = "float32" (or "float64")
```
df.select(vector_to_array("colA", "float32"), ...)
```

### How was this patch tested?
Unit test in scala.
doctest in python.

Closes #27522 from liangz1/udf-float32.

Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
2020-02-13 23:55:13 +08:00
..
benchmarks [SPARK-29297][TESTS] Compare core/mllib module benchmarks in JDK8/11 2019-09-29 21:43:58 -07:00
src [SPARK-30762] Add dtype=float32 support to vector_to_array UDF 2020-02-13 23:55:13 +08:00
pom.xml [INFRA] Reverts commit 56dcd79 and c216ef1 2019-12-16 19:57:44 -07:00