93ea353cae
## What changes were proposed in this pull request? This PR proposes two things. 1.. Deduplicates the type checking logic. While I am here, I checked each type. Currently, binary type, float type, nested struct type and array type are not supported. **For map and nested struct types:** it's expected to be unsupported per Spark's arrow optimization. ``` Exception in thread "serve-Arrow" java.lang.UnsupportedOperationException: Unsupported data type: map<string,double> ... ``` ``` Exception in thread "serve-Arrow" java.lang.UnsupportedOperationException: Unsupported data type: struct<type:tinyint,size:int,indices:array<int>,values:array<double>> ... ``` Please track the trace below to double check. ``` at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowType(ArrowUtils.scala:56) at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowField(ArrowUtils.scala:92) at org.apache.spark.sql.execution.arrow.ArrowUtils$.$anonfun$toArrowSchema$1(ArrowUtils.scala:116) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) at scala.collection.TraversableLike.map(TraversableLike.scala:237) at scala.collection.TraversableLike.map$(TraversableLike.scala:230) at org.apache.spark.sql.types.StructType.map(StructType.scala:99) at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowSchema(ArrowUtils.scala:115) at org.apache.spark.sql.execution.arrow.ArrowBatchStreamWriter.<init>(ArrowConverters.scala:50) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$2(Dataset.scala:3215) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$2$adapted(Dataset.scala:3212) ``` **For float and binary types:** They cause corrupt values in some cases. It needs to be investigated separately. **For array type:** ``` Error in Table__to_dataframe(x, use_threads = use_threads) : cannot handle Array of type list ``` Seems to be Arrow's R library limitation. It needs to be investigated separately as well. 2.. While I am touching the type specification codes across Arrow optimization, I move the Arrow optimization related tests into a separate filed called `test_arrow.R`. ## How was this patch tested? Tests were added and also manually tested. Closes #23969 from HyukjinKwon/SPARK-26920. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> |
||
---|---|---|
.. | ||
inst | ||
R | ||
src-native | ||
tests | ||
vignettes | ||
.lintr | ||
.Rbuildignore | ||
DESCRIPTION | ||
NAMESPACE |