spark-instrumented-optimizer

History

Hyukjin Kwon 93ea353cae [SPARK-26920][R] Deduplicate type checking across Arrow optimization in SparkR ## What changes were proposed in this pull request? This PR proposes two things. 1.. Deduplicates the type checking logic. While I am here, I checked each type. Currently, binary type, float type, nested struct type and array type are not supported. For map and nested struct types: it's expected to be unsupported per Spark's arrow optimization. ``` Exception in thread "serve-Arrow" java.lang.UnsupportedOperationException: Unsupported data type: map<string,double> ... ``` ``` Exception in thread "serve-Arrow" java.lang.UnsupportedOperationException: Unsupported data type: struct<type:tinyint,size:int,indices:array<int>,values:array<double>> ... ``` Please track the trace below to double check. ``` at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowType(ArrowUtils.scala:56) at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowField(ArrowUtils.scala:92) at org.apache.spark.sql.execution.arrow.ArrowUtils$.$anonfun$toArrowSchema$1(ArrowUtils.scala:116) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) at scala.collection.TraversableLike.map(TraversableLike.scala:237) at scala.collection.TraversableLike.map$(TraversableLike.scala:230) at org.apache.spark.sql.types.StructType.map(StructType.scala:99) at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowSchema(ArrowUtils.scala:115) at org.apache.spark.sql.execution.arrow.ArrowBatchStreamWriter.<init>(ArrowConverters.scala:50) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$2(Dataset.scala:3215) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$2$adapted(Dataset.scala:3212) ``` For float and binary types: They cause corrupt values in some cases. It needs to be investigated separately. For array type: ``` Error in Table__to_dataframe(x, use_threads = use_threads) : cannot handle Array of type list ``` Seems to be Arrow's R library limitation. It needs to be investigated separately as well. 2.. While I am touching the type specification codes across Arrow optimization, I move the Arrow optimization related tests into a separate filed called `test_arrow.R`. ## How was this patch tested? Tests were added and also manually tested. Closes #23969 from HyukjinKwon/SPARK-26920. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2019-03-11 10:23:24 +09:00
..
inst	[SPARK-26830][SQL][R] Vectorized R dapply() implementation	2019-02-27 14:29:58 +09:00
R	[SPARK-26920][R] Deduplicate type checking across Arrow optimization in SparkR	2019-03-11 10:23:24 +09:00
src-native	[SPARK-6811] Copy SparkR lib in make-distribution.sh	2015-05-23 00:04:01 -07:00
tests	[SPARK-26920][R] Deduplicate type checking across Arrow optimization in SparkR	2019-03-11 10:23:24 +09:00
vignettes	[SPARK-19827][R] spark.ml R API for PIC	2018-12-10 18:28:13 -06:00
.lintr	[SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r	2017-10-01 18:42:45 +09:00
.Rbuildignore	[SPARK-20877][SPARKR][FOLLOWUP] clean up after test move	2017-06-11 03:00:44 -07:00
DESCRIPTION	[R] update package description	2019-02-21 19:00:36 +08:00
NAMESPACE	[SPARK-24779][R] Add map_concat / map_from_entries / an option in months_between UDF to disable rounding-off	2019-01-31 19:38:32 +08:00