spark-instrumented-optimizer

History

hyukjinkwon e8982ca7ad [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks Shall ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` R code ```r createDataFrame(mtcars) # Initializes rdf <- read.csv("500000.csv") test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() createDataFrame(rdf) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` Data (350 MB): ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ Results ``` Time difference of 29.9468 secs ``` ``` Time difference of 3.222129 secs ``` The performance improvement was around 950%. Actually, this PR improves around 1200%+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See https://github.com/apache/spark/pull/22954#discussion_r231847272 ### Limitations: For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. ## How was this patch tested? Small test was added. I manually forced to set this optimization `true` for _all_ R tests and they were _all_ passed (with few of fallback warnings). TODOs: - [x] Draft codes - [x] make the tests passed - [x] make the CRAN check pass - [x] Performance measurement - [x] Supportability investigation (for instance types) - [x] Wait for Arrow 0.12.0 release - [x] Fix and match it to Arrow 0.12.0 Closes #22954 from HyukjinKwon/r-arrow-createdataframe. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2019-01-27 10:45:49 +08:00
..
inst	[MINOR][R] Fix indents of sparkR welcome message to be consistent with pyspark and spark-shell	2018-12-13 20:05:49 +08:00
R	[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame	2019-01-27 10:45:49 +08:00
src-native	[SPARK-6811] Copy SparkR lib in make-distribution.sh	2015-05-23 00:04:01 -07:00
tests	[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame	2019-01-27 10:45:49 +08:00
vignettes	[SPARK-19827][R] spark.ml R API for PIC	2018-12-10 18:28:13 -06:00
.lintr	[SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r	2017-10-01 18:42:45 +09:00
.Rbuildignore	[SPARK-20877][SPARKR][FOLLOWUP] clean up after test move	2017-06-11 03:00:44 -07:00
DESCRIPTION	[SPARK-26014][R] Deprecate R prior to version 3.4 in SparkR	2018-11-15 17:20:49 +08:00
NAMESPACE	[SPARK-19827][R] spark.ml R API for PIC	2018-12-10 18:28:13 -06:00