[SPARK-24572][SPARKR] "eager execution" for R shell, IDE

## What changes were proposed in this pull request?

Check the `spark.sql.repl.eagerEval.enabled` configuration property in SparkDataFrame `show()` method. If the `SparkSession` has eager execution enabled, the data will be returned to the R client when the data frame is created. So instead of seeing this
```
> df <- createDataFrame(faithful)
> df
SparkDataFrame[eruptions:double, waiting:double]
```
you will see
```
> df <- createDataFrame(faithful)
> df
+---------+-------+
|eruptions|waiting|
+---------+-------+
|      3.6|   79.0|
|      1.8|   54.0|
|    3.333|   74.0|
|    2.283|   62.0|
|    4.533|   85.0|
|    2.883|   55.0|
|      4.7|   88.0|
|      3.6|   85.0|
|     1.95|   51.0|
|     4.35|   85.0|
|    1.833|   54.0|
|    3.917|   84.0|
|      4.2|   78.0|
|     1.75|   47.0|
|      4.7|   83.0|
|    2.167|   52.0|
|     1.75|   62.0|
|      4.8|   84.0|
|      1.6|   52.0|
|     4.25|   79.0|
+---------+-------+
only showing top 20 rows
```

## How was this patch tested?
Manual tests as well as unit tests (one new test case is added).

Author: adrian555 <v2ave10p>

Closes #22455 from adrian555/eager_execution.
This commit is contained in:
adrian555 2018-10-24 23:42:06 -07:00 committed by Felix Cheung
parent 19ada15d1b
commit ddd1b1e8ae
4 changed files with 148 additions and 9 deletions

View file

@ -226,7 +226,9 @@ setMethod("showDF",
#' show
#'
#' Print class and type information of a Spark object.
#' If eager evaluation is enabled and the Spark object is a SparkDataFrame, evaluate the
#' SparkDataFrame and print top rows of the SparkDataFrame, otherwise, print the class
#' and type information of the Spark object.
#'
#' @param object a Spark object. Can be a SparkDataFrame, Column, GroupedData, WindowSpec.
#'
@ -244,11 +246,33 @@ setMethod("showDF",
#' @note show(SparkDataFrame) since 1.4.0
setMethod("show", "SparkDataFrame",
function(object) {
cols <- lapply(dtypes(object), function(l) {
paste(l, collapse = ":")
})
s <- paste(cols, collapse = ", ")
cat(paste(class(object), "[", s, "]\n", sep = ""))
allConf <- sparkR.conf()
prop <- allConf[["spark.sql.repl.eagerEval.enabled"]]
if (!is.null(prop) && identical(prop, "true")) {
argsList <- list()
argsList$x <- object
prop <- allConf[["spark.sql.repl.eagerEval.maxNumRows"]]
if (!is.null(prop)) {
numRows <- as.integer(prop)
if (numRows > 0) {
argsList$numRows <- numRows
}
}
prop <- allConf[["spark.sql.repl.eagerEval.truncate"]]
if (!is.null(prop)) {
truncate <- as.integer(prop)
if (truncate > 0) {
argsList$truncate <- truncate
}
}
do.call(showDF, argsList)
} else {
cols <- lapply(dtypes(object), function(l) {
paste(l, collapse = ":")
})
s <- paste(cols, collapse = ", ")
cat(paste(class(object), "[", s, "]\n", sep = ""))
}
})
#' DataTypes

View file

@ -0,0 +1,72 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
library(testthat)
context("test show SparkDataFrame when eager execution is enabled.")
test_that("eager execution is not enabled", {
# Start Spark session without eager execution enabled
sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE)
df <- createDataFrame(faithful)
expect_is(df, "SparkDataFrame")
expected <- "eruptions:double, waiting:double"
expect_output(show(df), expected)
# Stop Spark session
sparkR.session.stop()
})
test_that("eager execution is enabled", {
# Start Spark session with eager execution enabled
sparkConfig <- list(spark.sql.repl.eagerEval.enabled = "true")
sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, sparkConfig = sparkConfig)
df <- createDataFrame(faithful)
expect_is(df, "SparkDataFrame")
expected <- paste0("(+---------+-------+\n",
"|eruptions|waiting|\n",
"+---------+-------+\n)*",
"(only showing top 20 rows)")
expect_output(show(df), expected)
# Stop Spark session
sparkR.session.stop()
})
test_that("eager execution is enabled with maxNumRows and truncate set", {
# Start Spark session with eager execution enabled
sparkConfig <- list(spark.sql.repl.eagerEval.enabled = "true",
spark.sql.repl.eagerEval.maxNumRows = as.integer(5),
spark.sql.repl.eagerEval.truncate = as.integer(2))
sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, sparkConfig = sparkConfig)
df <- arrange(createDataFrame(faithful), "waiting")
expect_is(df, "SparkDataFrame")
expected <- paste0("(+---------+-------+\n",
"|eruptions|waiting|\n",
"+---------+-------+\n",
"| 1.| 43|\n)*",
"(only showing top 5 rows)")
expect_output(show(df), expected)
# Stop Spark session
sparkR.session.stop()
})

View file

@ -450,6 +450,48 @@ print(model.summaries)
{% endhighlight %}
</div>
### Eager execution
If eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. By default, eager execution is not enabled and can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up.
Maximum number of rows and maximum number of characters per column of data to display can be controlled by `spark.sql.repl.eagerEval.maxNumRows` and `spark.sql.repl.eagerEval.truncate` configuration properties, respectively. These properties are only effective when eager execution is enabled. If these properties are not set explicitly, by default, data up to 20 rows and up to 20 characters per column will be showed.
<div data-lang="r" markdown="1">
{% highlight r %}
# Start up spark session with eager execution enabled
sparkR.session(master = "local[*]",
sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true",
spark.sql.repl.eagerEval.maxNumRows = as.integer(10)))
# Create a grouped and sorted SparkDataFrame
df <- createDataFrame(faithful)
df2 <- arrange(summarize(groupBy(df, df$waiting), count = n(df$waiting)), "waiting")
# Similar to R data.frame, displays the data returned, instead of SparkDataFrame class string
df2
##+-------+-----+
##|waiting|count|
##+-------+-----+
##| 43.0| 1|
##| 45.0| 3|
##| 46.0| 5|
##| 47.0| 4|
##| 48.0| 3|
##| 49.0| 5|
##| 50.0| 5|
##| 51.0| 6|
##| 52.0| 5|
##| 53.0| 7|
##+-------+-----+
##only showing top 10 rows
{% endhighlight %}
</div>
Note that to enable eager execution in `sparkR` shell, add `spark.sql.repl.eagerEval.enabled=true` configuration property to the `--conf` option.
## Running SQL Queries from SparkR
A SparkDataFrame can also be registered as a temporary view in Spark SQL and that allows you to run SQL queries over its data.
The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.

View file

@ -1482,9 +1482,10 @@ object SQLConf {
val REPL_EAGER_EVAL_ENABLED = buildConf("spark.sql.repl.eagerEval.enabled")
.doc("Enables eager evaluation or not. When true, the top K rows of Dataset will be " +
"displayed if and only if the REPL supports the eager evaluation. Currently, the " +
"eager evaluation is only supported in PySpark. For the notebooks like Jupyter, " +
"the HTML table (generated by _repr_html_) will be returned. For plain Python REPL, " +
"the returned outputs are formatted like dataframe.show().")
"eager evaluation is supported in PySpark and SparkR. In PySpark, for the notebooks like " +
"Jupyter, the HTML table (generated by _repr_html_) will be returned. For plain Python " +
"REPL, the returned outputs are formatted like dataframe.show(). In SparkR, the returned " +
"outputs are showed similar to R data.frame would.")
.booleanConf
.createWithDefault(false)