[SPARK-8506] Add pakages to R context created through init.

Author: Holden Karau <holden@pigscanfly.ca> Closes #6928 from holdenk/SPARK-8506-sparkr-does-not-provide-an-easy-way-to-depend-on-spark-packages-when-performing-init-from-inside-of-r and squashes the following commits: b60dd63 [Holden Karau] Add an example with the spark-csv package fa8bc92 [Holden Karau] typo: sparm -> spark 865a90c [Holden Karau] strip spaces for comparision c7a4471 [Holden Karau] Add some documentation c1a9233 [Holden Karau] refactor for testing c818556 [Holden Karau] Add pakages to R
2015-06-24 11:55:20 -07:00 · 2015-06-24 11:55:20 -07:00 · 43e66192f4
parent 1173483f3f
commit 43e66192f4
4 changed files with 69 additions and 13 deletions
--- a/R/pkg/R/client.R
+++ b/R/pkg/R/client.R
@ -34,24 +34,36 @@ connectBackend <- function(hostname, port, timeout = 6000) {
  con
 }
-launchBackend <- function(args, sparkHome, jars, sparkSubmitOpts) {
+determineSparkSubmitBin <- function() {
  if (.Platform$OS.type == "unix") {
    sparkSubmitBinName = "spark-submit"
  } else {
    sparkSubmitBinName = "spark-submit.cmd"
  }
  sparkSubmitBinName
 }
 generateSparkSubmitArgs <- function(args, sparkHome, jars, sparkSubmitOpts, packages) {
  if (jars != "") {
    jars <- paste("--jars", jars)
  }
  if (packages != "") {
    packages <- paste("--packages", packages)
  }
  combinedArgs <- paste(jars, packages, sparkSubmitOpts, args, sep = " ")
  combinedArgs
 }
 launchBackend <- function(args, sparkHome, jars, sparkSubmitOpts, packages) {
  sparkSubmitBin <- determineSparkSubmitBin()
  if (sparkHome != "") {
    sparkSubmitBin <- file.path(sparkHome, "bin", sparkSubmitBinName)
  } else {
    sparkSubmitBin <- sparkSubmitBinName
  }
-
+  combinedArgs <- generateSparkSubmitArgs(args, sparkHome, jars, sparkSubmitOpts, packages)
  if (jars != "") {
    jars <- paste("--jars", jars)
  }
  combinedArgs <- paste(jars, sparkSubmitOpts, args, sep = " ")
  cat("Launching java with spark-submit command", sparkSubmitBin, combinedArgs, "\n")
  invisible(system2(sparkSubmitBin, combinedArgs, wait = F))
 }
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@ -81,6 +81,7 @@ sparkR.stop <- function() {
 #' @param sparkExecutorEnv Named list of environment variables to be used when launching executors.
 #' @param sparkJars Character string vector of jar files to pass to the worker nodes.
 #' @param sparkRLibDir The path where R is installed on the worker nodes.
 #' @param sparkPackages Character string vector of packages from spark-packages.org
 #' @export
 #' @examples
 #'\dontrun{
@ -100,7 +101,8 @@ sparkR.init <- function(
  sparkEnvir = list(),
  sparkExecutorEnv = list(),
  sparkJars = "",
-  sparkRLibDir = "") {
+  sparkRLibDir = "",
  sparkPackages = "") {
  if (exists(".sparkRjsc", envir = .sparkREnv)) {
    cat("Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or restart R to create a new Spark Context\n")
@ -129,7 +131,8 @@ sparkR.init <- function(
        args = path,
        sparkHome = sparkHome,
        jars = jars,
-        sparkSubmitOpts = Sys.getenv("SPARKR_SUBMIT_ARGS", "sparkr-shell"))
+        sparkSubmitOpts = Sys.getenv("SPARKR_SUBMIT_ARGS", "sparkr-shell"),
        sparkPackages = sparkPackages)
    # wait atmost 100 seconds for JVM to launch
    wait <- 0.1
    for (i in 1:25) {
--- a/R/pkg/inst/tests/test_client.R
+++ b/R/pkg/inst/tests/test_client.R
@ -0,0 +1,32 @@
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 context("functions in client.R")
 test_that("adding spark-testing-base as a package works", {
  args <- generateSparkSubmitArgs("", "", "", "",
                                  "holdenk:spark-testing-base:1.3.0_0.0.5")
  expect_equal(gsub("[[:space:]]", "", args),
               gsub("[[:space:]]", "",
                    "--packages holdenk:spark-testing-base:1.3.0_0.0.5"))
 })
 test_that("no package specified doesn't add packages flag", {
  args <- generateSparkSubmitArgs("", "", "", "", "")
  expect_equal(gsub("[[:space:]]", "", args),
               "")
 })
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@ -27,9 +27,9 @@ All of the examples on this page use sample data included in R or the Spark dist
 <div data-lang="r"  markdown="1">
 The entry point into SparkR is the `SparkContext` which connects your R program to a Spark cluster.
 You can create a `SparkContext` using `sparkR.init` and pass in options such as the application name
-etc. Further, to work with DataFrames we will need a `SQLContext`, which can be created from the 
+, any spark packages depended on, etc. Further, to work with DataFrames we will need a `SQLContext`,
-SparkContext. If you are working from the SparkR shell, the `SQLContext` and `SparkContext` should
+which can be created from the  SparkContext. If you are working from the SparkR shell, the
-already be created for you.
+`SQLContext` and `SparkContext` should already be created for you.
 {% highlight r %}
 sc <- sparkR.init()
@ -62,7 +62,16 @@ head(df)
 SparkR supports operating on a variety of data sources through the `DataFrame` interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more [specific options](sql-programming-guide.html#manually-specifying-options) that are available for the built-in data sources.
-The general method for creating DataFrames from data sources is `read.df`. This method takes in the `SQLContext`, the path for the file to load and the type of data source. SparkR supports reading JSON and Parquet files natively and through [Spark Packages](http://spark-packages.org/) you can find data source connectors for popular file formats like [CSV](http://spark-packages.org/package/databricks/spark-csv) and [Avro](http://spark-packages.org/package/databricks/spark-avro).
+The general method for creating DataFrames from data sources is `read.df`. This method takes in the `SQLContext`, the path for the file to load and the type of data source. SparkR supports reading JSON and Parquet files natively and through [Spark Packages](http://spark-packages.org/) you can find data source connectors for popular file formats like [CSV](http://spark-packages.org/package/databricks/spark-csv) and [Avro](http://spark-packages.org/package/databricks/spark-avro). These packages can either be added by
 specifying `--packages` with `spark-submit` or `sparkR` commands, or if creating context through `init`
 you can specify the packages with the `packages` argument.
 <div data-lang="r" markdown="1">
 {% highlight r %}
 sc <- sparkR.init(packages="com.databricks:spark-csv_2.11:1.0.3")
 sqlContext <- sparkRSQL.init(sc)
 {% endhighlight %}
 </div>
 We can see how to use data sources using an example JSON input file. Note that the file that is used here is _not_ a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.