spark-instrumented-optimizer/R/README.md
Shubhanshu Mishra d7415991a1 [SPARK-12910] Fixes : R version for installing sparkR
Testing code:
```
$ ./install-dev.sh
USING R_HOME = /usr/bin
ERROR: this R is version 2.15.1, package 'SparkR' requires R >= 3.0
```

Using the new argument:
```
$ ./install-dev.sh /content/username/SOFTWARE/R-3.2.3
USING R_HOME = /content/username/SOFTWARE/R-3.2.3/bin
* installing *source* package ‘SparkR’ ...
** R
** inst
** preparing package for lazy loading
Creating a new generic function for ‘colnames’ in package ‘SparkR’
Creating a new generic function for ‘colnames<-’ in package ‘SparkR’
Creating a new generic function for ‘cov’ in package ‘SparkR’
Creating a new generic function for ‘na.omit’ in package ‘SparkR’
Creating a new generic function for ‘filter’ in package ‘SparkR’
Creating a new generic function for ‘intersect’ in package ‘SparkR’
Creating a new generic function for ‘sample’ in package ‘SparkR’
Creating a new generic function for ‘transform’ in package ‘SparkR’
Creating a new generic function for ‘subset’ in package ‘SparkR’
Creating a new generic function for ‘summary’ in package ‘SparkR’
Creating a new generic function for ‘lag’ in package ‘SparkR’
Creating a new generic function for ‘rank’ in package ‘SparkR’
Creating a new generic function for ‘sd’ in package ‘SparkR’
Creating a new generic function for ‘var’ in package ‘SparkR’
Creating a new generic function for ‘predict’ in package ‘SparkR’
Creating a new generic function for ‘rbind’ in package ‘SparkR’
Creating a generic function for ‘lapply’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘Filter’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘alias’ from package ‘stats’ in package ‘SparkR’
Creating a generic function for ‘substr’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘%in%’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘mean’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘unique’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘nrow’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘ncol’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘head’ from package ‘utils’ in package ‘SparkR’
Creating a generic function for ‘factorial’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘atan2’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘ifelse’ from package ‘base’ in package ‘SparkR’
** help
No man pages found in package  ‘SparkR’
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (SparkR)

```

Author: Shubhanshu Mishra <smishra8@illinois.edu>

Closes #10836 from napsternxg/master.
2016-01-20 18:06:06 -08:00

78 lines
3.5 KiB
Markdown

# R on Spark
SparkR is an R package that provides a light-weight frontend to use Spark from R.
### Installing sparkR
Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be done by running the script `$SPARK_HOME/R/install-dev.sh`.
By default the above script uses the system wide installation of R. However, this can be changed to any user installed location of R by setting the environment variable `R_HOME` the full path of the base directory where R is installed, before running install-dev.sh script.
Example:
```
# where /home/username/R is where R is installed and /home/username/R/bin contains the files R and RScript
export R_HOME=/home/username/R
./install-dev.sh
```
### SparkR development
#### Build Spark
Build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn) and include the `-Psparkr` profile to build the R package. For example to use the default Hadoop versions you can run
```
build/mvn -DskipTests -Psparkr package
```
#### Running sparkR
You can start using SparkR by launching the SparkR shell with
./bin/sparkR
The `sparkR` script automatically creates a SparkContext with Spark by default in
local mode. To specify the Spark master of a cluster for the automatically created
SparkContext, you can run
./bin/sparkR --master "local[2]"
To set other options like driver memory, executor memory etc. you can pass in the [spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html) arguments to `./bin/sparkR`
#### Using SparkR from RStudio
If you wish to use SparkR from RStudio or other R frontends you will need to set some environment variables which point SparkR to your Spark installation. For example
```
# Set this to where Spark is installed
Sys.setenv(SPARK_HOME="/Users/shivaram/spark")
# This line loads SparkR from the installed directory
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local")
```
#### Making changes to SparkR
The [instructions](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) for making contributions to Spark also apply to SparkR.
If you only make R file changes (i.e. no Scala changes) then you can just re-install the R package using `R/install-dev.sh` and test your changes.
Once you have made your changes, please include unit tests for them and run existing unit tests using the `run-tests.sh` script as described below.
#### Generating documentation
The SparkR documentation (Rd files and HTML files) are not a part of the source repository. To generate them you can run the script `R/create-docs.sh`. This script uses `devtools` and `knitr` to generate the docs and these packages need to be installed on the machine before using the script.
### Examples, Unit tests
SparkR comes with several sample programs in the `examples/src/main/r` directory.
To run one of them, use `./bin/sparkR <filename> <args>`. For example:
./bin/sparkR examples/src/main/r/dataframe.R
You can also run the unit-tests for SparkR by running (you need to install the [testthat](http://cran.r-project.org/web/packages/testthat/index.html) package first):
R -e 'install.packages("testthat", repos="http://cran.us.r-project.org")'
./R/run-tests.sh
### Running on YARN
The `./bin/spark-submit` and `./bin/sparkR` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run
```
export YARN_CONF_DIR=/etc/hadoop/conf
./bin/spark-submit --master yarn examples/src/main/r/dataframe.R
```