c0972dec1d
### What changes were proposed in this pull request? This PR proposes a change that allows us to build SparkR with SBT. ### Why are the changes needed? In the current master, SparkR can be built only with Maven. It's helpful if we can built it with SBT. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that I can build SparkR on Ubuntu 20.04 with the following command. ``` build/sbt -Psparkr package ``` Closes #32285 from sarutak/sbt-sparkr. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
74 lines
3.4 KiB
Markdown
74 lines
3.4 KiB
Markdown
# R on Spark
|
|
|
|
SparkR is an R package that provides a light-weight frontend to use Spark from R.
|
|
|
|
### Installing sparkR
|
|
|
|
Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be done by running the script `$SPARK_HOME/R/install-dev.sh`.
|
|
By default the above script uses the system wide installation of R. However, this can be changed to any user installed location of R by setting the environment variable `R_HOME` the full path of the base directory where R is installed, before running install-dev.sh script.
|
|
Example:
|
|
```bash
|
|
# where /home/username/R is where R is installed and /home/username/R/bin contains the files R and RScript
|
|
export R_HOME=/home/username/R
|
|
./install-dev.sh
|
|
```
|
|
|
|
### SparkR development
|
|
|
|
#### Build Spark
|
|
|
|
Build Spark with [Maven](https://spark.apache.org/docs/latest/building-spark.html#buildmvn) or [SBT](https://spark.apache.org/docs/latest/building-spark.html#building-with-sbt), and include the `-Psparkr` profile to build the R package. For example to use the default Hadoop versions you can run
|
|
|
|
```bash
|
|
# Maven
|
|
./build/mvn -DskipTests -Psparkr package
|
|
|
|
# SBT
|
|
./build/sbt -Psparkr package
|
|
```
|
|
|
|
#### Running sparkR
|
|
|
|
You can start using SparkR by launching the SparkR shell with
|
|
|
|
./bin/sparkR
|
|
|
|
The `sparkR` script automatically creates a SparkContext with Spark by default in
|
|
local mode. To specify the Spark master of a cluster for the automatically created
|
|
SparkContext, you can run
|
|
|
|
./bin/sparkR --master "local[2]"
|
|
|
|
To set other options like driver memory, executor memory etc. you can pass in the [spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) arguments to `./bin/sparkR`
|
|
|
|
#### Using SparkR from RStudio
|
|
|
|
If you wish to use SparkR from RStudio, please refer [SparkR documentation](https://spark.apache.org/docs/latest/sparkr.html#starting-up-from-rstudio).
|
|
|
|
#### Making changes to SparkR
|
|
|
|
The [instructions](https://spark.apache.org/contributing.html) for making contributions to Spark also apply to SparkR.
|
|
If you only make R file changes (i.e. no Scala changes) then you can just re-install the R package using `R/install-dev.sh` and test your changes.
|
|
Once you have made your changes, please include unit tests for them and run existing unit tests using the `R/run-tests.sh` script as described below.
|
|
|
|
#### Generating documentation
|
|
|
|
The SparkR documentation (Rd files and HTML files) are not a part of the source repository. To generate them you can run the script `R/create-docs.sh`. This script uses `devtools` and `knitr` to generate the docs and these packages need to be installed on the machine before using the script. Also, you may need to install these [prerequisites](https://github.com/apache/spark/tree/master/docs#prerequisites). See also, `R/DOCUMENTATION.md`
|
|
|
|
### Examples, Unit tests
|
|
|
|
SparkR comes with several sample programs in the `examples/src/main/r` directory.
|
|
To run one of them, use `./bin/spark-submit <filename> <args>`. For example:
|
|
```bash
|
|
./bin/spark-submit examples/src/main/r/dataframe.R
|
|
```
|
|
You can run R unit tests by following the instructions under [Running R Tests](https://spark.apache.org/docs/latest/building-spark.html#running-r-tests).
|
|
|
|
### Running on YARN
|
|
|
|
The `./bin/spark-submit` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run
|
|
```bash
|
|
export YARN_CONF_DIR=/etc/hadoop/conf
|
|
./bin/spark-submit --master yarn examples/src/main/r/dataframe.R
|
|
```
|