Commit graph

123 commits

Author SHA1 Message Date
zero323 8a272ddc9d [SPARK-20438][R] SparkR wrappers for split and repeat
## What changes were proposed in this pull request?

Add wrappers for `o.a.s.sql.functions`:

- `split` as `split_string`
- `repeat` as `repeat_string`

## How was this patch tested?

Existing tests, additional unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes #17729 from zero323/SPARK-20438.
2017-04-24 10:56:57 -07:00
zero323 fd648bff63 [SPARK-20371][R] Add wrappers for collect_list and collect_set
## What changes were proposed in this pull request?

Adds wrappers for `collect_list` and `collect_set`.

## How was this patch tested?

Unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes #17672 from zero323/SPARK-20371.
2017-04-21 12:06:21 -07:00
zero323 46c5749768 [SPARK-20375][R] R wrappers for array and map
## What changes were proposed in this pull request?

Adds wrappers for `o.a.s.sql.functions.array` and `o.a.s.sql.functions.map`

## How was this patch tested?

Unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes #17674 from zero323/SPARK-20375.
2017-04-19 21:19:46 -07:00
Felix Cheung 5a693b4138 [SPARK-20195][SPARKR][SQL] add createTable catalog API and deprecate createExternalTable
## What changes were proposed in this pull request?

Following up on #17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17511 from felixcheung/rceatetable.
2017-04-06 09:15:13 -07:00
zero323 b34f7665dd [SPARK-19825][R][ML] spark.ml R API for FPGrowth
## What changes were proposed in this pull request?

Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):

- `spark.fpGrowth` -model training.
- `freqItemsets` and `associationRules` methods with new corresponding generics.
- Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
- unit tests.

## How was this patch tested?

Feature specific unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17170 from zero323/SPARK-19825.
2017-04-03 23:42:04 -07:00
Felix Cheung 93dbfe705f [SPARK-20159][SPARKR][SQL] Support all catalog API in R
## What changes were proposed in this pull request?

Add a set of catalog API in R

```
"currentDatabase",
"listColumns",
"listDatabases",
"listFunctions",
"listTables",
"recoverPartitions",
"refreshByPath",
"refreshTable",
"setCurrentDatabase",
```
https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17483 from felixcheung/rcatalog.
2017-04-02 11:59:27 -07:00
Felix Cheung c40597720e [SPARK-20020][SPARKR] DataFrame checkpoint API
## What changes were proposed in this pull request?

Add checkpoint, setCheckpointDir API to R

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17351 from felixcheung/rdfcheckpoint.
2017-03-19 22:34:18 -07:00
Felix Cheung 5c165596da [SPARK-19654][SPARKR][SS] Structured Streaming API for R
## What changes were proposed in this pull request?

Add "experimental" API for SS in R

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16982 from felixcheung/rss.
2017-03-18 16:26:48 -07:00
Felix Cheung 80d5338b32 [SPARK-19795][SPARKR] add column functions to_json, from_json
## What changes were proposed in this pull request?

Add column functions: to_json, from_json, and tests covering error cases.

## How was this patch tested?

unit tests, manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17134 from felixcheung/rtojson.
2017-03-05 12:37:02 -08:00
Felix Cheung 671bc08ed5 [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column
## What changes were proposed in this pull request?

Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16739 from felixcheung/rcoalesce.
2017-02-15 10:45:37 -08:00
wm624@hotmail.com 3973403d5d [SPARK-19456][SPARKR] Add LinearSVC R API
## What changes were proposed in this pull request?

Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.

Marked as WIP, as I am designing unit tests.

## How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16800 from wangmiao1981/svc.
2017-02-15 01:15:50 -08:00
anabranch 7a7ce272fe [SPARK-16609] Add to_date/to_timestamp with format functions
## What changes were proposed in this pull request?

This pull request adds two new user facing functions:
- `to_date` which accepts an expression and a format and returns a date.
- `to_timestamp` which accepts an expression and a format and returns a timestamp.

For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM)

### Date Function
*Previously*
```
to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp"))
```
*Current*
```
to_date(lit("2016-21-05"), "yyyy-dd-MM")
```

### Timestamp Function
*Previously*
```
unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")
```
*Current*
```
to_timestamp(lit("2016-21-05"), "yyyy-dd-MM")
```
### Tasks

- [X] Add `to_date` to Scala Functions
- [x] Add `to_date` to Python Functions
- [x] Add `to_date` to SQL Functions
- [X] Add `to_timestamp` to Scala Functions
- [x] Add `to_timestamp` to Python Functions
- [x] Add `to_timestamp` to SQL Functions
- [x] Add function to R

## How was this patch tested?

- [x] Add Functions to `DateFunctionsSuite`
- Test new `ParseToTimestamp` Expression (*not necessary*)
- Test new `ParseToDate` Expression (*not necessary*)
- [x] Add test for R
- [x] Add test for Python in test.py

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Author: anabranch <bill@databricks.com>

Closes #16138 from anabranch/SPARK-16609.
2017-02-07 15:50:30 +01:00
Felix Cheung 385d73848b [SPARK-19333][SPARKR] Add Apache License headers to R files
## What changes were proposed in this pull request?

add header

## How was this patch tested?

Manual run to check vignettes html is created properly

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16709 from felixcheung/rfilelicense.
2017-01-27 10:31:28 -08:00
Felix Cheung 90817a6cd0 [SPARK-18788][SPARKR] Add API for getNumPartitions
## What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16668 from felixcheung/rgetnumpartitions.
2017-01-26 21:06:39 -08:00
wm624@hotmail.com c0ba284300 [SPARK-18821][SPARKR] Bisecting k-means wrapper in SparkR
## What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans.

As JIRA is down, I will update title to link with corresponding JIRA later.

## How was this patch tested?

Add new unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16566 from wangmiao1981/bk.
2017-01-26 21:01:59 -08:00
Felix Cheung 17579bda3c [SPARK-18958][SPARKR] R API toJSON on DataFrame
## What changes were proposed in this pull request?

It would make it easier to integrate with other component expecting row-based JSON format.
This replaces the non-public toJSON RDD API.

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16368 from felixcheung/rJSON.
2016-12-22 20:54:38 -08:00
Felix Cheung 7e8994ffd3 [SPARK-18903][SPARKR] Add API to get SparkUI URL
## What changes were proposed in this pull request?

API for SparkUI URL from SparkContext

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16367 from felixcheung/rwebui.
2016-12-21 17:21:17 -08:00
Felix Cheung c3d3a9d0e8 [SPARK-18590][SPARKR] build R source package when making distribution
## What changes were proposed in this pull request?

This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

### more details

These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
 (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
 (the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
   R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
 But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.

## How was this patch tested?

Manually, CI.

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16014 from felixcheung/rdist.
2016-12-08 11:29:31 -08:00
Felix Cheung 55964c15a7 [SPARK-18239][SPARKR] Gradient Boosted Tree for R
## What changes were proposed in this pull request?

Gradient Boosted Tree in R.
With a few minor improvements to RandomForest in R.

Since this is relatively isolated I'd like to target this for branch-2.1

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15746 from felixcheung/rgbt.
2016-11-08 16:00:45 -08:00
Felix Cheung b6879b8b35 [SPARK-16137][SPARKR] randomForest for R
## What changes were proposed in this pull request?

Random Forest Regression and Classification for R
Clean-up/reordering generics.R

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15607 from felixcheung/rrandomforest.
2016-10-30 16:19:19 -07:00
wm624@hotmail.com 29cea8f332 [SPARK-17157][SPARKR] Add multiclass logistic regression SparkR Wrapper
## What changes were proposed in this pull request?

As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.

This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.

## How was this patch tested?

New unit tests are added.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15365 from wangmiao1981/glm.
2016-10-26 16:12:55 -07:00
WeichenXu fb0a8a8dd7 [SPARK-17961][SPARKR][SQL] Add storageLevel to DataFrame for SparkR
## What changes were proposed in this pull request?

Add storageLevel to DataFrame for SparkR.
This is similar to this RP:  https://github.com/apache/spark/pull/13780

but in R I do not make a class for `StorageLevel`
but add a method `storageToString`

## How was this patch tested?

test added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15516 from WeichenXu123/storageLevel_df_r.
2016-10-26 13:26:43 -07:00
Felix Cheung e21e1c946c [SPARK-18013][SPARKR] add crossJoin API
## What changes were proposed in this pull request?

Add crossJoin and do not default to cross join if joinExpr is left out

## How was this patch tested?

unit test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15559 from felixcheung/rcrossjoin.
2016-10-21 12:35:37 -07:00
Felix Cheung 3180272d2d [SPARKR] fix warnings
## What changes were proposed in this pull request?

Fix for a bunch of test warnings that were added recently.
We need to investigate why warnings are not turning into errors.

```
Warnings -----------------------------------------------------------------------
1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length  as column name

2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width  as column name

3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length  as column name

4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width  as column name

Consider adding
  importFrom("utils", "object.size")
to your NAMESPACE file.
```

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15560 from felixcheung/rwarnings.
2016-10-20 21:12:55 -07:00
Yanbo Liang c133907c5d [SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors
## What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```.
We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15131 from yanboliang/spark-17577.
2016-09-21 20:08:28 -07:00
Junyang Qian abb2f92103 [SPARK-17315][SPARKR] Kolmogorov-Smirnov test SparkR wrapper
## What changes were proposed in this pull request?

This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution.

## How was this patch tested?

R unit test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14881 from junyangq/SPARK-17315.
2016-09-03 12:26:30 -07:00
Felix Cheung 812333e433 [SPARK-17376][SPARKR] Spark version should be available in R
## What changes were proposed in this pull request?

Add sparkR.version() API.

```
> sparkR.version()
[1] "2.1.0-SNAPSHOT"
```

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14935 from felixcheung/rsparksessionversion.
2016-09-02 10:12:10 -07:00
Shivaram Venkataraman 736a7911cb [SPARK-16581][SPARKR] Make JVM backend calling functions public
## What changes were proposed in this pull request?

This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Unit tests, CRAN checks

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14775 from shivaram/sparkr-java-api.
2016-08-29 12:55:32 -07:00
Xin Ren 2fbdb60639 [SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkR
https://issues.apache.org/jira/browse/SPARK-16445

## What changes were proposed in this pull request?

Create Multilayer Perceptron Classifier wrapper in SparkR

## How was this patch tested?

Tested manually on local machine

Author: Xin Ren <iamshrek@126.com>

Closes #14447 from keypointt/SPARK-16445.
2016-08-24 11:18:10 -07:00
Felix Cheung 71afeeea4e [SPARK-16508][SPARKR] doc updates and more CRAN check fixes
## What changes were proposed in this pull request?

replace ``` ` ``` in code doc with `\code{thing}`
remove added `...` for drop(DataFrame)
fix remaining CRAN check warnings

## How was this patch tested?

create doc with knitr

junyangq

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14734 from felixcheung/rdoccleanup.
2016-08-22 15:53:10 -07:00
Junyang Qian acac7a508a [SPARK-16443][SPARKR] Alternating Least Squares (ALS) wrapper
## What changes were proposed in this pull request?

Add Alternating Least Squares wrapper in SparkR. Unit tests have been updated.

## How was this patch tested?

SparkR unit tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

![screen shot 2016-07-27 at 3 50 31 pm](https://cloud.githubusercontent.com/assets/15318264/17195347/f7a6352a-5411-11e6-8e21-61a48070192a.png)
![screen shot 2016-07-27 at 3 50 46 pm](https://cloud.githubusercontent.com/assets/15318264/17195348/f7a7d452-5411-11e6-845f-6d292283bc28.png)

Author: Junyang Qian <junyangq@databricks.com>

Closes #14384 from junyangq/SPARK-16443.
2016-08-19 14:24:09 -07:00
Xusen Yin b72bb62d42 [SPARK-16447][ML][SPARKR] LDA wrapper in SparkR
## What changes were proposed in this pull request?

Add LDA Wrapper in SparkR with the following interfaces:

- spark.lda(data, ...)

- spark.posterior(object, newData, ...)

- spark.perplexity(object, ...)

- summary(object)

- write.ml(object)

- read.ml(path)

## How was this patch tested?

Test with SparkR unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #14229 from yinxusen/SPARK-16447.
2016-08-18 05:33:52 -07:00
Yanbo Liang 4d92af310a [SPARK-16446][SPARKR][ML] Gaussian Mixture Model wrapper in SparkR
## What changes were proposed in this pull request?
Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14392 from yanboliang/spark-16446.
2016-08-17 11:18:33 -07:00
wm624@hotmail.com 363793f2bf [SPARK-16444][SPARKR] Isotonic Regression wrapper in SparkR
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Add Isotonic Regression wrapper in SparkR

Wrappers in R and Scala are added.
Unit tests
Documentation

## How was this patch tested?
Manually tested with sudo ./R/run-tests.sh

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14182 from wangmiao1981/isoR.
2016-08-17 06:15:04 -07:00
Junyang Qian 214ba66a03 [SPARK-16579][SPARKR] add install.spark function
## What changes were proposed in this pull request?

Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R.

Updates:

Several changes have been made:

- `install.spark()`
    - check existence of tar file in the cache folder, and download only if not found
    - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
    - use 2.0.0

- `sparkR.session()`
    - can install spark when not found in `SPARK_HOME`

## How was this patch tested?

Manual tests, running the check-cran.sh script added in #14173.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14258 from junyangq/SPARK-16579.
2016-08-10 11:18:23 -07:00
Shivaram Venkataraman c33e4b0d96 [SPARK-16507][SPARKR] Add a CRAN checker, fix Rd aliases
## What changes were proposed in this pull request?

Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include
- Updating `DESCRIPTION` to be appropriate
- Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs
- Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc.  This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods
- Other minor fixes

## How was this patch tested?

SparkR unit tests, running the above mentioned script

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14173 from shivaram/sparkr-cran-changes.
2016-07-16 17:06:44 -07:00
Sun Rui 093ebbc628 [SPARK-16509][SPARKR] Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy.
## What changes were proposed in this pull request?
Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check.

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #14192 from sun-rui/SPARK-16509.
2016-07-14 09:38:42 -07:00
Narine Kokhlikyan 26afb4ce40 [SPARK-16012][SPARKR] Implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame
## What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.

This is similar to dapplyCollect().

## How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect

Author: Narine Kokhlikyan <narine@slice.com>

Closes #13760 from NarineK/gapplyCollect.
2016-07-01 13:55:13 -07:00
Dongjoon Hyun 46395db80e [SPARK-16289][SQL] Implement posexplode table generating function
## What changes were proposed in this pull request?

This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.

**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```

**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+
```

For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with newly added testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13971 from dongjoon-hyun/SPARK-16289.
2016-06-30 12:03:54 -07:00
Felix Cheung 30b182bcc0 [SPARK-16184][SPARKR] conf API for SparkSession
## What changes were proposed in this pull request?

Add `conf` method to get Runtime Config from SparkSession

## How was this patch tested?

unit tests, manual tests

This is how it works in sparkR shell:
```
 SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"

$spark.app.id
[1] "local-1466749575523"

$spark.app.name
[1] "SparkR"

$spark.driver.host
[1] "10.0.2.1"

$spark.driver.port
[1] "45629"

$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"

$spark.executor.id
[1] "driver"

$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"

$spark.master
[1] "local[*]"

$spark.sql.catalogImplementation
[1] "hive"

$spark.submit.deployMode
[1] "client"

> conf("spark.master")
$spark.master
[1] "local[*]"

```

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13885 from felixcheung/rconf.
2016-06-26 13:10:43 -07:00
Felix Cheung dbfdae4e41 [SPARK-16096][SPARKR] add union and deprecate unionAll
## What changes were proposed in this pull request?

add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different)

`explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet.

## How was this patch tested?

unit tests, manual checks for r doc

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13805 from felixcheung/runion.
2016-06-21 13:36:50 -07:00
Dongjoon Hyun 217db56ba1 [SPARK-15294][R] Add pivot to SparkR
## What changes were proposed in this pull request?

This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did.

## How was this patch tested?

Pass the Jenkins tests (including new testcase.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13786 from dongjoon-hyun/SPARK-15294.
2016-06-20 21:09:39 -07:00
Dongjoon Hyun b0f2fb5b97 [SPARK-16053][R] Add spark_partition_id in SparkR
## What changes were proposed in this pull request?

This PR adds `spark_partition_id` virtual column function in SparkR for API parity.

The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
   id SPARK_PARTITION_ID()
1   3                    0
2   4                    0
3   8                    1
4   9                    1
5   0                    2
6   1                    3
7   2                    4
8   5                    5
9   6                    6
10  7                    7
```

## How was this patch tested?

Pass the Jenkins tests (including new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13768 from dongjoon-hyun/SPARK-16053.
2016-06-20 13:41:03 -07:00
Dongjoon Hyun c44bf137c7 [SPARK-16051][R] Add read.orc/write.orc to SparkR
## What changes were proposed in this pull request?

This issue adds `read.orc/write.orc` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13763 from dongjoon-hyun/SPARK-16051.
2016-06-20 11:30:26 -07:00
Felix Cheung 36e812d4b6 [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable
## What changes were proposed in this pull request?

Add dropTempView and deprecate dropTempTable

## How was this patch tested?

unit tests

shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13753 from felixcheung/rdroptempview.
2016-06-20 11:24:41 -07:00
Dongjoon Hyun 9613424898 [SPARK-16059][R] Add monotonically_increasing_id function in SparkR
## What changes were proposed in this pull request?

This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
After this PR, SparkR supports the followings.

```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
  monotonically_increasing_id()    name age
1                             0 Michael  NA
2                             1    Andy  30
3                             2  Justin  19
```

## How was this patch tested?

Pass the Jenkins tests (with added testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13774 from dongjoon-hyun/SPARK-16059.
2016-06-20 11:12:41 -07:00
Felix Cheung 8c198e246d [SPARK-15159][SPARKR] SparkR SparkSession API
## What changes were proposed in this pull request?

This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`

"getOrCreate" is a bit unusual in R but it's important to name this clearly.

SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed

TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide

## How was this patch tested?

unit tests, manual tests

shivaram sun-rui rxin

Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13635 from felixcheung/rsparksession.
2016-06-17 21:36:01 -07:00
Dongjoon Hyun 7d65a0db4a [SPARK-16005][R] Add randomSplit to SparkR
## What changes were proposed in this pull request?

This PR adds `randomSplit` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcase.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13721 from dongjoon-hyun/SPARK-16005.
2016-06-17 16:07:33 -07:00
Felix Cheung ef3cc4fc09 [SPARK-15925][SPARKR] R DataFrame add back registerTempTable, add tests
## What changes were proposed in this pull request?

Add registerTempTable to DataFrame with Deprecate

## How was this patch tested?

unit tests
shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13722 from felixcheung/rregistertemptable.
2016-06-17 15:56:03 -07:00
Narine Kokhlikyan 7c6c692637 [SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR
## What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12836 from NarineK/gapply2.
2016-06-15 21:42:05 -07:00