Commit graph

522 commits

Author SHA1 Message Date
Felix Cheung d749c06677 [SPARK-19130][SPARKR] Support setting literal value as column implicitly
## What changes were proposed in this pull request?

```
df$foo <- 1
```

instead of
```
df$foo <- lit(1)
```

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16510 from felixcheung/rlitcol.
2017-01-11 08:29:09 -08:00
Felix Cheung 9bc3507e41 [SPARK-19133][SPARKR][ML] fix glm for Gamma, clarify glm family supported
## What changes were proposed in this pull request?

R family is a longer list than what Spark supports.

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16511 from felixcheung/rdocglmfamily.
2017-01-10 11:42:07 -08:00
anabranch 19d9d4c855 [SPARK-19126][DOCS] Update Join Documentation Across Languages
## What changes were proposed in this pull request?

- [X] Make sure all join types are clearly mentioned
- [X] Make join labeling/style consistent
- [X] Make join label ordering docs the same
- [X] Improve join documentation according to above for Scala
- [X] Improve join documentation according to above for Python
- [X] Improve join documentation according to above for R

## How was this patch tested?
No tests b/c docs.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>

Closes #16504 from anabranch/SPARK-19126.
2017-01-08 20:37:46 -08:00
anabranch 1f6ded6455 [SPARK-19127][DOCS] Update Rank Function Documentation
## What changes were proposed in this pull request?

- [X] Fix inconsistencies in function reference for dense rank and dense
- [X] Make all languages equivalent in their reference to `dense_rank` and `rank`.

## How was this patch tested?

N/A for docs.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>

Closes #16505 from anabranch/SPARK-19127.
2017-01-08 17:53:53 -08:00
Yanbo Liang 6b6b555a1e [SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files
## What changes were proposed in this pull request?
SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:
* mllib_classification.R
* mllib_clustering.R
* mllib_recommendation.R
* mllib_regression.R
* mllib_stat.R
* mllib_tree.R
* mllib_utils.R

Note: Only reorg, no actual code change.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16312 from yanboliang/spark-18862.
2017-01-08 01:10:36 -08:00
Yanbo Liang cdda3372a3
[MINOR] Bump R version to 2.2.0.
## What changes were proposed in this pull request?
#16126 bumps master branch version to 2.2.0-SNAPSHOT, but it seems R version was omitted.

## How was this patch tested?
N/A

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16488 from yanboliang/r-version.
2017-01-07 14:33:17 +00:00
Felix Cheung 17579bda3c [SPARK-18958][SPARKR] R API toJSON on DataFrame
## What changes were proposed in this pull request?

It would make it easier to integrate with other component expecting row-based JSON format.
This replaces the non-public toJSON RDD API.

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16368 from felixcheung/rJSON.
2016-12-22 20:54:38 -08:00
Felix Cheung 7e8994ffd3 [SPARK-18903][SPARKR] Add API to get SparkUI URL
## What changes were proposed in this pull request?

API for SparkUI URL from SparkContext

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16367 from felixcheung/rwebui.
2016-12-21 17:21:17 -08:00
Felix Cheung 38fd163d0d [SPARK-18849][ML][SPARKR][DOC] vignettes final check reorg
## What changes were proposed in this pull request?

Reorganizing content (copy/paste)

## How was this patch tested?

https://felixcheung.github.io/sparkr-vignettes.html

Previous:
https://felixcheung.github.io/sparkr-vignettes_old.html

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16301 from felixcheung/rvignettespass2.
2016-12-17 14:37:34 -08:00
Dongjoon Hyun 1169db44bc [SPARK-18897][SPARKR] Fix SparkR SQL Test to drop test table
## What changes were proposed in this pull request?

SparkR tests, `R/run-tests.sh`, succeeds only once because `test_sparkSQL.R` does not clean up the test table, `people`.

As a result, the rows in `people` table are accumulated at every run and the test cases fail.

The following is the failure result for the second run.

```r
Failed -------------------------------------------------------------------------
1. Failure: create DataFrame from RDD (test_sparkSQL.R#204) -------------------
collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to c(16).
Lengths differ: 2 vs 1

2. Failure: create DataFrame from RDD (test_sparkSQL.R#206) -------------------
collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal to c(176.5).
Lengths differ: 2 vs 1
```

## How was this patch tested?

Manual. Run `run-tests.sh` twice and check if it passes without failures.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16310 from dongjoon-hyun/SPARK-18897.
2016-12-16 11:30:21 -08:00
Felix Cheung 7d858bc5ce [SPARK-18849][ML][SPARKR][DOC] vignettes final check update
## What changes were proposed in this pull request?

doc cleanup

## How was this patch tested?

~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~
Output html here: https://felixcheung.github.io/sparkr-vignettes.html

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16286 from felixcheung/rvignettespass.
2016-12-14 21:51:52 -08:00
wm624@hotmail.com 3243885316 [SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates
## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16284 from wangmiao1981/ks.
2016-12-14 17:07:27 -08:00
Joseph K. Bradley 7862742570 [SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes
## What changes were proposed in this pull request?

Added short section for KSTest.
Also added logreg model to list of ML models in vignette.  (This will be reorganized under SPARK-18849)

![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png)

## How was this patch tested?

Manually tested example locally.
Built vignettes locally.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #16283 from jkbradley/ksTest-vignette.
2016-12-14 14:10:40 -08:00
wm624@hotmail.com f2ddabfa09 [MINOR][SPARKR] fix kstest example error and add unit test
## What changes were proposed in this pull request?

While adding vignettes for kstest, I found some errors in the example:
1. There is a typo of kstest;
2. print.summary.KStest doesn't work with the example;

Fix the example errors;
Add a new unit test for print.summary.KStest;

## How was this patch tested?
Manual test;
Add new unit test;

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16259 from wangmiao1981/ks.
2016-12-13 18:52:05 -08:00
Xiangrui Meng 594b14f1eb [SPARK-18793][SPARK-18794][R] add spark.randomForest/spark.gbt to vignettes
## What changes were proposed in this pull request?

Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content minimal since users can type `?spark.randomForest` to see the full doc.

cc: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #16264 from mengxr/SPARK-18793.
2016-12-13 16:59:09 -08:00
wm624@hotmail.com 2aa16d03db [SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes
## What changes were proposed in this pull request?
spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

## How was this patch tested?

Manual build html. Please see attached image for the result.
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16222 from wangmiao1981/veg.
2016-12-12 22:41:11 -08:00
Felix Cheung 8a51cfdcad [SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots
## What changes were proposed in this pull request?

Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`

## How was this patch tested?

unit test, manually testing
- snapshot build url
  - download when spark jar not cached
  - when spark jar is cached
- RC build url
  - download when spark jar not cached
  - when spark jar is cached
- multiple cached spark versions
- starting with sparkR shell

To use this,
```
SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R
```
then in R,
```
library(SparkR) # or specify lib.loc
sparkR.session()
```

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16248 from felixcheung/rinstallurl.
2016-12-12 14:40:41 -08:00
Felix Cheung 3e11d5bfef [SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods with void return values
## What changes were proposed in this pull request?

Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE.
example:
```
> setLogLevel("WARN")
NULL
```
We should fix this to make the result more clear.

Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it.

## How was this patch tested?

manually - I didn't find a expect_*() method in testthat for this

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16237 from felixcheung/rinvis.
2016-12-09 19:06:05 -08:00
wm624@hotmail.com 86a96034cc [SPARK-18349][SPARKR] Update R API documentation on ml model summary
## What changes were proposed in this pull request?
In this PR, the document of `summary` method is improved in the format:

returns summary information of the fitted model, which is a list. The list includes .......

Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here.

In current document, some `return` have `.` and some don't have. `.` is added to missed ones.

Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged.

## How was this patch tested?

Manual build.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16150 from wangmiao1981/audit2.
2016-12-08 22:08:19 -08:00
Felix Cheung c3d3a9d0e8 [SPARK-18590][SPARKR] build R source package when making distribution
## What changes were proposed in this pull request?

This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

### more details

These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
 (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
 (the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
   R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
 But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.

## How was this patch tested?

Manually, CI.

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16014 from felixcheung/rdist.
2016-12-08 11:29:31 -08:00
Yanbo Liang 97255497d8 [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1
## What changes were proposed in this pull request?
Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
* Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle.
* Fix ```spark.als``` params to make it consistent with MLlib.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16169 from yanboliang/spark-18326.
2016-12-07 20:23:28 -08:00
Sean Owen 79f5f281bb
[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils
## What changes were proposed in this pull request?

Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.

## How was this patch tested?

Existing test plus new test case.

Author: Sean Owen <sowen@cloudera.com>

Closes #16129 from srowen/SPARK-18678.
2016-12-07 17:34:45 +08:00
Yanbo Liang 90b59d1bf2 [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.
## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             versicolor  virginica   setosa
(Intercept)  1.514031    -2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16117 from yanboliang/spark-18686.
2016-12-07 00:31:11 -08:00
Felix Cheung b019b3a8ac [SPARK-18643][SPARKR] SparkR hangs at session start when installed as a package without Spark
## What changes were proposed in this pull request?

If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session.

This seems to be a regression on the earlier behavior.

Fix is to always try to install or check for the cached Spark if running in an interactive session.
As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc)

## How was this patch tested?

Manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16077 from felixcheung/rsessioninteractive.
2016-12-04 20:25:11 -08:00
Yanbo Liang a985dd8e99 [SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial."
## What changes were proposed in this pull request?
It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved.

## How was this patch tested?
Existing unit tests.

This reverts commit daa975f4bf.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16118 from yanboliang/spark-18291-revert.
2016-12-02 12:16:57 -08:00
wm624@hotmail.com 2eb6764fbb [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support output original label.
## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and added. Document is also modified.

## How was this patch tested?

Unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15910 from wangmiao1981/audit.
2016-11-30 20:32:17 -08:00
Burak Yavuz 0d1bf2b6c8 [SPARK-18510] Fix data corruption from inferred partition column dataTypes
## What changes were proposed in this pull request?

### The Issue

If I specify my schema when doing
```scala
spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)
```
but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.

### Proposed solution

The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.

The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.

My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.

We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.

A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942 if this PR goes in.

## How was this patch tested?

Regression tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #15951 from brkyvz/partition-corruption.
2016-11-23 11:48:59 -08:00
Sean Owen 7e0cd1d9b1
[SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web site
## What changes were proposed in this pull request?

Updates links to the wiki to links to the new location of content on spark.apache.org.

## How was this patch tested?

Doc builds

Author: Sean Owen <sowen@cloudera.com>

Closes #15967 from srowen/SPARK-18073.1.
2016-11-23 11:25:47 +00:00
Yanbo Liang 982b82e32e [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data
## What changes were proposed in this pull request?
* Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
* Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".

## How was this patch tested?
Add unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15930 from yanboliang/spark-18501.
2016-11-22 19:17:48 -08:00
hyukjinkwon 4922f9cdca
[SPARK-18514][DOCS] Fix the markdown for Note:/NOTE:/Note that across R API documentation
## What changes were proposed in this pull request?

It seems in R, there are

- `Note:`
- `NOTE:`
- `Note that`

This PR proposes to fix those to `Note:` to be consistent.

**Before**

![2016-11-21 11 30 07](https://cloud.githubusercontent.com/assets/6477701/20468848/2f27b0fa-afde-11e6-89e3-993701269dbe.png)

**After**

![2016-11-21 11 29 44](https://cloud.githubusercontent.com/assets/6477701/20468851/39469664-afde-11e6-9929-ad80be7fc405.png)

## How was this patch tested?

The notes were found via

```bash
grep -r "NOTE: " .
grep -r "Note that " .
```

And then fixed one by one comparing with API documentation.

After that, manually tested via `sh create-docs.sh` under `./R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15952 from HyukjinKwon/SPARK-18514.
2016-11-22 11:26:10 +00:00
Yanbo Liang acb9715779 [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package.
## What changes were proposed in this pull request?
When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
```
./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
```
The following is output:
```
Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union

Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
......
```
There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.

## How was this patch tested?
Offline test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15888 from yanboliang/spark-18444.
2016-11-22 00:05:30 -08:00
anabranch 49b6f456ac
[SPARK-18365][DOCS] Improve Sample Method Documentation
## What changes were proposed in this pull request?

I found the documentation for the sample method to be confusing, this adds more clarification across all languages.

- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python

## How was this patch tested?

NA

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>

Closes #15815 from anabranch/SPARK-18365.
2016-11-17 11:34:55 +00:00
Yanbo Liang 95eb06bd7d [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.
## What changes were proposed in this pull request?
```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
BTW, I did some cleanup and improvement for ```spark.mlp```.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15883 from yanboliang/spark-18438.
2016-11-16 01:04:18 -08:00
Yanbo Liang 07be232ea1 [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data
## What changes were proposed in this pull request?
* Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data.
```
java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true.
```
See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more detail about how to reproduce this bug.
* Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function.
* Drop some unwanted columns when making prediction.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15851 from yanboliang/spark-18412.
2016-11-13 20:25:12 -08:00
Felix Cheung ba23f768f7 [SPARK-18264][SPARKR] build vignettes with package, update vignettes for CRAN release build and add info on release
## What changes were proposed in this pull request?

Changes to DESCRIPTION to build vignettes.
Changes the metadata for vignettes to generate the recommended format (which is about <10% of size before). Unfortunately it does not look as nice
(before - left, after - right)

![image](https://cloud.githubusercontent.com/assets/8969467/20040492/b75883e6-a40d-11e6-9534-25cdd5d59a8b.png)

![image](https://cloud.githubusercontent.com/assets/8969467/20040490/a40f4d42-a40d-11e6-8c91-af00ddcbdad9.png)

Also add information on how to run build/release to CRAN later.

## How was this patch tested?

manually, unit tests

shivaram

We need this for branch-2.1

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15790 from felixcheung/rpkgvignettes.
2016-11-11 15:49:55 -08:00
Yanbo Liang 5ddf69470b [SPARK-18401][SPARKR][ML] SparkR random forest should support output original label.
## What changes were proposed in this pull request?
SparkR ```spark.randomForest``` classification prediction should output original label rather than the indexed label. This issue is very similar with [SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291).

## How was this patch tested?
Add unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15842 from yanboliang/spark-18401.
2016-11-10 17:13:10 -08:00
Felix Cheung 55964c15a7 [SPARK-18239][SPARKR] Gradient Boosted Tree for R
## What changes were proposed in this pull request?

Gradient Boosted Tree in R.
With a few minor improvements to RandomForest in R.

Since this is relatively isolated I'd like to target this for branch-2.1

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15746 from felixcheung/rgbt.
2016-11-08 16:00:45 -08:00
Yanbo Liang daa975f4bf [SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial.
## What changes were proposed in this pull request?
SparkR ```spark.glm``` predict should output original label when family = "binomial".

## How was this patch tested?
Add unit test.
You can also run the following code to test:
```R
training <- suppressWarnings(createDataFrame(iris))
training <- training[training$Species %in% c("versicolor", "virginica"), ]
model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit"))
showDF(predict(model, training))
```
Before this change:
```
+------------+-----------+------------+-----------+----------+-----+-------------------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|         prediction|
+------------+-----------+------------+-----------+----------+-----+-------------------+
|         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| 0.8271421517601544|
|         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| 0.6044595910413112|
|         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| 0.7916340858281998|
|         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|0.16080518180591158|
|         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| 0.6112229217050189|
|         5.7|        2.8|         4.5|        1.3|versicolor|  0.0| 0.2555087295500885|
|         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| 0.5681507664364834|
|         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|0.05990570219972002|
|         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| 0.6644434078306246|
|         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|0.11293577405862379|
|         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|0.06152372321585971|
|         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|0.35250697207602555|
|         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|0.32267018290814303|
|         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|  0.433391153814592|
|         5.6|        2.9|         3.6|        1.3|versicolor|  0.0| 0.2280744262436993|
|         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| 0.7219848389339459|
|         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|0.23527698971404695|
|         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|  0.285024533520016|
|         6.2|        2.2|         4.5|        1.5|versicolor|  0.0| 0.4107047877447493|
|         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|0.20083561961645083|
+------------+-----------+------------+-----------+----------+-----+-------------------+
```
After this change:
```
+------------+-----------+------------+-----------+----------+-----+----------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|prediction|
+------------+-----------+------------+-----------+----------+-----+----------+
|         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| virginica|
|         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| virginica|
|         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| virginica|
|         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|versicolor|
|         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| virginica|
|         5.7|        2.8|         4.5|        1.3|versicolor|  0.0|versicolor|
|         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| virginica|
|         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|versicolor|
|         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| virginica|
|         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|versicolor|
|         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|versicolor|
|         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|versicolor|
|         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|versicolor|
|         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|versicolor|
|         5.6|        2.9|         3.6|        1.3|versicolor|  0.0|versicolor|
|         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| virginica|
|         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|versicolor|
|         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|versicolor|
|         6.2|        2.2|         4.5|        1.5|versicolor|  0.0|versicolor|
|         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|versicolor|
+------------+-----------+------------+-----------+----------+-----+----------+
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15788 from yanboliang/spark-18291.
2016-11-07 04:07:19 -08:00
hyukjinkwon 15d3926884 [MINOR][DOCUMENTATION] Fix some minor descriptions in functions consistently with expressions
## What changes were proposed in this pull request?

This PR proposes to improve documentation and fix some descriptions equivalent to several minor fixes identified in https://github.com/apache/spark/pull/15677

Also, this suggests to change `Note:` and `NOTE:` to `.. note::` consistently with the others which marks up pretty.

## How was this patch tested?

Jenkins tests and manually.

For PySpark, `Note:` and `NOTE:` to `.. note::` make the document as below:

**From**

![2016-11-04 6 53 35](https://cloud.githubusercontent.com/assets/6477701/20002648/42989922-a2c5-11e6-8a32-b73eda49e8c3.png)
![2016-11-04 6 53 45](https://cloud.githubusercontent.com/assets/6477701/20002650/429fb310-a2c5-11e6-926b-e030d7eb0185.png)
![2016-11-04 6 54 11](https://cloud.githubusercontent.com/assets/6477701/20002649/429d570a-a2c5-11e6-9e7e-44090f337e32.png)
![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002647/4297fc74-a2c5-11e6-801a-b89fbcbfca44.png)
![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002697/749f5780-a2c5-11e6-835f-022e1f2f82e3.png)

**To**

![2016-11-04 7 03 48](https://cloud.githubusercontent.com/assets/6477701/20002659/4961b504-a2c5-11e6-9ee0-ef0751482f47.png)
![2016-11-04 7 04 03](https://cloud.githubusercontent.com/assets/6477701/20002660/49871d3a-a2c5-11e6-85ea-d9a5d11efeff.png)
![2016-11-04 7 04 28](https://cloud.githubusercontent.com/assets/6477701/20002662/498e0f14-a2c5-11e6-803d-c0c5aeda4153.png)
![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png)
![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15765 from HyukjinKwon/minor-function-doc.
2016-11-05 21:47:33 -07:00
Felix Cheung a08463b1d3 [SPARK-14393][SQL][DOC] update doc for python and R
## What changes were proposed in this pull request?

minor doc update that should go to master & branch-2.1

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15747 from felixcheung/pySPARK-14393.
2016-11-03 22:27:35 -07:00
wm624@hotmail.com e89202523b [SPARKR][TEST] remove unnecessary suppressWarnings
## What changes were proposed in this pull request?

In test_mllib.R, there are two unnecessary suppressWarnings. This PR just removes them.

## How was this patch tested?

Existing unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15697 from wangmiao1981/rtest.
2016-11-03 15:27:18 -07:00
Wenchen Fan 3a1bc6f478 [SPARK-17470][SQL] unify path for data source table and locationUri for hive serde table
## What changes were proposed in this pull request?

Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.

This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.

This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.

For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.

To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15024 from cloud-fan/path.
2016-11-02 18:05:14 -07:00
eyal farago f151bd1af8 [SPARK-16839][SQL] Simplify Struct creation code path
## What changes were proposed in this pull request?

Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.

This PR includes:

1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.

## How was this patch tested?
Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.

Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.

Author: eyal farago <eyal farago>
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: eyal farago <eyal.farago@gmail.com>
Author: Eyal Farago <eyal.farago@actimize.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: eyalfa <eyal.farago@gmail.com>

Closes #15718 from hvanhovell/SPARK-16839-2.
2016-11-02 11:12:20 +01:00
hyukjinkwon 1ecfafa086 [SPARK-17838][SPARKR] Check named arguments for options and use formatted R friendly message from JVM exception message
## What changes were proposed in this pull request?

This PR proposes to
- improve the R-friendly error messages rather than raw JVM exception one.

  As `read.json`, `read.text`, `read.orc`, `read.parquet` and `read.jdbc` are executed in the same  path with `read.df`, and `write.json`, `write.text`, `write.orc`, `write.parquet` and `write.jdbc` shares the same path with `write.df`, it seems it is safe to call `handledCallJMethod` to handle
  JVM messages.
-  prevent `zero-length variable name` and prints the ignored options as an warning message.

**Before**

``` r
> read.json("path", a = 1, 2, 3, "a")
Error in env[[name]] <- value :
  zero-length variable name
```

``` r
> read.json("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.orc("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.text("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.parquet("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...
```

``` r
> write.json(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.orc(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.text(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.parquet(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
```

**After**

``` r
read.json("arbitrary_path", a = 1, 2, 3, "a")
Unnamed arguments ignored: 2, 3, a.
```

``` r
> read.json("arbitrary_path")
Error in json : analysis error - Path does not exist: file:/...

> read.orc("arbitrary_path")
Error in orc : analysis error - Path does not exist: file:/...

> read.text("arbitrary_path")
Error in text : analysis error - Path does not exist: file:/...

> read.parquet("arbitrary_path")
Error in parquet : analysis error - Path does not exist: file:/...
```

``` r
> write.json(df, "existing_path")
Error in json : analysis error - path file:/... already exists.;

> write.orc(df, "existing_path")
Error in orc : analysis error - path file:/... already exists.;

> write.text(df, "existing_path")
Error in text : analysis error - path file:/... already exists.;

> write.parquet(df, "existing_path")
Error in parquet : analysis error - path file:/... already exists.;
```
## How was this patch tested?

Unit tests in `test_utils.R` and `test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15608 from HyukjinKwon/SPARK-17838.
2016-11-01 22:14:53 -07:00
Herman van Hovell 0cba535af3 Revert "[SPARK-16839][SQL] redundant aliases after cleanupAliases"
This reverts commit 5441a6269e.
2016-11-01 17:30:37 +01:00
eyal farago 5441a6269e [SPARK-16839][SQL] redundant aliases after cleanupAliases
## What changes were proposed in this pull request?

Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.

This PR includes:

1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.

## How was this patch tested?

running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.

modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.

Credit goes to hvanhovell for assisting with this PR.

Author: eyal farago <eyal farago>
Author: eyal farago <eyal.farago@gmail.com>
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Eyal Farago <eyal.farago@actimize.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: eyalfa <eyal.farago@gmail.com>

Closes #14444 from eyalfa/SPARK-16839_redundant_aliases_after_cleanupAliases.
2016-11-01 17:12:20 +01:00
Felix Cheung b6879b8b35 [SPARK-16137][SPARKR] randomForest for R
## What changes were proposed in this pull request?

Random Forest Regression and Classification for R
Clean-up/reordering generics.R

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15607 from felixcheung/rrandomforest.
2016-10-30 16:19:19 -07:00
Hossein 2881a2d1d1 [SPARK-17919] Make timeout to RBackend configurable in SparkR
## What changes were proposed in this pull request?

This patch makes RBackend connection timeout configurable by user.

## How was this patch tested?
N/A

Author: Hossein <hossein@databricks.com>

Closes #15471 from falaki/SPARK-17919.
2016-10-30 16:17:23 -07:00
Felix Cheung 44c8bfda79 [SQL][DOC] updating doc for JSON source to link to jsonlines.org
## What changes were proposed in this pull request?

API and programming guide doc changes for Scala, Python and R.

## How was this patch tested?

manual test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15629 from felixcheung/jsondoc.
2016-10-26 23:06:11 -07:00
Felix Cheung 1dbe9896b7 [SPARK-17157][SPARKR][FOLLOW-UP] doc fixes
## What changes were proposed in this pull request?

a couple of small late finding fixes for doc

## How was this patch tested?

manually
wangmiao1981

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15650 from felixcheung/logitfix.
2016-10-26 23:02:54 -07:00
wm624@hotmail.com 29cea8f332 [SPARK-17157][SPARKR] Add multiclass logistic regression SparkR Wrapper
## What changes were proposed in this pull request?

As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.

This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.

## How was this patch tested?

New unit tests are added.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15365 from wangmiao1981/glm.
2016-10-26 16:12:55 -07:00
WeichenXu fb0a8a8dd7 [SPARK-17961][SPARKR][SQL] Add storageLevel to DataFrame for SparkR
## What changes were proposed in this pull request?

Add storageLevel to DataFrame for SparkR.
This is similar to this RP:  https://github.com/apache/spark/pull/13780

but in R I do not make a class for `StorageLevel`
but add a method `storageToString`

## How was this patch tested?

test added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15516 from WeichenXu123/storageLevel_df_r.
2016-10-26 13:26:43 -07:00
WeichenXu 12b3e8d2e0 [SPARK-18007][SPARKR][ML] update SparkR MLP - add initalWeights parameter
## What changes were proposed in this pull request?

update SparkR MLP, add initalWeights parameter.

## How was this patch tested?

test added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15552 from WeichenXu123/mlp_r_add_initialWeight_param.
2016-10-25 21:42:59 -07:00
Felix Cheung 3a423f5a03 [SPARKR][BRANCH-2.0] R merge API doc and example fix
## What changes were proposed in this pull request?

Fixes for R doc

## How was this patch tested?

N/A

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15589 from felixcheung/rdocmergefix.

(cherry picked from commit 0e0d83a597)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
2016-10-23 10:53:43 -07:00
Hossein e371040a01 [SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date columns
## What changes were proposed in this pull request?
NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time.

## How was this patch tested?
* [x] TODO

Author: Hossein <hossein@databricks.com>

Closes #15421 from falaki/SPARK-17811.
2016-10-21 12:38:52 -07:00
Felix Cheung e21e1c946c [SPARK-18013][SPARKR] add crossJoin API
## What changes were proposed in this pull request?

Add crossJoin and do not default to cross join if joinExpr is left out

## How was this patch tested?

unit test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15559 from felixcheung/rcrossjoin.
2016-10-21 12:35:37 -07:00
Felix Cheung 4efdc764ed [SPARK-17674][SPARKR] check for warning in test output
## What changes were proposed in this pull request?

testthat library we are using for testing R is redirecting warning (and disabling `options("warn" = 2)`), we need to have a way to detect any new warning and fail

## How was this patch tested?

manual testing, Jenkins

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15576 from felixcheung/rtestwarning.
2016-10-21 12:34:14 -07:00
Felix Cheung 3180272d2d [SPARKR] fix warnings
## What changes were proposed in this pull request?

Fix for a bunch of test warnings that were added recently.
We need to investigate why warnings are not turning into errors.

```
Warnings -----------------------------------------------------------------------
1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length  as column name

2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width  as column name

3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length  as column name

4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width  as column name

Consider adding
  importFrom("utils", "object.size")
to your NAMESPACE file.
```

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15560 from felixcheung/rwarnings.
2016-10-20 21:12:55 -07:00
Hossein 5cc503f4fe [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB
## What changes were proposed in this pull request?
If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.

I tested this on my MacBook. Following code works with this patch:
```R
intMax <- .Machine$integer.max
largeVec <- 1:intMax
rdd <- SparkR:::parallelize(sc, largeVec, 2)
```

## How was this patch tested?
* [x] Unit tests

Author: Hossein <hossein@databricks.com>

Closes #15375 from falaki/SPARK-17790.
2016-10-12 10:32:38 -07:00
Wenchen Fan b9a147181d [SPARK-17720][SQL] introduce static SQL conf
## What changes were proposed in this pull request?

SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897.

Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf.

## How was this patch tested?

new tests in SQLConfSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15295 from cloud-fan/global-conf.
2016-10-11 20:27:08 -07:00
Yanbo Liang 23405f324a [SPARK-15153][ML][SPARKR] Fix SparkR spark.naiveBayes error when label is numeric type
## What changes were proposed in this pull request?
Fix SparkR ```spark.naiveBayes``` error when response variable of dataset is numeric type.
See details and how to reproduce this bug at [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).

## How was this patch tested?
Add unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15431 from yanboliang/spark-15153-2.
2016-10-11 12:41:35 -07:00
hyukjinkwon 9d8ae853ec [SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types
## What changes were proposed in this pull request?

This PR includes the changes below:

  - Support `mode`/`options` in `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` APIs

  - Support other types (logical, numeric and string) as options for `write.df`, `read.df`, `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json`

## How was this patch tested?

Unit tests in `test_sparkSQL.R`/ `utils.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15239 from HyukjinKwon/SPARK-17665.
2016-10-07 11:34:49 -07:00
hyukjinkwon c9fe10d4ed [SPARK-17658][SPARKR] read.df/write.df API taking path optionally in SparkR
## What changes were proposed in this pull request?

`write.df`/`read.df` API require path which is not actually always necessary in Spark. Currently, it only affects the datasources implementing `CreatableRelationProvider`. Currently, Spark currently does not have internal data sources implementing this but it'd affect other external datasources.

In addition we'd be able to use this way in Spark's JDBC datasource after https://github.com/apache/spark/pull/12601 is merged.

**Before**

 - `read.df`

  ```r
> read.df(source = "json")
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
  argument "x" is missing with no default
```

  ```r
> read.df(path = c(1, 2))
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
  argument "x" is missing with no default
```

  ```r
> read.df(c(1, 2))
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
	at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:300)
	at
...
In if (is.na(object)) { :
...
```

 - `write.df`

  ```r
> write.df(df, source = "json")
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘write.df’ for signature ‘"function", "missing"’
```

  ```r
> write.df(df, source = c(1, 2))
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```

  ```r
> write.df(df, mode = TRUE)
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```

**After**

- `read.df`

  ```r
> read.df(source = "json")
Error in loadDF : analysis error - Unable to infer schema for JSON at . It must be specified manually;
```

  ```r
> read.df(path = c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```

  ```r
> read.df(c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```

- `write.df`

  ```r
> write.df(df, source = "json")
Error in save : illegal argument - 'path' is not specified
```

  ```r
> write.df(df, source = c(1, 2))
Error in .local(df, path, ...) :
  source should be charactor, null or omitted. It is 'parquet' by default.
```

  ```r
> write.df(df, mode = TRUE)
Error in .local(df, path, ...) :
  mode should be charactor or omitted. It is 'error' by default.
```

## How was this patch tested?

Unit tests in `test_sparkSQL.R`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15231 from HyukjinKwon/write-default-r.
2016-10-04 22:58:43 -07:00
Felix Cheung 068c198e95 [SPARKR][DOC] minor formatting and output cleanup for R vignettes
## What changes were proposed in this pull request?

Clean up output, format table, truncate long example output, hide warnings

(new - Left; existing - Right)
![image](https://cloud.githubusercontent.com/assets/8969467/19064018/5dcde4d0-89bc-11e6-857b-052df3f52a4e.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064034/6db09956-89bc-11e6-8e43-232d5c3fe5e6.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064058/88f09590-89bc-11e6-9993-61639e29dfdd.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064066/95ccbf64-89bc-11e6-877f-45af03ddcadc.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064082/a8445404-89bc-11e6-8532-26d8bc9b206f.png)

## How was this patch tested?

Run create-doc.sh manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15340 from felixcheung/vignettes.
2016-10-04 09:22:26 -07:00
hyukjinkwon 4a83395681 [SPARK-17499][SPARKR][FOLLOWUP] Check null first for layers in spark.mlp to avoid warnings in test results
## What changes were proposed in this pull request?

Some tests in `test_mllib.r` are as below:

```r
expect_error(spark.mlp(df, layers = NULL), "layers must be a integer vector with length > 1.")
expect_error(spark.mlp(df, layers = c()), "layers must be a integer vector with length > 1.")
```

The problem is, `is.na` is internally called via `na.omit` in `spark.mlp` which causes warnings as below:

```
Warnings -----------------------------------------------------------------------
1. spark.mlp (test_mllib.R#400) - is.na() applied to non-(list or vector) of type 'NULL'

2. spark.mlp (test_mllib.R#401) - is.na() applied to non-(list or vector) of type 'NULL'
```

## How was this patch tested?

Manually tested. Also, Jenkins tests and AppVeyor.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15232 from HyukjinKwon/remove-warnnings.
2016-09-27 21:19:59 -07:00
Yanbo Liang 93c743f1ac [SPARK-17577][FOLLOW-UP][SPARKR] SparkR spark.addFile supports adding directory recursively
## What changes were proposed in this pull request?
#15140 exposed ```JavaSparkContext.addFile(path: String, recursive: Boolean)``` to Python/R, then we can update SparkR ```spark.addFile``` to support adding directory recursively.

## How was this patch tested?
Added unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15216 from yanboliang/spark-17577-2.
2016-09-26 16:47:57 -07:00
Jeff Zhang f62ddc5983 [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio
## What changes were proposed in this pull request?

Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).
```
    if (args.isR && clusterManager == YARN) {
      val sparkRPackagePath = RUtils.localSparkRPackagePath
      if (sparkRPackagePath.isEmpty) {
        printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
      }
      val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
      if (!sparkRPackageFile.exists()) {
        printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
      }
      val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString

      // Distribute the SparkR package.
      // Assigns a symbol link name "sparkr" to the shipped package.
      args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")

      // Distribute the R package archive containing all the built R packages.
      if (!RUtils.rPackages.isEmpty) {
        val rPackageFile =
          RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
        if (!rPackageFile.exists()) {
          printErrorAndExit("Failed to zip all the built R packages.")
        }

        val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
        // Assigns a symbol link name "rpkg" to the shipped package.
        args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
      }
    }
```
So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor.  Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.

## How was this patch tested?

Verify it manually in R Studio using the following code.
```
Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
.libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
library(SparkR)
sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
df <- as.DataFrame(mtcars)
head(df)

```

…

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14784 from zjffdu/SPARK-17210.
2016-09-23 11:37:43 -07:00
WeichenXu f89808b0fd [SPARK-17499][SPARKR][ML][MLLIB] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier
## What changes were proposed in this pull request?

update `MultilayerPerceptronClassifierWrapper.fit` paramter type:
`layers: Array[Int]`
`seed: String`

update several default params in sparkR `spark.mlp`:
`tol` --> 1e-6
`stepSize` --> 0.03
`seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a `null` value and the seed will use the default one )
r-side `seed` only support 32bit integer.

remove `layers` default value, and move it in front of those parameters with default value.
add `layers` parameter validation check.

## How was this patch tested?

tests added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15051 from WeichenXu123/update_py_mlp_default.
2016-09-23 11:14:22 -07:00
Shivaram Venkataraman 9f24a17c59 Skip building R vignettes if Spark is not built
## What changes were proposed in this pull request?

When we build the docs separately we don't have the JAR files from the Spark build in
the same tree. As the SparkR vignettes need to launch a SparkContext to be built, we skip building them if JAR files don't exist

## How was this patch tested?

To test this we can run the following:
```
build/mvn -DskipTests -Psparkr clean
./R/create-docs.sh
```
You should see a line `Skipping R vignettes as Spark JARs not found` at the end

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #15200 from shivaram/sparkr-vignette-skip.
2016-09-22 11:52:42 -07:00
Yanbo Liang 6902edab7e [SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test summary
## What changes were proposed in this pull request?
#14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that ```print.summary.KSTest``` was implemented inappropriately and result in no effect.
Running the following code for KSTest:
```Scala
data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5))
df <- createDataFrame(data)
testResult <- spark.kstest(df, "test", "norm")
summary(testResult)
```
Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png)
The new implementation is similar with [```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284) of SparkR and [```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R) of native R.

BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since it's only wrappers of the summary output which has been checked. Another reason is that these comparison will output summary information to the test console, it will make the test output in a mess.

## How was this patch tested?
Existing test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15139 from yanboliang/spark-17315.
2016-09-21 20:14:18 -07:00
Yanbo Liang c133907c5d [SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors
## What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```.
We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15131 from yanboliang/spark-17577.
2016-09-21 20:08:28 -07:00
Sean Owen d720a40194
[SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not relative to a calendar
## What changes were proposed in this pull request?

Clarify that slide and window duration are absolute, and not relative to a calendar.

## How was this patch tested?

Doc build (no functional change)

Author: Sean Owen <sowen@cloudera.com>

Closes #15142 from srowen/SPARK-17297.
2016-09-19 09:38:25 +01:00
Sean Owen dc0a4c9161 [SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages
## What changes were proposed in this pull request?

Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects

This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki.

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15075 from srowen/SPARK-17445.
2016-09-14 10:10:16 +01:00
junyangq a454a4d86b [SPARK-17317][SPARKR] Add SparkR vignette
## What changes were proposed in this pull request?

This PR tries to add a SparkR vignette, which works as a friendly guidance going through the functionality provided by SparkR.

## How was this patch tested?

Manual test.

Author: junyangq <qianjunyang@gmail.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Junyang Qian <junyangq@databricks.com>

Closes #14980 from junyangq/SPARKR-vignette.
2016-09-13 21:01:03 -07:00
Xin Ren 71b7d42f5f [SPARK-16445][MLLIB][SPARKR] Fix @return description for sparkR mlp summary() method
## What changes were proposed in this pull request?

Fix summary() method's `return` description for spark.mlp

## How was this patch tested?

Ran tests locally on my laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #15015 from keypointt/SPARK-16445-2.
2016-09-10 09:52:53 -07:00
Yanbo Liang 2ed601217f [SPARK-17464][SPARKR][ML] SparkR spark.als argument reg should be 0.1 by default.
## What changes were proposed in this pull request?
SparkR ```spark.als``` arguments ```reg``` should be 0.1 by default, which need to be consistent with ML.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15021 from yanboliang/spark-17464.
2016-09-09 05:43:34 -07:00
Felix Cheung f0d21b7f90 [SPARK-17442][SPARKR] Additional arguments in write.df are not passed to data source
## What changes were proposed in this pull request?

additional options were not passed down in write.df.

## How was this patch tested?

unit tests
falaki shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15010 from felixcheung/testreadoptions.
2016-09-08 08:22:58 -07:00
hyukjinkwon 6b41195bca [SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR
## What changes were proposed in this pull request?

This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details.

## How was this patch tested?

Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r

Also, manually,

![2016-09-06 3 14 38](https://cloud.githubusercontent.com/assets/6477701/18263406/b93a98be-7444-11e6-9521-b28ee65a4771.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14960 from HyukjinKwon/SPARK-17339.
2016-09-07 19:24:03 +09:00
Clark Fitzgerald 9fccde4ff8 [SPARK-16785] R dapply doesn't return array or raw columns
## What changes were proposed in this pull request?

Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors.

cc shivaram

## How was this patch tested?

Unit tests

Author: Clark Fitzgerald <clarkfitzg@gmail.com>

Closes #14783 from clarkfitzg/SPARK-16785.
2016-09-06 23:40:37 -07:00
Junyang Qian abb2f92103 [SPARK-17315][SPARKR] Kolmogorov-Smirnov test SparkR wrapper
## What changes were proposed in this pull request?

This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution.

## How was this patch tested?

R unit test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14881 from junyangq/SPARK-17315.
2016-09-03 12:26:30 -07:00
Junyang Qian d2fde6b72c [SPARKR][MINOR] Fix docs for sparkR.session and count
## What changes were proposed in this pull request?

This PR tries to add some more explanation to `sparkR.session`. It also modifies doc for `count` so when grouped in one doc, the description doesn't confuse users.

## How was this patch tested?

Manual test.

![screen shot 2016-09-02 at 1 21 36 pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)

Author: Junyang Qian <junyangq@databricks.com>

Closes #14942 from junyangq/fixSparkRSessionDoc.
2016-09-02 21:11:57 -07:00
Srinath Shankar e6132a6cf1 [SPARK-17298][SQL] Require explicit CROSS join for cartesian products
## What changes were proposed in this pull request?

Require the use of CROSS join syntax in SQL (and a new crossJoin
DataFrame API) to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where
there is no join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS
join, an error must be thrown. Turning on the
"spark.sql.crossJoin.enabled" configuration flag will disable this check
and allow cartesian products without an explicit CROSS join.

The new crossJoin DataFrame API must be used to specify explicit cross
joins. The existing join(DataFrame) method will produce a INNER join
that will require a subsequent join condition.
That is df1.join(df2) is equivalent to select * from df1, df2.

## How was this patch tested?

Added cross-join.sql to the SQLQueryTestSuite to test the check for cartesian products. Added a couple of tests to the DataFrameJoinSuite to test the crossJoin API. Modified various other test suites to explicitly specify a cross join where an INNER join or a comma-separated list was previously used.

Author: Srinath Shankar <srinath@databricks.com>

Closes #14866 from srinathshankar/crossjoin.
2016-09-03 00:20:43 +02:00
Felix Cheung eac1d0e921 [SPARK-17376][SPARKR] followup - change since version
## What changes were proposed in this pull request?

change since version in doc

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14939 from felixcheung/rsparkversion2.
2016-09-02 11:08:25 -07:00
Felix Cheung 419eefd811 [SPARKR][DOC] regexp_extract should doc that it returns empty string when match fails
## What changes were proposed in this pull request?

Doc change - see https://issues.apache.org/jira/browse/SPARK-16324

## How was this patch tested?

manual check

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14934 from felixcheung/regexpextractdoc.
2016-09-02 10:28:37 -07:00
Felix Cheung 812333e433 [SPARK-17376][SPARKR] Spark version should be available in R
## What changes were proposed in this pull request?

Add sparkR.version() API.

```
> sparkR.version()
[1] "2.1.0-SNAPSHOT"
```

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14935 from felixcheung/rsparksessionversion.
2016-09-02 10:12:10 -07:00
wm624@hotmail.com 0f30cdedbd [SPARK-16883][SPARKR] SQL decimal type is not properly cast to number when collecting SparkDataFrame
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))

'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y:List of 5
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2

The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".

As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;

2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.

I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))
'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y: num  2 2 2 2 2
R Unit tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14613 from wangmiao1981/type.
2016-09-02 01:47:17 -07:00
Xin Ren 7a5000f39e [SPARK-17241][SPARKR][MLLIB] SparkR spark.glm should have configurable regularization parameter
https://issues.apache.org/jira/browse/SPARK-17241

## What changes were proposed in this pull request?

Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression.

## How was this patch tested?

Test manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14856 from keypointt/SPARK-17241.
2016-08-31 21:39:31 -07:00
Junyang Qian d008638fbe [SPARKR][MINOR] Fix windowPartitionBy example
## What changes were proposed in this pull request?

The usage in the original example is incorrect. This PR fixes it.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14903 from junyangq/SPARKR-FixWindowPartitionByDoc.
2016-08-31 21:28:53 -07:00
Shivaram Venkataraman 2f9c27364e [SPARK-16581][SPARKR] Fix JVM API tests in SparkR
## What changes were proposed in this pull request?

Remove cleanup.jobj test. Use JVM wrapper API for other test cases.

## How was this patch tested?

Run R unit tests with testthat 1.0

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14904 from shivaram/sparkr-jvm-tests-fix.
2016-08-31 16:56:41 -07:00
hyukjinkwon 50bb142332 [SPARK-17326][SPARKR] Fix tests with HiveContext in SparkR not to be skipped always
## What changes were proposed in this pull request?

Currently, `HiveContext` in SparkR is not being tested and always skipped.
This is because the initiation of `TestHiveContext` is being failed due to trying to load non-existing data paths (test tables).

This is introduced from https://github.com/apache/spark/pull/14005

This enables the tests with SparkR.

## How was this patch tested?

Manually,

**Before** (on Mac OS)

```
...
Skipped ------------------------------------------------------------------------
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1041) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1748) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2480) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```

**After** (on Mac OS)

```
...
Skipped ------------------------------------------------------------------------
1. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```

Please refer the tests below (on Windows)
 - Before: https://ci.appveyor.com/project/HyukjinKwon/spark/build/45-test123
 - After: https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14889 from HyukjinKwon/SPARK-17326.
2016-08-31 14:02:21 -07:00
hyukjinkwon 9953442aca [MINOR][SPARKR] Verbose build comment in WINDOWS.md rather than promoting default build without Hive
## What changes were proposed in this pull request?

This PR fixes `WINDOWS.md` to imply referring other profiles in http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn rather than directly pointing to run `mvn -DskipTests -Psparkr package` without Hive supports.

## How was this patch tested?

Manually,

<img width="626" alt="2016-08-31 6 01 08" src="https://cloud.githubusercontent.com/assets/6477701/18122549/f6297b2c-6fa4-11e6-9b5e-fd4347355d87.png">

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14890 from HyukjinKwon/minor-build-r.
2016-08-31 09:06:23 -07:00
Shivaram Venkataraman 736a7911cb [SPARK-16581][SPARKR] Make JVM backend calling functions public
## What changes were proposed in this pull request?

This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Unit tests, CRAN checks

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14775 from shivaram/sparkr-java-api.
2016-08-29 12:55:32 -07:00
Junyang Qian 6a0fda2c05 [SPARKR][MINOR] Fix LDA doc
## What changes were proposed in this pull request?

This PR tries to fix the name of the `SparkDataFrame` used in the example. Also, it gives a reference url of an example data file so that users can play with.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14853 from junyangq/SPARKR-FixLDADoc.
2016-08-29 10:23:10 -07:00
Junyang Qian 1883216235 [SPARKR][MINOR] Fix example of spark.naiveBayes
## What changes were proposed in this pull request?

The original example doesn't work because the features are not categorical. This PR fixes this by changing to another dataset.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14820 from junyangq/SPARK-FixNaiveBayes.
2016-08-26 11:01:48 -07:00
Junyang Qian 3a60be4b15 [SPARKR][MINOR] Add installation message for remote master mode and improve other messages
## What changes were proposed in this pull request?

This PR gives informative message to users when they try to connect to a remote master but don't have Spark package in their local machine.

As a clarification, for now, automatic installation will only happen if they start SparkR in R console (rather than from sparkr-shell) and connect to local master. In the remote master mode, local Spark package is still needed, but we will not trigger the install.spark function because the versions have to match those on the cluster, which involves more user input. Instead, we here try to provide detailed message that may help the users.

Some of the other messages have also been slightly changed.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14761 from junyangq/SPARK-16579-V1.
2016-08-24 16:04:14 -07:00
Junyang Qian 18708f76c3 [SPARKR][MINOR] Add more examples to window function docs
## What changes were proposed in this pull request?

This PR adds more examples to window function docs to make them more accessible to the users.

It also fixes default value issues for `lag` and `lead`.

## How was this patch tested?

Manual test, R unit test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14779 from junyangq/SPARKR-FixWindowFunctionDocs.
2016-08-24 16:00:04 -07:00
Felix Cheung 945c04bcd4 [MINOR][SPARKR] fix R MLlib parameter documentation
## What changes were proposed in this pull request?

Fixed several misplaced param tag - they should be on the spark.* method generics

## How was this patch tested?

run knitr
junyangq

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14792 from felixcheung/rdocmllib.
2016-08-24 15:59:09 -07:00
Xin Ren 2fbdb60639 [SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkR
https://issues.apache.org/jira/browse/SPARK-16445

## What changes were proposed in this pull request?

Create Multilayer Perceptron Classifier wrapper in SparkR

## How was this patch tested?

Tested manually on local machine

Author: Xin Ren <iamshrek@126.com>

Closes #14447 from keypointt/SPARK-16445.
2016-08-24 11:18:10 -07:00
Junyang Qian d2932a0e98 [SPARKR][MINOR] Fix doc for show method
## What changes were proposed in this pull request?

The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14776 from junyangq/SPARK-FixShowDoc.
2016-08-24 10:40:09 -07:00
Junyang Qian 8fd63e808e [SPARKR][MINOR] Remove reference link for common Windows environment variables
## What changes were proposed in this pull request?

The PR removes reference link in the doc for environment variables for common Windows folders. The cran check gave code 503: service unavailable on the original link.

## How was this patch tested?

Manual check.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14767 from junyangq/SPARKR-RemoveLink.
2016-08-23 11:22:32 -07:00
Felix Cheung d2b3d3e63e [SPARKR][MINOR] Update R DESCRIPTION file
## What changes were proposed in this pull request?

Update DESCRIPTION

## How was this patch tested?

Run install and CRAN tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14764 from felixcheung/rpackagedescription.
2016-08-22 20:15:03 -07:00
Shivaram Venkataraman 920806ab27 [SPARK-16577][SPARKR] Add CRAN documentation checks to run-tests.sh
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

This change adds CRAN documentation checks to be run as a part of `R/run-tests.sh` . As this script is also used by Jenkins this means that we will get documentation checks on every PR going forward.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14759 from shivaram/sparkr-cran-jenkins.
2016-08-22 17:09:32 -07:00
Felix Cheung 71afeeea4e [SPARK-16508][SPARKR] doc updates and more CRAN check fixes
## What changes were proposed in this pull request?

replace ``` ` ``` in code doc with `\code{thing}`
remove added `...` for drop(DataFrame)
fix remaining CRAN check warnings

## How was this patch tested?

create doc with knitr

junyangq

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14734 from felixcheung/rdoccleanup.
2016-08-22 15:53:10 -07:00
Shivaram Venkataraman 6f3cd36f93 [SPARKR][MINOR] Add Xiangrui and Felix to maintainers
## What changes were proposed in this pull request?

This change adds Xiangrui Meng and Felix Cheung to the maintainers field in the package description.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14758 from shivaram/sparkr-maintainers.
2016-08-22 12:53:52 -07:00
Felix Cheung 0583ecda1b [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reformat, fix deprecation in test
## What changes were proposed in this pull request?

refactor, cleanup, reformat, fix deprecation in test

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14735 from felixcheung/rmllibutil.
2016-08-22 12:27:33 -07:00
Junyang Qian 209e1b3c06 [SPARKR][MINOR] Fix Cache Folder Path in Windows
## What changes were proposed in this pull request?

This PR tries to fix the scheme of local cache folder in Windows. The name of the environment variable should be `LOCALAPPDATA` rather than `%LOCALAPPDATA%`.

## How was this patch tested?

Manual test in Windows 7.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14743 from junyangq/SPARKR-FixWindowsInstall.
2016-08-22 10:03:48 -07:00
Xiangrui Meng ab7143463d [MINOR][R] add SparkR.Rcheck/ and SparkR_*.tar.gz to R/.gitignore
## What changes were proposed in this pull request?

Ignore temp files generated by `check-cran.sh`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #14740 from mengxr/R-gitignore.
2016-08-21 10:31:25 -07:00
Yanbo Liang 7f08a60b6e [SPARK-16961][FOLLOW-UP][SPARKR] More robust test case for spark.gaussianMixture.
## What changes were proposed in this pull request?
#14551 fixed off-by-one bug in ```randomizeInPlace``` and some test failure caused by this fix.
But for SparkR ```spark.gaussianMixture``` test case, the fix is inappropriate. It only changed the output result of native R which should be compared by SparkR, however, it did not change the R code in annotation which is used for reproducing the result in native R. It will confuse users who can not reproduce the same result in native R. This PR sends a more robust test case which can produce same result between SparkR and native R.

## How was this patch tested?
Unit test update.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14730 from yanboliang/spark-16961-followup.
2016-08-21 02:23:31 -07:00
Junyang Qian 01401e965b [SPARK-16508][SPARKR] Fix CRAN undocumented/duplicated arguments warnings.
## What changes were proposed in this pull request?

This PR tries to fix all the remaining "undocumented/duplicated arguments" warnings given by CRAN-check.

One left is doc for R `stats::glm` exported in SparkR. To mute that warning, we have to also provide document for all arguments of that non-SparkR function.

Some previous conversation is in #14558.

## How was this patch tested?

R unit test and `check-cran.sh` script (with no-test).

Author: Junyang Qian <junyangq@databricks.com>

Closes #14705 from junyangq/SPARK-16508-master.
2016-08-20 06:59:23 -07:00
Junyang Qian acac7a508a [SPARK-16443][SPARKR] Alternating Least Squares (ALS) wrapper
## What changes were proposed in this pull request?

Add Alternating Least Squares wrapper in SparkR. Unit tests have been updated.

## How was this patch tested?

SparkR unit tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

![screen shot 2016-07-27 at 3 50 31 pm](https://cloud.githubusercontent.com/assets/15318264/17195347/f7a6352a-5411-11e6-8e21-61a48070192a.png)
![screen shot 2016-07-27 at 3 50 46 pm](https://cloud.githubusercontent.com/assets/15318264/17195348/f7a7d452-5411-11e6-845f-6d292283bc28.png)

Author: Junyang Qian <junyangq@databricks.com>

Closes #14384 from junyangq/SPARK-16443.
2016-08-19 14:24:09 -07:00
Nick Lavers 5377fc6236 [SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace
JIRA issue link:
https://issues.apache.org/jira/browse/SPARK-16961

Changed one line of Utils.randomizeInPlace to allow elements to stay in place.

Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution.

Author: Nick Lavers <nick.lavers@videoamp.com>

Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace.
2016-08-19 10:11:59 +01:00
Xusen Yin b72bb62d42 [SPARK-16447][ML][SPARKR] LDA wrapper in SparkR
## What changes were proposed in this pull request?

Add LDA Wrapper in SparkR with the following interfaces:

- spark.lda(data, ...)

- spark.posterior(object, newData, ...)

- spark.perplexity(object, ...)

- summary(object)

- write.ml(object)

- read.ml(path)

## How was this patch tested?

Test with SparkR unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #14229 from yinxusen/SPARK-16447.
2016-08-18 05:33:52 -07:00
Yanbo Liang 4d92af310a [SPARK-16446][SPARKR][ML] Gaussian Mixture Model wrapper in SparkR
## What changes were proposed in this pull request?
Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14392 from yanboliang/spark-16446.
2016-08-17 11:18:33 -07:00
wm624@hotmail.com 363793f2bf [SPARK-16444][SPARKR] Isotonic Regression wrapper in SparkR
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Add Isotonic Regression wrapper in SparkR

Wrappers in R and Scala are added.
Unit tests
Documentation

## How was this patch tested?
Manually tested with sudo ./R/run-tests.sh

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14182 from wangmiao1981/isoR.
2016-08-17 06:15:04 -07:00
Felix Cheung c34b546d67 [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check
## What changes were proposed in this pull request?

Rename RDD functions for now to avoid CRAN check warnings.
Some RDD functions are sharing generics with DataFrame functions (hence the problem) so after the renames we need to add new generics, for now.

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14626 from felixcheung/rrddfunctions.
2016-08-16 11:19:18 -07:00
Yanbo Liang d37ea3c09c [MINOR][SPARKR] spark.glm weightCol should in the signature.
## What changes were proposed in this pull request?
Fix the issue that ```spark.glm``` ```weightCol``` should in the signature.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14641 from yanboliang/weightCol.
2016-08-16 10:52:35 -07:00
Junyang Qian 564fe614c1 [SPARK-16508][SPARKR] Split docs for arrange and orderBy methods
## What changes were proposed in this pull request?

This PR splits arrange and orderBy methods according to their functionality (the former for sorting sparkDataFrame and the latter for windowSpec).

## How was this patch tested?

![screen shot 2016-08-06 at 6 39 19 pm](https://cloud.githubusercontent.com/assets/15318264/17459969/51eade28-5c05-11e6-8ca1-8d8a8e344bab.png)
![screen shot 2016-08-06 at 6 39 29 pm](https://cloud.githubusercontent.com/assets/15318264/17459966/51e3c246-5c05-11e6-8d35-3e905ca48676.png)
![screen shot 2016-08-06 at 6 40 02 pm](https://cloud.githubusercontent.com/assets/15318264/17459967/51e650ec-5c05-11e6-8698-0f037f5199ff.png)

Author: Junyang Qian <junyangq@databricks.com>

Closes #14522 from junyangq/SPARK-16508-0.
2016-08-15 11:03:03 -07:00
Junyang Qian 214ba66a03 [SPARK-16579][SPARKR] add install.spark function
## What changes were proposed in this pull request?

Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R.

Updates:

Several changes have been made:

- `install.spark()`
    - check existence of tar file in the cache folder, and download only if not found
    - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
    - use 2.0.0

- `sparkR.session()`
    - can install spark when not found in `SPARK_HOME`

## How was this patch tested?

Manual tests, running the check-cran.sh script added in #14173.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14258 from junyangq/SPARK-16579.
2016-08-10 11:18:23 -07:00
Yanbo Liang d4a9122430 [SPARK-16710][SPARKR][ML] spark.glm should support weightCol
## What changes were proposed in this pull request?
Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14346 from yanboliang/spark-16710.
2016-08-10 10:53:48 -07:00
Xin Ren 1203c8415c [MINOR][SPARKR] R API documentation for "coltypes" is confusing
## What changes were proposed in this pull request?

R API documentation for "coltypes" is confusing, found when working on another ticket.

Current version http://spark.apache.org/docs/2.0.0/api/R/coltypes.html, where parameters have 2 "x" which is a duplicate, and also the example is not very clear

![current](https://cloud.githubusercontent.com/assets/3925641/17386808/effb98ce-59a2-11e6-9657-d477d258a80c.png)

![screen shot 2016-08-03 at 5 56 00 pm](https://cloud.githubusercontent.com/assets/3925641/17386884/91831096-59a3-11e6-84af-39890b3d45d8.png)

## How was this patch tested?

Tested manually on local machine. And the screenshots are like below:

![screen shot 2016-08-07 at 11 29 20 pm](https://cloud.githubusercontent.com/assets/3925641/17471144/df36633c-5cf6-11e6-8238-4e32ead0e529.png)

![screen shot 2016-08-03 at 5 56 22 pm](https://cloud.githubusercontent.com/assets/3925641/17386896/9d36cb26-59a3-11e6-9619-6dae29f7ab17.png)

Author: Xin Ren <iamshrek@126.com>

Closes #14489 from keypointt/rExample.
2016-08-10 00:49:06 -07:00
Felix Cheung b73defdd79 [SPARKR][DOCS] fix broken url in doc
## What changes were proposed in this pull request?

Fix broken url, also,

sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop"
![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png)

Data type section is in the middle of a list of gapply/gapplyCollect subsections:
![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png)

## How was this patch tested?

manual test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14329 from felixcheung/rdoclinkfix.
2016-07-25 11:25:41 -07:00
Shivaram Venkataraman fc23263623 [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to SparkSubmitSuite
## What changes were proposed in this pull request?

This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar.

## How was this patch tested?
SparkR unit tests, SparkSubmitSuite, check-cran.sh

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14243 from shivaram/sparkr-jar-move.
2016-07-19 19:28:08 -07:00
krishnakalyan3 8ea3f4eaec [SPARK-16055][SPARKR] warning added while using sparkPackages with spark-submit
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-16055
sparkPackages - argument is passed and we detect that we are in the R script mode, we should print some warning like --packages flag should be used with with spark-submit

## How was this patch tested?
In my system locally

Author: krishnakalyan3 <krishnakalyan3@gmail.com>

Closes #14179 from krishnakalyan3/spark-pkg.
2016-07-18 09:46:23 -07:00
Felix Cheung d27fe9ba67 [SPARK-16027][SPARKR] Fix R tests SparkSession init/stop
## What changes were proposed in this pull request?

Fix R SparkSession init/stop, and warnings of reusing existing Spark Context

## How was this patch tested?

unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14177 from felixcheung/rsessiontest.
2016-07-17 19:02:21 -07:00
Shivaram Venkataraman c33e4b0d96 [SPARK-16507][SPARKR] Add a CRAN checker, fix Rd aliases
## What changes were proposed in this pull request?

Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include
- Updating `DESCRIPTION` to be appropriate
- Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs
- Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc.  This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods
- Other minor fixes

## How was this patch tested?

SparkR unit tests, running the above mentioned script

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14173 from shivaram/sparkr-cran-changes.
2016-07-16 17:06:44 -07:00
Felix Cheung 611a8ca589 [SPARK-16538][SPARKR] Add more tests for namespace call to SparkSession functions
## What changes were proposed in this pull request?

More tests
I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0.

## How was this patch tested?

unit tests

shivaram dongjoon-hyun

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14206 from felixcheung/rroutetests.
2016-07-15 13:58:57 -07:00
Felix Cheung 12005c88fb [SPARK-16538][SPARKR] fix R call with namespace operator on SparkSession functions
## What changes were proposed in this pull request?

Fix function routing to work with and without namespace operator `SparkR::createDataFrame`

## How was this patch tested?

manual, unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14195 from felixcheung/rroutedefault.
2016-07-14 09:45:30 -07:00
Sun Rui 093ebbc628 [SPARK-16509][SPARKR] Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy.
## What changes were proposed in this pull request?
Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check.

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #14192 from sun-rui/SPARK-16509.
2016-07-14 09:38:42 -07:00
Felix Cheung fb2e8eeb0b [SPARKR][DOCS][MINOR] R programming guide to include csv data source example
## What changes were proposed in this pull request?

Minor documentation update for code example, code style, and missed reference to "sparkR.init"

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14178 from felixcheung/rcsvprogrammingguide.
2016-07-13 15:09:23 -07:00
Felix Cheung b4baf086ca [SPARKR][MINOR] R examples and test updates
## What changes were proposed in this pull request?

Minor example updates

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14171 from felixcheung/rexample.
2016-07-13 13:33:34 -07:00
Felix Cheung 7f38b9d5f4 [SPARK-16144][SPARKR] update R API doc for mllib
## What changes were proposed in this pull request?

From SPARK-16140/PR #13921 - the issue is we left write.ml doc empty:
![image](https://cloud.githubusercontent.com/assets/8969467/16481934/856dd0ea-3e62-11e6-9474-e4d57d1ca001.png)

Here's what I meant as the fix:
![image](https://cloud.githubusercontent.com/assets/8969467/16481943/911f02ec-3e62-11e6-9d68-17363a9f5628.png)

![image](https://cloud.githubusercontent.com/assets/8969467/16481950/9bc057aa-3e62-11e6-8127-54870701c4b1.png)

I didn't realize there was already a JIRA on this. mengxr yanboliang

## How was this patch tested?

check doc generated.

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13993 from felixcheung/rmllibdoc.
2016-07-11 14:34:48 -07:00
Yanbo Liang 2ad031be67 [SPARKR][DOC] SparkR ML user guides update for 2.0
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.

## How was this patch tested?
Only docs update, manually check the generated docs.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14011 from yanboliang/r-user-guide-update.
2016-07-11 14:31:11 -07:00
Dongjoon Hyun 142df4834b [SPARK-16429][SQL] Include StringType columns in describe()
## What changes were proposed in this pull request?

Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument.

**Background**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show()
+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+
```

**Before**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 2|
|   mean|              24.5|
| stddev|7.7781745930520225|
|    min|                19|
|    max|                30|
+-------+------------------+
```

**After**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+
```

## How was this patch tested?

Pass the Jenkins with a update testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14095 from dongjoon-hyun/SPARK-16429.
2016-07-08 14:36:50 -07:00
Dongjoon Hyun 6aa7d09f4e [SPARK-16425][R] describe() should not fail with non-numeric columns
## What changes were proposed in this pull request?

This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`.

**Before**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType;
```

**After**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
SparkDataFrame[summary:string, eruptions:string, waiting:string]
```

## How was this patch tested?

Pass the Jenkins with a updated testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14096 from dongjoon-hyun/SPARK-16425.
2016-07-07 17:47:29 -07:00
Felix Cheung f4767bcc7a [SPARK-16310][SPARKR] R na.string-like default for csv source
## What changes were proposed in this pull request?

Apply default "NA" as null string for R, like R read.csv na.string parameter.

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"

An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")

(couldn't open JIRA, will do that later)

## How was this patch tested?

unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13984 from felixcheung/rcsvnastring.
2016-07-07 15:21:57 -07:00
Dongjoon Hyun d17e5f2f12 [SPARK-16233][R][TEST] ORC test should be enabled only when HiveContext is available.
## What changes were proposed in this pull request?

ORC test should be enabled only when HiveContext is available.

## How was this patch tested?

Manual.
```
$ R/run-tests.sh
...
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped

2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped

3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped

4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped

5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped

DONE ===========================================================================
Tests passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14019 from dongjoon-hyun/SPARK-16233.
2016-07-01 15:35:19 -07:00
Sun Rui e4fa58c43c [SPARK-16299][SPARKR] Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory.
## What changes were proposed in this pull request?
Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory. See detailed description at https://issues.apache.org/jira/browse/SPARK-16299

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #13975 from sun-rui/SPARK-16299.
2016-07-01 14:37:03 -07:00
Narine Kokhlikyan 26afb4ce40 [SPARK-16012][SPARKR] Implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame
## What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.

This is similar to dapplyCollect().

## How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect

Author: Narine Kokhlikyan <narine@slice.com>

Closes #13760 from NarineK/gapplyCollect.
2016-07-01 13:55:13 -07:00
Dongjoon Hyun 46395db80e [SPARK-16289][SQL] Implement posexplode table generating function
## What changes were proposed in this pull request?

This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.

**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```

**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+
```

For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with newly added testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13971 from dongjoon-hyun/SPARK-16289.
2016-06-30 12:03:54 -07:00
Xin Ren 8c9cd0a7a7 [SPARK-16140][MLLIB][SPARKR][DOCS] Group k-means method in generated R doc
https://issues.apache.org/jira/browse/SPARK-16140

## What changes were proposed in this pull request?

Group the R doc of spark.kmeans, predict(KM), summary(KM), read/write.ml(KM) under Rd spark.kmeans. The example code was updated.

## How was this patch tested?

Tested on my local machine

And on my laptop `jekyll build` is failing to build API docs, so here I can only show you the html I manually generated from Rd files, with no CSS applied, but the doc content should be there.

![screenshotkmeans](https://cloud.githubusercontent.com/assets/3925641/16403203/c2c9ca1e-3ca7-11e6-9e29-f2164aee75fc.png)

Author: Xin Ren <iamshrek@126.com>

Closes #13921 from keypointt/SPARK-16140.
2016-06-29 11:25:00 -07:00
Yanbo Liang c6a220d756 [MINOR][SPARKR] Fix arguments of survreg in SparkR
## What changes were proposed in this pull request?
Fix wrong arguments description of ```survreg``` in SparkR.

## How was this patch tested?
```Arguments``` section of ```survreg``` doc before this PR (with wrong description for ```path``` and missing ```overwrite```):
![image](https://cloud.githubusercontent.com/assets/1962026/16447548/fe7a5ed4-3da1-11e6-8b96-b5bf2083b07e.png)

After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/16447617/368e0b18-3da2-11e6-8277-45640fb11859.png)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13970 from yanboliang/spark-16143-followup.
2016-06-29 11:20:35 -07:00
Felix Cheung 823518c2b5 [SPARKR] add csv tests
## What changes were proposed in this pull request?

Add unit tests for csv data for SPARKR

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13904 from felixcheung/rcsv.
2016-06-28 17:08:28 -07:00
WeichenXu d59ba8e307 [MINOR][SPARKR] update sparkR DataFrame.R comment
## What changes were proposed in this pull request?

update sparkR DataFrame.R comment
SQLContext ==> SparkSession

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13946 from WeichenXu123/sparkR_comment_update_sparkSession.
2016-06-28 12:12:20 -07:00
Prashant Sharma f6b497fcdd [SPARK-16128][SQL] Allow setting length of characters to be truncated to, in Dataset.show function.
## What changes were proposed in this pull request?

Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.

## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.

For SparkR and pyspark, existing tests and manual testing.

Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>

Closes #13839 from ScrapCodes/add_truncateTo_DF.show.
2016-06-28 17:11:06 +05:30
Junyang Qian 1b7fc58172 [SPARK-16143][R] group AFT survival regression methods docs in a single Rd
## What changes were proposed in this pull request?

This PR groups `spark.survreg`, `summary(AFT)`, `predict(AFT)`, `write.ml(AFT)` for survival regression into a single Rd.

## How was this patch tested?

Manually checked generated HTML doc. See attached screenshots.

![screen shot 2016-06-27 at 10 28 20 am](https://cloud.githubusercontent.com/assets/15318264/16392008/a14cf472-3c5e-11e6-9ce5-490ed1a52249.png)
![screen shot 2016-06-27 at 10 28 35 am](https://cloud.githubusercontent.com/assets/15318264/16392009/a14e333c-3c5e-11e6-8bd7-c2e9ba71f8e2.png)

Author: Junyang Qian <junyangq@databricks.com>

Closes #13927 from junyangq/SPARK-16143.
2016-06-27 20:32:27 -07:00
Felix Cheung 30b182bcc0 [SPARK-16184][SPARKR] conf API for SparkSession
## What changes were proposed in this pull request?

Add `conf` method to get Runtime Config from SparkSession

## How was this patch tested?

unit tests, manual tests

This is how it works in sparkR shell:
```
 SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"

$spark.app.id
[1] "local-1466749575523"

$spark.app.name
[1] "SparkR"

$spark.driver.host
[1] "10.0.2.1"

$spark.driver.port
[1] "45629"

$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"

$spark.executor.id
[1] "driver"

$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"

$spark.master
[1] "local[*]"

$spark.sql.catalogImplementation
[1] "hive"

$spark.submit.deployMode
[1] "client"

> conf("spark.master")
$spark.master
[1] "local[*]"

```

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13885 from felixcheung/rconf.
2016-06-26 13:10:43 -07:00
Xiangrui Meng 4a40d43bb2 [SPARK-16142][R] group naiveBayes method docs in a single Rd
## What changes were proposed in this pull request?

This PR groups `spark.naiveBayes`, `summary(NB)`, `predict(NB)`, and `write.ml(NB)` into a single Rd.

## How was this patch tested?

Manually checked generated HTML doc. See attached screenshots.

![screen shot 2016-06-23 at 2 11 00 pm](https://cloud.githubusercontent.com/assets/829644/16320452/a5885e92-394c-11e6-994f-2ab5cddad86f.png)

![screen shot 2016-06-23 at 2 11 15 pm](https://cloud.githubusercontent.com/assets/829644/16320455/aad1f6d8-394c-11e6-8ef4-13bee989f52f.png)

Author: Xiangrui Meng <meng@databricks.com>

Closes #13877 from mengxr/SPARK-16142.
2016-06-23 21:43:13 -07:00
Felix Cheung b5a997667f [SPARK-16088][SPARKR] update setJobGroup, cancelJobGroup, clearJobGroup
## What changes were proposed in this pull request?

Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
Also updated roxygen2 doc and R programming guide on deprecations.

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13838 from felixcheung/rjobgroup.
2016-06-23 09:45:01 -07:00
Kai Jiang 43b04b7ecb [SPARK-15672][R][DOC] R programming guide update
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions

## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">

Author: Kai Jiang <jiangkai@gmail.com>

Closes #13660 from vectorijk/spark-15672-R-guide-update.
2016-06-22 12:50:36 -07:00
Junyang Qian ea3a12b014 [SPARK-16107][R] group glm methods in documentation
## What changes were proposed in this pull request?

This groups GLM methods (spark.glm, summary, print, predict and write.ml) in the documentation. The example code was updated.

## How was this patch tested?

N/A

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

![screen shot 2016-06-21 at 2 31 37 pm](https://cloud.githubusercontent.com/assets/15318264/16247077/f6eafc04-37bc-11e6-89a8-7898ff3e4078.png)
![screen shot 2016-06-21 at 2 31 45 pm](https://cloud.githubusercontent.com/assets/15318264/16247078/f6eb1c16-37bc-11e6-940a-2b595b10617c.png)

Author: Junyang Qian <junyangq@databricks.com>
Author: Junyang Qian <junyangq@Junyangs-MacBook-Pro.local>

Closes #13820 from junyangq/SPARK-16107.
2016-06-22 09:13:08 -07:00