Commit graph

59 commits

Author SHA1 Message Date
Yanbo Liang 83af297ac4 [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for more family and link functions
## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.

## How was this patch tested?
Unit tests.

SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
     Min        1Q    Median        3Q       Max
-0.95096  -0.16585  -0.00232   0.17410   0.72918

Coefficients:
                    Estimate  Std. Error  t value  Pr(>|t|)
(Intercept)         1.6765    0.23536     7.1231   4.4561e-11
Sepal_Length        0.34988   0.046301    7.5566   4.1873e-12
Species_versicolor  -0.98339  0.072075    -13.644  0
Species_virginica   -1.0075   0.093306    -10.798  0

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.22

Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
     Min        1Q    Median        3Q       Max
-0.95096  -0.16522   0.00171   0.18416   0.72918

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.217

Number of Fisher Scoring iterations: 2
```

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12393 from yanboliang/spark-13925.
2016-04-15 08:23:51 -07:00
Burak Yavuz 1146c534d6 [SPARK-14353] Dataset Time Window window API for R
## What changes were proposed in this pull request?

The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the R API for this function.

With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
 - `window(timeColumn, windowDuration)`
 - `window(timeColumn, windowDuration, slideDuration)`
 - `window(timeColumn, windowDuration, slideDuration, startTime)`

In Python and R, users can access all APIs above, but in addition they can do
 - In R:
   `window(timeColumn, windowDuration, startTime=...)`

that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.

## How was this patch tested?

Unit tests + manual tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #12141 from brkyvz/R-windows.
2016-04-05 17:21:41 -07:00
Yanbo Liang 13cbb2de70 [SPARK-13010][ML][SPARKR] Implement a simple wrapper of AFTSurvivalRegression in SparkR
## What changes were proposed in this pull request?
This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.

## How was this patch tested?
Test against output from R package survival's survreg.

cc mengxr felixcheung

Close #11447

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11932 from yanboliang/spark-13010-new.
2016-03-24 22:29:34 -07:00
Xusen Yin d6dc12ef01 [SPARK-13449] Naive Bayes wrapper in SparkR
## What changes were proposed in this pull request?

This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli.

I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes.

I removed the preprocess part that omit NA values because we don't know which columns to process.

## How was this patch tested?

Test against output from R package e1071's naiveBayes.

cc: yanboliang yinxusen

Closes #11486

Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #11890 from mengxr/SPARK-13449.
2016-03-22 14:16:51 -07:00
Yanbo Liang 50e60e36f7 [SPARK-13504] [SPARKR] Add approxQuantile for SparkR
## What changes were proposed in this pull request?
Add ```approxQuantile``` for SparkR.
## How was this patch tested?
unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11383 from yanboliang/spark-13504 and squashes the following commits:

4f17adb [Yanbo Liang] Add approxQuantile for SparkR
2016-02-25 21:23:41 -08:00
Xusen Yin 8d29001dec [SPARK-13011] K-means wrapper in SparkR
https://issues.apache.org/jira/browse/SPARK-13011

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11124 from yinxusen/SPARK-13011.
2016-02-23 15:42:58 -08:00
Yanbo Liang e7f9199e70 [SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR
Add ```covar_samp``` and ```covar_pop``` for SparkR.
Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change.

cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10829 from yanboliang/spark-12903.
2016-01-26 19:29:47 -08:00
Sun Rui 1b2a918e59 [SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #10201 from sun-rui/SPARK-12204.
2016-01-20 21:08:15 -08:00
felixcheung 488bbb216c [SPARK-12232][SPARKR] New R API for read.table to avoid name conflict
shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table`

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10406 from felixcheung/readtable.
2016-01-19 18:31:03 -08:00
Sun Rui 3ac648289c [SPARK-12337][SPARKR] Implement dropDuplicates() method of DataFrame in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #10309 from sun-rui/SPARK-12337.
2016-01-19 16:37:18 -08:00
felixcheung 37fefa66cb [SPARK-12168][SPARKR] Add automated tests for conflicted function in R
Currently this is reported when loading the SparkR package in R (probably would add is.nan)
```
Loading required package: methods

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:base’:

    colnames, colnames<-, intersect, rank, rbind, sample, subset,
    summary, table, transform
```

Adding this test adds an automated way to track changes to masked method.
Also, the second part of this test check for those functions that would not be accessible without namespace/package prefix.

Incidentally, this might point to how we would fix those inaccessible functions in base or stats.
Looking for feedback for adding this test.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10171 from felixcheung/rmaskedtest.
2016-01-19 16:33:48 -08:00
Oscar D. Lara Yejas ba4a641902 [SPARK-11031][SPARKR] Method str() on a DataFrame
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>

Closes #9613 from olarayej/SPARK-11031.
2016-01-15 07:37:54 -08:00
Yanbo Liang 3d77cffec0 [SPARK-12645][SPARKR] SparkR support hash function
Add ```hash``` function for SparkR ```DataFrame```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10597 from yanboliang/spark-12645.
2016-01-09 12:29:51 +05:30
Yanbo Liang d1fea41363 [SPARK-12393][SPARKR] Add read.text and write.text for SparkR
Add ```read.text``` and ```write.text``` for SparkR.
cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10348 from yanboliang/spark-12393.
2016-01-06 12:05:41 +05:30
Yanbo Liang 22f6cd86fc [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR
Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10281 from yanboliang/spark-12310.
2015-12-16 10:34:30 -08:00
Yanbo Liang 0fb9825556 [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files
* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments
jsonFile(sqlContext, “path1,path2”)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for ```parquetFile```.

cc felixcheung sun-rui shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10145 from yanboliang/spark-12146.
2015-12-11 11:47:35 -08:00
Yanbo Liang eeb58722ad [SPARK-12198][SPARKR] SparkR support read.parquet and deprecate parquetFile
SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with #10145 for ```jsonFile```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10191 from yanboliang/spark-12198.
2015-12-10 09:44:53 -08:00
Sun Rui c8d0e160da [SPARK-11774][SPARKR] Implement struct(), encode(), decode() functions in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #9804 from sun-rui/SPARK-11774.
2015-12-05 15:49:51 -08:00
felixcheung c793d2d9a1 [SPARK-9319][SPARKR] Add support for setting column names, types
Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218

shivaram sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9654 from felixcheung/colnamescoltypes.
2015-11-28 21:16:21 -08:00
Yanbo Liang ba02f6cb5a [SPARK-12025][SPARKR] Rename some window rank function names for SparkR
Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side.
There are two reasons that we should make this change:
* We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645)
* Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0.

It's better to fix this issue before 1.6 release, otherwise we will make breaking API change.
cc shivaram sun-rui

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10016 from yanboliang/SPARK-12025.
2015-11-27 11:48:01 -08:00
Sun Rui 224723e6a8 [SPARK-11773][SPARKR] Implement collection functions in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #9764 from sun-rui/SPARK-11773.
2015-11-18 08:41:45 -08:00
felixcheung 1a8e0468a1 [SPARK-11468] [SPARKR] add stddev/variance agg functions for Column
Checked names, none of them should conflict with anything in base

shivaram davies rxin

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9489 from felixcheung/rstddev.
2015-11-10 22:45:17 -08:00
Oscar D. Lara Yejas 47735cdc2a [SPARK-10863][SPARKR] Method coltypes() (New version)
This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.

Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>

Closes #9579 from olarayej/SPARK-10863_NEW14.
2015-11-10 11:07:57 -08:00
adrian555 b9455d1f18 [SPARK-11260][SPARKR] with() function support
Author: adrian555 <wzhuang@us.ibm.com>
Author: Adrian Zhuang <adrian555@users.noreply.github.com>

Closes #9443 from adrian555/with.
2015-11-05 14:47:38 -08:00
Sun Rui 40c77fb23a [SPARK-11210][SPARKR] Add window functions into SparkR [step 2].
Author: Sun Rui <rui.sun@intel.com>

Closes #9196 from sun-rui/SPARK-11210.
2015-10-30 10:56:06 -07:00
Sun Rui dc3220ce11 [SPARK-11209][SPARKR] Add window functions into SparkR [step 1].
Author: Sun Rui <rui.sun@intel.com>

Closes #9193 from sun-rui/SPARK-11209.
2015-10-26 20:58:18 -07:00
Sun Rui 390b22fad6 [SPARK-10996] [SPARKR] Implement sampleBy() in DataFrameStatFunctions.
Author: Sun Rui <rui.sun@intel.com>

Closes #9023 from sun-rui/SPARK-10996.
2015-10-13 22:31:23 -07:00
Adrian Zhuang f7f28ee7a5 [SPARK-10913] [SPARKR] attach() function support
Bring the change code up to date.

Author: Adrian Zhuang <adrian555@users.noreply.github.com>
Author: adrian555 <wzhuang@us.ibm.com>

Closes #9031 from adrian555/attach2.
2015-10-13 10:21:07 -07:00
Narine Kokhlikyan 1e0aba90b9 [SPARK-10888] [SPARKR] Added as.DataFrame as a synonym to createDataFrame
as.DataFrame is more a R-style like signature.
Also, I'd like to know if we could make the context, e.g. sqlContext global, so that we do not have to specify it as an argument, when we each time create a dataframe.

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #8952 from NarineK/sparkrasDataFrame.
2015-10-13 10:09:05 -07:00
Sun Rui 864de3bf40 [SPARK-10079] [SPARKR] Make 'column' and 'col' functions be S4 functions.
1.  Add a "col" function into DataFrame.
2.  Move the current "col" function in Column.R to functions.R, convert it to S4 function.
3.  Add a s4 "column" function in functions.R.
4.  Convert the "column" function in Column.R to S4 function. This is for private use.

Author: Sun Rui <rui.sun@intel.com>

Closes #8864 from sun-rui/SPARK-10079.
2015-10-09 23:05:38 -07:00
Rerngvit Yanggratoke 70f44ad2d8 [SPARK-10905] [SPARKR] Export freqItems() for DataFrameStatFunctions
[SPARK-10905][SparkR]: Export freqItems() for DataFrameStatFunctions
- Add function (together with roxygen2 doc) to DataFrame.R and generics.R
- Expose the function in NAMESPACE
- Add unit test for the function

Author: Rerngvit Yanggratoke <rerngvit@kth.se>

Closes #8962 from rerngvit/SPARK-10905.
2015-10-09 09:36:40 -07:00
Sun Rui f57c63d4c3 [SPARK-10752] [SPARKR] Implement corr() and cov in DataFrameStatFunctions.
Author: Sun Rui <rui.sun@intel.com>

Closes #8869 from sun-rui/SPARK-10752.
2015-10-07 09:46:37 -07:00
Oscar D. Lara Yejas f21e2da03f [SPARK-10807] [SPARKR] Added as.data.frame as a synonym for collect
Created method as.data.frame as a synonym for collect().

Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: olarayej <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>

Closes #8908 from olarayej/SPARK-10807.
2015-09-30 18:03:31 -07:00
felixcheung 2a4e00ca4d [SPARK-9803] [SPARKR] Add subset and transform + tests
Add subset and transform
Also reorganize `[` & `[[` to subset instead of select

Note: for transform, transform is very similar to mutate. Spark doesn't seem to replace existing column with the name in mutate (ie. `mutate(df, age = df$age + 2)` - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform.
Though it is clearly stated it should replace column with matching name (should I open a JIRA for mutate/transform?)

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #8503 from felixcheung/rsubset_transform.
2015-08-28 18:35:01 -07:00
Shivaram Venkataraman ad7f0f160b [SPARK-10308] [SPARKR] Add %in% to the exported namespace
I also checked all the other functions defined in column.R, functions.R and DataFrame.R and everything else looked fine.

cc yu-iskw

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #8473 from shivaram/in-namespace.
2015-08-26 18:13:07 -07:00
Yu ISHIKAWA d898c33f77 [SPARK-10106] [SPARKR] Add ifelse Column function to SparkR
### JIRA
[[SPARK-10106] Add `ifelse` Column function to SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10106)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8303 from yu-iskw/SPARK-10106.
2015-08-19 12:39:37 -07:00
Yu ISHIKAWA 2fcb9cb955 [SPARK-9856] [SPARKR] Add expression functions into SparkR whose params are complicated
I added lots of Column functinos into SparkR. And I also added `rand(seed: Int)` and `randn(seed: Int)` in Scala. Since we need such APIs for R integer type.

### JIRA
[[SPARK-9856] Add expression functions into SparkR whose params are complicated - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9856)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8264 from yu-iskw/SPARK-9856-3.
2015-08-19 10:41:14 -07:00
Yu ISHIKAWA bf32c1f7f4 [SPARK-10075] [SPARKR] Add when expressino function in SparkR
- Add `when` and `otherwise` as `Column` methods
- Add `When` as an expression function
- Add `%otherwise%` infix as an alias of `otherwise`

Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think?

### JIRA
[[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8266 from yu-iskw/SPARK-10075.
2015-08-18 20:27:36 -07:00
Yuu ISHIKAWA 1968276af0 [SPARK-10007] [SPARKR] Update NAMESPACE file in SparkR for simple parameters functions
### JIRA
[[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007)

Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8277 from yu-iskw/SPARK-10007.
2015-08-18 09:10:59 -07:00
Yu ISHIKAWA 26e760581f [SPARK-9871] [SPARKR] Add expression functions into SparkR which have a variable parameter
### Summary

- Add `lit` function
- Add `concat`, `greatest`, `least` functions

I think we need to improve `collect` function in order to implement `struct` function. Since `collect` doesn't work with arguments which includes a nested `list` variable. It seems that a list against `struct` still has `jobj` classes. So it would be better to solve this problem on another issue.

### JIRA
[[SPARK-9871] Add expression functions into SparkR which have a variable parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8194 from yu-iskw/SPARK-9856.
2015-08-16 23:33:20 -07:00
Hossein 712f5b7a9a [SPARK-9318] [SPARK-9320] [SPARKR] Aliases for merge and summary functions on DataFrames
This PR adds synonyms for ```merge``` and ```summary``` in SparkR DataFrame API.

cc shivaram

Author: Hossein <hossein@databricks.com>

Closes #7806 from falaki/SPARK-9320 and squashes the following commits:

72600f7 [Hossein] Updated docs
92a6e75 [Hossein] Fixed merge generic signature issue
4c2b051 [Hossein] Fixing naming with mllib summary
0f3a64c [Hossein] Added ... to generic for merge
30fbaf8 [Hossein] Merged master
ae1a4cf [Hossein] Merge branch 'master' into SPARK-9320
e8eb86f [Hossein] Add a generic for merge
fc01f2d [Hossein] Added unit test
8d92012 [Hossein] Added merge as an alias for join
5b8bedc [Hossein] Added unit test
632693d [Hossein] Added summary as an alias for describe for DataFrame
2015-07-31 19:24:44 -07:00
Hossein 710c2b5dd2 [SPARK-9324] [SPARK-9322] [SPARK-9321] [SPARKR] Some aliases for R-like functions in DataFrames
Adds following aliases:
* unique (distinct)
* rbind (unionAll): accepts many DataFrames
* nrow (count)
* ncol
* dim
* names (columns): along with the replacement function to change names

Author: Hossein <hossein@databricks.com>

Closes #7764 from falaki/sparkR-alias and squashes the following commits:

56016f5 [Hossein] Updated R documentation
5e4a4d0 [Hossein] Removed extra code
f51cbef [Hossein] Merge branch 'master' into sparkR-alias
c1b88bd [Hossein] Moved setGeneric and other comments applied
d9307f8 [Hossein] Added tests
b5aa988 [Hossein] Added dim, ncol, nrow, names, rbind, and unique functions to DataFrames
2015-07-31 14:08:18 -07:00
Eric Liang e7905a9395 [SPARK-9463] [ML] Expose model coefficients with names in SparkR RFormula
Preview:

```
> summary(m)
            features coefficients
1        (Intercept)    1.6765001
2       Sepal_Length    0.3498801
3 Species.versicolor   -0.9833885
4  Species.virginica   -1.0075104

```

Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

cc mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7771 from ericl/summary and squashes the following commits:

ccd54c3 [Eric Liang] second pass
a5ca93b [Eric Liang] comments
2772111 [Eric Liang] clean up
70483ef [Eric Liang] fix test
7c247d4 [Eric Liang] Merge branch 'master' into summary
3c55024 [Eric Liang] working
8c539aa [Eric Liang] first pass
2015-07-30 16:15:43 -07:00
Xiangrui Meng 2f5cbd860e [SPARK-8364] [SPARKR] Add crosstab to SparkR DataFrames
Add `crosstab` to SparkR DataFrames, which takes two column names and returns a local R data.frame. This is similar to `table` in R. However, `table` in SparkR is used for loading SQL tables as DataFrames. The return type is data.frame instead table for `crosstab` to be compatible with Scala/Python.

I couldn't run R tests successfully on my local. Many unit tests failed. So let's try Jenkins.

Author: Xiangrui Meng <meng@databricks.com>

Closes #7318 from mengxr/SPARK-8364 and squashes the following commits:

d75e894 [Xiangrui Meng] fix tests
53f6ddd [Xiangrui Meng] fix tests
f1348d6 [Xiangrui Meng] update test
47cb088 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8364
5621262 [Xiangrui Meng] first version without test
2015-07-22 21:40:23 -07:00
Eric Liang 1cbdd89918 [SPARK-9201] [ML] Initial integration of MLlib + SparkR using RFormula
This exposes the SparkR:::glm() and SparkR:::predict() APIs. It was necessary to change RFormula to silently drop the label column if it was missing from the input dataset, which is kind of a hack but necessary to integrate with the Pipeline API.

The umbrella design doc for MLlib + SparkR integration can be viewed here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7483 from ericl/spark-8774 and squashes the following commits:

3dfac0c [Eric Liang] update
17ef516 [Eric Liang] more comments
1753a0f [Eric Liang] make glm generic
b0f50f8 [Eric Liang] equivalence test
550d56d [Eric Liang] export methods
c015697 [Eric Liang] second pass
117949a [Eric Liang] comments
5afbc67 [Eric Liang] test label columns
6b7f15f [Eric Liang] Fri Jul 17 14:20:22 PDT 2015
3a63ae5 [Eric Liang] Fri Jul 17 13:41:52 PDT 2015
ce61367 [Eric Liang] Fri Jul 17 13:41:17 PDT 2015
0299c59 [Eric Liang] Fri Jul 17 13:40:32 PDT 2015
e37603f [Eric Liang] Fri Jul 17 12:15:03 PDT 2015
d417d0c [Eric Liang] Merge remote-tracking branch 'upstream/master' into spark-8774
29a2ce7 [Eric Liang] Merge branch 'spark-8774-1' into spark-8774
d1959d2 [Eric Liang] clarify comment
2db68aa [Eric Liang] second round of comments
dc3c943 [Eric Liang] address comments
5765ec6 [Eric Liang] fix style checks
1f361b0 [Eric Liang] doc
d33211b [Eric Liang] r support
fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
2015-07-20 20:49:38 -07:00
Liang-Chi Hsieh 0a795336df [SPARK-8807] [SPARKR] Add between operator in SparkR
JIRA: https://issues.apache.org/jira/browse/SPARK-8807

Add between operator in SparkR.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7356 from viirya/add_r_between and squashes the following commits:

7f51b44 [Liang-Chi Hsieh] Add test for non-numeric column.
c6a25c5 [Liang-Chi Hsieh] Add between function.
2015-07-15 23:36:57 -07:00
Hossein 1fa29c2df2 [SPARK-8452] [SPARKR] expose jobGroup API in SparkR
This pull request adds following methods to SparkR:

```R
setJobGroup()
cancelJobGroup()
clearJobGroup()
```
For each method, the spark context is passed as the first argument. There does not seem to be a good way to test these in R.

cc shivaram and davies

Author: Hossein <hossein@databricks.com>

Closes #6889 from falaki/SPARK-8452 and squashes the following commits:

9ce9f1e [Hossein] Added basic tests to verify methods can be called and won't throw errors
c706af9 [Hossein] Added examples
a2c19af [Hossein] taking spark context as first argument
343ca77 [Hossein] Added setJobGroup, cancelJobGroup and clearJobGroup to SparkR
2015-06-19 15:51:59 -07:00
Sun Rui 46576ab303 [SPARK-7227] [SPARKR] Support fillna / dropna in R DataFrame.
Author: Sun Rui <rui.sun@intel.com>

Closes #6183 from sun-rui/SPARK-7227 and squashes the following commits:

dd6f5b3 [Sun Rui] Rename readEnv() back to readMap(). Add alias na.omit() for dropna().
41cf725 [Sun Rui] [SPARK-7227][SPARKR] Support fillna / dropna in R DataFrame.
2015-05-31 15:01:59 -07:00
Shivaram Venkataraman a40bca0111 [SPARK-6811] Copy SparkR lib in make-distribution.sh
This change also remove native libraries from SparkR to make sure our distribution works across platforms

Tested by building on Mac, running on Amazon Linux (CentOS), Windows VM and vice-versa (built on Linux run on Mac)

I will also test this with YARN soon and update this PR.

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6373 from shivaram/sparkr-binary and squashes the following commits:

ae41b5c [Shivaram Venkataraman] Remove native libraries from SparkR Also include the built SparkR package in make-distribution.sh
2015-05-23 00:04:01 -07:00
qhuang 50da9e8916 [SPARK-7226] [SPARKR] Support math functions in R DataFrame
Author: qhuang <qian.huang@intel.com>

Closes #6170 from hqzizania/master and squashes the following commits:

f20c39f [qhuang] add tests units and fixes
2a7d121 [qhuang] use a function name more familiar to R users
07aa72e [qhuang] Support math functions in R DataFrame
2015-05-15 14:06:16 -07:00