Commit graph

815 commits

Author SHA1 Message Date
felixcheung 39d3bc62a7 [SPARK-14594][SPARKR] check execution return status code
## What changes were proposed in this pull request?

When JVM backend fails without going proper error handling (eg. process crashed), the R error message could be ambiguous.

```
Error in if (returnStatus != 0) { : argument is of length zero
```

This change attempts to make it more clear (however, one would still need to investigate why JVM fails)

## How was this patch tested?

manually

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #12622 from felixcheung/rreturnstatus.
2016-04-23 11:08:19 -07:00
felixcheung a55fbe2a16 [SPARK-12148][SPARKR] SparkR: rename DataFrame to SparkDataFrame
## What changes were proposed in this pull request?

Changed class name defined in R from "DataFrame" to "SparkDataFrame". A popular package, S4Vector already defines "DataFrame" - this change is to avoid conflict.

Aside from class name and API/roxygen2 references, SparkR APIs like `createDataFrame`, `as.DataFrame` are not changed (S4Vector does not define a "as.DataFrame").

Since in R, one would rarely reference type/class, this change should have minimal/almost-no impact to a SparkR user in terms of back compat.

## How was this patch tested?

SparkR tests, manually loading S4Vector then SparkR package

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #12621 from felixcheung/rdataframe.
2016-04-23 00:20:27 -07:00
Sun Rui 1a7fc74ccf [SPARK-13178] RRDD faces with concurrency issue in case of rdd.zip(rdd).count().
## What changes were proposed in this pull request?
The concurrency issue reported in SPARK-13178 was fixed by the PR https://github.com/apache/spark/pull/10947 for SPARK-12792.
This PR just removes a workaround not needed anymore.

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <rui.sun@intel.com>

Closes #12606 from sun-rui/SPARK-13178.
2016-04-22 11:19:52 -07:00
Dongjoon Hyun 411454475a [SPARK-14780] [R] Add setLogLevel to SparkR
## What changes were proposed in this pull request?

This PR aims to add `setLogLevel` function to SparkR shell.

**Spark Shell**
```scala
scala> sc.setLogLevel("ERROR")
```

**PySpark**
```python
>>> sc.setLogLevel("ERROR")
```

**SparkR (this PR)**
```r
> setLogLevel(sc, "ERROR")
NULL
```

## How was this patch tested?

Pass the Jenkins tests including a new R testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12547 from dongjoon-hyun/SPARK-14780.
2016-04-21 16:09:50 -07:00
Dongjoon Hyun 14869ae64e [SPARK-14639] [PYTHON] [R] Add bround function in Python/R.
## What changes were proposed in this pull request?

This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
    if (Double.isNaN(input) || Double.isInfinite(input)) {
      return input;
    }
    return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```

After this PR, `pyspark` and `sparkR` also support `bround` function.

**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```

**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
  bround(x, 0)
1            2
2            4
```

## How was this patch tested?

Pass the Jenkins tests (including new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12509 from dongjoon-hyun/SPARK-14639.
2016-04-19 22:28:11 -07:00
Sun Rui 8eedf0b553 [SPARK-13905][SPARKR] Change signature of as.data.frame() to be consistent with the R base package.
## What changes were proposed in this pull request?

Change the signature of as.data.frame() to be consistent with that in the R base package to meet R user's convention.

## How was this patch tested?
dev/lint-r
SparkR unit tests

Author: Sun Rui <rui.sun@intel.com>

Closes #11811 from sun-rui/SPARK-13905.
2016-04-19 19:57:03 -07:00
felixcheung ecd877e833 [SPARK-12224][SPARKR] R support for JDBC source
Add R API for `read.jdbc`, `write.jdbc`.

Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database.

Refactored some code into util so they could be tested.

Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function.

Tested:
```
# with postgresql
../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar

# read.jdbc
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345)

# partitionColumn and numPartitions test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345)
a <- SparkR:::toRDD(df)
SparkR:::getNumPartitions(a)
[1] 4
SparkR:::collectPartition(a, 2L)

# defaultParallelism test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345)
SparkR:::getNumPartitions(a)
[1] 2

# predicates test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345)
count(df) == 1

# write.jdbc, default save mode "error"
irisDf <- as.DataFrame(sqlContext, iris)
write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
"error, already exists"

write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345")
```

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10480 from felixcheung/rreadjdbc.
2016-04-19 15:59:47 -07:00
Yanbo Liang 83af297ac4 [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for more family and link functions
## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.

## How was this patch tested?
Unit tests.

SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
     Min        1Q    Median        3Q       Max
-0.95096  -0.16585  -0.00232   0.17410   0.72918

Coefficients:
                    Estimate  Std. Error  t value  Pr(>|t|)
(Intercept)         1.6765    0.23536     7.1231   4.4561e-11
Sepal_Length        0.34988   0.046301    7.5566   4.1873e-12
Species_versicolor  -0.98339  0.072075    -13.644  0
Species_virginica   -1.0075   0.093306    -10.798  0

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.22

Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
     Min        1Q    Median        3Q       Max
-0.95096  -0.16522   0.00171   0.18416   0.72918

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.217

Number of Fisher Scoring iterations: 2
```

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12393 from yanboliang/spark-13925.
2016-04-15 08:23:51 -07:00
Yanbo Liang 75e05a5a96 [SPARK-12566][SPARK-14324][ML] GLM model family, link function support in SparkR:::glm
* SparkR glm supports families and link functions which match R's signature for family.
* SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```.
* This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in.
* This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR.

Unit tests.

cc mengxr jkbradley hhbyyh

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12294 from yanboliang/spark-12566.
2016-04-12 10:51:09 -07:00
gatorsmile 9f838bd242 [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table
#### What changes were proposed in this pull request?
This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.

#### How was this patch tested?
Modified the existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12284 from gatorsmile/followupDropTable.
2016-04-10 20:46:15 -07:00
Burak Yavuz 1146c534d6 [SPARK-14353] Dataset Time Window window API for R
## What changes were proposed in this pull request?

The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the R API for this function.

With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
 - `window(timeColumn, windowDuration)`
 - `window(timeColumn, windowDuration, slideDuration)`
 - `window(timeColumn, windowDuration, slideDuration, startTime)`

In Python and R, users can access all APIs above, but in addition they can do
 - In R:
   `window(timeColumn, windowDuration, startTime=...)`

that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.

## How was this patch tested?

Unit tests + manual tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #12141 from brkyvz/R-windows.
2016-04-05 17:21:41 -07:00
Yanbo Liang 22249afb4a [SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans
## What changes were proposed in this pull request?
Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper.

## How was this patch tested?
Existing tests.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12039 from yanboliang/spark-14059.
2016-03-31 23:49:58 -07:00
Sun Rui d3638d7bff [SPARK-12792] [SPARKR] Refactor RRDD to support R UDF.
## What changes were proposed in this pull request?

Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.

Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.

## How was this patch tested?
dev/lint-r
SparkR unit tests

Author: Sun Rui <rui.sun@intel.com>

Closes #12024 from sun-rui/SPARK-12792_new.
2016-03-28 21:51:02 -07:00
Davies Liu e5a1b301fb Revert "[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF."
This reverts commit 40984f6706.
2016-03-28 10:21:02 -07:00
Sun Rui 40984f6706 [SPARK-12792] [SPARKR] Refactor RRDD to support R UDF.
Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.

Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.

Author: Sun Rui <rui.sun@intel.com>

Closes #10947 from sun-rui/SPARK-12792.
2016-03-28 10:14:28 -07:00
Andrew Or 20ddf5fddf [SPARK-14014][SQL] Integrate session catalog (attempt #2)
## What changes were proposed in this pull request?

This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests.

## How was this patch tested?

See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #11938 from andrewor14/session-catalog-again.
2016-03-24 22:59:35 -07:00
Yanbo Liang 13cbb2de70 [SPARK-13010][ML][SPARKR] Implement a simple wrapper of AFTSurvivalRegression in SparkR
## What changes were proposed in this pull request?
This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.

## How was this patch tested?
Test against output from R package survival's survreg.

cc mengxr felixcheung

Close #11447

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11932 from yanboliang/spark-13010-new.
2016-03-24 22:29:34 -07:00
Andrew Or c44d140cae Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog"
This reverts commit 5dfc01976b.
2016-03-23 22:21:15 -07:00
Andrew Or 5dfc01976b [SPARK-14014][SQL] Replace existing catalog with SessionCatalog
## What changes were proposed in this pull request?

`SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.

As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext

## How was this patch tested?

This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #11836 from andrewor14/use-session-catalog.
2016-03-23 13:34:22 -07:00
Xusen Yin d6dc12ef01 [SPARK-13449] Naive Bayes wrapper in SparkR
## What changes were proposed in this pull request?

This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli.

I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes.

I removed the preprocess part that omit NA values because we don't know which columns to process.

## How was this patch tested?

Test against output from R package e1071's naiveBayes.

cc: yanboliang yinxusen

Closes #11486

Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #11890 from mengxr/SPARK-13449.
2016-03-22 14:16:51 -07:00
Dongjoon Hyun 2082a49569 [MINOR][DOCS] Use spark-submit instead of sparkR to submit R script.
## What changes were proposed in this pull request?

Since `sparkR` is not used for submitting R Scripts from Spark 2.0, a user faces the following error message if he follows the instruction on `R/README.md`. This PR updates `R/README.md`.
```bash
$ ./bin/sparkR examples/src/main/r/dataframe.R
Running R applications through 'sparkR' is not supported as of Spark 2.0.
Use ./bin/spark-submit <R file>
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11842 from dongjoon-hyun/update_r_readme.
2016-03-19 13:23:34 +00:00
Sun Rui c7e68c3968 [SPARK-13812][SPARKR] Fix SparkR lint-r test errors.
## What changes were proposed in this pull request?

This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github.

## How was this patch tested?

dev/lint-r
SparkR unit tests

Author: Sun Rui <rui.sun@intel.com>

Closes #11652 from sun-rui/SPARK-13812.
2016-03-13 14:30:44 -07:00
Yanbo Liang 4d535d1f1c [SPARK-13389][SPARKR] SparkR support first/last with ignore NAs
## What changes were proposed in this pull request?

SparkR support first/last with ignore NAs

cc sun-rui felixcheung shivaram

## How was the this patch tested?

unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11267 from yanboliang/spark-13389.
2016-03-10 17:31:19 -08:00
Oscar D. Lara Yejas 416e71af4d [SPARK-13327][SPARKR] Added parameter validations for colnames<-
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>

Closes #11220 from olarayej/SPARK-13312-3.
2016-03-10 17:10:23 -08:00
Yanbo Liang 50e60e36f7 [SPARK-13504] [SPARKR] Add approxQuantile for SparkR
## What changes were proposed in this pull request?
Add ```approxQuantile``` for SparkR.
## How was this patch tested?
unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11383 from yanboliang/spark-13504 and squashes the following commits:

4f17adb [Yanbo Liang] Add approxQuantile for SparkR
2016-02-25 21:23:41 -08:00
Liang-Chi Hsieh 8930181833 [SPARK-13472] [SPARKR] Fix unstable Kmeans test in R
JIRA: https://issues.apache.org/jira/browse/SPARK-13472

## What changes were proposed in this pull request?

One Kmeans test in R is unstable and sometimes fails. We should fix it.

## How was this patch tested?

Unit test is modified in this PR.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11345 from viirya/fix-kmeans-r-test and squashes the following commits:

f959f61 [Liang-Chi Hsieh] Sort resulted clusters.
2016-02-24 07:05:20 -08:00
Xusen Yin 8d29001dec [SPARK-13011] K-means wrapper in SparkR
https://issues.apache.org/jira/browse/SPARK-13011

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11124 from yinxusen/SPARK-13011.
2016-02-23 15:42:58 -08:00
Dongjoon Hyun 024482bf51 [MINOR][DOCS] Fix all typos in markdown files of doc and similar patterns in other comments
## What changes were proposed in this pull request?

This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.

## How was the this patch tested?

manual tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11300 from dongjoon-hyun/minor_fix_typos.
2016-02-22 09:52:07 +00:00
Cheng Lian d9efe63ecd [SPARK-12799] Simplify various string output for expressions
This PR introduces several major changes:

1. Replacing `Expression.prettyString` with `Expression.sql`

   The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.

1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)

   Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:

   Expression         | `prettyString` | `sql`      | Note
   ------------------ | -------------- | ---------- | ---------------
   `a && b`           | `a && b`       | `a AND b`  |
   `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct

1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)

   `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.

Author: Cheng Lian <lian@databricks.com>

Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
2016-02-21 22:53:15 +08:00
Sean Owen fb7e21797e [SPARK-13339][DOCS] Clarify commutative / associative operator requirements for reduce, fold
Clarify that reduce functions need to be commutative, and fold functions do not

See https://github.com/apache/spark/pull/11091

Author: Sean Owen <sowen@cloudera.com>

Closes #11217 from srowen/SPARK-13339.
2016-02-19 10:26:38 +00:00
Sasaki Toru c2f21d8898 [SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.template
In spark-env.sh.template, there are multi-byte characters, this PR will remove it.

Author: Sasaki Toru <sasakitoa@nttdata.co.jp>

Closes #11149 from sasakitoa/remove_multibyte_in_sparkenv.
2016-02-11 09:30:36 +00:00
Yanbo Liang e7f9199e70 [SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR
Add ```covar_samp``` and ```covar_pop``` for SparkR.
Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change.

cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10829 from yanboliang/spark-12903.
2016-01-26 19:29:47 -08:00
Narine Kokhlikyan 8a88e12128 [SPARK-12629][SPARKR] Fixes for DataFrame saveAsTable method
I've tried to solve some of the issues mentioned in: https://issues.apache.org/jira/browse/SPARK-12629
Please, let me know what do you think.
Thanks!

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #10580 from NarineK/sparkrSavaAsRable.
2016-01-22 10:35:02 -08:00
Sun Rui 1b2a918e59 [SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #10201 from sun-rui/SPARK-12204.
2016-01-20 21:08:15 -08:00
Shubhanshu Mishra d7415991a1 [SPARK-12910] Fixes : R version for installing sparkR
Testing code:
```
$ ./install-dev.sh
USING R_HOME = /usr/bin
ERROR: this R is version 2.15.1, package 'SparkR' requires R >= 3.0
```

Using the new argument:
```
$ ./install-dev.sh /content/username/SOFTWARE/R-3.2.3
USING R_HOME = /content/username/SOFTWARE/R-3.2.3/bin
* installing *source* package ‘SparkR’ ...
** R
** inst
** preparing package for lazy loading
Creating a new generic function for ‘colnames’ in package ‘SparkR’
Creating a new generic function for ‘colnames<-’ in package ‘SparkR’
Creating a new generic function for ‘cov’ in package ‘SparkR’
Creating a new generic function for ‘na.omit’ in package ‘SparkR’
Creating a new generic function for ‘filter’ in package ‘SparkR’
Creating a new generic function for ‘intersect’ in package ‘SparkR’
Creating a new generic function for ‘sample’ in package ‘SparkR’
Creating a new generic function for ‘transform’ in package ‘SparkR’
Creating a new generic function for ‘subset’ in package ‘SparkR’
Creating a new generic function for ‘summary’ in package ‘SparkR’
Creating a new generic function for ‘lag’ in package ‘SparkR’
Creating a new generic function for ‘rank’ in package ‘SparkR’
Creating a new generic function for ‘sd’ in package ‘SparkR’
Creating a new generic function for ‘var’ in package ‘SparkR’
Creating a new generic function for ‘predict’ in package ‘SparkR’
Creating a new generic function for ‘rbind’ in package ‘SparkR’
Creating a generic function for ‘lapply’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘Filter’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘alias’ from package ‘stats’ in package ‘SparkR’
Creating a generic function for ‘substr’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘%in%’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘mean’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘unique’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘nrow’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘ncol’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘head’ from package ‘utils’ in package ‘SparkR’
Creating a generic function for ‘factorial’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘atan2’ from package ‘base’ in package ‘SparkR’
Creating a generic function for ‘ifelse’ from package ‘base’ in package ‘SparkR’
** help
No man pages found in package  ‘SparkR’
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (SparkR)

```

Author: Shubhanshu Mishra <smishra8@illinois.edu>

Closes #10836 from napsternxg/master.
2016-01-20 18:06:06 -08:00
Herman van Hovell 1017327930 [SPARK-12848][SQL] Change parsed decimal literal datatype from Double to Decimal
The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```.

The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double.

This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D```

cc davies rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10796 from hvanhovell/SPARK-12848.
2016-01-20 15:13:01 -08:00
felixcheung 488bbb216c [SPARK-12232][SPARKR] New R API for read.table to avoid name conflict
shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table`

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10406 from felixcheung/readtable.
2016-01-19 18:31:03 -08:00
Sun Rui 3ac648289c [SPARK-12337][SPARKR] Implement dropDuplicates() method of DataFrame in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #10309 from sun-rui/SPARK-12337.
2016-01-19 16:37:18 -08:00
felixcheung 37fefa66cb [SPARK-12168][SPARKR] Add automated tests for conflicted function in R
Currently this is reported when loading the SparkR package in R (probably would add is.nan)
```
Loading required package: methods

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:base’:

    colnames, colnames<-, intersect, rank, rbind, sample, subset,
    summary, table, transform
```

Adding this test adds an automated way to track changes to masked method.
Also, the second part of this test check for those functions that would not be accessible without namespace/package prefix.

Incidentally, this might point to how we would fix those inaccessible functions in base or stats.
Looking for feedback for adding this test.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10171 from felixcheung/rmaskedtest.
2016-01-19 16:33:48 -08:00
felixcheung 92502703f4 [SPARK-12862][SPARKR] Jenkins does not run R tests
Slight correction: I'm leaving sparkR as-is (ie. R file not supported) and fixed only run-tests.sh as shivaram described.

I also assume we are going to cover all doc changes in https://issues.apache.org/jira/browse/SPARK-12846 instead of here.

rxin shivaram zjffdu

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10792 from felixcheung/sparkRcmd.
2016-01-17 09:29:08 -08:00
Oscar D. Lara Yejas ba4a641902 [SPARK-11031][SPARKR] Method str() on a DataFrame
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>

Closes #9613 from olarayej/SPARK-11031.
2016-01-15 07:37:54 -08:00
Wenchen Fan 962e9bcf94 [SPARK-12756][SQL] use hash expression in Exchange
This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.

This PR also fixes the tests that are broken by the new hash behaviour in shuffle.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.
2016-01-13 22:43:28 -08:00
Yanbo Liang 3d77cffec0 [SPARK-12645][SPARKR] SparkR support hash function
Add ```hash``` function for SparkR ```DataFrame```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10597 from yanboliang/spark-12645.
2016-01-09 12:29:51 +05:30
Yanbo Liang d1fea41363 [SPARK-12393][SPARKR] Add read.text and write.text for SparkR
Add ```read.text``` and ```write.text``` for SparkR.
cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10348 from yanboliang/spark-12393.
2016-01-06 12:05:41 +05:30
felixcheung cc4d5229c9 [SPARK-12625][SPARKR][SQL] replace R usage of Spark SQL deprecated API
rxin davies shivaram
Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559

- [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed)

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10584 from felixcheung/rremovedeprecated.
2016-01-04 22:32:07 -08:00
felixcheung c3d505602d [SPARK-12327][SPARKR] fix code for lintr warning for commented code
shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10408 from felixcheung/rcodecomment.
2016-01-03 20:53:35 +05:30
Hossein f6ecf14333 [SPARK-11199][SPARKR] Improve R context management story and add getOrCreate
* Changes api.r.SQLUtils to use ```SQLContext.getOrCreate``` instead of creating a new context.
* Adds a simple test

[SPARK-11199] #comment link with JIRA

Author: Hossein <hossein@databricks.com>

Closes #9185 from falaki/SPARK-11199.
2015-12-29 11:44:20 -08:00
Forest Fang d80cc90b55 [SPARK-12526][SPARKR] ifelse, when, otherwise` unable to take Column as value
`ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values.

For example:
```r
ifelse(lit(1) == lit(1), lit(2), lit(3))
ifelse(df$mpg > 0, df$mpg, 0)
```
will both fail with
```r
attempt to replicate an object of type 'environment'
```

The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR.

For reference, added test cases which trigger failures:
```r
. Error: when(), otherwise() and ifelse() with column on a DataFrame ----------
error in evaluating the argument 'x' in selecting a method for function 'collect':
  error in evaluating the argument 'col' in selecting a method for function 'select':
  attempt to replicate an object of type 'environment'
Calls: when -> when -> ifelse -> ifelse

1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126
5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label)
6: condition(object)
7: compare(actual, expected, ...)
8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))
Error: Test failures
Execution halted
```

Author: Forest Fang <forest.fang@outlook.com>

Closes #10481 from saurfang/spark-12526.
2015-12-29 12:45:24 +05:30
Reynold Xin f496031bd2 Bump master version to 2.0.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #10387 from rxin/version-bump.
2015-12-19 15:13:05 -08:00
Yanbo Liang 22f6cd86fc [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR
Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10281 from yanboliang/spark-12310.
2015-12-16 10:34:30 -08:00
Jeff Zhang 2eb5af5f0d [SPARK-12318][SPARKR] Save mode in SparkR should be error by default
shivaram  Please help review.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10290 from zjffdu/SPARK-12318.
2015-12-16 10:32:32 -08:00
Shivaram Venkataraman fb3778de68 [SPARK-12327] Disable commented code lintr temporarily
cc yhuai felixcheung shaneknapp

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #10300 from shivaram/comment-lintr-disable.
2015-12-14 16:13:55 -08:00
gatorsmile 1e3526c2d3 [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases
The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value.

This could cause SparkR unit tests failed. For example, I hit it in another PR:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10160 from gatorsmile/sampleR.
2015-12-11 20:55:16 -08:00
Yanbo Liang 0fb9825556 [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files
* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments
jsonFile(sqlContext, “path1,path2”)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for ```parquetFile```.

cc felixcheung sun-rui shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10145 from yanboliang/spark-12146.
2015-12-11 11:47:35 -08:00
Yanbo Liang d9d354ed40 [SPARK-12234][SPARKR] Fix ``subset` function error when only set `select`` argument
Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it.

cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10217 from yanboliang/spark-12234.
2015-12-10 10:18:58 -08:00
Yanbo Liang eeb58722ad [SPARK-12198][SPARKR] SparkR support read.parquet and deprecate parquetFile
SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with #10145 for ```jsonFile```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10191 from yanboliang/spark-12198.
2015-12-10 09:44:53 -08:00
Sun Rui 39d677c8f1 [SPARK-12034][SPARKR] Eliminate warnings in SparkR test cases.
This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.

Author: Sun Rui <rui.sun@intel.com>

Closes #10030 from sun-rui/SPARK-12034.
2015-12-07 10:38:17 -08:00
Yanbo Liang b6e8e63a0d [SPARK-12044][SPARKR] Fix usage of isnan, isNaN
1, Add ```isNaN``` to ```Column``` for SparkR. ```Column``` should has three related variable functions: ```isNaN, isNull, isNotNull```.
2, Replace ```DataFrame.isNaN``` with ```DataFrame.isnan``` at SparkR side. Because ```DataFrame.isNaN``` has been deprecated and will be removed at Spark 2.0.
<del>3, Add ```isnull``` to ```DataFrame``` for SparkR. ```DataFrame``` should has two related functions: ```isnan, isnull```.<del>

cc shivaram sun-rui felixcheung

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10037 from yanboliang/spark-12044.
2015-12-05 22:51:05 -08:00
Yanbo Liang 6979edf4e1 [SPARK-12115][SPARKR] Change numPartitions() to getNumPartitions() to be consistent with Scala/Python
Change ```numPartitions()``` to ```getNumPartitions()``` to be consistent with Scala/Python.
<del>Note: If we can not catch up with 1.6 release, it will be breaking change for 1.7 that we also need to explain in release note.<del>

cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10123 from yanboliang/spark-12115.
2015-12-05 16:39:01 -08:00
felixcheung 895b6c4747 [SPARK-11715][SPARKR] Add R support corr for Column Aggregration
Need to match existing method signature

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9680 from felixcheung/rcorr.
2015-12-05 16:00:12 -08:00
Sun Rui c8d0e160da [SPARK-11774][SPARKR] Implement struct(), encode(), decode() functions in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #9804 from sun-rui/SPARK-11774.
2015-12-05 15:49:51 -08:00
Sun Rui 5011f264fb [SPARK-12104][SPARKR] collect() does not handle multiple columns with same name.
Author: Sun Rui <rui.sun@intel.com>

Closes #10118 from sun-rui/SPARK-12104.
2015-12-03 21:11:10 -08:00
felixcheung 2213441e5e [SPARK-12019][SPARKR] Support character vector for sparkR.init(), check param and fix doc
and add tests.
Spark submit expects comma-separated list

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10034 from felixcheung/sparkrinitdoc.
2015-12-03 13:25:20 -08:00
Sun Rui cc7a1bc937 [SPARK-11781][SPARKR] SparkR has problem in inferring type of raw type.
Author: Sun Rui <rui.sun@intel.com>

Closes #9769 from sun-rui/SPARK-11781.
2015-11-29 11:08:26 -08:00
felixcheung c793d2d9a1 [SPARK-9319][SPARKR] Add support for setting column names, types
Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218

shivaram sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9654 from felixcheung/colnamescoltypes.
2015-11-28 21:16:21 -08:00
felixcheung 28e46ab463 [SPARK-12029][SPARKR] Improve column functions signature, param check, tests, fix doc and add examples
shivaram sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10019 from felixcheung/rfunctionsdoc.
2015-11-28 21:02:05 -08:00
Yanbo Liang ba02f6cb5a [SPARK-12025][SPARKR] Rename some window rank function names for SparkR
Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side.
There are two reasons that we should make this change:
* We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645)
* Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0.

It's better to fix this issue before 1.6 release, otherwise we will make breaking API change.
cc shivaram sun-rui

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10016 from yanboliang/SPARK-12025.
2015-11-27 11:48:01 -08:00
felixcheung a6239d587c [SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help information for SparkR:::summary correctly
Fix use of aliases and changes uses of rdname and seealso
`aliases` is the hint for `?` - it should not be linked to some other name - those should be seealso
https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html

Clean up usage on family, as multiple use of family with the same rdname is causing duplicated See Also html blocks (like http://spark.apache.org/docs/latest/api/R/count.html)
Also changing some rdname for dplyr-like variant for better R user visibility in R doc, eg. rbind, summary, mutate, summarize

shivaram yanboliang

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9750 from felixcheung/rdocaliases.
2015-11-20 15:10:55 -08:00
felixcheung 1a93323c5b [SPARK-11339][SPARKR] Document the list of functions in R base package that are masked by functions with same name in SparkR
Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called.

For those we can't call, added them to SparkR programming guide.

It would seem to me `table, sample, subset, filter, cov` not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg.
```
> methods("transform")
[1] transform,ANY-method       transform.data.frame
[3] transform,DataFrame-method transform.default
see '?methods' for accessing help and source code
> methods("subset")
[1] subset.data.frame       subset,DataFrame-method subset.default
[4] subset.matrix
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
  function 'subset' appears not to be S3 generic; found functions that look like S3 methods
```
Any idea?

More information on masking:
http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htm
http://www.sfu.ca/~sweldon/howTo/guide4.pdf

This is what the output doc looks like (minus css):
![image](https://cloud.githubusercontent.com/assets/8969467/11229714/2946e5de-8d4d-11e5-94b0-dda9696b6fdd.png)

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9785 from felixcheung/rmasked.
2015-11-18 23:32:49 -08:00
Yanbo Liang e222d75849 [SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example codes
This PR includes:
* Update SparkR:::glm, SparkR:::summary API docs.
* Update SparkR machine learning user guide and example codes to show:
  * supporting feature interaction in R formula.
  * summary for gaussian GLM model.
  * coefficients for binomial GLM model.

mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9727 from yanboliang/spark-11684.
2015-11-18 13:30:29 -08:00
Sun Rui 224723e6a8 [SPARK-11773][SPARKR] Implement collection functions in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #9764 from sun-rui/SPARK-11773.
2015-11-18 08:41:45 -08:00
zero323 a97d6f3a58 [SPARK-11281][SPARKR] Add tests covering the issue.
The goal of this PR is to add tests covering the issue to ensure that is was resolved by [SPARK-11086](https://issues.apache.org/jira/browse/SPARK-11086).

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9743 from zero323/SPARK-11281-tests.
2015-11-18 08:32:03 -08:00
Yanbo Liang 8fb775ba87 [SPARK-11755][R] SparkR should export "predict"
The bug described at [SPARK-11755](https://issues.apache.org/jira/browse/SPARK-11755), after exporting ```predict``` we can both get the help information from the SparkR and base R package like the following:
```Java
> help(predict)
Help on topic ‘predict’ was found in the following packages:

  Package               Library
  SparkR                /Users/yanboliang/data/trunk2/spark/R/lib
  stats                 /Library/Frameworks/R.framework/Versions/3.2/Resources/library

Choose one

1: Make predictions from a model {SparkR}
2: Model Predictions {stats}
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9732 from yanboliang/spark-11755.
2015-11-17 22:13:15 -08:00
Sun Rui 835a79d78e [SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable
The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.

When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.

sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.

The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.

Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages)  so that these package can be accessed in R.

Author: Sun Rui <rui.sun@intel.com>

Closes #9390 from sun-rui/SPARK-10500.
2015-11-15 19:29:09 -08:00
zero323 d7d9fa0b87 [SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame
Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`

At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame.  It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).

A simple improvement is to apply `dropFactor `column-wise and then reshape output list.

It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9099 from zero323/SPARK-11086.
2015-11-15 19:15:27 -08:00
felixcheung ed04846e14 [SPARK-11263][SPARKR] lintr Throws Warnings on Commented Code in Documentation
Clean out hundreds of `style: Commented code should be removed.` from lintr

Like these:
```
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:513:3: style: Commented code should be removed.
# sc <- sparkR.init()
  ^~~~~~~~~~~~~~~~~~~
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:514:3: style: Commented code should be removed.
# sqlContext <- sparkRSQL.init(sc)
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:515:3: style: Commented code should be removed.
# path <- "path/to/file.json"
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~
```

tried without export or rdname, neither work
instead, added this `#' noRd` to suppress .Rd file generation

also updated `family` for DataFrame functions for longer descriptive text instead of `dataframe_funcs`
![image](https://cloud.githubusercontent.com/assets/8969467/10933937/17bf5b1e-8291-11e5-9777-40fc632105dc.png)

this covers *most* of 'Commented code' but I left out a few that looks legitimate.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9463 from felixcheung/rlintr.
2015-11-12 20:02:49 -08:00
JihongMa d292f74831 [SPARK-11420] Updating Stddev support via Imperative Aggregate
switched stddev support from DeclarativeAggregate to ImperativeAggregate.

Author: JihongMa <linlin200605@gmail.com>

Closes #9380 from JihongMA/SPARK-11420.
2015-11-12 13:47:34 -08:00
felixcheung 1a8e0468a1 [SPARK-11468] [SPARKR] add stddev/variance agg functions for Column
Checked names, none of them should conflict with anything in base

shivaram davies rxin

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9489 from felixcheung/rstddev.
2015-11-10 22:45:17 -08:00
Yanbo Liang f14e95115c [ML][R] SparkR::glm summary result to compare with native R
Follow up #9561. Due to [SPARK-11587](https://issues.apache.org/jira/browse/SPARK-11587) has been fixed, we should compare SparkR::glm summary result with native R output rather than hard-code one. mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9590 from yanboliang/glm-r-test.
2015-11-10 11:34:36 -08:00
Oscar D. Lara Yejas 47735cdc2a [SPARK-10863][SPARKR] Method coltypes() (New version)
This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.

Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>

Closes #9579 from olarayej/SPARK-10863_NEW14.
2015-11-10 11:07:57 -08:00
Yin Huai e0701c7560 [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s
https://issues.apache.org/jira/browse/SPARK-9830

This PR contains the following main changes.
* Removing `AggregateExpression1`.
* Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
* Removing planner rule used to plan `Aggregate`.
* Linking `MultipleDistinctRewriter` to analyzer.
* Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
* Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
* Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).

Author: Yin Huai <yhuai@databricks.com>

Closes #9556 from yhuai/removeAgg1.
2015-11-10 11:06:29 -08:00
Shivaram Venkataraman c4e19b3819 [SPARK-11587][SPARKR] Fix the summary generic to match base R
The signature is summary(object, ...) as defined in
https://stat.ethz.ch/R-manual/R-devel/library/base/html/summary.html

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #9582 from shivaram/summary-fix.
2015-11-09 21:06:01 -08:00
felixcheung cd174882a5 [SPARK-9865][SPARKR] Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame
Make sample test less flaky by setting the seed

Tested with
```
repeat {  if (count(sample(df, FALSE, 0.1)) == 3) { break } }
```

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9549 from felixcheung/rsample.
2015-11-09 10:26:09 -08:00
Yanbo Liang 8c0e1b50e9 [SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for linear regression
Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
 Min        Max
 -0.9509607 0.7291832

$Coefficients
                   Estimate   Std. Error t value   Pr(>|t|)
(Intercept)        1.6765     0.2353597  7.123139  4.456124e-11
Sepal_Length       0.3498801  0.04630128 7.556598  4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica  -1.00751   0.09330565 -10.79796 0
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9561 from yanboliang/spark-11494.
2015-11-09 08:56:22 -08:00
Imran Rashid 49f1a82037 [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits
https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.
2015-11-06 20:06:24 +00:00
Davies Liu 2440106242 [SPARK-11542] [SPARKR] fix glm with long fomular
Because deparse() will break the long string into multiple lines, the deserialization will fail

Author: Davies Liu <davies@databricks.com>

Closes #9510 from davies/fix_glm.
2015-11-05 16:34:10 -08:00
adrian555 b9455d1f18 [SPARK-11260][SPARKR] with() function support
Author: adrian555 <wzhuang@us.ibm.com>
Author: Adrian Zhuang <adrian555@users.noreply.github.com>

Closes #9443 from adrian555/with.
2015-11-05 14:47:38 -08:00
Yanbo Liang e328b69c31 [SPARK-9492][ML][R] LogisticRegression in R should provide model statistics
Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9303 from yanboliang/spark-9492.
2015-11-04 08:28:33 -08:00
lewuathe d648a4ad54 [DOC] Missing link to R DataFrame API doc
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>

Closes #9394 from Lewuathe/missing-link-to-R-dataframe.
2015-11-03 16:38:22 -08:00
vectorijk c020f7d9d4 [SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead in ML models
Deprecated in `LogisticRegression` and `LinearRegression`

Author: vectorijk <jiangkai@gmail.com>

Closes #9311 from vectorijk/spark-10592.
2015-11-02 16:12:04 -08:00
felixcheung bb5a2af034 [SPARK-11340][SPARKR] Support setting driver properties when starting Spark from R programmatically or from RStudio
Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments.

shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf?
sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9290 from felixcheung/rdrivermem.
2015-10-30 13:51:32 -07:00
Sun Rui 40c77fb23a [SPARK-11210][SPARKR] Add window functions into SparkR [step 2].
Author: Sun Rui <rui.sun@intel.com>

Closes #9196 from sun-rui/SPARK-11210.
2015-10-30 10:56:06 -07:00
felixcheung d89be0bf81 [SPARK-11409][SPARKR] Enable url link in R doc for Persist
Quick one line doc fix
link is not clickable
![image](https://cloud.githubusercontent.com/assets/8969467/10833041/4e91dd7c-7e4c-11e5-8905-713b986dbbde.png)

shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9363 from felixcheung/rpersistdoc.
2015-10-29 21:01:10 -07:00
Yanbo Liang fba9e95452 [SPARK-11369][ML][R] SparkR glm should support setting standardize
SparkR glm currently support :
```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0```
We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9331 from yanboliang/spark-11369.
2015-10-28 08:50:21 -07:00
Sun Rui dc3220ce11 [SPARK-11209][SPARKR] Add window functions into SparkR [step 1].
Author: Sun Rui <rui.sun@intel.com>

Closes #9193 from sun-rui/SPARK-11209.
2015-10-26 20:58:18 -07:00
Narine Kokhlikyan 3689beb98b [SPARK-10979][SPARKR] Sparkrmerge: Add merge to DataFrame with R signature
Add merge function to DataFrame, which supports R signature.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #9012 from NarineK/sparkrmerge.
2015-10-26 15:12:25 -07:00
felixcheung 5e45812501 [SPARK-11294][SPARKR] Improve R doc for read.df, write.df, saveAsTable
Add examples for read.df, write.df; fix grouping for read.df, loadDF; fix formatting and text truncation for write.df, saveAsTable.

Several text issues:
![image](https://cloud.githubusercontent.com/assets/8969467/10708590/1303a44e-79c3-11e5-854f-3a2e16854cd7.png)
- text collapsed into a single paragraph
- text truncated at 2 places, eg. "overwrite: Existing data is expected to be overwritten by the contents of error:"

shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9261 from felixcheung/rdocreadwritedf.
2015-10-23 21:42:00 -07:00
Forest Fang 94e2064fa1 [SPARK-11244][SPARKR] sparkR.stop() should remove SQLContext
SparkR should remove `.sparkRSQLsc` and `.sparkRHivesc` when `sparkR.stop()` is called. Otherwise even when SparkContext is reinitialized, `sparkRSQL.init` returns the stale copy of the object and complains:

```r
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sparkR.stop()
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sqlContext
```
producing
```r
Error in callJMethod(x, "getClass") :
  Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed.
```

I have added the check and removal only when SparkContext itself is initialized. I have also added corresponding test for this fix. Let me know if you want me to move the test to SQL test suite instead.

p.s. I tried lint-r but ended up a lots of errors on existing code.

Author: Forest Fang <forest.fang@outlook.com>

Closes #9205 from saurfang/sparkR.stop.
2015-10-22 09:34:07 -07:00
Davies Liu f8c6bec657 [SPARK-11197][SQL] run SQL on files directly
This PR introduce a new feature to run SQL directly on files without create a table, for example:

```
select id from json.`path/to/json/files` as j
```

Author: Davies Liu <davies@databricks.com>

Closes #9173 from davies/source.
2015-10-21 13:38:30 -07:00
felixcheung 1107bd958a [SPARK-11221][SPARKR] fix R doc for lit and add examples
Currently the documentation for `lit` is inconsistent with doc format, references "Scala symbol" and has no example. Fixing that.
shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9187 from felixcheung/rlit.
2015-10-20 21:40:22 -07:00
lewuathe 4c33a34ba3 [SPARK-10668] [ML] Use WeightedLeastSquares in LinearRegression with L…
…2 regularization if the number of features is small

Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <sasaki@treasure-data.com>
Author: Kai Sasaki <sasaki@treasure-data.com>
Author: Lewuathe <lewuathe@me.com>

Closes #8884 from Lewuathe/SPARK-10668.
2015-10-19 10:46:10 -07:00
Sun Rui 390b22fad6 [SPARK-10996] [SPARKR] Implement sampleBy() in DataFrameStatFunctions.
Author: Sun Rui <rui.sun@intel.com>

Closes #9023 from sun-rui/SPARK-10996.
2015-10-13 22:31:23 -07:00
Monica Liu 8b32885704 [SPARK-10981] [SPARKR] SparkR Join improvements
I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file.
Pull request because I filed this JIRA bug report:
https://issues.apache.org/jira/browse/SPARK-10981

Author: Monica Liu <liu.monica.f@gmail.com>

Closes #9029 from mfliu/master.
2015-10-13 22:24:52 -07:00
Adrian Zhuang f7f28ee7a5 [SPARK-10913] [SPARKR] attach() function support
Bring the change code up to date.

Author: Adrian Zhuang <adrian555@users.noreply.github.com>
Author: adrian555 <wzhuang@us.ibm.com>

Closes #9031 from adrian555/attach2.
2015-10-13 10:21:07 -07:00
Narine Kokhlikyan 1e0aba90b9 [SPARK-10888] [SPARKR] Added as.DataFrame as a synonym to createDataFrame
as.DataFrame is more a R-style like signature.
Also, I'd like to know if we could make the context, e.g. sqlContext global, so that we do not have to specify it as an argument, when we each time create a dataframe.

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #8952 from NarineK/sparkrasDataFrame.
2015-10-13 10:09:05 -07:00
Sun Rui 5e3868ba13 [SPARK-10051] [SPARKR] Support collecting data of StructType in DataFrame
Two points in this PR:

1.    Originally thought was that a named R list is assumed to be a struct in SerDe. But this is problematic because some R functions will implicitly generate named lists that are not intended to be a struct when transferred by SerDe. So SerDe clients have to explicitly mark a names list as struct by changing its class from "list" to "struct".

2.    SerDe is in the Spark Core module, and data of StructType is represented as GenricRow which is defined in Spark SQL module. SerDe can't import GenricRow as in maven build  Spark SQL module depends on Spark Core module. So this PR adds a registration hook in SerDe to allow SQLUtils in Spark SQL module to register its functions for serialization and deserialization of StructType.

Author: Sun Rui <rui.sun@intel.com>

Closes #8794 from sun-rui/SPARK-10051.
2015-10-13 10:02:21 -07:00
Sun Rui 864de3bf40 [SPARK-10079] [SPARKR] Make 'column' and 'col' functions be S4 functions.
1.  Add a "col" function into DataFrame.
2.  Move the current "col" function in Column.R to functions.R, convert it to S4 function.
3.  Add a s4 "column" function in functions.R.
4.  Convert the "column" function in Column.R to S4 function. This is for private use.

Author: Sun Rui <rui.sun@intel.com>

Closes #8864 from sun-rui/SPARK-10079.
2015-10-09 23:05:38 -07:00
Rerngvit Yanggratoke 70f44ad2d8 [SPARK-10905] [SPARKR] Export freqItems() for DataFrameStatFunctions
[SPARK-10905][SparkR]: Export freqItems() for DataFrameStatFunctions
- Add function (together with roxygen2 doc) to DataFrame.R and generics.R
- Expose the function in NAMESPACE
- Add unit test for the function

Author: Rerngvit Yanggratoke <rerngvit@kth.se>

Closes #8962 from rerngvit/SPARK-10905.
2015-10-09 09:36:40 -07:00
Narine Kokhlikyan e8f90d9dda [SPARK-10836] [SPARKR] Added sort(x, decreasing, col, ... ) method to DataFrame
the sort function can be used as an alternative to arrange(... ).
As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of orderings for columns and the list of columns, represented as string names

for example:
sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to sort some of the columns in the same order

sort(df, decreasing=TRUE, "col1")
sort(df, decreasing=c(TRUE,FALSE), "col1","col2")

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #8920 from NarineK/sparkrsort.
2015-10-08 09:53:44 -07:00
Sun Rui f57c63d4c3 [SPARK-10752] [SPARKR] Implement corr() and cov in DataFrameStatFunctions.
Author: Sun Rui <rui.sun@intel.com>

Closes #8869 from sun-rui/SPARK-10752.
2015-10-07 09:46:37 -07:00
felixcheung 721e8b5f35 [SPARK-10904] [SPARKR] Fix to support select(df, c("col1", "col2"))
The fix is to coerce `c("a", "b")` into a list such that it could be serialized to call JVM with.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #8961 from felixcheung/rselect.
2015-10-03 22:42:36 -07:00
Oscar D. Lara Yejas f21e2da03f [SPARK-10807] [SPARKR] Added as.data.frame as a synonym for collect
Created method as.data.frame as a synonym for collect().

Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: olarayej <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>

Closes #8908 from olarayej/SPARK-10807.
2015-09-30 18:03:31 -07:00
Narine Kokhlikyan 6fcee906d2 [SPARK-10760] [SPARKR] SparkR glm: the documentation in examples - family argument is missing
Hi everyone,

Since the family argument is required for the glm function, the execution of:

model <- glm(Sepal_Length ~ Sepal_Width, df)

is failing.

I've fixed the documentation by adding the family argument and also added the summay(model) which will show the coefficients for the model.

Thanks,
Narine

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #8870 from NarineK/sparkrml.
2015-09-25 11:55:08 -07:00
Eric Liang 922338812c [SPARK-9681] [ML] Support R feature interactions in RFormula
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).

To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #8830 from ericl/interaction-2.
2015-09-25 00:43:22 -07:00
Sun Rui 896edb51ab [SPARK-10050] [SPARKR] Support collecting data of MapType in DataFrame.
1. Support collecting data of MapType from DataFrame.
2. Support data of MapType in createDataFrame.

Author: Sun Rui <rui.sun@intel.com>

Closes #8711 from sun-rui/SPARK-10050.
2015-09-16 13:20:39 -07:00
Reynold Xin 09b7e7c198 Update version to 1.6.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #8350 from rxin/1.6.
2015-09-15 00:54:20 -07:00
JihongMa f4a22808e0 [SPARK-6548] Adding stddev to DataFrame functions
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.

Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>

Closes #6297 from JihongMA/SPARK-SQL.
2015-09-12 10:17:15 -07:00
Sun Rui 45e3be5c13 [SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataFrame.
this PR :
1.  Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side.

2.  Enhance the SerDe to support transferring  a Scala seq to R side. Data of ArrayType in DataFrame
after collection is observed to be of Scala Seq type.

3.  Support ArrayType in createDataFrame().

Author: Sun Rui <rui.sun@intel.com>

Closes #8458 from sun-rui/SPARK-10049.
2015-09-10 12:21:13 -07:00
Shivaram Venkataraman 143e521d1a [MINOR] Minor style fix in SparkR
`dev/lintr-r` passes on my machine now

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #8601 from shivaram/sparkr-style-fix.
2015-09-04 11:24:47 -07:00
CHOIJAEHONG af0e3125cb [SPARK-8951] [SPARKR] support Unicode characters in collect()
Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK.
I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R.

Author: CHOIJAEHONG <redrock07@naver.com>

Closes #7494 from CHOIJAEHONG1/SPARK-8951.
2015-09-03 13:38:26 -07:00
felixcheung 2a4e00ca4d [SPARK-9803] [SPARKR] Add subset and transform + tests
Add subset and transform
Also reorganize `[` & `[[` to subset instead of select

Note: for transform, transform is very similar to mutate. Spark doesn't seem to replace existing column with the name in mutate (ie. `mutate(df, age = df$age + 2)` - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform.
Though it is clearly stated it should replace column with matching name (should I open a JIRA for mutate/transform?)

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #8503 from felixcheung/rsubset_transform.
2015-08-28 18:35:01 -07:00
Luciano Resende 499e8e154b [SPARK-8952] [SPARKR] - Wrap normalizePath calls with suppressWarnings
This is based on davies comment on SPARK-8952 which suggests to only call normalizePath() when path starts with '~'

Author: Luciano Resende <lresende@apache.org>

Closes #8343 from lresende/SPARK-8952.
2015-08-28 09:13:21 -07:00
Shivaram Venkataraman 2f99c37273 [SPARK-10328] [SPARKR] Fix generic for na.omit
S3 function is at https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8495 from shivaram/na-omit-fix.
2015-08-28 00:37:50 -07:00
Shivaram Venkataraman e936cf8088 [SPARK-10219] [SPARKR] Fix varargsToEnv and add test case
cc sun-rui davies

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #8475 from shivaram/varargs-fix.
2015-08-26 22:27:31 -07:00
Yu ISHIKAWA 773ca037a4 [MINOR] [SPARKR] Fix some validation problems in SparkR
Getting rid of some validation problems in SparkR
https://github.com/apache/spark/pull/7883

cc shivaram

```
inst/tests/test_Serde.R:26:1: style: Trailing whitespace is superfluous.

^~
inst/tests/test_Serde.R:34:1: style: Trailing whitespace is superfluous.

^~
inst/tests/test_Serde.R:37:38: style: Trailing whitespace is superfluous.
  expect_equal(class(x), "character")
                                     ^~
inst/tests/test_Serde.R:50:1: style: Trailing whitespace is superfluous.

^~
inst/tests/test_Serde.R:55:1: style: Trailing whitespace is superfluous.

^~
inst/tests/test_Serde.R:60:1: style: Trailing whitespace is superfluous.

^~
inst/tests/test_sparkSQL.R:611:1: style: Trailing whitespace is superfluous.

^~
R/DataFrame.R:664:1: style: Trailing whitespace is superfluous.

^~~~~~~~~~~~~~
R/DataFrame.R:670:55: style: Trailing whitespace is superfluous.
                df <- data.frame(row.names = 1 : nrow)
                                                      ^~~~~~~~~~~~~~~~
R/DataFrame.R:672:1: style: Trailing whitespace is superfluous.

^~~~~~~~~~~~~~
R/DataFrame.R:686:49: style: Trailing whitespace is superfluous.
                    df[[names[colIndex]]] <- vec
                                                ^~~~~~~~~~~~~~~~~~
```

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8474 from yu-iskw/minor-fix-sparkr.
2015-08-26 18:14:32 -07:00
Shivaram Venkataraman ad7f0f160b [SPARK-10308] [SPARKR] Add %in% to the exported namespace
I also checked all the other functions defined in column.R, functions.R and DataFrame.R and everything else looked fine.

cc yu-iskw

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #8473 from shivaram/in-namespace.
2015-08-26 18:13:07 -07:00
felixcheung 75d4773aa5 [SPARK-9316] [SPARKR] Add support for filtering using [ (synonym for filter / select)
Add support for
```
   df[df$name == "Smith", c(1,2)]
   df[df$age %in% c(19, 30), 1:2]
```

shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #8394 from felixcheung/rsubset.
2015-08-25 23:48:16 -07:00
Sun Rui 71a138cd0e [SPARK-10048] [SPARKR] Support arbitrary nested Java array in serde.
This PR:
1. supports transferring arbitrary nested array from JVM to R side in SerDe;
2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
   from a DataFrame.

Author: Sun Rui <rui.sun@intel.com>

Closes #8276 from sun-rui/SPARK-10048.
2015-08-25 13:14:10 -07:00
Yu ISHIKAWA d4549fe58f [SPARK-10214] [SPARKR] [DOCS] Improve SparkR Column, DataFrame API docs
cc: shivaram

## Summary

- Add name tags to each methods in DataFrame.R and column.R
- Replace `rdname column` with `rdname {each_func}`. i.e. alias method : `rdname column` =>  `rdname alias`

## Generated PDF File
https://drive.google.com/file/d/0B9biIZIU47lLNHN2aFpnQXlSeGs/view?usp=sharing

## JIRA
[[SPARK-10214] Improve SparkR Column, DataFrame API docs - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10214)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8414 from yu-iskw/SPARK-10214.
2015-08-25 00:28:51 -07:00
Yu ISHIKAWA 6511bf559b [SPARK-10118] [SPARKR] [DOCS] Improve SparkR API docs for 1.5 release
cc: shivaram

## Summary

- Modify `tdname` of expression functions. i.e. `ascii`: `rdname functions` => `rdname ascii`
- Replace the dynamical function definitions to the static ones because of thir documentations.

## Generated PDF File
https://drive.google.com/file/d/0B9biIZIU47lLX2t6ZjRoRnBTSEU/view?usp=sharing

## JIRA
[[SPARK-10118] Improve SparkR API docs for 1.5 release - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10118)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8386 from yu-iskw/SPARK-10118.
2015-08-24 18:17:51 -07:00
Yu ISHIKAWA d898c33f77 [SPARK-10106] [SPARKR] Add ifelse Column function to SparkR
### JIRA
[[SPARK-10106] Add `ifelse` Column function to SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10106)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8303 from yu-iskw/SPARK-10106.
2015-08-19 12:39:37 -07:00
Yu ISHIKAWA 2fcb9cb955 [SPARK-9856] [SPARKR] Add expression functions into SparkR whose params are complicated
I added lots of Column functinos into SparkR. And I also added `rand(seed: Int)` and `randn(seed: Int)` in Scala. Since we need such APIs for R integer type.

### JIRA
[[SPARK-9856] Add expression functions into SparkR whose params are complicated - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9856)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8264 from yu-iskw/SPARK-9856-3.
2015-08-19 10:41:14 -07:00
Yu ISHIKAWA bf32c1f7f4 [SPARK-10075] [SPARKR] Add when expressino function in SparkR
- Add `when` and `otherwise` as `Column` methods
- Add `When` as an expression function
- Add `%otherwise%` infix as an alias of `otherwise`

Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think?

### JIRA
[[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8266 from yu-iskw/SPARK-10075.
2015-08-18 20:27:36 -07:00
Yu ISHIKAWA b4b35f133a [SPARKR] [MINOR] Get rid of a long line warning
```
R/functions.R:74:1: style: lines should not be more than 100 characters.
            jc <- callJStatic("org.apache.spark.sql.functions", "lit", ifelse(class(x) == "Column", xjc, x))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8297 from yu-iskw/minor-lint-r.
2015-08-18 19:18:05 -07:00
Hossein 04e0fea79b Bump SparkR version string to 1.5.0
This patch is against master, but we need to apply it to 1.5 branch as well.

cc shivaram  and rxin

Author: Hossein <hossein@databricks.com>

Closes #8291 from falaki/SparkRVersion1.5.
2015-08-18 18:02:22 -07:00
Yuu ISHIKAWA 1968276af0 [SPARK-10007] [SPARKR] Update NAMESPACE file in SparkR for simple parameters functions
### JIRA
[[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007)

Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8277 from yu-iskw/SPARK-10007.
2015-08-18 09:10:59 -07:00
Yu ISHIKAWA 26e760581f [SPARK-9871] [SPARKR] Add expression functions into SparkR which have a variable parameter
### Summary

- Add `lit` function
- Add `concat`, `greatest`, `least` functions

I think we need to improve `collect` function in order to implement `struct` function. Since `collect` doesn't work with arguments which includes a nested `list` variable. It seems that a list against `struct` still has `jobj` classes. So it would be better to solve this problem on another issue.

### JIRA
[[SPARK-9871] Add expression functions into SparkR which have a variable parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8194 from yu-iskw/SPARK-9856.
2015-08-16 23:33:20 -07:00
Sun Rui 5f9ce738fe [SPARK-8844] [SPARKR] head/collect is broken in SparkR.
This is a WIP patch for SPARK-8844  for collecting reviews.

This bug is about reading an empty DataFrame. in readCol(),
      lapply(1:numRows, function(x) {
does not take into consideration the case where numRows = 0.

Will add unit test case.

Author: Sun Rui <rui.sun@intel.com>

Closes #7419 from sun-rui/SPARK-8844.
2015-08-16 00:30:02 -07:00
Burak Yavuz 2fb4901b71 [SPARK-9916] [BUILD] [SPARKR] removed left-over sparkr.zip copy/create commands from codebase
sparkr.zip is now built by SparkSubmit on a need-to-build basis.

cc shivaram

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #8147 from brkyvz/make-dist-fix.
2015-08-12 20:59:38 -07:00
Yu ISHIKAWA f4bc01f1f3 [SPARK-9855] [SPARKR] Add expression functions into SparkR whose params are simple
I added lots of expression functions for SparkR. This PR includes only functions whose params  are only `(Column)` or `(Column, Column)`.  And I think we need to improve how to test those functions. However, it would be better to work on another issue.

## Diff Summary

- Add lots of functions in `functions.R` and their generic in `generic.R`
- Add aliases for `ceiling` and `sign`
- Move expression functions from `column.R` to `functions.R`
- Modify `rdname` from `column` to `functions`

I haven't supported `not` function, because the name has a collesion with `testthat` package. I didn't think of the way  to define it.

## New Supported Functions

```
approxCountDistinct
ascii
base64
bin
bitwiseNOT
ceil (alias: ceiling)
crc32
dayofmonth
dayofyear
explode
factorial
hex
hour
initcap
isNaN
last_day
length
log2
ltrim
md5
minute
month
negate
quarter
reverse
round
rtrim
second
sha1
signum (alias: sign)
size
soundex
to_date
trim
unbase64
unhex
weekofyear
year

datediff
levenshtein
months_between
nanvl
pmod
```

## JIRA
[[SPARK-9855] Add expression functions into SparkR whose params are simple - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9855)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8123 from yu-iskw/SPARK-9855.
2015-08-12 18:33:27 -07:00
Eric Liang 74a293f453 [SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5
This documents the use of R model formulae in the SparkR guide. Also fixes some bugs in the R api doc.

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #8085 from ericl/docs.
2015-08-11 21:26:03 -07:00
Reynold Xin 4309262ec9 [SPARK-9700] Pick default page size more intelligently.
Previously, we use 64MB as the default page size, which was way too big for a lot of Spark applications (especially for single node).

This patch changes it so that the default page size, if unset by the user, is determined by the number of cores available and the total execution memory available.

Author: Reynold Xin <rxin@databricks.com>

Closes #8012 from rxin/pagesize and squashes the following commits:

16f4756 [Reynold Xin] Fixed failing test.
5afd570 [Reynold Xin] private...
0d5fb98 [Reynold Xin] Update default value.
674a6cd [Reynold Xin] Address review feedback.
dc00e05 [Reynold Xin] Merge with master.
73ebdb6 [Reynold Xin] [SPARK-9700] Pick default page size more intelligently.
2015-08-06 23:18:29 -07:00
Burak Yavuz c9a4c36d05 [SPARK-8313] R Spark packages support
shivaram cafreeman Could you please help me in testing this out? Exposing and running `rPackageBuilder` from inside the shell works, but for some reason, I can't get it to work during Spark Submit. It just starts relaunching Spark Submit.

For testing, you may use the R branch with [sbt-spark-package](https://github.com/databricks/sbt-spark-package). You can call spPackage, and then pass the jar using `--jars`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #7139 from brkyvz/r-submit and squashes the following commits:

0de384f [Burak Yavuz] remove unused imports 2
d253708 [Burak Yavuz] removed unused imports
6603d0d [Burak Yavuz] addressed comments
4258ffe [Burak Yavuz] merged master
ddfcc06 [Burak Yavuz] added zipping test
3a1be7d [Burak Yavuz] don't zip
77995df [Burak Yavuz] fix URI
ac45527 [Burak Yavuz] added zipping of all libs
e6bf7b0 [Burak Yavuz] add println ignores
1bc5554 [Burak Yavuz] add assumes for tests
9778e03 [Burak Yavuz] addressed comments
b42b300 [Burak Yavuz] merged master
ffd134e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit
d867756 [Burak Yavuz] add apache header
eff5ba1 [Burak Yavuz] ready for review
8838edb [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit
e5b5a06 [Burak Yavuz] added doc
bb751ce [Burak Yavuz] fix null bug
0226768 [Burak Yavuz] fixed issues
8810beb [Burak Yavuz] R packages support
2015-08-04 18:20:12 -07:00
Hossein 712f5b7a9a [SPARK-9318] [SPARK-9320] [SPARKR] Aliases for merge and summary functions on DataFrames
This PR adds synonyms for ```merge``` and ```summary``` in SparkR DataFrame API.

cc shivaram

Author: Hossein <hossein@databricks.com>

Closes #7806 from falaki/SPARK-9320 and squashes the following commits:

72600f7 [Hossein] Updated docs
92a6e75 [Hossein] Fixed merge generic signature issue
4c2b051 [Hossein] Fixing naming with mllib summary
0f3a64c [Hossein] Added ... to generic for merge
30fbaf8 [Hossein] Merged master
ae1a4cf [Hossein] Merge branch 'master' into SPARK-9320
e8eb86f [Hossein] Add a generic for merge
fc01f2d [Hossein] Added unit test
8d92012 [Hossein] Added merge as an alias for join
5b8bedc [Hossein] Added unit test
632693d [Hossein] Added summary as an alias for describe for DataFrame
2015-07-31 19:24:44 -07:00
Hossein 710c2b5dd2 [SPARK-9324] [SPARK-9322] [SPARK-9321] [SPARKR] Some aliases for R-like functions in DataFrames
Adds following aliases:
* unique (distinct)
* rbind (unionAll): accepts many DataFrames
* nrow (count)
* ncol
* dim
* names (columns): along with the replacement function to change names

Author: Hossein <hossein@databricks.com>

Closes #7764 from falaki/sparkR-alias and squashes the following commits:

56016f5 [Hossein] Updated R documentation
5e4a4d0 [Hossein] Removed extra code
f51cbef [Hossein] Merge branch 'master' into sparkR-alias
c1b88bd [Hossein] Moved setGeneric and other comments applied
d9307f8 [Hossein] Added tests
b5aa988 [Hossein] Added dim, ncol, nrow, names, rbind, and unique functions to DataFrames
2015-07-31 14:08:18 -07:00
Shivaram Venkataraman 82f47b8116 [SPARK-9510] [SPARKR] Remaining SparkR style fixes
With the change in this patch, I get no more warnings from `./dev/lint-r` in my machine

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #7834 from shivaram/sparkr-style-fixes and squashes the following commits:

716cd8e [Shivaram Venkataraman] Remaining SparkR style fixes
2015-07-31 14:02:44 -07:00
Yu ISHIKAWA fc0e57e5ab [SPARK-9053] [SPARKR] Fix spaces around parens, infix operators etc.
### JIRA
[[SPARK-9053] Fix spaces around parens, infix operators etc. - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9053)

### The Result of `lint-r`
[The result of lint-r at the rivision:a4c83cb1e4b066cd60264b6572fd3e51d160d26a](https://gist.github.com/yu-iskw/d253d7f8ef351f86443d)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7584 from yu-iskw/SPARK-9053 and squashes the following commits:

613170f [Yu ISHIKAWA] Ignore a warning about a space before a left parentheses
ede61e1 [Yu ISHIKAWA] Ignores two warnings about a space before a left parentheses. TODO: After updating `lintr`, we will remove the ignores
de3e0db [Yu ISHIKAWA] Add '## nolint start' & '## nolint end' statement to ignore infix space warnings
e233ea8 [Yu ISHIKAWA] [SPARK-9053][SparkR] Fix spaces around parens, infix operators etc.
2015-07-31 09:33:38 -07:00
Hossein 157840d1b1 [SPARK-8742] [SPARKR] Improve SparkR error messages for DataFrame API
This patch improves SparkR error message reporting, especially with DataFrame API. When there is a user error (e.g., malformed SQL query), the message of the cause is sent back through the RPC and the R client reads it and returns it back to user.

cc shivaram

Author: Hossein <hossein@databricks.com>

Closes #7742 from falaki/SPARK-8742 and squashes the following commits:

4f643c9 [Hossein] Not logging exceptions in RBackendHandler
4a8005c [Hossein] Returning stack track of causing exception from RBackendHandler
5cf17f0 [Hossein] Adding unit test for error messages from SQLContext
2af75d5 [Hossein] Reading error message in case of failure and stoping with that message
f479c99 [Hossein] Wrting exception cause message in JVM
2015-07-30 16:16:17 -07:00
Eric Liang e7905a9395 [SPARK-9463] [ML] Expose model coefficients with names in SparkR RFormula
Preview:

```
> summary(m)
            features coefficients
1        (Intercept)    1.6765001
2       Sepal_Length    0.3498801
3 Species.versicolor   -0.9833885
4  Species.virginica   -1.0075104

```

Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

cc mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7771 from ericl/summary and squashes the following commits:

ccd54c3 [Eric Liang] second pass
a5ca93b [Eric Liang] comments
2772111 [Eric Liang] clean up
70483ef [Eric Liang] fix test
7c247d4 [Eric Liang] Merge branch 'master' into summary
3c55024 [Eric Liang] working
8c539aa [Eric Liang] first pass
2015-07-30 16:15:43 -07:00
Josh Rosen 520ec0ff9d [SPARK-8850] [SQL] Enable Unsafe mode by default
This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues:

**List of fixed blockers**:

- [x] Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly (#7741).
- [x] Memory leak on grouped aggregation of empty input (fixed by #7560 to fix this)
- [x] Update planner to also check whether codegen is enabled before planning unsafe operators.
- [x] Investigate failing HiveThriftBinaryServerSuite test.  This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD.  This could be fixed by #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used.
- [x] Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by #7603).
- [x]  ~~https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF.  This is necessary for `UDFSuite` to pass.  For now, I've just ignored this test in order to try to find other problems while we wait for a fix.~~ This is no longer necessary as of #7682.
- [x] Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6).
- [x] Tests in `AggregationQuerySuite` are failing due to NaN-handling issues in UnsafeRow, which were fixed in #7736.
- [x] `org.apache.spark.sql.ColumnExpressionSuite.rand` needs to be updated so that the planner check also matches `TungstenProject`.
- [x] After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs:
  - [x] Wrong answer in `join_1to1` (fixed by #7680)
  - [x] Wrong answer in `join_nulls` (fixed by #7680)
  - [x] Managed memory OOM / leak in `lateral_view`
  - [x] Seems to hang indefinitely in `partcols1`.  This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by #7710.
  - [x] Error while freeing memory in `partcols1`: will be fixed by #7734.
- [x] After fixing the `partcols1` hang, it appears that a number of later tests have issues as well.
- [x] Fix thread-safety bug in codegen fallback expression evaluation (#7759).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7564 from JoshRosen/unsafe-by-default and squashes the following commits:

83c0c56 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
f4cc859 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
963f567 [Josh Rosen] Reduce buffer size for R tests
d6986de [Josh Rosen] Lower page size in PySpark tests
013b9da [Josh Rosen] Also match TungstenProject in checkNumProjects
5d0b2d3 [Josh Rosen] Add task completion callback to avoid leak in limit after sort
ea250da [Josh Rosen] Disable unsafe Exchange path when RangePartitioning is used
715517b [Josh Rosen] Enable Unsafe by default
2015-07-30 10:45:32 -07:00
Yuu ISHIKAWA 7492a33fdd [SPARK-9248] [SPARKR] Closing curly-braces should always be on their own line
### JIRA
[[SPARK-9248] Closing curly-braces should always be on their own line - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9248)

## The result of `dev/lint-r`
[The result of `dev/lint-r` for SPARK-9248 at the revistion:6175d6cfe795fbd88e3ee713fac375038a3993a8](https://gist.github.com/yu-iskw/96cadcea4ce664c41f81)

Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7795 from yu-iskw/SPARK-9248 and squashes the following commits:

c8eccd3 [Yuu ISHIKAWA] [SPARK-9248][SparkR] Closing curly-braces should always be on their own line
2015-07-30 10:00:27 -07:00
Eric Liang 8d5bb5283c [SPARK-9391] [ML] Support minus, dot, and intercept operators in SparkR RFormula
Adds '.', '-', and intercept parsing to RFormula. Also splits RFormulaParser into a separate file.

Umbrella design doc here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7707 from ericl/string-features-2 and squashes the following commits:

8588625 [Eric Liang] exclude complex types for .
8106ffe [Eric Liang] comments
a9350bb [Eric Liang] s/var/val
9c50d4d [Eric Liang] Merge branch 'string-features' into string-features-2
581afb2 [Eric Liang] Merge branch 'master' into string-features
08ae539 [Eric Liang] Merge branch 'string-features' into string-features-2
f99131a [Eric Liang] comments
cecec43 [Eric Liang] Merge branch 'string-features' into string-features-2
0bf3c26 [Eric Liang] update docs
4592df2 [Eric Liang] intercept supports
7412a2e [Eric Liang] Fri Jul 24 14:56:51 PDT 2015
3cf848e [Eric Liang] fix the parser
0556c2b [Eric Liang] Merge branch 'string-features' into string-features-2
c302a2c [Eric Liang] fix tests
9d1ac82 [Eric Liang] Merge remote-tracking branch 'upstream/master' into string-features
e713da3 [Eric Liang] comments
cd231a9 [Eric Liang] Wed Jul 22 17:18:44 PDT 2015
4d79193 [Eric Liang] revert to seq + distinct
169a085 [Eric Liang] tweak functional test
a230a47 [Eric Liang] Merge branch 'master' into string-features
72bd6f3 [Eric Liang] fix merge
d841cec [Eric Liang] Merge branch 'master' into string-features
5b2c4a2 [Eric Liang] Mon Jul 20 18:45:33 PDT 2015
b01c7c5 [Eric Liang] add test
8a637db [Eric Liang] encoder wip
a1d03f4 [Eric Liang] refactor into estimator
2015-07-28 14:16:57 -07:00
trestletech 6143234062 Use vector-friendly comparison for packages argument.
Otherwise, `sparkR.init()` with multiple `sparkPackages` results in this warning:

```
Warning message:
In if (packages != "") { :
  the condition has length > 1 and only the first element will be used
```

Author: trestletech <jeff.allen@trestletechnology.net>

Closes #7701 from trestletech/compare-packages and squashes the following commits:

72c8b36 [trestletech] Correct function name.
c52db0e [trestletech] Added test for multiple packages.
3aab1a7 [trestletech] Use vector-friendly comparison for packages argument.
2015-07-28 10:45:19 -07:00
Eric Liang 8ddfa52c20 [SPARK-9230] [ML] Support StringType features in RFormula
This adds StringType feature support via OneHotEncoder. As part of this task it was necessary to change RFormula to an Estimator, so that factor levels could be determined from the training dataset.

Not sure if I am using uids correctly here, would be good to get reviewer help on that.
cc mengxr

Umbrella design doc: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#

Author: Eric Liang <ekl@databricks.com>

Closes #7574 from ericl/string-features and squashes the following commits:

f99131a [Eric Liang] comments
0bf3c26 [Eric Liang] update docs
c302a2c [Eric Liang] fix tests
9d1ac82 [Eric Liang] Merge remote-tracking branch 'upstream/master' into string-features
e713da3 [Eric Liang] comments
4d79193 [Eric Liang] revert to seq + distinct
169a085 [Eric Liang] tweak functional test
a230a47 [Eric Liang] Merge branch 'master' into string-features
72bd6f3 [Eric Liang] fix merge
d841cec [Eric Liang] Merge branch 'master' into string-features
5b2c4a2 [Eric Liang] Mon Jul 20 18:45:33 PDT 2015
b01c7c5 [Eric Liang] add test
8a637db [Eric Liang] encoder wip
a1d03f4 [Eric Liang] refactor into estimator
2015-07-27 17:17:49 -07:00
Yu ISHIKAWA 3aec9f4e2d [SPARK-9249] [SPARKR] local variable assigned but may not be used
[[SPARK-9249] local variable assigned but may not be used - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9249)

https://gist.github.com/yu-iskw/0e5b0253c11769457ea5

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7640 from yu-iskw/SPARK-9249 and squashes the following commits:

7a51cab [Yu ISHIKAWA] [SPARK-9249][SparkR] local variable assigned but may not be used
2015-07-24 09:10:57 -07:00
Xiangrui Meng ecfb312767 [SPARK-9243] [Documentation] null -> zero in crosstab doc
We forgot to update doc. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #7608 from mengxr/SPARK-9243 and squashes the following commits:

0ea3236 [Xiangrui Meng] null -> zero in crosstab doc
2015-07-23 10:32:11 -07:00
Xiangrui Meng 2f5cbd860e [SPARK-8364] [SPARKR] Add crosstab to SparkR DataFrames
Add `crosstab` to SparkR DataFrames, which takes two column names and returns a local R data.frame. This is similar to `table` in R. However, `table` in SparkR is used for loading SQL tables as DataFrames. The return type is data.frame instead table for `crosstab` to be compatible with Scala/Python.

I couldn't run R tests successfully on my local. Many unit tests failed. So let's try Jenkins.

Author: Xiangrui Meng <meng@databricks.com>

Closes #7318 from mengxr/SPARK-8364 and squashes the following commits:

d75e894 [Xiangrui Meng] fix tests
53f6ddd [Xiangrui Meng] fix tests
f1348d6 [Xiangrui Meng] update test
47cb088 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8364
5621262 [Xiangrui Meng] first version without test
2015-07-22 21:40:23 -07:00
Eric Liang 1cbdd89918 [SPARK-9201] [ML] Initial integration of MLlib + SparkR using RFormula
This exposes the SparkR:::glm() and SparkR:::predict() APIs. It was necessary to change RFormula to silently drop the label column if it was missing from the input dataset, which is kind of a hack but necessary to integrate with the Pipeline API.

The umbrella design doc for MLlib + SparkR integration can be viewed here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7483 from ericl/spark-8774 and squashes the following commits:

3dfac0c [Eric Liang] update
17ef516 [Eric Liang] more comments
1753a0f [Eric Liang] make glm generic
b0f50f8 [Eric Liang] equivalence test
550d56d [Eric Liang] export methods
c015697 [Eric Liang] second pass
117949a [Eric Liang] comments
5afbc67 [Eric Liang] test label columns
6b7f15f [Eric Liang] Fri Jul 17 14:20:22 PDT 2015
3a63ae5 [Eric Liang] Fri Jul 17 13:41:52 PDT 2015
ce61367 [Eric Liang] Fri Jul 17 13:41:17 PDT 2015
0299c59 [Eric Liang] Fri Jul 17 13:40:32 PDT 2015
e37603f [Eric Liang] Fri Jul 17 12:15:03 PDT 2015
d417d0c [Eric Liang] Merge remote-tracking branch 'upstream/master' into spark-8774
29a2ce7 [Eric Liang] Merge branch 'spark-8774-1' into spark-8774
d1959d2 [Eric Liang] clarify comment
2db68aa [Eric Liang] second round of comments
dc3c943 [Eric Liang] address comments
5765ec6 [Eric Liang] fix style checks
1f361b0 [Eric Liang] doc
d33211b [Eric Liang] r support
fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
2015-07-20 20:49:38 -07:00
Yu ISHIKAWA 2bdf9914ab [SPARK-9052] [SPARKR] Fix comments after curly braces
[[SPARK-9052] Fix comments after curly braces - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9052)

This is the full result of lintr at the rivision:011551620faa87107a787530f074af3d9be7e695.
[[SPARK-9052] the result of lint-r at the revision:011551620faa87107a787530f074af3d9be7e695](https://gist.github.com/yu-iskw/e7246041b173a3f29482)

This is the difference of the result between before and after.
https://gist.github.com/yu-iskw/e7246041b173a3f29482/revisions

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7440 from yu-iskw/SPARK-9052 and squashes the following commits:

015d738 [Yu ISHIKAWA] Fix the indentations and move the placement of commna
5cc30fe [Yu ISHIKAWA] Fix the indentation in a condition
4ead0e5 [Yu ISHIKAWA] [SPARK-9052][SparkR] Fix comments after curly braces
2015-07-21 11:38:22 +09:00
Yu ISHIKAWA 5a3c1ad087 [SPARK-9093] [SPARKR] Fix single-quotes strings in SparkR
[[SPARK-9093] Fix single-quotes strings in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9093)

This is the result of lintr at the rivision:011551620faa87107a787530f074af3d9be7e695
[[SPARK-9093] The result of lintr at 011551620f](https://gist.github.com/yu-iskw/8c47acf3202796da4d01)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7439 from yu-iskw/SPARK-9093 and squashes the following commits:

61c391e [Yu ISHIKAWA] [SPARK-9093][SparkR] Fix single-quotes strings in SparkR
2015-07-17 17:00:50 +09:00
Liang-Chi Hsieh 0a795336df [SPARK-8807] [SPARKR] Add between operator in SparkR
JIRA: https://issues.apache.org/jira/browse/SPARK-8807

Add between operator in SparkR.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7356 from viirya/add_r_between and squashes the following commits:

7f51b44 [Liang-Chi Hsieh] Add test for non-numeric column.
c6a25c5 [Liang-Chi Hsieh] Add between function.
2015-07-15 23:36:57 -07:00
Liang-Chi Hsieh 6f6902597d [SPARK-8840] [SPARKR] Add float coercion on SparkR
JIRA: https://issues.apache.org/jira/browse/SPARK-8840

Currently the type coercion rules don't include float type. This PR simply adds it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7280 from viirya/add_r_float_coercion and squashes the following commits:

c86dc0e [Liang-Chi Hsieh] For comments.
dbf0c1b [Liang-Chi Hsieh] Implicitly convert Double to Float based on provided schema.
733015a [Liang-Chi Hsieh] Add test case for DataFrame with float type.
30c2a40 [Liang-Chi Hsieh] Update test case.
52b5294 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_r_float_coercion
6f9159d [Liang-Chi Hsieh] Add another test case.
8db3244 [Liang-Chi Hsieh] schema also needs to support float. add test case.
0dcc992 [Liang-Chi Hsieh] Add float coercion on SparkR.
2015-07-15 09:48:33 -07:00
Sun Rui f650a005e0 [SPARK-8808] [SPARKR] Fix assignments in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #7395 from sun-rui/SPARK-8808 and squashes the following commits:

ce603bc [Sun Rui] Use '<-' instead of '='.
88590b1 [Sun Rui] Use '<-' instead of '='.
2015-07-14 22:21:01 -07:00
Sun Rui 7f487c8bde [SPARK-6797] [SPARKR] Add support for YARN cluster mode.
This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node.

This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed.

This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue.

This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue.

R/install-dev.bat is not tested. shivaram , Could you help to test it?

Author: Sun Rui <rui.sun@intel.com>

Closes #6743 from sun-rui/SPARK-6797 and squashes the following commits:

ca63c86 [Sun Rui] Adjust MimaExcludes after rebase.
7313374 [Sun Rui] Fix unit test errors.
72695fb [Sun Rui] Fix unit test failures.
193882f [Sun Rui] Fix Mima test error.
fe25a33 [Sun Rui] Fix Mima test error.
35ecfa3 [Sun Rui] Fix comments.
c38a005 [Sun Rui] Unzipped SparkR binary package is still required for standalone and Mesos modes.
b05340c [Sun Rui] Fix scala style.
2ca5048 [Sun Rui] Fix comments.
1acefd1 [Sun Rui] Fix scala style.
0aa1e97 [Sun Rui] Fix scala style.
41d4f17 [Sun Rui] Add support for locating SparkR package for R workers required by RDD APIs.
49ff948 [Sun Rui] Invoke jar.exe with full path in install-dev.bat.
7b916c5 [Sun Rui] Use 'rem' consistently.
3bed438 [Sun Rui] Add a comment.
681afb0 [Sun Rui] Fix a bug that RRunner does not handle client deployment modes.
cedfbe2 [Sun Rui] [SPARK-6797][SPARKR] Add support for YARN cluster mode.
2015-07-13 08:21:47 -07:00
Liang-Chi Hsieh 59cc38944f [SPARK-8940] [SPARKR] Don't overwrite given schema in createDataFrame
JIRA: https://issues.apache.org/jira/browse/SPARK-8940

The given `schema` parameter will be overwritten in `createDataFrame` now. If it is not null, we shouldn't overwrite it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7311 from viirya/df_not_overwrite_schema and squashes the following commits:

2385139 [Liang-Chi Hsieh] Don't overwrite given schema if it is not null.
2015-07-09 09:57:12 -07:00
Dirceu Semighini Filho 57c72fcce7 Small update in the readme file
Just change the attribute from -PsparkR to -Psparkr

Author: Dirceu Semighini Filho <dirceu.semighini@gmail.com>

Closes #7242 from dirceusemighini/patch-1 and squashes the following commits:

fad5991 [Dirceu Semighini Filho] Small update in the readme file
2015-07-06 13:28:07 -07:00
Yu ISHIKAWA a0cb111b22 [SPARK-8549] [SPARKR] Fix the line length of SparkR
[[SPARK-8549] Fix the line length of SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8549)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7204 from yu-iskw/SPARK-8549 and squashes the following commits:

6fb131a [Yu ISHIKAWA] Fix the typo
1737598 [Yu ISHIKAWA] [SPARK-8549][SparkR] Fix the line length of SparkR
2015-07-05 20:50:02 -07:00
Ilya Ganelin 3697232b7d [SPARK-3071] Increase default driver memory
I've updated default values in comments, documentation, and in the command line builder to be 1g based on comments in the JIRA. I've also updated most usages to point at a single variable defined in the Utils.scala and JavaUtils.java files. This wasn't possible in all cases (R, shell scripts etc.) but usage in most code is now pointing at the same place.

Please let me know if I've missed anything.

Will the spark-shell use the value within the command line builder during instantiation?

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #7132 from ilganeli/SPARK-3071 and squashes the following commits:

4074164 [Ilya Ganelin] String fix
271610b [Ilya Ganelin] Merge branch 'SPARK-3071' of github.com:ilganeli/spark into SPARK-3071
273b6e9 [Ilya Ganelin] Test fix
fd67721 [Ilya Ganelin] Update JavaUtils.java
26cc177 [Ilya Ganelin] test fix
e5db35d [Ilya Ganelin] Fixed test failure
39732a1 [Ilya Ganelin] merge fix
a6f7deb [Ilya Ganelin] Created default value for DRIVER MEM in Utils that's now used in almost all locations instead of setting manually in each
09ad698 [Ilya Ganelin] Update SubmitRestProtocolSuite.scala
19b6f25 [Ilya Ganelin] Missed one doc update
2698a3d [Ilya Ganelin] Updated default value for driver memory
2015-07-01 23:11:02 -07:00
Sun Rui 69c5dee2f0 [SPARK-7714] [SPARKR] SparkR tests should use more specific expectations than expect_true
1. Update the pattern 'expect_true(a == b)' to 'expect_equal(a, b)'.
2. Update the pattern 'expect_true(inherits(a, b))' to 'expect_is(a, b)'.
3. Update the pattern 'expect_true(identical(a, b))' to 'expect_identical(a, b)'.

Author: Sun Rui <rui.sun@intel.com>

Closes #7152 from sun-rui/SPARK-7714 and squashes the following commits:

8ad2440 [Sun Rui] Fix test case errors.
8fe9f0c [Sun Rui] Update the pattern 'expect_true(identical(a, b))' to 'expect_identical(a, b)'.
f1b8005 [Sun Rui] Update the pattern 'expect_true(inherits(a, b))' to 'expect_is(a, b)'.
f631e94 [Sun Rui] Update the pattern 'expect_true(a == b)' to 'expect_equal(a, b)'.
2015-07-01 09:50:12 -07:00
zsxwing 12671dd5e4 [SPARK-8434][SQL]Add a "pretty" parameter to the "show" method to display long strings
Sometimes the user may want to show the complete content of cells. Now `sql("set -v").show()` displays:

![screen shot 2015-06-18 at 4 34 51 pm](https://cloud.githubusercontent.com/assets/1000778/8227339/14d3c5ea-15d9-11e5-99b9-f00b7e93beef.png)

The user needs to use something like `sql("set -v").collect().foreach(r => r.toSeq.mkString("\t"))` to show the complete content.

This PR adds a `pretty` parameter to show. If `pretty` is false, `show` won't truncate strings or align cells right.

![screen shot 2015-06-18 at 4 21 44 pm](https://cloud.githubusercontent.com/assets/1000778/8227407/b6f8dcac-15d9-11e5-8219-8079280d76fc.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #6877 from zsxwing/show and squashes the following commits:

22e28e9 [zsxwing] pretty -> truncate
e582628 [zsxwing] Add pretty parameter to the show method in R
a3cd55b [zsxwing] Fix calling showString in R
923cee4 [zsxwing] Add a "pretty" parameter to show to display long strings
2015-06-29 23:44:11 -07:00
cafreeman 9d11817765 [SPARK-8607] SparkR -- jars not being added to application classpath correctly
Add `getStaticClass` method in SparkR's `RBackendHandler`

This is a fix for the problem referenced in [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185).

cc shivaram

Author: cafreeman <cfreeman@alteryx.com>

Closes #7001 from cafreeman/branch-1.4 and squashes the following commits:

8f81194 [cafreeman] Add missing license
31aedcf [cafreeman] Refactor test to call an external R script
2c22073 [cafreeman] Merge branch 'branch-1.4' of github.com:apache/spark into branch-1.4
0bea809 [cafreeman] Fixed relative path issue and added smaller JAR
ee25e60 [cafreeman] Merge branch 'branch-1.4' of github.com:apache/spark into branch-1.4
9a5c362 [cafreeman] test for including JAR when launching sparkContext
9101223 [cafreeman] Merge branch 'branch-1.4' of github.com:apache/spark into branch-1.4
5a80844 [cafreeman] Fix style nits
7c6bd0c [cafreeman] [SPARK-8607] SparkR

(cherry picked from commit 2579948bf5)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-26 17:06:16 -07:00
cafreeman a56516fc92 [SPARK-8662] SparkR Update SparkSQL Test
Test `infer_type` using a more fine-grained approach rather than comparing environments. Since `all.equal`'s behavior has changed in R 3.2, the test became unpassable.

JIRA here:
https://issues.apache.org/jira/browse/SPARK-8662

Author: cafreeman <cfreeman@alteryx.com>

Closes #7045 from cafreeman/R32_Test and squashes the following commits:

b97cc52 [cafreeman] Add `checkStructField` utility
3381e5c [cafreeman] Update SparkSQL Test

(cherry picked from commit 78b31a2a63)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-26 10:07:49 -07:00
Shivaram Venkataraman c392a9efab [SPARK-8637] [SPARKR] [HOTFIX] Fix packages argument, sparkSubmitBinName
cc cafreeman

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #7022 from shivaram/sparkr-init-hotfix and squashes the following commits:

9178d15 [Shivaram Venkataraman] Fix packages argument, sparkSubmitBinName
2015-06-25 10:56:00 -07:00
Holden Karau 43e66192f4 [SPARK-8506] Add pakages to R context created through init.
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6928 from holdenk/SPARK-8506-sparkr-does-not-provide-an-easy-way-to-depend-on-spark-packages-when-performing-init-from-inside-of-r and squashes the following commits:

b60dd63 [Holden Karau] Add an example with the spark-csv package
fa8bc92 [Holden Karau] typo: sparm -> spark
865a90c [Holden Karau] strip spaces for comparision
c7a4471 [Holden Karau] Add some documentation
c1a9233 [Holden Karau] refactor for testing
c818556 [Holden Karau] Add pakages to R
2015-06-24 11:55:20 -07:00
Alok Singh f2fb0285ab [SPARK-8111] [SPARKR] SparkR shell should display Spark logo and version banner on startup.
spark version is taken from the environment variable SPARK_VERSION

Author: Alok  Singh <singhal@Aloks-MacBook-Pro.local>
Author: Alok  Singh <singhal@aloks-mbp.usca.ibm.com>

Closes #6944 from aloknsingh/aloknsingh_spark_jiras and squashes the following commits:

ed607bd [Alok  Singh] [SPARK-8111][SparkR] As per suggestion, 1) using the version from sparkContext rather than the Sys.env. 2) change "Welcome to SparkR!" to "Welcome to" followed by Spark logo and version
acd5b85 [Alok  Singh] fix the jira SPARK-8111 to add the spark version and logo. Currently spark version is taken from the environment variable SPARK_VERSION
2015-06-23 12:47:55 -07:00
Yu ISHIKAWA d4f633514a [SPARK-8431] [SPARKR] Add in operator to DataFrame Column in SparkR
[[SPARK-8431] Add in operator to DataFrame Column in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8431)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6941 from yu-iskw/SPARK-8431 and squashes the following commits:

1f64423 [Yu ISHIKAWA] Modify the comment
f4309a7 [Yu ISHIKAWA] Make a `setMethod` for `%in%` be independent
6e37936 [Yu ISHIKAWA] Modify a variable name
c196173 [Yu ISHIKAWA] [SPARK-8431][SparkR] Add in operator to DataFrame Column in SparkR
2015-06-22 23:04:36 -07:00
Yu ISHIKAWA 44fa7df64d [SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files
[[SPARK-8548] Remove the trailing whitespaces from the SparkR files - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8548)

- This is the result of `lint-r`
    https://gist.github.com/yu-iskw/0019b37a2c1167f33986

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6945 from yu-iskw/SPARK-8548 and squashes the following commits:

0bd567a [Yu ISHIKAWA] [SPARK-8548][SparkR] Remove the trailing whitespaces from the SparkR files
2015-06-22 20:55:38 -07:00
Yu ISHIKAWA b1f3a489ef [SPARK-8537] [SPARKR] Add a validation rule about the curly braces in SparkR to .lintr
[[SPARK-8537] Add a validation rule about the curly braces in SparkR to `.lintr` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8537)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6940 from yu-iskw/SPARK-8537 and squashes the following commits:

7eec1a0 [Yu ISHIKAWA] [SPARK-8537][SparkR] Add a validation rule about the curly braces in SparkR to `.lintr`
2015-06-22 14:35:38 -07:00
Yu ISHIKAWA 004f57374b [SPARK-8495] [SPARKR] Add a .lintr file to validate the SparkR files and the lint-r script
Thank Shivaram Venkataraman for your support. This is a prototype script to validate the R files.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6922 from yu-iskw/SPARK-6813 and squashes the following commits:

c1ffe6b [Yu ISHIKAWA] Modify to save result to a log file and add a rule to validate
5520806 [Yu ISHIKAWA] Exclude the .lintr file not to check Apache lincence
8f94680 [Yu ISHIKAWA] [SPARK-8495][SparkR] Add a `.lintr` file to validate the SparkR files and the `lint-r` script
2015-06-20 16:10:14 -07:00
Hossein 1fa29c2df2 [SPARK-8452] [SPARKR] expose jobGroup API in SparkR
This pull request adds following methods to SparkR:

```R
setJobGroup()
cancelJobGroup()
clearJobGroup()
```
For each method, the spark context is passed as the first argument. There does not seem to be a good way to test these in R.

cc shivaram and davies

Author: Hossein <hossein@databricks.com>

Closes #6889 from falaki/SPARK-8452 and squashes the following commits:

9ce9f1e [Hossein] Added basic tests to verify methods can be called and won't throw errors
c706af9 [Hossein] Added examples
a2c19af [Hossein] taking spark context as first argument
343ca77 [Hossein] Added setJobGroup, cancelJobGroup and clearJobGroup to SparkR
2015-06-19 15:51:59 -07:00
andrewor14 56d4e8a2d0 [SPARK-8350] [R] Log R unit test output to "unit-tests.log"
Right now it's logged to "R-unit-tests.log". Jenkins currently only archives files named "unit-tests.log", and this is what all other modules (e.g. SQL, network, REPL) use.
1. We should be consistent
2. I don't want to reconfigure Jenkins to accept a different file

shivaram

Author: andrewor14 <andrew@databricks.com>
Author: Andrew Or <andrew@databricks.com>

Closes #6807 from andrewor14/r-logs and squashes the following commits:

96005d2 [andrewor14] Nest unit-tests.log further until R
407c46c [andrewor14] Add target to log path
d7b68ae [Andrew Or] Log R unit test output to "unit-tests.log"
2015-06-15 08:16:22 -07:00
hqzizania a5c52c1a34 [SPARK-6820] [SPARKR] Convert NAs to null type in SparkR DataFrames
Author: hqzizania <qian.huang@intel.com>

Closes #6190 from hqzizania/R and squashes the following commits:

1641f9e [hqzizania] fixes and add test units
bb3411a [hqzizania] Convert NAs to null type in SparkR DataFrames
2015-06-08 21:40:12 -07:00
Shivaram Venkataraman 12f5eaeee1 [SPARK-8085] [SPARKR] Support user-specified schema in read.df
cc davies sun-rui

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6620 from shivaram/sparkr-read-schema and squashes the following commits:

16a6726 [Shivaram Venkataraman] Fix loadDF to pass schema Also add a unit test
a229877 [Shivaram Venkataraman] Use wrapper function to DataFrameReader
ee70ba8 [Shivaram Venkataraman] Support user-specified schema in read.df
2015-06-05 10:19:03 -07:00
Shivaram Venkataraman 3dc005282a [SPARK-8027] [SPARKR] Move man pages creation to install-dev.sh
This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available

Related to discussion in #6567

cc pwendell srowen -- Let me know if this looks better

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6593 from shivaram/sparkr-pom-cleanup and squashes the following commits:

b282241 [Shivaram Venkataraman] Remove sparkr-docs from release script as well
8f100a5 [Shivaram Venkataraman] Move man pages creation to install-dev.sh This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available
2015-06-04 12:52:16 -07:00
Shivaram Venkataraman 0576c3c4ff [SPARK-8084] [SPARKR] Make SparkR scripts fail on error
cc shaneknapp pwendell JoshRosen

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6623 from shivaram/SPARK-8084 and squashes the following commits:

0ec5b26 [Shivaram Venkataraman] Make SparkR scripts fail on error
2015-06-03 17:02:16 -07:00
Sun Rui 708c63bbbe [SPARK-8063] [SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.
Author: Sun Rui <rui.sun@intel.com>

Closes #6605 from sun-rui/SPARK-8063 and squashes the following commits:

51ca48b [Sun Rui] [SPARK-8063][SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.
2015-06-03 11:56:35 -07:00
Sun Rui 46576ab303 [SPARK-7227] [SPARKR] Support fillna / dropna in R DataFrame.
Author: Sun Rui <rui.sun@intel.com>

Closes #6183 from sun-rui/SPARK-7227 and squashes the following commits:

dd6f5b3 [Sun Rui] Rename readEnv() back to readMap(). Add alias na.omit() for dropna().
41cf725 [Sun Rui] [SPARK-7227][SPARKR] Support fillna / dropna in R DataFrame.
2015-05-31 15:01:59 -07:00
Shivaram Venkataraman 5fb97dca9b [SPARK-7954] [SPARKR] Create SparkContext in sparkRSQL init
cc davies

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6507 from shivaram/sparkr-init and squashes the following commits:

6fdd169 [Shivaram Venkataraman] Create SparkContext in sparkRSQL init
2015-05-29 15:08:30 -07:00
Shivaram Venkataraman 5f48e5c33b [SPARK-6806] [SPARKR] [DOCS] Add a new SparkR programming guide
This PR adds a new SparkR programming guide at the top-level. This will be useful for R users as our APIs don't directly match the Scala/Python APIs and as we need to explain SparkR without using RDDs as examples etc.

cc rxin davies pwendell

cc cafreeman -- Would be great if you could also take a look at this !

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6490 from shivaram/sparkr-guide and squashes the following commits:

d5ff360 [Shivaram Venkataraman] Add a section on HiveContext, HQL queries
408dce5 [Shivaram Venkataraman] Fix link
dbb86e3 [Shivaram Venkataraman] Fix minor typo
9aff5e0 [Shivaram Venkataraman] Address comments, use dplyr-like syntax in example
d09703c [Shivaram Venkataraman] Fix default argument in read.df
ea816a1 [Shivaram Venkataraman] Add a new SparkR programming guide Also update write.df, read.df to handle defaults better
2015-05-29 14:11:58 -07:00
Shivaram Venkataraman a40bca0111 [SPARK-6811] Copy SparkR lib in make-distribution.sh
This change also remove native libraries from SparkR to make sure our distribution works across platforms

Tested by building on Mac, running on Amazon Linux (CentOS), Windows VM and vice-versa (built on Linux run on Mac)

I will also test this with YARN soon and update this PR.

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6373 from shivaram/sparkr-binary and squashes the following commits:

ae41b5c [Shivaram Venkataraman] Remove native libraries from SparkR Also include the built SparkR package in make-distribution.sh
2015-05-23 00:04:01 -07:00
Davies Liu 7af3818c6b [SPARK-6806] [SPARKR] [DOCS] Fill in SparkR examples in programming guide
sqlCtx -> sqlContext

You can check the docs by:

```
$ cd docs
$ SKIP_SCALADOC=1 jekyll serve
```
cc shivaram

Author: Davies Liu <davies@databricks.com>

Closes #5442 from davies/r_docs and squashes the following commits:

7a12ec6 [Davies Liu] remove rdd in R docs
8496b26 [Davies Liu] remove the docs related to RDD
e23b9d6 [Davies Liu] delete R docs for RDD API
222e4ff [Davies Liu] Merge branch 'master' into r_docs
89684ce [Davies Liu] Merge branch 'r_docs' of github.com:davies/spark into r_docs
f0a10e1 [Davies Liu] address comments from @shivaram
f61de71 [Davies Liu] Update pairRDD.R
3ef7cf3 [Davies Liu] use + instead of function(a,b) a+b
2f10a77 [Davies Liu] address comments from @cafreeman
9c2a062 [Davies Liu] mention R api together with Python API
23f751a [Davies Liu] Fill in SparkR examples in programming guide
2015-05-23 00:01:40 -07:00
Josh Rosen c9fa870a6d [SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String
In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to.  As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits:

146b615 [Josh Rosen] Fix R test.
2974bd5 [Josh Rosen] Cast to string type instead
f206580 [Josh Rosen] Cast to double to fix SPARK-7687
307ecbf [Josh Rosen] Add failing regression test for SPARK-7687
2015-05-18 21:53:44 -07:00
qhuang 50da9e8916 [SPARK-7226] [SPARKR] Support math functions in R DataFrame
Author: qhuang <qian.huang@intel.com>

Closes #6170 from hqzizania/master and squashes the following commits:

f20c39f [qhuang] add tests units and fixes
2a7d121 [qhuang] use a function name more familiar to R users
07aa72e [qhuang] Support math functions in R DataFrame
2015-05-15 14:06:16 -07:00
Sun Rui df9b94a57c [SPARK-7482] [SPARKR] Rename some DataFrame API methods in SparkR to match their counterparts in Scala.
Author: Sun Rui <rui.sun@intel.com>

Closes #6007 from sun-rui/SPARK-7482 and squashes the following commits:

5c5cf5e [Sun Rui] Implement alias loadDF() as a new function.
3a30c10 [Sun Rui] Rename load()/save() to read.df()/write.df(). Also add loadDF()/saveDF() as aliases.
9f569d6 [Sun Rui] [SPARK-7482][SparkR] Rename some DataFrame API methods in SparkR to match their counterparts in Scala.
2015-05-12 23:52:30 -07:00
Joshi b94a93371c [SPARK-7435] [SPARKR] Make DataFrame.show() consistent with that of Scala and pySpark
Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #5989 from rekhajoshm/fix/SPARK-7435 and squashes the following commits:

cfc9e02 [Joshi] Spark-7435[R]: updated patch for review comments
62becc1 [Joshi] SPARK-7435: Update to DataFrame
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
2015-05-11 21:04:32 -07:00
Reynold Xin 0a4844f90a [SPARK-7462] By default retain group by columns in aggregate
Updated Java, Scala, Python, and R.

Author: Reynold Xin <rxin@databricks.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #5996 from rxin/groupby-retain and squashes the following commits:

aac7119 [Reynold Xin] Merge branch 'groupby-retain' of github.com:rxin/spark into groupby-retain
f6858f6 [Reynold Xin] Merge branch 'master' into groupby-retain
5f923c0 [Reynold Xin] Merge pull request #15 from shivaram/sparkr-groupby-retrain
c1de670 [Shivaram Venkataraman] Revert workaround in SparkR to retain grouped cols Based on reverting code added in commit 9a6be746ef
b8b87e1 [Reynold Xin] Fixed DataFrameJoinSuite.
d910141 [Reynold Xin] Updated rest of the files
1e6e666 [Reynold Xin] [SPARK-7462] By default retain group by columns in aggregate
2015-05-11 11:35:16 -07:00
Shivaram Venkataraman 0835f1edd4 [SPARK-7512] [SPARKR] Fix RDD's show method to use getJRDD
Since the RDD object might be a Pipelined RDD we should use `getJRDD` to get the right handle to the Java object.

Fixes the bug reported at
http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working

cc concretevitamin

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6035 from shivaram/sparkr-show-bug and squashes the following commits:

d70145c [Shivaram Venkataraman] Fix RDD's show method to use getJRDD Fixes the bug reported at http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working
2015-05-10 19:49:42 -07:00
Shivaram Venkataraman 0a901dd3a1 [SPARK-7231] [SPARKR] Changes to make SparkR DataFrame dplyr friendly.
Changes include
1. Rename sortDF to arrange
2. Add new aliases `group_by` and `sample_frac`, `summarize`
3. Add more user friendly column addition (mutate), rename
4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax

The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it

cc sun-rui rxin

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6005 from shivaram/sparkr-df-api and squashes the following commits:

5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
2015-05-08 18:29:57 -07:00
hqzizania 008a60dd37 [SPARK-6824] Fill the docs for DataFrame API in SparkR
This patch also removes the RDD docs from being built as a part of roxygen just by the method to delete
" ' '" of " \#' ".

Author: hqzizania <qian.huang@intel.com>
Author: qhuang <qian.huang@intel.com>

Closes #5969 from hqzizania/R1 and squashes the following commits:

6d27696 [qhuang] fixes in NAMESPACE
eb4b095 [qhuang] remove more docs
6394579 [qhuang] remove RDD docs in generics.R
6813860 [hqzizania] Fill the docs for DataFrame API in SparkR
857220f [hqzizania] remove the pairRDD docs from being built as a part of roxygen
c045d64 [hqzizania] remove the RDD docs from being built as a part of roxygen
2015-05-08 11:25:04 -07:00
Sun Rui 9cfa9a516e [SPARK-6812] [SPARKR] filter() on DataFrame does not work as expected.
According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html,
" if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options("defaultPackages")."
In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list.
We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First().

BTW, I'd like to discuss our policy on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. If we agree on this, I can submit a JIRA issue to change back some rename API methods, for example, DataFrame.sortDF().

Author: Sun Rui <rui.sun@intel.com>

Closes #5938 from sun-rui/SPARK-6812 and squashes the following commits:

b569145 [Sun Rui] [SPARK-6812][SparkR] filter() on DataFrame does not work as expected.
2015-05-06 22:48:16 -07:00
Shivaram Venkataraman 4e930420c1 [SPARK-6799] [SPARKR] Remove SparkR RDD examples, add dataframe examples
This PR also makes some of the DataFrame to RDD methods private as the RDD class is private in 1.4

cc rxin pwendell

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #5949 from shivaram/sparkr-examples and squashes the following commits:

6c42fdc [Shivaram Venkataraman] Remove SparkR RDD examples, add dataframe examples
2015-05-06 17:28:11 -07:00
qhuang a466944399 [SPARK-6841] [SPARKR] add support for mean, median, stdev etc.
Moving here from https://github.com/amplab-extras/SparkR-pkg/pull/241
sum() has been implemented. (https://github.com/amplab-extras/SparkR-pkg/pull/242)

Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841

Author: qhuang <qian.huang@intel.com>

Closes #5446 from hqzizania/R and squashes the following commits:

f283572 [qhuang] add test unit for describe()
2e74d5a [qhuang] add describe() DataFrame API
2015-05-05 20:39:56 -07:00
Shivaram Venkataraman c688e3c5e4 [SPARK-7230] [SPARKR] Make RDD private in SparkR.
This change makes the RDD API private in SparkR and all internal uses of the SparkR API use SparkR::: to access private functions.

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #5895 from shivaram/rrdd-private and squashes the following commits:

bdb2f07 [Shivaram Venkataraman] Make RDD private in SparkR. This change also makes all internal uses of the SparkR API use SparkR::: to access private functions
2015-05-05 14:40:33 -07:00
云峤 f32e69ecc3 [SPARK-7319][SQL] Improve the output from DataFrame.show()
Author: 云峤 <chensong.cs@alibaba-inc.com>

Closes #5865 from kaka1992/df.show and squashes the following commits:

c79204b [云峤] Update
a1338f6 [云峤] Update python dataFrame show test and add empty df unit test.
734369c [云峤] Update python dataFrame show test and add empty df unit test.
84aec3e [云峤] Update python dataFrame show test and add empty df unit test.
159b3d5 [云峤] update
03ef434 [云峤] update
7394fd5 [云峤] update test show
ced487a [云峤] update pep8
b6e690b [云峤] Merge remote-tracking branch 'upstream/master' into df.show
30ac311 [云峤] [SPARK-7294] ADD BETWEEN
7d62368 [云峤] [SPARK-7294] ADD BETWEEN
baf839b [云峤] [SPARK-7294] ADD BETWEEN
d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
2015-05-04 12:08:38 -07:00
hlin09 ca9f4ebb8e [SPARK-6991] [SPARKR] Adds support for zipPartitions.
Author: hlin09 <hlin09pu@gmail.com>

Closes #5568 from hlin09/zipPartitions and squashes the following commits:

12c08a5 [hlin09] Fix comments
d2d32db [hlin09] Merge branch 'master' into zipPartitions
ec56d2f [hlin09] Fix test.
27655d3 [hlin09] Adds support for zipPartitions.
2015-04-27 15:04:37 -07:00
Jeff Harrison 7078f6028b [SPARK-6856] [R] Make RDD information more useful in SparkR
Author: Jeff Harrison <jeffrharrison@gmail.com>

Closes #5667 from His-name-is-Joof/joofspark and squashes the following commits:

f8814a6 [Jeff Harrison] newline added after RDD show() output
4d9d972 [Jeff Harrison] Merge branch 'master' into joofspark
9d2295e [Jeff Harrison] parallelize with 1:10
878b830 [Jeff Harrison] Merge branch 'master' into joofspark
c8c0b80 [Jeff Harrison] add test for RDD function show()
123be65 [Jeff Harrison] SPARK-6856
2015-04-27 13:38:25 -07:00
Sun Rui caf0136ec5 [SPARK-6852] [SPARKR] Accept numeric as numPartitions in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #5613 from sun-rui/SPARK-6852 and squashes the following commits:

abaf02e [Sun Rui] Change the type of default numPartitions from integer to numeric in generics.R.
29d67c1 [Sun Rui] [SPARK-6852][SPARKR] Accept numeric as numPartitions in SparkR.
2015-04-24 12:52:07 -07:00
Sun Rui ebb77b2aff [SPARK-7033] [SPARKR] Clean usage of split. Use partition instead where applicable.
Author: Sun Rui <rui.sun@intel.com>

Closes #5628 from sun-rui/SPARK-7033 and squashes the following commits:

046bc9e [Sun Rui] Clean split usage in tests.
d531c86 [Sun Rui] [SPARK-7033][SPARKR] Clean usage of split. Use partition instead where applicable.
2015-04-24 11:00:19 -07:00
Sun Rui 73db132bf5 [SPARK-6818] [SPARKR] Support column deletion in SparkR DataFrame API.
Author: Sun Rui <rui.sun@intel.com>

Closes #5655 from sun-rui/SPARK-6818 and squashes the following commits:

7c66570 [Sun Rui] [SPARK-6818][SPARKR] Support column deletion in SparkR DataFrame API.
2015-04-23 16:08:14 -07:00
cafreeman 59e206deb7 [SPARK-6807] [SparkR] Merge recent SparkR-pkg changes
This PR pulls in recent changes in SparkR-pkg, including

cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField.

Author: cafreeman <cfreeman@alteryx.com>
Author: Davies Liu <davies@databricks.com>
Author: Zongheng Yang <zongheng.y@gmail.com>
Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Sun Rui <rui.sun@intel.com>

Closes #5436 from davies/R3 and squashes the following commits:

c2b09be [Davies Liu] SQLTypes -> schema
a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3
168b7fe [Davies Liu] sort generics
b1fe460 [Davies Liu] fix conflict in README.md
e74c04e [Davies Liu] fix schema.R
4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5
41f8184 [Davies Liu] rm man
ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3
1bdcb63 [Zongheng Yang] Updates to README.md.
5a553e7 [cafreeman] Use object attribute instead of argument
71372d9 [cafreeman] Update docs and examples
8526d2e71 [cafreeman] Remove `tojson` functions
6ef5f2d [cafreeman] Fix spacing
7741d66 [cafreeman] Rename the SQL DataType function
141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream
9387402 [Davies Liu] fix style
40199eb [Shivaram Venkataraman] Move except into sorted position
07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD.
7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey
ed66c81 [cafreeman] Update `subtract` to work with `generics.R`
f3ba785 [cafreeman] Fixed duplicate export
275deb4 [cafreeman] Update `NAMESPACE` and tests
1a3b63d [cafreeman] new version of `CreateDF`
836c4bf [cafreeman] Update `createDataFrame` and `toDF`
be5d5c1 [cafreeman] refactor schema functions
40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5
20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist
ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4
c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master
b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master
136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats
cd66603 [cafreeman] new line at EOF
8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep
7dd81b7 [cafreeman] Documentation
0e2a94f [cafreeman] Define functions for schema and fields
2015-04-17 13:42:19 -07:00
Shivaram Venkataraman 55f553a979 [SPARK-6855] [SPARKR] Set R includes to get the right collate order.
This prevents tools like devtools::document creating invalid collate orders

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #5462 from shivaram/collate-order and squashes the following commits:

f3db562 [Shivaram Venkataraman] Set R includes to get the right collate order. This prevents tools like devtools::document creating invalid collate orders
2015-04-16 13:06:34 -07:00
hlin09 0ba3fdd599 [Minor][SparkR] Minor refactor and removes redundancy related to cleanClosure.
1. Only use `cleanClosure` in creation of RRDDs. Normally, user and developer do not need to call `cleanClosure` in their function definition.
2. Removes redundant code (e.g. unnecessary wrapper functions) related to `cleanClosure`.

Author: hlin09 <hlin09pu@gmail.com>

Closes #5495 from hlin09/cleanClosureFix and squashes the following commits:

74ec303 [hlin09] Minor refactor and removes redundancy.
2015-04-13 20:43:24 -07:00
hlin09 d7f2c19867 [SPARK-6881][SparkR] Changes the checkpoint directory name.
Author: hlin09 <hlin09pu@gmail.com>

Closes #5493 from hlin09/fixCheckpointDir and squashes the following commits:

e67fc40 [hlin09] Change to temp dir.
1f7ed9e [hlin09] Change the checkpoint dir name.
2015-04-13 16:53:50 -07:00
Davies Liu 68ecdb7f99 [SPARK-6850] [SparkR] use one partition when we need to compare the whole result
Author: Davies Liu <davies@databricks.com>

Closes #5460 from davies/r_test and squashes the following commits:

0a593ce [Davies Liu] use one partition when we need to compare the whole result
2015-04-10 15:35:45 -07:00
Shivaram Venkataraman 2fe0a1aaee [SPARK-5654] Integrate SparkR
This pull requests integrates SparkR, an R frontend for Spark. The SparkR package contains both RDD and DataFrame APIs in R and is integrated with Spark's submission scripts to work on different cluster managers.

Some integration points that would be great to get feedback on:

1. Build procedure: SparkR requires R to be installed on the machine to be built. Right now we have a new Maven profile `-PsparkR` that can be used to enable SparkR builds

2. YARN cluster mode: The R package that is built needs to be present on the driver and all the worker nodes during execution. The R package location is currently set using SPARK_HOME, but this might not work on YARN cluster mode.

The SparkR package represents the work of many contributors and attached below is a list of people along with areas they worked on

edwardt (edwart) - Documentation improvements
Felix Cheung (felixcheung) - Documentation improvements
Hossein Falaki (falaki)  - Documentation improvements
Chris Freeman (cafreeman) - DataFrame API, Programming Guide
Todd Gao (7c00) - R worker Internals
Ryan Hafen (hafen) - SparkR Internals
Qian Huang (hqzizania) - RDD API
Hao Lin (hlin09) - RDD API, Closure cleaner
Evert Lammerts (evertlammerts) - DataFrame API
Davies Liu (davies) - DataFrame API, R worker internals, Merging with Spark
Yi Lu (lythesia) - RDD API, Worker internals
Matt Massie (massie) - Jenkins build
Harihar Nahak (hnahak87) - SparkR examples
Oscar Olmedo (oscaroboto) - Spark configuration
Antonio Piccolboni (piccolbo) - SparkR examples, Namespace bug fixes
Dan Putler (dputler) - Dataframe API, SparkR Install Guide
Ashutosh Raina (ashutoshraina) - Build improvements
Josh Rosen (joshrosen) - Travis CI build
Sun Rui (sun-rui)- RDD API, JVM Backend, Shuffle improvements
Shivaram Venkataraman (shivaram) - RDD API, JVM Backend, Worker Internals
Zongheng Yang (concretevitamin) - RDD API, Pipelined RDDs, Examples and EC2 guide

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
Author: Zongheng Yang <zongheng.y@gmail.com>
Author: cafreeman <cfreeman@alteryx.com>
Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu>
Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>
Author: hlin09 <hlin09pu@gmail.com>
Author: Sun Rui <rui.sun@intel.com>
Author: lythesia <iranaikimi@gmail.com>
Author: oscaroboto <oscarjr@gmail.com>
Author: Antonio Piccolboni <antonio@piccolboni.info>
Author: root <edward>
Author: edwardt <edwardt.tril@gmail.com>
Author: hqzizania <qian.huang@intel.com>
Author: dputler <dan.putler@gmail.com>
Author: Todd Gao <todd.gao.2013@gmail.com>
Author: Chris Freeman <cfreeman@alteryx.com>
Author: Felix Cheung <fcheung@AVVOMAC-119.local>
Author: Hossein <hossein@databricks.com>
Author: Evert Lammerts <evert@apache.org>
Author: Felix Cheung <fcheung@avvomac-119.t-mobile.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Author: Ryan Hafen <rhafen@gmail.com>
Author: Ashutosh Raina <ashutoshraina@users.noreply.github.com>
Author: Oscar Olmedo <oscarjr@gmail.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Yi Lu <iranaikimi@gmail.com>
Author: Harihar Nahak <hnahak87@users.noreply.github.com>

Closes #5096 from shivaram/R and squashes the following commits:

da64742 [Davies Liu] fix Date serialization
59266d1 [Davies Liu] check exclusive of primary-py-file and primary-r-file
55808e4 [Davies Liu] fix tests
5581c75 [Davies Liu] update author of SparkR
f731b48 [Shivaram Venkataraman] Only run SparkR tests if R is installed
64eda24 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
d7c3f22 [Shivaram Venkataraman] Address code review comments Changes include 1. Adding SparkR docs to API docs generated 2. Style fixes in SparkR scala files 3. Clean up of shell scripts and explanation of install-dev.sh
377151f [Shivaram Venkataraman] Merge remote-tracking branch 'apache/master' into R
eb5da53 [Shivaram Venkataraman] Merge pull request #3 from davies/R2
a18ff5c [Davies Liu] Update sparkR.R
5133f3a [Shivaram Venkataraman] Merge pull request #7 from hqzizania/R3
940b631 [hqzizania] [SPARKR-92] Phase 2: implement sum(rdd)
0e788c0 [Shivaram Venkataraman] Merge pull request #5 from hlin09/doc-fix
3487461 [hlin09] Add tests log in .gitignore.
1d1802e [Shivaram Venkataraman] Merge pull request #4 from felixcheung/r-require
11981b7 [felixcheung] Update R to fail early if SparkR package is missing
c300e08 [Davies Liu] remove duplicated file
b045701 [Davies Liu] Merge branch 'remote_r' into R
19c9368 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into remote_r
f8fa8af [Davies Liu] mute logging when start/stop context
e7104b6 [Davies Liu] remove ::: in SparkR
a1777eb [Davies Liu] move rules into R/.gitignore
e88b649 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
6e20e71 [Davies Liu] address comments
b433817 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R
a1cedad [Shivaram Venkataraman] Merge pull request #228 from felixcheung/doc
e089151 [Davies Liu] Merge pull request #225 from sun-rui/SPARKR-154_2
463e28c [Davies Liu] Merge pull request #2 from shivaram/doc-fixes
bc2d6d8 [Shivaram Venkataraman] Remove arg from sparkR.stop and update docs
d425363 [Shivaram Venkataraman] Some doc fixes for column, generics, group
1f1a7e0 [Shivaram Venkataraman] Some fixes to DataFrame, RDD, SQLContext docs
104ad4e [Shivaram Venkataraman] Check the right env in exists
cf5cd99 [Shivaram Venkataraman] Remove unused numCols argument
85a50ec [Shivaram Venkataraman] Merge pull request #226 from RevolutionAnalytics/master
3eacfc0 [Davies Liu] fix flaky test
733380d [Davies Liu] update R examples (remove master from args)
b21a0da [Davies Liu] Merge pull request #1 from shivaram/log4j-tests
a1493d7 [Shivaram Venkataraman] Address comments
e1f83ab [Shivaram Venkataraman] Send Spark INFO logs to a file in SparkR tests
58276f5 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
52cc92d [Shivaram Venkataraman] Add license to create-docs.sh
6ff5ea2 [Shivaram Venkataraman] Add instructions to generate docs
1f478c5 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
02b4833 [Shivaram Venkataraman] Add a script to generate R docs (Rd, html) Also fix some issues with our documentation
d6d3729 [Davies Liu] enable spark and pyspark tests
0e5a83f [Davies Liu] fix code style
afd8a77 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
d87a181 [Davies Liu] fix flaky tests
7100fb9 [Shivaram Venkataraman] Fix libPaths in README
bdf3a14 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
05e7375 [Davies Liu] sort generics
b44e371 [Shivaram Venkataraman] Include RStudio instructions in README
855537f [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
9fb6af3 [Davies Liu] mark R classes/objects are private
423ea3c [Shivaram Venkataraman] Ignore unknown jobj in cleanup
974e4ea [Davies Liu] fix flaky test
410ec18 [Davies Liu] fix zipRDD() tests
d8b24fc [Davies Liu] disable spark and python tests temporary
ce3ca62 [Davies Liu] fix license check
7da0049 [Davies Liu] fix build
2892e29 [Davies Liu] support R in YARN cluster
ebd4d07 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
38cbf59 [Davies Liu] fix test of zipRDD()
756ece0 [Shivaram Venkataraman] Update README remove outdated TODO
d436f26 [Davies Liu] add missing files
40d193a [Shivaram Venkataraman] Merge pull request #224 from sun-rui/SPARKR-224-new
1a16cd6 [Davies Liu] rm PROJECT_HOME
56670ef [Davies Liu] rm man page
ba4b80b [Davies Liu] Merge branch 'remote_r' into R
f04080c [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into remote_r
028cbfb [Davies Liu] fix exit code of sparkr unit test
42d8b4c [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
ef26015 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
a1870e8 [Shivaram Venkataraman] Merge pull request #214 from sun-rui/SPARKR-156_3
cb6e5e3 [Shivaram Venkataraman] Add scripts to start SparkR on windows
8030847 [Shivaram Venkataraman] Set windows file separators, install dirs
05afef0 [Shivaram Venkataraman] Only stop backend JVM if R launched it
95d2de3 [Davies Liu] fix spark-submit with R scripot
baefd9e [Shivaram Venkataraman] Make bin/sparkR use spark-submit As a part of this move the R initialization functions into first.R and first-submit.R
d6f2bdd [Shivaram Venkataraman] Fix run-tests path
ea90fab [Davies Liu] fix spark-submit with R path and sparkR -h
0e2412c [Davies Liu] fix bin/sparkR
9f6aa1f [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
479e3fe [Davies Liu] change println() to logging
52ca6e5 [Shivaram Venkataraman] Add missing comma
716b16f [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
2d235d4 [Shivaram Venkataraman] Build SparkR with Maven profile
aae881b [Davies Liu] fix rat
ff776aa [Shivaram Venkataraman] Fix style
e4f1937 [Shivaram Venkataraman] Remove DFC example
f7b6936 [Davies Liu] remove Spark prefix for class
043959e [Davies Liu] cleanup
ba53b09 [Davies Liu] support R in spark-submit
f403b4a [Davies Liu] rm .travis.yml
c4a5bdf [Davies Liu] run sparkr tests in Spark
e8fc7ca [Davies Liu] fix .gitignore
35e5755 [Davies Liu] reduce size of example data
50bff63 [Davies Liu] add LICENSE header for R sources
facb6e0 [Davies Liu] add .gitignore for .o, .so, .Rd
18e5eed [Davies Liu] update docs
0a0e632 [Davies Liu] move sparkR into bin/
a76472f [Davies Liu] fix path of assembly jar
df3eeea [Davies Liu] move R/examples into examples/src/main/r
3415cc7 [Davies Liu] move Scala source into core/ and sql/
180fc9c [Davies Liu] move scala
014d253 [Davies Liu] delete man pages
49a8133 [Davies Liu] Merge branch 'remote_r' into R
44994c2 [Davies Liu] Moved files to R/
2fc553f [Shivaram Venkataraman] Merge pull request #222 from davies/column2
b043876 [Davies Liu] fix test
5e610cb [Davies Liu] add more API for Column
6f95d49 [Shivaram Venkataraman] Merge pull request #221 from shivaram/sparkr-stop-start
3214c6d [Shivaram Venkataraman] Merge pull request #217 from hlin09/cleanClosureFix
f5d3355 [Shivaram Venkataraman] Merge pull request #218 from davies/merge
70f620c [Davies Liu] address comments
4b1628d [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into merge
3139325 [Shivaram Venkataraman] Merge pull request #212 from davies/toDF
6122e0e [Davies Liu] handle NULL
bc2ff38 [Davies Liu] handle NULL
7f5e70c [Davies Liu] Update SerDe.scala
46454e4 [Davies Liu] address comments
dd52cbc [Shivaram Venkataraman] Merge pull request #220 from shivaram/sparkr-utils-include
662938a [Shivaram Venkataraman] Include utils before SparkR for `head` to work Before this change calling `head` on a DataFrame would not work from the sparkR script as utils would be loaded after SparkR and placed ahead in the search list. This change requires utils to be loaded before SparkR
1bc2998 [Shivaram Venkataraman] Merge pull request #179 from evertlammerts/sparkr-sql
7695d36 [Evert Lammerts] added tests
8190127 [Evert Lammerts] fixed parquetFile signature
d8c8fcc [Shivaram Venkataraman] Merge pull request #219 from shivaram/sparkr-build-final
963c7ee [Davies Liu] Merge branch 'master' into merge
8bff523 [Shivaram Venkataraman] Remove staging repo now that 1.3 is released
e52258f [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into toDF
05b9126 [Shivaram Venkataraman] Merge pull request #215 from davies/agg
8e1497d [Davies Liu] Update DataFrame.R
72adb14 [Davies Liu] Update SQLContext.R
66cc92a [Davies Liu] address commets
55c38bc [Shivaram Venkataraman] Merge pull request #216 from davies/select2
3e0555d [Shivaram Venkataraman] Merge pull request #193 from davies/daemon
0467474 [Davies Liu] add more selecter for DataFrame
9a6be74 [Davies Liu] include grouping columns in agg()
e87bb98 [Davies Liu] improve comment and logging
a6dc435 [Davies Liu] remove dependency of jsonlite
26a3621 [Davies Liu] support date.frame and Date/Time
4e4908a [Davies Liu] createDataFrame from rdd
5757b95 [Shivaram Venkataraman] Merge pull request #196 from davies/die
90f2692 [Shivaram Venkataraman] Merge pull request #211 from hlin09/generics
8583968 [Davies Liu] readFully()
46cea3d [Davies Liu] retry
01aa5ee [Davies Liu] add config for using daemon, refactor
ff948db [hlin09] Remove missingOrInteger.
ecdfda1 [hlin09] Remove duplication.
411b751 [Davies Liu] make RStudio happy
8f8813f [Davies Liu] switch back to use parallel
6bccbbf [hlin09] Move roxygen doc back to implementation.
ffd6e8e [Shivaram Venkataraman] Merge pull request #210 from hlin09/hlin09
471c794 [hlin09] Move getJRDD and broadcast's value to 00-generic.R.
89b886d [hlin09] Move setGeneric() to 00-generics.R.
97dde1a [hlin09] Add a test for access operators.
09ff163 [Shivaram Venkataraman] Merge pull request #204 from cafreeman/sparkr-sql
15a713f [cafreeman] Fix example for `dropTempTable`
dc1291b [hlin09] Add checks for namespace access operators in cleanClosure.
b4c0b2e [Davies Liu] use fork package
3db5649 [cafreeman] Merge branch 'sparkr-sql' of https://github.com/amplab-extras/SparkR-pkg into sparkr-sql
789be97 [Shivaram Venkataraman] Merge pull request #207 from shivaram/err-remove
e60578a [cafreeman] update tests to guarantee row order
5eec6fc [Shivaram Venkataraman] Merge pull request #206 from sun-rui/SPARKR-156_2
3f7aed6 [Sun Rui] Fix minor typos in the function description.
a8cebf0 [Shivaram Venkataraman] Remove print statement in SparkRBackendHandler This print statement is noisy for SQL methods which have multiple APIs (like loadDF). We already have a better error message when no valid methods are found
5e3a576 [Sun Rui] Fix indentation.
f3d99a6 [Sun Rui] [SPARKR-156] phase 2: implement zipWithIndex() of the RDD class.
a582810 [cafreeman] Merge branch 'dfMethods' into sparkr-sql
7a5d6fd [cafreeman] `withColumn` and `withColumnRenamed`
c5fa3b9 [cafreeman] New `select` method
bcb0bf5 [Shivaram Venkataraman] Merge pull request #180 from davies/group
9dd6a5a [Davies Liu] Update SparkRBackendHandler.scala
e6fb8d8 [Davies Liu] improve logging
428a99a [Davies Liu] remove test, catch exception
fef99de [cafreeman] `intersect`, `subtract`, `unionAll`
befbd32 [cafreeman] `insertInto`
9d01bcd [cafreeman] `dropTempTable`
d8c1c09 [Davies Liu] add test to start and stop context multiple times
18c6004 [Shivaram Venkataraman] Merge pull request #201 from sun-rui/SPARKR-156_1
dfb399a [Davies Liu] address comments
f06ccec [Sun Rui] Use mapply() instead of for statement.
3c7674f [Davies Liu] Merge branch 'die' of github.com:davies/SparkR-pkg into die
ac8a852 [Davies Liu] close monitor connection in sparkR.stop()
4d0fb56 [Shivaram Venkataraman] Merge pull request #203 from shivaram/sparkr-hive-fix
62b0760 [Shivaram Venkataraman] Fix test hive context package name
47a613f [Shivaram Venkataraman] Fix HiveContext package name
fb3b139 [Davies Liu] fix tests
d0d4626 [Shivaram Venkataraman] Merge pull request #199 from davies/load
8b7fb67 [Davies Liu] fix HiveContext
bb46832 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load
e9e2a03 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into group
b875b4f [Davies Liu] fix style
de2abfa [Shivaram Venkataraman] Merge pull request #202 from cafreeman/sparkr-sql
3675fcf [cafreeman] Update `explain` and fixed doc for `toJSON`
5fd9575 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load
6fac596 [Davies Liu] support Column expression in agg()
f10a24e [Davies Liu] address comments
ff8b005 [cafreeman] 'saveAsParquetFile`
a5c2887 [cafreeman] fix test
3fab0f8 [cafreeman] `showDF`
779c102 [cafreeman] `isLocal`
68b11cf [cafreeman] `toJSON`
0ac4abc [cafreeman] 'explain`
20242c4 [cafreeman] clean up docs
6a1fe64 [Shivaram Venkataraman] Merge pull request #198 from cafreeman/sparkr-sql
198c130 [Shivaram Venkataraman] Merge pull request #200 from shivaram/sparkr-sql-build
870acd4 [Shivaram Venkataraman] Use rc2 explicitly
8b9a963 [cafreeman] Merge branch 'sparkr-sql' of https://github.com/amplab-extras/SparkR-pkg into sparkr-sql
bc90115 [cafreeman] Fixed docs
3865f39 [Sun Rui] [SPARKR-156] phase 1: implement zipWithUniqueId() of the RDD class.
a37fd80 [Davies Liu] Update sparkR.R
d18f9d3 [Shivaram Venkataraman] Remove SparkR snapshot build We now have 1.3.0 RC2 on Apache Staging
8de958d [Davies Liu] Update SparkRBackend.scala
4e0becc [Shivaram Venkataraman] Merge pull request #194 from davies/api
197a79b [Davies Liu] add HiveContext (commented)
32aa01d [Shivaram Venkataraman] Merge pull request #191 from felixcheung/doc
5073e07 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load
7918634 [cafreeman] Fix test
acea146 [cafreeman] remove extra line
74269f3 [cafreeman] Merge branch 'dfMethods' into sparkr-sql
cd7ac8a [Shivaram Venkataraman] Merge pull request #197 from cafreeman/sparkr-sql
494a4dd [cafreeman] update export
e14c328 [cafreeman] `selectExpr`
32b37d1 [cafreeman] Fixed indent in `join` test.
2e7b190 [Felix Cheung] small update on yarn deploy mode.
8ff29d6 [Davies Liu] fix tests
12a6db2 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into api
294ca4a [cafreeman] `join`, `sort`, and `filter`
4fa6343 [cafreeman] Refactor `join` generic for use with `DataFrame`
3f22c8d [Shivaram Venkataraman] Merge pull request #195 from cafreeman/sparkr-sql
2b6f980 [Davies Liu] shutdown the JVM after R process die
e8639c3 [cafreeman] New 1.3 repo and updates to `column.R`
ed9a89f [Davies Liu] address comments
03bcf20 [Davies Liu] Merge branch 'group' of github.com:davies/SparkR-pkg into group
39c253d [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into group
98cc97a [Davies Liu] fix test and docs
e2d144a [Felix Cheung] Fixed small typos
3beadcf [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into api
06cbc2d [Davies Liu] launch R worker by a daemon
8a676b1 [Shivaram Venkataraman] Merge pull request #188 from davies/column
524c122 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column
f798402 [Davies Liu] Update column.R
1d0f2ae [Davies Liu] Update DataFrame.R
03402eb [Felix Cheung] Updates as per feedback on sparkR-submit
76cf2e0 [Shivaram Venkataraman] Merge pull request #192 from cafreeman/sparkr-sql
1955a09 [cafreeman] return object instead of a list of one object
f585929 [cafreeman] Fix brackets
e998356 [cafreeman] define generic for 'first' in RDD API
71d66a1 [Davies Liu] fix first(0
8ec21af [Davies Liu] fix signature
acae527 [Davies Liu] refactor
d7b17a4 [Davies Liu] fix approxCountDistinct
7dfe27d [Davies Liu] fix cyclic namespace dependency
8caf5bb [Davies Liu] use S4 methods
5c0bb24 [Felix Cheung] Doc updates: build and running on YARN
773baf0 [Zongheng Yang] Merge pull request #178 from davies/random
862f07c [Shivaram Venkataraman] Merge pull request #190 from shivaram/SPARKR-79
b457833 [Shivaram Venkataraman] Merge pull request #189 from shivaram/stdErrFix
f7caeb8 [Davies Liu] Update SparkRBackend.scala
8c4deae [Shivaram Venkataraman] Remove unused function
6e51c7f [Shivaram Venkataraman] Fix stderr redirection on executors
7afa4c9 [Shivaram Venkataraman] Merge pull request #186 from hlin09/funcDep3
4d36ab1 [hlin09] Add tests for broadcast variables.
3f57e56 [hlin09] Fix comments.
7b72487 [hlin09] Fix comments.
ae05bf1 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column
abb4bb9 [Davies Liu] add Column and expression
eb8ac11 [Shivaram Venkataraman] Set Spark version 1.3.0 in Windows build
5c72e73 [Davies Liu] wait atmost 100 seconds
e425437 [Shivaram Venkataraman] Merge pull request #177 from lythesia/master
a00f502 [lythesia] fix indents
0346e5f [Davies Liu] address comment
6134649 [Shivaram Venkataraman] Merge pull request #187 from cafreeman/sparkr-sql
ad0935e [lythesia] minor fixes
b0e7f73 [cafreeman] Update `sampleDF` test
7b0d070 [lythesia] keep partitions check
889c265 [cafreeman] numToInt utility function
27dd3a0 [lythesia] modify tests for repartition
cad0f0c [cafreeman] Fix docs and indents
2808dcf [cafreeman] Three more DataFrame methods
5ef66fb [Davies Liu] send back the port via temporary file
3b46429 [Davies Liu] Merge branch 'master' of github.com:amplab-extras/SparkR-pkg into random
798f453 [cafreeman] Merge branch 'sparkr-sql' into dev
9aa4acf [Shivaram Venkataraman] Merge pull request #184 from davies/socket
020bce8 [Shivaram Venkataraman] Merge pull request #183 from cafreeman/sparkr-sql
222e06b [cafreeman] Lazy evaluation and formatting changes
e776324 [Davies Liu] fix import
211cc15 [cafreeman] Merge branch 'sparkr-sql' into dev
3351afd [hlin09] Replaces getDependencies with cleanClosure, to serialize UDFs to workers.
e7c56d6 [lythesia] fix random partition key
50c74b1 [Davies Liu] address comments
083c89f [cafreeman] Remove commented lines an unused import
dfa119b [hlin09] Improve the coverage of processClosure.
a41c9b9 [cafreeman] Merge branch 'wrapper' into sparkr-sql
1cd714f [cafreeman] Wrapper function docs.
db0cd9e [cafreeman] Clean up for wrapper functions
818c19f [cafreeman] Update schema-related functions
a57884e [cafreeman] Remove unused import
d72e830 [cafreeman] Add wrapper for `StructField` and `StructType`
2ea2ecf [lythesia] use generic arg
09b9512 [hlin09] add docs
f4f077c [hlin09] Add recursive cleanClosure for function access.
f84ad27 [hlin09] Merge remote-tracking branch 'upstream/master' into funcDep2
5300766 [Shivaram Venkataraman] Merge pull request #185 from hlin09/hlin09
07aa7c0 [hlin09] Unifies the implementation of lapply with lapplyParitionsWithIndex.
f4dbb0b [Davies Liu] use socket in worker
8282c59 [Davies Liu] Update DataFrame.R
ba495a8 [Davies Liu] Update NAMESPACE
36dffb3 [cafreeman] Add 'head` and `first`
534a95f [cafreeman] Schema-related methods
64f488d [cafreeman] Cache and Persist Methods
30d71fd [cafreeman] Standardize method arguments for DataFrame methods
785898b [Shivaram Venkataraman] Merge pull request #182 from cafreeman/sparkr-sql
2619003 [Shivaram Venkataraman] Merge pull request #181 from cafreeman/master
a9bbe0b [cafreeman] Update existing SparkSQL functions
8c241a3 [cafreeman] Merge with master, include changes to method args
68d6de4 [cafreeman] Fix typos
8d2ec6e [Davies Liu] add sum/max/min/avg/mean
774e687 [Davies Liu] add missing API in SQLContext
1e72b4b [Davies Liu] missing API in SQLContext
3294949 [Chris Freeman] Restore `rdd` argument to `getJRDD`
3a58ebc [Davies Liu] rm unrelated file
8bd93b5 [Davies Liu] fix signature
c652b4c [cafreeman] Update method signatures to use generic arg
48c8827 [Davies Liu] update NAMESPACE
84e2d8c [Davies Liu] groupBy and agg()
7c3ddbd [Davies Liu] create jmode in JVM
9465426 [Davies Liu] load and save
982f342 [lythesia] fix numeric issue
7651d84 [lythesia] fix coalesce
4e712e1 [Davies Liu] use random port in backend
041d22b [Shivaram Venkataraman] Merge pull request #172 from cafreeman/sparkr-sql
0d07770 [cafreeman] Added `limit` and updated `take`
301d8e5 [cafreeman] Remove extraneous map functions
0387db2 [cafreeman] Remove colNames
04c4b65 [lythesia] add repartition/coalesce
231deab [cafreeman] Change reserialize to serializeToBytes
acf7e1a [cafreeman] Rework the Scala to R DataFrame Conversion
481ae37 [cafreeman] Updated stale comments and standardized arg names
21d4a97 [hlin09] Adds cleanClosure to capture the function closures.
d24ffb4 [hlin09] Merge remote-tracking branch 'upstream/master' into funcDep2
8be02de [hlin09] Revert "loop 1-12 test pass."
fddb9cc [hlin09] Revert "add docs"
f8ef0ab [hlin09] Revert "More docs"
8e4b3da [hlin09] Revert "More docs"
57e005b [hlin09] Revert "fix tests."
c10148e [Shivaram Venkataraman] Merge pull request #174 from shivaram/sparkr-runner
910e3be [Shivaram Venkataraman] Add a timeout for initialization Also move sparkRBackend.stop into a finally block
bf52b17 [Shivaram Venkataraman] Merge remote-tracking branch 'amplab-sparkr/master' into sparkr-runner
08102b0 [Shivaram Venkataraman] Merge pull request #176 from lythesia/master
9c77b20 [Chris Freeman] Merge pull request #2 from shivaram/sparkr-sql
179ab38 [lythesia] add try counts and increase time interval
71a73b2 [Shivaram Venkataraman] Use a getter for serialization mode This change encapsulates the semantics of serialization mode for RDDs inside a getter function. For PipelinedRDDs if a backing JavaRDD is available we use that else we fall back to a default serialization mode
06bf250 [Shivaram Venkataraman] Merge pull request #173 from shivaram/windows-space-fix
88bf97f [Shivaram Venkataraman] Create SparkContext for R shell launch
f9268d9 [Shivaram Venkataraman] Fix code review comments
e6ad12d [Shivaram Venkataraman] Update comment describing sparkR-submit
17eda4c [Shivaram Venkataraman] Merge pull request #175 from falaki/docfix
ba2b72b [Hossein] Spark 1.1.0 is default
4cd7d3f [lythesia] retry backend connection
749e2d0 [Hossein] Updated README
bc04cf4 [Shivaram Venkataraman] Use SPARKR_BACKEND_PORT in sparkR.R as default Change SparkRRunner to use EXISTING_SPARKR_BACKEND_PORT to differentiate between the two
22a19ac [Shivaram Venkataraman] Use a semaphore to wait for backend to initalize Also pick a random port to avoid collisions
7f1f0f8 [cafreeman] Move comments to fit 100 char line length
8b84e4e [cafreeman] Make if statements more explicit
ce5d5ab [cafreeman] New tests for Union and Object File
b063320 [cafreeman] Changed 'serialized' to 'serializedMode'
0981dff [Zongheng Yang] Merge pull request #168 from sun-rui/SPARKR-153_2
86fc639 [Shivaram Venkataraman] Move sparkR-submit into pkg/inst
fd8f8a9 [Shivaram Venkataraman] Merge branch 'hqzizania-master'
a33dbea [Shivaram Venkataraman] Merge branch 'master' of https://github.com/hqzizania/SparkR-pkg into hqzizania-master
384e6e2 [Shivaram Venkataraman] Merge pull request #171 from hlin09/hlin09
1f5a6ac [hlin09] fixed comments
7f7596a [cafreeman] Additional handling for "row" serialization
8c3b8c5 [cafreeman] Add test for UnionRDD on "row" serialization
b1141f8 [cafreeman] Fixed formatting issues.
5db30bf [cafreeman] Changed serialized from bool to string
2f0c0b8 [cafreeman] Add check for serialized type
d243dfb [cafreeman] Clean up code
5ff63a2 [cafreeman] Change test from boolean to string
77fec1a [cafreeman] Updated .Rd files
9224989 [cafreeman] Various updates for DataFrame to RRDD
26af62b [cafreeman] DataFrame to RRDD
e004481 [cafreeman] Update UnionRDD test
5292be7 [hlin09] Adds support of pipeRDD().
e2a7560 [Shivaram Venkataraman] Merge pull request #170 from cafreeman/sparkr-sql
5d537f4 [cafreeman] Add pairRDD to Description
b6fa88e [cafreeman] Updating to current master
0cda231 [Sun Rui] [SPARKR-153] phase 2: implement aggregateByKey() and foldByKey().
95ee6b4 [Shivaram Venkataraman] Merge remote-tracking branch 'amplab-sparkr/master' into sparkr-runner
67fbc60 [Shivaram Venkataraman] Add support for SparkR shell to use spark-submit This ensures that SparkConf options are read in both in batch and interactive modes
2271030 [Shivaram Venkataraman] Merge pull request #167 from sun-rui/removePartionByInRDD
7fcb46a [Sun Rui] Remove partitionBy() in RDD.
52f94c4 [Shivaram Venkataraman] Merge pull request #160 from lythesia/master
59e2d54 [lythesia] merge with upstream
5836650 [Zongheng Yang] Merge pull request #163 from sun-rui/SPARKR-153_1
141723e [Sun Rui] fix comments.
f73a07e [Shivaram Venkataraman] Merge pull request #165 from shivaram/sparkr-sql-build
10ffc6d [Shivaram Venkataraman] Set Spark version to 1.3 using staging dependency Also fix the maven build
c91ede2 [Shivaram Venkataraman] Merge pull request #164 from hlin09/hlin09
9d335a9 [hlin09] Makes git to ignore Eclipse meta files.
94066bf [Sun Rui] [SPARKR-153] phase 1: implement fold() and aggregate().
9c391c7 [hqzizania] Merge remote-tracking branch 'upstream/master'
5f29551 [hqzizania] 	modified:   pkg/R/RDD.R 	modified:   pkg/R/context.R
d968664 [lythesia] fix comment
7972858 [Shivaram Venkataraman] Merge pull request #159 from sun-rui/SPARKR-150_2
7690878 [lythesia] separate out pair RDD functions
f4573c1 [Sun Rui] Use reduce() instead of sortBy().take() to get the ordered elements.
63e62ed [Sun Rui] [SPARKR-150] phase 2: implement takeOrdered() and top().
050390b [Shivaram Venkataraman] Fix bugs in inferring R file
8398f2e [Shivaram Venkataraman] Add sparkR-submit helper script Also adjust R file path for YARN cluster mode
bd6705b [Zongheng Yang] Merge pull request #154 from sun-rui/SPARKR-150
c7964c9 [Sun Rui] Merge with upstream master.
7feac38 [Sun Rui] Use default arguments for sortBy() and sortKeyBy().
de2bfb3 [Sun Rui] Fix minor comments and add more test cases.
0c6e071 [Zongheng Yang] Merge pull request #157 from lythesia/master
f5038c0 [lythesia] pull out anonymous functions in groupByKey
ba6f044 [lythesia] fixes for reduceByKeyLocally
343b6ab [Oscar Olmedo] Export sparkR.stop Closes #156 from oscaroboto/master
25639cf [Shivaram Venkataraman] Replace tabs with spaces
bb25920 [Shivaram Venkataraman] Merge branch 'dputler-master'
fd836db [hlin09] fix tests.
24a7f13 [hlin09] More docs
a465165 [hlin09] More docs
6ad4fc3 [hlin09] add docs
b082a35 [lythesia] add reduceByKeyLocally
7ca6512 [Shivaram Venkataraman] First cut of SparkRRunner
193f5fe [hlin09] loop 1-12 test pass.
345f1b8 [dputler] [SPARKR-195] Implemented project style guidelines for if-else statements
8043559 [Sun Rui] Add a TODO to use binary search in the range partitioner.
91b2fd6 [Sun Rui] Add more test cases.
e8ebbe4 [Shivaram Venkataraman] Merge pull request #152 from cafreeman/sparkr-sql
0c53d6c [dputler] Data frames now coerced to lists, and messages issued for a data frame or matrix on how they are parallelized
6d57ec0 [cafreeman] Remove json test file since we're using a temp
ac1ef09 [cafreeman] Update registerTempTable test
d9da451 [Sun Rui] [SPARKR-150] phase 1: implement sortBy() and sortByKey().
08ff30b [Shivaram Venkataraman] Merge pull request #153 from hqzizania/master
9767e8e [hqzizania] 	modified:   pkg/man/collect-methods.Rd
5d69f0a [hqzizania] 	modified:   pkg/R/RDD.R
4914091 [hqzizania] 	modified:   pkg/inst/tests/test_rdd.R
742a68b [cafreeman] Update test_sparkRSQL.R
a95823e [hqzizania] 	modified:   pkg/R/RDD.R
2d04526 [cafreeman] Formatting
fae9bdd [cafreeman] Renamed to SQLUtils.scala
39888ea [Chris Freeman] Update test_sparkSQL.R
fce2453 [cafreeman] Updated documentation for SQLContext
13fbf12 [cafreeman] Regenerated .Rd files
51ecf41 [cafreeman] Updated Scala object
30d7337 [cafreeman] Added SparkSQL test
74b3ed6 [cafreeman] Incorporate code feedback
554bda0 [Zongheng Yang] Merge pull request #147 from shivaram/sparkr-ec2-fixes
a5f4f8f [cafreeman] Squashed commit of the following:
f34bb88 [Shivaram Venkataraman] Remove profiling information from this PR
c662f29 [Zongheng Yang] Merge pull request #146 from shivaram/spark-1.2-build
21e9b74 [Zongheng Yang] Merge pull request #145 from lythesia/master
76f6b9e [Shivaram Venkataraman] Merge pull request #149 from hqzizania/master
1c2dbec [lythesia] minor fix for refactoring join code
5b380d3 [hqzizania] 	modified:   pkg/man/combineByKey.Rd 	modified:   pkg/man/groupByKey.Rd 	modified:   pkg/man/partitionBy.Rd 	modified:   pkg/man/reduceByKey.Rd
98794fe [hqzizania] 	modified:   pkg/R/RDD.R
b66534d [Zongheng Yang] Merge pull request #144 from shivaram/fix-rd-files
60da1df [Shivaram Venkataraman] Initialize timing variables
179aa75 [Shivaram Venkataraman] Bunch of fixes for longer running jobs 1. Increase the timeout for socket connection to wait for long jobs 2. Add some profiling information in worker.R 3. Put temp file writes before stdin writes in RRDD.scala
06d99f0 [Shivaram Venkataraman] Fix URI to have right number of slashes
add97f5 [Shivaram Venkataraman] Use URL encode to create valid URIs for jars
4eec962 [lythesia] refactor join functions
73430c6 [Shivaram Venkataraman] Make SparkR work on paths with spaces on Windows
aaf8f47 [Shivaram Venkataraman] Exclude hadoop client from Spark dependency
227ee42 [Zongheng Yang] Merge pull request #141 from shivaram/SPARKR-140
ac5ceb1 [Shivaram Venkataraman] Fix code review comments
32394de [Shivaram Venkataraman] Regenerate Rd files for SparkR This fixes a number of issues in SparkR man pages. The main changes are 1. Don't export or generate docs for PipelineRDD 2. Fix variable names for Filter, count to match base methods 3. Document missing arguments for sparkR.init, print.jobj etc.
e157bf6 [Shivaram Venkataraman] Use prev_serialized to track if JRDD is serialized This changes introduces a new variable in PipelineRDD environment to track if the prev_jrdd is serialized or not.
7428a7e [Zongheng Yang] Merge pull request #143 from shivaram/SPARKR-181
7dd1797 [Shivaram Venkataraman] Address code review comments
8f81c45 [Shivaram Venkataraman] Remove roxygen export for PipelinedRDD
0cb90f1 [Zongheng Yang] Merge pull request #142 from shivaram/SPARKR-169
d1c6e6c [Shivaram Venkataraman] Buffer stderr from R and return it on Exception This change buffers the last 100 lines from R process and passes these lines back to the driver if we have an exception. This will help users debug why their tasks failed on the cluster
d6c1393 [Shivaram Venkataraman] Suppress warnings from normalizePath
a382835 [Shivaram Venkataraman] Fix serialization tracking in pipelined RDDs When creating a pipeline RDD, we need to check if the JavaRDD belonging to the parent is serialized.
da39529 [Zongheng Yang] Merge pull request #140 from sun-rui/SPARKR-183
2814caa [Sun Rui] Merge with upstream master.
cd2a5b3 [Sun Rui] Add reference to Nagle's algorithm and clean code.
52356b6 [Shivaram Venkataraman] Merge pull request #139 from shivaram/fix-backend-exit
97e5a1f [Sun Rui] [SPARKR-183] Fix the issue that parallelize collect tests are slow.
a9f8e8e [Shivaram Venkataraman] Merge pull request #138 from concretevitamin/fix-collect-test
125ae43 [Shivaram Venkataraman] Fix SparkR backend to exit in more cases This change has two fixes 1. When the workspace is saved (from R or RStudio) the backend connection seems to be closed before the finalizer is run. In such cases we reopen the connection and stop the backend 2. With RStudio when R is restarted, there are port-conflicts which appear due to a race condition between the JVM and rsession restart. This change adds a 1 sec sleep to avoid this race.
12c102a [Zongheng Yang] Simplify a unit test.
9c0637a [Zongheng Yang] Merge pull request #137 from shivaram/fix-docs
0df0e18 [Shivaram Venkataraman] Fix documentation for includePackage
7549f88 [Zongheng Yang] Merge pull request #136 from shivaram/man-updates
7edbe46 [Shivaram Venkataraman] Add missing man pages
9cb9567 [Shivaram Venkataraman] Merge pull request #131 from shivaram/rJavaExpt
1fa722e [Shivaram Venkataraman] Rename to SerDe now
2fcb051 [Shivaram Venkataraman] Rename to SerDeJVMR
d112cf0 [Shivaram Venkataraman] Style fixes
9fd01cc [Shivaram Venkataraman] Remove unnecessary braces
0881931 [Shivaram Venkataraman] Some more style fixes
f00b531 [Shivaram Venkataraman] Address code review comments. Big changes include style fixes throughout for named arguments
c09ba05 [Shivaram Venkataraman] Change jobj id to be just an integer Add a new print.jobj that gets the class name and prints it Also add a utility function isInstanceOf
be05b16 [Shivaram Venkataraman] Check if context, connection exist before stopping
d596a23 [Shivaram Venkataraman] Address code review comments
396e7ac [Shivaram Venkataraman] Changes to make new backend work on Windows This change uses file.path to construct the Java binary path in a OS agnostic way and uses system2 to handle quoting binary paths correctly. Tests pass on Mac OSX and a Windows EC2 instance.
e7a4e03 [Shivaram Venkataraman] Remove unused file BACKEND.md
62f380b [Shivaram Venkataraman] Update worker.R to use new deserialization call
8b9c4e6 [Shivaram Venkataraman] Change RDD name, setName to use new backend
6dcd5c5 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into rJavaExpt
0873397 [Shivaram Venkataraman] Refactor java object tracking into a new singleton. Also add comments describing each class
95db964 [Shivaram Venkataraman] Add comments, cleanup new R code
bcd4258 [Zongheng Yang] Merge pull request #130 from lythesia/master
74dbc5e [Sun Rui] Match method using parameter types.
7ad4a4d [Sun Rui] Use 1 char to represent types on the backend->client direction.
bace887 [Sun Rui] Use an integer count for the backend java object ID because Uniqueness isn't guaranteed by System.identityHashCode().
b38d04f [Sun Rui] Use 1 char to represent types on the client -> backend direction.
f88bc68 [lythesia] Merge branch 'master' of github.com:lythesia/SparkR-pkg
71d41f5 [lythesia] add test case for fullOuterJoin
eb4f423 [lythesia] --amend
cffecc5 [lythesia] add test case for fullOuterJoin
a547dd2 [Shivaram Venkataraman] Move classTag, rddRef into newJObject call This avoids them getting eagerly garbage collected
1255391 [Shivaram Venkataraman] Add a finalizer for jobj objects This enables Java objects to be garbage collected on the backend when they are no longer referenced in R. Also rename newJava to newJObject to be more consistent with callJMethod
70fa409 [Sun Rui] Add YARN Conf Dir to the class path when launching the backend.
a1108ca [lythesia] add fullOuterJoin in RDD.R
2152727 [Shivaram Venkataraman] Remove empty file
cd08bee [Shivaram Venkataraman] Update all functions to use new backend All unit tests pass.
9de49b7 [Shivaram Venkataraman] Add high level calls for methods, constructors Also update BACKEND.md
5a97ea4 [Shivaram Venkataraman] Add jobj S3 class that holds backend refs
e071d3e [Shivaram Venkataraman] Change SparkRBackend to use general method calls This change uses a custom protocl + JNI to invoke any method on a given object type. Also update serializers, deserializers to make code more concise
49f0404 [Shivaram Venkataraman] Merge pull request #129 from lythesia/master
7f8cd82 [lythesia] update man
4715ed2 [Yi Lu] Update RDD.R
5a53801 [lythesia] fix name,setName
4f3870b [lythesia] add name,setName in RDD.R
1c25700 [Shivaram Venkataraman] Merge pull request #128 from sun-rui/SPARKR-165
c8507d8 [Sun Rui] [SPARKR-165] IS_SCALAR is not present in R before 3.1
2cff2bd [Sun Rui] Add function to invoke Java method.
7a31da1 [Shivaram Venkataraman] Merge branch 'dputler-master'. Closes #119
0ceba82 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/dputler/SparkR-pkg into dputler-master
735f70c [Shivaram Venkataraman] Merge pull request #125 from 7c00/rawcon
fccfe6c [Shivaram Venkataraman] Merge pull request #127 from sun-rui/SPARKR-164
387bd57 [Sun Rui] [SPARKR-164] Temporary files used by SparkR accumulat as time goes on.
5f2268f [Shivaram Venkataraman] Add support to stop backend
5f745c0 [Shivaram Venkataraman] Update notes in backend
22015c1 [Shivaram Venkataraman] Add first cut of SparkR Backend
52821da [Todd Gao] switch the order of packages and function deps
d7b0007 [Todd Gao] remove memCompress
cb6873e [Shivaram Venkataraman] Merge pull request #126 from sun-rui/SPARKR-147
c5962eb [Todd Gao] further optimize using rawConnection
f04c6e0 [Sun Rui] [SPARKR-147] Support multiple directories as input to textFile.
b7de604 [Todd Gao] optimize execFunctionDeps loading in worker.R
4d4fc30 [Shivaram Venkataraman] Merge pull request #122 from cafreeman/master
b508877 [cafreeman] Update SparkR_IDE_Setup.sh
21ed9d7 [cafreeman] Update build.sbt
f73ec16 [cafreeman] Delete SparkR_IDE_Setup_Guide.md
d63b026 [cafreeman] Delete SparkR_Quick_Start_Guide.md
6e6cb62 [cafreeman] Update SparkR_IDE_Setup.sh
bc6042b [cafreeman] Update build.sbt
a8197d5 [cafreeman] Merge remote-tracking branch 'upstream/master'
d671564 [Zongheng Yang] Merge pull request #123 from shivaram/jcheck-void
76b8d00 [Zongheng Yang] Merge pull request #124 from shivaram/master
b690d58 [Shivaram Venkataraman] Specify how to change Spark versions in README
0fb003d [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into jcheck-void
1c227b4 [Shivaram Venkataraman] Also add a check in context.R
96812b6 [Shivaram Venkataraman] Check for exceptions after void method calls
f5c216d [cafreeman] Merge remote-tracking branch 'upstream/master'
90c8933 [Zongheng Yang] Merge pull request #121 from shivaram/fix-sort-order
bd0e3b4 [Shivaram Venkataraman] Fix saveAsTextFile test case
2e55f67 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into fix-sort-order
f10c607 [Shivaram Venkataraman] Merge pull request #118 from sun-rui/saveAsTextFile
6c9bfc0 [Sun Rui] Merge remote-tracking branch 'SparkR_upstream/master' into saveAsTextFile
6faedbe [cafreeman] Update SparkR_IDE_Setup_Guide.md
57008bc [cafreeman] Update SparkR_IDE_Setup.sh
bb1c17d [cafreeman] Update SparkR_IDE_Setup.sh
538bfdb [cafreeman] Update SparkR_Quick_Start_Guide.md
31322c6 [cafreeman] Update SparkR_IDE_Setup.sh
ca3f593 [Sun Rui] Refactor RRDD code.
df58d95 [cafreeman] Update SparkR_Quick_Start_Guide.md
b488c88 [cafreeman] Rename Spark_IDE_Setup.sh to SparkR_IDE_Setup.sh
b2545a4 [cafreeman] Added IDE Setup Guide
0ffb5de [cafreeman] Merge branch 'master' of https://github.com/cafreeman/SparkR-pkg
bd8fbfb [cafreeman] Merge remote-tracking branch 'upstream/master'
98efa5b [cafreeman] Added Quick Start Guide
3cf88f2 [Shivaram Venkataraman] Sort lists before comparing in unit tests Since Spark doesn't guarantee that shuffle results will always be in the same order, we need to sort the results before comparing for deterministic behavior
d621dbc [Shivaram Venkataraman] Merge pull request #120 from sun-rui/objectFile
c4a44d7 [Sun Rui] Add @seealso in comments and extract some common code into a function.
724e3a4 [cafreeman] Update Spark_IDE_Setup.sh
8153e5a [Sun Rui] [SPARKR-146] Support read/save object files in SparkR.
17f9909 [cafreeman] Update Spark_IDE_Setup.sh
a9eb080 [cafreeman] IDE Shell Script
64d800c [dputler] Merge remote branch 'upstream/master'
1fbdb2e [dputler] Added the ability for the user to specify a text file location throught the use of tilde expansion or just the file name if it is in the working directory.
d83c017 [Shivaram Venkataraman] Merge pull request #113 from sun-rui/stringHashCodeInC
a7d9cdb [Sun Rui] Fix build on Windows.
7d81b05 [Shivaram Venkataraman] Merge pull request #114 from hlin09/hlin09
47c4bb7 [hlin09] fix reviews
a457f7f [Shivaram Venkataraman] Merge pull request #116 from dputler/master
0fa48d1 [Shivaram Venkataraman] Merge pull request #117 from sun-rui/keyBy
85cfeb4 [Sun Rui] [SPARKR-144] Implement saveAsTextFile() in the RDD class.
09083d9 [Sun Rui] Add keyBy() to the RDD class.
caad5d7 [dputler] Adding the script to install software on the Cloudera Quick Start VM.
dca3d05 [hlin09] Minor fix.
ece5f7d [hlin09] Merge remote-tracking branch 'upstream/master' into hlin09
a40874b [hlin09] Use extendible accumulators aggregate the cogroup values.
d0347ce [Zongheng Yang] Merge pull request #112 from sun-rui/outer_join
492f76e [Sun Rui] Refine code and add description.
ba01358 [Shivaram Venkataraman] Merge pull request #115 from sun-rui/SPARKR-130
5c8e46e [Sun Rui] Fix per the review comments.
7190a2c [Sun Rui] Update comment to add a reference to storage levels.
1da705e [hlin09] Fix the review comments.
c4b77be [Sun Rui] [SPARKR-130] Add persist(storageLevel) API to RDD.
b424a1a [hlin09] Add function cogroup().
9770312 [Shivaram Venkataraman] Merge pull request #111 from hlin09/hlin09
cead7df [hlin09] fix review comments.
54f712e [Sun Rui] Implement string hash code in C.
425f0c6 [Sun Rui] Add leftOuterJoin() and rightOuterJoin() to the RDD class.
39509c7 [hlin09] add Rd file for foreach and foreachPartition.
63d6ac7 [hlin09] Adds function foreach() and foreachPartition().
9c954df [Zongheng Yang] Merge pull request #105 from sun-rui/join
c71228d [Sun Rui] Pre-allocate list with fixed length. Add test case for join() using string key.
bc3e9f6 [Shivaram Venkataraman] Merge pull request #108 from concretevitamin/take-optimize
c06fc90 [Zongheng Yang] Fix: only optimize for unserialized dataset case.
d399aeb [Zongheng Yang] Apply size-capping on logical representation instead of physical.
e4217dd [Zongheng Yang] Merge pull request #107 from shivaram/master
7952180 [Shivaram Venkataraman] Copy, use getLocalDirs from Spark Utils.scala
08e24c3 [Zongheng Yang] Merge pull request #109 from hlin09/hlin09
97d4e02 [Zongheng Yang] Min() upper-bound size with actual size.
bb779bf [hlin09] Rename the filter function to filterRDD to follow the API consistency. Filter() is also kept.
ce1661f [Zongheng Yang] Fix slow take(): deserialize only up to necessary # of elements.
4dca9b1 [Shivaram Venkataraman] Merge pull request #106 from hlin09/hlin09
1220d92 [hlin09] Adds function numPartitions().
2326a65 [Shivaram Venkataraman] Use SPARK_LOCAL_DIRS to create tmp files
e119757 [hlin09] Minor fix.
9c24c8b [hlin09] Adds function countByKey().
48fce67 [hlin09] Adds countByValue().
6679eef [Sun Rui] Update documentation for join().
70586b4 [Sun Rui] Add join() to the RDD class.
e6fb999 [Zongheng Yang] Merge pull request #103 from shivaram/rlibdir-fix
a21f146 [Shivaram Venkataraman] Merge pull request #102 from hlin09/hlin09
32eb619 [Shivaram Venkataraman] Merge pull request #104 from sun-rui/add_keys_values
d8692e9 [Sun Rui] Add keys() and values() for the RDD class.
18b9be1 [Shivaram Venkataraman] Allow users to set where SparkR is installed This also adds a warning if somebody tries to call sparkR.init multiple times.
a17f135 [hlin09] Adds tests for flatMap and flatMapValues.
4bcf59b [hlin09] Adds function flatMapValues.
4a193ef [Zongheng Yang] Merge pull request #101 from ashutoshraina/master
60d22f2 [Ashutosh Raina] changed sbt version
5400793 [Zongheng Yang] Merge pull request #98 from shivaram/windows-fixes-build
36d61a7 [Shivaram Venkataraman] Merge pull request #97 from hlin09/hlin09
f7d7d89 [hlin09] Remove redundant code in test.
6bbe823 [hlin09] minor style fix.
9b47f3a [Shivaram Venkataraman] Merge pull request #100 from hnahak87/patch-1
7f6e4ea [Harihar Nahak] Update logistic_regression.R
a605047 [Shivaram Venkataraman] Merge pull request #99 from hlin09/makefile
323151d [hlin09] Fix yar flag in Makefile to remove build error in Maven.
8911897 [hlin09] Make reserialize() private function in package.
79aee73 [Shivaram Venkataraman] Add notes on how to build SparkR on windows
49a99e7 [Shivaram Venkataraman] Clean up some commented code
ddc271b [Shivaram Venkataraman] Only append file:/// to non empty jar paths
a53952e [Shivaram Venkataraman] Add windows build scripts
325b179 [hlin09] Merge remote-tracking branch 'upstream/master' into hlin09
daf5040 [hlin09] Add reserialize() before union if two RDDs are not both serialized.
536afb1 [hlin09] Add new function of union().
7044677 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into windows-fixes
d22a02d [Zongheng Yang] Merge pull request #94 from shivaram/windows-fixes-stdin
51924f7 [Shivaram Venkataraman] Merge pull request #90 from oscaroboto/master
eb97d85 [Shivaram Venkataraman] Merge pull request #96 from sun-rui/add_clarification_readme
5a128f4 [Sun Rui] Add clarification on setting Spark master when launching the SparkR shell.
187526a [oscaroboto] Update sparkR.R
32c567b [Shivaram Venkataraman] Merge pull request #95 from concretevitamin/master
4cd2d5e [Zongheng Yang] Notes about spark-ec2.
1c28e3b [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into windows-fixes
8e8a029 [Zongheng Yang] Merge pull request #92 from shivaram/sparkr-yarn
721043b [Zongheng Yang] Update README.md with YARN instructions.
1681f58 [Shivaram Venkataraman] Use temporary files for input instead of stdin This fixes a bug for Windows where stdin would get truncated
b084314 [oscaroboto] removed ... from example
44c93d4 [oscaroboto] Added example to SparkR.R
be82dcc [Shivaram Venkataraman] Merge pull request #93 from hlin09/hlin09
868554d [oscaroboto] Update sparkR.R
488ac47 [hlin09] Add generated Rd file of previous added functions, distinct() and mapValues().
b2740ad [hlin09] Add test for filter all elements. Add filter() as alias.
08d3631 [hlin09] Minor style fixes.
2c0e34f [hlin09] Adds function Filter(), which extracts the elements that satisfy a predicate.
5951d3b [Shivaram Venkataraman] Remove SBT plugin
4e70ced [oscaroboto] changed ExecutorEnv to sparkExecutorEnvMap, to make it consistent with sparkEnvirMap
903d18a [oscaroboto] changed executorEnv to sparkExecutorEnvMap,  will do the same in R
f97346e [oscaroboto] executorEnv to lower-case e
88a524e [oscaroboto] Added LD_LIBRARY_PATH to the ExecutorEnv. This is need so that the nodes can find libjvm.so, or if the master has a different LD_LIBRARY_PATH then the nodes. Make sure to export LD_LIBRARY_PATH  that includes the path to libjvm.so in the nodes.
1d208ae [oscaroboto] added the YARN_CONF_DIR to the classpath
8a9b75c [oscaroboto] forgot to change hm and ee inside the for loops
579db58 [Shivaram Venkataraman] Merge pull request #91 from sun-rui/add_max_min
4381efa [Sun Rui] use reduce() to implemement max() and min().
a5459c5 [Shivaram Venkataraman] Consolidate yarn flags
86b04eb [Shivaram Venkataraman] Don't use quotes around yarn
bf0797f [Shivaram Venkataraman] Add dependency on spark yarn module
af5fe77 [Shivaram Venkataraman] Fix SBT build, add dependency tree plugin
4917607 [Sun Rui] Add maximum() and minimum() API to RDD.
51bbbe4 [Shivaram Venkataraman] Changes to make SparkR work with YARN
9d5e3ab [oscaroboto] a few stylistic changes. Also change vars to sparkEnvirMap and eevars to ExecutorEnv, to match sparkR.R
578f545 [oscaroboto] a few stylistic changes
39eea2f [oscaroboto] Modification to dynamically create a sparkContext with YARN. Added .setExecutorEnv to the sparkConf in createSparkContext within the RRDD object. This modification was made together with sparkR.R
17ec42e [oscaroboto] A modification to dynamically create a sparkContext with YARN. sparkR.R modified to pass custom Jar file names and EnvironmentEnv to the sparkConf. RRDD.scala was also modified to accept the new inputs to creatSparkContext.
624ac9d [Shivaram Venkataraman] Merge pull request #87 from sun-rui/SPARKR-125
4f213db [Shivaram Venkataraman] Merge pull request #89 from sun-rui/SPARKR-108
eb833c5 [Shivaram Venkataraman] Merge pull request #88 from hlin09/hlin09
07bf971 [Sun Rui] [SPARKR-108] Implement map-side reduction for reduceByKey().
4accba1 [hlin09] Fixes style and adds an optional param 'numPartition' in distinct().
80d303a [hlin09] typo fixed.
e37a9b5 [hlin09] Adds function distinct() and mapValues().
08dac06 [Sun Rui] [SPARKR-125] Get the iterator of the parent RDD before launching a R worker process in compute() of RRDD/PairwiseRRDD
c4ba53c [Shivaram Venkataraman] Merge pull request #85 from edwardt/master
72a9d27 [root] reorder to keep relative ordering the same
f3fcb10 [root] fix up build.sbt also to match pom.xml
5ecbe3e [root] Make spark verison configurable in build script per ISSUE122
a44e63d [Shivaram Venkataraman] Merge pull request #84 from sun-rui/SPARKR-94
fbb5663 [Sun Rui] Add {} to one-line functions and add a test case for lookup where no match is found.
95beb4e [Shivaram Venkataraman] Merge pull request #82 from edwardt/master
36776c5 [edwardt] missed one 0.9.0 revert
b26deec [Sun Rui] [SPARKR-94] Add a  method to get an element of a pair RDD object by key.
1ba256e [edwardt] Keep 0.9.0 and says uses 1.1.0 by default
5380c43 [root] missed one version
21f74da [root] upgrade to spark version 1.1.0 to match lastest merge list
ddfcde9 [root] merge
67d067a [Shivaram Venkataraman] Merge pull request #81 from sun-rui/SparkR-117
993868f [Sun Rui] [SPARKR-117] Update Spark dependency to 1.1.0
d20661a [Zongheng Yang] Merge pull request #80 from sun-rui/master
0b2da9f [Sun Rui] Update Rd file and add a test case for mapPartitions.
5879648 [Sun Rui] Add mapPartitions() method to RDD for API consistency.
c033461 [Shivaram Venkataraman] Merge pull request #79 from sun-rui/fix-kmeans
f62b77e [Sun Rui] Adjust coding style.
b40911d [Sun Rui] Fix syntax error in examples/kmeans.R.
5304451 [Shivaram Venkataraman] Merge pull request #78 from sun-rui/master
70ffbfb [Sun Rui] Fix a bug that modifications to build.sbt won't trigger rebuilding.
a25696c [Shivaram Venkataraman] Merge pull request #76 from edwardt/addjira
b8bbd93 [edwardt] Update README.md
615d930 [edwardt] Update README.md
e522e69 [edwardt] Update README.md
03e6ced [edwardt] Update README.md
3007015 [root] don't check in gedit buffer file'
c35c9a6 [root] Add where to enter bugs ad feeback
469eae3 [edwardt] Update README.md
61b4a43 [edwardt] Update Makefile (style uniformity)
ce3337d [edwardt] Update README.md
7ff68fc [root] Merge branch 'master' of https://github.com/edwardt/SparkR-pkg
16353f5 [root] add links to devtools and install_github
513b9e5 [Shivaram Venkataraman] Merge pull request #72 from edwardt/master
31608a4 [edwardt] Update Makefile (style uniformity)
4ffe146 [root] Makefile: factor out SPARKR_VERSION to reduce potential copy&paste error; cp & rm called with -f in build/clean phase; .gitignore includes checkpoints and unit test log generated by run-tests.sh
715275f [Zongheng Yang] Merge pull request #68 from shivaram/master
90e2083 [Shivaram Venkataraman] Add return type to hasNext
8eb983d [Shivaram Venkataraman] Fix up comment
2206164 [Shivaram Venkataraman] Delete temporary files after they are read This change deletes temporary files used for communication between Rscript and the JVM once they have been completely read.
5881da7 [Zongheng Yang] Merge pull request #67 from shivaram/improve-shuffle
81251e2 [Shivaram Venkataraman] Address code review comments
a5f573f [Shivaram Venkataraman] Use a better list append in shuffles This is helpful in scenarios where we have a large number of values in a bucket
388e64d [Shivaram Venkataraman] Merge pull request #55 from RevolutionAnalytics/master
e1f95b6 [Zongheng Yang] Merge pull request #65 from concretevitamin/parallelize-fix
fc1a71a [Zongheng Yang] Fix that collect(parallelize(sc,1:72,15)) drops elements.
b8204c5 [Zongheng Yang] Minor: update a URL in README.
86f30c3 [Antonio Piccolboni] better fix for amplab-extras/SparkR-pkg#53
b3c318d [Antonio Piccolboni] delayed loading to have all namespaces available.
f323e97 [Antonio Piccolboni] tentative fix for amplab-extras/SparkR-pkg#53
6f82269 [Zongheng Yang] Merge pull request #48 from shivaram/master
8f433e5 [Shivaram Venkataraman] Move up Hadoop in pom.xml and add back protobufs As Hadoop 1.0.4 doesn't use protobufs, we can't exclude protobufs from Spark always. This change tries to order the dependencies so that the shader first picks up Hadoop's protobufs over Mesos.
bfe7e26 [Shivaram Venkataraman] Merge pull request #36 from RevolutionAnalytics/vectorize-examples
059ae41 [Antonio Piccolboni] and more formatting
9dbd531 [Antonio Piccolboni] more formatting per committer request
948738a [Antonio Piccolboni] converted tabs to spaces per project request
49f5f5a [Shivaram Venkataraman] Merge pull request #35 from shivaram/master
3eb5ad3 [Shivaram Venkataraman] on_failure -> after_failure in travis.yml
139bdee [Shivaram Venkataraman] Cache sbt, maven, ivy dependencies
4ebced2 [Shivaram Venkataraman] Merge pull request #34 from shivaram/master
8437061 [Shivaram Venkataraman] Exclude protobuf from Spark dependency in Maven This avoids pulling in multiple versions of protobuf from Mesos and Hadoop.
91aa527 [Antonio Piccolboni] vectorized version, 36s 10 slices 10^6 per slice. The older version takes 30 sec on 1/10th of data.
f137a57 [Antonio Piccolboni] for rstudio users
1f7ffb0 [Antonio Piccolboni] implemented using matrices and vectorized calls wherever possible
46b23df [Antonio Piccolboni] replace require with library
b15d7db [Antonio Piccolboni] faster parsing
8b7aeb3 [Antonio Piccolboni] 22x speed improvement, 3X mem impovement
c5bce07 [Zongheng Yang] Merge pull request #30 from shivaram/string-tests
21fa2d8 [Shivaram Venkataraman] Fix bug where serialized was not changed for RRRD Reason: When an RRDD is created in getJRDD we have converted any possibly unserialized RDD to a serialized RDD.
9d1ea20 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg into string-tests
7b9348c [Shivaram Venkataraman] Add tests for partition with string keys Add two tests one with a string array and one from a textFile to test both codepaths
aacd726 [Shivaram Venkataraman] Update README with maven proxy instructions
803e62c [Shivaram Venkataraman] Merge pull request #28 from concretevitamin/master
7c093e6 [Zongheng Yang] Use inherits() to test an object's class.
061c591 [Shivaram Venkataraman] Merge pull request #26 from hafen/master
90f9fda [Ryan Hafen] Fix isRdd() to properly check for class
5b10cc7 [Zongheng Yang] Merge pull request #24 from shivaram/master
7014f83 [Shivaram Venkataraman] Remove unused transformers in maven's pom.xml
b00cea5 [Shivaram Venkataraman] Add support for a Maven build
11ec9b2 [Shivaram Venkataraman] Merge pull request #12 from concretevitamin/pipelined
6b18a90 [Zongheng Yang] Merge branch 'master' into pipelined
57127b8 [Zongheng Yang] Merge pull request #23 from shivaram/master
1ac3940 [Zongheng Yang] Review feedback.
a06fb34 [Zongheng Yang] Remove outdated comment.
0a1fc13 [Shivaram Venkataraman] Fixes for using SparkR with Hadoop2. 1. Exclude ASM, Netty from Hadoop similar to Spark. 2. Concat services files to ensure HDFS filesystems work. 3. Update README with an example
9a1db44 [Zongheng Yang] Merge pull request #22 from shivaram/master
e462448 [Shivaram Venkataraman] Use `$` for calling `put` instead of .jrcall
ed4559a [Shivaram Venkataraman] Add support for passing Spark environment vars This change creates a new `createSparkContext` method in RRDD as we can't pass Map<String, String> through rJava. Also use SPARK_MEM in local mode to increase heap size and update the README with some examples.
10228fb [Shivaram Venkataraman] Merge pull request #20 from concretevitamin/digit-ex
1398d9f [Zongheng Yang] Add linear_solver_mnist to examples/.
d484c2a [Zongheng Yang] Add tests for actions on PipelinedRDD.
d9cb95c [Zongheng Yang] Add setCheckpointDir() to context.R; comment fix.
f8bc8a9 [Zongheng Yang] Minor edits per Shivaram's comments.
8cd67f7 [Shivaram Venkataraman] Merge pull request #15 from shivaram/master
d4468a9 [Shivaram Venkataraman] Remove trailing comma
e2714b8 [Shivaram Venkataraman] Remove Apache Staging repo and update README
334eace [Zongheng Yang] Add a multi-transformation test to benchmark on pipelining.
5650ad7 [Zongheng Yang] Put serialized field inside env for both RDD and PipelinedRDD.
0b9e8bb [Zongheng Yang] First cut at PipelinedRDD.
a4c431e [Zongheng Yang] Add `isCheckpointed` field and checkpoint().
dac0795 [Zongheng Yang] Minor inline comment style fix.
bfb8e26 [Zongheng Yang] Add isCached field (inside an env) and unpersist().
295bff6 [Zongheng Yang] Merge pull request #11 from shivaram/master
4cb209c [Shivaram Venkataraman] Search rLibDir in worker before libPaths This ensures we pick up the SparkR intended and not an older version installed on the same machine
ef198ff [Zongheng Yang] Merge pull request #10 from shivaram/unit-tests
e0557a8 [Shivaram Venkataraman] Update travis to install plyr
8b18bc1 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg into unit-tests
4a9ca31 [Shivaram Venkataraman] Use smaller broadcast and plyr instead of Matrix Matrix package takes around 2s to load and slows down unit tests.
21c6a61 [Zongheng Yang] Merge pull request #8 from shivaram/master
08c2947 [Shivaram Venkataraman] Move dev install directory to front of libPaths
bda42ee [Shivaram Venkataraman] Merge pull request #7 from JoshRosen/travis
cc5f5c0 [Josh Rosen] Add Travis CI integration (using craigcitro/r-travis)
b6c864b [Shivaram Venkataraman] Merge pull request #6 from concretevitamin/env-style-fix
4fcef22 [Zongheng Yang] Use one style ($) for accessing names in environments.
8a948c6 [Shivaram Venkataraman] Merge pull request #4 from shivaram/master
24978eb [Shivaram Venkataraman] Update README to use install_github
8899db4 [Shivaram Venkataraman] Update TODO.md
91792de [Shivaram Venkataraman] Update Spark requirements
f34f4bf [Shivaram Venkataraman] Check tests for failures and output error msg
cd750d3 [Shivaram Venkataraman] Update run-tests to use new path
1877b7c [Shivaram Venkataraman] Unset R_TESTS to make tests work with R CMD check Also silence Akka remoting logs and update Makefile to build on log4j changes
e60e18a [Shivaram Venkataraman] Update README to remove Spark installation notes
4450189 [Shivaram Venkataraman] Add Spark 0.9 dependency from Apache Staging Also clean up assembly jar from inst on make clean
5eb2131 [Shivaram Venkataraman] Update repo path in README
ec8210e [Shivaram Venkataraman] Remove broadcastId hack as it is public in Spark
9f0e080 [Shivaram Venkataraman] Merge branch 'install-github'
5c88fbd [Shivaram Venkataraman] Add helper script to run tests
77450a1 [Shivaram Venkataraman] Remove dependency on Spark Logging
6cb00d1 [Shivaram Venkataraman] Update README and add helper script install-dev.sh
28346ca [Shivaram Venkataraman] Only normalize if SPARK_HOME is not empty
0fd6571 [Shivaram Venkataraman] Normalize SPARK_HOME before passing it
ff96d5c [Shivaram Venkataraman] Pass in SPARK_HOME and jar file path
34c4dce [Shivaram Venkataraman] Move src into pkg and update Makefile This enables the package to be installed using install_github using devtools and automates the build procedure.
b25afed [Shivaram Venkataraman] Change package name to edu.berkeley.cs.amplab
c691464 [Shivaram Venkataraman] Add Apache 2.0 License file
27a4a4b [Shivaram Venkataraman] Add notes on how to compile roxygen2 docs
ca63844 [Shivaram Venkataraman] Add broadcast documentation Also generate documentation for sample, takeSample etc.
e4dd976 [Shivaram Venkataraman] Update TODO.md
e42d435 [Shivaram Venkataraman] Add support for broadcast variables
6b638e7 [Shivaram Venkataraman] Add the assembly jar to SparkContext
bf24e32 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg
43c05ce [Zongheng Yang] Fix a flaky/incorrect test for sampleRDD().
c6a9dfc [Zongheng Yang] Initial port of the kmeans example.
6885581 [Zongheng Yang] Implement element-level sampleRDD() and takeSample() with tests.
d3a4987 [Zongheng Yang] Add a test for lapplyPartitionsWithIndex on pairwise RDD.
c7899c1 [Zongheng Yang] Add lapplyPartitionsWithIndex, with a test and an alias function.
a9a7436 [Shivaram Venkataraman] Add DFC example from Tselil, Benjamin and Jonah
fbc5a95 [Zongheng Yang] Implement take() and takeSample().
c4a3409 [Shivaram Venkataraman] Use RDD instead of RRDD
dfad3f5 [Zongheng Yang] Add test_utils.R: a unit test for convertJListToRList().
a45227d [Zongheng Yang] Update .gitignore.
238fe6e [Zongheng Yang] Add a unit test for textFile().
a88898b [Zongheng Yang] Rename test_rrd to test_rrdd
10c8baa [Shivaram Venkataraman] Make SparkR work as a standalone package. Changes include: 1. Adding a new `sbt` project that builds RRDD.scala 2. Change the onLoad functions to load the assembly jar for SparkR 3. Set rLibDir in RRDD.scala and worker.R to load things correctly
78adcd8 [Shivaram Venkataraman] Add a gitignore
ca6108f [Shivaram Venkataraman] Merge branch 'SparkR-scalacode' of ../SparkR
999bd61 [Shivaram Venkataraman] Update collectPartition in R and use ClassTag
c58f63e [Shivaram Venkataraman] Update collectPartition in R and use ClassTag
48265fd [Shivaram Venkataraman] Use new version of collectPartitions in take
d4fe086 [Shivaram Venkataraman] Move collectPartitions to JavaRDDLike Also remove numPartitions in JavaRDD and update R code
bfecd7b [Shivaram Venkataraman] Scala 2.10 changes 1. Update sparkR script 2. Use classTag instead of classManifest
092a4b3 [Shivaram Venkataraman] Add combineByKey, update TODO
ac0d81d [Shivaram Venkataraman] Add more documentation
d1dc3fa [Shivaram Venkataraman] Add more documentation
c515e3a [Shivaram Venkataraman] Update TODO
db56a34 [Shivaram Venkataraman] Add a test case for include package
41cea51 [Shivaram Venkataraman] Ensure all parent environments are serialized. Also add a test case with an inline function
a978e84 [Shivaram Venkataraman] Add support to include packages in the worker
12bf8ce [Shivaram Venkataraman] Add support to include packages in the worker
fb7e72c [Shivaram Venkataraman] Cleanup TODO
16ac314 [Shivaram Venkataraman] Add documentation for functions in context, sparkR
85b1d25 [Shivaram Venkataraman] Set license to Apache
88f1101 [Shivaram Venkataraman] Add unit test running instructions
c40768e [Shivaram Venkataraman] Update TODO
0c7efbf [Shivaram Venkataraman] Refactor RRDD.scala and add comments to functions
5880d42 [Shivaram Venkataraman] Refactor RRDD.scala and add comments to functions
2dee36c [Shivaram Venkataraman] Remove empty test file
a82219b [Shivaram Venkataraman] Update TODOs
5db00dc [Shivaram Venkataraman] Add reduceByKey, groupByKey and refactor shuffle Other changes include 1. Adding unit tests for basic RDD functions and shuffle 2. Add a word count example 3. Change the dependency serialization to handle double loading of SparkR    package 4. Allow partitionBy to operate on any RDDs to create pair-wise RDD.
f196479 [Shivaram Venkataraman] Add reduceByKey, groupByKey and refactor shuffle Other changes include 1. Adding unit tests for basic RDD functions and shuffle 2. Add a word count example 3. Change the dependency serialization to handle double loading of SparkR    package 4. Allow partitionBy to operate on any RDDs to create pair-wise RDD.
987e36f [Shivaram Venkataraman] Add perf todo
0b03265 [Shivaram Venkataraman] Update TODO with testing, docs todo
685aaad [Zongheng Yang] First cut at refactoring worker.R. Remove pairwiseWorker.R.
95b9ddc [Zongheng Yang] First cut at refactoring worker.R. Remove pairwiseWorker.R.
4f00895 [Zongheng Yang] Remove the unnecessary `pairwise' flag in RRDD class. Reasons:
75d36d9 [Zongheng Yang] Working versions: partitionBy() and collectPartition() for RRDD.
e3fbd9d [Zongheng Yang] Working versions: partitionBy() and collectPartition() for RRDD.
67a4335 [Zongheng Yang] Add unit test for parallelize() and collect() pairwise data.
100ae65 [Zongheng Yang] Properly parallelize() and collect() pairwise data.
cd0a5e2 [Zongheng Yang] Properly parallelize() and collect() pairwise data.
aea16c3 [Zongheng Yang] WIP: second cut at partitionBy. Running into R/Scala communication issues.
45eb943 [Zongheng Yang] WIP: second cut at partitionBy. Running into R/Scala communication issues.
11c893b [Zongheng Yang] WIP: need to figure out the logic of (whether or not) shipping a hash func
82c201a [Zongheng Yang] WIP: need to figure out the logic of (whether or not) shipping a hash func
b3bfad2 [Zongheng Yang] Update TODO: take() done.
0e45293 [Zongheng Yang] Add ability to parallelize key-val collections in R.
f60406a [Zongheng Yang] Add ability to parallelize key-val collections in R.
7d7fe3b [Zongheng Yang] Re-implement take(): take a partition at a time and append.
a054e55 [Zongheng Yang] Fix take() tests(): mode difference.
9de0935 [Zongheng Yang] Implement take() for RRDD.
1e4427e [Zongheng Yang] Implement take() for RRDD.
ec3cd67 [Shivaram Venkataraman] Use temp file in Spark to pipe output
417aaed [Shivaram Venkataraman] Use temp file in Spark to pipe output
bb0a3c3 [Shivaram Venkataraman] Add conf directory to classpath
9594d8a [Shivaram Venkataraman] Clean up LR example
3b26b58 [Shivaram Venkataraman] Add a list of things to do.
cabce68 [Shivaram Venkataraman] Fix warnings from package check
fde3f9c [Shivaram Venkataraman] Flatten by default and disable recursive unlist
ab2e061 [Shivaram Venkataraman] Create LIB_DIR before installing SparkR package
555220a [Shivaram Venkataraman] Add readme and update Makefile
1319cda [Shivaram Venkataraman] Make standalone programs run with sparkR
ae19fa8 [Shivaram Venkataraman] Add support for cache and use `tempfile`
4e89ca4 [Shivaram Venkataraman] Add support for apply, reduce, count Also serialize closures using `save` and add two examples
25a0bea [Shivaram Venkataraman] Add support for apply, reduce, count Also serialize closures using `save` and add two examples
f50223f [Zongheng Yang] Make parallelize() and collect() use lists. Add a few more tests for them.
fc7693f [Zongheng Yang] Refactor and enhance the previously added unit test a little bit.
6de9b81 [Zongheng Yang] Add a simple unit test for parallelize().
8b95155 [Zongheng Yang] Add testthat skeleton infrastructure
ef305bf [Zongheng Yang] parallelize() followed by collect() now work for vectors/lists of strings and numerics (should work for other primitives as well).
dc16af4 [Zongheng Yang] Comment: toArray() allocates memory for a copy
f50121e [Zongheng Yang] Make parallelize() return JavaRDD[Array[Byte]]. Add RRDD.scala with a helper function in the singleton object.
46eb063 [Zongheng Yang] Make parallelize() return JavaRDD[Array[Byte]]. Add RRDD.scala with a helper function in the singleton object.
6b4938a [Zongheng Yang] parallelize(): a raw can be parallelized by JavaSparkContext and get back JavaRDD
978aa0f [Zongheng Yang] Add parallelize() skeleton: only return serialized slices now
84c1fd2 [Zongheng Yang] Use .jsimplify() to get around generic List's get() type erasure problem
f16b891 [Zongheng Yang] Convert a few reflectionc alls to .jcall
1284c13 [Zongheng Yang] WIP on collect(): JavaListToRList() failed with errors.
4c2e516 [Zongheng Yang] Add simple prototype of S4 class RRDD. Make TextFile() returns an RRDD.
82aa17a [Zongheng Yang] Add textFile()
83ce63f [Zongheng Yang] Create a JavaSparkContext and save it in .sparkEnv using sparkR.init()
01cdf0e [Zongheng Yang] Add Makefile for SparkR
fc9cae2 [Shivaram Venkataraman] Add skeleton R package
2015-04-08 22:45:40 -07:00