Add R API for `read.jdbc`, `write.jdbc`.
Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database.
Refactored some code into util so they could be tested.
Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function.
Tested:
```
# with postgresql
../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar
# read.jdbc
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345)
# partitionColumn and numPartitions test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345)
a <- SparkR:::toRDD(df)
SparkR:::getNumPartitions(a)
[1] 4
SparkR:::collectPartition(a, 2L)
# defaultParallelism test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345)
SparkR:::getNumPartitions(a)
[1] 2
# predicates test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345)
count(df) == 1
# write.jdbc, default save mode "error"
irisDf <- as.DataFrame(sqlContext, iris)
write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
"error, already exists"
write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345")
```
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10480 from felixcheung/rreadjdbc.
## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.
## How was this patch tested?
Unit tests.
SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
Min 1Q Median 3Q Max
-0.95096 -0.16585 -0.00232 0.17410 0.72918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6765 0.23536 7.1231 4.4561e-11
Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12
Species_versicolor -0.98339 0.072075 -13.644 0
Species_virginica -1.0075 0.093306 -10.798 0
(Dispersion parameter for gaussian family taken to be 0.08351462)
Null deviance: 28.307 on 149 degrees of freedom
Residual deviance: 12.193 on 146 degrees of freedom
AIC: 59.22
Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
Min 1Q Median 3Q Max
-0.95096 -0.16522 0.00171 0.18416 0.72918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.67650 0.23536 7.123 4.46e-11 ***
Sepal.Length 0.34988 0.04630 7.557 4.19e-12 ***
Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 ***
Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.08351462)
Null deviance: 28.307 on 149 degrees of freedom
Residual deviance: 12.193 on 146 degrees of freedom
AIC: 59.217
Number of Fisher Scoring iterations: 2
```
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12393 from yanboliang/spark-13925.
* SparkR glm supports families and link functions which match R's signature for family.
* SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```.
* This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in.
* This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR.
Unit tests.
cc mengxr jkbradley hhbyyh
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12294 from yanboliang/spark-12566.
#### What changes were proposed in this pull request?
This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.
#### How was this patch tested?
Modified the existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12284 from gatorsmile/followupDropTable.
## What changes were proposed in this pull request?
The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the R API for this function.
With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
- `window(timeColumn, windowDuration)`
- `window(timeColumn, windowDuration, slideDuration)`
- `window(timeColumn, windowDuration, slideDuration, startTime)`
In Python and R, users can access all APIs above, but in addition they can do
- In R:
`window(timeColumn, windowDuration, startTime=...)`
that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
## How was this patch tested?
Unit tests + manual tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#12141 from brkyvz/R-windows.
## What changes were proposed in this pull request?
Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
## How was this patch tested?
dev/lint-r
SparkR unit tests
Author: Sun Rui <rui.sun@intel.com>
Closes#12024 from sun-rui/SPARK-12792_new.
Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
Author: Sun Rui <rui.sun@intel.com>
Closes#10947 from sun-rui/SPARK-12792.
## What changes were proposed in this pull request?
This reopens#11836, which was merged but promptly reverted because it introduced flaky Hive tests.
## How was this patch tested?
See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11938 from andrewor14/session-catalog-again.
## What changes were proposed in this pull request?
This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.
## How was this patch tested?
Test against output from R package survival's survreg.
cc mengxr felixcheung
Close#11447
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11932 from yanboliang/spark-13010-new.
## What changes were proposed in this pull request?
`SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.
As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext
## How was this patch tested?
This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#11836 from andrewor14/use-session-catalog.
## What changes were proposed in this pull request?
This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli.
I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes.
I removed the preprocess part that omit NA values because we don't know which columns to process.
## How was this patch tested?
Test against output from R package e1071's naiveBayes.
cc: yanboliang yinxusen
Closes#11486
Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#11890 from mengxr/SPARK-13449.
## What changes were proposed in this pull request?
This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github.
## How was this patch tested?
dev/lint-r
SparkR unit tests
Author: Sun Rui <rui.sun@intel.com>
Closes#11652 from sun-rui/SPARK-13812.
## What changes were proposed in this pull request?
SparkR support first/last with ignore NAs
cc sun-rui felixcheung shivaram
## How was the this patch tested?
unit tests
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11267 from yanboliang/spark-13389.
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Closes#11220 from olarayej/SPARK-13312-3.
## What changes were proposed in this pull request?
Add ```approxQuantile``` for SparkR.
## How was this patch tested?
unit tests
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11383 from yanboliang/spark-13504 and squashes the following commits:
4f17adb [Yanbo Liang] Add approxQuantile for SparkR
JIRA: https://issues.apache.org/jira/browse/SPARK-13472
## What changes were proposed in this pull request?
One Kmeans test in R is unstable and sometimes fails. We should fix it.
## How was this patch tested?
Unit test is modified in this PR.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11345 from viirya/fix-kmeans-r-test and squashes the following commits:
f959f61 [Liang-Chi Hsieh] Sort resulted clusters.
This PR introduces several major changes:
1. Replacing `Expression.prettyString` with `Expression.sql`
The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples:
Expression | `prettyString` | `sql` | Note
------------------ | -------------- | ---------- | ---------------
`a && b` | `a && b` | `a AND b` |
`a.getField("f")` | `a[f]` | `a.f` | `a` is a struct
1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
`NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
Author: Cheng Lian <lian@databricks.com>
Closes#10757 from liancheng/spark-12799.simplify-expression-string-methods.
Add ```covar_samp``` and ```covar_pop``` for SparkR.
Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change.
cc sun-rui felixcheung shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10829 from yanboliang/spark-12903.
I've tried to solve some of the issues mentioned in: https://issues.apache.org/jira/browse/SPARK-12629
Please, let me know what do you think.
Thanks!
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Closes#10580 from NarineK/sparkrSavaAsRable.
The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```.
The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double.
This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D```
cc davies rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#10796 from hvanhovell/SPARK-12848.
shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table`
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10406 from felixcheung/readtable.
Currently this is reported when loading the SparkR package in R (probably would add is.nan)
```
Loading required package: methods
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
cov, filter, lag, na.omit, predict, sd, var
The following objects are masked from ‘package:base’:
colnames, colnames<-, intersect, rank, rbind, sample, subset,
summary, table, transform
```
Adding this test adds an automated way to track changes to masked method.
Also, the second part of this test check for those functions that would not be accessible without namespace/package prefix.
Incidentally, this might point to how we would fix those inaccessible functions in base or stats.
Looking for feedback for adding this test.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10171 from felixcheung/rmaskedtest.
Slight correction: I'm leaving sparkR as-is (ie. R file not supported) and fixed only run-tests.sh as shivaram described.
I also assume we are going to cover all doc changes in https://issues.apache.org/jira/browse/SPARK-12846 instead of here.
rxin shivaram zjffdu
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10792 from felixcheung/sparkRcmd.
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
Closes#9613 from olarayej/SPARK-11031.
This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.
This PR also fixes the tests that are broken by the new hash behaviour in shuffle.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#10703 from cloud-fan/use-hash-expr-in-shuffle.
Add ```read.text``` and ```write.text``` for SparkR.
cc sun-rui felixcheung shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10348 from yanboliang/spark-12393.
rxin davies shivaram
Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559
- [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed)
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10584 from felixcheung/rremovedeprecated.
* Changes api.r.SQLUtils to use ```SQLContext.getOrCreate``` instead of creating a new context.
* Adds a simple test
[SPARK-11199] #comment link with JIRA
Author: Hossein <hossein@databricks.com>
Closes#9185 from falaki/SPARK-11199.
`ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values.
For example:
```r
ifelse(lit(1) == lit(1), lit(2), lit(3))
ifelse(df$mpg > 0, df$mpg, 0)
```
will both fail with
```r
attempt to replicate an object of type 'environment'
```
The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR.
For reference, added test cases which trigger failures:
```r
. Error: when(), otherwise() and ifelse() with column on a DataFrame ----------
error in evaluating the argument 'x' in selecting a method for function 'collect':
error in evaluating the argument 'col' in selecting a method for function 'select':
attempt to replicate an object of type 'environment'
Calls: when -> when -> ifelse -> ifelse
1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126
5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label)
6: condition(object)
7: compare(actual, expected, ...)
8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))
Error: Test failures
Execution halted
```
Author: Forest Fang <forest.fang@outlook.com>
Closes#10481 from saurfang/spark-12526.
Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10281 from yanboliang/spark-12310.
The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value.
This could cause SparkR unit tests failed. For example, I hit it in another PR:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull
Author: gatorsmile <gatorsmile@gmail.com>
Closes#10160 from gatorsmile/sampleR.
* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments
jsonFile(sqlContext, “path1,path2”)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for ```parquetFile```.
cc felixcheung sun-rui shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10145 from yanboliang/spark-12146.
Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it.
cc sun-rui felixcheung shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10217 from yanboliang/spark-12234.
SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with #10145 for ```jsonFile```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10191 from yanboliang/spark-12198.
This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.
Author: Sun Rui <rui.sun@intel.com>
Closes#10030 from sun-rui/SPARK-12034.
1, Add ```isNaN``` to ```Column``` for SparkR. ```Column``` should has three related variable functions: ```isNaN, isNull, isNotNull```.
2, Replace ```DataFrame.isNaN``` with ```DataFrame.isnan``` at SparkR side. Because ```DataFrame.isNaN``` has been deprecated and will be removed at Spark 2.0.
<del>3, Add ```isnull``` to ```DataFrame``` for SparkR. ```DataFrame``` should has two related functions: ```isnan, isnull```.<del>
cc shivaram sun-rui felixcheung
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10037 from yanboliang/spark-12044.
Change ```numPartitions()``` to ```getNumPartitions()``` to be consistent with Scala/Python.
<del>Note: If we can not catch up with 1.6 release, it will be breaking change for 1.7 that we also need to explain in release note.<del>
cc sun-rui felixcheung shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10123 from yanboliang/spark-12115.
Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.
I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218
shivaram sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9654 from felixcheung/colnamescoltypes.
Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side.
There are two reasons that we should make this change:
* We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645)
* Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0.
It's better to fix this issue before 1.6 release, otherwise we will make breaking API change.
cc shivaram sun-rui
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10016 from yanboliang/SPARK-12025.
Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called.
For those we can't call, added them to SparkR programming guide.
It would seem to me `table, sample, subset, filter, cov` not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg.
```
> methods("transform")
[1] transform,ANY-method transform.data.frame
[3] transform,DataFrame-method transform.default
see '?methods' for accessing help and source code
> methods("subset")
[1] subset.data.frame subset,DataFrame-method subset.default
[4] subset.matrix
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
function 'subset' appears not to be S3 generic; found functions that look like S3 methods
```
Any idea?
More information on masking:
http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htmhttp://www.sfu.ca/~sweldon/howTo/guide4.pdf
This is what the output doc looks like (minus css):
![image](https://cloud.githubusercontent.com/assets/8969467/11229714/2946e5de-8d4d-11e5-94b0-dda9696b6fdd.png)
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9785 from felixcheung/rmasked.
The goal of this PR is to add tests covering the issue to ensure that is was resolved by [SPARK-11086](https://issues.apache.org/jira/browse/SPARK-11086).
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes#9743 from zero323/SPARK-11281-tests.
The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.
When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.
sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.
The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.
Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R.
Author: Sun Rui <rui.sun@intel.com>
Closes#9390 from sun-rui/SPARK-10500.
Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`
At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame. It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).
A simple improvement is to apply `dropFactor `column-wise and then reshape output list.
It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes#9099 from zero323/SPARK-11086.
Clean out hundreds of `style: Commented code should be removed.` from lintr
Like these:
```
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:513:3: style: Commented code should be removed.
# sc <- sparkR.init()
^~~~~~~~~~~~~~~~~~~
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:514:3: style: Commented code should be removed.
# sqlContext <- sparkRSQL.init(sc)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:515:3: style: Commented code should be removed.
# path <- "path/to/file.json"
^~~~~~~~~~~~~~~~~~~~~~~~~~~
```
tried without export or rdname, neither work
instead, added this `#' noRd` to suppress .Rd file generation
also updated `family` for DataFrame functions for longer descriptive text instead of `dataframe_funcs`
![image](https://cloud.githubusercontent.com/assets/8969467/10933937/17bf5b1e-8291-11e5-9777-40fc632105dc.png)
this covers *most* of 'Commented code' but I left out a few that looks legitimate.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9463 from felixcheung/rlintr.
switched stddev support from DeclarativeAggregate to ImperativeAggregate.
Author: JihongMa <linlin200605@gmail.com>
Closes#9380 from JihongMA/SPARK-11420.
Checked names, none of them should conflict with anything in base
shivaram davies rxin
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9489 from felixcheung/rstddev.
Follow up #9561. Due to [SPARK-11587](https://issues.apache.org/jira/browse/SPARK-11587) has been fixed, we should compare SparkR::glm summary result with native R output rather than hard-code one. mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9590 from yanboliang/glm-r-test.
This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Closes#9579 from olarayej/SPARK-10863_NEW14.
Make sample test less flaky by setting the seed
Tested with
```
repeat { if (count(sample(df, FALSE, 0.1)) == 3) { break } }
```
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9549 from felixcheung/rsample.
Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
Min Max
-0.9509607 0.7291832
$Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6765 0.2353597 7.123139 4.456124e-11
Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica -1.00751 0.09330565 -10.79796 0
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9561 from yanboliang/spark-11494.
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes#8314 from squito/SPARK-10116.
Because deparse() will break the long string into multiple lines, the deserialization will fail
Author: Davies Liu <davies@databricks.com>
Closes#9510 from davies/fix_glm.
Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9303 from yanboliang/spark-9492.
Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments.
shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf?
sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9290 from felixcheung/rdrivermem.
SparkR should remove `.sparkRSQLsc` and `.sparkRHivesc` when `sparkR.stop()` is called. Otherwise even when SparkContext is reinitialized, `sparkRSQL.init` returns the stale copy of the object and complains:
```r
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sparkR.stop()
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sqlContext
```
producing
```r
Error in callJMethod(x, "getClass") :
Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed.
```
I have added the check and removal only when SparkContext itself is initialized. I have also added corresponding test for this fix. Let me know if you want me to move the test to SQL test suite instead.
p.s. I tried lint-r but ended up a lots of errors on existing code.
Author: Forest Fang <forest.fang@outlook.com>
Closes#9205 from saurfang/sparkR.stop.
This PR introduce a new feature to run SQL directly on files without create a table, for example:
```
select id from json.`path/to/json/files` as j
```
Author: Davies Liu <davies@databricks.com>
Closes#9173 from davies/source.
…2 regularization if the number of features is small
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <sasaki@treasure-data.com>
Author: Kai Sasaki <sasaki@treasure-data.com>
Author: Lewuathe <lewuathe@me.com>
Closes#8884 from Lewuathe/SPARK-10668.
I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file.
Pull request because I filed this JIRA bug report:
https://issues.apache.org/jira/browse/SPARK-10981
Author: Monica Liu <liu.monica.f@gmail.com>
Closes#9029 from mfliu/master.
Bring the change code up to date.
Author: Adrian Zhuang <adrian555@users.noreply.github.com>
Author: adrian555 <wzhuang@us.ibm.com>
Closes#9031 from adrian555/attach2.
as.DataFrame is more a R-style like signature.
Also, I'd like to know if we could make the context, e.g. sqlContext global, so that we do not have to specify it as an argument, when we each time create a dataframe.
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Closes#8952 from NarineK/sparkrasDataFrame.
Two points in this PR:
1. Originally thought was that a named R list is assumed to be a struct in SerDe. But this is problematic because some R functions will implicitly generate named lists that are not intended to be a struct when transferred by SerDe. So SerDe clients have to explicitly mark a names list as struct by changing its class from "list" to "struct".
2. SerDe is in the Spark Core module, and data of StructType is represented as GenricRow which is defined in Spark SQL module. SerDe can't import GenricRow as in maven build Spark SQL module depends on Spark Core module. So this PR adds a registration hook in SerDe to allow SQLUtils in Spark SQL module to register its functions for serialization and deserialization of StructType.
Author: Sun Rui <rui.sun@intel.com>
Closes#8794 from sun-rui/SPARK-10051.
1. Add a "col" function into DataFrame.
2. Move the current "col" function in Column.R to functions.R, convert it to S4 function.
3. Add a s4 "column" function in functions.R.
4. Convert the "column" function in Column.R to S4 function. This is for private use.
Author: Sun Rui <rui.sun@intel.com>
Closes#8864 from sun-rui/SPARK-10079.
[SPARK-10905][SparkR]: Export freqItems() for DataFrameStatFunctions
- Add function (together with roxygen2 doc) to DataFrame.R and generics.R
- Expose the function in NAMESPACE
- Add unit test for the function
Author: Rerngvit Yanggratoke <rerngvit@kth.se>
Closes#8962 from rerngvit/SPARK-10905.
the sort function can be used as an alternative to arrange(... ).
As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of orderings for columns and the list of columns, represented as string names
for example:
sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to sort some of the columns in the same order
sort(df, decreasing=TRUE, "col1")
sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Closes#8920 from NarineK/sparkrsort.
The fix is to coerce `c("a", "b")` into a list such that it could be serialized to call JVM with.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#8961 from felixcheung/rselect.
Created method as.data.frame as a synonym for collect().
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: olarayej <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
Closes#8908 from olarayej/SPARK-10807.
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes#8830 from ericl/interaction-2.
1. Support collecting data of MapType from DataFrame.
2. Support data of MapType in createDataFrame.
Author: Sun Rui <rui.sun@intel.com>
Closes#8711 from sun-rui/SPARK-10050.
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
Closes#6297 from JihongMA/SPARK-SQL.
this PR :
1. Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side.
2. Enhance the SerDe to support transferring a Scala seq to R side. Data of ArrayType in DataFrame
after collection is observed to be of Scala Seq type.
3. Support ArrayType in createDataFrame().
Author: Sun Rui <rui.sun@intel.com>
Closes#8458 from sun-rui/SPARK-10049.
Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK.
I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R.
Author: CHOIJAEHONG <redrock07@naver.com>
Closes#7494 from CHOIJAEHONG1/SPARK-8951.
Add subset and transform
Also reorganize `[` & `[[` to subset instead of select
Note: for transform, transform is very similar to mutate. Spark doesn't seem to replace existing column with the name in mutate (ie. `mutate(df, age = df$age + 2)` - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform.
Though it is clearly stated it should replace column with matching name (should I open a JIRA for mutate/transform?)
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#8503 from felixcheung/rsubset_transform.
This PR:
1. supports transferring arbitrary nested array from JVM to R side in SerDe;
2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
from a DataFrame.
Author: Sun Rui <rui.sun@intel.com>
Closes#8276 from sun-rui/SPARK-10048.
I added lots of Column functinos into SparkR. And I also added `rand(seed: Int)` and `randn(seed: Int)` in Scala. Since we need such APIs for R integer type.
### JIRA
[[SPARK-9856] Add expression functions into SparkR whose params are complicated - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9856)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#8264 from yu-iskw/SPARK-9856-3.