Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10281 from yanboliang/spark-12310.
The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value.
This could cause SparkR unit tests failed. For example, I hit it in another PR:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull
Author: gatorsmile <gatorsmile@gmail.com>
Closes#10160 from gatorsmile/sampleR.
* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments
jsonFile(sqlContext, “path1,path2”)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for ```parquetFile```.
cc felixcheung sun-rui shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10145 from yanboliang/spark-12146.
Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it.
cc sun-rui felixcheung shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10217 from yanboliang/spark-12234.
SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with #10145 for ```jsonFile```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10191 from yanboliang/spark-12198.
This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.
Author: Sun Rui <rui.sun@intel.com>
Closes#10030 from sun-rui/SPARK-12034.
1, Add ```isNaN``` to ```Column``` for SparkR. ```Column``` should has three related variable functions: ```isNaN, isNull, isNotNull```.
2, Replace ```DataFrame.isNaN``` with ```DataFrame.isnan``` at SparkR side. Because ```DataFrame.isNaN``` has been deprecated and will be removed at Spark 2.0.
<del>3, Add ```isnull``` to ```DataFrame``` for SparkR. ```DataFrame``` should has two related functions: ```isnan, isnull```.<del>
cc shivaram sun-rui felixcheung
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10037 from yanboliang/spark-12044.
Change ```numPartitions()``` to ```getNumPartitions()``` to be consistent with Scala/Python.
<del>Note: If we can not catch up with 1.6 release, it will be breaking change for 1.7 that we also need to explain in release note.<del>
cc sun-rui felixcheung shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10123 from yanboliang/spark-12115.
Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.
I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218
shivaram sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9654 from felixcheung/colnamescoltypes.
Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side.
There are two reasons that we should make this change:
* We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645)
* Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0.
It's better to fix this issue before 1.6 release, otherwise we will make breaking API change.
cc shivaram sun-rui
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10016 from yanboliang/SPARK-12025.
Fix use of aliases and changes uses of rdname and seealso
`aliases` is the hint for `?` - it should not be linked to some other name - those should be seealso
https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html
Clean up usage on family, as multiple use of family with the same rdname is causing duplicated See Also html blocks (like http://spark.apache.org/docs/latest/api/R/count.html)
Also changing some rdname for dplyr-like variant for better R user visibility in R doc, eg. rbind, summary, mutate, summarize
shivaram yanboliang
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9750 from felixcheung/rdocaliases.
Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called.
For those we can't call, added them to SparkR programming guide.
It would seem to me `table, sample, subset, filter, cov` not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg.
```
> methods("transform")
[1] transform,ANY-method transform.data.frame
[3] transform,DataFrame-method transform.default
see '?methods' for accessing help and source code
> methods("subset")
[1] subset.data.frame subset,DataFrame-method subset.default
[4] subset.matrix
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
function 'subset' appears not to be S3 generic; found functions that look like S3 methods
```
Any idea?
More information on masking:
http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htmhttp://www.sfu.ca/~sweldon/howTo/guide4.pdf
This is what the output doc looks like (minus css):
![image](https://cloud.githubusercontent.com/assets/8969467/11229714/2946e5de-8d4d-11e5-94b0-dda9696b6fdd.png)
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9785 from felixcheung/rmasked.
This PR includes:
* Update SparkR:::glm, SparkR:::summary API docs.
* Update SparkR machine learning user guide and example codes to show:
* supporting feature interaction in R formula.
* summary for gaussian GLM model.
* coefficients for binomial GLM model.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9727 from yanboliang/spark-11684.
The goal of this PR is to add tests covering the issue to ensure that is was resolved by [SPARK-11086](https://issues.apache.org/jira/browse/SPARK-11086).
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes#9743 from zero323/SPARK-11281-tests.
The bug described at [SPARK-11755](https://issues.apache.org/jira/browse/SPARK-11755), after exporting ```predict``` we can both get the help information from the SparkR and base R package like the following:
```Java
> help(predict)
Help on topic ‘predict’ was found in the following packages:
Package Library
SparkR /Users/yanboliang/data/trunk2/spark/R/lib
stats /Library/Frameworks/R.framework/Versions/3.2/Resources/library
Choose one
1: Make predictions from a model {SparkR}
2: Model Predictions {stats}
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9732 from yanboliang/spark-11755.
The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.
When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.
sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.
The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.
Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R.
Author: Sun Rui <rui.sun@intel.com>
Closes#9390 from sun-rui/SPARK-10500.
Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`
At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame. It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).
A simple improvement is to apply `dropFactor `column-wise and then reshape output list.
It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes#9099 from zero323/SPARK-11086.
Clean out hundreds of `style: Commented code should be removed.` from lintr
Like these:
```
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:513:3: style: Commented code should be removed.
# sc <- sparkR.init()
^~~~~~~~~~~~~~~~~~~
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:514:3: style: Commented code should be removed.
# sqlContext <- sparkRSQL.init(sc)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:515:3: style: Commented code should be removed.
# path <- "path/to/file.json"
^~~~~~~~~~~~~~~~~~~~~~~~~~~
```
tried without export or rdname, neither work
instead, added this `#' noRd` to suppress .Rd file generation
also updated `family` for DataFrame functions for longer descriptive text instead of `dataframe_funcs`
![image](https://cloud.githubusercontent.com/assets/8969467/10933937/17bf5b1e-8291-11e5-9777-40fc632105dc.png)
this covers *most* of 'Commented code' but I left out a few that looks legitimate.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9463 from felixcheung/rlintr.
switched stddev support from DeclarativeAggregate to ImperativeAggregate.
Author: JihongMa <linlin200605@gmail.com>
Closes#9380 from JihongMA/SPARK-11420.
Checked names, none of them should conflict with anything in base
shivaram davies rxin
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9489 from felixcheung/rstddev.
Follow up #9561. Due to [SPARK-11587](https://issues.apache.org/jira/browse/SPARK-11587) has been fixed, we should compare SparkR::glm summary result with native R output rather than hard-code one. mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9590 from yanboliang/glm-r-test.
This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Closes#9579 from olarayej/SPARK-10863_NEW14.
https://issues.apache.org/jira/browse/SPARK-9830
This PR contains the following main changes.
* Removing `AggregateExpression1`.
* Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
* Removing planner rule used to plan `Aggregate`.
* Linking `MultipleDistinctRewriter` to analyzer.
* Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
* Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
* Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).
Author: Yin Huai <yhuai@databricks.com>
Closes#9556 from yhuai/removeAgg1.
Make sample test less flaky by setting the seed
Tested with
```
repeat { if (count(sample(df, FALSE, 0.1)) == 3) { break } }
```
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9549 from felixcheung/rsample.
Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
Min Max
-0.9509607 0.7291832
$Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6765 0.2353597 7.123139 4.456124e-11
Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica -1.00751 0.09330565 -10.79796 0
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9561 from yanboliang/spark-11494.
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes#8314 from squito/SPARK-10116.
Because deparse() will break the long string into multiple lines, the deserialization will fail
Author: Davies Liu <davies@databricks.com>
Closes#9510 from davies/fix_glm.
Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9303 from yanboliang/spark-9492.
Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments.
shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf?
sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9290 from felixcheung/rdrivermem.
SparkR glm currently support :
```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0```
We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit)
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9331 from yanboliang/spark-11369.
Add examples for read.df, write.df; fix grouping for read.df, loadDF; fix formatting and text truncation for write.df, saveAsTable.
Several text issues:
![image](https://cloud.githubusercontent.com/assets/8969467/10708590/1303a44e-79c3-11e5-854f-3a2e16854cd7.png)
- text collapsed into a single paragraph
- text truncated at 2 places, eg. "overwrite: Existing data is expected to be overwritten by the contents of error:"
shivaram
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9261 from felixcheung/rdocreadwritedf.
SparkR should remove `.sparkRSQLsc` and `.sparkRHivesc` when `sparkR.stop()` is called. Otherwise even when SparkContext is reinitialized, `sparkRSQL.init` returns the stale copy of the object and complains:
```r
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sparkR.stop()
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sqlContext
```
producing
```r
Error in callJMethod(x, "getClass") :
Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed.
```
I have added the check and removal only when SparkContext itself is initialized. I have also added corresponding test for this fix. Let me know if you want me to move the test to SQL test suite instead.
p.s. I tried lint-r but ended up a lots of errors on existing code.
Author: Forest Fang <forest.fang@outlook.com>
Closes#9205 from saurfang/sparkR.stop.