## What changes were proposed in this pull request?
Move all existing tests to non-installed directory so that it will never run by installing SparkR package
For a follow-up PR:
- remove all skip_on_cran() calls in tests
- clean up test timer
- improve or change basic tests that do run on CRAN (if anyone has suggestion)
It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
## How was this patch tested?
- [x] unit tests, Jenkins
- [x] AppVeyor
- [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18264 from felixcheung/rtestset.
## What changes were proposed in this pull request?
to investigate how long they run
## How was this patch tested?
Jenkins, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18104 from felixcheung/rtimetest.
## What changes were proposed in this pull request?
This change skips tests that use the Hadoop libraries while running
on CRAN check with Windows as the operating system. This is to handle
cases where the Hadoop winutils binaries are missing on the target
system. The skipped tests consist of
1. Tests that save, load a model in MLlib
2. Tests that save, load CSV, JSON and Parquet files in SQL
3. Hive tests
## How was this patch tested?
Tested by running on a local windows VM with HADOOP_HOME unset. Also testing with https://win-builder.r-project.org
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#17966 from shivaram/sparkr-windows-cran.
## What changes were proposed in this pull request?
- [x] need to test by running R CMD check --as-cran
- [x] sanity check vignettes
## How was this patch tested?
Jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17945 from felixcheung/rchangesforpackage.
## What changes were proposed in this pull request?
General rule on skip or not:
skip if
- RDD tests
- tests could run long or complicated (streaming, hivecontext)
- tests on error conditions
- tests won't likely change/break
## How was this patch tested?
unit tests, `R CMD check --as-cran`, `R CMD check`
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17817 from felixcheung/rskiptest.
## What changes were proposed in this pull request
When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`.
In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.
Example:
> col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> cols <- as.data.frame(cbind(col1, col2, col3))
> df <- createDataFrame(cols)
>
> model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5)
>
> summary(model2)
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In matrix(coefficients, ncol = k) :
data length [9] is not a sub-multiple or multiple of the number of rows [2]
Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
## How was this patch tested?
Add unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16666 from wangmiao1981/kmeans.
## What changes were proposed in this pull request?
The `coefficients` component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first `unlist` the coefficients returned from the `callJMethod` before converting to matrix. An example illustrates the issues:
```
data(iris)
df <- createDataFrame(iris)
model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian")
s <- summary(model)
> str(s$coefficients)
List of 8
$ : num 6.53
$ : num -0.223
$ : num 0.479
$ : num 0.155
$ : num 13.6
$ : num -1.44
$ : num 0
$ : num 0.152
- attr(*, "dim")= int [1:2] 2 4
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "(Intercept)" "Sepal_Width"
..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
> s$coefficients[, 2]
$`(Intercept)`
[1] 0.4788963
$Sepal_Width
[1] 0.1550809
```
This shows that the underlying structure of coefficients is still `list`.
felixcheung wangmiao1981
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16730 from actuaryzhang/sparkRCoef.
## What changes were proposed in this pull request?
Add R wrapper for bisecting Kmeans.
As JIRA is down, I will update title to link with corresponding JIRA later.
## How was this patch tested?
Add new unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16566 from wangmiao1981/bk.
## What changes were proposed in this pull request?
```spark.gaussianMixture``` supports output total log-likelihood for the model like R ```mvnormalmixEM```.
## How was this patch tested?
R unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16646 from yanboliang/spark-19291.
## What changes were proposed in this pull request?
spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code.
In addition, the `summary` method should bring back the results related to `DistributedLDAModel`.
## How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16464 from wangmiao1981/new.
## What changes were proposed in this pull request?
spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.
Add missing parameters and corresponding document.
Modified existing unit tests to take additional parameters.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16523 from wangmiao1981/kmeans.
## What changes were proposed in this pull request?
SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:
* mllib_classification.R
* mllib_clustering.R
* mllib_recommendation.R
* mllib_regression.R
* mllib_stat.R
* mllib_tree.R
* mllib_utils.R
Note: Only reorg, no actual code change.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16312 from yanboliang/spark-18862.