ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Hossein	e371040a01	[SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date columns ## What changes were proposed in this pull request? NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time. ## How was this patch tested? * [x] TODO Author: Hossein <hossein@databricks.com> Closes #15421 from falaki/SPARK-17811.	2016-10-21 12:38:52 -07:00
Felix Cheung	e21e1c946c	[SPARK-18013][SPARKR] add crossJoin API ## What changes were proposed in this pull request? Add crossJoin and do not default to cross join if joinExpr is left out ## How was this patch tested? unit test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15559 from felixcheung/rcrossjoin.	2016-10-21 12:35:37 -07:00
Felix Cheung	3180272d2d	[SPARKR] fix warnings ## What changes were proposed in this pull request? Fix for a bunch of test warnings that were added recently. We need to investigate why warnings are not turning into errors. ``` Warnings ----------------------------------------------------------------------- 1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length as column name 2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width as column name 3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length as column name 4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width as column name Consider adding importFrom("utils", "object.size") to your NAMESPACE file. ``` ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15560 from felixcheung/rwarnings.	2016-10-20 21:12:55 -07:00
Hossein	5cc503f4fe	[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB ## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes #15375 from falaki/SPARK-17790.	2016-10-12 10:32:38 -07:00
Wenchen Fan	b9a147181d	[SPARK-17720][SQL] introduce static SQL conf ## What changes were proposed in this pull request? SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897. Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf. ## How was this patch tested? new tests in SQLConfSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #15295 from cloud-fan/global-conf.	2016-10-11 20:27:08 -07:00
Yanbo Liang	23405f324a	[SPARK-15153][ML][SPARKR] Fix SparkR spark.naiveBayes error when label is numeric type ## What changes were proposed in this pull request? Fix SparkR ```spark.naiveBayes``` error when response variable of dataset is numeric type. See details and how to reproduce this bug at [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153). ## How was this patch tested? Add unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15431 from yanboliang/spark-15153-2.	2016-10-11 12:41:35 -07:00
hyukjinkwon	9d8ae853ec	[SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types ## What changes were proposed in this pull request? This PR includes the changes below: - Support `mode`/`options` in `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` APIs - Support other types (logical, numeric and string) as options for `write.df`, `read.df`, `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` ## How was this patch tested? Unit tests in `test_sparkSQL.R`/ `utils.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15239 from HyukjinKwon/SPARK-17665.	2016-10-07 11:34:49 -07:00
hyukjinkwon	c9fe10d4ed	[SPARK-17658][SPARKR] read.df/write.df API taking path optionally in SparkR ## What changes were proposed in this pull request? `write.df`/`read.df` API require path which is not actually always necessary in Spark. Currently, it only affects the datasources implementing `CreatableRelationProvider`. Currently, Spark currently does not have internal data sources implementing this but it'd affect other external datasources. In addition we'd be able to use this way in Spark's JDBC datasource after https://github.com/apache/spark/pull/12601 is merged. Before - `read.df` ```r > read.df(source = "json") Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)", : argument "x" is missing with no default ``` ```r > read.df(path = c(1, 2)) Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)", : argument "x" is missing with no default ``` ```r > read.df(c(1, 2)) Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:300) at ... In if (is.na(object)) { : ... ``` - `write.df` ```r > write.df(df, source = "json") Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘write.df’ for signature ‘"function", "missing"’ ``` ```r > write.df(df, source = c(1, 2)) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’ ``` ```r > write.df(df, mode = TRUE) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’ ``` After - `read.df` ```r > read.df(source = "json") Error in loadDF : analysis error - Unable to infer schema for JSON at . It must be specified manually; ``` ```r > read.df(path = c(1, 2)) Error in f(x, ...) : path should be charactor, null or omitted. ``` ```r > read.df(c(1, 2)) Error in f(x, ...) : path should be charactor, null or omitted. ``` - `write.df` ```r > write.df(df, source = "json") Error in save : illegal argument - 'path' is not specified ``` ```r > write.df(df, source = c(1, 2)) Error in .local(df, path, ...) : source should be charactor, null or omitted. It is 'parquet' by default. ``` ```r > write.df(df, mode = TRUE) Error in .local(df, path, ...) : mode should be charactor or omitted. It is 'error' by default. ``` ## How was this patch tested? Unit tests in `test_sparkSQL.R` Author: hyukjinkwon <gurwls223@gmail.com> Closes #15231 from HyukjinKwon/write-default-r.	2016-10-04 22:58:43 -07:00
Yanbo Liang	93c743f1ac	[SPARK-17577][FOLLOW-UP][SPARKR] SparkR spark.addFile supports adding directory recursively ## What changes were proposed in this pull request? #15140 exposed ```JavaSparkContext.addFile(path: String, recursive: Boolean)``` to Python/R, then we can update SparkR ```spark.addFile``` to support adding directory recursively. ## How was this patch tested? Added unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15216 from yanboliang/spark-17577-2.	2016-09-26 16:47:57 -07:00
WeichenXu	f89808b0fd	[SPARK-17499][SPARKR][ML][MLLIB] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier ## What changes were proposed in this pull request? update `MultilayerPerceptronClassifierWrapper.fit` paramter type: `layers: Array[Int]` `seed: String` update several default params in sparkR `spark.mlp`: `tol` --> 1e-6 `stepSize` --> 0.03 `seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a `null` value and the seed will use the default one ) r-side `seed` only support 32bit integer. remove `layers` default value, and move it in front of those parameters with default value. add `layers` parameter validation check. ## How was this patch tested? tests added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15051 from WeichenXu123/update_py_mlp_default.	2016-09-23 11:14:22 -07:00
Yanbo Liang	6902edab7e	[SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test summary ## What changes were proposed in this pull request? #14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that ```print.summary.KSTest``` was implemented inappropriately and result in no effect. Running the following code for KSTest: ```Scala data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5)) df <- createDataFrame(data) testResult <- spark.kstest(df, "test", "norm") summary(testResult) ``` Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png) The new implementation is similar with [```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284) of SparkR and [```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R) of native R. BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since it's only wrappers of the summary output which has been checked. Another reason is that these comparison will output summary information to the test console, it will make the test output in a mess. ## How was this patch tested? Existing test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15139 from yanboliang/spark-17315.	2016-09-21 20:14:18 -07:00
Yanbo Liang	c133907c5d	[SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors ## What changes were proposed in this pull request? Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```. We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```. ## How was this patch tested? Add unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15131 from yanboliang/spark-17577.	2016-09-21 20:08:28 -07:00
Xin Ren	71b7d42f5f	[SPARK-16445][MLLIB][SPARKR] Fix @return description for sparkR mlp summary() method ## What changes were proposed in this pull request? Fix summary() method's `return` description for spark.mlp ## How was this patch tested? Ran tests locally on my laptop. Author: Xin Ren <iamshrek@126.com> Closes #15015 from keypointt/SPARK-16445-2.	2016-09-10 09:52:53 -07:00
Felix Cheung	f0d21b7f90	[SPARK-17442][SPARKR] Additional arguments in write.df are not passed to data source ## What changes were proposed in this pull request? additional options were not passed down in write.df. ## How was this patch tested? unit tests falaki shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15010 from felixcheung/testreadoptions.	2016-09-08 08:22:58 -07:00
hyukjinkwon	6b41195bca	[SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR ## What changes were proposed in this pull request? This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details. ## How was this patch tested? Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r Also, manually, ![2016-09-06 3 14 38](https://cloud.githubusercontent.com/assets/6477701/18263406/b93a98be-7444-11e6-9521-b28ee65a4771.png) Author: hyukjinkwon <gurwls223@gmail.com> Closes #14960 from HyukjinKwon/SPARK-17339.	2016-09-07 19:24:03 +09:00
Clark Fitzgerald	9fccde4ff8	[SPARK-16785] R dapply doesn't return array or raw columns ## What changes were proposed in this pull request? Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors. cc shivaram ## How was this patch tested? Unit tests Author: Clark Fitzgerald <clarkfitzg@gmail.com> Closes #14783 from clarkfitzg/SPARK-16785.	2016-09-06 23:40:37 -07:00
Junyang Qian	abb2f92103	[SPARK-17315][SPARKR] Kolmogorov-Smirnov test SparkR wrapper ## What changes were proposed in this pull request? This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution. ## How was this patch tested? R unit test. Author: Junyang Qian <junyangq@databricks.com> Closes #14881 from junyangq/SPARK-17315.	2016-09-03 12:26:30 -07:00
Felix Cheung	812333e433	[SPARK-17376][SPARKR] Spark version should be available in R ## What changes were proposed in this pull request? Add sparkR.version() API. ``` > sparkR.version() [1] "2.1.0-SNAPSHOT" ``` ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14935 from felixcheung/rsparksessionversion.	2016-09-02 10:12:10 -07:00
wm624@hotmail.com	0f30cdedbd	[SPARK-16883][SPARKR] SQL decimal type is not properly cast to number when collecting SparkDataFrame ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) registerTempTable(createDataFrame(iris), "iris") str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5"))) 'data.frame': 5 obs. of 2 variables: $ x: num 1 1 1 1 1 $ y:List of 5 ..$ : num 2 ..$ : num 2 ..$ : num 2 ..$ : num 2 ..$ : num 2 The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double". As discussed in JIRA thread, we can have two potential fixes: 1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match; 2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future. I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual test: > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5"))) 'data.frame': 5 obs. of 2 variables: $ x: num 1 1 1 1 1 $ y: num 2 2 2 2 2 R Unit tests Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14613 from wangmiao1981/type.	2016-09-02 01:47:17 -07:00
Xin Ren	7a5000f39e	[SPARK-17241][SPARKR][MLLIB] SparkR spark.glm should have configurable regularization parameter https://issues.apache.org/jira/browse/SPARK-17241 ## What changes were proposed in this pull request? Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression. ## How was this patch tested? Test manually on local laptop. Author: Xin Ren <iamshrek@126.com> Closes #14856 from keypointt/SPARK-17241.	2016-08-31 21:39:31 -07:00
Shivaram Venkataraman	2f9c27364e	[SPARK-16581][SPARKR] Fix JVM API tests in SparkR ## What changes were proposed in this pull request? Remove cleanup.jobj test. Use JVM wrapper API for other test cases. ## How was this patch tested? Run R unit tests with testthat 1.0 Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14904 from shivaram/sparkr-jvm-tests-fix.	2016-08-31 16:56:41 -07:00
hyukjinkwon	50bb142332	[SPARK-17326][SPARKR] Fix tests with HiveContext in SparkR not to be skipped always ## What changes were proposed in this pull request? Currently, `HiveContext` in SparkR is not being tested and always skipped. This is because the initiation of `TestHiveContext` is being failed due to trying to load non-existing data paths (test tables). This is introduced from https://github.com/apache/spark/pull/14005 This enables the tests with SparkR. ## How was this patch tested? Manually, Before (on Mac OS) ``` ... Skipped ------------------------------------------------------------------------ 1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped 2. test HiveContext (test_sparkSQL.R#1041) - Hive is not build with SparkSQL, skipped 3. read/write ORC files (test_sparkSQL.R#1748) - Hive is not build with SparkSQL, skipped 4. enableHiveSupport on SparkSession (test_sparkSQL.R#2480) - Hive is not build with SparkSQL, skipped 5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped ... ``` After (on Mac OS) ``` ... Skipped ------------------------------------------------------------------------ 1. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped ... ``` Please refer the tests below (on Windows) - Before: https://ci.appveyor.com/project/HyukjinKwon/spark/build/45-test123 - After: https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 Author: hyukjinkwon <gurwls223@gmail.com> Closes #14889 from HyukjinKwon/SPARK-17326.	2016-08-31 14:02:21 -07:00
Shivaram Venkataraman	736a7911cb	[SPARK-16581][SPARKR] Make JVM backend calling functions public ## What changes were proposed in this pull request? This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit tests, CRAN checks Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14775 from shivaram/sparkr-java-api.	2016-08-29 12:55:32 -07:00
Xin Ren	2fbdb60639	[SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkR https://issues.apache.org/jira/browse/SPARK-16445 ## What changes were proposed in this pull request? Create Multilayer Perceptron Classifier wrapper in SparkR ## How was this patch tested? Tested manually on local machine Author: Xin Ren <iamshrek@126.com> Closes #14447 from keypointt/SPARK-16445.	2016-08-24 11:18:10 -07:00
Felix Cheung	0583ecda1b	[SPARK-17173][SPARKR] R MLlib refactor, cleanup, reformat, fix deprecation in test ## What changes were proposed in this pull request? refactor, cleanup, reformat, fix deprecation in test ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14735 from felixcheung/rmllibutil.	2016-08-22 12:27:33 -07:00
Yanbo Liang	7f08a60b6e	[SPARK-16961][FOLLOW-UP][SPARKR] More robust test case for spark.gaussianMixture. ## What changes were proposed in this pull request? #14551 fixed off-by-one bug in ```randomizeInPlace``` and some test failure caused by this fix. But for SparkR ```spark.gaussianMixture``` test case, the fix is inappropriate. It only changed the output result of native R which should be compared by SparkR, however, it did not change the R code in annotation which is used for reproducing the result in native R. It will confuse users who can not reproduce the same result in native R. This PR sends a more robust test case which can produce same result between SparkR and native R. ## How was this patch tested? Unit test update. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14730 from yanboliang/spark-16961-followup.	2016-08-21 02:23:31 -07:00
Junyang Qian	acac7a508a	[SPARK-16443][SPARKR] Alternating Least Squares (ALS) wrapper ## What changes were proposed in this pull request? Add Alternating Least Squares wrapper in SparkR. Unit tests have been updated. ## How was this patch tested? SparkR unit tests. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) ![screen shot 2016-07-27 at 3 50 31 pm](https://cloud.githubusercontent.com/assets/15318264/17195347/f7a6352a-5411-11e6-8e21-61a48070192a.png) ![screen shot 2016-07-27 at 3 50 46 pm](https://cloud.githubusercontent.com/assets/15318264/17195348/f7a7d452-5411-11e6-845f-6d292283bc28.png) Author: Junyang Qian <junyangq@databricks.com> Closes #14384 from junyangq/SPARK-16443.	2016-08-19 14:24:09 -07:00
Nick Lavers	5377fc6236	[SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace JIRA issue link: https://issues.apache.org/jira/browse/SPARK-16961 Changed one line of Utils.randomizeInPlace to allow elements to stay in place. Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution. Author: Nick Lavers <nick.lavers@videoamp.com> Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace.	2016-08-19 10:11:59 +01:00
Xusen Yin	b72bb62d42	[SPARK-16447][ML][SPARKR] LDA wrapper in SparkR ## What changes were proposed in this pull request? Add LDA Wrapper in SparkR with the following interfaces: - spark.lda(data, ...) - spark.posterior(object, newData, ...) - spark.perplexity(object, ...) - summary(object) - write.ml(object) - read.ml(path) ## How was this patch tested? Test with SparkR unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #14229 from yinxusen/SPARK-16447.	2016-08-18 05:33:52 -07:00
Yanbo Liang	4d92af310a	[SPARK-16446][SPARKR][ML] Gaussian Mixture Model wrapper in SparkR ## What changes were proposed in this pull request? Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14392 from yanboliang/spark-16446.	2016-08-17 11:18:33 -07:00
wm624@hotmail.com	363793f2bf	[SPARK-16444][SPARKR] Isotonic Regression wrapper in SparkR ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Add Isotonic Regression wrapper in SparkR Wrappers in R and Scala are added. Unit tests Documentation ## How was this patch tested? Manually tested with sudo ./R/run-tests.sh (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14182 from wangmiao1981/isoR.	2016-08-17 06:15:04 -07:00
Felix Cheung	c34b546d67	[SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check ## What changes were proposed in this pull request? Rename RDD functions for now to avoid CRAN check warnings. Some RDD functions are sharing generics with DataFrame functions (hence the problem) so after the renames we need to add new generics, for now. ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14626 from felixcheung/rrddfunctions.	2016-08-16 11:19:18 -07:00
Junyang Qian	214ba66a03	[SPARK-16579][SPARKR] add install.spark function ## What changes were proposed in this pull request? Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R. Updates: Several changes have been made: - `install.spark()` - check existence of tar file in the cache folder, and download only if not found - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option - use 2.0.0 - `sparkR.session()` - can install spark when not found in `SPARK_HOME` ## How was this patch tested? Manual tests, running the check-cran.sh script added in #14173. Author: Junyang Qian <junyangq@databricks.com> Closes #14258 from junyangq/SPARK-16579.	2016-08-10 11:18:23 -07:00
Yanbo Liang	d4a9122430	[SPARK-16710][SPARKR][ML] spark.glm should support weightCol ## What changes were proposed in this pull request? Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14346 from yanboliang/spark-16710.	2016-08-10 10:53:48 -07:00
Shivaram Venkataraman	fc23263623	[SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to SparkSubmitSuite ## What changes were proposed in this pull request? This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar. ## How was this patch tested? SparkR unit tests, SparkSubmitSuite, check-cran.sh Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14243 from shivaram/sparkr-jar-move.	2016-07-19 19:28:08 -07:00
Felix Cheung	d27fe9ba67	[SPARK-16027][SPARKR] Fix R tests SparkSession init/stop ## What changes were proposed in this pull request? Fix R SparkSession init/stop, and warnings of reusing existing Spark Context ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14177 from felixcheung/rsessiontest.	2016-07-17 19:02:21 -07:00
Felix Cheung	611a8ca589	[SPARK-16538][SPARKR] Add more tests for namespace call to SparkSession functions ## What changes were proposed in this pull request? More tests I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0. ## How was this patch tested? unit tests shivaram dongjoon-hyun Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14206 from felixcheung/rroutetests.	2016-07-15 13:58:57 -07:00
Felix Cheung	12005c88fb	[SPARK-16538][SPARKR] fix R call with namespace operator on SparkSession functions ## What changes were proposed in this pull request? Fix function routing to work with and without namespace operator `SparkR::createDataFrame` ## How was this patch tested? manual, unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14195 from felixcheung/rroutedefault.	2016-07-14 09:45:30 -07:00
Sun Rui	093ebbc628	[SPARK-16509][SPARKR] Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy. ## What changes were proposed in this pull request? Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check. ## How was this patch tested? SparkR unit tests. Author: Sun Rui <sunrui2016@gmail.com> Closes #14192 from sun-rui/SPARK-16509.	2016-07-14 09:38:42 -07:00
Felix Cheung	fb2e8eeb0b	[SPARKR][DOCS][MINOR] R programming guide to include csv data source example ## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14178 from felixcheung/rcsvprogrammingguide.	2016-07-13 15:09:23 -07:00
Felix Cheung	b4baf086ca	[SPARKR][MINOR] R examples and test updates ## What changes were proposed in this pull request? Minor example updates ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14171 from felixcheung/rexample.	2016-07-13 13:33:34 -07:00
Dongjoon Hyun	142df4834b	[SPARK-16429][SQL] Include `StringType` columns in `describe()` ## What changes were proposed in this pull request? Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument. Background ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show() +-------+------------------+-------+ \|summary\| age\| name\| +-------+------------------+-------+ \| count\| 2\| 3\| \| mean\| 24.5\| null\| \| stddev\|7.7781745930520225\| null\| \| min\| 19\| Andy\| \| max\| 30\|Michael\| +-------+------------------+-------+ ``` Before ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe().show() +-------+------------------+ \|summary\| age\| +-------+------------------+ \| count\| 2\| \| mean\| 24.5\| \| stddev\|7.7781745930520225\| \| min\| 19\| \| max\| 30\| +-------+------------------+ ``` After ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe().show() +-------+------------------+-------+ \|summary\| age\| name\| +-------+------------------+-------+ \| count\| 2\| 3\| \| mean\| 24.5\| null\| \| stddev\|7.7781745930520225\| null\| \| min\| 19\| Andy\| \| max\| 30\|Michael\| +-------+------------------+-------+ ``` ## How was this patch tested? Pass the Jenkins with a update testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14095 from dongjoon-hyun/SPARK-16429.	2016-07-08 14:36:50 -07:00
Dongjoon Hyun	6aa7d09f4e	[SPARK-16425][R] `describe()` should not fail with non-numeric columns ## What changes were proposed in this pull request? This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`. Before ```r > df <- createDataFrame(faithful) > df <- withColumn(df, "boolean", df$waiting==79) > summary(df) 16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType; ``` After ```r > df <- createDataFrame(faithful) > df <- withColumn(df, "boolean", df$waiting==79) > summary(df) SparkDataFrame[summary:string, eruptions:string, waiting:string] ``` ## How was this patch tested? Pass the Jenkins with a updated testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14096 from dongjoon-hyun/SPARK-16425.	2016-07-07 17:47:29 -07:00
Felix Cheung	f4767bcc7a	[SPARK-16310][SPARKR] R na.string-like default for csv source ## What changes were proposed in this pull request? Apply default "NA" as null string for R, like R read.csv na.string parameter. https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html na.strings = "NA" An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv") (couldn't open JIRA, will do that later) ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13984 from felixcheung/rcsvnastring.	2016-07-07 15:21:57 -07:00
Dongjoon Hyun	d17e5f2f12	[SPARK-16233][R][TEST] ORC test should be enabled only when HiveContext is available. ## What changes were proposed in this pull request? ORC test should be enabled only when HiveContext is available. ## How was this patch tested? Manual. ``` $ R/run-tests.sh ... 1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped 2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped 3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped 4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped 5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped DONE =========================================================================== Tests passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14019 from dongjoon-hyun/SPARK-16233.	2016-07-01 15:35:19 -07:00
Narine Kokhlikyan	26afb4ce40	[SPARK-16012][SPARKR] Implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame ## What changes were proposed in this pull request? gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided. This is similar to dapplyCollect(). ## How was this patch tested? Added test cases for gapplyCollect similar to dapplyCollect Author: Narine Kokhlikyan <narine@slice.com> Closes #13760 from NarineK/gapplyCollect.	2016-07-01 13:55:13 -07:00
Dongjoon Hyun	46395db80e	[SPARK-16289][SQL] Implement posexplode table generating function ## What changes were proposed in this pull request? This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive. Before ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7 ``` After ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show +---+---+-----+ \|pos\|key\|value\| +---+---+-----+ \| 0\| a\| 1\| \| 1\| b\| 2\| +---+---+-----+ ``` For `array` argument, `after` is the same with `before`. ``` scala> sql("select posexplode(array(1, 2, 3))").show +---+---+ \|pos\|col\| +---+---+ \| 0\| 1\| \| 1\| 2\| \| 2\| 3\| +---+---+ ``` ## How was this patch tested? Pass the Jenkins tests with newly added testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13971 from dongjoon-hyun/SPARK-16289.	2016-06-30 12:03:54 -07:00
Felix Cheung	823518c2b5	[SPARKR] add csv tests ## What changes were proposed in this pull request? Add unit tests for csv data for SPARKR ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13904 from felixcheung/rcsv.	2016-06-28 17:08:28 -07:00
Prashant Sharma	f6b497fcdd	[SPARK-16128][SQL] Allow setting length of characters to be truncated to, in Dataset.show function. ## What changes were proposed in this pull request? Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise. ## How was this patch tested? Existing tests. + 1 new test in DataFrameSuite. For SparkR and pyspark, existing tests and manual testing. Author: Prashant Sharma <prashsh1@in.ibm.com> Author: Prashant Sharma <prashant@apache.org> Closes #13839 from ScrapCodes/add_truncateTo_DF.show.	2016-06-28 17:11:06 +05:30
Felix Cheung	30b182bcc0	[SPARK-16184][SPARKR] conf API for SparkSession ## What changes were proposed in this pull request? Add `conf` method to get Runtime Config from SparkSession ## How was this patch tested? unit tests, manual tests This is how it works in sparkR shell: ``` SparkSession available as 'spark'. > conf() $hive.metastore.warehouse.dir [1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse" $spark.app.id [1] "local-1466749575523" $spark.app.name [1] "SparkR" $spark.driver.host [1] "10.0.2.1" $spark.driver.port [1] "45629" $spark.executorEnv.LD_LIBRARY_PATH [1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server" $spark.executor.id [1] "driver" $spark.home [1] "/opt/spark-2.0.0-bin-hadoop2.6" $spark.master [1] "local[]" $spark.sql.catalogImplementation [1] "hive" $spark.submit.deployMode [1] "client" > conf("spark.master") $spark.master [1] "local[]" ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13885 from felixcheung/rconf.	2016-06-26 13:10:43 -07:00

1 2 3 4 5

242 commits