ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
wangmiao1981	a7b46c627b	[SPARK-20307][SPARKR] SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer ## What changes were proposed in this pull request? For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic. This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers. ## How was this patch tested? Add a new unit test based on the error case in the JIRA. Author: wangmiao1981 <wm624@hotmail.com> Closes #18496 from wangmiao1981/handle.	2017-07-07 23:51:32 -07:00
actuaryzhang	e9a93f8140	[SPARK-20889][SPARKR][FOLLOWUP] Clean up grouped doc for column methods ## What changes were proposed in this pull request? Add doc for methods that were left out, and fix various style and consistency issues. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18493 from actuaryzhang/sparkRDocCleanup.	2017-07-04 21:05:05 -07:00
actuaryzhang	cec3921504	[SPARK-20889][SPARKR] Grouped documentation for WINDOW column methods ## What changes were proposed in this pull request? Grouped documentation for column window methods. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18481 from actuaryzhang/sparkRDocWindow.	2017-07-04 12:18:51 -07:00
actuaryzhang	52981715bb	[SPARK-20889][SPARKR] Grouped documentation for COLLECTION column methods ## What changes were proposed in this pull request? Grouped documentation for column collection methods. Author: actuaryzhang <actuaryzhang10@gmail.com> Author: Wayne Zhang <actuaryzhang10@gmail.com> Closes #18458 from actuaryzhang/sparkRDocCollection.	2017-06-29 23:00:50 -07:00
actuaryzhang	fddb63f463	[SPARK-20889][SPARKR] Grouped documentation for MISC column methods ## What changes were proposed in this pull request? Grouped documentation for column misc methods. Author: actuaryzhang <actuaryzhang10@gmail.com> Author: Wayne Zhang <actuaryzhang10@gmail.com> Closes #18448 from actuaryzhang/sparkRDocMisc.	2017-06-29 21:35:01 -07:00
actuaryzhang	a2d5623548	[SPARK-20889][SPARKR] Grouped documentation for NONAGGREGATE column methods ## What changes were proposed in this pull request? Grouped documentation for nonaggregate column methods. Author: actuaryzhang <actuaryzhang10@gmail.com> Author: Wayne Zhang <actuaryzhang10@gmail.com> Closes #18422 from actuaryzhang/sparkRDocNonAgg.	2017-06-29 01:23:13 -07:00
Felix Cheung	fc92d25f2a	Revert "[SPARK-21094][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak" This reverts commit `6b3d02285e`.	2017-06-28 20:06:29 -07:00
hyukjinkwon	db44f5f3e8	[SPARK-21224][R] Specify a schema by using a DDL-formatted string when reading in R ## What changes were proposed in this pull request? This PR proposes to support a DDL-formetted string as schema as below: ```r mockLines <- c("{\"name\":\"Michael\"}", "{\"name\":\"Andy\", \"age\":30}", "{\"name\":\"Justin\", \"age\":19}") jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") writeLines(mockLines, jsonPath) df <- read.df(jsonPath, "json", "name STRING, age DOUBLE") collect(df) ``` ## How was this patch tested? Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18431 from HyukjinKwon/r-ddl-schema.	2017-06-28 19:36:00 -07:00
actuaryzhang	376d90d556	[SPARK-20889][SPARKR] Grouped documentation for STRING column methods ## What changes were proposed in this pull request? Grouped documentation for string column methods. Author: actuaryzhang <actuaryzhang10@gmail.com> Author: Wayne Zhang <actuaryzhang10@gmail.com> Closes #18366 from actuaryzhang/sparkRDocString.	2017-06-28 19:31:54 -07:00
actuaryzhang	e793bf248b	[SPARK-20889][SPARKR] Grouped documentation for MATH column methods ## What changes were proposed in this pull request? Grouped documentation for math column methods. Author: actuaryzhang <actuaryzhang10@gmail.com> Author: Wayne Zhang <actuaryzhang10@gmail.com> Closes #18371 from actuaryzhang/sparkRDocMath.	2017-06-27 23:15:45 -07:00
hyukjinkwon	6b3d02285e	[SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak ## What changes were proposed in this pull request? `mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open. This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too. All the details are described in https://issues.apache.org/jira/browse/SPARK-21093 This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak. ## How was this patch tested? I ran the codes below on both CentOS and Mac with that configuration disabled/enabled. ```r df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d")) collect(gapply(df, "a", function(key, x) { x }, schema(df))) collect(gapply(df, "a", function(key, x) { x }, schema(df))) ... # 30 times ``` Also, now it passes R tests on CentOS as below: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark .............................................................................................................................................................. .............................................................................................................................................................. .............................................................................................................................................................. .............................................................................................................................................................. .............................................................................................................................................................. .................................................................................................................................... ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #18320 from HyukjinKwon/SPARK-21093.	2017-06-25 11:05:57 -07:00
hyukjinkwon	07479b3cfb	[SPARK-21149][R] Add job description API for R ## What changes were proposed in this pull request? Extend `setJobDescription` to SparkR API. ## How was this patch tested? It looks difficult to add a test. Manually tested as below: ```r df <- createDataFrame(iris) count(df) setJobDescription("This is an example job.") count(df) ``` prints ... ![2017-06-22 12 05 49](https://user-images.githubusercontent.com/6477701/27415670-2a649936-5743-11e7-8e95-312f1cd103af.png) Author: hyukjinkwon <gurwls223@gmail.com> Closes #18382 from HyukjinKwon/SPARK-21149.	2017-06-23 09:59:24 -07:00
actuaryzhang	19331b8e44	[SPARK-20889][SPARKR] Grouped documentation for DATETIME column methods ## What changes were proposed in this pull request? Grouped documentation for datetime column methods. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18114 from actuaryzhang/sparkRDocDate.	2017-06-22 10:16:51 -07:00
wangmiao1981	53543374ce	[SPARK-20906][SPARKR] Constrained Logistic Regression for SparkR ## What changes were proposed in this pull request? PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR. ## How was this patch tested? Add new unit tests. Author: wangmiao1981 <wm624@hotmail.com> Closes #18128 from wangmiao1981/test.	2017-06-21 20:42:45 -07:00
actuaryzhang	ad459cfb1d	[SPARK-20917][ML][SPARKR] SparkR supports string encoding consistent with R ## What changes were proposed in this pull request? Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R. ## How was this patch tested? new tests Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18140 from actuaryzhang/sparkRFormula.	2017-06-21 10:35:16 -07:00
Joseph K. Bradley	cc67bd5732	[SPARK-20929][ML] LinearSVC should use its own threshold param ## What changes were proposed in this pull request? LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley <joseph@databricks.com> Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.	2017-06-19 23:04:17 -07:00
actuaryzhang	8965fe764a	[SPARK-20889][SPARKR] Grouped documentation for AGGREGATE column methods ## What changes were proposed in this pull request? Grouped documentation for the aggregate functions for Column. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18025 from actuaryzhang/sparkRDoc4.	2017-06-19 19:41:24 -07:00
hyukjinkwon	9a145fd796	[MINOR] Bump SparkR and PySpark version to 2.3.0. ## What changes were proposed in this pull request? #17753 bumps master branch version to 2.3.0-SNAPSHOT, but it seems SparkR and PySpark version were omitted. ditto of https://github.com/apache/spark/pull/16488 / https://github.com/apache/spark/pull/17523 ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #18341 from HyukjinKwon/r-version.	2017-06-19 11:13:03 +01:00
actuaryzhang	110ce1f27b	[SPARK-20892][SPARKR] Add SQL trunc function to SparkR ## What changes were proposed in this pull request? Add SQL trunc function ## How was this patch tested? standard test Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18291 from actuaryzhang/sparkRTrunc2.	2017-06-18 18:00:27 -07:00
hyukjinkwon	05f83c532a	[SPARK-21128][R] Remove both "spark-warehouse" and "metastore_db" before listing files in R tests ## What changes were proposed in this pull request? This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying. ## How was this patch tested? Manually running multiple times R tests via `./R/run-tests.sh`. Before Second run: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ....................................................................................................1234....................... Failed ------------------------------------------------------------------------- 1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384) length(list1) not equal to length(list2). 1/1 mismatches [1] 25 - 23 == 2 2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384) sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). 10/25 mismatches x[16]: "metastore_db" y[16]: "pkg" x[17]: "pkg" y[17]: "R" x[18]: "R" y[18]: "README.md" x[19]: "README.md" y[19]: "run-tests.sh" x[20]: "run-tests.sh" y[20]: "SparkR_2.2.0.tar.gz" x[21]: "metastore_db" y[21]: "pkg" x[22]: "pkg" y[22]: "R" x[23]: "R" y[23]: "README.md" x[24]: "README.md" y[24]: "run-tests.sh" x[25]: "run-tests.sh" y[25]: "SparkR_2.2.0.tar.gz" 3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388) length(list1) not equal to length(list2). 1/1 mismatches [1] 25 - 23 == 2 4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388) sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). 10/25 mismatches x[16]: "metastore_db" y[16]: "pkg" x[17]: "pkg" y[17]: "R" x[18]: "R" y[18]: "README.md" x[19]: "README.md" y[19]: "run-tests.sh" x[20]: "run-tests.sh" y[20]: "SparkR_2.2.0.tar.gz" x[21]: "metastore_db" y[21]: "pkg" x[22]: "pkg" y[22]: "R" x[23]: "R" y[23]: "README.md" x[24]: "README.md" y[24]: "run-tests.sh" x[25]: "run-tests.sh" y[25]: "SparkR_2.2.0.tar.gz" DONE =========================================================================== ``` After Second run: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................... ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #18335 from HyukjinKwon/SPARK-21128.	2017-06-18 11:26:27 -07:00
Yuming Wang	45824fb608	[MINOR][DOCS] Improve Running R Tests docs ## What changes were proposed in this pull request? Update Running R Tests dependence packages to: ```bash R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')" ``` ## How was this patch tested? manual tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18271 from wangyum/building-spark.	2017-06-16 11:03:54 +01:00
Xiao Li	2051428173	[SPARK-20980][SQL] Rename `wholeFile` to `multiLine` for both CSV and JSON ### What changes were proposed in this pull request? The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`. ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #18202 from gatorsmile/renameCVSOption.	2017-06-15 13:18:19 +08:00
Felix Cheung	9f4ff95524	[SPARK-20877][SPARKR][FOLLOWUP] clean up after test move ## What changes were proposed in this pull request? clean up after big test move ## How was this patch tested? unit tests, jenkins Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #18267 from felixcheung/rtestset2.	2017-06-11 03:00:44 -07:00
Felix Cheung	dc4c351837	[SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN ## What changes were proposed in this pull request? Move all existing tests to non-installed directory so that it will never run by installing SparkR package For a follow-up PR: - remove all skip_on_cran() calls in tests - clean up test timer - improve or change basic tests that do run on CRAN (if anyone has suggestion) It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them) ## How was this patch tested? - [x] unit tests, Jenkins - [x] AppVeyor - [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #18264 from felixcheung/rtestset.	2017-06-11 00:00:33 -07:00
Reynold Xin	b78e3849b2	[SPARK-21042][SQL] Document Dataset.union is resolution by position ## What changes were proposed in this pull request? Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users. ## How was this patch tested? N/A - doc only change. Author: Reynold Xin <rxin@databricks.com> Closes #18256 from rxin/SPARK-21042.	2017-06-09 18:29:33 -07:00
Felix Cheung	382fefd187	[SPARK-20877][SPARKR][WIP] add timestamps to test runs ## What changes were proposed in this pull request? to investigate how long they run ## How was this patch tested? Jenkins, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #18104 from felixcheung/rtimetest.	2017-05-30 22:33:29 -07:00
Zheng RuiFeng	a97c497045	[SPARK-20849][DOC][SPARKR] Document R DecisionTree ## What changes were proposed in this pull request? 1, add an example for sparkr `decisionTree` 2, document it in user guide ## How was this patch tested? local submit Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #18067 from zhengruifeng/dt_example.	2017-05-25 23:00:50 -07:00
Yanbo Liang	ad09e4ca04	[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. ## What changes were proposed in this pull request? Joint coefficients with intercept for SparkR linear SVM summary. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18035 from yanboliang/svm-r.	2017-05-23 16:16:14 +08:00
Shivaram Venkataraman	d06610f992	[SPARK-20727] Skip tests that use Hadoop utils on CRAN Windows ## What changes were proposed in this pull request? This change skips tests that use the Hadoop libraries while running on CRAN check with Windows as the operating system. This is to handle cases where the Hadoop winutils binaries are missing on the target system. The skipped tests consist of 1. Tests that save, load a model in MLlib 2. Tests that save, load CSV, JSON and Parquet files in SQL 3. Hive tests ## How was this patch tested? Tested by running on a local windows VM with HADOOP_HOME unset. Also testing with https://win-builder.r-project.org Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #17966 from shivaram/sparkr-windows-cran.	2017-05-22 23:04:22 -07:00
Zheng RuiFeng	4be3375835	[SPARK-15767][ML][SPARKR] Decision Tree wrapper in SparkR ## What changes were proposed in this pull request? support decision tree in R ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17981 from zhengruifeng/dt_r.	2017-05-22 10:40:49 -07:00
Wayne Zhang	7f203a248f	[SPARKR] Fix bad examples in DataFrame methods and style issues ## What changes were proposed in this pull request? Some examples in the DataFrame methods are syntactically wrong, even though they are pseudo code. Fix these and some style issues. Author: Wayne Zhang <actuaryzhang@uber.com> Closes #18003 from actuaryzhang/sparkRDoc3.	2017-05-19 11:18:20 -07:00
zero323	2d90c04f23	[SPARKR][DOCS][MINOR] Use consistent names in rollup and cube examples ## What changes were proposed in this pull request? Rename `carsDF` to `df` in SparkR `rollup` and `cube` examples. ## How was this patch tested? Manual tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17988 from zero323/cube-docs.	2017-05-19 11:04:38 -07:00
zero323	5a799fd8c3	[SPARK-20726][SPARKR] wrapper for SQL broadcast ## What changes were proposed in this pull request? - Adds R wrapper for `o.a.s.sql.functions.broadcast`. - Renames `broadcast` to `broadcast_`. ## How was this patch tested? Unit tests, check `check-cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes #17965 from zero323/SPARK-20726.	2017-05-14 13:22:19 -07:00
zero323	aa3df15904	[DOCS][SPARKR] Use verbose names for family annotations in functions.R ## What changes were proposed in this pull request? - Change current short annotations (same as Scala `group`) to verbose names (same as Scala `groupname`). Before: ![image](https://cloud.githubusercontent.com/assets/1554276/26033909/9a98b596-38b4-11e7-961e-15fd9ea7440d.png) After: ![image](https://cloud.githubusercontent.com/assets/1554276/26033903/727a9944-38b4-11e7-8873-b09c553f4ec3.png) - Add missing `family` annotations. ## How was this patch tested? `check-cran.R` (skipping tests), manual inspection. Author: zero323 <zero323@users.noreply.github.com> Closes #17976 from zero323/SPARKR-FUNCTIONS-DOCSTRINGS.	2017-05-14 11:43:28 -07:00
hyukjinkwon	720708ccdd	[SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with documentation improvement ## What changes were proposed in this pull request? This PR proposes three things as below: - Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`). - Support single argument for `to_timestamp` similarly with APIs in other languages. For example, the one below works ``` import org.apache.spark.sql.functions._ Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show() ``` prints ``` +----------------------------------------+ \|to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')\| +----------------------------------------+ \| 2016-12-31 00:12:00\| +----------------------------------------+ ``` whereas this does not work in SQL. Before ``` spark-sql> SELECT to_timestamp('2016-12-31 00:12:00'); Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7 ``` After ``` spark-sql> SELECT to_timestamp('2016-12-31 00:12:00'); 2016-12-31 00:12:00 ``` - Related document improvement for SQL function descriptions and other API descriptions accordingly. Before ``` spark-sql> DESCRIBE FUNCTION extended to_date; ... Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input. Extended Usage: Examples: > SELECT to_date('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 ``` ``` spark-sql> DESCRIBE FUNCTION extended to_timestamp; ... Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input. Extended Usage: Examples: > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 00:00:00.0 ``` After ``` spark-sql> DESCRIBE FUNCTION extended to_date; ... Usage: to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the `fmt` is omitted. Extended Usage: Examples: > SELECT to_date('2009-07-30 04:17:52'); 2009-07-30 > SELECT to_date('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 ``` ``` spark-sql> DESCRIBE FUNCTION extended to_timestamp; ... Usage: to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if the `fmt` is omitted. Extended Usage: Examples: > SELECT to_timestamp('2016-12-31 00:12:00'); 2016-12-31 00:12:00 > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 00:00:00 ``` ## How was this patch tested? Added tests in `datetime.sql`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17901 from HyukjinKwon/to_timestamp_arg.	2017-05-12 16:42:58 +08:00
Felix Cheung	888b84abe8	[SPARK-20704][SPARKR] change CRAN test to run single thread ## What changes were proposed in this pull request? - [x] need to test by running R CMD check --as-cran - [x] sanity check vignettes ## How was this patch tested? Jenkins Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17945 from felixcheung/rchangesforpackage.	2017-05-11 23:10:04 -07:00
Felix Cheung	b952b44af4	[SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames() test fails ## What changes were proposed in this pull request? Change it to check for relative count like in this test https://github.com/apache/spark/blame/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3355 for catalog APIs ## How was this patch tested? unit tests, this needs to combine with another commit with SQL change to check Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17905 from felixcheung/rtabletests.	2017-05-08 22:49:40 -07:00
Hossein	2abfee18b6	[SPARK-20661][SPARKR][TEST] SparkR tableNames() test fails ## What changes were proposed in this pull request? Cleaning existing temp tables before running tableNames tests ## How was this patch tested? SparkR Unit tests Author: Hossein <hossein@databricks.com> Closes #17903 from falaki/SPARK-20661.	2017-05-08 14:48:11 -07:00
Wayne Zhang	2fdaeb52bb	[SPARKR][DOC] fix typo in vignettes ## What changes were proposed in this pull request? Fix typo in vignettes Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17884 from actuaryzhang/typo.	2017-05-07 23:16:30 -07:00
Felix Cheung	c24bdaab5a	[SPARK-20626][SPARKR] address date test warning with timezone on windows ## What changes were proposed in this pull request? set timezone on windows ## How was this patch tested? unit test, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17892 from felixcheung/rtimestamptest.	2017-05-07 23:10:18 -07:00
zero323	1f73d3589a	[SPARK-20550][SPARKR] R wrapper for Dataset.alias ## What changes were proposed in this pull request? - Add SparkR wrapper for `Dataset.alias`. - Adjust roxygen annotations for `functions.alias` (including example usage). ## How was this patch tested? Unit tests, `check_cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes #17825 from zero323/SPARK-20550.	2017-05-07 16:24:42 -07:00
Felix Cheung	7087e01194	[SPARK-20543][SPARKR][FOLLOWUP] Don't skip tests on AppVeyor ## What changes were proposed in this pull request? add environment ## How was this patch tested? wait for appveyor run Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17878 from felixcheung/appveyorrcran.	2017-05-07 13:10:10 -07:00
Felix Cheung	57b64703e6	[SPARK-20571][SPARKR][SS] Flaky Structured Streaming tests ## What changes were proposed in this pull request? Make tests more reliable by having it till processed. Increasing timeout value might help but ultimately the flakiness from processing delay when Jenkins is hard to account for. This isn't an actual public API supported ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17857 from felixcheung/rsstestrelia.	2017-05-04 01:54:59 -07:00
zero323	f21897fc15	[SPARK-20544][SPARKR] R wrapper for input_file_name ## What changes were proposed in this pull request? Adds wrapper for `o.a.s.sql.functions.input_file_name` ## How was this patch tested? Existing unit tests, additional unit tests, `check-cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes #17818 from zero323/SPARK-20544.	2017-05-04 01:51:37 -07:00
zero323	9c36aa2791	[SPARK-20585][SPARKR] R generic hint support ## What changes were proposed in this pull request? Adds support for generic hints on `SparkDataFrame` ## How was this patch tested? Unit tests, `check-cran.sh` Author: zero323 <zero323@users.noreply.github.com> Closes #17851 from zero323/SPARK-20585.	2017-05-04 01:41:36 -07:00
Felix Cheung	b8302ccd02	[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example ## What changes were proposed in this pull request? Add - R vignettes - R programming guide - SS programming guide - R example Also disable spark.als in vignettes for now since it's failing (SPARK-20402) ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17814 from felixcheung/rdocss.	2017-05-04 00:27:10 -07:00
Felix Cheung	fc472bddd1	[SPARK-20543][SPARKR] skip tests when running on CRAN ## What changes were proposed in this pull request? General rule on skip or not: skip if - RDD tests - tests could run long or complicated (streaming, hivecontext) - tests on error conditions - tests won't likely change/break ## How was this patch tested? unit tests, `R CMD check --as-cran`, `R CMD check` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17817 from felixcheung/rskiptest.	2017-05-03 21:40:18 -07:00
Felix Cheung	13f47dc503	[SPARK-20490][SPARKR][DOC] add family tag for not function ## What changes were proposed in this pull request? doc only ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17828 from felixcheung/rnotfamily.	2017-05-02 09:37:01 -07:00
zero323	90d77e971f	[SPARK-20532][SPARKR] Implement grouping and grouping_id ## What changes were proposed in this pull request? Adds R wrappers for: - `o.a.s.sql.functions.grouping` as `o.a.s.sql.functions.is_grouping` (to avoid shading `base::grouping` - `o.a.s.sql.functions.grouping_id` ## How was this patch tested? Existing unit tests, additional unit tests. `check-cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes #17807 from zero323/SPARK-20532.	2017-05-01 21:39:17 -07:00
Felix Cheung	a355b667a3	[SPARK-20541][SPARKR][SS] support awaitTermination without timeout ## What changes were proposed in this pull request? Add without param for timeout - will need this to submit a job that runs until stopped Need this for 2.2 ## How was this patch tested? manually, unit test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17815 from felixcheung/rssawaitinfinite.	2017-04-30 23:23:49 -07:00
zero323	80e9cf1b59	[SPARK-20490][SPARKR] Add R wrappers for eqNullSafe and ! / not ## What changes were proposed in this pull request? - Add null-safe equality operator `%<=>%` (sames as `o.a.s.sql.Column.eqNullSafe`, `o.a.s.sql.Column.<=>`) - Add boolean negation operator `!` and function `not `. ## How was this patch tested? Existing unit tests, additional unit tests, `check-cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes #17783 from zero323/SPARK-20490.	2017-04-30 22:07:12 -07:00
zero323	ae3df4e98f	[SPARK-20535][SPARKR] R wrappers for explode_outer and posexplode_outer ## What changes were proposed in this pull request? Ad R wrappers for - `o.a.s.sql.functions.explode_outer` - `o.a.s.sql.functions.posexplode_outer` ## How was this patch tested? Additional unit tests, manual testing. Author: zero323 <zero323@users.noreply.github.com> Closes #17809 from zero323/SPARK-20535.	2017-04-30 12:33:03 -07:00
hyukjinkwon	70f1bcd7bc	[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R ## What changes were proposed in this pull request? It seems we are using `SQLUtils.getSQLDataType` for type string in structField. It looks we can replace this with `CatalystSqlParser.parseDataType`. They look similar DDL-like type definitions as below: ```scala scala> Seq(Tuple1(Tuple1("a"))).toDF.show() ``` ``` +---+ \| _1\| +---+ \|[a]\| +---+ ``` ```scala scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show() ``` ``` +---+ \| _1\| +---+ \|[a]\| +---+ ``` Such type strings looks identical when R’s one as below: ```R > write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet") > collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>")))) struct 1 a ``` R’s one is stricter because we are checking the types via regular expressions in R side ahead. Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks `structField` is the only place that calls this method). ## How was this patch tested? Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17785 from HyukjinKwon/SPARK-20493.	2017-04-29 11:02:17 -07:00
zero323	b58cf77c4d	[DOCS][MINOR] Add missing since to SparkR repeat_string note. ## What changes were proposed in this pull request? Replace note repeat_string 2.3.0 with note repeat_string since 2.3.0 ## How was this patch tested? `create-docs.sh` Author: zero323 <zero323@users.noreply.github.com> Closes #17779 from zero323/REPEAT-NOTE.	2017-04-27 00:29:43 -07:00
Takeshi Yamamuro	b4724db19a	[SPARK-20425][SQL] Support a vertical display mode for Dataset.show ## What changes were proposed in this pull request? This pr added a new display mode for `Dataset.show` to print output rows vertically (one line per column value). In the current master, when printing Dataset with many columns, the readability is low like; ``` scala> val df = spark.range(100).selectExpr((0 until 100).map(i => s"rand() AS c$i"): _*) scala> df.show(3, 0) +------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+ \|c0 \|c1 \|c2 \|c3 \|c4 \|c5 \|c6 \|c7 \|c8 \|c9 \|c10 \|c11 \|c12 \|c13 \|c14 \|c15 \|c16 \|c17 \|c18 \|c19 \|c20 \|c21 \|c22 \|c23 \|c24 \|c25 \|c26 \|c27 \|c28 \|c29 \|c30 \|c31 \|c32 \|c33 \|c34 \|c35 \|c36 \|c37 \|c38 \|c39 \|c40 \|c41 \|c42 \|c43 \|c44 \|c45 \|c46 \|c47 \|c48 \|c49 \|c50 \|c51 \|c52 \|c53 \|c54 \|c55 \|c56 \|c57 \|c58 \|c59 \|c60 \|c61 \|c62 \|c63 \|c64 \|c65 \|c66 \|c67 \|c68 \|c69 \|c70 \|c71 \|c72 \|c73 \|c74 \|c75 \|c76 \|c77 \|c78 \|c79 \|c80 \|c81 \|c82 \|c83 \|c84 \|c85 \|c86 \|c87 \|c88 \|c89 \|c90 \|c91 \|c92 \|c93 \|c94 \|c95 \|c96 \|c97 \|c98 \|c99 \| +------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+ \|0.6306087152476858\|0.9174349686288383\|0.5511324165035159\|0.3320844128641819 \|0.7738486877101489\|0.2154915886962553\|0.4754997600674299 \|0.922780639280355 \|0.7136894772661909\|0.2277580838165979\|0.5926874459847249\|0.40311408392226633\|0.467830264333843 \|0.8330466896984213\|0.1893258482389527\|0.6320849515511165 \|0.7530911056912044 \|0.06700254871955424\|0.370528597355559 \|0.2755437445193154\|0.23704391110980128\|0.8067400174905822\|0.13597793616251852\|0.1708888820162453\|0.01672725007605702\|0.983118121881555 \|0.25040195628629924\|0.060537253723083384\|0.20000530582637488\|0.3400572407133511\|0.9375689433322597 \|0.057039316954370256\|0.8053269714347623\|0.5247817572228813\|0.28419308820527944\|0.9798908885194533 \|0.31805988175678146\|0.7034448027077574\|0.5400575751346084\|0.25336322371116216\|0.9361634546853429\|0.6118681368289798\|0.6295081549153907 \|0.13417468943957422\|0.41617137072255794\|0.7267230869252035\|0.023792726137561115\|0.5776157058356362 \|0.04884204913195467\|0.26728716103441275\|0.646680370807925 \|0.9782712690657244 \|0.16434031314818154\|0.20985522381321275\|0.24739842475440077 \|0.26335189682977334\|0.19604841662422068\|0.10742950487300651\|0.20283136488091502\|0.3100312319723688\|0.886959006630645 \|0.25157102269776244\|0.34428775168410786\|0.3500506818575777\|0.3781142441912052 \|0.8560316444386715\|0.4737104888956839\|0.735903101602148\|0.02236617130529006\|0.8769074095835873 \|0.2001426662503153\|0.5534032319238532 \|0.7289496620397098\|0.41955191309992157\|0.9337700133660436 \|0.34059094378451005\|0.6419144759403556\|0.08167496930341167\|0.9947099478497635\|0.48010888605366586\|0.22314796858167918\|0.17786598882331306\|0.7351521162297135 \|0.5422057170020095 \|0.9521927872726792 \|0.7459825486368227 \|0.40907708791990627\|0.8903819313311575\|0.7251413746923618 \|0.2977174938745204 \|0.9515209660203555\|0.9375968604766713\|0.5087851740042524\|0.4255237544908751 \|0.8023768698664653\|0.48003189618006703\|0.1775841829745185\|0.09050775629268382\|0.6743909291138167 \|0.2498415755876865 \| \|0.6866473844170801\|0.4774360641212433\|0.631696201340726 \|0.33979113021468343\|0.5663049010847052\|0.7280190472258865\|0.41370958502324806\|0.9977433873622218\|0.7671957338989901\|0.2788708556233931\|0.3355106391656496\|0.88478952319287 \|0.0333974166999893\|0.6061744715862606\|0.9617779139652359\|0.22484954822341863\|0.12770906021550898\|0.5577789629508672 \|0.2877649024640704\|0.5566577406549361\|0.9334933255278052 \|0.9166720585157266\|0.9689249324600591 \|0.6367502457478598\|0.7993572745928459 \|0.23213222324218108\|0.11928284054154137\|0.6173493362456599 \|0.0505122058694798 \|0.9050228629552983\|0.17112767911121707\|0.47395598348370005 \|0.5820498657823081\|0.6241124650645072\|0.18587258258036776\|0.14987593554122225\|0.3079446253653946 \|0.9414228822867968\|0.8362276265462365\|0.9155655305576353 \|0.5121559807153562\|0.8963362656525707\|0.22765970274318037\|0.8177039187132797 \|0.8190326635933787 \|0.5256005177032199\|0.8167598457269669 \|0.030936807130934496\|0.6733006585281015 \|0.4208049626816347 \|0.24603085738518538\|0.22719198954208153\|0.1622280557565281 \|0.22217325159218038\|0.014684419513742553\|0.08987111517447499\|0.2157764759142622 \|0.8223414104088321 \|0.4868624404491777 \|0.4016191733088167\|0.6169281906889263\|0.15603611040433385\|0.18289285085714913\|0.9538408988218972\|0.15037154865295121\|0.5364516961987454\|0.8077254873163031\|0.712600478545675\|0.7277477241003857 \|0.19822912960348305\|0.8305051199208777\|0.18631911396566114\|0.8909532487898342\|0.3470409226992506 \|0.35306974180587636\|0.9107058868891469 \|0.3321327206004986\|0.48952332459050607\|0.3630403307479373\|0.5400046826340376 \|0.5387377194310529 \|0.42860539421837585\|0.23214101630985995\|0.21438968839794847\|0.15370603160082352\|0.04355605642700022\|0.6096006707067466 \|0.6933354157094292\|0.06302172470859002\|0.03174631856164001\|0.664243581650643 \|0.7833239547446621\|0.696884598352864 \|0.34626385933237736\|0.9263495598791336\|0.404818892816584 \|0.2085585394755507\|0.6150004897990109 \|0.05391193524302473\|0.28188484028329097\| +------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+ only showing top 2 rows ``` `psql`, CLI for PostgreSQL, supports a vertical display mode for this case like: http://stackoverflow.com/questions/9604723/alternate-output-format-for-psql ``` -RECORD 0------------------- c0 \| 0.6306087152476858 c1 \| 0.9174349686288383 c2 \| 0.5511324165035159 ... c98 \| 0.05391193524302473 c99 \| 0.28188484028329097 -RECORD 1------------------- c0 \| 0.6866473844170801 c1 \| 0.4774360641212433 c2 \| 0.631696201340726 ... c98 \| 0.05391193524302473 c99 \| 0.28188484028329097 only showing top 2 rows ``` ## How was this patch tested? Added tests in `DataFrameSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17733 from maropu/SPARK-20425.	2017-04-26 22:18:01 -07:00
Yanbo Liang	dbb06c689c	[MINOR][ML] Fix some PySpark & SparkR flaky tests ## What changes were proposed in this pull request? Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17757 from yanboliang/flaky-test.	2017-04-26 21:34:18 +08:00
zero323	df58a95a33	[SPARK-20437][R] R wrappers for rollup and cube ## What changes were proposed in this pull request? - Add `rollup` and `cube` methods and corresponding generics. - Add short description to the vignette. ## How was this patch tested? - Existing unit tests. - Additional unit tests covering new features. - `check-cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes #17728 from zero323/SPARK-20437.	2017-04-25 22:00:45 -07:00
Yanbo Liang	67eef47acf	[SPARK-20449][ML] Upgrade breeze version to 0.13.1 ## What changes were proposed in this pull request? Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17746 from yanboliang/spark-20449.	2017-04-25 17:10:41 +00:00
zero323	8a272ddc9d	[SPARK-20438][R] SparkR wrappers for split and repeat ## What changes were proposed in this pull request? Add wrappers for `o.a.s.sql.functions`: - `split` as `split_string` - `repeat` as `repeat_string` ## How was this patch tested? Existing tests, additional unit tests, `check-cran.sh` Author: zero323 <zero323@users.noreply.github.com> Closes #17729 from zero323/SPARK-20438.	2017-04-24 10:56:57 -07:00
zero323	fd648bff63	[SPARK-20371][R] Add wrappers for collect_list and collect_set ## What changes were proposed in this pull request? Adds wrappers for `collect_list` and `collect_set`. ## How was this patch tested? Unit tests, `check-cran.sh` Author: zero323 <zero323@users.noreply.github.com> Closes #17672 from zero323/SPARK-20371.	2017-04-21 12:06:21 -07:00
zero323	46c5749768	[SPARK-20375][R] R wrappers for array and map ## What changes were proposed in this pull request? Adds wrappers for `o.a.s.sql.functions.array` and `o.a.s.sql.functions.map` ## How was this patch tested? Unit tests, `check-cran.sh` Author: zero323 <zero323@users.noreply.github.com> Closes #17674 from zero323/SPARK-20375.	2017-04-19 21:19:46 -07:00
Shixiong Zhu	4fea7848c4	[SPARK-20397][SPARKR][SS] Fix flaky test: test_streaming.R.Terminated by error ## What changes were proposed in this pull request? Checking a source parameter is asynchronous. When the query is created, it's not guaranteed that source has been created. This PR just increases the timeout of awaitTermination to ensure the parsing error is thrown. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #17687 from zsxwing/SPARK-20397.	2017-04-19 13:10:44 -07:00
zero323	702d85af2d	[SPARK-20208][R][DOCS] Document R fpGrowth support ## What changes were proposed in this pull request? Document fpGrowth in: - vignettes - programming guide - code example ## How was this patch tested? Manual tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17557 from zero323/SPARK-20208.	2017-04-18 19:59:18 -07:00
hyukjinkwon	24f09b39c7	[SPARK-19828][R][FOLLOWUP] Rename asJsonArray to as.json.array in from_json function in R ## What changes were proposed in this pull request? This was suggested to be `as.json.array` at the first place in the PR to SPARK-19828 but we could not do this as the lint check emits an error for multiple dots in the variable names. After SPARK-20278, now we are able to use `multiple.dots.in.names`. `asJsonArray` in `from_json` function is still able to be changed as 2.2 is not released yet. So, this PR proposes to rename `asJsonArray` to `as.json.array`. ## How was this patch tested? Jenkins tests, local tests with `./R/run-tests.sh` and manual `./dev/lint-r`. Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17653 from HyukjinKwon/SPARK-19828-followup.	2017-04-17 09:04:24 -07:00
hyukjinkwon	86d251c585	[SPARK-20278][R] Disable 'multiple_dots_linter' lint rule that is against project's code style ## What changes were proposed in this pull request? Currently, multi-dot separated variables in R is not allowed. For example, ```diff setMethod("from_json", signature(x = "Column", schema = "structType"), - function(x, schema, asJsonArray = FALSE, ...) { + function(x, schema, as.json.array = FALSE, ...) { if (asJsonArray) { jschema <- callJStatic("org.apache.spark.sql.types.DataTypes", "createArrayType", ``` produces an error as below: ``` R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'. function(x, schema, as.json.array = FALSE, ...) { ^~~~~~~~~~~~~ ``` This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says > The preferred form for variable names is all lower case letters and words separated with dots This looks because lintr by default https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases seems not following Google's one as "a few tweaks". Per [SPARK-6813](https://issues.apache.org/jira/browse/SPARK-6813), we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - https://github.com/apache/spark-website/pull/43 Also, it looks we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md. > `multiple_dots_linter`: check that function and variable names are separated by _ rather than .. ## How was this patch tested? Manually tested `./dev/lint-r`with the manual change below in `R/functions.R`: ```diff setMethod("from_json", signature(x = "Column", schema = "structType"), - function(x, schema, asJsonArray = FALSE, ...) { + function(x, schema, as.json.array = FALSE, ...) { if (asJsonArray) { jschema <- callJStatic("org.apache.spark.sql.types.DataTypes", "createArrayType", ``` Before ```R R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'. function(x, schema, as.json.array = FALSE, ...) { ^~~~~~~~~~~~~ ``` After ``` lintr checks passed. ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #17590 from HyukjinKwon/disable-dot-in-name.	2017-04-16 11:27:27 -07:00
Brendan Dwyer	044f7ecbfd	[SPARK-20298][SPARKR][MINOR] fixed spelling mistake "charactor" ## What changes were proposed in this pull request? Fixed spelling of "charactor" ## How was this patch tested? Spelling change only Author: Brendan Dwyer <brendan.dwyer@ibm.com> Closes #17611 from bdwyer2/SPARK-20298.	2017-04-12 09:24:41 +01:00
Felix Cheung	8feb799af0	[SPARK-20197][SPARKR] CRAN check fail with package installation ## What changes were proposed in this pull request? Test failed because SPARK_HOME is not set before Spark is installed. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17516 from felixcheung/rdircheckincran.	2017-04-07 11:17:49 -07:00
Felix Cheung	5a693b4138	[SPARK-20195][SPARKR][SQL] add createTable catalog API and deprecate createExternalTable ## What changes were proposed in this pull request? Following up on #17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17511 from felixcheung/rceatetable.	2017-04-06 09:15:13 -07:00
Felix Cheung	bccc330193	[SPARK-20196][PYTHON][SQL] update doc for catalog functions for all languages, add pyspark refreshByPath API ## What changes were proposed in this pull request? Update doc to remove external for createTable, add refreshByPath in python ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17512 from felixcheung/catalogdoc.	2017-04-06 09:09:43 -07:00
Felix Cheung	c1b8b66750	[SPARKR][DOC] update doc for fpgrowth ## What changes were proposed in this pull request? minor update zero323 Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17526 from felixcheung/rfpgrowthfollowup.	2017-04-04 22:32:46 -07:00
hyukjinkwon	0e2ee82044	[MINOR][R] Reorder `Collate` fields in DESCRIPTION file ## What changes were proposed in this pull request? It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields. This PR proposes to fix `catalog.R`'s order so that running this script does not show up a small diff in this file every time. ## How was this patch tested? Manually via `./R/check-cran.sh`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17528 from HyukjinKwon/minor-reorder-description.	2017-04-04 11:42:14 -07:00
zero323	b34f7665dd	[SPARK-19825][R][ML] spark.ml R API for FPGrowth ## What changes were proposed in this pull request? Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825): - `spark.fpGrowth` -model training. - `freqItemsets` and `associationRules` methods with new corresponding generics. - Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper` - unit tests. ## How was this patch tested? Feature specific unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17170 from zero323/SPARK-19825.	2017-04-03 23:42:04 -07:00
Felix Cheung	93dbfe705f	[SPARK-20159][SPARKR][SQL] Support all catalog API in R ## What changes were proposed in this pull request? Add a set of catalog API in R ``` "currentDatabase", "listColumns", "listDatabases", "listFunctions", "listTables", "recoverPartitions", "refreshByPath", "refreshTable", "setCurrentDatabase", ``` https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17483 from felixcheung/rcatalog.	2017-04-02 11:59:27 -07:00
zuotingbing	76de2d1153	[SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK… JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20123 ## What changes were proposed in this pull request? If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed. ## How was this patch tested? manual tests Author: zuotingbing <zuo.tingbing9@zte.com.cn> Closes #17452 from zuotingbing/spark-bulid.	2017-04-02 15:31:13 +01:00
hyukjinkwon	3fada2f502	[SPARK-20105][TESTS][R] Add tests for checkType and type string in structField in R ## What changes were proposed in this pull request? It seems `checkType` and the type string in `structField` are not being tested closely. This string format currently seems SparkR-specific (see `d1f6c64c4b/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (L93-L131)`) but resembles SQL type definition. Therefore, it seems nicer if we test positive/negative cases in R side. ## How was this patch tested? Unit tests in `test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17439 from HyukjinKwon/r-typestring-tests.	2017-03-27 10:43:00 -07:00
hyukjinkwon	3fbf0a5f92	[MINOR][DOCS] Match several documentation changes in Scala to R/Python ## What changes were proposed in this pull request? This PR proposes to match minor documentations changes in https://github.com/apache/spark/pull/17399 and https://github.com/apache/spark/pull/17380 to R/Python. ## How was this patch tested? Manual tests in Python , Python tests via `./python/run-tests.py --module=pyspark-sql` and lint-checks for Python/R. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17429 from HyukjinKwon/minor-match-doc.	2017-03-26 18:40:00 -07:00
Yanbo Liang	478fbc866f	[SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors. ## What changes were proposed in this pull request? SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925). ## How was this patch tested? Add unit tests, and verify this fix at standalone and yarn cluster. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17274 from yanboliang/spark-19925.	2017-03-21 21:50:54 -07:00
Felix Cheung	a8877bdbba	[SPARK-19237][SPARKR][CORE] On Windows spark-submit should handle when java is not installed ## What changes were proposed in this pull request? When SparkR is installed as a R package there might not be any java runtime. If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message. ## How was this patch tested? manually - [x] need to test on Windows Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16596 from felixcheung/rcheckjava.	2017-03-21 14:24:41 -07:00
Zheng RuiFeng	63f077fbe5	[SPARK-20041][DOC] Update docs for NaN handling in approxQuantile ## What changes were proposed in this pull request? Update docs for NaN handling in approxQuantile. ## How was this patch tested? existing tests. Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17369 from zhengruifeng/doc_quantiles_nan.	2017-03-21 08:45:59 -07:00
Wenchen Fan	68d65fae71	[SPARK-19949][SQL] unify bad record handling in CSV and JSON ## What changes were proposed in this pull request? Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication. The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode. Behavior changes: 1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible. 2. all logging is removed as they are not very useful in practice. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Wenchen Fan <cloud0fan@gmail.com> Closes #17315 from cloud-fan/bad-record2.	2017-03-20 21:43:14 -07:00
Felix Cheung	f14f81e900	[SPARK-20020][SPARKR][FOLLOWUP] DataFrame checkpoint API fix version tag ## What changes were proposed in this pull request? doc only change ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17356 from felixcheung/rdfcheckpoint2.	2017-03-19 23:49:26 -07:00
Felix Cheung	c40597720e	[SPARK-20020][SPARKR] DataFrame checkpoint API ## What changes were proposed in this pull request? Add checkpoint, setCheckpointDir API to R ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17351 from felixcheung/rdfcheckpoint.	2017-03-19 22:34:18 -07:00
hyukjinkwon	0cdcf91145	[SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array ## What changes were proposed in this pull request? This PR proposes to support an array of struct type in `to_json` as below: ```scala import org.apache.spark.sql.functions._ val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a") df.select(to_json($"a").as("json")).show() ``` ``` +----------+ \| json\| +----------+ \|[{"_1":1}]\| +----------+ ``` Currently, it throws an exception as below (a newline manually inserted for readability): ``` org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type mismatch: structtojson requires that the expression is a struct expression.;; ``` This allows the roundtrip with `from_json` as below: ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array")) df.show() // Read back. df.select(to_json($"array").as("json")).show() ``` ``` +----------+ \| array\| +----------+ \|[[1], [2]]\| +----------+ +-----------------+ \| json\| +-----------------+ \|[{"a":1},{"a":2}]\| +-----------------+ ``` Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`. ## How was this patch tested? Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17192 from HyukjinKwon/SPARK-19849.	2017-03-19 22:33:01 -07:00
Felix Cheung	422aa67d1b	[SPARK-18817][SPARKR][SQL] change derby log output to temp dir ## What changes were proposed in this pull request? Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log ## How was this patch tested? Manually, unit tests With this, these are relocated to under /tmp ``` # ls /tmp/RtmpG2M0cB/ derby.log ``` And they are removed automatically when the R session is ended. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16330 from felixcheung/rderby.	2017-03-19 10:37:15 -07:00
hyukjinkwon	60262bc951	[MINOR][R] Reorder `Collate` fields in DESCRIPTION file ## What changes were proposed in this pull request? It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields. This PR proposes to fix this so that running this script does not show up a diff in this file. ## How was this patch tested? Manually via `./R/check-cran.sh`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17349 from HyukjinKwon/minor-cran.	2017-03-19 10:30:34 -07:00
Felix Cheung	5c165596da	[SPARK-19654][SPARKR][SS] Structured Streaming API for R ## What changes were proposed in this pull request? Add "experimental" API for SS in R ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16982 from felixcheung/rss.	2017-03-18 16:26:48 -07:00
hyukjinkwon	d1f6c64c4b	[SPARK-19828][R] Support array type in from_json in R ## What changes were proposed in this pull request? Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in `structField` as below: ```R jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]" df <- as.DataFrame(list(list("people" = jsonArr))) collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol"))) ``` prints ```R arrcol 1 Bob, Alice ``` ## How was this patch tested? Unit tests in `test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17178 from HyukjinKwon/SPARK-19828.	2017-03-14 19:51:25 -07:00
actuaryzhang	f6314eab4b	[SPARK-19391][SPARKR][ML] Tweedie GLM API for SparkR ## What changes were proposed in this pull request? Port Tweedie GLM #16344 to SparkR felixcheung yanboliang ## How was this patch tested? new test in SparkR Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16729 from actuaryzhang/sparkRTweedie.	2017-03-14 00:50:38 -07:00
Xin Ren	9f8ce4825e	[SPARK-19282][ML][SPARKR] RandomForest Wrapper and GBT Wrapper return param "maxDepth" to R models ## What changes were proposed in this pull request? RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models. Below 4 R wrappers are changed: * `RandomForestClassificationWrapper` * `RandomForestRegressionWrapper` * `GBTClassificationWrapper` * `GBTRegressionWrapper` ## How was this patch tested? Test manually on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #17207 from keypointt/SPARK-19282.	2017-03-12 12:15:19 -07:00
Xiao Li	9a6ac7226f	[SPARK-19601][SQL] Fix CollapseRepartition rule to preserve shuffle-enabled Repartition ### What changes were proposed in this pull request? Observed by felixcheung in https://github.com/apache/spark/pull/16739, when users use the shuffle-enabled `repartition` API, they expect the partition they got should be the exact number they provided, even if they call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule does not consider whether shuffle is enabled or not. Thus, we got the following unexpected result. ```Scala val df = spark.range(0, 10000, 1, 5) val df2 = df.repartition(10) assert(df2.coalesce(13).rdd.getNumPartitions == 5) assert(df2.coalesce(7).rdd.getNumPartitions == 5) assert(df2.coalesce(3).rdd.getNumPartitions == 3) ``` This PR is to fix the issue. We preserve shuffle-enabled Repartition. ### How was this patch tested? Added a test case Author: Xiao Li <gatorsmile@gmail.com> Closes #16933 from gatorsmile/CollapseRepartition.	2017-03-08 09:36:01 -08:00
actuaryzhang	1f6c090c15	[SPARK-19818][SPARKR] rbind should check for name consistency of input data frames ## What changes were proposed in this pull request? Added checks for name consistency of input data frames in union. ## How was this patch tested? new test. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #17159 from actuaryzhang/sparkRUnion.	2017-03-06 21:55:11 -08:00
Felix Cheung	80d5338b32	[SPARK-19795][SPARKR] add column functions to_json, from_json ## What changes were proposed in this pull request? Add column functions: to_json, from_json, and tests covering error cases. ## How was this patch tested? unit tests, manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17134 from felixcheung/rtojson.	2017-03-05 12:37:02 -08:00
Yuming Wang	6b0cfd9fa5	[SPARK-19550][SPARKR][DOCS] Update R document to use JDK8 ## What changes were proposed in this pull request? Update R document to use JDK8. ## How was this patch tested? manual tests Author: Yuming Wang <wgyumg@gmail.com> Closes #17162 from wangyum/SPARK-19550.	2017-03-04 16:43:31 +00:00
Felix Cheung	8d6ef895ee	[SPARK-18352][DOCS] wholeFile JSON update doc and programming guide ## What changes were proposed in this pull request? Update doc for R, programming guide. Clarify default behavior for all languages. ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17128 from felixcheung/jsonwholefiledoc.	2017-03-02 01:02:38 -08:00
actuaryzhang	2ff1467d67	[DOC][MINOR][SPARKR] Update SparkR doc for names, columns and colnames Update R doc: 1. columns, names and colnames returns a vector of strings, not list as in current doc. 2. `colnames<-` does allow the subset assignment, so the length of `value` can be less than the number of columns, e.g., `colnames(df)[1] <- "a"`. felixcheung Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #17115 from actuaryzhang/sparkRMinorDoc.	2017-03-01 12:35:56 -08:00
wm624@hotmail.com	89cd3845b6	[SPARK-19460][SPARKR] Update dataset used in R documentation, examples to reduce warning noise and confusions ## What changes were proposed in this pull request? Replace `iris` dataset with `Titanic` or other dataset in example and document. ## How was this patch tested? Manual and existing test Author: wm624@hotmail.com <wm624@hotmail.com> Closes #17032 from wangmiao1981/example.	2017-02-28 22:31:35 -08:00
Yuming Wang	9b8eca65dc	[SPARK-19660][CORE][SQL] Replace the configuration property names that are deprecated in the version of Hadoop 2.6 ## What changes were proposed in this pull request? Replace all the Hadoop deprecated configuration property names according to [DeprecatedProperties](https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html). except: https://github.com/apache/spark/blob/v2.1.0/python/pyspark/sql/tests.py#L1533 https://github.com/apache/spark/blob/v2.1.0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L987 https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala#L45 https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L614 ## How was this patch tested? Existing tests Author: Yuming Wang <wgyumg@gmail.com> Closes #16990 from wangyum/HadoopDeprecatedProperties.	2017-02-28 10:13:42 +00:00
actuaryzhang	7bf09433f5	[SPARK-19682][SPARKR] Issue warning (or error) when subset method "[[" takes vector index ## What changes were proposed in this pull request? The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index. We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R). Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better. ## How was this patch tested? new tests Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #17017 from actuaryzhang/sparkRSubsetter.	2017-02-23 11:12:02 -08:00
wm624@hotmail.com	1f86e795b8	[SPARK-19616][SPARKR] weightCol and aggregationDepth should be improved for some SparkR APIs ## What changes were proposed in this pull request? This is a follow-up PR of #16800 When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter. ## How was this patch tested? Existing tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16945 from wangmiao1981/svc.	2017-02-22 11:50:24 -08:00
wm624@hotmail.com	8b57ea4a1e	[SPARK-19639][SPARKR][EXAMPLE] Add spark.svmLinear example and update vignettes ## What changes were proposed in this pull request? We recently add the spark.svmLinear API for SparkR. We need to add an example and update the vignettes. ## How was this patch tested? Manually run example. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16969 from wangmiao1981/example.	2017-02-17 21:21:10 -08:00
Yanbo Liang	b406598382	[SPARK-18285][SPARKR] SparkR approxQuantile supports input multiple columns ## What changes were proposed in this pull request? SparkR ```approxQuantile``` supports input multiple columns. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16951 from yanboliang/spark-19619.	2017-02-17 11:58:39 -08:00
Felix Cheung	671bc08ed5	[SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column ## What changes were proposed in this pull request? Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16739 from felixcheung/rcoalesce.	2017-02-15 10:45:37 -08:00
wm624@hotmail.com	3973403d5d	[SPARK-19456][SPARKR] Add LinearSVC R API ## What changes were proposed in this pull request? Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API. Marked as WIP, as I am designing unit tests. ## How was this patch tested? Please review http://spark.apache.org/contributing.html before opening a pull request. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16800 from wangmiao1981/svc.	2017-02-15 01:15:50 -08:00
Felix Cheung	a3626ca333	[SPARK-19387][SPARKR] Tests do not run with SparkR source package in CRAN check ## What changes were proposed in this pull request? - this is cause by changes in SPARK-18444, SPARK-18643 that we no longer install Spark when `master = ""` (default), but also related to SPARK-18449 since the real `master` value is not known at the time the R code in `sparkR.session` is run. (`master` cannot default to "local" since it could be overridden by spark-submit commandline or spark config) - as a result, while running SparkR as a package in IDE is working fine, CRAN check is not as it is launching it via non-interactive script - fix is to add check to the beginning of each test and vignettes; the same would also work by changing `sparkR.session()` to `sparkR.session(master = "local")` in tests, but I think being more explicit is better. ## How was this patch tested? Tested this by reverting version to 2.1, since it needs to download the release jar with matching version. But since there are changes in 2.2 (specifically around SparkR ML) that are incompatible with 2.1, some tests are failing in this config. Will need to port this to branch-2.1 and retest with 2.1 release jar. manually as: ``` # modify DESCRIPTION to revert version to 2.1.0 SPARK_HOME=/usr/spark R CMD build pkg # run cran check without SPARK_HOME R CMD check --as-cran SparkR_2.1.0.tar.gz ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16720 from felixcheung/rcranchecktest.	2017-02-14 13:51:27 -08:00
titicaca	bc0a0e6392	[SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column ## What changes were proposed in this pull request? Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs: ``` library(SparkR) sparkR.session(master = "local") df <- data.frame(col1 = c(0, 1, 2), col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01"))) sdf1 <- createDataFrame(df) print(dtypes(sdf1)) df1 <- collect(sdf1) print(lapply(df1, class)) sdf2 <- filter(sdf1, "col1 > 0") print(dtypes(sdf2)) df2 <- collect(sdf2) print(lapply(df2, class)) ``` As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column. This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct. Therefore, we need to cast the data type of the vector explicitly. ## How was this patch tested? The patch can be tested manually with the same code above. Author: titicaca <fangzhou.yang@hotmail.com> Closes #16689 from titicaca/sparkr-dev.	2017-02-12 10:42:15 -08:00
Dongjoon Hyun	c618ccdbe9	[SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop2.6 ## What changes were proposed in this pull request? After SPARK-19464, SparkPullRequestBuilder fails because it still tries to use hadoop2.3. BEFORE https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.3 in the list. Valid options are ['hadoop2.6', 'hadoop2.7'] Attempting to post to Github... > Post successful. ``` AFTER https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. ``` ## How was this patch tested? Pass the existing test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16858 from dongjoon-hyun/hotfix_run-tests.	2017-02-08 21:28:04 +00:00
anabranch	7a7ce272fe	[SPARK-16609] Add to_date/to_timestamp with format functions ## What changes were proposed in this pull request? This pull request adds two new user facing functions: - `to_date` which accepts an expression and a format and returns a date. - `to_timestamp` which accepts an expression and a format and returns a timestamp. For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM) ### Date Function Previously ``` to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")) ``` Current ``` to_date(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Timestamp Function Previously ``` unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp") ``` Current ``` to_timestamp(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Tasks - [X] Add `to_date` to Scala Functions - [x] Add `to_date` to Python Functions - [x] Add `to_date` to SQL Functions - [X] Add `to_timestamp` to Scala Functions - [x] Add `to_timestamp` to Python Functions - [x] Add `to_timestamp` to SQL Functions - [x] Add function to R ## How was this patch tested? - [x] Add Functions to `DateFunctionsSuite` - Test new `ParseToTimestamp` Expression (not necessary) - Test new `ParseToDate` Expression (not necessary) - [x] Add test for R - [x] Add test for Python in test.py Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <bill@databricks.com> Author: anabranch <bill@databricks.com> Closes #16138 from anabranch/SPARK-16609.	2017-02-07 15:50:30 +01:00
actuaryzhang	b94f4b6fa6	[SPARK-19452][SPARKR] Fix bug in the name assignment method ## What changes were proposed in this pull request? The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names. ## How was this patch tested? new tests. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16794 from actuaryzhang/sparkRNames.	2017-02-05 11:37:45 -08:00
actuaryzhang	050c20cc90	[SPARK-19386][SPARKR][FOLLOWUP] fix error in vignettes ## What changes were proposed in this pull request? Current version has error in vignettes: ``` model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4) summary(kmeansModel) ``` `kmeansModel` does not exist... felixcheung wangmiao1981 Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16799 from actuaryzhang/sparkRVignettes.	2017-02-03 18:02:10 -08:00
krishnakalyan3	48aafeda7d	[SPARK-19386][SPARKR][DOC] Bisecting k-means in SparkR documentation ## What changes were proposed in this pull request? Update programming guide, example and vignette with Bisecting k-means. Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #16767 from krishnakalyan3/bisecting-kmeans.	2017-02-03 12:19:47 -08:00
wm624@hotmail.com	9ac05225e8	[SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k ## What changes were proposed in this pull request When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`. In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k. Example: > col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > cols <- as.data.frame(cbind(col1, col2, col3)) > df <- createDataFrame(cols) > > model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5) > > summary(model2) Error in `colnames<-`(`tmp`, value = c("col1", "col2", "col3")) : length of 'dimnames' [2] not equal to array extent In addition: Warning message: In matrix(coefficients, ncol = k) : data length [9] is not a sub-multiple or multiple of the number of rows [2] Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix. ## How was this patch tested? Add unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16666 from wangmiao1981/kmeans.	2017-01-31 21:16:37 -08:00
actuaryzhang	ce112cec4f	[SPARK-19395][SPARKR] Convert coefficients in summary to matrix ## What changes were proposed in this pull request? The `coefficients` component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first `unlist` the coefficients returned from the `callJMethod` before converting to matrix. An example illustrates the issues: ``` data(iris) df <- createDataFrame(iris) model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian") s <- summary(model) > str(s$coefficients) List of 8 $ : num 6.53 $ : num -0.223 $ : num 0.479 $ : num 0.155 $ : num 13.6 $ : num -1.44 $ : num 0 $ : num 0.152 - attr(, "dim")= int [1:2] 2 4 - attr(, "dimnames")=List of 2 ..$ : chr [1:2] "(Intercept)" "Sepal_Width" ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>\|t\|)" > s$coefficients[, 2] $`(Intercept)` [1] 0.4788963 $Sepal_Width [1] 0.1550809 ``` This shows that the underlying structure of coefficients is still `list`. felixcheung wangmiao1981 Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16730 from actuaryzhang/sparkRCoef.	2017-01-31 12:20:43 -08:00
Felix Cheung	be7425e26a	[SPARKR][DOCS] update R API doc for subset/extract ## What changes were proposed in this pull request? With extract `[[` or replace `[[<-`, the parameter `i` is a column index, that needs to be corrected in doc. Also a few minor updates: examples, links. ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16721 from felixcheung/rsubsetdoc.	2017-01-30 18:47:14 -08:00
Felix Cheung	a7ab6f9a8f	[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR ## What changes were proposed in this pull request? This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible) Before: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) NULL ``` After: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) count#11L NULL ``` Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped. ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16670 from felixcheung/rjvmstdout.	2017-01-27 12:41:35 -08:00
Felix Cheung	385d73848b	[SPARK-19333][SPARKR] Add Apache License headers to R files ## What changes were proposed in this pull request? add header ## How was this patch tested? Manual run to check vignettes html is created properly Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16709 from felixcheung/rfilelicense.	2017-01-27 10:31:28 -08:00
Felix Cheung	90817a6cd0	[SPARK-18788][SPARKR] Add API for getNumPartitions ## What changes were proposed in this pull request? With doc to say this would convert DF into RDD ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16668 from felixcheung/rgetnumpartitions.	2017-01-26 21:06:39 -08:00
wm624@hotmail.com	c0ba284300	[SPARK-18821][SPARKR] Bisecting k-means wrapper in SparkR ## What changes were proposed in this pull request? Add R wrapper for bisecting Kmeans. As JIRA is down, I will update title to link with corresponding JIRA later. ## How was this patch tested? Add new unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16566 from wangmiao1981/bk.	2017-01-26 21:01:59 -08:00
Felix Cheung	f27e024768	[SPARK-18823][SPARKR] add support for assigning to column ## What changes were proposed in this pull request? Support for ``` df[[myname]] <- 1 df[[2]] <- df$eruptions ``` ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16663 from felixcheung/rcolset.	2017-01-24 00:23:23 -08:00
Yanbo Liang	0c589e3713	[SPARK-19291][SPARKR][ML] spark.gaussianMixture supports output log-likelihood. ## What changes were proposed in this pull request? ```spark.gaussianMixture``` supports output total log-likelihood for the model like R ```mvnormalmixEM```. ## How was this patch tested? R unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16646 from yanboliang/spark-19291.	2017-01-21 21:26:14 -08:00
Felix Cheung	278fa1eb30	[SPARK-19231][SPARKR] add error handling for download and untar for Spark release ## What changes were proposed in this pull request? When R is starting as a package and it needs to download the Spark release distribution we need to handle error for download and untar, and clean up, otherwise it will get stuck. ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16589 from felixcheung/rtarreturncode.	2017-01-18 09:53:14 -08:00
Felix Cheung	c84f7d3e1b	[SPARK-18828][SPARKR] Refactor scripts for R ## What changes were proposed in this pull request? Refactored script to remove duplications and clearer purpose for each script ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16249 from felixcheung/rscripts.	2017-01-16 13:49:12 -08:00
Felix Cheung	a115a54399	[SPARK-19232][SPARKR] Update Spark distribution download cache location on Windows ## What changes were proposed in this pull request? Windows seems to be the only place with appauthor in the path, for which we should say "Apache" (and case sensitive) Current path of `AppData\Local\spark\spark\Cache` is a bit odd. ## How was this patch tested? manual. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16590 from felixcheung/rcachedir.	2017-01-16 09:35:52 -08:00
wm624@hotmail.com	12c8c21608	[SPARK-19066][SPARKR] SparkR LDA doesn't set optimizer correctly ## What changes were proposed in this pull request? spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code. In addition, the `summary` method should bring back the results related to `DistributedLDAModel`. ## How was this patch tested? Manual tests by comparing with scala example. Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16464 from wangmiao1981/new.	2017-01-16 06:05:59 -08:00
Felix Cheung	b0e8eb6d3e	[SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter ## What changes were proposed in this pull request? To allow specifying number of partitions when the DataFrame is created ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16512 from felixcheung/rnumpart.	2017-01-13 10:08:14 -08:00
wm624@hotmail.com	7f24a0b6c3	[SPARK-19142][SPARKR] spark.kmeans should take seed, initSteps, and tol as parameters ## What changes were proposed in this pull request? spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans. Add missing parameters and corresponding document. Modified existing unit tests to take additional parameters. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16523 from wangmiao1981/kmeans.	2017-01-12 22:27:57 -08:00
Felix Cheung	d749c06677	[SPARK-19130][SPARKR] Support setting literal value as column implicitly ## What changes were proposed in this pull request? ``` df$foo <- 1 ``` instead of ``` df$foo <- lit(1) ``` ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16510 from felixcheung/rlitcol.	2017-01-11 08:29:09 -08:00
Felix Cheung	9bc3507e41	[SPARK-19133][SPARKR][ML] fix glm for Gamma, clarify glm family supported ## What changes were proposed in this pull request? R family is a longer list than what Spark supports. ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16511 from felixcheung/rdocglmfamily.	2017-01-10 11:42:07 -08:00
anabranch	19d9d4c855	[SPARK-19126][DOCS] Update Join Documentation Across Languages ## What changes were proposed in this pull request? - [X] Make sure all join types are clearly mentioned - [X] Make join labeling/style consistent - [X] Make join label ordering docs the same - [X] Improve join documentation according to above for Scala - [X] Improve join documentation according to above for Python - [X] Improve join documentation according to above for R ## How was this patch tested? No tests b/c docs. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Closes #16504 from anabranch/SPARK-19126.	2017-01-08 20:37:46 -08:00
anabranch	1f6ded6455	[SPARK-19127][DOCS] Update Rank Function Documentation ## What changes were proposed in this pull request? - [X] Fix inconsistencies in function reference for dense rank and dense - [X] Make all languages equivalent in their reference to `dense_rank` and `rank`. ## How was this patch tested? N/A for docs. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Closes #16505 from anabranch/SPARK-19127.	2017-01-08 17:53:53 -08:00
Yanbo Liang	6b6b555a1e	[SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files ## What changes were proposed in this pull request? SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain: * mllib_classification.R * mllib_clustering.R * mllib_recommendation.R * mllib_regression.R * mllib_stat.R * mllib_tree.R * mllib_utils.R Note: Only reorg, no actual code change. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16312 from yanboliang/spark-18862.	2017-01-08 01:10:36 -08:00
Yanbo Liang	cdda3372a3	[MINOR] Bump R version to 2.2.0. ## What changes were proposed in this pull request? #16126 bumps master branch version to 2.2.0-SNAPSHOT, but it seems R version was omitted. ## How was this patch tested? N/A Author: Yanbo Liang <ybliang8@gmail.com> Closes #16488 from yanboliang/r-version.	2017-01-07 14:33:17 +00:00
Felix Cheung	17579bda3c	[SPARK-18958][SPARKR] R API toJSON on DataFrame ## What changes were proposed in this pull request? It would make it easier to integrate with other component expecting row-based JSON format. This replaces the non-public toJSON RDD API. ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16368 from felixcheung/rJSON.	2016-12-22 20:54:38 -08:00
Felix Cheung	7e8994ffd3	[SPARK-18903][SPARKR] Add API to get SparkUI URL ## What changes were proposed in this pull request? API for SparkUI URL from SparkContext ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16367 from felixcheung/rwebui.	2016-12-21 17:21:17 -08:00
Felix Cheung	38fd163d0d	[SPARK-18849][ML][SPARKR][DOC] vignettes final check reorg ## What changes were proposed in this pull request? Reorganizing content (copy/paste) ## How was this patch tested? https://felixcheung.github.io/sparkr-vignettes.html Previous: https://felixcheung.github.io/sparkr-vignettes_old.html Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16301 from felixcheung/rvignettespass2.	2016-12-17 14:37:34 -08:00
Dongjoon Hyun	1169db44bc	[SPARK-18897][SPARKR] Fix SparkR SQL Test to drop test table ## What changes were proposed in this pull request? SparkR tests, `R/run-tests.sh`, succeeds only once because `test_sparkSQL.R` does not clean up the test table, `people`. As a result, the rows in `people` table are accumulated at every run and the test cases fail. The following is the failure result for the second run. ```r Failed ------------------------------------------------------------------------- 1. Failure: create DataFrame from RDD (test_sparkSQL.R#204) ------------------- collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to c(16). Lengths differ: 2 vs 1 2. Failure: create DataFrame from RDD (test_sparkSQL.R#206) ------------------- collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal to c(176.5). Lengths differ: 2 vs 1 ``` ## How was this patch tested? Manual. Run `run-tests.sh` twice and check if it passes without failures. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16310 from dongjoon-hyun/SPARK-18897.	2016-12-16 11:30:21 -08:00
Felix Cheung	7d858bc5ce	[SPARK-18849][ML][SPARKR][DOC] vignettes final check update ## What changes were proposed in this pull request? doc cleanup ## How was this patch tested? ~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~ Output html here: https://felixcheung.github.io/sparkr-vignettes.html Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16286 from felixcheung/rvignettespass.	2016-12-14 21:51:52 -08:00
wm624@hotmail.com	3243885316	[SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates ## What changes were proposed in this pull request? When do the QA work, I found that the following issues: 1). `spark.mlp` doesn't include an example; 2). `spark.mlp` and `spark.lda` have redundant parameter explanations; 3). `spark.lda` document misses default values for some parameters. I also changed the `spark.logit` regParam in the examples, as we discussed in #16222. ## How was this patch tested? Manual test Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16284 from wangmiao1981/ks.	2016-12-14 17:07:27 -08:00
Joseph K. Bradley	7862742570	[SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes ## What changes were proposed in this pull request? Added short section for KSTest. Also added logreg model to list of ML models in vignette. (This will be reorganized under SPARK-18849) ![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png) ## How was this patch tested? Manually tested example locally. Built vignettes locally. Author: Joseph K. Bradley <joseph@databricks.com> Closes #16283 from jkbradley/ksTest-vignette.	2016-12-14 14:10:40 -08:00
wm624@hotmail.com	f2ddabfa09	[MINOR][SPARKR] fix kstest example error and add unit test ## What changes were proposed in this pull request? While adding vignettes for kstest, I found some errors in the example: 1. There is a typo of kstest; 2. print.summary.KStest doesn't work with the example; Fix the example errors; Add a new unit test for print.summary.KStest; ## How was this patch tested? Manual test; Add new unit test; Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16259 from wangmiao1981/ks.	2016-12-13 18:52:05 -08:00
Xiangrui Meng	594b14f1eb	[SPARK-18793][SPARK-18794][R] add spark.randomForest/spark.gbt to vignettes ## What changes were proposed in this pull request? Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content minimal since users can type `?spark.randomForest` to see the full doc. cc: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #16264 from mengxr/SPARK-18793.	2016-12-13 16:59:09 -08:00
wm624@hotmail.com	2aa16d03db	[SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes ## What changes were proposed in this pull request? spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work. ## How was this patch tested? Manual build html. Please see attached image for the result. ![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16222 from wangmiao1981/veg.	2016-12-12 22:41:11 -08:00
Felix Cheung	8a51cfdcad	[SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots ## What changes were proposed in this pull request? Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL` ## How was this patch tested? unit test, manually testing - snapshot build url - download when spark jar not cached - when spark jar is cached - RC build url - download when spark jar not cached - when spark jar is cached - multiple cached spark versions - starting with sparkR shell To use this, ``` SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R ``` then in R, ``` library(SparkR) # or specify lib.loc sparkR.session() ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16248 from felixcheung/rinstallurl.	2016-12-12 14:40:41 -08:00
Felix Cheung	3e11d5bfef	[SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods with void return values ## What changes were proposed in this pull request? Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE. example: ``` > setLogLevel("WARN") NULL ``` We should fix this to make the result more clear. Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it. ## How was this patch tested? manually - I didn't find a expect_*() method in testthat for this Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16237 from felixcheung/rinvis.	2016-12-09 19:06:05 -08:00
wm624@hotmail.com	86a96034cc	[SPARK-18349][SPARKR] Update R API documentation on ml model summary ## What changes were proposed in this pull request? In this PR, the document of `summary` method is improved in the format: returns summary information of the fitted model, which is a list. The list includes ....... Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here. In current document, some `return` have `.` and some don't have. `.` is added to missed ones. Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged. ## How was this patch tested? Manual build. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16150 from wangmiao1981/audit2.	2016-12-08 22:08:19 -08:00
Felix Cheung	c3d3a9d0e8	[SPARK-18590][SPARKR] build R source package when making distribution ## What changes were proposed in this pull request? This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. ### more details These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. ## How was this patch tested? Manually, CI. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16014 from felixcheung/rdist.	2016-12-08 11:29:31 -08:00
Yanbo Liang	97255497d8	[SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1 ## What changes were proposed in this pull request? Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues: * Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle. * Fix ```spark.als``` params to make it consistent with MLlib. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16169 from yanboliang/spark-18326.	2016-12-07 20:23:28 -08:00
Sean Owen	79f5f281bb	[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils ## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes #16129 from srowen/SPARK-18678.	2016-12-07 17:34:45 +08:00
Yanbo Liang	90b59d1bf2	[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit. ## What changes were proposed in this pull request? Several cleanup and improvements for ```spark.logit```: * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model. * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently. * SparkR test improvement: comparing the training result with native R glmnet. * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users. ## How was this patch tested? Unit tests. The ```summary``` output after this change: multinomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > model <- spark.logit(df, Species ~ ., regParam = 0.5) > summary(model) $coefficients versicolor virginica setosa (Intercept) 1.514031 -2.609108 1.095077 Sepal_Length 0.02511006 0.2649821 -0.2900921 Sepal_Width -0.5291215 -0.02016446 0.549286 Petal_Length 0.03647411 0.1544119 -0.190886 Petal_Width 0.000236092 0.4195804 -0.4198165 ``` binomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > training <- df[df$Species %in% c("versicolor", "virginica"), ] > model <- spark.logit(training, Species ~ ., regParam = 0.5) > summary(model) $coefficients Estimate (Intercept) -6.053815 Sepal_Length 0.2449379 Sepal_Width 0.1648321 Petal_Length 0.4730718 Petal_Width 1.031947 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #16117 from yanboliang/spark-18686.	2016-12-07 00:31:11 -08:00
Felix Cheung	b019b3a8ac	[SPARK-18643][SPARKR] SparkR hangs at session start when installed as a package without Spark ## What changes were proposed in this pull request? If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session. This seems to be a regression on the earlier behavior. Fix is to always try to install or check for the cached Spark if running in an interactive session. As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc) ## How was this patch tested? Manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16077 from felixcheung/rsessioninteractive.	2016-12-04 20:25:11 -08:00
Yanbo Liang	a985dd8e99	[SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial." ## What changes were proposed in this pull request? It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved. ## How was this patch tested? Existing unit tests. This reverts commit `daa975f4bf`. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16118 from yanboliang/spark-18291-revert.	2016-12-02 12:16:57 -08:00

1 2 3 4 5 ...

647 commits