ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Felix Cheung	7f38b9d5f4	[SPARK-16144][SPARKR] update R API doc for mllib ## What changes were proposed in this pull request? From SPARK-16140/PR #13921 - the issue is we left write.ml doc empty: ![image](https://cloud.githubusercontent.com/assets/8969467/16481934/856dd0ea-3e62-11e6-9474-e4d57d1ca001.png) Here's what I meant as the fix: ![image](https://cloud.githubusercontent.com/assets/8969467/16481943/911f02ec-3e62-11e6-9d68-17363a9f5628.png) ![image](https://cloud.githubusercontent.com/assets/8969467/16481950/9bc057aa-3e62-11e6-8127-54870701c4b1.png) I didn't realize there was already a JIRA on this. mengxr yanboliang ## How was this patch tested? check doc generated. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13993 from felixcheung/rmllibdoc.	2016-07-11 14:34:48 -07:00
Yanbo Liang	2ad031be67	[SPARKR][DOC] SparkR ML user guides update for 2.0 ## What changes were proposed in this pull request? * Update SparkR ML section to make them consistent with SparkR API docs. * Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page. ## How was this patch tested? Only docs update, manually check the generated docs. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14011 from yanboliang/r-user-guide-update.	2016-07-11 14:31:11 -07:00
Dongjoon Hyun	142df4834b	[SPARK-16429][SQL] Include `StringType` columns in `describe()` ## What changes were proposed in this pull request? Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument. Background ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show() +-------+------------------+-------+ \|summary\| age\| name\| +-------+------------------+-------+ \| count\| 2\| 3\| \| mean\| 24.5\| null\| \| stddev\|7.7781745930520225\| null\| \| min\| 19\| Andy\| \| max\| 30\|Michael\| +-------+------------------+-------+ ``` Before ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe().show() +-------+------------------+ \|summary\| age\| +-------+------------------+ \| count\| 2\| \| mean\| 24.5\| \| stddev\|7.7781745930520225\| \| min\| 19\| \| max\| 30\| +-------+------------------+ ``` After ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe().show() +-------+------------------+-------+ \|summary\| age\| name\| +-------+------------------+-------+ \| count\| 2\| 3\| \| mean\| 24.5\| null\| \| stddev\|7.7781745930520225\| null\| \| min\| 19\| Andy\| \| max\| 30\|Michael\| +-------+------------------+-------+ ``` ## How was this patch tested? Pass the Jenkins with a update testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14095 from dongjoon-hyun/SPARK-16429.	2016-07-08 14:36:50 -07:00
Dongjoon Hyun	6aa7d09f4e	[SPARK-16425][R] `describe()` should not fail with non-numeric columns ## What changes were proposed in this pull request? This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`. Before ```r > df <- createDataFrame(faithful) > df <- withColumn(df, "boolean", df$waiting==79) > summary(df) 16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType; ``` After ```r > df <- createDataFrame(faithful) > df <- withColumn(df, "boolean", df$waiting==79) > summary(df) SparkDataFrame[summary:string, eruptions:string, waiting:string] ``` ## How was this patch tested? Pass the Jenkins with a updated testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14096 from dongjoon-hyun/SPARK-16425.	2016-07-07 17:47:29 -07:00
Felix Cheung	f4767bcc7a	[SPARK-16310][SPARKR] R na.string-like default for csv source ## What changes were proposed in this pull request? Apply default "NA" as null string for R, like R read.csv na.string parameter. https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html na.strings = "NA" An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv") (couldn't open JIRA, will do that later) ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13984 from felixcheung/rcsvnastring.	2016-07-07 15:21:57 -07:00
Dongjoon Hyun	d17e5f2f12	[SPARK-16233][R][TEST] ORC test should be enabled only when HiveContext is available. ## What changes were proposed in this pull request? ORC test should be enabled only when HiveContext is available. ## How was this patch tested? Manual. ``` $ R/run-tests.sh ... 1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped 2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped 3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped 4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped 5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped DONE =========================================================================== Tests passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14019 from dongjoon-hyun/SPARK-16233.	2016-07-01 15:35:19 -07:00
Sun Rui	e4fa58c43c	[SPARK-16299][SPARKR] Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory. ## What changes were proposed in this pull request? Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory. See detailed description at https://issues.apache.org/jira/browse/SPARK-16299 ## How was this patch tested? SparkR unit tests. Author: Sun Rui <sunrui2016@gmail.com> Closes #13975 from sun-rui/SPARK-16299.	2016-07-01 14:37:03 -07:00
Narine Kokhlikyan	26afb4ce40	[SPARK-16012][SPARKR] Implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame ## What changes were proposed in this pull request? gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided. This is similar to dapplyCollect(). ## How was this patch tested? Added test cases for gapplyCollect similar to dapplyCollect Author: Narine Kokhlikyan <narine@slice.com> Closes #13760 from NarineK/gapplyCollect.	2016-07-01 13:55:13 -07:00
Dongjoon Hyun	46395db80e	[SPARK-16289][SQL] Implement posexplode table generating function ## What changes were proposed in this pull request? This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive. Before ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7 ``` After ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show +---+---+-----+ \|pos\|key\|value\| +---+---+-----+ \| 0\| a\| 1\| \| 1\| b\| 2\| +---+---+-----+ ``` For `array` argument, `after` is the same with `before`. ``` scala> sql("select posexplode(array(1, 2, 3))").show +---+---+ \|pos\|col\| +---+---+ \| 0\| 1\| \| 1\| 2\| \| 2\| 3\| +---+---+ ``` ## How was this patch tested? Pass the Jenkins tests with newly added testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13971 from dongjoon-hyun/SPARK-16289.	2016-06-30 12:03:54 -07:00
Xin Ren	8c9cd0a7a7	[SPARK-16140][MLLIB][SPARKR][DOCS] Group k-means method in generated R doc https://issues.apache.org/jira/browse/SPARK-16140 ## What changes were proposed in this pull request? Group the R doc of spark.kmeans, predict(KM), summary(KM), read/write.ml(KM) under Rd spark.kmeans. The example code was updated. ## How was this patch tested? Tested on my local machine And on my laptop `jekyll build` is failing to build API docs, so here I can only show you the html I manually generated from Rd files, with no CSS applied, but the doc content should be there. ![screenshotkmeans](https://cloud.githubusercontent.com/assets/3925641/16403203/c2c9ca1e-3ca7-11e6-9e29-f2164aee75fc.png) Author: Xin Ren <iamshrek@126.com> Closes #13921 from keypointt/SPARK-16140.	2016-06-29 11:25:00 -07:00
Yanbo Liang	c6a220d756	[MINOR][SPARKR] Fix arguments of survreg in SparkR ## What changes were proposed in this pull request? Fix wrong arguments description of ```survreg``` in SparkR. ## How was this patch tested? ```Arguments``` section of ```survreg``` doc before this PR (with wrong description for ```path``` and missing ```overwrite```): ![image](https://cloud.githubusercontent.com/assets/1962026/16447548/fe7a5ed4-3da1-11e6-8b96-b5bf2083b07e.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/16447617/368e0b18-3da2-11e6-8277-45640fb11859.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #13970 from yanboliang/spark-16143-followup.	2016-06-29 11:20:35 -07:00
Felix Cheung	823518c2b5	[SPARKR] add csv tests ## What changes were proposed in this pull request? Add unit tests for csv data for SPARKR ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13904 from felixcheung/rcsv.	2016-06-28 17:08:28 -07:00
WeichenXu	d59ba8e307	[MINOR][SPARKR] update sparkR DataFrame.R comment ## What changes were proposed in this pull request? update sparkR DataFrame.R comment SQLContext ==> SparkSession ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #13946 from WeichenXu123/sparkR_comment_update_sparkSession.	2016-06-28 12:12:20 -07:00
Prashant Sharma	f6b497fcdd	[SPARK-16128][SQL] Allow setting length of characters to be truncated to, in Dataset.show function. ## What changes were proposed in this pull request? Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise. ## How was this patch tested? Existing tests. + 1 new test in DataFrameSuite. For SparkR and pyspark, existing tests and manual testing. Author: Prashant Sharma <prashsh1@in.ibm.com> Author: Prashant Sharma <prashant@apache.org> Closes #13839 from ScrapCodes/add_truncateTo_DF.show.	2016-06-28 17:11:06 +05:30
Junyang Qian	1b7fc58172	[SPARK-16143][R] group AFT survival regression methods docs in a single Rd ## What changes were proposed in this pull request? This PR groups `spark.survreg`, `summary(AFT)`, `predict(AFT)`, `write.ml(AFT)` for survival regression into a single Rd. ## How was this patch tested? Manually checked generated HTML doc. See attached screenshots. ![screen shot 2016-06-27 at 10 28 20 am](https://cloud.githubusercontent.com/assets/15318264/16392008/a14cf472-3c5e-11e6-9ce5-490ed1a52249.png) ![screen shot 2016-06-27 at 10 28 35 am](https://cloud.githubusercontent.com/assets/15318264/16392009/a14e333c-3c5e-11e6-8bd7-c2e9ba71f8e2.png) Author: Junyang Qian <junyangq@databricks.com> Closes #13927 from junyangq/SPARK-16143.	2016-06-27 20:32:27 -07:00
Felix Cheung	30b182bcc0	[SPARK-16184][SPARKR] conf API for SparkSession ## What changes were proposed in this pull request? Add `conf` method to get Runtime Config from SparkSession ## How was this patch tested? unit tests, manual tests This is how it works in sparkR shell: ``` SparkSession available as 'spark'. > conf() $hive.metastore.warehouse.dir [1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse" $spark.app.id [1] "local-1466749575523" $spark.app.name [1] "SparkR" $spark.driver.host [1] "10.0.2.1" $spark.driver.port [1] "45629" $spark.executorEnv.LD_LIBRARY_PATH [1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server" $spark.executor.id [1] "driver" $spark.home [1] "/opt/spark-2.0.0-bin-hadoop2.6" $spark.master [1] "local[]" $spark.sql.catalogImplementation [1] "hive" $spark.submit.deployMode [1] "client" > conf("spark.master") $spark.master [1] "local[]" ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13885 from felixcheung/rconf.	2016-06-26 13:10:43 -07:00
Xiangrui Meng	4a40d43bb2	[SPARK-16142][R] group naiveBayes method docs in a single Rd ## What changes were proposed in this pull request? This PR groups `spark.naiveBayes`, `summary(NB)`, `predict(NB)`, and `write.ml(NB)` into a single Rd. ## How was this patch tested? Manually checked generated HTML doc. See attached screenshots. ![screen shot 2016-06-23 at 2 11 00 pm](https://cloud.githubusercontent.com/assets/829644/16320452/a5885e92-394c-11e6-994f-2ab5cddad86f.png) ![screen shot 2016-06-23 at 2 11 15 pm](https://cloud.githubusercontent.com/assets/829644/16320455/aad1f6d8-394c-11e6-8ef4-13bee989f52f.png) Author: Xiangrui Meng <meng@databricks.com> Closes #13877 from mengxr/SPARK-16142.	2016-06-23 21:43:13 -07:00
Felix Cheung	b5a997667f	[SPARK-16088][SPARKR] update setJobGroup, cancelJobGroup, clearJobGroup ## What changes were proposed in this pull request? Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter. Also updated roxygen2 doc and R programming guide on deprecations. ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13838 from felixcheung/rjobgroup.	2016-06-23 09:45:01 -07:00
Kai Jiang	43b04b7ecb	[SPARK-15672][R][DOC] R programming guide update ## What changes were proposed in this pull request? Guide for - UDFs with dapply, dapplyCollect - spark.lapply for running parallel R functions ## How was this patch tested? build locally <img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png"> Author: Kai Jiang <jiangkai@gmail.com> Closes #13660 from vectorijk/spark-15672-R-guide-update.	2016-06-22 12:50:36 -07:00
Junyang Qian	ea3a12b014	[SPARK-16107][R] group glm methods in documentation ## What changes were proposed in this pull request? This groups GLM methods (spark.glm, summary, print, predict and write.ml) in the documentation. The example code was updated. ## How was this patch tested? N/A (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) ![screen shot 2016-06-21 at 2 31 37 pm](https://cloud.githubusercontent.com/assets/15318264/16247077/f6eafc04-37bc-11e6-89a8-7898ff3e4078.png) ![screen shot 2016-06-21 at 2 31 45 pm](https://cloud.githubusercontent.com/assets/15318264/16247078/f6eb1c16-37bc-11e6-940a-2b595b10617c.png) Author: Junyang Qian <junyangq@databricks.com> Author: Junyang Qian <junyangq@Junyangs-MacBook-Pro.local> Closes #13820 from junyangq/SPARK-16107.	2016-06-22 09:13:08 -07:00
Felix Cheung	dbfdae4e41	[SPARK-16096][SPARKR] add union and deprecate unionAll ## What changes were proposed in this pull request? add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different) `explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet. ## How was this patch tested? unit tests, manual checks for r doc Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13805 from felixcheung/runion.	2016-06-21 13:36:50 -07:00
Felix Cheung	57746295e6	[SPARK-16109][SPARKR][DOC] R more doc fixes ## What changes were proposed in this pull request? Found these issues while reviewing for SPARK-16090 ## How was this patch tested? roxygen2 doc gen, checked output html Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13803 from felixcheung/rdocrd.	2016-06-21 11:01:42 -07:00
Xiangrui Meng	4f83ca1059	[SPARK-15177][.1][R] make SparkR model params and default values consistent with MLlib ## What changes were proposed in this pull request? This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation. Main changes: * `spark.glm`: epsilon -> tol, maxit -> maxIter * `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means\|\|" * `spark.naiveBayes`: laplace -> smoothing, default 1.0 ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #13801 from mengxr/SPARK-15177.1.	2016-06-21 08:31:15 -07:00
Felix Cheung	843a1eba8e	[SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other DataFrame stats functions ## What changes were proposed in this pull request? Doc only changes. Please see screenshots. Before: http://spark.apache.org/docs/latest/api/R/statfunctions.html ![image](https://cloud.githubusercontent.com/assets/8969467/15264110/cd458826-1924-11e6-85bd-8ee2e2e1a85f.png) After ![image](https://cloud.githubusercontent.com/assets/8969467/16218452/b9e89f08-3732-11e6-969d-a3a1796e7ad0.png) (please ignore the style differences - this is due to not having the css in my local copy) This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function. ## How was this patch tested? Build doc Author: Felix Cheung <felixcheung_m@hotmail.com> Author: felixcheung <felixcheung_m@hotmail.com> Closes #13109 from felixcheung/rstatdoc.	2016-06-21 00:19:09 -07:00
Felix Cheung	09f4ceaeb0	[SPARKR][DOCS] R code doc cleanup ## What changes were proposed in this pull request? I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc. There are still more doc issues to be cleaned up. ## How was this patch tested? manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13798 from felixcheung/rdocseealso.	2016-06-20 23:51:08 -07:00
Dongjoon Hyun	217db56ba1	[SPARK-15294][R] Add `pivot` to SparkR ## What changes were proposed in this pull request? This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests (including new testcase.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13786 from dongjoon-hyun/SPARK-15294.	2016-06-20 21:09:39 -07:00
Narine Kokhlikyan	e2b7eba87c	remove duplicated docs in dapply ## What changes were proposed in this pull request? Removed unnecessary duplicated documentation in dapply and dapplyCollect. In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link. ## How was this patch tested? Existing test cases. Author: Narine Kokhlikyan <narine@slice.com> Closes #13790 from NarineK/dapply-docs-fix.	2016-06-20 19:36:51 -07:00
Dongjoon Hyun	d0eddb80ec	[SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods ## What changes were proposed in this pull request? This PR adds `since` tags to Roxygen documentation according to the previous documentation archive. https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/ ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13734 from dongjoon-hyun/SPARK-14995.	2016-06-20 14:24:41 -07:00
Felix Cheung	359c2e827d	[SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates ## What changes were proposed in this pull request? roxygen2 doc, programming guide, example updates ## How was this patch tested? manual checks shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13751 from felixcheung/rsparksessiondoc.	2016-06-20 13:46:24 -07:00
Dongjoon Hyun	b0f2fb5b97	[SPARK-16053][R] Add `spark_partition_id` in SparkR ## What changes were proposed in this pull request? This PR adds `spark_partition_id` virtual column function in SparkR for API parity. The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`. ```r > collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id()))) id SPARK_PARTITION_ID() 1 3 0 2 4 0 3 8 1 4 9 1 5 0 2 6 1 3 7 2 4 8 5 5 9 6 6 10 7 7 ``` ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13768 from dongjoon-hyun/SPARK-16053.	2016-06-20 13:41:03 -07:00
Felix Cheung	aee1420eca	[SPARKR] fix R roxygen2 doc for count on GroupedData ## What changes were proposed in this pull request? fix code doc ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13782 from felixcheung/rcountdoc.	2016-06-20 12:31:00 -07:00
Felix Cheung	46d98e0a1f	[SPARK-16028][SPARKR] spark.lapply can work with active context ## What changes were proposed in this pull request? spark.lapply and setLogLevel ## How was this patch tested? unit test shivaram thunterdb Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13752 from felixcheung/rlapply.	2016-06-20 12:08:42 -07:00
Dongjoon Hyun	c44bf137c7	[SPARK-16051][R] Add `read.orc/write.orc` to SparkR ## What changes were proposed in this pull request? This issue adds `read.orc/write.orc` to SparkR for API parity. ## How was this patch tested? Pass the Jenkins tests (with new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13763 from dongjoon-hyun/SPARK-16051.	2016-06-20 11:30:26 -07:00
Felix Cheung	36e812d4b6	[SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable ## What changes were proposed in this pull request? Add dropTempView and deprecate dropTempTable ## How was this patch tested? unit tests shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13753 from felixcheung/rdroptempview.	2016-06-20 11:24:41 -07:00
Dongjoon Hyun	9613424898	[SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR ## What changes were proposed in this pull request? This PR adds `monotonically_increasing_id` column function in SparkR for API parity. After this PR, SparkR supports the followings. ```r > df <- read.json("examples/src/main/resources/people.json") > collect(select(df, monotonically_increasing_id(), df$name, df$age)) monotonically_increasing_id() name age 1 0 Michael NA 2 1 Andy 30 3 2 Justin 19 ``` ## How was this patch tested? Pass the Jenkins tests (with added testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13774 from dongjoon-hyun/SPARK-16059.	2016-06-20 11:12:41 -07:00
Felix Cheung	8c198e246d	[SPARK-15159][SPARKR] SparkR SparkSession API ## What changes were proposed in this pull request? This PR introduces the new SparkSession API for SparkR. `sparkR.session.getOrCreate()` and `sparkR.session.stop()` "getOrCreate" is a bit unusual in R but it's important to name this clearly. SparkR implementation should - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR) - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work) - Changes to SparkSession is mostly transparent to users due to SPARK-10903 - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))` - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView` - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames` - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python) - All tests are updated to use the SparkSession entrypoint - A bug in `read.jdbc` is fixed TODO - [x] Add more tests - [ ] Separate PR - update all roxygen2 doc coding example - [ ] Separate PR - update SparkR programming guide ## How was this patch tested? unit tests, manual tests shivaram sun-rui rxin Author: Felix Cheung <felixcheung_m@hotmail.com> Author: felixcheung <felixcheung_m@hotmail.com> Closes #13635 from felixcheung/rsparksession.	2016-06-17 21:36:01 -07:00
Dongjoon Hyun	7d65a0db4a	[SPARK-16005][R] Add `randomSplit` to SparkR ## What changes were proposed in this pull request? This PR adds `randomSplit` to SparkR for API parity. ## How was this patch tested? Pass the Jenkins tests (with new testcase.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13721 from dongjoon-hyun/SPARK-16005.	2016-06-17 16:07:33 -07:00
Felix Cheung	ef3cc4fc09	[SPARK-15925][SPARKR] R DataFrame add back registerTempTable, add tests ## What changes were proposed in this pull request? Add registerTempTable to DataFrame with Deprecate ## How was this patch tested? unit tests shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13722 from felixcheung/rregistertemptable.	2016-06-17 15:56:03 -07:00
Dongjoon Hyun	513a03e41e	[SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkR ## What changes were proposed in this pull request? This PR adds varargs-type `dropDuplicates` function to SparkR for API parity. Refer to https://issues.apache.org/jira/browse/SPARK-15807, too. ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13684 from dongjoon-hyun/SPARK-15908.	2016-06-16 20:35:17 -07:00
Kai Jiang	5fd20b66ff	[SPARK-15490][R][DOC] SparkR 2.0 QA: New R APIs and API docs for non-MLib changes ## What changes were proposed in this pull request? R Docs changes include typos, format, layout. ## How was this patch tested? Test locally. Author: Kai Jiang <jiangkai@gmail.com> Closes #13394 from vectorijk/spark-15490.	2016-06-16 19:39:33 -07:00
Narine Kokhlikyan	7c6c692637	[SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR ## What changes were proposed in this pull request? gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API. Please, let me know what do you think and if you have any ideas to improve it. Thank you! ## How was this patch tested? Unit tests. 1. Primitive test with different column types 2. Add a boolean column 3. Compute average by a group Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Author: NarineK <narine.kokhlikyan@us.ibm.com> Closes #12836 from NarineK/gapply2.	2016-06-15 21:42:05 -07:00
Felix Cheung	d30b7e6696	[SPARK-15637][SPARK-15931][SPARKR] Fix R masked functions checks ## What changes were proposed in this pull request? Because of the fix in SPARK-15684, this exclusion is no longer necessary. ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13636 from felixcheung/rendswith.	2016-06-15 10:29:07 -07:00
Cheng Lian	ced8d669b3	[SPARK-15925][SQL][SPARKR] Replaces registerTempTable with createOrReplaceTempView ## What changes were proposed in this pull request? This PR replaces `registerTempTable` with `createOrReplaceTempView` as a follow-up task of #12945. ## How was this patch tested? Existing SparkR tests. Author: Cheng Lian <lian@databricks.com> Closes #13644 from liancheng/spark-15925-temp-view-for-r.	2016-06-13 15:46:50 -07:00
Wenchen Fan	e2ab79d5ea	[SPARK-15898][SQL] DataFrameReader.text should return DataFrame ## What changes were proposed in this pull request? We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String]. affected PRs: https://github.com/apache/spark/pull/11731 https://github.com/apache/spark/pull/13104 https://github.com/apache/spark/pull/13184 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13604 from cloud-fan/revert.	2016-06-12 21:36:41 -07:00
wm624@hotmail.com	2c8f40cea1	[SPARK-15766][SPARKR] R should export is.nan ## What changes were proposed in this pull request? When reviewing SPARK-15545, we found that is.nan is not exported, which should be exported. Add it to the NAMESPACE. ## How was this patch tested? Manual tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13508 from wangmiao1981/unused.	2016-06-10 12:46:22 -07:00
wm624@hotmail.com	3ec4461c46	[SPARK-15684][SPARKR] Not mask startsWith and endsWith in R ## What changes were proposed in this pull request? In R 3.3.0, startsWith and endsWith are added. In this PR, I make the two work in SparkR. 1. Remove signature in generic.R 2. Add setMethod in column.R 3. Add unit tests ## How was this patch tested? Manually test it through SparkR shell for both column data and string data, which are added into the unit test file. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13476 from wangmiao1981/start.	2016-06-07 09:13:18 -07:00
Zheng RuiFeng	fd8af39713	[MINOR] Fix Typos 'an -> a' ## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' \| xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.	2016-06-06 09:35:47 +01:00
Kai Jiang	8a9110510c	[MINOR][R][DOC] Fix R documentation generation instruction. ## What changes were proposed in this pull request? changes in R/README.md - Make step of generating SparkR document more clear. - link R/DOCUMENTATION.md from R/README.md - turn on some code syntax highlight in R/README.md ## How was this patch tested? local test Author: Kai Jiang <jiangkai@gmail.com> Closes #13488 from vectorijk/R-Readme.	2016-06-05 13:03:02 -07:00
felixcheung	74c1b79f3f	[SPARK-15637][SPARKR] fix R tests on R 3.2.2 ## What changes were proposed in this pull request? Change version check in R tests ## How was this patch tested? R tests shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #13369 from felixcheung/rversioncheck.	2016-05-28 10:32:40 -07:00
felixcheung	c82883239e	[SPARK-10903] followup - update API doc for SqlContext ## What changes were proposed in this pull request? Follow up on the earlier PR - in here we are fixing up roxygen2 doc examples. Also add to the programming guide migration section. ## How was this patch tested? SparkR tests Author: felixcheung <felixcheung_m@hotmail.com> Closes #13340 from felixcheung/sqlcontextdoc.	2016-05-26 21:42:36 -07:00

1 2 3 4 5 ...

292 commits