Commit graph

274 commits

Author SHA1 Message Date
Kai Jiang 43b04b7ecb [SPARK-15672][R][DOC] R programming guide update
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions

## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">

Author: Kai Jiang <jiangkai@gmail.com>

Closes #13660 from vectorijk/spark-15672-R-guide-update.
2016-06-22 12:50:36 -07:00
Junyang Qian ea3a12b014 [SPARK-16107][R] group glm methods in documentation
## What changes were proposed in this pull request?

This groups GLM methods (spark.glm, summary, print, predict and write.ml) in the documentation. The example code was updated.

## How was this patch tested?

N/A

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

![screen shot 2016-06-21 at 2 31 37 pm](https://cloud.githubusercontent.com/assets/15318264/16247077/f6eafc04-37bc-11e6-89a8-7898ff3e4078.png)
![screen shot 2016-06-21 at 2 31 45 pm](https://cloud.githubusercontent.com/assets/15318264/16247078/f6eb1c16-37bc-11e6-940a-2b595b10617c.png)

Author: Junyang Qian <junyangq@databricks.com>
Author: Junyang Qian <junyangq@Junyangs-MacBook-Pro.local>

Closes #13820 from junyangq/SPARK-16107.
2016-06-22 09:13:08 -07:00
Felix Cheung dbfdae4e41 [SPARK-16096][SPARKR] add union and deprecate unionAll
## What changes were proposed in this pull request?

add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different)

`explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet.

## How was this patch tested?

unit tests, manual checks for r doc

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13805 from felixcheung/runion.
2016-06-21 13:36:50 -07:00
Felix Cheung 57746295e6 [SPARK-16109][SPARKR][DOC] R more doc fixes
## What changes were proposed in this pull request?

Found these issues while reviewing for SPARK-16090

## How was this patch tested?

roxygen2 doc gen, checked output html

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13803 from felixcheung/rdocrd.
2016-06-21 11:01:42 -07:00
Xiangrui Meng 4f83ca1059 [SPARK-15177][.1][R] make SparkR model params and default values consistent with MLlib
## What changes were proposed in this pull request?

This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.

Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13801 from mengxr/SPARK-15177.1.
2016-06-21 08:31:15 -07:00
Felix Cheung 843a1eba8e [SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other DataFrame stats functions
## What changes were proposed in this pull request?

Doc only changes. Please see screenshots.

Before:
http://spark.apache.org/docs/latest/api/R/statfunctions.html
![image](https://cloud.githubusercontent.com/assets/8969467/15264110/cd458826-1924-11e6-85bd-8ee2e2e1a85f.png)

After
![image](https://cloud.githubusercontent.com/assets/8969467/16218452/b9e89f08-3732-11e6-969d-a3a1796e7ad0.png)
(please ignore the style differences - this is due to not having the css in my local copy)

This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function.

## How was this patch tested?

Build doc

Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13109 from felixcheung/rstatdoc.
2016-06-21 00:19:09 -07:00
Felix Cheung 09f4ceaeb0 [SPARKR][DOCS] R code doc cleanup
## What changes were proposed in this pull request?

I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc.

There are still more doc issues to be cleaned up.

## How was this patch tested?

manual tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13798 from felixcheung/rdocseealso.
2016-06-20 23:51:08 -07:00
Dongjoon Hyun 217db56ba1 [SPARK-15294][R] Add pivot to SparkR
## What changes were proposed in this pull request?

This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did.

## How was this patch tested?

Pass the Jenkins tests (including new testcase.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13786 from dongjoon-hyun/SPARK-15294.
2016-06-20 21:09:39 -07:00
Narine Kokhlikyan e2b7eba87c remove duplicated docs in dapply
## What changes were proposed in this pull request?
Removed unnecessary duplicated documentation in dapply and dapplyCollect.

In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link.

## How was this patch tested?
Existing test cases.

Author: Narine Kokhlikyan <narine@slice.com>

Closes #13790 from NarineK/dapply-docs-fix.
2016-06-20 19:36:51 -07:00
Dongjoon Hyun d0eddb80ec [SPARK-14995][R] Add since tag in Roxygen documentation for SparkR API methods
## What changes were proposed in this pull request?

This PR adds `since` tags to Roxygen documentation according to the previous documentation archive.

https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13734 from dongjoon-hyun/SPARK-14995.
2016-06-20 14:24:41 -07:00
Felix Cheung 359c2e827d [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates
## What changes were proposed in this pull request?

roxygen2 doc, programming guide, example updates

## How was this patch tested?

manual checks
shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13751 from felixcheung/rsparksessiondoc.
2016-06-20 13:46:24 -07:00
Dongjoon Hyun b0f2fb5b97 [SPARK-16053][R] Add spark_partition_id in SparkR
## What changes were proposed in this pull request?

This PR adds `spark_partition_id` virtual column function in SparkR for API parity.

The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
   id SPARK_PARTITION_ID()
1   3                    0
2   4                    0
3   8                    1
4   9                    1
5   0                    2
6   1                    3
7   2                    4
8   5                    5
9   6                    6
10  7                    7
```

## How was this patch tested?

Pass the Jenkins tests (including new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13768 from dongjoon-hyun/SPARK-16053.
2016-06-20 13:41:03 -07:00
Felix Cheung aee1420eca [SPARKR] fix R roxygen2 doc for count on GroupedData
## What changes were proposed in this pull request?
fix code doc

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13782 from felixcheung/rcountdoc.
2016-06-20 12:31:00 -07:00
Felix Cheung 46d98e0a1f [SPARK-16028][SPARKR] spark.lapply can work with active context
## What changes were proposed in this pull request?

spark.lapply and setLogLevel

## How was this patch tested?

unit test

shivaram thunterdb

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13752 from felixcheung/rlapply.
2016-06-20 12:08:42 -07:00
Dongjoon Hyun c44bf137c7 [SPARK-16051][R] Add read.orc/write.orc to SparkR
## What changes were proposed in this pull request?

This issue adds `read.orc/write.orc` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13763 from dongjoon-hyun/SPARK-16051.
2016-06-20 11:30:26 -07:00
Felix Cheung 36e812d4b6 [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable
## What changes were proposed in this pull request?

Add dropTempView and deprecate dropTempTable

## How was this patch tested?

unit tests

shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13753 from felixcheung/rdroptempview.
2016-06-20 11:24:41 -07:00
Dongjoon Hyun 9613424898 [SPARK-16059][R] Add monotonically_increasing_id function in SparkR
## What changes were proposed in this pull request?

This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
After this PR, SparkR supports the followings.

```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
  monotonically_increasing_id()    name age
1                             0 Michael  NA
2                             1    Andy  30
3                             2  Justin  19
```

## How was this patch tested?

Pass the Jenkins tests (with added testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13774 from dongjoon-hyun/SPARK-16059.
2016-06-20 11:12:41 -07:00
Felix Cheung 8c198e246d [SPARK-15159][SPARKR] SparkR SparkSession API
## What changes were proposed in this pull request?

This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`

"getOrCreate" is a bit unusual in R but it's important to name this clearly.

SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed

TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide

## How was this patch tested?

unit tests, manual tests

shivaram sun-rui rxin

Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13635 from felixcheung/rsparksession.
2016-06-17 21:36:01 -07:00
Dongjoon Hyun 7d65a0db4a [SPARK-16005][R] Add randomSplit to SparkR
## What changes were proposed in this pull request?

This PR adds `randomSplit` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcase.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13721 from dongjoon-hyun/SPARK-16005.
2016-06-17 16:07:33 -07:00
Felix Cheung ef3cc4fc09 [SPARK-15925][SPARKR] R DataFrame add back registerTempTable, add tests
## What changes were proposed in this pull request?

Add registerTempTable to DataFrame with Deprecate

## How was this patch tested?

unit tests
shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13722 from felixcheung/rregistertemptable.
2016-06-17 15:56:03 -07:00
Dongjoon Hyun 513a03e41e [SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkR
## What changes were proposed in this pull request?

This PR adds varargs-type `dropDuplicates` function to SparkR for API parity.
Refer to https://issues.apache.org/jira/browse/SPARK-15807, too.

## How was this patch tested?

Pass the Jenkins tests with new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13684 from dongjoon-hyun/SPARK-15908.
2016-06-16 20:35:17 -07:00
Kai Jiang 5fd20b66ff [SPARK-15490][R][DOC] SparkR 2.0 QA: New R APIs and API docs for non-MLib changes
## What changes were proposed in this pull request?
R Docs changes
include typos, format, layout.
## How was this patch tested?
Test locally.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #13394 from vectorijk/spark-15490.
2016-06-16 19:39:33 -07:00
Narine Kokhlikyan 7c6c692637 [SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR
## What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12836 from NarineK/gapply2.
2016-06-15 21:42:05 -07:00
Felix Cheung d30b7e6696 [SPARK-15637][SPARK-15931][SPARKR] Fix R masked functions checks
## What changes were proposed in this pull request?

Because of the fix in SPARK-15684, this exclusion is no longer necessary.

## How was this patch tested?

unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13636 from felixcheung/rendswith.
2016-06-15 10:29:07 -07:00
Cheng Lian ced8d669b3 [SPARK-15925][SQL][SPARKR] Replaces registerTempTable with createOrReplaceTempView
## What changes were proposed in this pull request?

This PR replaces `registerTempTable` with `createOrReplaceTempView` as a follow-up task of #12945.

## How was this patch tested?

Existing SparkR tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13644 from liancheng/spark-15925-temp-view-for-r.
2016-06-13 15:46:50 -07:00
Wenchen Fan e2ab79d5ea [SPARK-15898][SQL] DataFrameReader.text should return DataFrame
## What changes were proposed in this pull request?

We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String].

affected PRs:
https://github.com/apache/spark/pull/11731
https://github.com/apache/spark/pull/13104
https://github.com/apache/spark/pull/13184

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13604 from cloud-fan/revert.
2016-06-12 21:36:41 -07:00
wm624@hotmail.com 2c8f40cea1 [SPARK-15766][SPARKR] R should export is.nan
## What changes were proposed in this pull request?

When reviewing SPARK-15545, we found that is.nan is not exported, which should be exported.

Add it to the NAMESPACE.

## How was this patch tested?

Manual tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13508 from wangmiao1981/unused.
2016-06-10 12:46:22 -07:00
wm624@hotmail.com 3ec4461c46 [SPARK-15684][SPARKR] Not mask startsWith and endsWith in R
## What changes were proposed in this pull request?

In R 3.3.0, startsWith and endsWith are added. In this PR, I make the two work in SparkR.
1. Remove signature in generic.R
2. Add setMethod in column.R
3. Add unit tests

## How was this patch tested?
Manually test it through SparkR shell for both column data and string data, which are added into the unit test file.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13476 from wangmiao1981/start.
2016-06-07 09:13:18 -07:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Kai Jiang 8a9110510c [MINOR][R][DOC] Fix R documentation generation instruction.
## What changes were proposed in this pull request?
changes in R/README.md

- Make step of generating SparkR document more clear.
- link R/DOCUMENTATION.md from R/README.md
- turn on some code syntax highlight in R/README.md

## How was this patch tested?
local test

Author: Kai Jiang <jiangkai@gmail.com>

Closes #13488 from vectorijk/R-Readme.
2016-06-05 13:03:02 -07:00
felixcheung 74c1b79f3f [SPARK-15637][SPARKR] fix R tests on R 3.2.2
## What changes were proposed in this pull request?

Change version check in R tests

## How was this patch tested?

R tests
shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13369 from felixcheung/rversioncheck.
2016-05-28 10:32:40 -07:00
felixcheung c82883239e [SPARK-10903] followup - update API doc for SqlContext
## What changes were proposed in this pull request?

Follow up on the earlier PR - in here we are fixing up roxygen2 doc examples.
Also add to the programming guide migration section.

## How was this patch tested?

SparkR tests

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13340 from felixcheung/sqlcontextdoc.
2016-05-26 21:42:36 -07:00
hyukjinkwon 1c403733b8 [SPARK-8603][SPARKR] Use shell() instead of system2() for SparkR on Windows
## What changes were proposed in this pull request?

This PR corrects SparkR to use `shell()` instead of `system2()` on Windows.

Using `system2(...)` on Windows does not process windows file separator `\`. `shell(tralsate = TRUE, ...)` can treat this problem. So, this was changed to be chosen according to OS.

Existing tests were failed on Windows due to this problem. For example, those were failed.

  ```
8. Failure: sparkJars tag in SparkContext (test_includeJAR.R#34)
9. Failure: sparkJars tag in SparkContext (test_includeJAR.R#36)
```

The cases above were due to using of `system2`.

In addition, this PR also fixes some tests failed on Windows.

  ```
5. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#128)
6. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#131)
7. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#134)
```

  The cases above were due to a weird behaviour of `normalizePath()`. On Linux, if the path does not exist, it just prints out the input but it prints out including the current path on Windows.

  ```r
# On Linus
path <- normalizePath("aa")
print(path)
[1] "aa"

# On Windows
path <- normalizePath("aa")
print(path)
[1] "C:\\Users\\aa"
```

## How was this patch tested?

Jenkins tests and manually tested in a Window machine as below:

Here is the [stdout](https://gist.github.com/HyukjinKwon/4bf35184f3a30f3bce987a58ec2bbbab) of testing.

Closes #7025

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: Prakash PC <prakash.chinnu@gmail.com>

Closes #13165 from HyukjinKwon/pr/7025.
2016-05-26 20:55:06 -07:00
Xin Ren 6ab973ec51 [SPARK-15542][SPARKR] Make error message clear for script './R/install-dev.sh' when R is missing on Mac
https://issues.apache.org/jira/browse/SPARK-15542

## What changes were proposed in this pull request?

When running`./R/install-dev.sh` in **Mac OS EI Captain** environment, I got
```
mbp185-xr:spark xin$ ./R/install-dev.sh
usage: dirname path
```
This message is very confusing to me, and then I found R is not properly configured on my Mac when this script is using `$(which R)` to get R home.

I tried similar situation on CentOS with R missing, and it's giving me very clear error message while MacOS is not.
on CentOS:
```
[rootip-xxx-31-9-xx spark]# which R
/usr/bin/which: no R in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin:/root/bin)
```
but on Mac, if not found then nothing returned and this is causing the confusing message for R build failure and running R/install-dev.sh:
```
mbp185-xr:spark xin$ which R
mbp185-xr:spark xin$
```

Here I just added a clear message for this miss configuration for R when running `R/install-dev.sh`.
```
mbp185-xr:spark xin$ ./R/install-dev.sh
Cannot find R home by running 'which R', please make sure R is properly installed.
```

## How was this patch tested?
Manually tested on local machine.

Author: Xin Ren <iamshrek@126.com>

Closes #13308 from keypointt/SPARK-15542.
2016-05-26 21:25:13 -05:00
felixcheung c76457c8e4 [SPARK-10903][SPARKR] R - Simplify SQLContext method signatures and use a singleton
Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.

Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).

Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9192 from felixcheung/rsqlcontext.
2016-05-26 11:20:20 -07:00
wm624@hotmail.com 06bae8af17 [SPARK-15439][SPARKR] Failed to run unit test in SparkR
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
There are some failures when running SparkR unit tests.
In this PR, I fixed two of these failures in test_context.R and test_sparkSQL.R
The first one is due to different masked name. I added missed names in the expected arrays.
The second one is because one PR removed the logic of a previous fix of missing subset method.

The file privilege issue is still there. I am debugging it. SparkR shell can run the test case successfully.
test_that("pipeRDD() on RDDs", {
  actual <- collect(pipeRDD(rdd, "more"))
When using run-test script, it complains no such directories as below:
cannot open file '/tmp/Rtmp4FQbah/filee2273f9d47f7': No such file or directory

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually test it

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13284 from wangmiao1981/R.
2016-05-25 21:08:03 -07:00
Daoyuan Wang d642b27354 [SPARK-15397][SQL] fix string udf locate as hive
## What changes were proposed in this pull request?

in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 and  `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.

## How was this patch tested?

tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #13186 from adrian-wang/locate.
2016-05-23 23:29:15 -07:00
hyukjinkwon a8e97d17b9 [MINOR][SPARKR][DOC] Add a description for running unit tests in Windows
## What changes were proposed in this pull request?

This PR adds the description for running unit tests in Windows.

## How was this patch tested?

On a bare machine (Window 7, 32bits), this was manually built and tested.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13217 from HyukjinKwon/minor-r-doc.
2016-05-23 17:20:29 -07:00
Reynold Xin 4987f39ac7 [SPARK-14463][SQL] Document the semantics for read.text
## What changes were proposed in this pull request?
This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13184 from rxin/SPARK-14463.
2016-05-18 19:16:28 -07:00
Sun Rui b3930f74a0 [SPARK-15202][SPARKR] add dapplyCollect() method for DataFrame in SparkR.
## What changes were proposed in this pull request?

dapplyCollect() applies an R function on each partition of a SparkDataFrame and collects the result back to R as a data.frame.
```
dapplyCollect(df, function(ldf) {...})
```

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #12989 from sun-rui/SPARK-15202.
2016-05-12 17:50:55 -07:00
Yanbo Liang ee3b171562 [MINOR] [SPARKR] Update data-manipulation.R to use native csv reader
## What changes were proposed in this pull request?
* Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
* Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.

## How was this patch tested?
Offline test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13005 from yanboliang/r-df-examples.
2016-05-09 09:58:36 -07:00
Sun Rui 454ba4d67e [SPARK-12479][SPARKR] sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
## What changes were proposed in this pull request?

This PR is a workaround for NA handling in hash code computation.

This PR is on behalf of paulomagalhaes whose PR is https://github.com/apache/spark/pull/10436

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>
Author: ray <ray@rays-MacBook-Air.local>

Closes #12976 from sun-rui/SPARK-12479.
2016-05-08 00:17:36 -07:00
Sun Rui 157a49aa41 [SPARK-11395][SPARKR] Support over and window specification in SparkR.
This PR:
1. Implement WindowSpec S4 class.
2. Implement Window.partitionBy() and Window.orderBy() as utility functions to create WindowSpec objects.
3. Implement over() of Column class.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #10094 from sun-rui/SPARK-11395.
2016-05-05 18:49:43 -07:00
NarineK 22226fcc92 [SPARK-15110] [SPARKR] Implement repartitionByColumn for SparkR DataFrames
## What changes were proposed in this pull request?

Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.

## How was this patch tested?

Unit tests

Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12887 from NarineK/repartitionByColumns.
2016-05-05 12:00:55 -07:00
Sun Rui 8b6491fc0b [SPARK-15091][SPARKR] Fix warnings and a failure in SparkR test cases with testthat version 1.0.1
## What changes were proposed in this pull request?
Fix warnings and a failure in SparkR test cases with testthat version 1.0.1

## How was this patch tested?
SparkR unit test cases.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #12867 from sun-rui/SPARK-15091.
2016-05-03 09:29:49 -07:00
Yanbo Liang 19a6d192d5 [SPARK-15030][ML][SPARKR] Support formula in spark.kmeans in SparkR
## What changes were proposed in this pull request?
* ```RFormula``` supports empty response variable like ```~ x + y```.
* Support formula in ```spark.kmeans``` in SparkR.
* Fix some outdated docs for SparkR.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12813 from yanboliang/spark-15030.
2016-04-30 08:37:56 -07:00
Xiangrui Meng b3ea579314 [SPARK-14831][.2][ML][R] rename ml.save/ml.load to write.ml/read.ml
## What changes were proposed in this pull request?

Continue the work of #12789 to rename ml.asve/ml.load to write.ml/read.ml, which are more consistent with read.df/write.df and other methods in SparkR.

I didn't rename `data` to `df` because we still use `predict` for prediction, which uses `newData` to match the signature in R.

## How was this patch tested?

Existing unit tests.

cc: yanboliang thunterdb

Author: Xiangrui Meng <meng@databricks.com>

Closes #12807 from mengxr/SPARK-14831.
2016-04-30 00:45:44 -07:00
Timothy Hunter bc36fe6e89 [SPARK-14831][SPARKR] Make the SparkR MLlib API more consistent with Spark
## What changes were proposed in this pull request?

This PR splits the MLlib algorithms into two flavors:
 - the R flavor, which tries to mimic the existing R API for these algorithms (and works as an S4 specialization for Spark dataframes)
 - the Spark flavor, which follows the same API and naming conventions as the rest of the MLlib algorithms in the other languages

In practice, the former calls the latter.

## How was this patch tested?

The tests for the various algorithms were adapted to be run against both interfaces.

Author: Timothy Hunter <timhunter@databricks.com>

Closes #12789 from thunterdb/14831.
2016-04-29 23:13:03 -07:00
Sun Rui 4ae9fe091c [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.
## What changes were proposed in this pull request?

dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.

The function signature is:

	dapply(df, function(localDF) {}, schema = NULL)

R function input: local data.frame from the partition on local node
R function output: local data.frame

Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #12493 from sun-rui/SPARK-12919.
2016-04-29 16:41:07 -07:00
Yanbo Liang 87ac84d437 [SPARK-14314][SPARK-14315][ML][SPARKR] Model persistence in SparkR (glm & kmeans)
SparkR ```glm``` and ```kmeans``` model persistence.

Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Gayathri Murali <gayathri.m.softie@gmail.com>

Closes #12778 from yanboliang/spark-14311.
Closes #12680
Closes #12683
2016-04-29 09:43:04 -07:00