Commit graph

79 commits

Author SHA1 Message Date
Felix Cheung f0d21b7f90 [SPARK-17442][SPARKR] Additional arguments in write.df are not passed to data source
## What changes were proposed in this pull request?

additional options were not passed down in write.df.

## How was this patch tested?

unit tests
falaki shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15010 from felixcheung/testreadoptions.
2016-09-08 08:22:58 -07:00
Clark Fitzgerald 9fccde4ff8 [SPARK-16785] R dapply doesn't return array or raw columns
## What changes were proposed in this pull request?

Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors.

cc shivaram

## How was this patch tested?

Unit tests

Author: Clark Fitzgerald <clarkfitzg@gmail.com>

Closes #14783 from clarkfitzg/SPARK-16785.
2016-09-06 23:40:37 -07:00
Felix Cheung 812333e433 [SPARK-17376][SPARKR] Spark version should be available in R
## What changes were proposed in this pull request?

Add sparkR.version() API.

```
> sparkR.version()
[1] "2.1.0-SNAPSHOT"
```

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14935 from felixcheung/rsparksessionversion.
2016-09-02 10:12:10 -07:00
wm624@hotmail.com 0f30cdedbd [SPARK-16883][SPARKR] SQL decimal type is not properly cast to number when collecting SparkDataFrame
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))

'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y:List of 5
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2

The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".

As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;

2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.

I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))
'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y: num  2 2 2 2 2
R Unit tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14613 from wangmiao1981/type.
2016-09-02 01:47:17 -07:00
hyukjinkwon 50bb142332 [SPARK-17326][SPARKR] Fix tests with HiveContext in SparkR not to be skipped always
## What changes were proposed in this pull request?

Currently, `HiveContext` in SparkR is not being tested and always skipped.
This is because the initiation of `TestHiveContext` is being failed due to trying to load non-existing data paths (test tables).

This is introduced from https://github.com/apache/spark/pull/14005

This enables the tests with SparkR.

## How was this patch tested?

Manually,

**Before** (on Mac OS)

```
...
Skipped ------------------------------------------------------------------------
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1041) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1748) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2480) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```

**After** (on Mac OS)

```
...
Skipped ------------------------------------------------------------------------
1. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```

Please refer the tests below (on Windows)
 - Before: https://ci.appveyor.com/project/HyukjinKwon/spark/build/45-test123
 - After: https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14889 from HyukjinKwon/SPARK-17326.
2016-08-31 14:02:21 -07:00
Felix Cheung c34b546d67 [SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R CMD check
## What changes were proposed in this pull request?

Rename RDD functions for now to avoid CRAN check warnings.
Some RDD functions are sharing generics with DataFrame functions (hence the problem) so after the renames we need to add new generics, for now.

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14626 from felixcheung/rrddfunctions.
2016-08-16 11:19:18 -07:00
Junyang Qian 214ba66a03 [SPARK-16579][SPARKR] add install.spark function
## What changes were proposed in this pull request?

Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R.

Updates:

Several changes have been made:

- `install.spark()`
    - check existence of tar file in the cache folder, and download only if not found
    - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
    - use 2.0.0

- `sparkR.session()`
    - can install spark when not found in `SPARK_HOME`

## How was this patch tested?

Manual tests, running the check-cran.sh script added in #14173.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14258 from junyangq/SPARK-16579.
2016-08-10 11:18:23 -07:00
Felix Cheung d27fe9ba67 [SPARK-16027][SPARKR] Fix R tests SparkSession init/stop
## What changes were proposed in this pull request?

Fix R SparkSession init/stop, and warnings of reusing existing Spark Context

## How was this patch tested?

unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14177 from felixcheung/rsessiontest.
2016-07-17 19:02:21 -07:00
Felix Cheung 611a8ca589 [SPARK-16538][SPARKR] Add more tests for namespace call to SparkSession functions
## What changes were proposed in this pull request?

More tests
I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0.

## How was this patch tested?

unit tests

shivaram dongjoon-hyun

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14206 from felixcheung/rroutetests.
2016-07-15 13:58:57 -07:00
Felix Cheung 12005c88fb [SPARK-16538][SPARKR] fix R call with namespace operator on SparkSession functions
## What changes were proposed in this pull request?

Fix function routing to work with and without namespace operator `SparkR::createDataFrame`

## How was this patch tested?

manual, unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14195 from felixcheung/rroutedefault.
2016-07-14 09:45:30 -07:00
Sun Rui 093ebbc628 [SPARK-16509][SPARKR] Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy.
## What changes were proposed in this pull request?
Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check.

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #14192 from sun-rui/SPARK-16509.
2016-07-14 09:38:42 -07:00
Felix Cheung fb2e8eeb0b [SPARKR][DOCS][MINOR] R programming guide to include csv data source example
## What changes were proposed in this pull request?

Minor documentation update for code example, code style, and missed reference to "sparkR.init"

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14178 from felixcheung/rcsvprogrammingguide.
2016-07-13 15:09:23 -07:00
Dongjoon Hyun 142df4834b [SPARK-16429][SQL] Include StringType columns in describe()
## What changes were proposed in this pull request?

Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument.

**Background**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show()
+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+
```

**Before**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 2|
|   mean|              24.5|
| stddev|7.7781745930520225|
|    min|                19|
|    max|                30|
+-------+------------------+
```

**After**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+
```

## How was this patch tested?

Pass the Jenkins with a update testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14095 from dongjoon-hyun/SPARK-16429.
2016-07-08 14:36:50 -07:00
Dongjoon Hyun 6aa7d09f4e [SPARK-16425][R] describe() should not fail with non-numeric columns
## What changes were proposed in this pull request?

This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`.

**Before**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType;
```

**After**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
SparkDataFrame[summary:string, eruptions:string, waiting:string]
```

## How was this patch tested?

Pass the Jenkins with a updated testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14096 from dongjoon-hyun/SPARK-16425.
2016-07-07 17:47:29 -07:00
Felix Cheung f4767bcc7a [SPARK-16310][SPARKR] R na.string-like default for csv source
## What changes were proposed in this pull request?

Apply default "NA" as null string for R, like R read.csv na.string parameter.

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"

An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")

(couldn't open JIRA, will do that later)

## How was this patch tested?

unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13984 from felixcheung/rcsvnastring.
2016-07-07 15:21:57 -07:00
Dongjoon Hyun d17e5f2f12 [SPARK-16233][R][TEST] ORC test should be enabled only when HiveContext is available.
## What changes were proposed in this pull request?

ORC test should be enabled only when HiveContext is available.

## How was this patch tested?

Manual.
```
$ R/run-tests.sh
...
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped

2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped

3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped

4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped

5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped

DONE ===========================================================================
Tests passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14019 from dongjoon-hyun/SPARK-16233.
2016-07-01 15:35:19 -07:00
Narine Kokhlikyan 26afb4ce40 [SPARK-16012][SPARKR] Implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame
## What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.

This is similar to dapplyCollect().

## How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect

Author: Narine Kokhlikyan <narine@slice.com>

Closes #13760 from NarineK/gapplyCollect.
2016-07-01 13:55:13 -07:00
Dongjoon Hyun 46395db80e [SPARK-16289][SQL] Implement posexplode table generating function
## What changes were proposed in this pull request?

This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.

**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```

**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+
```

For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with newly added testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13971 from dongjoon-hyun/SPARK-16289.
2016-06-30 12:03:54 -07:00
Felix Cheung 823518c2b5 [SPARKR] add csv tests
## What changes were proposed in this pull request?

Add unit tests for csv data for SPARKR

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13904 from felixcheung/rcsv.
2016-06-28 17:08:28 -07:00
Prashant Sharma f6b497fcdd [SPARK-16128][SQL] Allow setting length of characters to be truncated to, in Dataset.show function.
## What changes were proposed in this pull request?

Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.

## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.

For SparkR and pyspark, existing tests and manual testing.

Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>

Closes #13839 from ScrapCodes/add_truncateTo_DF.show.
2016-06-28 17:11:06 +05:30
Felix Cheung 30b182bcc0 [SPARK-16184][SPARKR] conf API for SparkSession
## What changes were proposed in this pull request?

Add `conf` method to get Runtime Config from SparkSession

## How was this patch tested?

unit tests, manual tests

This is how it works in sparkR shell:
```
 SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"

$spark.app.id
[1] "local-1466749575523"

$spark.app.name
[1] "SparkR"

$spark.driver.host
[1] "10.0.2.1"

$spark.driver.port
[1] "45629"

$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"

$spark.executor.id
[1] "driver"

$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"

$spark.master
[1] "local[*]"

$spark.sql.catalogImplementation
[1] "hive"

$spark.submit.deployMode
[1] "client"

> conf("spark.master")
$spark.master
[1] "local[*]"

```

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13885 from felixcheung/rconf.
2016-06-26 13:10:43 -07:00
Felix Cheung dbfdae4e41 [SPARK-16096][SPARKR] add union and deprecate unionAll
## What changes were proposed in this pull request?

add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different)

`explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet.

## How was this patch tested?

unit tests, manual checks for r doc

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13805 from felixcheung/runion.
2016-06-21 13:36:50 -07:00
Dongjoon Hyun 217db56ba1 [SPARK-15294][R] Add pivot to SparkR
## What changes were proposed in this pull request?

This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did.

## How was this patch tested?

Pass the Jenkins tests (including new testcase.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13786 from dongjoon-hyun/SPARK-15294.
2016-06-20 21:09:39 -07:00
Dongjoon Hyun b0f2fb5b97 [SPARK-16053][R] Add spark_partition_id in SparkR
## What changes were proposed in this pull request?

This PR adds `spark_partition_id` virtual column function in SparkR for API parity.

The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
   id SPARK_PARTITION_ID()
1   3                    0
2   4                    0
3   8                    1
4   9                    1
5   0                    2
6   1                    3
7   2                    4
8   5                    5
9   6                    6
10  7                    7
```

## How was this patch tested?

Pass the Jenkins tests (including new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13768 from dongjoon-hyun/SPARK-16053.
2016-06-20 13:41:03 -07:00
Dongjoon Hyun c44bf137c7 [SPARK-16051][R] Add read.orc/write.orc to SparkR
## What changes were proposed in this pull request?

This issue adds `read.orc/write.orc` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13763 from dongjoon-hyun/SPARK-16051.
2016-06-20 11:30:26 -07:00
Felix Cheung 36e812d4b6 [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable
## What changes were proposed in this pull request?

Add dropTempView and deprecate dropTempTable

## How was this patch tested?

unit tests

shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13753 from felixcheung/rdroptempview.
2016-06-20 11:24:41 -07:00
Dongjoon Hyun 9613424898 [SPARK-16059][R] Add monotonically_increasing_id function in SparkR
## What changes were proposed in this pull request?

This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
After this PR, SparkR supports the followings.

```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
  monotonically_increasing_id()    name age
1                             0 Michael  NA
2                             1    Andy  30
3                             2  Justin  19
```

## How was this patch tested?

Pass the Jenkins tests (with added testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13774 from dongjoon-hyun/SPARK-16059.
2016-06-20 11:12:41 -07:00
Felix Cheung 8c198e246d [SPARK-15159][SPARKR] SparkR SparkSession API
## What changes were proposed in this pull request?

This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`

"getOrCreate" is a bit unusual in R but it's important to name this clearly.

SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed

TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide

## How was this patch tested?

unit tests, manual tests

shivaram sun-rui rxin

Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13635 from felixcheung/rsparksession.
2016-06-17 21:36:01 -07:00
Dongjoon Hyun 7d65a0db4a [SPARK-16005][R] Add randomSplit to SparkR
## What changes were proposed in this pull request?

This PR adds `randomSplit` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcase.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13721 from dongjoon-hyun/SPARK-16005.
2016-06-17 16:07:33 -07:00
Felix Cheung ef3cc4fc09 [SPARK-15925][SPARKR] R DataFrame add back registerTempTable, add tests
## What changes were proposed in this pull request?

Add registerTempTable to DataFrame with Deprecate

## How was this patch tested?

unit tests
shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13722 from felixcheung/rregistertemptable.
2016-06-17 15:56:03 -07:00
Dongjoon Hyun 513a03e41e [SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkR
## What changes were proposed in this pull request?

This PR adds varargs-type `dropDuplicates` function to SparkR for API parity.
Refer to https://issues.apache.org/jira/browse/SPARK-15807, too.

## How was this patch tested?

Pass the Jenkins tests with new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13684 from dongjoon-hyun/SPARK-15908.
2016-06-16 20:35:17 -07:00
Narine Kokhlikyan 7c6c692637 [SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR
## What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12836 from NarineK/gapply2.
2016-06-15 21:42:05 -07:00
Cheng Lian ced8d669b3 [SPARK-15925][SQL][SPARKR] Replaces registerTempTable with createOrReplaceTempView
## What changes were proposed in this pull request?

This PR replaces `registerTempTable` with `createOrReplaceTempView` as a follow-up task of #12945.

## How was this patch tested?

Existing SparkR tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13644 from liancheng/spark-15925-temp-view-for-r.
2016-06-13 15:46:50 -07:00
wm624@hotmail.com 3ec4461c46 [SPARK-15684][SPARKR] Not mask startsWith and endsWith in R
## What changes were proposed in this pull request?

In R 3.3.0, startsWith and endsWith are added. In this PR, I make the two work in SparkR.
1. Remove signature in generic.R
2. Add setMethod in column.R
3. Add unit tests

## How was this patch tested?
Manually test it through SparkR shell for both column data and string data, which are added into the unit test file.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13476 from wangmiao1981/start.
2016-06-07 09:13:18 -07:00
felixcheung c76457c8e4 [SPARK-10903][SPARKR] R - Simplify SQLContext method signatures and use a singleton
Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.

Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).

Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9192 from felixcheung/rsqlcontext.
2016-05-26 11:20:20 -07:00
Daoyuan Wang d642b27354 [SPARK-15397][SQL] fix string udf locate as hive
## What changes were proposed in this pull request?

in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 and  `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.

## How was this patch tested?

tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #13186 from adrian-wang/locate.
2016-05-23 23:29:15 -07:00
Sun Rui b3930f74a0 [SPARK-15202][SPARKR] add dapplyCollect() method for DataFrame in SparkR.
## What changes were proposed in this pull request?

dapplyCollect() applies an R function on each partition of a SparkDataFrame and collects the result back to R as a data.frame.
```
dapplyCollect(df, function(ldf) {...})
```

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #12989 from sun-rui/SPARK-15202.
2016-05-12 17:50:55 -07:00
Sun Rui 157a49aa41 [SPARK-11395][SPARKR] Support over and window specification in SparkR.
This PR:
1. Implement WindowSpec S4 class.
2. Implement Window.partitionBy() and Window.orderBy() as utility functions to create WindowSpec objects.
3. Implement over() of Column class.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #10094 from sun-rui/SPARK-11395.
2016-05-05 18:49:43 -07:00
NarineK 22226fcc92 [SPARK-15110] [SPARKR] Implement repartitionByColumn for SparkR DataFrames
## What changes were proposed in this pull request?

Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.

## How was this patch tested?

Unit tests

Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12887 from NarineK/repartitionByColumns.
2016-05-05 12:00:55 -07:00
Sun Rui 8b6491fc0b [SPARK-15091][SPARKR] Fix warnings and a failure in SparkR test cases with testthat version 1.0.1
## What changes were proposed in this pull request?
Fix warnings and a failure in SparkR test cases with testthat version 1.0.1

## How was this patch tested?
SparkR unit test cases.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #12867 from sun-rui/SPARK-15091.
2016-05-03 09:29:49 -07:00
Sun Rui 4ae9fe091c [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.
## What changes were proposed in this pull request?

dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.

The function signature is:

	dapply(df, function(localDF) {}, schema = NULL)

R function input: local data.frame from the partition on local node
R function output: local data.frame

Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #12493 from sun-rui/SPARK-12919.
2016-04-29 16:41:07 -07:00
Sun Rui 9e785079b6 [SPARK-12235][SPARKR] Enhance mutate() to support replace existing columns.
Make the behavior of mutate more consistent with that in dplyr, besides support for replacing existing columns.
1. Throw error message when there are duplicated column names in the DataFrame being mutated.
2. when there are duplicated column names in specified columns by arguments, the last column of the same name takes effect.

Author: Sun Rui <rui.sun@intel.com>

Closes #10220 from sun-rui/SPARK-12235.
2016-04-28 09:33:58 -07:00
Oscar D. Lara Yejas e4bfb4aa73 [SPARK-13436][SPARKR] Added parameter drop to subsetting operator [
Added parameter drop to subsetting operator [. This is useful to get a Column from a DataFrame, given its name. R supports it.

In R:
```
> name <- "Sepal_Length"
> class(iris[, name])
[1] "numeric"
```
Currently, in SparkR:
```
> name <- "Sepal_Length"
> class(irisDF[, name])
[1] "DataFrame"
```

Previous code returns a DataFrame, which is inconsistent with R's behavior. SparkR should return a Column instead. Currently, in order for the user to return a Column given a column name as a character variable would be through `eval(parse(x))`, where x is the string `"irisDF$Sepal_Length"`. That itself is pretty hacky. `SparkR:::getColumn() `is another choice, but I don't see why this method should be externalized. Instead, following R's way to do things, the proposed implementation allows this:

```
> name <- "Sepal_Length"
> class(irisDF[, name, drop=T])
[1] "Column"

> class(irisDF[, name, drop=F])
[1] "DataFrame"
```

This is consistent with R:

```
> name <- "Sepal_Length"
> class(iris[, name])
[1] "numeric"
> class(iris[, name, drop=F])
[1] "data.frame"
```

Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>

Closes #11318 from olarayej/SPARK-13436.
2016-04-27 15:47:54 -07:00
Oscar D. Lara Yejas 0c99c23b7d [SPARK-13734][SPARKR] Added histogram function
## What changes were proposed in this pull request?

Added method histogram() to compute the histogram of a Column

Usage:

```
## Create a DataFrame from the Iris dataset
irisDF <- createDataFrame(sqlContext, iris)

## Render a histogram for the Sepal_Length column
histogram(irisDF, "Sepal_Length", nbins=12)

```
![histogram](https://cloud.githubusercontent.com/assets/13985649/13588486/e1e751c6-e484-11e5-85db-2fc2115c4bb2.png)

Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name

## How was this patch tested?

All unit tests pass. I added specific unit cases for different scenarios.

Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>

Closes #11569 from olarayej/SPARK-13734.
2016-04-26 15:34:30 -07:00
Reynold Xin 890abd1279 [SPARK-14869][SQL] Don't mask exceptions in ResolveRelations
## What changes were proposed in this pull request?
In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence.

## How was this patch tested?
I manually hacked some bugs into Spark and made sure the exceptions were being propagated up.

Author: Reynold Xin <rxin@databricks.com>

Closes #12634 from rxin/SPARK-14869.
2016-04-23 12:49:36 -07:00
felixcheung a55fbe2a16 [SPARK-12148][SPARKR] SparkR: rename DataFrame to SparkDataFrame
## What changes were proposed in this pull request?

Changed class name defined in R from "DataFrame" to "SparkDataFrame". A popular package, S4Vector already defines "DataFrame" - this change is to avoid conflict.

Aside from class name and API/roxygen2 references, SparkR APIs like `createDataFrame`, `as.DataFrame` are not changed (S4Vector does not define a "as.DataFrame").

Since in R, one would rarely reference type/class, this change should have minimal/almost-no impact to a SparkR user in terms of back compat.

## How was this patch tested?

SparkR tests, manually loading S4Vector then SparkR package

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #12621 from felixcheung/rdataframe.
2016-04-23 00:20:27 -07:00
Dongjoon Hyun 14869ae64e [SPARK-14639] [PYTHON] [R] Add bround function in Python/R.
## What changes were proposed in this pull request?

This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
    if (Double.isNaN(input) || Double.isInfinite(input)) {
      return input;
    }
    return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```

After this PR, `pyspark` and `sparkR` also support `bround` function.

**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```

**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
  bround(x, 0)
1            2
2            4
```

## How was this patch tested?

Pass the Jenkins tests (including new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12509 from dongjoon-hyun/SPARK-14639.
2016-04-19 22:28:11 -07:00
Sun Rui 8eedf0b553 [SPARK-13905][SPARKR] Change signature of as.data.frame() to be consistent with the R base package.
## What changes were proposed in this pull request?

Change the signature of as.data.frame() to be consistent with that in the R base package to meet R user's convention.

## How was this patch tested?
dev/lint-r
SparkR unit tests

Author: Sun Rui <rui.sun@intel.com>

Closes #11811 from sun-rui/SPARK-13905.
2016-04-19 19:57:03 -07:00
gatorsmile 9f838bd242 [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table
#### What changes were proposed in this pull request?
This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.

#### How was this patch tested?
Modified the existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12284 from gatorsmile/followupDropTable.
2016-04-10 20:46:15 -07:00
Burak Yavuz 1146c534d6 [SPARK-14353] Dataset Time Window window API for R
## What changes were proposed in this pull request?

The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the R API for this function.

With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
 - `window(timeColumn, windowDuration)`
 - `window(timeColumn, windowDuration, slideDuration)`
 - `window(timeColumn, windowDuration, slideDuration, startTime)`

In Python and R, users can access all APIs above, but in addition they can do
 - In R:
   `window(timeColumn, windowDuration, startTime=...)`

that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.

## How was this patch tested?

Unit tests + manual tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #12141 from brkyvz/R-windows.
2016-04-05 17:21:41 -07:00