Commit graph

122 commits

Author SHA1 Message Date
Felix Cheung d27fe9ba67 [SPARK-16027][SPARKR] Fix R tests SparkSession init/stop
## What changes were proposed in this pull request?

Fix R SparkSession init/stop, and warnings of reusing existing Spark Context

## How was this patch tested?

unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14177 from felixcheung/rsessiontest.
2016-07-17 19:02:21 -07:00
Felix Cheung 611a8ca589 [SPARK-16538][SPARKR] Add more tests for namespace call to SparkSession functions
## What changes were proposed in this pull request?

More tests
I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0.

## How was this patch tested?

unit tests

shivaram dongjoon-hyun

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14206 from felixcheung/rroutetests.
2016-07-15 13:58:57 -07:00
Felix Cheung 12005c88fb [SPARK-16538][SPARKR] fix R call with namespace operator on SparkSession functions
## What changes were proposed in this pull request?

Fix function routing to work with and without namespace operator `SparkR::createDataFrame`

## How was this patch tested?

manual, unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14195 from felixcheung/rroutedefault.
2016-07-14 09:45:30 -07:00
Sun Rui 093ebbc628 [SPARK-16509][SPARKR] Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy.
## What changes were proposed in this pull request?
Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check.

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #14192 from sun-rui/SPARK-16509.
2016-07-14 09:38:42 -07:00
Felix Cheung fb2e8eeb0b [SPARKR][DOCS][MINOR] R programming guide to include csv data source example
## What changes were proposed in this pull request?

Minor documentation update for code example, code style, and missed reference to "sparkR.init"

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14178 from felixcheung/rcsvprogrammingguide.
2016-07-13 15:09:23 -07:00
Dongjoon Hyun 142df4834b [SPARK-16429][SQL] Include StringType columns in describe()
## What changes were proposed in this pull request?

Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument.

**Background**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show()
+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+
```

**Before**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 2|
|   mean|              24.5|
| stddev|7.7781745930520225|
|    min|                19|
|    max|                30|
+-------+------------------+
```

**After**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+
```

## How was this patch tested?

Pass the Jenkins with a update testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14095 from dongjoon-hyun/SPARK-16429.
2016-07-08 14:36:50 -07:00
Dongjoon Hyun 6aa7d09f4e [SPARK-16425][R] describe() should not fail with non-numeric columns
## What changes were proposed in this pull request?

This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`.

**Before**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType;
```

**After**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
SparkDataFrame[summary:string, eruptions:string, waiting:string]
```

## How was this patch tested?

Pass the Jenkins with a updated testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14096 from dongjoon-hyun/SPARK-16425.
2016-07-07 17:47:29 -07:00
Felix Cheung f4767bcc7a [SPARK-16310][SPARKR] R na.string-like default for csv source
## What changes were proposed in this pull request?

Apply default "NA" as null string for R, like R read.csv na.string parameter.

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"

An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")

(couldn't open JIRA, will do that later)

## How was this patch tested?

unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13984 from felixcheung/rcsvnastring.
2016-07-07 15:21:57 -07:00
Dongjoon Hyun d17e5f2f12 [SPARK-16233][R][TEST] ORC test should be enabled only when HiveContext is available.
## What changes were proposed in this pull request?

ORC test should be enabled only when HiveContext is available.

## How was this patch tested?

Manual.
```
$ R/run-tests.sh
...
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped

2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped

3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped

4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped

5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped

DONE ===========================================================================
Tests passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14019 from dongjoon-hyun/SPARK-16233.
2016-07-01 15:35:19 -07:00
Narine Kokhlikyan 26afb4ce40 [SPARK-16012][SPARKR] Implement gapplyCollect which will apply a R function on each group similar to gapply and collect the result back to R data.frame
## What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.

This is similar to dapplyCollect().

## How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect

Author: Narine Kokhlikyan <narine@slice.com>

Closes #13760 from NarineK/gapplyCollect.
2016-07-01 13:55:13 -07:00
Dongjoon Hyun 46395db80e [SPARK-16289][SQL] Implement posexplode table generating function
## What changes were proposed in this pull request?

This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.

**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```

**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+
```

For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with newly added testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13971 from dongjoon-hyun/SPARK-16289.
2016-06-30 12:03:54 -07:00
Felix Cheung 823518c2b5 [SPARKR] add csv tests
## What changes were proposed in this pull request?

Add unit tests for csv data for SPARKR

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13904 from felixcheung/rcsv.
2016-06-28 17:08:28 -07:00
Prashant Sharma f6b497fcdd [SPARK-16128][SQL] Allow setting length of characters to be truncated to, in Dataset.show function.
## What changes were proposed in this pull request?

Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.

## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.

For SparkR and pyspark, existing tests and manual testing.

Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>

Closes #13839 from ScrapCodes/add_truncateTo_DF.show.
2016-06-28 17:11:06 +05:30
Felix Cheung 30b182bcc0 [SPARK-16184][SPARKR] conf API for SparkSession
## What changes were proposed in this pull request?

Add `conf` method to get Runtime Config from SparkSession

## How was this patch tested?

unit tests, manual tests

This is how it works in sparkR shell:
```
 SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"

$spark.app.id
[1] "local-1466749575523"

$spark.app.name
[1] "SparkR"

$spark.driver.host
[1] "10.0.2.1"

$spark.driver.port
[1] "45629"

$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"

$spark.executor.id
[1] "driver"

$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"

$spark.master
[1] "local[*]"

$spark.sql.catalogImplementation
[1] "hive"

$spark.submit.deployMode
[1] "client"

> conf("spark.master")
$spark.master
[1] "local[*]"

```

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13885 from felixcheung/rconf.
2016-06-26 13:10:43 -07:00
Felix Cheung dbfdae4e41 [SPARK-16096][SPARKR] add union and deprecate unionAll
## What changes were proposed in this pull request?

add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different)

`explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet.

## How was this patch tested?

unit tests, manual checks for r doc

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13805 from felixcheung/runion.
2016-06-21 13:36:50 -07:00
Dongjoon Hyun 217db56ba1 [SPARK-15294][R] Add pivot to SparkR
## What changes were proposed in this pull request?

This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did.

## How was this patch tested?

Pass the Jenkins tests (including new testcase.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13786 from dongjoon-hyun/SPARK-15294.
2016-06-20 21:09:39 -07:00
Dongjoon Hyun b0f2fb5b97 [SPARK-16053][R] Add spark_partition_id in SparkR
## What changes were proposed in this pull request?

This PR adds `spark_partition_id` virtual column function in SparkR for API parity.

The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
   id SPARK_PARTITION_ID()
1   3                    0
2   4                    0
3   8                    1
4   9                    1
5   0                    2
6   1                    3
7   2                    4
8   5                    5
9   6                    6
10  7                    7
```

## How was this patch tested?

Pass the Jenkins tests (including new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13768 from dongjoon-hyun/SPARK-16053.
2016-06-20 13:41:03 -07:00
Dongjoon Hyun c44bf137c7 [SPARK-16051][R] Add read.orc/write.orc to SparkR
## What changes were proposed in this pull request?

This issue adds `read.orc/write.orc` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13763 from dongjoon-hyun/SPARK-16051.
2016-06-20 11:30:26 -07:00
Felix Cheung 36e812d4b6 [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable
## What changes were proposed in this pull request?

Add dropTempView and deprecate dropTempTable

## How was this patch tested?

unit tests

shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13753 from felixcheung/rdroptempview.
2016-06-20 11:24:41 -07:00
Dongjoon Hyun 9613424898 [SPARK-16059][R] Add monotonically_increasing_id function in SparkR
## What changes were proposed in this pull request?

This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
After this PR, SparkR supports the followings.

```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
  monotonically_increasing_id()    name age
1                             0 Michael  NA
2                             1    Andy  30
3                             2  Justin  19
```

## How was this patch tested?

Pass the Jenkins tests (with added testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13774 from dongjoon-hyun/SPARK-16059.
2016-06-20 11:12:41 -07:00
Felix Cheung 8c198e246d [SPARK-15159][SPARKR] SparkR SparkSession API
## What changes were proposed in this pull request?

This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`

"getOrCreate" is a bit unusual in R but it's important to name this clearly.

SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed

TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide

## How was this patch tested?

unit tests, manual tests

shivaram sun-rui rxin

Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13635 from felixcheung/rsparksession.
2016-06-17 21:36:01 -07:00
Dongjoon Hyun 7d65a0db4a [SPARK-16005][R] Add randomSplit to SparkR
## What changes were proposed in this pull request?

This PR adds `randomSplit` to SparkR for API parity.

## How was this patch tested?

Pass the Jenkins tests (with new testcase.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13721 from dongjoon-hyun/SPARK-16005.
2016-06-17 16:07:33 -07:00
Felix Cheung ef3cc4fc09 [SPARK-15925][SPARKR] R DataFrame add back registerTempTable, add tests
## What changes were proposed in this pull request?

Add registerTempTable to DataFrame with Deprecate

## How was this patch tested?

unit tests
shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13722 from felixcheung/rregistertemptable.
2016-06-17 15:56:03 -07:00
Dongjoon Hyun 513a03e41e [SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkR
## What changes were proposed in this pull request?

This PR adds varargs-type `dropDuplicates` function to SparkR for API parity.
Refer to https://issues.apache.org/jira/browse/SPARK-15807, too.

## How was this patch tested?

Pass the Jenkins tests with new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13684 from dongjoon-hyun/SPARK-15908.
2016-06-16 20:35:17 -07:00
Narine Kokhlikyan 7c6c692637 [SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR
## What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12836 from NarineK/gapply2.
2016-06-15 21:42:05 -07:00
Cheng Lian ced8d669b3 [SPARK-15925][SQL][SPARKR] Replaces registerTempTable with createOrReplaceTempView
## What changes were proposed in this pull request?

This PR replaces `registerTempTable` with `createOrReplaceTempView` as a follow-up task of #12945.

## How was this patch tested?

Existing SparkR tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13644 from liancheng/spark-15925-temp-view-for-r.
2016-06-13 15:46:50 -07:00
wm624@hotmail.com 3ec4461c46 [SPARK-15684][SPARKR] Not mask startsWith and endsWith in R
## What changes were proposed in this pull request?

In R 3.3.0, startsWith and endsWith are added. In this PR, I make the two work in SparkR.
1. Remove signature in generic.R
2. Add setMethod in column.R
3. Add unit tests

## How was this patch tested?
Manually test it through SparkR shell for both column data and string data, which are added into the unit test file.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13476 from wangmiao1981/start.
2016-06-07 09:13:18 -07:00
felixcheung c76457c8e4 [SPARK-10903][SPARKR] R - Simplify SQLContext method signatures and use a singleton
Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.

Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).

Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9192 from felixcheung/rsqlcontext.
2016-05-26 11:20:20 -07:00
Daoyuan Wang d642b27354 [SPARK-15397][SQL] fix string udf locate as hive
## What changes were proposed in this pull request?

in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 and  `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.

## How was this patch tested?

tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #13186 from adrian-wang/locate.
2016-05-23 23:29:15 -07:00
Sun Rui b3930f74a0 [SPARK-15202][SPARKR] add dapplyCollect() method for DataFrame in SparkR.
## What changes were proposed in this pull request?

dapplyCollect() applies an R function on each partition of a SparkDataFrame and collects the result back to R as a data.frame.
```
dapplyCollect(df, function(ldf) {...})
```

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #12989 from sun-rui/SPARK-15202.
2016-05-12 17:50:55 -07:00
Sun Rui 157a49aa41 [SPARK-11395][SPARKR] Support over and window specification in SparkR.
This PR:
1. Implement WindowSpec S4 class.
2. Implement Window.partitionBy() and Window.orderBy() as utility functions to create WindowSpec objects.
3. Implement over() of Column class.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #10094 from sun-rui/SPARK-11395.
2016-05-05 18:49:43 -07:00
NarineK 22226fcc92 [SPARK-15110] [SPARKR] Implement repartitionByColumn for SparkR DataFrames
## What changes were proposed in this pull request?

Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.

## How was this patch tested?

Unit tests

Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12887 from NarineK/repartitionByColumns.
2016-05-05 12:00:55 -07:00
Sun Rui 8b6491fc0b [SPARK-15091][SPARKR] Fix warnings and a failure in SparkR test cases with testthat version 1.0.1
## What changes were proposed in this pull request?
Fix warnings and a failure in SparkR test cases with testthat version 1.0.1

## How was this patch tested?
SparkR unit test cases.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #12867 from sun-rui/SPARK-15091.
2016-05-03 09:29:49 -07:00
Sun Rui 4ae9fe091c [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.
## What changes were proposed in this pull request?

dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.

The function signature is:

	dapply(df, function(localDF) {}, schema = NULL)

R function input: local data.frame from the partition on local node
R function output: local data.frame

Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #12493 from sun-rui/SPARK-12919.
2016-04-29 16:41:07 -07:00
Sun Rui 9e785079b6 [SPARK-12235][SPARKR] Enhance mutate() to support replace existing columns.
Make the behavior of mutate more consistent with that in dplyr, besides support for replacing existing columns.
1. Throw error message when there are duplicated column names in the DataFrame being mutated.
2. when there are duplicated column names in specified columns by arguments, the last column of the same name takes effect.

Author: Sun Rui <rui.sun@intel.com>

Closes #10220 from sun-rui/SPARK-12235.
2016-04-28 09:33:58 -07:00
Oscar D. Lara Yejas e4bfb4aa73 [SPARK-13436][SPARKR] Added parameter drop to subsetting operator [
Added parameter drop to subsetting operator [. This is useful to get a Column from a DataFrame, given its name. R supports it.

In R:
```
> name <- "Sepal_Length"
> class(iris[, name])
[1] "numeric"
```
Currently, in SparkR:
```
> name <- "Sepal_Length"
> class(irisDF[, name])
[1] "DataFrame"
```

Previous code returns a DataFrame, which is inconsistent with R's behavior. SparkR should return a Column instead. Currently, in order for the user to return a Column given a column name as a character variable would be through `eval(parse(x))`, where x is the string `"irisDF$Sepal_Length"`. That itself is pretty hacky. `SparkR:::getColumn() `is another choice, but I don't see why this method should be externalized. Instead, following R's way to do things, the proposed implementation allows this:

```
> name <- "Sepal_Length"
> class(irisDF[, name, drop=T])
[1] "Column"

> class(irisDF[, name, drop=F])
[1] "DataFrame"
```

This is consistent with R:

```
> name <- "Sepal_Length"
> class(iris[, name])
[1] "numeric"
> class(iris[, name, drop=F])
[1] "data.frame"
```

Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>

Closes #11318 from olarayej/SPARK-13436.
2016-04-27 15:47:54 -07:00
Oscar D. Lara Yejas 0c99c23b7d [SPARK-13734][SPARKR] Added histogram function
## What changes were proposed in this pull request?

Added method histogram() to compute the histogram of a Column

Usage:

```
## Create a DataFrame from the Iris dataset
irisDF <- createDataFrame(sqlContext, iris)

## Render a histogram for the Sepal_Length column
histogram(irisDF, "Sepal_Length", nbins=12)

```
![histogram](https://cloud.githubusercontent.com/assets/13985649/13588486/e1e751c6-e484-11e5-85db-2fc2115c4bb2.png)

Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name

## How was this patch tested?

All unit tests pass. I added specific unit cases for different scenarios.

Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>

Closes #11569 from olarayej/SPARK-13734.
2016-04-26 15:34:30 -07:00
Reynold Xin 890abd1279 [SPARK-14869][SQL] Don't mask exceptions in ResolveRelations
## What changes were proposed in this pull request?
In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence.

## How was this patch tested?
I manually hacked some bugs into Spark and made sure the exceptions were being propagated up.

Author: Reynold Xin <rxin@databricks.com>

Closes #12634 from rxin/SPARK-14869.
2016-04-23 12:49:36 -07:00
felixcheung a55fbe2a16 [SPARK-12148][SPARKR] SparkR: rename DataFrame to SparkDataFrame
## What changes were proposed in this pull request?

Changed class name defined in R from "DataFrame" to "SparkDataFrame". A popular package, S4Vector already defines "DataFrame" - this change is to avoid conflict.

Aside from class name and API/roxygen2 references, SparkR APIs like `createDataFrame`, `as.DataFrame` are not changed (S4Vector does not define a "as.DataFrame").

Since in R, one would rarely reference type/class, this change should have minimal/almost-no impact to a SparkR user in terms of back compat.

## How was this patch tested?

SparkR tests, manually loading S4Vector then SparkR package

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #12621 from felixcheung/rdataframe.
2016-04-23 00:20:27 -07:00
Dongjoon Hyun 14869ae64e [SPARK-14639] [PYTHON] [R] Add bround function in Python/R.
## What changes were proposed in this pull request?

This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
    if (Double.isNaN(input) || Double.isInfinite(input)) {
      return input;
    }
    return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```

After this PR, `pyspark` and `sparkR` also support `bround` function.

**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```

**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
  bround(x, 0)
1            2
2            4
```

## How was this patch tested?

Pass the Jenkins tests (including new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12509 from dongjoon-hyun/SPARK-14639.
2016-04-19 22:28:11 -07:00
Sun Rui 8eedf0b553 [SPARK-13905][SPARKR] Change signature of as.data.frame() to be consistent with the R base package.
## What changes were proposed in this pull request?

Change the signature of as.data.frame() to be consistent with that in the R base package to meet R user's convention.

## How was this patch tested?
dev/lint-r
SparkR unit tests

Author: Sun Rui <rui.sun@intel.com>

Closes #11811 from sun-rui/SPARK-13905.
2016-04-19 19:57:03 -07:00
gatorsmile 9f838bd242 [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table
#### What changes were proposed in this pull request?
This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.

#### How was this patch tested?
Modified the existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12284 from gatorsmile/followupDropTable.
2016-04-10 20:46:15 -07:00
Burak Yavuz 1146c534d6 [SPARK-14353] Dataset Time Window window API for R
## What changes were proposed in this pull request?

The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the R API for this function.

With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
 - `window(timeColumn, windowDuration)`
 - `window(timeColumn, windowDuration, slideDuration)`
 - `window(timeColumn, windowDuration, slideDuration, startTime)`

In Python and R, users can access all APIs above, but in addition they can do
 - In R:
   `window(timeColumn, windowDuration, startTime=...)`

that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.

## How was this patch tested?

Unit tests + manual tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #12141 from brkyvz/R-windows.
2016-04-05 17:21:41 -07:00
Andrew Or 20ddf5fddf [SPARK-14014][SQL] Integrate session catalog (attempt #2)
## What changes were proposed in this pull request?

This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests.

## How was this patch tested?

See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #11938 from andrewor14/session-catalog-again.
2016-03-24 22:59:35 -07:00
Andrew Or c44d140cae Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog"
This reverts commit 5dfc01976b.
2016-03-23 22:21:15 -07:00
Andrew Or 5dfc01976b [SPARK-14014][SQL] Replace existing catalog with SessionCatalog
## What changes were proposed in this pull request?

`SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.

As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext

## How was this patch tested?

This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #11836 from andrewor14/use-session-catalog.
2016-03-23 13:34:22 -07:00
Sun Rui c7e68c3968 [SPARK-13812][SPARKR] Fix SparkR lint-r test errors.
## What changes were proposed in this pull request?

This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github.

## How was this patch tested?

dev/lint-r
SparkR unit tests

Author: Sun Rui <rui.sun@intel.com>

Closes #11652 from sun-rui/SPARK-13812.
2016-03-13 14:30:44 -07:00
Yanbo Liang 4d535d1f1c [SPARK-13389][SPARKR] SparkR support first/last with ignore NAs
## What changes were proposed in this pull request?

SparkR support first/last with ignore NAs

cc sun-rui felixcheung shivaram

## How was the this patch tested?

unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11267 from yanboliang/spark-13389.
2016-03-10 17:31:19 -08:00
Oscar D. Lara Yejas 416e71af4d [SPARK-13327][SPARKR] Added parameter validations for colnames<-
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>

Closes #11220 from olarayej/SPARK-13312-3.
2016-03-10 17:10:23 -08:00
Yanbo Liang 50e60e36f7 [SPARK-13504] [SPARKR] Add approxQuantile for SparkR
## What changes were proposed in this pull request?
Add ```approxQuantile``` for SparkR.
## How was this patch tested?
unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11383 from yanboliang/spark-13504 and squashes the following commits:

4f17adb [Yanbo Liang] Add approxQuantile for SparkR
2016-02-25 21:23:41 -08:00
Cheng Lian d9efe63ecd [SPARK-12799] Simplify various string output for expressions
This PR introduces several major changes:

1. Replacing `Expression.prettyString` with `Expression.sql`

   The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.

1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)

   Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:

   Expression         | `prettyString` | `sql`      | Note
   ------------------ | -------------- | ---------- | ---------------
   `a && b`           | `a && b`       | `a AND b`  |
   `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct

1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)

   `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.

Author: Cheng Lian <lian@databricks.com>

Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
2016-02-21 22:53:15 +08:00
Yanbo Liang e7f9199e70 [SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR
Add ```covar_samp``` and ```covar_pop``` for SparkR.
Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change.

cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10829 from yanboliang/spark-12903.
2016-01-26 19:29:47 -08:00
Narine Kokhlikyan 8a88e12128 [SPARK-12629][SPARKR] Fixes for DataFrame saveAsTable method
I've tried to solve some of the issues mentioned in: https://issues.apache.org/jira/browse/SPARK-12629
Please, let me know what do you think.
Thanks!

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #10580 from NarineK/sparkrSavaAsRable.
2016-01-22 10:35:02 -08:00
Sun Rui 1b2a918e59 [SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #10201 from sun-rui/SPARK-12204.
2016-01-20 21:08:15 -08:00
Herman van Hovell 1017327930 [SPARK-12848][SQL] Change parsed decimal literal datatype from Double to Decimal
The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```.

The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double.

This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D```

cc davies rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10796 from hvanhovell/SPARK-12848.
2016-01-20 15:13:01 -08:00
felixcheung 488bbb216c [SPARK-12232][SPARKR] New R API for read.table to avoid name conflict
shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table`

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10406 from felixcheung/readtable.
2016-01-19 18:31:03 -08:00
Sun Rui 3ac648289c [SPARK-12337][SPARKR] Implement dropDuplicates() method of DataFrame in SparkR.
Author: Sun Rui <rui.sun@intel.com>

Closes #10309 from sun-rui/SPARK-12337.
2016-01-19 16:37:18 -08:00
felixcheung 92502703f4 [SPARK-12862][SPARKR] Jenkins does not run R tests
Slight correction: I'm leaving sparkR as-is (ie. R file not supported) and fixed only run-tests.sh as shivaram described.

I also assume we are going to cover all doc changes in https://issues.apache.org/jira/browse/SPARK-12846 instead of here.

rxin shivaram zjffdu

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10792 from felixcheung/sparkRcmd.
2016-01-17 09:29:08 -08:00
Oscar D. Lara Yejas ba4a641902 [SPARK-11031][SPARKR] Method str() on a DataFrame
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>

Closes #9613 from olarayej/SPARK-11031.
2016-01-15 07:37:54 -08:00
Wenchen Fan 962e9bcf94 [SPARK-12756][SQL] use hash expression in Exchange
This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.

This PR also fixes the tests that are broken by the new hash behaviour in shuffle.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.
2016-01-13 22:43:28 -08:00
Yanbo Liang 3d77cffec0 [SPARK-12645][SPARKR] SparkR support hash function
Add ```hash``` function for SparkR ```DataFrame```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10597 from yanboliang/spark-12645.
2016-01-09 12:29:51 +05:30
Yanbo Liang d1fea41363 [SPARK-12393][SPARKR] Add read.text and write.text for SparkR
Add ```read.text``` and ```write.text``` for SparkR.
cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10348 from yanboliang/spark-12393.
2016-01-06 12:05:41 +05:30
felixcheung cc4d5229c9 [SPARK-12625][SPARKR][SQL] replace R usage of Spark SQL deprecated API
rxin davies shivaram
Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559

- [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed)

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10584 from felixcheung/rremovedeprecated.
2016-01-04 22:32:07 -08:00
felixcheung c3d505602d [SPARK-12327][SPARKR] fix code for lintr warning for commented code
shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10408 from felixcheung/rcodecomment.
2016-01-03 20:53:35 +05:30
Hossein f6ecf14333 [SPARK-11199][SPARKR] Improve R context management story and add getOrCreate
* Changes api.r.SQLUtils to use ```SQLContext.getOrCreate``` instead of creating a new context.
* Adds a simple test

[SPARK-11199] #comment link with JIRA

Author: Hossein <hossein@databricks.com>

Closes #9185 from falaki/SPARK-11199.
2015-12-29 11:44:20 -08:00
Forest Fang d80cc90b55 [SPARK-12526][SPARKR] ifelse, when, otherwise` unable to take Column as value
`ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values.

For example:
```r
ifelse(lit(1) == lit(1), lit(2), lit(3))
ifelse(df$mpg > 0, df$mpg, 0)
```
will both fail with
```r
attempt to replicate an object of type 'environment'
```

The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR.

For reference, added test cases which trigger failures:
```r
. Error: when(), otherwise() and ifelse() with column on a DataFrame ----------
error in evaluating the argument 'x' in selecting a method for function 'collect':
  error in evaluating the argument 'col' in selecting a method for function 'select':
  attempt to replicate an object of type 'environment'
Calls: when -> when -> ifelse -> ifelse

1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126
5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label)
6: condition(object)
7: compare(actual, expected, ...)
8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))
Error: Test failures
Execution halted
```

Author: Forest Fang <forest.fang@outlook.com>

Closes #10481 from saurfang/spark-12526.
2015-12-29 12:45:24 +05:30
Yanbo Liang 22f6cd86fc [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR
Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10281 from yanboliang/spark-12310.
2015-12-16 10:34:30 -08:00
gatorsmile 1e3526c2d3 [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases
The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value.

This could cause SparkR unit tests failed. For example, I hit it in another PR:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10160 from gatorsmile/sampleR.
2015-12-11 20:55:16 -08:00
Yanbo Liang 0fb9825556 [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files
* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments
jsonFile(sqlContext, “path1,path2”)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for ```parquetFile```.

cc felixcheung sun-rui shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10145 from yanboliang/spark-12146.
2015-12-11 11:47:35 -08:00
Yanbo Liang d9d354ed40 [SPARK-12234][SPARKR] Fix ``subset` function error when only set `select`` argument
Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it.

cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10217 from yanboliang/spark-12234.
2015-12-10 10:18:58 -08:00
Yanbo Liang eeb58722ad [SPARK-12198][SPARKR] SparkR support read.parquet and deprecate parquetFile
SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with #10145 for ```jsonFile```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10191 from yanboliang/spark-12198.
2015-12-10 09:44:53 -08:00
Sun Rui 39d677c8f1 [SPARK-12034][SPARKR] Eliminate warnings in SparkR test cases.
This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.

Author: Sun Rui <rui.sun@intel.com>

Closes #10030 from sun-rui/SPARK-12034.
2015-12-07 10:38:17 -08:00
Renamed from R/pkg/inst/tests/test_sparkSQL.R (Browse further)