Commit graph

227 commits

Author SHA1 Message Date
Felix Cheung 7087e01194 [SPARK-20543][SPARKR][FOLLOWUP] Don't skip tests on AppVeyor
## What changes were proposed in this pull request?

add environment

## How was this patch tested?

wait for appveyor run

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17878 from felixcheung/appveyorrcran.
2017-05-07 13:10:10 -07:00
Felix Cheung 57b64703e6 [SPARK-20571][SPARKR][SS] Flaky Structured Streaming tests
## What changes were proposed in this pull request?

Make tests more reliable by having it till processed.
Increasing timeout value might help but ultimately the flakiness from processing delay when Jenkins is hard to account for. This isn't an actual public API supported

## How was this patch tested?
unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17857 from felixcheung/rsstestrelia.
2017-05-04 01:54:59 -07:00
zero323 f21897fc15 [SPARK-20544][SPARKR] R wrapper for input_file_name
## What changes were proposed in this pull request?

Adds wrapper for `o.a.s.sql.functions.input_file_name`

## How was this patch tested?

Existing unit tests, additional unit tests, `check-cran.sh`.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17818 from zero323/SPARK-20544.
2017-05-04 01:51:37 -07:00
zero323 9c36aa2791 [SPARK-20585][SPARKR] R generic hint support
## What changes were proposed in this pull request?

Adds support for generic hints on `SparkDataFrame`

## How was this patch tested?

Unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes #17851 from zero323/SPARK-20585.
2017-05-04 01:41:36 -07:00
Felix Cheung fc472bddd1 [SPARK-20543][SPARKR] skip tests when running on CRAN
## What changes were proposed in this pull request?

General rule on skip or not:
skip if
- RDD tests
- tests could run long or complicated (streaming, hivecontext)
- tests on error conditions
- tests won't likely change/break

## How was this patch tested?

unit tests, `R CMD check --as-cran`, `R CMD check`

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17817 from felixcheung/rskiptest.
2017-05-03 21:40:18 -07:00
zero323 90d77e971f [SPARK-20532][SPARKR] Implement grouping and grouping_id
## What changes were proposed in this pull request?

Adds R wrappers for:

- `o.a.s.sql.functions.grouping` as `o.a.s.sql.functions.is_grouping` (to avoid shading `base::grouping`
- `o.a.s.sql.functions.grouping_id`

## How was this patch tested?

Existing unit tests, additional unit tests. `check-cran.sh`.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17807 from zero323/SPARK-20532.
2017-05-01 21:39:17 -07:00
Felix Cheung a355b667a3 [SPARK-20541][SPARKR][SS] support awaitTermination without timeout
## What changes were proposed in this pull request?

Add without param for timeout - will need this to submit a job that runs until stopped
Need this for 2.2

## How was this patch tested?

manually, unit test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17815 from felixcheung/rssawaitinfinite.
2017-04-30 23:23:49 -07:00
zero323 80e9cf1b59 [SPARK-20490][SPARKR] Add R wrappers for eqNullSafe and ! / not
## What changes were proposed in this pull request?

- Add null-safe equality operator `%<=>%` (sames as `o.a.s.sql.Column.eqNullSafe`, `o.a.s.sql.Column.<=>`)
- Add boolean negation operator `!` and function `not `.

## How was this patch tested?

Existing unit tests, additional unit tests, `check-cran.sh`.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17783 from zero323/SPARK-20490.
2017-04-30 22:07:12 -07:00
zero323 ae3df4e98f [SPARK-20535][SPARKR] R wrappers for explode_outer and posexplode_outer
## What changes were proposed in this pull request?

Ad R wrappers for

- `o.a.s.sql.functions.explode_outer`
- `o.a.s.sql.functions.posexplode_outer`

## How was this patch tested?

Additional unit tests, manual testing.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17809 from zero323/SPARK-20535.
2017-04-30 12:33:03 -07:00
hyukjinkwon 70f1bcd7bc [SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R
## What changes were proposed in this pull request?

It seems we are using `SQLUtils.getSQLDataType` for type string in structField. It looks we can replace this with `CatalystSqlParser.parseDataType`.

They look similar DDL-like type definitions as below:

```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```

```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```

Such type strings looks identical when R’s one as below:

```R
> write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
  struct
1      a
```

R’s one is stricter because we are checking the types via regular expressions in R side ahead.

Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks `structField` is the only place that calls this method).

## How was this patch tested?

Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17785 from HyukjinKwon/SPARK-20493.
2017-04-29 11:02:17 -07:00
Yanbo Liang dbb06c689c [MINOR][ML] Fix some PySpark & SparkR flaky tests
## What changes were proposed in this pull request?
Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17757 from yanboliang/flaky-test.
2017-04-26 21:34:18 +08:00
zero323 df58a95a33 [SPARK-20437][R] R wrappers for rollup and cube
## What changes were proposed in this pull request?

- Add `rollup` and `cube` methods and corresponding generics.
- Add short description to the vignette.

## How was this patch tested?

- Existing unit tests.
- Additional unit tests covering new features.
- `check-cran.sh`.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17728 from zero323/SPARK-20437.
2017-04-25 22:00:45 -07:00
Yanbo Liang 67eef47acf
[SPARK-20449][ML] Upgrade breeze version to 0.13.1
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17746 from yanboliang/spark-20449.
2017-04-25 17:10:41 +00:00
zero323 8a272ddc9d [SPARK-20438][R] SparkR wrappers for split and repeat
## What changes were proposed in this pull request?

Add wrappers for `o.a.s.sql.functions`:

- `split` as `split_string`
- `repeat` as `repeat_string`

## How was this patch tested?

Existing tests, additional unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes #17729 from zero323/SPARK-20438.
2017-04-24 10:56:57 -07:00
zero323 fd648bff63 [SPARK-20371][R] Add wrappers for collect_list and collect_set
## What changes were proposed in this pull request?

Adds wrappers for `collect_list` and `collect_set`.

## How was this patch tested?

Unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes #17672 from zero323/SPARK-20371.
2017-04-21 12:06:21 -07:00
zero323 46c5749768 [SPARK-20375][R] R wrappers for array and map
## What changes were proposed in this pull request?

Adds wrappers for `o.a.s.sql.functions.array` and `o.a.s.sql.functions.map`

## How was this patch tested?

Unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes #17674 from zero323/SPARK-20375.
2017-04-19 21:19:46 -07:00
Shixiong Zhu 4fea7848c4 [SPARK-20397][SPARKR][SS] Fix flaky test: test_streaming.R.Terminated by error
## What changes were proposed in this pull request?

Checking a source parameter is asynchronous. When the query is created, it's not guaranteed that source has been created. This PR just increases the timeout of awaitTermination to ensure the parsing error is thrown.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #17687 from zsxwing/SPARK-20397.
2017-04-19 13:10:44 -07:00
hyukjinkwon 24f09b39c7 [SPARK-19828][R][FOLLOWUP] Rename asJsonArray to as.json.array in from_json function in R
## What changes were proposed in this pull request?

This was suggested to be `as.json.array` at the first place in the PR to SPARK-19828 but we could not do this as the lint check emits an error for multiple dots in the variable names.

After SPARK-20278, now we are able to use `multiple.dots.in.names`. `asJsonArray` in `from_json` function is still able to be changed as 2.2 is not released yet.

So, this PR proposes to rename `asJsonArray` to `as.json.array`.

## How was this patch tested?

Jenkins tests, local tests with `./R/run-tests.sh` and manual `./dev/lint-r`. Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17653 from HyukjinKwon/SPARK-19828-followup.
2017-04-17 09:04:24 -07:00
Brendan Dwyer 044f7ecbfd [SPARK-20298][SPARKR][MINOR] fixed spelling mistake "charactor"
## What changes were proposed in this pull request?

Fixed spelling of "charactor"

## How was this patch tested?

Spelling change only

Author: Brendan Dwyer <brendan.dwyer@ibm.com>

Closes #17611 from bdwyer2/SPARK-20298.
2017-04-12 09:24:41 +01:00
Felix Cheung 5a693b4138 [SPARK-20195][SPARKR][SQL] add createTable catalog API and deprecate createExternalTable
## What changes were proposed in this pull request?

Following up on #17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17511 from felixcheung/rceatetable.
2017-04-06 09:15:13 -07:00
zero323 b34f7665dd [SPARK-19825][R][ML] spark.ml R API for FPGrowth
## What changes were proposed in this pull request?

Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):

- `spark.fpGrowth` -model training.
- `freqItemsets` and `associationRules` methods with new corresponding generics.
- Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
- unit tests.

## How was this patch tested?

Feature specific unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17170 from zero323/SPARK-19825.
2017-04-03 23:42:04 -07:00
Felix Cheung 93dbfe705f [SPARK-20159][SPARKR][SQL] Support all catalog API in R
## What changes were proposed in this pull request?

Add a set of catalog API in R

```
"currentDatabase",
"listColumns",
"listDatabases",
"listFunctions",
"listTables",
"recoverPartitions",
"refreshByPath",
"refreshTable",
"setCurrentDatabase",
```
https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17483 from felixcheung/rcatalog.
2017-04-02 11:59:27 -07:00
hyukjinkwon 3fada2f502 [SPARK-20105][TESTS][R] Add tests for checkType and type string in structField in R
## What changes were proposed in this pull request?

It seems `checkType` and the type string in `structField` are not being tested closely. This string format currently seems SparkR-specific (see d1f6c64c4b/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (L93-L131)) but resembles SQL type definition.

Therefore, it seems nicer if we test positive/negative cases in R side.

## How was this patch tested?

Unit tests in `test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17439 from HyukjinKwon/r-typestring-tests.
2017-03-27 10:43:00 -07:00
Yanbo Liang 478fbc866f [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors.
## What changes were proposed in this pull request?
SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).

## How was this patch tested?
Add unit tests, and verify this fix at standalone and yarn cluster.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17274 from yanboliang/spark-19925.
2017-03-21 21:50:54 -07:00
Felix Cheung a8877bdbba [SPARK-19237][SPARKR][CORE] On Windows spark-submit should handle when java is not installed
## What changes were proposed in this pull request?

When SparkR is installed as a R package there might not be any java runtime.
If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message.

## How was this patch tested?

manually

- [x] need to test on Windows

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16596 from felixcheung/rcheckjava.
2017-03-21 14:24:41 -07:00
Wenchen Fan 68d65fae71 [SPARK-19949][SQL] unify bad record handling in CSV and JSON
## What changes were proposed in this pull request?

Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.

The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode.

Behavior changes:
1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible.
2. all logging is removed as they are not very useful in practice.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Wenchen Fan <cloud0fan@gmail.com>

Closes #17315 from cloud-fan/bad-record2.
2017-03-20 21:43:14 -07:00
Felix Cheung c40597720e [SPARK-20020][SPARKR] DataFrame checkpoint API
## What changes were proposed in this pull request?

Add checkpoint, setCheckpointDir API to R

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17351 from felixcheung/rdfcheckpoint.
2017-03-19 22:34:18 -07:00
hyukjinkwon 0cdcf91145 [SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array
## What changes were proposed in this pull request?

This PR proposes to support an array of struct type in `to_json` as below:

```scala
import org.apache.spark.sql.functions._

val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
df.select(to_json($"a").as("json")).show()
```

```
+----------+
|      json|
+----------+
|[{"_1":1}]|
+----------+
```

Currently, it throws an exception as below (a newline manually inserted for readability):

```
org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
mismatch: structtojson requires that the expression is a struct expression.;;
```

This allows the roundtrip with `from_json` as below:

```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
df.show()

// Read back.
df.select(to_json($"array").as("json")).show()
```

```
+----------+
|     array|
+----------+
|[[1], [2]]|
+----------+

+-----------------+
|             json|
+-----------------+
|[{"a":1},{"a":2}]|
+-----------------+
```

Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.

## How was this patch tested?

Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17192 from HyukjinKwon/SPARK-19849.
2017-03-19 22:33:01 -07:00
Felix Cheung 422aa67d1b [SPARK-18817][SPARKR][SQL] change derby log output to temp dir
## What changes were proposed in this pull request?

Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log

## How was this patch tested?

Manually, unit tests

With this, these are relocated to under /tmp
```
# ls /tmp/RtmpG2M0cB/
derby.log
```
And they are removed automatically when the R session is ended.

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16330 from felixcheung/rderby.
2017-03-19 10:37:15 -07:00
Felix Cheung 5c165596da [SPARK-19654][SPARKR][SS] Structured Streaming API for R
## What changes were proposed in this pull request?

Add "experimental" API for SS in R

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16982 from felixcheung/rss.
2017-03-18 16:26:48 -07:00
hyukjinkwon d1f6c64c4b [SPARK-19828][R] Support array type in from_json in R
## What changes were proposed in this pull request?

Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in `structField` as below:

```R
jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
df <- as.DataFrame(list(list("people" = jsonArr)))
collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol")))
```

prints

```R
      arrcol
1 Bob, Alice
```

## How was this patch tested?

Unit tests in `test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17178 from HyukjinKwon/SPARK-19828.
2017-03-14 19:51:25 -07:00
actuaryzhang f6314eab4b [SPARK-19391][SPARKR][ML] Tweedie GLM API for SparkR
## What changes were proposed in this pull request?
Port Tweedie GLM  #16344  to SparkR

felixcheung yanboliang

## How was this patch tested?
new test in SparkR

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16729 from actuaryzhang/sparkRTweedie.
2017-03-14 00:50:38 -07:00
Xin Ren 9f8ce4825e [SPARK-19282][ML][SPARKR] RandomForest Wrapper and GBT Wrapper return param "maxDepth" to R models
## What changes were proposed in this pull request?

RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models.

Below 4 R wrappers are changed:
* `RandomForestClassificationWrapper`
* `RandomForestRegressionWrapper`
* `GBTClassificationWrapper`
* `GBTRegressionWrapper`

## How was this patch tested?

Test manually on my local machine.

Author: Xin Ren <iamshrek@126.com>

Closes #17207 from keypointt/SPARK-19282.
2017-03-12 12:15:19 -07:00
Xiao Li 9a6ac7226f [SPARK-19601][SQL] Fix CollapseRepartition rule to preserve shuffle-enabled Repartition
### What changes were proposed in this pull request?

Observed by felixcheung  in https://github.com/apache/spark/pull/16739, when users use the shuffle-enabled `repartition` API, they expect the partition they got should be the exact number they provided, even if they call shuffle-disabled `coalesce` later.

Currently, `CollapseRepartition` rule does not consider whether shuffle is enabled or not. Thus, we got the following unexpected result.

```Scala
    val df = spark.range(0, 10000, 1, 5)
    val df2 = df.repartition(10)
    assert(df2.coalesce(13).rdd.getNumPartitions == 5)
    assert(df2.coalesce(7).rdd.getNumPartitions == 5)
    assert(df2.coalesce(3).rdd.getNumPartitions == 3)
```

This PR is to fix the issue. We preserve shuffle-enabled Repartition.

### How was this patch tested?
Added a test case

Author: Xiao Li <gatorsmile@gmail.com>

Closes #16933 from gatorsmile/CollapseRepartition.
2017-03-08 09:36:01 -08:00
actuaryzhang 1f6c090c15 [SPARK-19818][SPARKR] rbind should check for name consistency of input data frames
## What changes were proposed in this pull request?
Added checks for name consistency of input data frames in union.

## How was this patch tested?
new test.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #17159 from actuaryzhang/sparkRUnion.
2017-03-06 21:55:11 -08:00
Felix Cheung 80d5338b32 [SPARK-19795][SPARKR] add column functions to_json, from_json
## What changes were proposed in this pull request?

Add column functions: to_json, from_json, and tests covering error cases.

## How was this patch tested?

unit tests, manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17134 from felixcheung/rtojson.
2017-03-05 12:37:02 -08:00
actuaryzhang 2ff1467d67 [DOC][MINOR][SPARKR] Update SparkR doc for names, columns and colnames
Update R doc:
1. columns, names and colnames returns a vector of strings, not **list** as in current doc.
2. `colnames<-` does allow the subset assignment, so the length of `value` can be less than the number of columns, e.g., `colnames(df)[1] <- "a"`.

felixcheung

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #17115 from actuaryzhang/sparkRMinorDoc.
2017-03-01 12:35:56 -08:00
actuaryzhang 7bf09433f5 [SPARK-19682][SPARKR] Issue warning (or error) when subset method "[[" takes vector index
## What changes were proposed in this pull request?
The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index.  We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R).

Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better.

## How was this patch tested?
new tests

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #17017 from actuaryzhang/sparkRSubsetter.
2017-02-23 11:12:02 -08:00
wm624@hotmail.com 1f86e795b8 [SPARK-19616][SPARKR] weightCol and aggregationDepth should be improved for some SparkR APIs
## What changes were proposed in this pull request?

This is a follow-up PR of #16800

When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter.

## How was this patch tested?
Existing tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16945 from wangmiao1981/svc.
2017-02-22 11:50:24 -08:00
Yanbo Liang b406598382 [SPARK-18285][SPARKR] SparkR approxQuantile supports input multiple columns
## What changes were proposed in this pull request?
SparkR ```approxQuantile``` supports input multiple columns.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16951 from yanboliang/spark-19619.
2017-02-17 11:58:39 -08:00
Felix Cheung 671bc08ed5 [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column
## What changes were proposed in this pull request?

Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16739 from felixcheung/rcoalesce.
2017-02-15 10:45:37 -08:00
wm624@hotmail.com 3973403d5d [SPARK-19456][SPARKR] Add LinearSVC R API
## What changes were proposed in this pull request?

Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.

Marked as WIP, as I am designing unit tests.

## How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16800 from wangmiao1981/svc.
2017-02-15 01:15:50 -08:00
titicaca bc0a0e6392 [SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column
## What changes were proposed in this pull request?

Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:

```
library(SparkR)
sparkR.session(master = "local")
df <- data.frame(col1 = c(0, 1, 2),
                 col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))

sdf1 <- createDataFrame(df)
print(dtypes(sdf1))
df1 <- collect(sdf1)
print(lapply(df1, class))

sdf2 <- filter(sdf1, "col1 > 0")
print(dtypes(sdf2))
df2 <- collect(sdf2)
print(lapply(df2, class))
```

As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.

This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.

Therefore, we need to cast the data type of the vector explicitly.

## How was this patch tested?

The patch can be tested manually with the same code above.

Author: titicaca <fangzhou.yang@hotmail.com>

Closes #16689 from titicaca/sparkr-dev.
2017-02-12 10:42:15 -08:00
Dongjoon Hyun c618ccdbe9
[SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop2.6
## What changes were proposed in this pull request?

After SPARK-19464, **SparkPullRequestBuilder** fails because it still tries to use hadoop2.3.

**BEFORE**
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
```
========================================================================
Building Spark
========================================================================
[error] Could not find hadoop2.3 in the list. Valid options  are ['hadoop2.6', 'hadoop2.7']
Attempting to post to Github...
 > Post successful.
```

**AFTER**
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
```
========================================================================
Building Spark
========================================================================
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:  -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly
Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
```

## How was this patch tested?

Pass the existing test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16858 from dongjoon-hyun/hotfix_run-tests.
2017-02-08 21:28:04 +00:00
anabranch 7a7ce272fe [SPARK-16609] Add to_date/to_timestamp with format functions
## What changes were proposed in this pull request?

This pull request adds two new user facing functions:
- `to_date` which accepts an expression and a format and returns a date.
- `to_timestamp` which accepts an expression and a format and returns a timestamp.

For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM)

### Date Function
*Previously*
```
to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp"))
```
*Current*
```
to_date(lit("2016-21-05"), "yyyy-dd-MM")
```

### Timestamp Function
*Previously*
```
unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")
```
*Current*
```
to_timestamp(lit("2016-21-05"), "yyyy-dd-MM")
```
### Tasks

- [X] Add `to_date` to Scala Functions
- [x] Add `to_date` to Python Functions
- [x] Add `to_date` to SQL Functions
- [X] Add `to_timestamp` to Scala Functions
- [x] Add `to_timestamp` to Python Functions
- [x] Add `to_timestamp` to SQL Functions
- [x] Add function to R

## How was this patch tested?

- [x] Add Functions to `DateFunctionsSuite`
- Test new `ParseToTimestamp` Expression (*not necessary*)
- Test new `ParseToDate` Expression (*not necessary*)
- [x] Add test for R
- [x] Add test for Python in test.py

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Author: anabranch <bill@databricks.com>

Closes #16138 from anabranch/SPARK-16609.
2017-02-07 15:50:30 +01:00
actuaryzhang b94f4b6fa6 [SPARK-19452][SPARKR] Fix bug in the name assignment method
## What changes were proposed in this pull request?
The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names.

## How was this patch tested?
new tests.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16794 from actuaryzhang/sparkRNames.
2017-02-05 11:37:45 -08:00
wm624@hotmail.com 9ac05225e8 [SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k
## What changes were proposed in this pull request

When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`.

In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.

Example:
>  col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
>   col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
>   col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
>   cols <- as.data.frame(cbind(col1, col2, col3))
>   df <- createDataFrame(cols)
>
>   model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,  initMode = "random", seed = 22222, tol = 1E-5)
>
> summary(model2)
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
  length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In matrix(coefficients, ncol = k) :
  data length [9] is not a sub-multiple or multiple of the number of rows [2]

Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
## How was this patch tested?

Add unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16666 from wangmiao1981/kmeans.
2017-01-31 21:16:37 -08:00
actuaryzhang ce112cec4f [SPARK-19395][SPARKR] Convert coefficients in summary to matrix
## What changes were proposed in this pull request?
The `coefficients` component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first `unlist` the coefficients returned from the `callJMethod` before converting to matrix. An example illustrates the issues:

```
data(iris)
df <- createDataFrame(iris)
model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian")
s <- summary(model)

> str(s$coefficients)
List of 8
 $ : num 6.53
 $ : num -0.223
 $ : num 0.479
 $ : num 0.155
 $ : num 13.6
 $ : num -1.44
 $ : num 0
 $ : num 0.152
 - attr(*, "dim")= int [1:2] 2 4
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "(Intercept)" "Sepal_Width"
  ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
> s$coefficients[, 2]
$`(Intercept)`
[1] 0.4788963

$Sepal_Width
[1] 0.1550809
```

This  shows that the underlying structure of coefficients is still `list`.

felixcheung wangmiao1981

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16730 from actuaryzhang/sparkRCoef.
2017-01-31 12:20:43 -08:00
Felix Cheung a7ab6f9a8f [SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR
## What changes were proposed in this pull request?

This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)

Before:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
NULL
```

After:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
count#11L
NULL
```

Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16670 from felixcheung/rjvmstdout.
2017-01-27 12:41:35 -08:00
Felix Cheung 90817a6cd0 [SPARK-18788][SPARKR] Add API for getNumPartitions
## What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16668 from felixcheung/rgetnumpartitions.
2017-01-26 21:06:39 -08:00