Commit graph

18 commits

Author SHA1 Message Date
goldmedal a28728a9af [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR
## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.

### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.

cc viirya HyukjinKwon

Author: goldmedal <liugs963@gmail.com>

Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
2017-09-15 11:53:10 +09:00
hyukjinkwon 07fd68a29f [SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R
## What changes were proposed in this pull request?

This PR proposes to add a wrapper for `unionByName` API to R and Python as well.

**Python**

```python
df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
df1.unionByName(df2).show()
```

```
+----+----+----+
|col0|col1|col3|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+
```

**R**

```R
df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
head(unionByName(limit(df1, 2), limit(df2, 2)))
```

```
  carb am gear
1    4  1    4
2    4  1    4
3    4  1    4
4    4  1    4
```

## How was this patch tested?

Doctests for Python and unit test added in `test_sparkSQL.R` for R.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19105 from HyukjinKwon/unionByName-r-python.
2017-09-03 21:03:21 +09:00
Felix Cheung 6077e3ef3c [SPARK-21801][SPARKR][TEST] unit test randomly fail with randomforest
## What changes were proposed in this pull request?

fix the random seed to eliminate variability

## How was this patch tested?

jenkins, appveyor, lots more jenkins

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19018 from felixcheung/rrftest.
2017-08-29 10:09:41 -07:00
Andrew Ray 5c9b301727 [SPARK-21584][SQL][SPARKR] Update R method for summary to call new implementation
## What changes were proposed in this pull request?

SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included  expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.

This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.

## How was this patch tested?

Modified and additional unit tests.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #18786 from aray/summary-r.
2017-08-21 23:08:27 -07:00
actuaryzhang 55aa4da285 [SPARK-21622][ML][SPARKR] Support offset in SparkR GLM
## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18831 from actuaryzhang/sparkROffset.
2017-08-06 15:14:12 -07:00
hyukjinkwon 97ba491836 [SPARK-21602][R] Add map_keys and map_values functions to R
## What changes were proposed in this pull request?

This PR adds `map_values` and `map_keys` to R API.

```r
> df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
> tmp <- mutate(df, v = create_map(df$model, df$cyl))
> head(select(tmp, map_keys(tmp$v)))
```
```
        map_keys(v)
1         Mazda RX4
2     Mazda RX4 Wag
3        Datsun 710
4    Hornet 4 Drive
5 Hornet Sportabout
6           Valiant
```
```r
> head(select(tmp, map_values(tmp$v)))
```
```
  map_values(v)
1             6
2             6
3             4
4             6
5             8
6             6
```

## How was this patch tested?

Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18809 from HyukjinKwon/map-keys-values-r.
2017-08-03 23:00:00 +09:00
wangmiao1981 9570e81aa9 [SPARK-21381][SPARKR] SparkR: pass on setHandleInvalid for classification algorithms
## What changes were proposed in this pull request?

SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.

This is a followup PR for SPARK-20307.

## How was this patch tested?

New Unit tests are added.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18605 from wangmiao1981/class.
2017-07-31 20:37:06 -07:00
Yanbo Liang 69e5282d3c [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.
## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.

## How was this patch tested?
Add test cases.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18613 from yanboliang/spark-20307.
2017-07-15 20:56:38 +08:00
Sean Owen 425c4ada4c [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10
## What changes were proposed in this pull request?

- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #17150 from srowen/SPARK-19810.
2017-07-13 17:06:24 +08:00
hyukjinkwon 2bfd5accdc [SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json
## What changes were proposed in this pull request?

This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.

Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.

**Python**

`from_json`

```python
from pyspark.sql.functions import from_json

data = [(1, '''{"a": 1}''')]
df = spark.createDataFrame(data, ("key", "value"))
df.select(from_json(df.value, "a INT").alias("json")).show()
```

**R**

`from_json`

```R
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, "name STRING")))
```

`structType.character`

```R
structType("a STRING, b INT")
```

`dapply`

```R
dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
```

`gapply`

```R
gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
```

## How was this patch tested?

Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18498 from HyukjinKwon/SPARK-21266.
2017-07-10 10:40:03 -07:00
wangmiao1981 a7b46c627b [SPARK-20307][SPARKR] SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
## What changes were proposed in this pull request?

For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic.

This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers.

## How was this patch tested?

Add a new unit test based on the error case in the JIRA.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18496 from wangmiao1981/handle.
2017-07-07 23:51:32 -07:00
hyukjinkwon db44f5f3e8 [SPARK-21224][R] Specify a schema by using a DDL-formatted string when reading in R
## What changes were proposed in this pull request?

This PR proposes to support a DDL-formetted string as schema as below:

```r
mockLines <- c("{\"name\":\"Michael\"}",
               "{\"name\":\"Andy\", \"age\":30}",
               "{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)
df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
```

## How was this patch tested?

Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18431 from HyukjinKwon/r-ddl-schema.
2017-06-28 19:36:00 -07:00
hyukjinkwon 07479b3cfb [SPARK-21149][R] Add job description API for R
## What changes were proposed in this pull request?

Extend `setJobDescription` to SparkR API.

## How was this patch tested?

It looks difficult to add a test. Manually tested as below:

```r
df <- createDataFrame(iris)
count(df)
setJobDescription("This is an example job.")
count(df)
```

prints ...

![2017-06-22 12 05 49](https://user-images.githubusercontent.com/6477701/27415670-2a649936-5743-11e7-8e95-312f1cd103af.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18382 from HyukjinKwon/SPARK-21149.
2017-06-23 09:59:24 -07:00
wangmiao1981 53543374ce [SPARK-20906][SPARKR] Constrained Logistic Regression for SparkR
## What changes were proposed in this pull request?

PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR.

## How was this patch tested?

Add new unit tests.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18128 from wangmiao1981/test.
2017-06-21 20:42:45 -07:00
actuaryzhang ad459cfb1d [SPARK-20917][ML][SPARKR] SparkR supports string encoding consistent with R
## What changes were proposed in this pull request?

Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R.

## How was this patch tested?
new tests

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18140 from actuaryzhang/sparkRFormula.
2017-06-21 10:35:16 -07:00
actuaryzhang 110ce1f27b [SPARK-20892][SPARKR] Add SQL trunc function to SparkR
## What changes were proposed in this pull request?

Add SQL trunc function

## How was this patch tested?
standard test

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18291 from actuaryzhang/sparkRTrunc2.
2017-06-18 18:00:27 -07:00
Felix Cheung 9f4ff95524 [SPARK-20877][SPARKR][FOLLOWUP] clean up after test move
## What changes were proposed in this pull request?

clean up after big test move

## How was this patch tested?

unit tests, jenkins

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #18267 from felixcheung/rtestset2.
2017-06-11 03:00:44 -07:00
Felix Cheung dc4c351837 [SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN
## What changes were proposed in this pull request?

Move all existing tests to non-installed directory so that it will never run by installing SparkR package

For a follow-up PR:
- remove all skip_on_cran() calls in tests
- clean up test timer
- improve or change basic tests that do run on CRAN (if anyone has suggestion)

It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)

## How was this patch tested?

- [x] unit tests, Jenkins
- [x] AppVeyor
- [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #18264 from felixcheung/rtestset.
2017-06-11 00:00:33 -07:00