Commit graph

569 commits

Author SHA1 Message Date
Zhenhua Wang 655f6f86f8 [SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError and starting from index 0
## What changes were proposed in this pull request?

Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer.

For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2.

Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above.

## How was this patch tested?

Added a new test case and fix existing test cases.

Author: Zhenhua Wang <wzh_zju@163.com>

Closes #19438 from wzhfy/improve_percentile_approx.
2017-10-11 00:16:12 -07:00
Liang-Chi Hsieh ae61f187aa [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns
## What changes were proposed in this pull request?

Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19436 from viirya/fix-flatmapinr-distribution.
2017-10-05 23:36:18 +09:00
Holden Karau 8fab7995d3 [SPARK-22167][R][BUILD] sparkr packaging issue allow zinc
## What changes were proposed in this pull request?

When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up

## How was this patch tested?

set -x in the SparkR install scripts.

Author: Holden Karau <holden@us.ibm.com>

Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc.
2017-10-02 11:46:51 -07:00
hyukjinkwon 02c91e03f9 [SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r
## What changes were proposed in this pull request?

Currently, we set lintr to jimhester/lintra769c0b (see [this](7d1175011c) and [SPARK-14074](https://issues.apache.org/jira/browse/SPARK-14074)).

I first tested and checked lintr-1.0.1 but it looks many important fixes are missing (for example, checking 100 length). So, I instead tried the latest commit, 5431140ffe, in my local and fixed the check failures.

It looks it has fixed many bugs and now finds many instances that I have observed and thought should be caught time to time, here I filed [the results](https://gist.github.com/HyukjinKwon/4f59ddcc7b6487a02da81800baca533c).

The downside looks it now takes about 7ish mins, (it was 2ish mins before) in my local.

## How was this patch tested?

Manually, `./dev/lint-r` after manually updating the lintr package.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: zuotingbing <zuo.tingbing9@zte.com.cn>

Closes #19290 from HyukjinKwon/upgrade-r-lint.
2017-10-01 18:42:45 +09:00
Zhenhua Wang 365a29bdbf [SPARK-22100][SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type
## What changes were proposed in this pull request?

The `percentile_approx` function previously accepted numeric type input and output double type results.

But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.

After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.

This change is also required when we generate equi-height histograms for these types.

## How was this patch tested?

Added a new test and modified some existing tests.

Author: Zhenhua Wang <wangzhenhua@huawei.com>

Closes #19321 from wzhfy/approx_percentile_support_types.
2017-09-25 09:28:42 -07:00
hyukjinkwon a8d9ec8a60 [SPARK-21780][R] Simpler Dataset.sample API in R
## What changes were proposed in this pull request?

This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`.

In short, the following examples are allowed:

```r
> df <- createDataFrame(as.list(seq(10)))
> count(sample(df, fraction=0.5, seed=3))
[1] 4
> count(sample(df, fraction=1.0))
[1] 10
```

In addition, this PR also adds some type checking logics as below:

```r
> sample(df, fraction = "a")
Error in sample(df, fraction = "a") :
  fraction must be numeric; however, got character
> sample(df, fraction = 1, seed = NULL)
Error in sample(df, fraction = 1, seed = NULL) :
  seed must not be NULL or NA; however, got NULL
> sample(df, list(1), 1.0)
Error in sample(df, list(1), 1) :
  withReplacement must be logical; however, got list
> sample(df, fraction = -1.0)
...
Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement
```

## How was this patch tested?

Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19243 from HyukjinKwon/SPARK-21780.
2017-09-21 20:16:25 +09:00
Sean Owen e17901d6df [SPARK-22049][DOCS] Confusing behavior of from_utc_timestamp and to_utc_timestamp
## What changes were proposed in this pull request?

Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example

## How was this patch tested?

Doc only change / existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #19276 from srowen/SPARK-22049.
2017-09-20 20:47:17 +09:00
goldmedal a28728a9af [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR
## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.

### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.

cc viirya HyukjinKwon

Author: goldmedal <liugs963@gmail.com>

Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
2017-09-15 11:53:10 +09:00
Felix Cheung 36b48ee6e9 [SPARK-21801][SPARKR][TEST] set random seed for predictable test
## What changes were proposed in this pull request?

set.seed() before running tests

## How was this patch tested?

jenkins, appveyor

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19111 from felixcheung/rranseed.
2017-09-06 09:53:55 -07:00
hyukjinkwon 07fd68a29f [SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R
## What changes were proposed in this pull request?

This PR proposes to add a wrapper for `unionByName` API to R and Python as well.

**Python**

```python
df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
df1.unionByName(df2).show()
```

```
+----+----+----+
|col0|col1|col3|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+
```

**R**

```R
df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
head(unionByName(limit(df1, 2), limit(df2, 2)))
```

```
  carb am gear
1    4  1    4
2    4  1    4
3    4  1    4
4    4  1    4
```

## How was this patch tested?

Doctests for Python and unit test added in `test_sparkSQL.R` for R.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19105 from HyukjinKwon/unionByName-r-python.
2017-09-03 21:03:21 +09:00
Felix Cheung 6077e3ef3c [SPARK-21801][SPARKR][TEST] unit test randomly fail with randomforest
## What changes were proposed in this pull request?

fix the random seed to eliminate variability

## How was this patch tested?

jenkins, appveyor, lots more jenkins

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19018 from felixcheung/rrftest.
2017-08-29 10:09:41 -07:00
Felix Cheung 43cbfad999 [SPARK-21805][SPARKR] Disable R vignettes code on Windows
## What changes were proposed in this pull request?

Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there)

fix * checking re-building of vignette outputs ... WARNING
and
> %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir.

## How was this patch tested?

jenkins, appveyor, r-hub

before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log
after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19016 from felixcheung/rvigwind.
2017-08-23 21:35:17 -07:00
Andrew Ray 5c9b301727 [SPARK-21584][SQL][SPARKR] Update R method for summary to call new implementation
## What changes were proposed in this pull request?

SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included  expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.

This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.

## How was this patch tested?

Modified and additional unit tests.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #18786 from aray/summary-r.
2017-08-21 23:08:27 -07:00
actuaryzhang 55aa4da285 [SPARK-21622][ML][SPARKR] Support offset in SparkR GLM
## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18831 from actuaryzhang/sparkROffset.
2017-08-06 15:14:12 -07:00
hyukjinkwon 97ba491836 [SPARK-21602][R] Add map_keys and map_values functions to R
## What changes were proposed in this pull request?

This PR adds `map_values` and `map_keys` to R API.

```r
> df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
> tmp <- mutate(df, v = create_map(df$model, df$cyl))
> head(select(tmp, map_keys(tmp$v)))
```
```
        map_keys(v)
1         Mazda RX4
2     Mazda RX4 Wag
3        Datsun 710
4    Hornet 4 Drive
5 Hornet Sportabout
6           Valiant
```
```r
> head(select(tmp, map_values(tmp$v)))
```
```
  map_values(v)
1             6
2             6
3             4
4             6
5             8
6             6
```

## How was this patch tested?

Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18809 from HyukjinKwon/map-keys-values-r.
2017-08-03 23:00:00 +09:00
wangmiao1981 9570e81aa9 [SPARK-21381][SPARKR] SparkR: pass on setHandleInvalid for classification algorithms
## What changes were proposed in this pull request?

SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.

This is a followup PR for SPARK-20307.

## How was this patch tested?

New Unit tests are added.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18605 from wangmiao1981/class.
2017-07-31 20:37:06 -07:00
Yanbo Liang 69e5282d3c [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.
## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.

## How was this patch tested?
Add test cases.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18613 from yanboliang/spark-20307.
2017-07-15 20:56:38 +08:00
Sean Owen 74ac1fb081 [SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming-guide redirector
## What changes were proposed in this pull request?

Update internal references from programming-guide to rdd-programming-guide

See 5ddf243fd8 and https://github.com/apache/spark/pull/18485#issuecomment-314789751

Let's keep the redirector even if it's problematic to build, but not rely on it internally.

## How was this patch tested?

(Doc build)

Author: Sean Owen <sowen@cloudera.com>

Closes #18625 from srowen/SPARK-21267.2.
2017-07-15 09:21:29 +01:00
Sean Owen 425c4ada4c [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10
## What changes were proposed in this pull request?

- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #17150 from srowen/SPARK-19810.
2017-07-13 17:06:24 +08:00
hyukjinkwon 2bfd5accdc [SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json
## What changes were proposed in this pull request?

This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.

Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.

**Python**

`from_json`

```python
from pyspark.sql.functions import from_json

data = [(1, '''{"a": 1}''')]
df = spark.createDataFrame(data, ("key", "value"))
df.select(from_json(df.value, "a INT").alias("json")).show()
```

**R**

`from_json`

```R
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, "name STRING")))
```

`structType.character`

```R
structType("a STRING, b INT")
```

`dapply`

```R
dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
```

`gapply`

```R
gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
```

## How was this patch tested?

Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18498 from HyukjinKwon/SPARK-21266.
2017-07-10 10:40:03 -07:00
hyukjinkwon 08e0d033b4 [SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak
## What changes were proposed in this pull request?

This is a retry for #18320. This PR was reverted due to unexpected test failures with -10 error code.

I was unable to reproduce in MacOS, CentOS and Ubuntu but only in Jenkins. So, the tests proceeded to verify this and revert the past try here - https://github.com/apache/spark/pull/18456

This new approach was tested in https://github.com/apache/spark/pull/18463.

**Test results**:

- With the part of suspicious change in the past try (466325d3fd)

  Tests ran 4 times and 2 times passed and 2 time failed.

- Without the part of suspicious change in the past try (466325d3fd)

  Tests ran 5 times and they all passed.

- With this new approach (0a7589c09f)

  Tests ran 5 times and they all passed.

It looks the cause is as below (see 466325d3fd):

```diff
+ exitCode <- 1
...
+   data <- parallel:::readChild(child)
+   if (is.raw(data)) {
+     if (unserialize(data) == exitCode) {
      ...
+     }
+   }

...

- parallel:::mcexit(0L)
+ parallel:::mcexit(0L, send = exitCode)
```

Two possibilities I think

 - `parallel:::mcexit(.. , send = exitCode)`

   https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html

   > It sends send to the master (unless NULL) and then shuts down the child process.

   However, it looks possible that the parent attemps to terminate the child right after getting our custom exit code. So, the child gets terminated between "send" and "shuts down", failing to exit properly.

 - A bug between `parallel:::mcexit(..., send = ...)` and `parallel:::readChild`.

**Proposal**:

To resolve this, I simply decided to avoid both possibilities with this new approach here (9ff89a7859). To support this idea, I explained with some quotation of the documentation as below:

https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html

> `readChild` and `readChildren` return a raw vector with a "pid" attribute if data were available, an integer vector of length one with the process ID if a child terminated or `NULL` if the child no longer exists (no children at all for `readChildren`).

`readChild` returns "an integer vector of length one with the process ID if a child terminated" so we can check if it is `integer` and the same selected "process ID". I believe this makes sure that the children are exited.

In case that children happen to send any data manually to parent (which is why we introduced the suspicious part of the change (466325d3fd)), this should be raw bytes and will be discarded (and then will try to read the next and check if it is `integer` in the next loop).

## How was this patch tested?

Manual tests and Jenkins tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18465 from HyukjinKwon/SPARK-21093-retry-1.
2017-07-08 14:24:37 -07:00
Michael Patterson f5f02d213d [SPARK-20456][DOCS] Add examples for functions collection for pyspark
## What changes were proposed in this pull request?

This adds documentation to many functions in pyspark.sql.functions.py:
`upper`, `lower`, `reverse`, `unix_timestamp`, `from_unixtime`, `rand`, `randn`, `collect_list`, `collect_set`, `lit`
Add units to the trigonometry functions.
Renames columns in datetime examples to be more informative.
Adds links between some functions.

## How was this patch tested?

`./dev/lint-python`
`python python/pyspark/sql/functions.py`
`./python/run-tests.py --module pyspark-sql`

Author: Michael Patterson <map222@gmail.com>

Closes #17865 from map222/spark-20456.
2017-07-07 23:59:34 -07:00
wangmiao1981 a7b46c627b [SPARK-20307][SPARKR] SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
## What changes were proposed in this pull request?

For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic.

This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers.

## How was this patch tested?

Add a new unit test based on the error case in the JIRA.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18496 from wangmiao1981/handle.
2017-07-07 23:51:32 -07:00
actuaryzhang e9a93f8140 [SPARK-20889][SPARKR][FOLLOWUP] Clean up grouped doc for column methods
## What changes were proposed in this pull request?
Add doc for methods that were left out, and fix various style and consistency issues.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18493 from actuaryzhang/sparkRDocCleanup.
2017-07-04 21:05:05 -07:00
actuaryzhang cec3921504 [SPARK-20889][SPARKR] Grouped documentation for WINDOW column methods
## What changes were proposed in this pull request?

Grouped documentation for column window methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18481 from actuaryzhang/sparkRDocWindow.
2017-07-04 12:18:51 -07:00
actuaryzhang 52981715bb [SPARK-20889][SPARKR] Grouped documentation for COLLECTION column methods
## What changes were proposed in this pull request?

Grouped documentation for column collection methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18458 from actuaryzhang/sparkRDocCollection.
2017-06-29 23:00:50 -07:00
actuaryzhang fddb63f463 [SPARK-20889][SPARKR] Grouped documentation for MISC column methods
## What changes were proposed in this pull request?
Grouped documentation for column misc methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18448 from actuaryzhang/sparkRDocMisc.
2017-06-29 21:35:01 -07:00
actuaryzhang a2d5623548 [SPARK-20889][SPARKR] Grouped documentation for NONAGGREGATE column methods
## What changes were proposed in this pull request?

Grouped documentation for nonaggregate column methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18422 from actuaryzhang/sparkRDocNonAgg.
2017-06-29 01:23:13 -07:00
Felix Cheung fc92d25f2a Revert "[SPARK-21094][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak"
This reverts commit 6b3d02285e.
2017-06-28 20:06:29 -07:00
hyukjinkwon db44f5f3e8 [SPARK-21224][R] Specify a schema by using a DDL-formatted string when reading in R
## What changes were proposed in this pull request?

This PR proposes to support a DDL-formetted string as schema as below:

```r
mockLines <- c("{\"name\":\"Michael\"}",
               "{\"name\":\"Andy\", \"age\":30}",
               "{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)
df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
```

## How was this patch tested?

Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18431 from HyukjinKwon/r-ddl-schema.
2017-06-28 19:36:00 -07:00
actuaryzhang 376d90d556 [SPARK-20889][SPARKR] Grouped documentation for STRING column methods
## What changes were proposed in this pull request?

Grouped documentation for string column methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18366 from actuaryzhang/sparkRDocString.
2017-06-28 19:31:54 -07:00
actuaryzhang e793bf248b [SPARK-20889][SPARKR] Grouped documentation for MATH column methods
## What changes were proposed in this pull request?

Grouped documentation for math column methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18371 from actuaryzhang/sparkRDocMath.
2017-06-27 23:15:45 -07:00
hyukjinkwon 6b3d02285e [SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak
## What changes were proposed in this pull request?

`mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open.

This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too.

All the details are described in https://issues.apache.org/jira/browse/SPARK-21093

This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak.

## How was this patch tested?

I ran the codes below on both CentOS and Mac with that configuration disabled/enabled.

```r
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
...  # 30 times
```

Also, now it passes R tests on CentOS as below:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
....................................................................................................................................
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18320 from HyukjinKwon/SPARK-21093.
2017-06-25 11:05:57 -07:00
hyukjinkwon 07479b3cfb [SPARK-21149][R] Add job description API for R
## What changes were proposed in this pull request?

Extend `setJobDescription` to SparkR API.

## How was this patch tested?

It looks difficult to add a test. Manually tested as below:

```r
df <- createDataFrame(iris)
count(df)
setJobDescription("This is an example job.")
count(df)
```

prints ...

![2017-06-22 12 05 49](https://user-images.githubusercontent.com/6477701/27415670-2a649936-5743-11e7-8e95-312f1cd103af.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18382 from HyukjinKwon/SPARK-21149.
2017-06-23 09:59:24 -07:00
actuaryzhang 19331b8e44 [SPARK-20889][SPARKR] Grouped documentation for DATETIME column methods
## What changes were proposed in this pull request?
Grouped documentation for datetime column methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18114 from actuaryzhang/sparkRDocDate.
2017-06-22 10:16:51 -07:00
wangmiao1981 53543374ce [SPARK-20906][SPARKR] Constrained Logistic Regression for SparkR
## What changes were proposed in this pull request?

PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR.

## How was this patch tested?

Add new unit tests.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18128 from wangmiao1981/test.
2017-06-21 20:42:45 -07:00
actuaryzhang ad459cfb1d [SPARK-20917][ML][SPARKR] SparkR supports string encoding consistent with R
## What changes were proposed in this pull request?

Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R.

## How was this patch tested?
new tests

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18140 from actuaryzhang/sparkRFormula.
2017-06-21 10:35:16 -07:00
Joseph K. Bradley cc67bd5732 [SPARK-20929][ML] LinearSVC should use its own threshold param
## What changes were proposed in this pull request?

LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.

## How was this patch tested?

New unit test to make sure the threshold can be set to any Double value.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.
2017-06-19 23:04:17 -07:00
actuaryzhang 8965fe764a [SPARK-20889][SPARKR] Grouped documentation for AGGREGATE column methods
## What changes were proposed in this pull request?
Grouped documentation for the aggregate functions for Column.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18025 from actuaryzhang/sparkRDoc4.
2017-06-19 19:41:24 -07:00
hyukjinkwon 9a145fd796 [MINOR] Bump SparkR and PySpark version to 2.3.0.
## What changes were proposed in this pull request?

#17753 bumps master branch version to 2.3.0-SNAPSHOT, but it seems SparkR and PySpark version were omitted.

ditto of https://github.com/apache/spark/pull/16488 / https://github.com/apache/spark/pull/17523

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18341 from HyukjinKwon/r-version.
2017-06-19 11:13:03 +01:00
actuaryzhang 110ce1f27b [SPARK-20892][SPARKR] Add SQL trunc function to SparkR
## What changes were proposed in this pull request?

Add SQL trunc function

## How was this patch tested?
standard test

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18291 from actuaryzhang/sparkRTrunc2.
2017-06-18 18:00:27 -07:00
hyukjinkwon 05f83c532a [SPARK-21128][R] Remove both "spark-warehouse" and "metastore_db" before listing files in R tests
## What changes were proposed in this pull request?

This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.

## How was this patch tested?

Manually running multiple times R tests via `./R/run-tests.sh`.

**Before**

Second run:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
....................................................................................................1234.......................

Failed -------------------------------------------------------------------------
1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2

2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"

x[17]: "pkg"
y[17]: "R"

x[18]: "R"
y[18]: "README.md"

x[19]: "README.md"
y[19]: "run-tests.sh"

x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"

x[21]: "metastore_db"
y[21]: "pkg"

x[22]: "pkg"
y[22]: "R"

x[23]: "R"
y[23]: "README.md"

x[24]: "README.md"
y[24]: "run-tests.sh"

x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"

3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2

4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"

x[17]: "pkg"
y[17]: "R"

x[18]: "R"
y[18]: "README.md"

x[19]: "README.md"
y[19]: "run-tests.sh"

x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"

x[21]: "metastore_db"
y[21]: "pkg"

x[22]: "pkg"
y[22]: "R"

x[23]: "R"
y[23]: "README.md"

x[24]: "README.md"
y[24]: "run-tests.sh"

x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"

DONE ===========================================================================
```

**After**

Second run:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18335 from HyukjinKwon/SPARK-21128.
2017-06-18 11:26:27 -07:00
Yuming Wang 45824fb608 [MINOR][DOCS] Improve Running R Tests docs
## What changes were proposed in this pull request?

Update Running R Tests dependence packages to:
```bash
R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
```

## How was this patch tested?
manual tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18271 from wangyum/building-spark.
2017-06-16 11:03:54 +01:00
Xiao Li 2051428173 [SPARK-20980][SQL] Rename wholeFile to multiLine for both CSV and JSON
### What changes were proposed in this pull request?
The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.

### How was this patch tested?
N/A

Author: Xiao Li <gatorsmile@gmail.com>

Closes #18202 from gatorsmile/renameCVSOption.
2017-06-15 13:18:19 +08:00
Felix Cheung 9f4ff95524 [SPARK-20877][SPARKR][FOLLOWUP] clean up after test move
## What changes were proposed in this pull request?

clean up after big test move

## How was this patch tested?

unit tests, jenkins

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #18267 from felixcheung/rtestset2.
2017-06-11 03:00:44 -07:00
Felix Cheung dc4c351837 [SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN
## What changes were proposed in this pull request?

Move all existing tests to non-installed directory so that it will never run by installing SparkR package

For a follow-up PR:
- remove all skip_on_cran() calls in tests
- clean up test timer
- improve or change basic tests that do run on CRAN (if anyone has suggestion)

It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)

## How was this patch tested?

- [x] unit tests, Jenkins
- [x] AppVeyor
- [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #18264 from felixcheung/rtestset.
2017-06-11 00:00:33 -07:00
Reynold Xin b78e3849b2 [SPARK-21042][SQL] Document Dataset.union is resolution by position
## What changes were proposed in this pull request?
Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users.

## How was this patch tested?
N/A - doc only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #18256 from rxin/SPARK-21042.
2017-06-09 18:29:33 -07:00
Felix Cheung 382fefd187 [SPARK-20877][SPARKR][WIP] add timestamps to test runs
## What changes were proposed in this pull request?

to investigate how long they run

## How was this patch tested?

Jenkins, AppVeyor

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #18104 from felixcheung/rtimetest.
2017-05-30 22:33:29 -07:00
Zheng RuiFeng a97c497045 [SPARK-20849][DOC][SPARKR] Document R DecisionTree
## What changes were proposed in this pull request?
1, add an example for sparkr `decisionTree`
2, document it in user guide

## How was this patch tested?
local submit

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #18067 from zhengruifeng/dt_example.
2017-05-25 23:00:50 -07:00
Yanbo Liang ad09e4ca04 [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary.
## What changes were proposed in this pull request?
Joint coefficients with intercept for SparkR linear SVM summary.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18035 from yanboliang/svm-r.
2017-05-23 16:16:14 +08:00