Commit graph

19918 commits

Author SHA1 Message Date
wangzhenhua b803b66a81 [SPARK-21180][SQL] Remove conf from stats functions since now we have conf in LogicalPlan
## What changes were proposed in this pull request?

After wiring `SQLConf` in logical plan ([PR 18299](https://github.com/apache/spark/pull/18299)), we can remove the need of passing `conf` into `def stats` and `def computeStats`.

## How was this patch tested?

Covered by existing tests, plus some modified existing tests.

Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>

Closes #18391 from wzhfy/removeConf.
2017-06-23 10:33:53 -07:00
hyukjinkwon 07479b3cfb [SPARK-21149][R] Add job description API for R
## What changes were proposed in this pull request?

Extend `setJobDescription` to SparkR API.

## How was this patch tested?

It looks difficult to add a test. Manually tested as below:

```r
df <- createDataFrame(iris)
count(df)
setJobDescription("This is an example job.")
count(df)
```

prints ...

![2017-06-22 12 05 49](https://user-images.githubusercontent.com/6477701/27415670-2a649936-5743-11e7-8e95-312f1cd103af.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18382 from HyukjinKwon/SPARK-21149.
2017-06-23 09:59:24 -07:00
Takeshi Yamamuro f3dea60793 [SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns
## What changes were proposed in this pull request?
The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:
```
withTempPath { dir =>
  val basePath = dir.getCanonicalPath
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
  spark.read.parquet(basePath).show()
}

+---+
|foo|
+---+
|  1|
|  1|
|  a|
|  a|
|  1|
|  a|
+---+
```
This patch added code to print a warning when the duplication found.

## How was this patch tested?
Manually checked.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #18375 from maropu/SPARK-21144-3.
2017-06-23 09:28:02 -07:00
hyukjinkwon 5dca10b8fd [SPARK-21193][PYTHON] Specify Pandas version in setup.py
## What changes were proposed in this pull request?

It looks we missed specifying the Pandas version. This PR proposes to fix it. For the current state, it should be Pandas 0.13.0 given my test. This PR propose to fix it as 0.13.0.

Running the codes below:

```python
from pyspark.sql.types import *

schema = StructType().add("a", IntegerType()).add("b", StringType())\
                     .add("c", BooleanType()).add("d", FloatType())
data = [
    (1, "foo", True, 3.0,), (2, "foo", True, 5.0),
    (3, "bar", False, -1.0), (4, "bar", False, 6.0),
]
spark.createDataFrame(data, schema).toPandas().dtypes
```

prints ...

**With Pandas 0.13.0** - released, 2014-01

```
a      int32
b     object
c       bool
d    float32
dtype: object
```

**With Pandas 0.12.0** -  - released, 2013-06

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'
```

without `copy`

```
a      int32
b     object
c       bool
d    float32
dtype: object
```

**With Pandas 0.11.0** - released, 2013-03

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'
```

without `copy`

```
a      int32
b     object
c       bool
d    float32
dtype: object
```

**With Pandas 0.10.0** -  released, 2012-12

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'
```

without `copy`

```
a      int64  # <- this should be 'int32'
b     object
c       bool
d    float64  # <- this should be 'float32'
```

## How was this patch tested?

Manually tested with Pandas from 0.10.0 to 0.13.0.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18403 from HyukjinKwon/SPARK-21193.
2017-06-23 21:51:55 +08:00
10129659 acd208ee50 [SPARK-21115][CORE] If the cores left is less than the coresPerExecutor,the cores left will not be allocated, so it should not to check in every schedule
## What changes were proposed in this pull request?
If we start an app with the param --total-executor-cores=4 and spark.executor.cores=3, the cores left is always 1, so it will try to allocate executors in the function org.apache.spark.deploy.master.startExecutorsOnWorkers in every schedule.
Another question is, is it will be better to allocate another executor with 1 core for the cores left.

## How was this patch tested?
unit test

Author: 10129659 <chen.yanshan@zte.com.cn>

Closes #18322 from eatoncys/leftcores.
2017-06-23 20:53:26 +08:00
jinxing 153dd49b74 [SPARK-21047] Add test suites for complicated cases in ColumnarBatchSuite
## What changes were proposed in this pull request?
Current ColumnarBatchSuite has very simple test cases for `Array` and `Struct`. This pr wants to add  some test suites for complicated cases in ColumnVector.

Author: jinxing <jinxing6042@126.com>

Closes #18327 from jinxing64/SPARK-21047.
2017-06-23 20:41:17 +08:00
Tathagata Das fe24634d14 [SPARK-21145][SS] Added StateStoreProviderId with queryRunId to reload StateStoreProviders when query is restarted
## What changes were proposed in this pull request?
StateStoreProvider instances are loaded on-demand in a executor when a query is started. When a query is restarted, the loaded provider instance will get reused. Now, there is a non-trivial chance, that the task of the previous query run is still running, while the tasks of the restarted run has started. So for a stateful partition, there may be two concurrent tasks related to the same stateful partition, and there for using the same provider instance. This can lead to inconsistent results and possibly random failures, as state store implementations are not designed to be thread-safe.

To fix this, I have introduced a `StateStoreProviderId`, that unique identifies a provider loaded in an executor. It has the query run id in it, thus making sure that restarted queries will force the executor to load a new provider instance, thus avoiding two concurrent tasks (from two different runs) from reusing the same provider instance.

Additional minor bug fixes
- All state stores related to query run is marked as deactivated in the `StateStoreCoordinator` so that the executors can unload them and clear resources.
- Moved the code that determined the checkpoint directory of a state store from implementation-specific code (`HDFSBackedStateStoreProvider`) to non-specific code (StateStoreId), so that implementation do not accidentally get it wrong.
  - Also added store name to the path, to support multiple stores per sql operator partition.

*Note:* This change does not address the scenario where two tasks of the same run (e.g. speculative tasks) are concurrently running in the same executor. The chance of this very small, because ideally speculative tasks should never run in the same executor.

## How was this patch tested?
Existing unit tests + new unit test.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18355 from tdas/SPARK-21145.
2017-06-23 00:43:21 -07:00
Wang Gengliang b8a743b6a5 [SPARK-21174][SQL] Validate sampling fraction in logical operator level
## What changes were proposed in this pull request?

Currently the validation of sampling fraction in dataset is incomplete.
As an improvement, validate sampling fraction in logical operator level:
1) if with replacement: fraction should be nonnegative
2) else: fraction should be on interval [0, 1]
Also add test cases for the validation.

## How was this patch tested?
integration tests

gatorsmile cloud-fan
Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Wang Gengliang <ltnwgl@gmail.com>

Closes #18387 from gengliangwang/sample_ratio_validate.
2017-06-23 09:27:35 +08:00
Thomas Graves 5b5a69bea9 [SPARK-20923] turn tracking of TaskMetrics._updatedBlockStatuses off
## What changes were proposed in this pull request?
Turn tracking of TaskMetrics._updatedBlockStatuses off by default. As far as I can see its not used by anything and it uses a lot of memory when caching and processing a lot of blocks.  In my case it was taking 5GB of a 10GB heap and I even went up to 50GB heap and the job still ran out of memory.  With this change in place the same job easily runs in less then 10GB of heap.

We leave the api there as well as a config to turn it back on just in case anyone is using it.  TaskMetrics is exposed via SparkListenerTaskEnd so if users are relying on it they can turn it back on.

## How was this patch tested?

Ran unit tests that were modified and manually tested on a couple of jobs (with and without caching).  Clicked through the UI and didn't see anything missing.
Ran my very large hive query job with 200,000 small tasks, 1000 executors, cached 6+TB of data this runs fine now whereas without this change it would go into full gcs and eventually die.

Author: Thomas Graves <tgraves@thirteenroutine.corp.gq1.yahoo.com>
Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #18162 from tgravescs/SPARK-20923.
2017-06-23 09:19:02 +08:00
Bryan Cutler e44697606f [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
2017-06-23 09:01:13 +08:00
jinxing 58434acdd8 [SPARK-19937] Collect metrics for remote bytes read to disk during shuffle.
In current code(https://github.com/apache/spark/pull/16989), big blocks are shuffled to disk.
This pr proposes to collect metrics for remote bytes fetched to disk.

Author: jinxing <jinxing6042@126.com>

Closes #18249 from jinxing64/SPARK-19937.
2017-06-22 14:10:51 -07:00
Lubo Zhang e55a105ae0 [SPARK-20599][SS] ConsoleSink should work with (batch)
## What changes were proposed in this pull request?

Currently, if we read a batch and want to display it on the console sink, it will lead a runtime exception.

Changes:

- In this PR, we add a match rule to check whether it is a ConsoleSinkProvider, we will display the Dataset
 if using console format.

## How was this patch tested?

spark.read.schema().json(path).write.format("console").save

Author: Lubo Zhang <lubo.zhang@intel.com>
Author: lubozhan <lubo.zhang@intel.com>

Closes #18347 from lubozhan/dev.
2017-06-22 11:18:58 -07:00
actuaryzhang 19331b8e44 [SPARK-20889][SPARKR] Grouped documentation for DATETIME column methods
## What changes were proposed in this pull request?
Grouped documentation for datetime column methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18114 from actuaryzhang/sparkRDocDate.
2017-06-22 10:16:51 -07:00
Xingbo Jiang 2dadea95c8 [SPARK-20832][CORE] Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs
## What changes were proposed in this pull request?

In standalone mode, master should explicitly inform each active driver of any worker deaths, so the invalid external shuffle service outputs on the lost host would be removed from the shuffle mapStatus, thus we can avoid future `FetchFailure`s.

## How was this patch tested?
Manually tested by the following steps:
1. Start a standalone Spark cluster with one driver node and two worker nodes;
2. Run a Job with ShuffleMapStage, ensure the outputs distribute on each worker;
3. Run another Job to make all executors exit, but the workers are all alive;
4. Kill one of the workers;
5. Run rdd.collect(), before this change, we should see `FetchFailure`s and failed Stages, while after the change, the job should complete without failure.

Before the change:
![image](https://user-images.githubusercontent.com/4784782/27335366-c251c3d6-55fe-11e7-99dd-d1fdcb429210.png)

After the change:
![image](https://user-images.githubusercontent.com/4784782/27335393-d1c71640-55fe-11e7-89ed-bd760f1f39af.png)

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #18362 from jiangxb1987/removeWorker.
2017-06-22 20:48:12 +08:00
actuaryzhang 97b307c87c [SQL][DOC] Fix documentation of lpad
## What changes were proposed in this pull request?
Fix incomplete documentation for `lpad`.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18367 from actuaryzhang/SQLDoc.
2017-06-22 10:12:33 +01:00
hyukjinkwon 67c75021c5 [SPARK-21163][SQL] DataFrame.toPandas should respect the data type
## What changes were proposed in this pull request?

Currently we convert a spark DataFrame to Pandas Dataframe by `pd.DataFrame.from_records`. It infers the data type from the data and doesn't respect the spark DataFrame Schema. This PR fixes it.

## How was this patch tested?

a new regression test

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Wenchen Fan <cloud0fan@gmail.com>

Closes #18378 from cloud-fan/to_pandas.
2017-06-22 16:22:02 +08:00
Shixiong Zhu d66b143eec [SPARK-21167][SS] Decode the path generated by File sink to handle special characters
## What changes were proposed in this pull request?

Decode the path generated by File sink to handle special characters.

## How was this patch tested?

The added unit test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18381 from zsxwing/SPARK-21167.
2017-06-21 23:43:21 -07:00
wangmiao1981 53543374ce [SPARK-20906][SPARKR] Constrained Logistic Regression for SparkR
## What changes were proposed in this pull request?

PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR.

## How was this patch tested?

Add new unit tests.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18128 from wangmiao1981/test.
2017-06-21 20:42:45 -07:00
zero323 215281d88e [SPARK-20830][PYSPARK][SQL] Add posexplode and posexplode_outer
## What changes were proposed in this pull request?

Add Python wrappers for `o.a.s.sql.functions.explode_outer` and `o.a.s.sql.functions.posexplode_outer`.

## How was this patch tested?

Unit tests, doctests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #18049 from zero323/SPARK-20830.
2017-06-21 14:59:52 -07:00
sjarvie ba78514da7 [SPARK-21125][PYTHON] Extend setJobDescription to PySpark and JavaSpark APIs
## What changes were proposed in this pull request?

Extend setJobDescription to PySpark and JavaSpark APIs

SPARK-21125

## How was this patch tested?

Testing was done by running a local Spark shell on the built UI. I originally had added a unit test but the PySpark context cannot easily access the Scala Spark Context's private variable with the Job Description key so I omitted the test, due to the simplicity of this addition.

Also ran the existing tests.

# Misc

This contribution is my original work and that I license the work to the project under the project's open source license.

Author: sjarvie <sjarvie@uber.com>

Closes #18332 from sjarvie/add_python_set_job_description.
2017-06-21 10:51:45 -07:00
hyukjinkwon 7a00c658d4 [SPARK-21147][SS] Throws an analysis exception when a user-specified schema is given in socket/rate sources
## What changes were proposed in this pull request?

This PR proposes to throw an exception if a schema is provided by user to socket source as below:

**socket source**

```scala
import org.apache.spark.sql.types._

val userSpecifiedSchema = StructType(
  StructField("name", StringType) ::
  StructField("area", StringType) :: Nil)
val df = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).schema(userSpecifiedSchema).load
df.printSchema
```

Before

```
root
 |-- value: string (nullable = true)
```

After

```
org.apache.spark.sql.AnalysisException: The socket source does not support a user-specified schema.;
  at org.apache.spark.sql.execution.streaming.TextSocketSourceProvider.sourceSchema(socket.scala:199)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:192)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
  at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
  ... 50 elided
```

**rate source**

```scala
spark.readStream.format("rate").schema(spark.range(1).schema).load().printSchema()
```

Before

```
root
 |-- timestamp: timestamp (nullable = true)
 |-- value: long (nullable = true)`
```

After

```
org.apache.spark.sql.AnalysisException: The rate source does not support a user-specified schema.;
  at org.apache.spark.sql.execution.streaming.RateSourceProvider.sourceSchema(RateSourceProvider.scala:57)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:192)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
  at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
  ... 48 elided
```

## How was this patch tested?

Unit test in `TextSocketStreamSuite` and `RateSourceSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18365 from HyukjinKwon/SPARK-21147.
2017-06-21 10:51:17 -07:00
actuaryzhang ad459cfb1d [SPARK-20917][ML][SPARKR] SparkR supports string encoding consistent with R
## What changes were proposed in this pull request?

Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R.

## How was this patch tested?
new tests

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18140 from actuaryzhang/sparkRFormula.
2017-06-21 10:35:16 -07:00
Xingbo Jiang cad88f17e8 [SPARK-17851][SQL][TESTS] Make sure all test sqls in catalyst pass checkAnalysis
## What changes were proposed in this pull request?

Currently we have several tens of test sqls in catalyst will fail at `SimpleAnalyzer.checkAnalysis`, we should make sure they are valid.

This PR makes the following changes:
1. Apply `checkAnalysis` on plans that tests `Optimizer` rules, but don't require the testcases for `Parser`/`Analyzer` pass `checkAnalysis`;
2. Fix testcases for `Optimizer` that would have fall.
## How was this patch tested?

Apply `SimpleAnalyzer.checkAnalysis` on plans in `PlanTest.comparePlans`, update invalid test cases.

Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #15417 from jiangxb1987/cptest.
2017-06-21 09:40:06 -07:00
Marcos P e92befcb4b [MINOR][DOC] modified issue link and updated status
## What changes were proposed in this pull request?

This PR aims to clarify some outdated comments that i found at **spark-catalyst** and **spark-sql** pom files. Maven bug still happening and in order to track it I have updated the issue link and also the status of the issue.

Author: Marcos P <mpenate@stratio.com>

Closes #18374 from mpenate/fix/mng-3559-comment.
2017-06-21 15:34:10 +01:00
Yuming Wang 987eb8fadd [MINOR][DOCS] Add lost <tr> tag for configuration.md
## What changes were proposed in this pull request?

Add lost `<tr>` tag for `configuration.md`.

## How was this patch tested?
N/A

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18372 from wangyum/docs-missing-tr.
2017-06-21 15:30:31 +01:00
Li Yichao d107b3b910 [SPARK-20640][CORE] Make rpc timeout and retry for shuffle registration configurable.
## What changes were proposed in this pull request?

Currently the shuffle service registration timeout and retry has been hardcoded. This works well for small workloads but under heavy workload when the shuffle service is busy transferring large amount of data we see significant delay in responding to the registration request, as a result we often see the executors fail to register with the shuffle service, eventually failing the job. We need to make these two parameters configurable.

## How was this patch tested?

* Updated `BlockManagerSuite` to test registration timeout and max attempts configuration actually works.

cc sitalkedia

Author: Li Yichao <lyc@zhihu.com>

Closes #18092 from liyichao/SPARK-20640.
2017-06-21 21:54:29 +08:00
sureshthalamati 9ce714dca2 [SPARK-10655][SQL] Adding additional data type mappings to jdbc DB2dialect.
This patch adds DB2 specific data type mappings for decfloat, real, xml , and timestamp with time zone (DB2Z specific type)  types on read and for byte, short data types  on write to the to jdbc data source DB2 dialect. Default mapping does not work for these types when reading/writing from DB2 database.

Added docker test, and a JDBC unit test case.

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #9162 from sureshthalamati/db2dialect_enhancements-spark-10655.
2017-06-20 22:35:42 -07:00
Reynold Xin b6b108826a [SPARK-21103][SQL] QueryPlanConstraints should be part of LogicalPlan
## What changes were proposed in this pull request?
QueryPlanConstraints should be part of LogicalPlan, rather than QueryPlan, since the constraint framework is only used for query plan rewriting and not for physical planning.

## How was this patch tested?
Should be covered by existing tests, since it is a simple refactoring.

Author: Reynold Xin <rxin@databricks.com>

Closes #18310 from rxin/SPARK-21103.
2017-06-20 11:34:22 -07:00
Wenchen Fan e862dc9049 [SPARK-21150][SQL] Persistent view stored in Hive metastore should be case preserving
## What changes were proposed in this pull request?

This is a regression in Spark 2.2. In Spark 2.2, we introduced a new way to resolve persisted view: https://issues.apache.org/jira/browse/SPARK-18209 , but this makes the persisted view non case-preserving because we store the schema in hive metastore directly. We should follow data source table and store schema in table properties.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #18360 from cloud-fan/view.
2017-06-20 09:15:33 -07:00
Xingbo Jiang ef1622899f [SPARK-20989][CORE] Fail to start multiple workers on one host if external shuffle service is enabled in standalone mode
## What changes were proposed in this pull request?

In standalone mode, if we enable external shuffle service by setting `spark.shuffle.service.enabled` to true, and then we try to start multiple workers on one host(by setting `SPARK_WORKER_INSTANCES=3` in spark-env.sh, and then run `sbin/start-slaves.sh`), we can only launch one worker on each host successfully and the rest of the workers fail to launch.
The reason is the port of external shuffle service if configed by `spark.shuffle.service.port`, so currently we could start no more than one external shuffle service on each host. In our case, each worker tries to start a external shuffle service, and only one of them succeeded doing this.

We should give explicit reason of failure instead of fail silently.

## How was this patch tested?
Manually test by the following steps:
1. SET `SPARK_WORKER_INSTANCES=1` in `conf/spark-env.sh`;
2. SET `spark.shuffle.service.enabled` to `true` in `conf/spark-defaults.conf`;
3. Run `sbin/start-all.sh`.

Before the change, you will see no error in the command line, as the following:
```
starting org.apache.spark.deploy.master.Master, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
```
And you can see in the webUI that only one worker is running.

After the change, you get explicit error messages in the command line:
```
starting org.apache.spark.deploy.master.Master, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://xxx.local:7077
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls to: xxx
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls to: xxx
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls groups to:
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls groups to:
localhost:   17/06/13 23:24:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
localhost:   17/06/13 23:24:54 INFO Utils: Successfully started service 'sparkWorker' on port 63354.
localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
localhost:   	at scala.Predef$.require(Predef.scala:224)
localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8082 spark://xxx.local:7077
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls to: xxx
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls to: xxx
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls groups to:
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls groups to:
localhost:   17/06/13 23:24:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
localhost:   17/06/13 23:24:56 INFO Utils: Successfully started service 'sparkWorker' on port 63359.
localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
localhost:   	at scala.Predef$.require(Predef.scala:224)
localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8083 spark://xxx.local:7077
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls to: xxx
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls to: xxx
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls groups to:
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls groups to:
localhost:   17/06/13 23:24:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
localhost:   17/06/13 23:24:59 INFO Utils: Successfully started service 'sparkWorker' on port 63360.
localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
localhost:   	at scala.Predef$.require(Predef.scala:224)
localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
```

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #18290 from jiangxb1987/start-slave.
2017-06-20 17:17:21 +08:00
Joseph K. Bradley cc67bd5732 [SPARK-20929][ML] LinearSVC should use its own threshold param
## What changes were proposed in this pull request?

LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.

## How was this patch tested?

New unit test to make sure the threshold can be set to any Double value.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.
2017-06-19 23:04:17 -07:00
actuaryzhang 8965fe764a [SPARK-20889][SPARKR] Grouped documentation for AGGREGATE column methods
## What changes were proposed in this pull request?
Grouped documentation for the aggregate functions for Column.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18025 from actuaryzhang/sparkRDoc4.
2017-06-19 19:41:24 -07:00
Yuming Wang 9b57cd8d5c [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeExternal throws NPE
## What changes were proposed in this pull request?

Fix HighlyCompressedMapStatus#writeExternal NPE:
```
17/06/18 15:00:27 ERROR Utils: Exception encountered
java.lang.NullPointerException
        at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
        at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
        at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
        at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
        at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
        at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
        at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
        at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
        at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
        at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
        at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
        at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
        at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
        at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
        at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
        at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
        at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
        at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
        at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
        at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
        at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
        at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
        ... 17 more
17/06/18 15:00:27 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 10.17.47.20:50188
17/06/18 15:00:27 ERROR Utils: Exception encountered
java.lang.NullPointerException
        at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
        at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
        at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
        at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
        at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
        at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
        at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
        at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
        at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
        at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
        at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
```

## How was this patch tested?

manual tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18343 from wangyum/SPARK-21133.
2017-06-20 09:22:30 +08:00
Marcelo Vanzin 9eacc5e438 [INFRA] Close stale PRs.
Closes #18311
Closes #18278
2017-06-19 15:14:33 -07:00
sharkdtu 3d4d11a80f [SPARK-21138][YARN] Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
## What changes were proposed in this pull request?

When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows:
```
spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
```
The staging dir can not be deleted, it will prompt following message:
```
java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
```

## How was this patch tested?

Existing tests

Author: sharkdtu <sharkdtu@tencent.com>

Closes #18352 from sharkdtu/master.
2017-06-19 14:54:54 -07:00
Marcelo Vanzin 581565dd87 [SPARK-21124][UI] Show correct application user in UI.
The jobs page currently shows the application user, but it assumes
the OS user is the same as the user running the application, which
may not be true in all scenarios (e.g., kerberos). While it might be
useful to show both in the UI, this change just chooses the application
user over the OS user, since the latter can be found in the environment
page if needed.

Tested in live application and in history server.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18331 from vanzin/SPARK-21124.
2017-06-19 14:41:58 -07:00
Xianyang Liu 0a4b7e4f81 [MINOR] Fix some typo of the document
## What changes were proposed in this pull request?

Fix some typo of the document.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Xianyang Liu <xianyang.liu@intel.com>

Closes #18350 from ConeyLiu/fixtypo.
2017-06-19 20:35:58 +01:00
Dongjoon Hyun ecc5631351 [MINOR][BUILD] Fix Java linter errors
## What changes were proposed in this pull request?

This PR cleans up a few Java linter errors for Apache Spark 2.2 release.

## How was this patch tested?

```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

We can check the result at Travis CI, [here](https://travis-ci.org/dongjoon-hyun/spark/builds/244297894).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18345 from dongjoon-hyun/fix_lint_java_2.
2017-06-19 20:17:54 +01:00
Yong Tang e5387018e7 [SPARK-19975][PYTHON][SQL] Add map_keys and map_values functions to Python
## What changes were proposed in this pull request?

This fix tries to address the issue in SPARK-19975 where we
have `map_keys` and `map_values` functions in SQL yet there
is no Python equivalent functions.

This fix adds `map_keys` and `map_values` functions to Python.

## How was this patch tested?

This fix is tested manually (See Python docs for examples).

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #17328 from yongtang/SPARK-19975.
2017-06-19 11:40:07 -07:00
assafmendelson 66a792cd88 [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table
## What changes were proposed in this pull request?

The description for several options of File Source for structured streaming appeared in the File Sink description instead.

This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2

## How was this patch tested?

Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.

The original documentation was written by tdas and lw-lin

Author: assafmendelson <assaf.mendelson@gmail.com>

Closes #18342 from assafmendelson/spark-21123.
2017-06-19 10:58:58 -07:00
saturday_s e92ffe6f17 [SPARK-19688][STREAMING] Not to read spark.yarn.credentials.file from checkpoint.
## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <shi.indetail@gmail.com>

Closes #18230 from saturday-shi/SPARK-21008.
2017-06-19 10:24:29 -07:00
hyukjinkwon 9a145fd796 [MINOR] Bump SparkR and PySpark version to 2.3.0.
## What changes were proposed in this pull request?

#17753 bumps master branch version to 2.3.0-SNAPSHOT, but it seems SparkR and PySpark version were omitted.

ditto of https://github.com/apache/spark/pull/16488 / https://github.com/apache/spark/pull/17523

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18341 from HyukjinKwon/r-version.
2017-06-19 11:13:03 +01:00
Xiao Li 9413b84b5a [SPARK-21132][SQL] DISTINCT modifier of function arguments should not be silently ignored
### What changes were proposed in this pull request?
We should not silently ignore `DISTINCT` when they are not supported in the function arguments. This PR is to block these cases and issue the error messages.

### How was this patch tested?
Added test cases for both regular functions and window functions

Author: Xiao Li <gatorsmile@gmail.com>

Closes #18340 from gatorsmile/firstCount.
2017-06-19 15:51:21 +08:00
Xingbo Jiang ea542d29b2 [SPARK-19824][CORE] Update JsonProtocol to keep consistent with the UI
## What changes were proposed in this pull request?

Fix any inconsistent part in JsonProtocol with the UI.
This PR also contains the modifications in #17181

## How was this patch tested?

Updated JsonProtocolSuite.

Before this change, localhost:8080/json shows:
```
{
  "url" : "spark://xingbos-MBP.local:7077",
  "workers" : [ {
    "id" : "worker-20170615172946-192.168.0.101-49450",
    "host" : "192.168.0.101",
    "port" : 49450,
    "webuiaddress" : "http://192.168.0.101:8081",
    "cores" : 8,
    "coresused" : 8,
    "coresfree" : 0,
    "memory" : 15360,
    "memoryused" : 1024,
    "memoryfree" : 14336,
    "state" : "ALIVE",
    "lastheartbeat" : 1497519481722
  }, {
    "id" : "worker-20170615172948-192.168.0.101-49452",
    "host" : "192.168.0.101",
    "port" : 49452,
    "webuiaddress" : "http://192.168.0.101:8082",
    "cores" : 8,
    "coresused" : 8,
    "coresfree" : 0,
    "memory" : 15360,
    "memoryused" : 1024,
    "memoryfree" : 14336,
    "state" : "ALIVE",
    "lastheartbeat" : 1497519484160
  }, {
    "id" : "worker-20170615172951-192.168.0.101-49469",
    "host" : "192.168.0.101",
    "port" : 49469,
    "webuiaddress" : "http://192.168.0.101:8083",
    "cores" : 8,
    "coresused" : 8,
    "coresfree" : 0,
    "memory" : 15360,
    "memoryused" : 1024,
    "memoryfree" : 14336,
    "state" : "ALIVE",
    "lastheartbeat" : 1497519486905
  } ],
  "cores" : 24,
  "coresused" : 24,
  "memory" : 46080,
  "memoryused" : 3072,
  "activeapps" : [ {
    "starttime" : 1497519426990,
    "id" : "app-20170615173706-0001",
    "name" : "Spark shell",
    "user" : "xingbojiang",
    "memoryperslave" : 1024,
    "submitdate" : "Thu Jun 15 17:37:06 CST 2017",
    "state" : "RUNNING",
    "duration" : 65362
  } ],
  "completedapps" : [ {
    "starttime" : 1497519250893,
    "id" : "app-20170615173410-0000",
    "name" : "Spark shell",
    "user" : "xingbojiang",
    "memoryperslave" : 1024,
    "submitdate" : "Thu Jun 15 17:34:10 CST 2017",
    "state" : "FINISHED",
    "duration" : 116895
  } ],
  "activedrivers" : [ ],
  "status" : "ALIVE"
}
```

After the change:
```
{
  "url" : "spark://xingbos-MBP.local:7077",
  "workers" : [ {
    "id" : "worker-20170615175032-192.168.0.101-49951",
    "host" : "192.168.0.101",
    "port" : 49951,
    "webuiaddress" : "http://192.168.0.101:8081",
    "cores" : 8,
    "coresused" : 8,
    "coresfree" : 0,
    "memory" : 15360,
    "memoryused" : 1024,
    "memoryfree" : 14336,
    "state" : "ALIVE",
    "lastheartbeat" : 1497520292900
  }, {
    "id" : "worker-20170615175034-192.168.0.101-49953",
    "host" : "192.168.0.101",
    "port" : 49953,
    "webuiaddress" : "http://192.168.0.101:8082",
    "cores" : 8,
    "coresused" : 8,
    "coresfree" : 0,
    "memory" : 15360,
    "memoryused" : 1024,
    "memoryfree" : 14336,
    "state" : "ALIVE",
    "lastheartbeat" : 1497520280301
  }, {
    "id" : "worker-20170615175037-192.168.0.101-49955",
    "host" : "192.168.0.101",
    "port" : 49955,
    "webuiaddress" : "http://192.168.0.101:8083",
    "cores" : 8,
    "coresused" : 8,
    "coresfree" : 0,
    "memory" : 15360,
    "memoryused" : 1024,
    "memoryfree" : 14336,
    "state" : "ALIVE",
    "lastheartbeat" : 1497520282884
  } ],
  "aliveworkers" : 3,
  "cores" : 24,
  "coresused" : 24,
  "memory" : 46080,
  "memoryused" : 3072,
  "activeapps" : [ {
    "id" : "app-20170615175122-0001",
    "starttime" : 1497520282115,
    "name" : "Spark shell",
    "cores" : 24,
    "user" : "xingbojiang",
    "memoryperslave" : 1024,
    "submitdate" : "Thu Jun 15 17:51:22 CST 2017",
    "state" : "RUNNING",
    "duration" : 10805
  } ],
  "completedapps" : [ {
    "id" : "app-20170615175058-0000",
    "starttime" : 1497520258766,
    "name" : "Spark shell",
    "cores" : 24,
    "user" : "xingbojiang",
    "memoryperslave" : 1024,
    "submitdate" : "Thu Jun 15 17:50:58 CST 2017",
    "state" : "FINISHED",
    "duration" : 9876
  } ],
  "activedrivers" : [ ],
  "completeddrivers" : [ ],
  "status" : "ALIVE"
}
```

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #18303 from jiangxb1987/json-protocol.
2017-06-18 22:05:06 -07:00
liuxian 112bd9bfc5 [SPARK-21090][CORE] Optimize the unified memory manager code
## What changes were proposed in this pull request?
1.In `acquireStorageMemory`, when the Memory Mode is OFF_HEAP ,the `maxOffHeapMemory` should be modified to `maxOffHeapStorageMemory`. after this PR,it will same as ON_HEAP Memory Mode.
Because when acquire memory is between `maxOffHeapStorageMemory` and `maxOffHeapMemory`,it will fail surely, so if acquire memory is greater than  `maxOffHeapStorageMemory`(not greater than `maxOffHeapMemory`),we should fail fast.
2. Borrow memory from execution, `numBytes` modified to `numBytes - storagePool.memoryFree` will be more reasonable.
Because we just acquire `(numBytes - storagePool.memoryFree)`, unnecessary borrowed `numBytes` from execution

## How was this patch tested?
added unit test case

Author: liuxian <liu.xian3@zte.com.cn>

Closes #18296 from 10110346/wip-lx-0614.
2017-06-19 11:46:58 +08:00
Yuming Wang f913f158ec [SPARK-20948][SQL] Built-in SQL Function UnaryMinus/UnaryPositive support string type
## What changes were proposed in this pull request?

Built-in SQL Function UnaryMinus/UnaryPositive support string type, if it's string type, convert it to double type, after this PR:
```sql
spark-sql> select positive('-1.11'), negative('-1.11');
-1.11   1.11
spark-sql>
```

## How was this patch tested?

unit tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18173 from wangyum/SPARK-20948.
2017-06-18 20:14:05 -07:00
Yuming Wang ce49428ef7 [SPARK-20749][SQL][FOLLOWUP] Support character_length
## What changes were proposed in this pull request?

The function `char_length` is shorthand for `character_length` function. Both Hive and Postgresql support `character_length`,  This PR add support for `character_length`.

Ref:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
https://www.postgresql.org/docs/current/static/functions-string.html

## How was this patch tested?

unit tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18330 from wangyum/SPARK-20749-character_length.
2017-06-18 18:56:53 -07:00
actuaryzhang 110ce1f27b [SPARK-20892][SPARKR] Add SQL trunc function to SparkR
## What changes were proposed in this pull request?

Add SQL trunc function

## How was this patch tested?
standard test

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18291 from actuaryzhang/sparkRTrunc2.
2017-06-18 18:00:27 -07:00
hyukjinkwon 05f83c532a [SPARK-21128][R] Remove both "spark-warehouse" and "metastore_db" before listing files in R tests
## What changes were proposed in this pull request?

This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.

## How was this patch tested?

Manually running multiple times R tests via `./R/run-tests.sh`.

**Before**

Second run:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
....................................................................................................1234.......................

Failed -------------------------------------------------------------------------
1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2

2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"

x[17]: "pkg"
y[17]: "R"

x[18]: "R"
y[18]: "README.md"

x[19]: "README.md"
y[19]: "run-tests.sh"

x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"

x[21]: "metastore_db"
y[21]: "pkg"

x[22]: "pkg"
y[22]: "R"

x[23]: "R"
y[23]: "README.md"

x[24]: "README.md"
y[24]: "run-tests.sh"

x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"

3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2

4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"

x[17]: "pkg"
y[17]: "R"

x[18]: "R"
y[18]: "README.md"

x[19]: "README.md"
y[19]: "run-tests.sh"

x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"

x[21]: "metastore_db"
y[21]: "pkg"

x[22]: "pkg"
y[22]: "R"

x[23]: "R"
y[23]: "README.md"

x[24]: "README.md"
y[24]: "run-tests.sh"

x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"

DONE ===========================================================================
```

**After**

Second run:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18335 from HyukjinKwon/SPARK-21128.
2017-06-18 11:26:27 -07:00
hyukjinkwon 75a6d05853 [MINOR][R] Add knitr and rmarkdown packages/improve output for version info in AppVeyor tests
## What changes were proposed in this pull request?

This PR proposes three things as below:

**Install packages per documentation** - this does not affect the tests itself (but CRAN which we are not doing via AppVeyor) up to my knowledge.

This adds `knitr` and `rmarkdown` per 45824fb608/R/WINDOWS.md (unit-tests) (please see 45824fb608)

**Improve logs/shorten logs** - actually, long logs can be a problem on AppVeyor (e.g., see https://github.com/apache/spark/pull/17873)

`R -e ...` repeats printing R information for each invocation as below:

```
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
```

It looks reducing the call might be slightly better and print out the versions together looks more readable.

Before:

```
# R information ...
> packageVersion('testthat')
[1] '1.0.2'
>
>

# R information ...
> packageVersion('e1071')
[1] '1.6.8'
>
>
... 3 more times
```

After:

```
# R information ...
> packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')
[1] ‘1.16’
[1] ‘1.6’
[1] ‘1.0.2’
[1] ‘1.6.8’
[1] ‘2.41.3’
```

**Add`appveyor.yml`/`dev/appveyor-install-dependencies.ps1` for triggering the test**

Changing this file might break the test, e.g., https://github.com/apache/spark/pull/16927

## How was this patch tested?

Before (please see https://ci.appveyor.com/project/HyukjinKwon/spark/build/169-master)
After (please see the AppVeyor build in this PR):

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18336 from HyukjinKwon/minor-add-knitr-and-rmarkdown.
2017-06-18 08:43:47 +01:00