Commit graph

20699 commits

Author SHA1 Message Date
liuxian b8a08f25cc [SPARK-21506][DOC] The description of "spark.executor.cores" may be not correct
## What changes were proposed in this pull request?

The number of cores assigned to each executor is configurable. When this is not explicitly set,  multiple executors from the same application may be launched on the same worker too.

## How was this patch tested?
N/A

Author: liuxian <liu.xian3@zte.com.cn>

Closes #18711 from 10110346/executorcores.
2017-10-10 20:44:33 +08:00
WeichenXu 3b5c2a84bf [SPARK-21770][ML] ProbabilisticClassificationModel fix corner case: normalization of all-zero raw predictions
## What changes were proposed in this pull request?

Fix probabilisticClassificationModel corner case: normalization of all-zero raw predictions, throw IllegalArgumentException with description.

## How was this patch tested?

Test case added.

Author: WeichenXu <weichen.xu@databricks.com>

Closes #19106 from WeichenXu123/SPARK-21770.
2017-10-10 08:27:45 +01:00
Takuya UESHIN af8a34c787 [SPARK-22159][SQL][FOLLOW-UP] Make config names consistently end with "enabled".
## What changes were proposed in this pull request?

This is a follow-up of #19384.

In the previous pr, only definitions of the config names were modified, but we also need to modify the names in runtime or tests specified as string literal.

## How was this patch tested?

Existing tests but modified the config names.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #19462 from ueshin/issues/SPARK-22159/fup1.
2017-10-09 22:35:34 -07:00
Feng Liu bebd2e1ce1 [SPARK-22222][CORE] Fix the ARRAY_MAX in BufferHolder and add a test
## What changes were proposed in this pull request?

We should not break the assumption that the length of the allocated byte array is word rounded:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170
So we want to use `Integer.MAX_VALUE - 15` instead of `Integer.MAX_VALUE - 8` as the upper bound of an allocated byte array.

cc: srowen gatorsmile
## How was this patch tested?

Since the Spark unit test JVM has less than 1GB heap, here we run the test code as a submit job, so it can run on a JVM has 4GB memory.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Feng Liu <fengliu@databricks.com>

Closes #19460 from liufengdb/fix_array_max.
2017-10-09 21:34:37 -07:00
Jose Torres 71c2b81aa0 [SPARK-22230] Swap per-row order in state store restore.
## What changes were proposed in this pull request?
In state store restore, for each row, put the saved state before the row in the iterator instead of after.

This fixes an issue where agg(last('attr)) will forever return the last value of 'attr from the first microbatch.

## How was this patch tested?

new unit test

Author: Jose Torres <jose@databricks.com>

Closes #19461 from joseph-torres/SPARK-22230.
2017-10-09 16:34:39 -07:00
Ryan Blue 155ab6347e [SPARK-22170][SQL] Reduce memory consumption in broadcast joins.
## What changes were proposed in this pull request?

This updates the broadcast join code path to lazily decompress pages and
iterate through UnsafeRows to prevent all rows from being held in memory
while the broadcast table is being built.

## How was this patch tested?

Existing tests.

Author: Ryan Blue <blue@apache.org>

Closes #19394 from rdblue/broadcast-driver-memory.
2017-10-09 15:22:41 -07:00
Pavel Sakun dadd13f365 [SPARK] Misleading error message for missing --proxy-user value
Fix misleading error message when argument is expected.

## What changes were proposed in this pull request?

Change message to be accurate.

## How was this patch tested?

Messaging change, was tested manually.

Author: Pavel Sakun <pavel.sakun@gmail.com>

Closes #19457 from pavel-sakun/patch-1.
2017-10-09 23:00:04 +01:00
Marcelo Vanzin b650ee0265 [INFRA] Close stale PRs.
Closes #19423
Closes #19455
2017-10-09 13:55:55 -07:00
Thomas Graves a74ec6d7bb [SPARK-22218] spark shuffle services fails to update secret on app re-attempts
This patch fixes application re-attempts when running spark on yarn using the external shuffle service with security on.  Currently executors will fail to launch on any application re-attempt when launched on a nodemanager that had an executor from the first attempt.  The reason for this is because we aren't updating the secret key after the first application attempt.  The fix here is to just remove the containskey check to see if it already exists. In this way, we always add it and make sure its the most recent secret.  Similarly remove the check for containsKey on the remove since its just adding extra check that isn't really needed.

Note this worked before spark 2.2 because the check used to be contains (which was looking for the value) rather then containsKey, so that never matched and it was just always adding the new secret.

Patch was tested on a 10 node cluster as well as added the unit test.
The test ran was a wordcount where the output directory already existed.  With the bug present the application attempt failed with max number of executor Failures which were all saslExceptions.  With the fix present the application re-attempts fail with directory already exists or when you remove the directory between attempts the re-attemps succeed.

Author: Thomas Graves <tgraves@unharmedunarmed.corp.ne1.yahoo.com>

Closes #19450 from tgravescs/SPARK-22218.
2017-10-09 12:56:37 -07:00
Dongjoon Hyun f31e11404d [SPARK-21568][CORE] ConsoleProgressBar should only be enabled in shells
## What changes were proposed in this pull request?

This PR disables console progress bar feature in non-shell environment by overriding the configuration.

## How was this patch tested?

Manual. Run the following examples with and without `spark.ui.showConsoleProgress` in order to see progress bar on master branch and this PR.

**Scala Shell**
```scala
spark.range(1000000000).map(_ + 1).count
```

**PySpark**
```python
spark.range(10000000).rdd.map(lambda x: len(x)).count()
```

**Spark Submit**
```python
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder.getOrCreate()
    spark.range(2000000).rdd.map(lambda row: len(row)).count()
    spark.stop()
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #19061 from dongjoon-hyun/SPARK-21568.
2017-10-09 12:53:10 -07:00
Nick Pentreath 98057583dd [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel
This PR adds methods `recommendForUserSubset` and `recommendForItemSubset` to `ALSModel`. These allow recommending for a specified set of user / item ids rather than for every user / item (as in the `recommendForAllX` methods).

The subset methods take a `DataFrame` as input, containing ids in the column specified by the param `userCol` or `itemCol`. The model will generate recommendations for each _unique_ id in this input dataframe.

## How was this patch tested?
New unit tests in `ALSSuite` and Python doctests in `ALS`. Ran updated examples locally.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #18748 from MLnick/als-recommend-df.
2017-10-09 10:42:33 +02:00
Yuanjian Li fe7b219ae3 [SPARK-22074][CORE] Task killed by other attempt task should not be resubmitted
## What changes were proposed in this pull request?

As the detail scenario described in [SPARK-22074](https://issues.apache.org/jira/browse/SPARK-22074), unnecessary resubmitted may cause stage hanging in currently release versions. This patch add a new var in TaskInfo to mark this task killed by other attempt or not.

## How was this patch tested?

Add a new UT `[SPARK-22074] Task killed by other attempt task should not be resubmitted` in TaskSetManagerSuite, this UT recreate the scenario in JIRA description, it failed without the changes in this PR and passed conversely.

Author: Yuanjian Li <xyliyuanjian@gmail.com>

Closes #19287 from xuanyuanking/SPARK-22074.
2017-10-09 14:16:25 +08:00
Sergei Lebedev c998a2ae0e [SPARK-22147][CORE] Removed redundant allocations from BlockId
## What changes were proposed in this pull request?

Prior to this commit BlockId.hashCode and BlockId.equals were defined
in terms of BlockId.name. This allowed the subclasses to be concise and
enforced BlockId.name as a single unique identifier for a block. All
subclasses override BlockId.name with an expression involving an
allocation of StringBuilder and ultimatelly String. This is suboptimal
since it induced unnecessary GC pressure on the dirver, see
BlockManagerMasterEndpoint.

The commit removes the definition of hashCode and equals from the base
class. No other change is necessary since all subclasses are in fact
case classes and therefore have auto-generated hashCode and equals. No
change of behaviour is expected.

Sidenote: you might be wondering, why did the subclasses use the base
implementation and the auto-generated one? Apparently, this behaviour
is documented in the spec. See this SO answer for details
https://stackoverflow.com/a/44990210/262432.

## How was this patch tested?

BlockIdSuite

Author: Sergei Lebedev <s.lebedev@criteo.com>

Closes #19369 from superbobry/blockid-equals-hashcode.
2017-10-08 12:58:39 +01:00
Kento NOZAWA 5eacc3bfa9 [SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala
## What changes were proposed in this pull request?

Current equation of learning rate is incorrect when `numIterations` > `1`.
This PR is based on [original C code](https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L393).

cc: mengxr

## How was this patch tested?

manual tests

I modified [this example code](https://spark.apache.org/docs/2.1.1/mllib-feature-extraction.html#example).

### `numIteration=1`

#### Code

```scala
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq)

val word2vec = new Word2Vec()

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("1", 5)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}
```

#### Result

```
2 0.175856813788414
0 0.10971353203058243
4 0.09818313270807266
3 0.012947646901011467
9 -0.09881238639354706
```

### `numIteration=5`

#### Code

```scala
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq)

val word2vec = new Word2Vec()
word2vec.setNumIterations(5)

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("1", 5)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}
```

#### Result

```
0 0.9898583889007568
2 0.9808019399642944
4 0.9794934391975403
3 0.9506527781486511
9 -0.9065656661987305
```

Author: Kento NOZAWA <k_nzw@klis.tsukuba.ac.jp>

Closes #19372 from nzw0301/master.
2017-10-07 08:30:48 +01:00
Sergey Zhemzhitsky 2030f19511 [SPARK-21549][CORE] Respect OutputFormats with no output directory provided
## What changes were proposed in this pull request?

Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue.

Since version 2.2 Spark does not respect OutputFormat with no output paths provided.
The examples of such formats are [Cassandra OutputFormat](08dfa3a7ac/src/main/java/org/apache/cassandra/hadoop2/AbstractColumnFamilyOutputFormat.java), [Aerospike OutputFormat](https://github.com/aerospike/aerospike-hadoop/blob/master/mapreduce/src/main/java/com/aerospike/hadoop/mapreduce/AerospikeOutputFormat.java), etc. which do not have an ability to rollback the results written to an external systems on job failure.

Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems.

This pull request prevents accessing `absPathStagingDir` method that causes the error described in SPARK-21549 unless there are files to rename in `addedAbsPathFiles`.

## How was this patch tested?

Unit tests

Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com>

Closes #19294 from szhem/SPARK-21549-abs-output-commits.
2017-10-06 20:43:53 -07:00
Liang-Chi Hsieh debcbec749 [SPARK-21947][SS] Check and report error when monotonically_increasing_id is used in streaming query
## What changes were proposed in this pull request?

`monotonically_increasing_id` doesn't work in Structured Streaming. We should throw an exception if a streaming query uses it.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19336 from viirya/SPARK-21947.
2017-10-06 13:10:04 -07:00
Xingbo Jiang 08b204fd2c [SPARK-22214][SQL] Refactor the list hive partitions code
## What changes were proposed in this pull request?

In this PR we make a few changes to the list hive partitions code, to make the code more extensible.
The following changes are made:
1. In `HiveClientImpl.getPartitions()`, call `client.getPartitions` instead of `shim.getAllPartitions` when `spec` is empty;
2. In `HiveTableScanExec`, previously we always call `listPartitionsByFilter` if the config `metastorePartitionPruning` is enabled, but actually, we'd better call `listPartitions` if `partitionPruningPred` is empty;
3.  We should use sessionCatalog instead of SharedState.externalCatalog in `HiveTableScanExec`.

## How was this patch tested?

Tested by existing test cases since this is code refactor, no regression or behavior change is expected.

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #19444 from jiangxb1987/hivePartitions.
2017-10-06 12:53:35 -07:00
minixalpha c7b46d4d8a [SPARK-21877][DEPLOY, WINDOWS] Handle quotes in Windows command scripts
## What changes were proposed in this pull request?

All the windows command scripts can not handle quotes in parameter.

Run a windows command shell with parameter which has quotes can reproduce the bug:

```
C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell --driver-java-options " -Dfile.encoding=utf-8 "
'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" --driver-java-options "' is not recognized as an internal or external command,
operable program or batch file.
```

Windows recognize "--driver-java-options" as part of the command.
All the Windows command script has the following code have the bug.

```
cmd /V /E /C "<other command>" %*
```

We should quote command and parameters like

```
cmd /V /E /C ""<other command>" %*"
```

## How was this patch tested?

Test manually on Windows 10 and Windows 7

We can verify it by the following demo:

```
C:\Users\meng\program\demo>cat a.cmd
echo off
cmd /V /E /C "b.cmd" %*

C:\Users\meng\program\demo>cat b.cmd
echo off
echo %*

C:\Users\meng\program\demo>cat c.cmd
echo off
cmd /V /E /C ""b.cmd" %*"

C:\Users\meng\program\demo>a.cmd "123"
'b.cmd" "123' is not recognized as an internal or external command,
operable program or batch file.

C:\Users\meng\program\demo>c.cmd "123"
"123"
```

With the spark-shell.cmd example, change it to the following code will make the command execute succeed.

```
cmd /V /E /C ""%~dp0spark-shell2.cmd" %*"
```

```
C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell  --driver-java-options " -Dfile.encoding=utf-8 "
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
...

```

Author: minixalpha <xkzalpha@gmail.com>

Closes #19090 from minixalpha/master.
2017-10-06 23:38:47 +09:00
Sean Owen 0c03297bf0 [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2
## What changes were proposed in this pull request?

Move flume behind a profile, take 2. See https://github.com/apache/spark/pull/19365 for most of the back-story.

This change should fix the problem by removing the examples module dependency and moving Flume examples to the module itself. It also adds deprecation messages, per a discussion on dev about deprecating for 2.3.0.

## How was this patch tested?

Existing tests, which still enable flume integration.

Author: Sean Owen <sowen@cloudera.com>

Closes #19412 from srowen/SPARK-22142.2.
2017-10-06 15:08:28 +01:00
gatorsmile 83488cc318 [SPARK-21871][SQL] Fix infinite loop when bytecode size is larger than spark.sql.codegen.hugeMethodLimit
## What changes were proposed in this pull request?
When exceeding `spark.sql.codegen.hugeMethodLimit`, the runtime fallbacks to the Volcano iterator solution. This could cause an infinite loop when `FileSourceScanExec` can use the columnar batch to read the data. This PR is to fix the issue.

## How was this patch tested?
Added a test

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19440 from gatorsmile/testt.
2017-10-05 23:33:49 -07:00
Liang-Chi Hsieh ae61f187aa [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns
## What changes were proposed in this pull request?

Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19436 from viirya/fix-flatmapinr-distribution.
2017-10-05 23:36:18 +09:00
Shixiong Zhu c8affec21c [SPARK-22203][SQL] Add job description for file listing Spark jobs
## What changes were proposed in this pull request?

The user may be confused about some 10000-tasks jobs. We can add a job description for these jobs so that the user can figure it out.

## How was this patch tested?

The new unit test.

Before:
<img width="343" alt="screen shot 2017-10-04 at 3 22 09 pm" src="https://user-images.githubusercontent.com/1000778/31202567-f78d15c0-a917-11e7-841e-11b8bf8f0032.png">

After:
<img width="473" alt="screen shot 2017-10-04 at 3 13 51 pm" src="https://user-images.githubusercontent.com/1000778/31202576-fc01e356-a917-11e7-9c2b-7bf80b153adb.png">

Author: Shixiong Zhu <zsxwing@gmail.com>

Closes #19432 from zsxwing/SPARK-22203.
2017-10-04 20:58:48 -07:00
Tathagata Das 969ffd6317 [SPARK-22187][SS] Update unsaferow format for saved state such that we can set timeouts when state is null
## What changes were proposed in this pull request?

Currently, the group state of user-defined-type is encoded as top-level columns in the UnsafeRows stores in the state store. The timeout timestamp is also saved as (when needed) as the last top-level column. Since the group state is serialized to top-level columns, you cannot save "null" as a value of state (setting null in all the top-level columns is not equivalent). So we don't let the user set the timeout without initializing the state for a key. Based on user experience, this leads to confusion.

This PR is to change the row format such that the state is saved as nested columns. This would allow the state to be set to null, and avoid these confusing corner cases.

## How was this patch tested?
Refactored tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #19416 from tdas/SPARK-22187.
2017-10-04 19:25:22 -07:00
Wenchen Fan bb035f1ee5 [SPARK-22169][SQL] support byte length literal as identifier
## What changes were proposed in this pull request?

By definition the table name in Spark can be something like `123x`, `25a`, etc., with exceptions for literals like `12L`, `23BD`, etc. However, Spark SQL has a special byte length literal, which stops users to use digits followed by `b`, `k`, `m`, `g` as identifiers.

byte length literal is not a standard sql literal and is only used in the `tableSample` parser rule. This PR move the parsing of byte length literal from lexer to parser, so that users can use it as identifiers.

## How was this patch tested?

regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19392 from cloud-fan/parser-bug.
2017-10-04 13:13:51 -07:00
Takeshi Yamamuro 4a779bdac3 [SPARK-21871][SQL] Check actual bytecode size when compiling generated code
## What changes were proposed in this pull request?
This pr added code to check actual bytecode size when compiling generated code. In #18810, we added code to give up code compilation and use interpreter execution in `SparkPlan` if the line number of generated functions goes over `maxLinesPerFunction`. But, we already have code to collect metrics for compiled bytecode size in `CodeGenerator` object. So,we could easily reuse the code for this purpose.

## How was this patch tested?
Added tests in `WholeStageCodegenSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #19083 from maropu/SPARK-21871.
2017-10-04 10:08:24 -07:00
Kazuaki Ishizaki 64df08b647 [SPARK-20783][SQL] Create ColumnVector to abstract existing compressed column (batch method)
## What changes were proposed in this pull request?

This PR abstracts data compressed by `CompressibleColumnAccessor` using `ColumnVector` in batch method. When `ColumnAccessor.decompress` is called, `ColumnVector` will have uncompressed data. This batch decompress does not use `InternalRow` to reduce the number of memory accesses.

As first step of this implementation, this JIRA supports primitive data types. Another PR will support array and other data types.

This implementation decompress data in batch into uncompressed column batch, as rxin suggested at [here](https://github.com/apache/spark/pull/18468#issuecomment-316914076). Another implementation uses adapter approach [as cloud-fan suggested](https://github.com/apache/spark/pull/18468).

## How was this patch tested?

Added test suites

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18704 from kiszk/SPARK-20783a.
2017-10-04 15:06:44 +08:00
Rekha Joshi d54670192a [SPARK-22193][SQL] Minor typo fix
## What changes were proposed in this pull request?

[SPARK-22193][SQL] Minor typo fix in SortMergeJoinExec. Nothing major, but it bothered me going into.Hence fixing

## How was this patch tested?
existing tests

Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: rjoshi2 <rekhajoshm@gmail.com>

Closes #19422 from rekhajoshm/SPARK-22193.
2017-10-04 07:11:00 +01:00
Jose Torres 3099c574c5 [SPARK-22136][SS] Implement stream-stream outer joins.
## What changes were proposed in this pull request?

Allow one-sided outer joins between two streams when a watermark is defined.

## How was this patch tested?

new unit tests

Author: Jose Torres <jose@databricks.com>

Closes #19327 from joseph-torres/outerjoin.
2017-10-03 21:42:51 -07:00
gatorsmile 5f69433453 [SPARK-22171][SQL] Describe Table Extended Failed when Table Owner is Empty
## What changes were proposed in this pull request?

Users could hit `java.lang.NullPointerException` when the tables were created by Hive and the table's owner is `null` that are got from Hive metastore. `DESC EXTENDED` failed with the error:

> SQLExecutionException: java.lang.NullPointerException at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47) at scala.collection.immutable.StringOps.length(StringOps.scala:47) at scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27) at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111) at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at org.apache.spark.sql.catalyst.catalog.CatalogTable.toLinkedHashMap(interface.scala:300) at org.apache.spark.sql.execution.command.DescribeTableCommand.describeFormattedTableInfo(tables.scala:565) at org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:543) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:66) at

## How was this patch tested?
Added a unit test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19395 from gatorsmile/desc.
2017-10-03 21:27:58 -07:00
Sahil Takiar e36ec38d89 [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE
## What changes were proposed in this pull request?

Fix for SPARK-20466, full description of the issue in the JIRA. To summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The cache uses soft-references, which means the JVM can delete entries from the cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it would check if the cache contained the `JobConf`, if it did it would get the `JobConf` from the cache and return it. This doesn't work when soft-references are used as the JVM can delete the entry between the existence check and the get call.

## How was this patch tested?

Haven't thought of a good way to test this yet given the issue only occurs sometimes, and happens during high GC pressure. Was thinking of using mocks to verify `#getJobConf` is doing the right thing. I deleted the method `HadoopRDD#containsCachedMetadata` so that we don't hit this issue again.

Author: Sahil Takiar <stakiar@cloudera.com>

Closes #19413 from sahilTakiar/master.
2017-10-03 16:53:32 -07:00
gatorsmile e65b6b7ca1 [SPARK-22178][SQL] Refresh Persistent Views by REFRESH TABLE Command
## What changes were proposed in this pull request?
The underlying tables of persistent views are not refreshed when users issue the REFRESH TABLE command against the persistent views.

## How was this patch tested?
Added a test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19405 from gatorsmile/refreshView.
2017-10-03 12:40:22 -07:00
Reynold Xin 4c5158eec9 [SPARK-21644][SQL] LocalLimit.maxRows is defined incorrectly
## What changes were proposed in this pull request?
The definition of `maxRows` in `LocalLimit` operator was simply wrong. This patch introduces a new `maxRowsPerPartition` method and uses that in pruning. The patch also adds more documentation on why we need local limit vs global limit.

Note that this previously has never been a bug because the way the code is structured, but future use of the maxRows could lead to bugs.

## How was this patch tested?
Should be covered by existing test cases.

Closes #18851

Author: gatorsmile <gatorsmile@gmail.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #19393 from gatorsmile/pr-18851.
2017-10-03 12:38:13 -07:00
Takeshi Yamamuro fa225da746 [SPARK-22176][SQL] Fix overflow issue in Dataset.show
## What changes were proposed in this pull request?
This pr fixed an overflow issue below in `Dataset.show`:
```
scala> Seq((1, 2), (3, 4)).toDF("a", "b").show(Int.MaxValue)
org.apache.spark.sql.AnalysisException: The limit expression must be equal to or greater than 0, but got -2147483648;;
GlobalLimit -2147483648
+- LocalLimit -2147483648
   +- Project [_1#27218 AS a#27221, _2#27219 AS b#27222]
      +- LocalRelation [_1#27218, _2#27219]

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:89)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$$checkLimitClause(CheckAnalysis.scala:70)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:234)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
```

## How was this patch tested?
Added tests in `DataFrameSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #19401 from maropu/MaxValueInShowString.
2017-10-02 15:25:33 -07:00
Gene Pang 4329eb2e73 [SPARK-16944][Mesos] Improve data locality when launching new executors when dynamic allocation is enabled
## What changes were proposed in this pull request?

Improve the Spark-Mesos coarse-grained scheduler to consider the preferred locations when dynamic allocation is enabled.

## How was this patch tested?

Added a unittest, and performed manual testing on AWS.

Author: Gene Pang <gene.pang@gmail.com>

Closes #18098 from gpang/mesos_data_locality.
2017-10-02 15:09:11 -07:00
Dongjoon Hyun e5431f2cfd [SPARK-22158][SQL] convertMetastore should not ignore table property
## What changes were proposed in this pull request?

From the beginning, convertMetastoreOrc ignores table properties and use an empty map instead. This PR fixes that. For the diff, please see [this](https://github.com/apache/spark/pull/19382/files?w=1). convertMetastoreParquet also ignore.

```scala
val options = Map[String, String]()
```

- [SPARK-14070: HiveMetastoreCatalog.scala](https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650)
- [Master branch: HiveStrategies.scala](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197
)

## How was this patch tested?

Pass the Jenkins with an updated test suite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #19382 from dongjoon-hyun/SPARK-22158.
2017-10-02 15:00:26 -07:00
Holden Karau 8fab7995d3 [SPARK-22167][R][BUILD] sparkr packaging issue allow zinc
## What changes were proposed in this pull request?

When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up

## How was this patch tested?

set -x in the SparkR install scripts.

Author: Holden Karau <holden@us.ibm.com>

Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc.
2017-10-02 11:46:51 -07:00
guoxiaolong 405c0e99e7 [SPARK-22173][WEB-UI] Table CSS style needs to be adjusted in History Page and in Executors Page.
## What changes were proposed in this pull request?

There is a problem with table CSS style.

1. At present, table CSS style is too crowded, and the table width cannot adapt itself.

2. Table CSS style is different from job page, stage page, task page, master page, worker page, etc. The Spark web UI needs to be consistent.

fix before:
![01](https://user-images.githubusercontent.com/26266482/31041261-c6766c3a-a5c4-11e7-97a7-96bd51ef12bd.png)

![02](https://user-images.githubusercontent.com/26266482/31041266-d75b6a32-a5c4-11e7-8071-e3bbbba39b80.png)

----------------------------------------------------------------------------------------------------------

fix after:
![1](https://user-images.githubusercontent.com/26266482/31041162-808a5a3e-a5c3-11e7-8d92-d763b500ce53.png)

![2](https://user-images.githubusercontent.com/26266482/31041166-86e583e0-a5c3-11e7-949c-11c370db9e27.png)

## How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: guoxiaolong <guo.xiaolong1@zte.com.cn>

Closes #19397 from guoxiaolongzte/SPARK-22173.
2017-10-02 08:07:56 +01:00
Liang-Chi Hsieh 3ca367083e [SPARK-22001][ML][SQL] ImputerModel can do withColumn for all input columns at one pass
## What changes were proposed in this pull request?

SPARK-21690 makes one-pass `Imputer` by parallelizing the computation of all input columns. When we transform dataset with `ImputerModel`, we do `withColumn` on all input columns sequentially. We can also do this on all input columns at once by adding a `withColumns` API to `Dataset`.

The new `withColumns` API is for internal use only now.

## How was this patch tested?

Existing tests for `ImputerModel`'s change. Added tests for `withColumns` API.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19229 from viirya/SPARK-22001.
2017-10-01 10:49:22 -07:00
hyukjinkwon 02c91e03f9 [SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r
## What changes were proposed in this pull request?

Currently, we set lintr to jimhester/lintra769c0b (see [this](7d1175011c) and [SPARK-14074](https://issues.apache.org/jira/browse/SPARK-14074)).

I first tested and checked lintr-1.0.1 but it looks many important fixes are missing (for example, checking 100 length). So, I instead tried the latest commit, 5431140ffe, in my local and fixed the check failures.

It looks it has fixed many bugs and now finds many instances that I have observed and thought should be caught time to time, here I filed [the results](https://gist.github.com/HyukjinKwon/4f59ddcc7b6487a02da81800baca533c).

The downside looks it now takes about 7ish mins, (it was 2ish mins before) in my local.

## How was this patch tested?

Manually, `./dev/lint-r` after manually updating the lintr package.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: zuotingbing <zuo.tingbing9@zte.com.cn>

Closes #19290 from HyukjinKwon/upgrade-r-lint.
2017-10-01 18:42:45 +09:00
Takeshi Yamamuro c6610a997f [SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark
## What changes were proposed in this pull request?
Since the current code ignores WITH clauses to check input relations in TPCDS queries, this leads to inaccurate per-row processing time for benchmark results. For example, in `q2`, this fix could catch all the input relations: `web_sales`, `date_dim`, and `catalog_sales` (the current code catches `date_dim` only). The one-third of the TPCDS queries uses WITH clauses, so I think it is worth fixing this.

## How was this patch tested?
Manually checked.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #19344 from maropu/RespectWithInTPCDSBench.
2017-09-29 21:36:52 -07:00
gatorsmile 530fe68329 [SPARK-21904][SQL] Rename tempTables to tempViews in SessionCatalog
### What changes were proposed in this pull request?
`tempTables` is not right. To be consistent, we need to rename the internal variable names/comments to tempViews in SessionCatalog too.

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19117 from gatorsmile/renameTempTablesToTempViews.
2017-09-29 19:35:32 -07:00
gatorsmile 472864014c Revert "[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile"
This reverts commit a2516f41ae.
2017-09-29 11:45:58 -07:00
gatorsmile 9ed7394a68 [SPARK-22161][SQL] Add Impala-modified TPC-DS queries
## What changes were proposed in this pull request?

Added IMPALA-modified TPCDS queries to TPC-DS query suites.

- Ref: https://github.com/cloudera/impala-tpcds-kit/tree/master/queries

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19386 from gatorsmile/addImpalaQueries.
2017-09-29 08:59:42 -07:00
Holden Karau ecbe416ab5 [SPARK-22129][SPARK-22138] Release script improvements
## What changes were proposed in this pull request?

Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts as well.

## How was this patch tested?

Rolled 2.1.2 RC2

Author: Holden Karau <holden@us.ibm.com>

Closes #19359 from holdenk/SPARK-22129-fix-signing.
2017-09-29 08:04:14 -07:00
Sean Owen a2516f41ae [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile
## What changes were proposed in this pull request?

Add 'flume' profile to enable Flume-related integration modules

## How was this patch tested?

Existing tests; no functional change

Author: Sean Owen <sowen@cloudera.com>

Closes #19365 from srowen/SPARK-22142.
2017-09-29 08:26:53 +01:00
Wang Gengliang 0fa4dbe4f4 [SPARK-22141][FOLLOWUP][SQL] Add comments for the order of batches
## What changes were proposed in this pull request?
Add comments for specifying the position of  batch "Check Cartesian Products", as rxin suggested in https://github.com/apache/spark/pull/19362 .

## How was this patch tested?
Unit test

Author: Wang Gengliang <ltnwgl@gmail.com>

Closes #19379 from gengliangwang/SPARK-22141-followup.
2017-09-28 23:23:30 -07:00
Marco Gaido 161ba7eaa4 [SPARK-22146] FileNotFoundException while reading ORC files containing special characters
## What changes were proposed in this pull request?

Reading ORC files containing special characters like '%' fails with a FileNotFoundException.
This PR aims to fix the problem.

## How was this patch tested?

Added UT.

Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>

Closes #19368 from mgaido91/SPARK-22146.
2017-09-28 23:14:53 -07:00
Reynold Xin 323806e68f [SPARK-22160][SQL] Make sample points per partition (in range partitioner) configurable and bump the default value up to 100
## What changes were proposed in this pull request?
Spark's RangePartitioner hard codes the number of sampling points per partition to be 20. This is sometimes too low. This ticket makes it configurable, via spark.sql.execution.rangeExchange.sampleSizePerPartition, and raises the default in Spark SQL to be 100.

## How was this patch tested?
Added a pretty sophisticated test based on chi square test ...

Author: Reynold Xin <rxin@databricks.com>

Closes #19387 from rxin/SPARK-22160.
2017-09-28 21:07:12 -07:00
Reynold Xin d29d1e8799 [SPARK-22159][SQL] Make config names consistently end with "enabled".
## What changes were proposed in this pull request?
spark.sql.execution.arrow.enable and spark.sql.codegen.aggregate.map.twolevel.enable -> enabled

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #19384 from rxin/SPARK-22159.
2017-09-28 15:59:05 -07:00
Reynold Xin d74dee1336 [SPARK-22153][SQL] Rename ShuffleExchange -> ShuffleExchangeExec
## What changes were proposed in this pull request?
For some reason when we added the Exec suffix to all physical operators, we missed this one. I was looking for this physical operator today and couldn't find it, because I was looking for ExchangeExec.

## How was this patch tested?
This is a simple rename and should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #19376 from rxin/SPARK-22153.
2017-09-28 09:20:37 -07:00