Commit graph

17419 commits

Author SHA1 Message Date
petermaxlee a7b02db457 [SPARK-17015][SQL] group-by/order-by ordinal and arithmetic tests
## What changes were proposed in this pull request?
This patch adds three test files:
1. arithmetic.sql.out
2. order-by-ordinal.sql
3. group-by-ordinal.sql

This includes https://github.com/apache/spark/pull/14594.

## How was this patch tested?
This is a test case change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14595 from petermaxlee/SPARK-17015.
2016-08-11 01:43:08 -07:00
petermaxlee 0db373aaf8 [SPARK-17011][SQL] Support testing exceptions in SQLQueryTestSuite
## What changes were proposed in this pull request?
This patch adds exception testing to SQLQueryTestSuite. When there is an exception in query execution, the query result contains the the exception class along with the exception message.

As part of this, I moved some additional test cases for limit from SQLQuerySuite over to SQLQueryTestSuite.

## How was this patch tested?
This is a test harness change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14592 from petermaxlee/SPARK-17011.
2016-08-10 23:22:14 -07:00
Tao Wang 7a6a3c3fbc [SPARK-17010][MINOR][DOC] Wrong description in memory management document
## What changes were proposed in this pull request?

change the remain percent to right one.

## How was this patch tested?

Manual review

Author: Tao Wang <wangtao111@huawei.com>

Closes #14591 from WangTaoTheTonic/patch-1.
2016-08-10 22:30:18 -07:00
petermaxlee 665e175328 [SPARK-17007][SQL] Move test data files into a test-data folder
## What changes were proposed in this pull request?
This patch moves all the test data files in sql/core/src/test/resources to sql/core/src/test/resources/test-data, so we don't clutter the top level sql/core/src/test/resources. Also deleted sql/core/src/test/resources/old-repeated.parquet since it is no longer used.

The change will make it easier to spot sql-tests directory.

## How was this patch tested?
This is a test-only change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14589 from petermaxlee/SPARK-17007.
2016-08-10 21:26:46 -07:00
petermaxlee 425c7c2dbd [SPARK-17008][SPARK-17009][SQL] Normalization and isolation in SQLQueryTestSuite.
## What changes were proposed in this pull request?
This patch enhances SQLQueryTestSuite in two ways:

1. SPARK-17009: Use a new SparkSession for each test case to provide stronger isolation (e.g. config changes in one test case does not impact another). That said, we do not currently isolate catalog changes.
2. SPARK-17008: Normalize query output using sorting, inspired by HiveComparisonTest.

I also ported a few new test cases over from SQLQuerySuite.

## How was this patch tested?
This is a test harness update.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14590 from petermaxlee/SPARK-17008.
2016-08-10 21:05:32 -07:00
jerryshao ab648c0004 [SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN
## What changes were proposed in this pull request?

Add a configurable token manager for Spark on running on yarn.

### Current Problems ###

1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
2. Also this problem exits in timely token renewer and updater.

### Changes In This Proposal ###

In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:

1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.

### Behavior Changes ###

For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).

For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:

1. `spark.yarn.security.tokens.test.enabled` to true
2. `spark.yarn.security.tokens.test.class` to the full qualified class name.

So we still keep the same semantics as current code while add one new configuration.

### Current Status ###

- [x] token provider interface and management framework.
- [x] implement built-in token providers (hdfs, hbase, hive).
- [x] Coverage of unit test.
- [x] Integrated test with security cluster.

## How was this patch tested?

Unit test and integrated test.

Please suggest and review, any comment is greatly appreciated.

Author: jerryshao <sshao@hortonworks.com>

Closes #14065 from jerryshao/SPARK-16342.
2016-08-10 15:39:30 -07:00
Rajesh Balamohan bd2c12fb49 [SPARK-12920][CORE] Honor "spark.ui.retainedStages" to reduce mem-pressure
When large number of jobs are run concurrently with Spark thrift server, thrift server starts running at high CPU due to GC pressure. Job UI retention causes memory pressure with large jobs. https://issues.apache.org/jira/secure/attachment/12783302/SPARK-12920.profiler_job_progress_listner.png has the profiler snapshot. This PR honors `spark.ui.retainedStages` strictly to reduce memory pressure.

Manual and unit tests

Author: Rajesh Balamohan <rbalamohan@apache.org>

Closes #10846 from rajeshbalamohan/SPARK-12920.
2016-08-10 15:30:52 -07:00
Qifan Pu bf5cb8af4a [SPARK-16928] [SQL] Recursive call of ColumnVector::getInt() breaks JIT inlining
## What changes were proposed in this pull request?

In both `OnHeapColumnVector` and `OffHeapColumnVector`, we implemented `getInt()` with the following code pattern:
```
public int getInt(int rowId) {
if (dictionary == null)
{ return intData[rowId]; }
else
{ return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); }
}
```
As `dictionaryIds` is also a `ColumnVector`, this results in a recursive call of `getInt()` and breaks JIT inlining. As a result, `getInt()` will not get inlined.

We fix this by adding a separate method `getDictId()` specific for `dictionaryIds` to use.

## How was this patch tested?

We tested the difference with the following aggregate query on a TPCDS dataset (with scale factor = 5):
```
select
  max(ss_sold_date_sk) as max_ss_sold_date_sk,
from store_sales
```
The query runtime is improved, from 202ms (before) to 159ms (after).

Author: Qifan Pu <qifan.pu@gmail.com>

Closes #14513 from ooq/SPARK-16928.
2016-08-10 14:45:13 -07:00
Junyang Qian 214ba66a03 [SPARK-16579][SPARKR] add install.spark function
## What changes were proposed in this pull request?

Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R.

Updates:

Several changes have been made:

- `install.spark()`
    - check existence of tar file in the cache folder, and download only if not found
    - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
    - use 2.0.0

- `sparkR.session()`
    - can install spark when not found in `SPARK_HOME`

## How was this patch tested?

Manual tests, running the check-cran.sh script added in #14173.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14258 from junyangq/SPARK-16579.
2016-08-10 11:18:23 -07:00
Yanbo Liang d4a9122430 [SPARK-16710][SPARKR][ML] spark.glm should support weightCol
## What changes were proposed in this pull request?
Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14346 from yanboliang/spark-16710.
2016-08-10 10:53:48 -07:00
Liang-Chi Hsieh 19af298bb6 [SPARK-15639] [SPARK-16321] [SQL] Push down filter at RowGroups level for parquet reader
## What changes were proposed in this pull request?

The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.

The benchmark that excludes the time of writing Parquet file:

    test("Benchmark for Parquet") {
      val N = 500 << 12
        withParquetTable((0 until N).map(i => (101, i)), "t") {
          val benchmark = new Benchmark("Parquet reader", N)
          benchmark.addCase("reading Parquet file", 10) { iter =>
            sql("SELECT _1 FROM t where t._1 < 100").collect()
          }
          benchmark.run()
      }
    }

`withParquetTable` in default will run tests for vectorized reader non-vectorized readers. I only let it run vectorized reader.

When we set the block size of parquet as 1024 to have multiple row groups. The benchmark is:

Before this patch:

The retrieved row groups: 8063

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           825 / 1233          2.5         402.6       1.0X

After this patch:

The retrieved row groups: 0

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           306 /  503          6.7         149.6       1.0X

Next, I run the benchmark for non-pushdown case using the same benchmark code but with disabled pushdown configuration. This time the parquet block size is default value.

Before this patch:

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           136 /  238         15.0          66.5       1.0X

After this patch:

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           124 /  193         16.5          60.7       1.0X

For non-pushdown case, from the results, I think this patch doesn't affect normal code path.

I've manually output the `totalRowCount` in `SpecificParquetRecordReaderBase` to see if this patch actually filter the row-groups. When running the above benchmark:

After this patch:
    `totalRowCount = 0`

Before this patch:
    `totalRowCount = 1024000`

## How was this patch tested?
Existing tests should be passed.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13701 from viirya/vectorized-reader-push-down-filter2.
2016-08-10 10:03:55 -07:00
avulanov 11a6844beb [SPARK-15899][SQL] Fix the construction of the file path with hadoop Path
## What changes were proposed in this pull request?

Fix the construction of the file path. Previous way of construction caused the creation of incorrect path on Windows.

## How was this patch tested?

Run SQL unit tests on Windows

Author: avulanov <nashb@yandex.ru>

Closes #13868 from avulanov/SPARK-15899-file.
2016-08-10 10:25:00 +01:00
petermaxlee b9f8a11709 [SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end tests
## What changes were proposed in this pull request?
This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.

This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.

Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).

When there is a mismatch, the error message looks like the following:

```
[info] - blacklist.sql !!! IGNORED !!!
[info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)
[info]   Expected "...147483648	-214748364[8]", but got "...147483648	-214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171)
```

## How was this patch tested?
This is a test infrastructure change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14472 from petermaxlee/SPARK-16866.
2016-08-10 17:17:21 +08:00
Sean Owen 0578ff9681 [SPARK-16324][SQL] regexp_extract should doc that it returns empty string when match fails
## What changes were proposed in this pull request?

Doc that regexp_extract returns empty string when regex or group does not match

## How was this patch tested?

Jenkins test, with a few new test cases

Author: Sean Owen <sowen@cloudera.com>

Closes #14525 from srowen/SPARK-16324.
2016-08-10 10:14:43 +01:00
Timothy Chen eca58755fb [SPARK-16927][SPARK-16923] Override task properties at dispatcher.
## What changes were proposed in this pull request?

- enable setting default properties for all jobs submitted through the dispatcher [SPARK-16927]
- remove duplication of conf vars on cluster submitted jobs [SPARK-16923] (this is a small fix, so I'm including in the same PR)

## How was this patch tested?

mesos/spark integration test suite
manual testing

Author: Timothy Chen <tnachen@gmail.com>

Closes #14511 from mgummelt/override-props.
2016-08-10 10:11:03 +01:00
Andrew Ash bfda53f63a Typo: Fow -> For
Author: Andrew Ash <andrew@andrewash.com>

Closes #14563 from ash211/patch-8.
2016-08-10 10:09:35 +01:00
gatorsmile 2b10ebe6ac [SPARK-16185][SQL] Better Error Messages When Creating Table As Select Without Enabling Hive Support
#### What changes were proposed in this pull request?
When we do not turn on the Hive Support, the following query generates a confusing error message by Planner:
```Scala
sql("CREATE TABLE t2 SELECT a, b from t1")
```

```
assertion failed: No plan for CreateTable CatalogTable(
	Table: `t2`
	Created: Tue Aug 09 23:45:32 PDT 2016
	Last Access: Wed Dec 31 15:59:59 PST 1969
	Type: MANAGED
	Provider: hive
	Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), ErrorIfExists
+- Relation[a#19L,b#20L] parquet

java.lang.AssertionError: assertion failed: No plan for CreateTable CatalogTable(
	Table: `t2`
	Created: Tue Aug 09 23:45:32 PDT 2016
	Last Access: Wed Dec 31 15:59:59 PST 1969
	Type: MANAGED
	Provider: hive
	Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), ErrorIfExists
+- Relation[a#19L,b#20L] parquet
```

This PR is to issue a better error message:
```
Hive support is required to use CREATE Hive TABLE AS SELECT
```

#### How was this patch tested?
Added test cases in `DDLSuite.scala`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13886 from gatorsmile/createCatalogedTableAsSelect.
2016-08-10 17:05:50 +08:00
Dongjoon Hyun 41a7dbdd34 [SPARK-10601][SQL] Support MINUS set operator
## What changes were proposed in this pull request?

This PR adds `MINUS` set operator which is equivalent `EXCEPT DISTINCT`. This will slightly improve the compatibility with Oracle.

## How was this patch tested?

Pass the Jenkins with newly added testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14570 from dongjoon-hyun/SPARK-10601.
2016-08-10 10:31:30 +02:00
gatorsmile bdd537164d [SPARK-16959][SQL] Rebuild Table Comment when Retrieving Metadata from Hive Metastore
### What changes were proposed in this pull request?
The `comment` in `CatalogTable` returned from Hive is always empty. We store it in the table property when creating a table. However, when we try to retrieve the table metadata from Hive metastore, we do not rebuild it. The `comment` is always empty.

This PR is to fix the issue.

### How was this patch tested?
Fixed the test case to verify the change.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14550 from gatorsmile/tableComment.
2016-08-10 16:25:01 +08:00
Xin Ren 1203c8415c [MINOR][SPARKR] R API documentation for "coltypes" is confusing
## What changes were proposed in this pull request?

R API documentation for "coltypes" is confusing, found when working on another ticket.

Current version http://spark.apache.org/docs/2.0.0/api/R/coltypes.html, where parameters have 2 "x" which is a duplicate, and also the example is not very clear

![current](https://cloud.githubusercontent.com/assets/3925641/17386808/effb98ce-59a2-11e6-9657-d477d258a80c.png)

![screen shot 2016-08-03 at 5 56 00 pm](https://cloud.githubusercontent.com/assets/3925641/17386884/91831096-59a3-11e6-84af-39890b3d45d8.png)

## How was this patch tested?

Tested manually on local machine. And the screenshots are like below:

![screen shot 2016-08-07 at 11 29 20 pm](https://cloud.githubusercontent.com/assets/3925641/17471144/df36633c-5cf6-11e6-8238-4e32ead0e529.png)

![screen shot 2016-08-03 at 5 56 22 pm](https://cloud.githubusercontent.com/assets/3925641/17386896/9d36cb26-59a3-11e6-9619-6dae29f7ab17.png)

Author: Xin Ren <iamshrek@126.com>

Closes #14489 from keypointt/rExample.
2016-08-10 00:49:06 -07:00
Michał Kiełbowicz 9dc3e602d7 Fixed typo
## What changes were proposed in this pull request?

Fixed small typo - "value ... ~~in~~ is null"

## How was this patch tested?

Still compiles!

Author: Michał Kiełbowicz <jupblb@users.noreply.github.com>

Closes #14569 from jupblb/typo-fix.
2016-08-09 23:01:50 -07:00
Andrew Ash 121643bc76 Make logDir easily copy/paste-able
In many terminals double-clicking and dragging also includes the trailing period.  Simply remove this to make the value more easily copy/pasteable.

Example value:
`hdfs://mybox-123.net.example.com:8020/spark-events.`

Author: Andrew Ash <andrew@andrewash.com>

Closes #14566 from ash211/patch-9.
2016-08-09 21:11:52 -07:00
Josh Rosen b89b3a5c8e [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable
## What changes were proposed in this pull request?

This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.

**Background:** This application-killing was added in 6b5980da79 (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.

**Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative.

I'd like to merge this patch into master, branch-2.0, and branch-1.6.

## How was this patch tested?

I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.
2016-08-09 11:21:45 -07:00
Davies Liu 92da22878b [SPARK-16905] SQL DDL: MSCK REPAIR TABLE
## What changes were proposed in this pull request?

MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system.

Another syntax is: ALTER TABLE table RECOVER PARTITIONS

The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed).

## How was this patch tested?

Added unit tests for it and Hive compatibility test suite.

Author: Davies Liu <davies@databricks.com>

Closes #14500 from davies/repair_table.
2016-08-09 10:04:36 -07:00
Mariusz Strzelecki 29081b587f [SPARK-16950] [PYSPARK] fromOffsets parameter support in KafkaUtils.createDirectStream for python3
## What changes were proposed in this pull request?

Ability to use KafkaUtils.createDirectStream with starting offsets in python 3 by using java.lang.Number instead of Long during param mapping in scala helper. This allows py4j to pass Integer or Long to the map and resolves ClassCastException problems.

## How was this patch tested?

unit tests

jerryshao  - could you please look at this PR?

Author: Mariusz Strzelecki <mariusz.strzelecki@allegrogroup.com>

Closes #14540 from szczeles/kafka_pyspark.
2016-08-09 09:44:43 -07:00
Yanbo Liang 182e11904b [SPARK-16933][ML] Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data.
## What changes were proposed in this pull request?
Similar to ```LeastSquaresAggregator``` in #14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109.

## How was this patch tested?
I tested this locally and verified the serialization reduction.

Before patch
![image](https://cloud.githubusercontent.com/assets/1962026/17512035/abb93f04-5dda-11e6-97d3-8ae6b61a0dfd.png)

After patch
![image](https://cloud.githubusercontent.com/assets/1962026/17512024/9e0dc44c-5dda-11e6-93d0-6e130ba0d6aa.png)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14519 from yanboliang/spark-16933.
2016-08-09 03:39:57 -07:00
Reynold Xin 511f52f842 [SPARK-16964][SQL] Remove private[sql] and private[spark] from sql.execution package
## What changes were proposed in this pull request?
This package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime.

This patch removes all private[sql] and private[spark] visibility modifiers in org.apache.spark.sql.execution.

## How was this patch tested?
N/A - just visibility changes.

Author: Reynold Xin <rxin@databricks.com>

Closes #14554 from rxin/remote-private.
2016-08-09 18:22:14 +08:00
Michael Gummelt 62e6212441 [SPARK-16809] enable history server links in dispatcher UI
## What changes were proposed in this pull request?

Links the Spark Mesos Dispatcher UI to the history server UI

- adds spark.mesos.dispatcher.historyServer.url
- explicitly generates frameworkIDs for the launched drivers, so the dispatcher knows how to correlate drivers and frameworkIDs

## How was this patch tested?

manual testing

Author: Michael Gummelt <mgummelt@mesosphere.io>
Author: Sergiusz Urbaniak <sur@mesosphere.io>

Closes #14414 from mgummelt/history-server.
2016-08-09 10:55:33 +01:00
Dongjoon Hyun 2154345b6a [SPARK-16940][SQL] checkAnswer should raise TestFailedException for wrong results
## What changes were proposed in this pull request?

This PR fixes the following to make `checkAnswer` raise `TestFailedException` again instead of `java.util.NoSuchElementException: key not found: TZ` in the environments without `TZ` variable. Also, this PR adds `QueryTestSuite` class for testing `QueryTest` itself.

```scala
- |Timezone Env: ${sys.env("TZ")}
+ |Timezone Env: ${sys.env.getOrElse("TZ", "")}
```

## How was this patch tested?

Pass the Jenkins tests with a new test suite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14528 from dongjoon-hyun/SPARK-16940.
2016-08-09 09:45:46 +01:00
Sun Rui af710e5bdd [SPARK-16522][MESOS] Spark application throws exception on exit.
## What changes were proposed in this pull request?
Spark applications running on Mesos throw exception upon exit. For details, refer to https://issues.apache.org/jira/browse/SPARK-16522.

I am not sure if there is any better fix, so wait for review comments.

## How was this patch tested?
Manual test. Observed that the exception is gone upon application exit.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #14175 from sun-rui/SPARK-16522.
2016-08-09 09:39:45 +01:00
Sean Owen 801e4d097f [SPARK-16606][CORE] Misleading warning for SparkContext.getOrCreate "WARN SparkContext: Use an existing SparkContext, some configuration may not take effect."
## What changes were proposed in this pull request?

SparkContext.getOrCreate shouldn't warn about ignored config if

- it wasn't ignored because a new context is created with it or
- no config was actually provided

## How was this patch tested?

Jenkins + existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #14533 from srowen/SPARK-16606.
2016-08-09 09:38:12 +01:00
hyukjinkwon bb2b9d0a42 [SPARK-16610][SQL] Add orc.compress as an alias for compression option.
## What changes were proposed in this pull request?

For ORC source, Spark SQL has a writer option `compression`, which is used to set the codec and its value will be also set to `orc.compress` (the orc conf used for codec). However, if a user only set `orc.compress` in the writer option, we should not use the default value of `compression` (snappy) as the codec. Instead, we should respect the value of `orc.compress`.

This PR makes ORC data source not ignoring `orc.compress` when `comperssion` is unset.

So, here is the behaviour,

 1. Check `compression` and use this if it is set.
 2. If `compression` is not set, check `orc.compress` and use it.
 3. If `compression` and `orc.compress` are not set, then use the default snappy.

## How was this patch tested?

Unit test in `OrcQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14518 from HyukjinKwon/SPARK-16610.
2016-08-09 10:23:54 +08:00
Alice e17a76efdb [SPARK-16563][SQL] fix spark sql thrift server FetchResults bug
## What changes were proposed in this pull request?

Add a constant iterator which point to head of result. The header will be used to reset iterator when fetch result from first row repeatedly.
JIRA ticket https://issues.apache.org/jira/browse/SPARK-16563

## How was this patch tested?

This bug was found when using Cloudera HUE connecting to spark sql thrift server, currently SQL statement result can be only fetched for once. The fix was tested manually with Cloudera HUE, With this fix, HUE can fetch spark SQL results repeatedly through thrift server.

Author: Alice <alice.gugu@gmail.com>
Author: Alice <guhq@garena.com>

Closes #14218 from alicegugu/SparkSQLFetchResultsBug.
2016-08-08 18:00:04 -07:00
Sean Zhong bca43cd635 [SPARK-16898][SQL] Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn
## What changes were proposed in this pull request?

This PR adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn, so that we can use these info in customized optimizer rule.

## How was this patch tested?

Existing test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14494 from clockfly/add_more_info_for_typed_operator.
2016-08-09 08:36:50 +08:00
Herman van Hovell df10658831 [SPARK-16749][SQL] Simplify processing logic in LEAD/LAG processing.
## What changes were proposed in this pull request?
The logic for LEAD/LAG processing is more complex that it needs to be. This PR fixes that.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14376 from hvanhovell/SPARK-16749.
2016-08-08 16:34:57 -07:00
Michael Gummelt 53d1c78779 Update docs to include SASL support for RPC
## What changes were proposed in this pull request?

Update docs to include SASL support for RPC

Evidence: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L63

## How was this patch tested?

Docs change only

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14549 from mgummelt/sasl.
2016-08-08 16:07:51 -07:00
Holden Karau 9216901d52 [SPARK-16779][TRIVIAL] Avoid using postfix operators where they do not add much and remove whitelisting
## What changes were proposed in this pull request?

Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.

## How was this patch tested?

Existing tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #14407 from holdenk/SPARK-16779.
2016-08-08 15:54:03 -07:00
Tathagata Das 8650239050 [SPARK-16953] Make requestTotalExecutors public Developer API to be consistent with requestExecutors/killExecutors
## What changes were proposed in this pull request?

RequestExecutors and killExecutor are public developer APIs for managing the number of executors allocated to the SparkContext. For consistency, requestTotalExecutors should also be a public Developer API, as it provides similar functionality. In fact, using requestTotalExecutors is more convenient that requestExecutors as the former is idempotent and the latter is not.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14541 from tdas/SPARK-16953.
2016-08-08 12:52:04 -07:00
Marcelo Vanzin 1739e75fec [SPARK-16586][CORE] Handle JVM errors printed to stdout.
Some very rare JVM errors are printed to stdout, and that confuses
the code in spark-class. So add a check so that those cases are
detected and the proper error message is shown to the user.

Tested by running spark-submit after setting "ulimit -v 32000".

Closes #14231

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14508 from vanzin/SPARK-16586.
2016-08-08 10:34:54 -07:00
gatorsmile 5959df217d [SPARK-16936][SQL] Case Sensitivity Support for Refresh Temp Table
### What changes were proposed in this pull request?
Currently, the `refreshTable` API is always case sensitive.

When users use the view name without the exact case match, the API silently ignores the call. Users might expect the command has been successfully completed. However, when users run the subsequent SQL commands, they might still get the exception, like
```
Job aborted due to stage failure:
Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 7, localhost):
java.io.FileNotFoundException:
File file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-00000-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet does not exist
```

This PR is to fix the issue.

### How was this patch tested?
Added a test case.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14523 from gatorsmile/refreshTempTable.
2016-08-08 22:34:28 +08:00
gatorsmile ab126909ce [SPARK-16457][SQL] Fix Wrong Messages when CTAS with a Partition By Clause
#### What changes were proposed in this pull request?
When doing a CTAS with a Partition By clause, we got a wrong error message.

For example,
```SQL
CREATE TABLE gen__tmp
PARTITIONED BY (key string)
AS SELECT key, value FROM mytable1
```
The error message we get now is like
```
Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 2, pos 0)
```

However, based on the code, the message we should get is like
```
Operation not allowed: A Create Table As Select (CTAS) statement is not allowed to create a partitioned table using Hive's file formats. Please use the syntax of "CREATE TABLE tableName USING dataSource OPTIONS (...) PARTITIONED BY ...\" to create a partitioned table through a CTAS statement.(line 2, pos 0)
```

Currently, partitioning columns is part of the schema. This PR fixes the bug by changing the detection orders.

#### How was this patch tested?
Added test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14113 from gatorsmile/ctas.
2016-08-08 22:26:44 +08:00
Sean Zhong 94a9d11ed1 [SPARK-16906][SQL] Adds auxiliary info like input class and input schema in TypedAggregateExpression
## What changes were proposed in this pull request?

This PR adds auxiliary info like input class and input schema in TypedAggregateExpression

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14501 from clockfly/typed_aggregation.
2016-08-08 22:20:54 +08:00
Nattavut Sutyanyong 06f5dc8415 [SPARK-16804][SQL] Correlated subqueries containing non-deterministic operations return incorrect results
## What changes were proposed in this pull request?

This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase by returning an error message when the LIMIT is found in the path from the parent table to the correlated predicate in the subquery.

## How was this patch tested?

./dev/run-tests
a new unit test on the problematic pattern.

Author: Nattavut Sutyanyong <nsy.can@gmail.com>

Closes #14411 from nsyca/master.
2016-08-08 12:14:11 +02:00
Weiqing Yang e10ca8de49 [SPARK-16945] Fix Java Lint errors
## What changes were proposed in this pull request?
This PR is to fix the minor Java linter errors as following:
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[42,10] (modifier) RedundantModifier: Redundant 'final' modifier.
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[97,10] (modifier) RedundantModifier: Redundant 'final' modifier.

## How was this patch tested?
Manual test.
dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #14532 from Sherry302/master.
2016-08-08 09:24:37 +01:00
sethah 1db1c6567b [SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data
## What changes were proposed in this pull request?
Similar to `LogisticAggregator`, `LeastSquaresAggregator` used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization.

In https://github.com/apache/spark/pull/13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of `transient lazy val`s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark.

## How was this patch tested?

**MLlib**
![image](https://cloud.githubusercontent.com/assets/7275795/16703660/436f79fa-4524-11e6-9022-ef00058ec718.png)

**ML without patch**
![image](https://cloud.githubusercontent.com/assets/7275795/16703831/c4d50b9e-4525-11e6-80cb-9b58c850cd41.png)

**ML with patch**
![image](https://cloud.githubusercontent.com/assets/7275795/16703675/63e0cf40-4524-11e6-9120-1f512a70e083.png)

Author: sethah <seth.hendrickson16@gmail.com>

Closes #14109 from sethah/LIR_serialize.
2016-08-08 00:00:15 -07:00
Tejas Patil e076fb05ac [SPARK-16919] Configurable update interval for console progress bar
## What changes were proposed in this pull request?

Currently the update interval for the console progress bar is hardcoded. This PR makes it configurable for users.

## How was this patch tested?

Ran a long running job and with a high value of update interval, the updates were shown less frequently.

Author: Tejas Patil <tejasp@fb.com>

Closes #14507 from tejasapatil/SPARK-16919.
2016-08-08 06:22:37 +01:00
Dongjoon Hyun a16983c97b [SPARK-16939][SQL] Fix build error by using Tuple1 explicitly in StringFunctionsSuite
## What changes were proposed in this pull request?

This PR aims to fix a build error on branch 1.6 at 8d87252087, but I think we had better have this consistently in master branch, too. It's because there exist other ongoing PR (https://github.com/apache/spark/pull/14525) about this.

https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-compile-maven-with-yarn-2.3/286/console

```scala
[error] /home/jenkins/workspace/spark-branch-1.6-compile-maven-with-yarn-2.3/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala:82: value toDF is not a member of Seq[String]
[error]     val df = Seq("aaaac").toDF("s")
[error]                           ^
```

## How was this patch tested?

After passing Jenkins, run compilation test on branch 1.6.
```
build/mvn -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14526 from dongjoon-hyun/SPARK-16939.
2016-08-07 20:51:54 +01:00
Sean Owen 8d87252087 [SPARK-16409][SQL] regexp_extract with optional groups causes NPE
## What changes were proposed in this pull request?

regexp_extract actually returns null when it shouldn't when a regex matches but the requested optional group did not. This makes it return an empty string, as apparently designed.

## How was this patch tested?

Additional unit test

Author: Sean Owen <sowen@cloudera.com>

Closes #14504 from srowen/SPARK-16409.
2016-08-07 12:20:07 +01:00
Prince J Wesley bdfab9f942 [SPARK-16909][SPARK CORE] Streaming for postgreSQL JDBC driver
As per the postgreSQL JDBC driver [implementation](ab2a6d8908/pgjdbc/src/main/java/org/postgresql/PGProperty.java (L99)), the default record fetch size is 0(which means, it caches all record)

This fix enforces default record fetch size as 10 to enable streaming of data.

Author: Prince J Wesley <princejohnwesley@gmail.com>

Closes #14502 from princejwesley/spark-postgres.
2016-08-07 12:18:11 +01:00
Shivansh 6c1ecb191b [SPARK-16911] Fix the links in the programming guide
## What changes were proposed in this pull request?

 Fix the broken links in the programming guide of the Graphx Migration and understanding closures

## How was this patch tested?

By running the test cases  and checking the links.

Author: Shivansh <shiv4nsh@gmail.com>

Closes #14503 from shiv4nsh/SPARK-16911.
2016-08-07 09:30:18 +01:00