Commit graph

2233 commits

Author SHA1 Message Date
Joan 3ae25f244b [SPARK-13929] Use Scala reflection for UDTs
## What changes were proposed in this pull request?

Enable ScalaReflection and User Defined Types for plain Scala classes.

This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet.

## How was this patch tested?

Unit test

Author: Joan <joan@goyeau.com>

Closes #12149 from joan38/SPARK-13929-Scala-reflection.
2016-04-19 17:36:31 -07:00
Cheng Lian 10f273d8db [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178
## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.

Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.

## How was this patch tested?

Existing tests.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
2016-04-19 17:32:23 -07:00
Herman van Hovell da8859226e [SPARK-4226] [SQL] Support IN/EXISTS Subqueries
### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:

- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`

This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did.

### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`

cc rxin, davies & chenghao-intel

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12306 from hvanhovell/SPARK-4226.
2016-04-19 15:16:02 -07:00
Wenchen Fan 5cb2e33609 [SPARK-14675][SQL] ClassFormatError when use Seq as Aggregator buffer type
## What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`.

However, after https://github.com/apache/spark/pull/12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`.

This PR fixes this issue by calling `distinct` before declare the class menbers.

## How was this patch tested?

new regression test in `DatasetAggregatorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12468 from cloud-fan/bug.
2016-04-19 10:51:58 -07:00
Josh Rosen 947b9020b0 [SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture full stacktrace
When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread.

This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`.

I tested this manually using 16b31c8251, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR.

/cc rxin nongli yhuai anabranch

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
2016-04-19 10:38:10 -07:00
Wenchen Fan 9ee95b6ecc [SPARK-14491] [SQL] refactor object operator framework to make it easy to eliminate serializations
## What changes were proposed in this pull request?

This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer.

Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there.

## How was this patch tested?

existing tests and new test in `EliminateSerializationSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12260 from cloud-fan/encoder.
2016-04-19 10:00:44 -07:00
Cheng Lian 5e360c93be [SPARK-13681][SPARK-14458][SPARK-14566][SQL] Add back once removed CommitFailureTestRelationSuite and SimpleTextHadoopFsRelationSuite
## What changes were proposed in this pull request?

These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back.

This PR also fixes two regressions:

- SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature.

- SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when
  - appending a Dataset `ds` to a persisted partitioned data source relation `t`, and
  - partition columns in `ds` don't all appear after data columns

## How was this patch tested?

`CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup.

`SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces.

The two regressions are both covered by existing test cases.

Author: Cheng Lian <lian@databricks.com>

Closes #12179 from liancheng/spark-13681-commit-failure-test.
2016-04-19 09:37:00 -07:00
Dongjoon Hyun 3d46d796a3 [SPARK-14577][SQL] Add spark.sql.codegen.maxCaseBranches config option
## What changes were proposed in this pull request?

We currently disable codegen for `CaseWhen` if the number of branches is greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better if this value is a non-public config defined in SQLConf.

## How was this patch tested?

Pass the Jenkins tests (including a new testcase `Support spark.sql.codegen.maxCaseBranches option`)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12353 from dongjoon-hyun/SPARK-14577.
2016-04-19 21:38:15 +08:00
Wenchen Fan d4b94ead92 [SPARK-14595][SQL] add input metrics for FileScanRDD
## What changes were proposed in this pull request?

This is roughly based on the input metrics logic in `SqlNewHadoopRDD`

## How was this patch tested?

Not sure how to write a test, I manually verified it in Spark UI.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12352 from cloud-fan/metrics.
2016-04-18 23:48:22 -07:00
Sameer Agarwal 6f88006895 [SPARK-14722][SQL] Rename upstreams() -> inputRDDs() in WholeStageCodegen
## What changes were proposed in this pull request?

Per rxin's suggestions, this patch renames `upstreams()` to `inputRDDs()` in `WholeStageCodegen` for better implied semantics

## How was this patch tested?

N/A

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12486 from sameeragarwal/codegen-cleanup.
2016-04-18 20:28:58 -07:00
Sameer Agarwal 4eae1dbd7c [SPARK-14718][SQL] Avoid mutating ExprCode in doGenCode
## What changes were proposed in this pull request?

The `doGenCode` method currently takes in an `ExprCode`, mutates it and returns the java code to evaluate the given expression. It should instead just return a new `ExprCode` to avoid passing around mutable objects during code generation.

## How was this patch tested?

Existing Tests

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12483 from sameeragarwal/new-exprcode-2.
2016-04-18 20:28:22 -07:00
Reynold Xin 5e92583d38 [SPARK-14667] Remove HashShuffleManager
## What changes were proposed in this pull request?
The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.

## How was this patch tested?
Removed some tests related to the old manager.

Author: Reynold Xin <rxin@databricks.com>

Closes #12423 from rxin/SPARK-14667.
2016-04-18 19:30:00 -07:00
Sameer Agarwal 8bd8121329 [SPARK-14710][SQL] Rename gen/genCode to genCode/doGenCode to better reflect the semantics
## What changes were proposed in this pull request?

Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls.

## How was this patch tested?

N/A (refactoring only)

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12475 from sameeragarwal/gencode.
2016-04-18 14:03:40 -07:00
hyukjinkwon 6fc1e72d9b [MINOR] Revert removing explicit typing (changed in some examples and StatFunctions)
## What changes were proposed in this pull request?

This PR reverts some changes in https://github.com/apache/spark/pull/12413. (please see the discussion in that PR).

from
```scala
    words.foreachRDD { (rdd, time) =>
    ...
```

to
```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
```

Also, this was discussed in dev-mailing list, [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-Scala-style-explicit-typing-within-transformation-functions-and-anonymous-val-td17173.html)

## How was this patch tested?

This was tested with `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12452 from HyukjinKwon/revert-explicit-typing.
2016-04-18 13:45:03 -07:00
Andrew Or 28ee15702d [SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState
## What changes were proposed in this pull request?

This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #12463 from yhuai/sharedState.
2016-04-18 13:15:23 -07:00
Tathagata Das 775cf17eaa [SPARK-14473][SQL] Define analysis rules to catch operations not supported in streaming
## What changes were proposed in this pull request?

There are many operations that are currently not supported in the streaming execution. For example:
 - joining two streams
 - unioning a stream and a batch source
 - sorting
 - window functions (not time windows)
 - distinct aggregates

Furthermore, executing a query with a stream source as a batch query should also fail.

This patch add an additional step after analysis in the QueryExecution which will check that all the operations in the analyzed logical plan is supported or not.

## How was this patch tested?
unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12246 from tdas/SPARK-14473.
2016-04-18 11:09:33 -07:00
Dongjoon Hyun 432d1399cb [SPARK-14614] [SQL] Add bround function
## What changes were proposed in this pull request?

This PR aims to add `bound` function (aka Banker's round) by extending current `round` implementation. [Hive supports `bround` since 1.3.0.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)

**Hive (1.3 ~ 2.0)**
```
hive> select round(2.5), bround(2.5);
OK
3.0	2.0
```

**After this PR**
```scala
scala> sql("select round(2.5), bround(2.5)").head
res0: org.apache.spark.sql.Row = [3,2]
```

## How was this patch tested?

Pass the Jenkins tests (with extended tests).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12376 from dongjoon-hyun/SPARK-14614.
2016-04-18 10:44:51 -07:00
Reynold Xin 1a3966472c [SPARK-14696][SQL] Add implicit encoders for boxed primitive types
## What changes were proposed in this pull request?
We currently only have implicit encoders for scala primitive types. We should also add implicit encoders for boxed primitives. Otherwise, the following code would not have an encoder:

```scala
sqlContext.range(1000).map { i => i }
```

## How was this patch tested?
Added a unit test case for this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12466 from rxin/SPARK-14696.
2016-04-18 17:03:15 +08:00
Wenchen Fan 2f1d0320c9 [SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset
## What changes were proposed in this pull request?

set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.

## How was this patch tested?

new tests in `DatasetAggregatorSuite`

close https://github.com/apache/spark/pull/11269

This PR brings https://github.com/apache/spark/pull/12359 up to date and fix the compile.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12451 from cloud-fan/agg.
2016-04-18 14:27:26 +08:00
Andrew Or 7de06a646d Revert "[SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState"
This reverts commit 5cefecc95a.
2016-04-17 17:35:41 -07:00
Subhobrata Dey 699a4dfd89 [SPARK-14632] randomSplit method fails on dataframes with maps in schema
## What changes were proposed in this pull request?

The patch fixes the issue with the randomSplit method which is not able to split dataframes which has maps in schema. The bug was introduced in spark 1.6.1.

## How was this patch tested?

Tested with unit tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Subhobrata Dey <sbcd90@gmail.com>

Closes #12438 from sbcd90/randomSplitIssue.
2016-04-17 15:18:32 -07:00
Andrew Or 5cefecc95a [SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState
## What changes were proposed in this pull request?

This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.

## How was this patch tested?
Existing tests.

Closes #12405

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12447 from yhuai/sharedState.
2016-04-16 14:00:53 -07:00
Reynold Xin 7319fcc1cd [SPARK-14677][SQL] follow up: make max iter num config internal
## What changes were proposed in this pull request?
This is a follow-up to make the max iteration number an internal config.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #12441 from rxin/maxIterConfInternal.
2016-04-16 11:39:47 -07:00
hyukjinkwon 9f678e9754 [MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes

- Inappropriate type notations
    For example, from
    ```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
    ```
    to
    ```scala
    words.foreachRDD { (rdd, time) =>
    ...
    ```

- Extra anonymous closure within functional transformations.
    For example,
    ```scala
    .map(item => {
      ...
    })
    ```

    which can be just simply as below:

    ```scala
    .map { item =>
      ...
    }
    ```

and corrects some obvious style nits.

## How was this patch tested?

This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.

The rules applied were below:

- For the first correction,

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```

- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```

**Those rules were not added**

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12413 from HyukjinKwon/SPARK-style.
2016-04-16 14:56:23 +01:00
Reynold Xin 527c780bb0 Revert "[SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset"
This reverts commit 12854464c4.
2016-04-16 01:05:26 -07:00
Wenchen Fan 12854464c4 [SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset
## What changes were proposed in this pull request?

set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.

## How was this patch tested?

new tests in `DatasetAggregatorSuite`

close https://github.com/apache/spark/pull/11269

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12359 from cloud-fan/agg.
2016-04-16 00:31:51 -07:00
Reynold Xin f4be0946af [SPARK-14677][SQL] Make the max number of iterations configurable for Catalyst
## What changes were proposed in this pull request?
We currently hard code the max number of optimizer/analyzer iterations to 100. This patch makes it configurable. While I'm at it, I also added the SessionCatalog to the optimizer, so we can use information there in optimization.

## How was this patch tested?
Updated unit tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12434 from rxin/SPARK-14677.
2016-04-15 20:28:09 -07:00
Yin Huai b2dfa84959 [SPARK-14668][SQL] Move CurrentDatabase to Catalyst
## What changes were proposed in this pull request?

This PR moves `CurrentDatabase` from sql/hive package to sql/catalyst. It also adds the function description, which looks like the following.

```
scala> sqlContext.sql("describe function extended current_database").collect.foreach(println)
[Function: current_database]
[Class: org.apache.spark.sql.execution.command.CurrentDatabase]
[Usage: current_database() - Returns the current database.]
[Extended Usage:
> SELECT current_database()]
```

## How was this patch tested?
Existing tests

Author: Yin Huai <yhuai@databricks.com>

Closes #12424 from yhuai/SPARK-14668.
2016-04-15 17:48:41 -07:00
Sameer Agarwal 4df65184b6 [SPARK-14620][SQL] Use/benchmark a better hash in VectorizedHashMap
## What changes were proposed in this pull request?

This PR uses a better hashing algorithm while probing the AggregateHashMap:

```java
long h = 0
h = (h ^ (0x9e3779b9)) + key_1 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_2 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_3 + (h << 6) + (h >>> 2);
...
h = (h ^ (0x9e3779b9)) + key_n + (h << 6) + (h >>> 2);
return h
```

Depends on: https://github.com/apache/spark/pull/12345
## How was this patch tested?

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    codegen = F                              2417 / 2457          8.7         115.2       1.0X
    codegen = T hashmap = F                  1554 / 1581         13.5          74.1       1.6X
    codegen = T hashmap = T                   877 /  929         23.9          41.8       2.8X

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12379 from sameeragarwal/hash.
2016-04-15 15:55:31 -07:00
Wenchen Fan 297ba3f1b4 [SPARK-14275][SQL] Reimplement TypedAggregateExpression to DeclarativeAggregate
## What changes were proposed in this pull request?

`ExpressionEncoder` is just a container for serialization and deserialization expressions, we can use these expressions to build `TypedAggregateExpression` directly, so that it can fit in `DeclarativeAggregate`, which is more efficient.

One trick is, for each buffer serializer expression, it will reference to the result object of serialization and function call. To avoid re-calculating this result object, we can serialize the buffer object to a single struct field, so that we can use a special `Expression` to only evaluate result object once.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12067 from cloud-fan/typed_udaf.
2016-04-15 12:10:00 +08:00
Sameer Agarwal b5c60bcdca [SPARK-14447][SQL] Speed up TungstenAggregate w/ keys using VectorizedHashMap
## What changes were proposed in this pull request?

This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see https://github.com/apache/spark/pull/12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).

Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness.

## How was this patch tested?

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    codegen = F                              2124 / 2204          9.9         101.3       1.0X
    codegen = T hashmap = F                  1198 / 1364         17.5          57.1       1.8X
    codegen = T hashmap = T                   369 /  600         56.8          17.6       5.8X

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12345 from sameeragarwal/tungsten-aggregate-integration.
2016-04-14 20:57:03 -07:00
Mark Grover ff9ae61a3b [SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly
## What changes were proposed in this pull request?

Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.

## How was this patch tested?

Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.

Author: Mark Grover <mark@apache.org>

Closes #12365 from markgrover/spark-14601.
2016-04-14 18:51:43 -07:00
Liang-Chi Hsieh 28efdd3fd7 [SPARK-14592][SQL] Native support for CREATE TABLE LIKE DDL command
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14592

This patch adds native support for DDL command `CREATE TABLE LIKE`.

The SQL syntax is like:

    CREATE TABLE table_name LIKE existing_table
    CREATE TABLE IF NOT EXISTS table_name LIKE existing_table

## How was this patch tested?
`HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrew@databricks.com>

Closes #12362 from viirya/create-table-like.
2016-04-14 11:08:08 -07:00
Reynold Xin dac40b68dc [SPARK-14619] Track internal accumulators (metrics) by stage attempt
## What changes were proposed in this pull request?
When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. This is part of my bigger effort to simplify accumulators and task metrics.

## How was this patch tested?
Covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12378 from rxin/SPARK-14619.
2016-04-14 10:54:57 -07:00
Liwei Lin 3e27940a19 [SPARK-14630][BUILD][CORE][SQL][STREAMING] Code style: public abstract methods should have explicit return types
## What changes were proposed in this pull request?

Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110):
```scala
def start() // should be: def start(): Unit
def stop()  // should be: def stop(): Unit
```

These methods exist in core, sql, streaming; this PR fixes them.

## How was this patch tested?

N/A

## Which piece of scala style rule led to the changes?

the rule was added separately in https://github.com/apache/spark/pull/12396

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12389 from lw-lin/public-abstract-methods.
2016-04-14 10:14:38 -07:00
gatorsmile 0d22092cd9 [SPARK-14125][SQL] Native DDL Support: Alter View
#### What changes were proposed in this pull request?
This PR is to provide a native DDL support for the following three Alter View commands:

Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
##### 1. ALTER VIEW RENAME
**Syntax:**
```SQL
ALTER VIEW view_name RENAME TO new_view_name
```
- to change the name of a view to a different name
- not allowed to rename a view's name by ALTER TABLE

##### 2. ALTER VIEW SET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
```
- to add metadata to a view
- not allowed to set views' properties by ALTER TABLE
- ignore it if trying to set a view's existing property key when the value is the same
- overwrite the value if trying to set a view's existing key to a different value

##### 3. ALTER VIEW UNSET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
```
- to remove metadata from a view
- not allowed to unset views' properties by ALTER TABLE
- issue an exception if trying to unset a view's non-existent key

#### How was this patch tested?
Added test cases to verify if it works properly.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12324 from gatorsmile/alterView.
2016-04-14 08:34:11 -07:00
hyukjinkwon 6fc3dc8839 [MINOR][SQL] Remove extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes extra anonymous closure within functional transformations.

For example,

```scala
.map(item => {
  ...
})
```

which can be just simply as below:

```scala
.map { item =>
  ...
}
```

## How was this patch tested?

Related unit tests and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12382 from HyukjinKwon/minor-extra-closers.
2016-04-14 09:43:41 +01:00
hyukjinkwon b4819404a6 [SPARK-14596][SQL] Remove not used SqlNewHadoopRDD and some more unused imports
## What changes were proposed in this pull request?

Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`.
Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore.

So, this PR removes `SqlNewHadoopRDD` and several unused imports.

This was discussed in https://github.com/apache/spark/pull/12326.

## How was this patch tested?

Several related existing unit tests and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12354 from HyukjinKwon/SPARK-14596.
2016-04-14 15:43:44 +08:00
Davies Liu 62b7f306fb [SPARK-14607] [SPARK-14484] [SQL] fix case-insensitive predicates in FileSourceStrategy
## What changes were proposed in this pull request?

When prune the partitions or push down predicates, case-sensitivity is not respected. In order to make it work with case-insensitive, this PR update the AttributeReference inside predicate to use the name from schema.

## How was this patch tested?

Add regression tests for case-insensitive.

Author: Davies Liu <davies@databricks.com>

Closes #12371 from davies/case_insensi.
2016-04-13 17:17:19 -07:00
Andrew Or 7d2ed8cc03 [SPARK-14388][SQL] Implement CREATE TABLE
## What changes were proposed in this pull request?

This patch implements the `CREATE TABLE` command using the `SessionCatalog`. Previously we handled only `CTAS` and `CREATE TABLE ... USING`. This requires us to refactor `CatalogTable` to accept various fields (e.g. bucket and skew columns) and pass them to Hive.

WIP: Note that I haven't verified whether this actually works yet! But I believe it does.

## How was this patch tested?

Tests will come in a future commit.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12271 from andrewor14/create-table-ddl.
2016-04-13 11:08:34 -07:00
Wenchen Fan a5f8c9b15b [SPARK-14554][SQL][FOLLOW-UP] use checkDataset to check the result
## What changes were proposed in this pull request?

address this comment: https://github.com/apache/spark/pull/12322#discussion_r59417359

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12346 from cloud-fan/tmp.
2016-04-13 11:41:09 +08:00
hyukjinkwon 587cd554af [MINOR][SQL] Remove some unused imports in datasources.
## What changes were proposed in this pull request?

It looks several recent commits for datasources (maybe while removing old `HadoopFsRelation` interface) missed removing some unused imports.

This PR removes some unused imports in datasources.

## How was this patch tested?

`sbt scalastyle` and some unit tests for them.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12326 from HyukjinKwon/minor-imports.
2016-04-13 10:20:03 +08:00
Shixiong Zhu 768b3d623c [SPARK-14579][SQL] Fix a race condition in StreamExecution.processAllAvailable
## What changes were proposed in this pull request?

There is a race condition in `StreamExecution.processAllAvailable`. Here is an execution order to reproduce it.

| Time        |Thread 1           | MicroBatchThread  |
|:-------------:|:-------------:|:-----:|
| 1 | |  `dataAvailable in constructNextBatch` returns false  |
| 2 | addData(newData)      |   |
| 3 | `noNewData = false` in  processAllAvailable |  |
| 4 | | noNewData = true |
| 5 | `noNewData` is true so just return | |

The root cause is that `checking dataAvailable and change noNewData to true` is not atomic. This PR puts these two actions into `synchronized` to make sure they are atomic.

In addition, this PR also has the following changes:

- Make `committedOffsets` and `availableOffsets` volatile to make sure they can be seen in other threads.
- Copy the reference of `availableOffsets` to a local variable so that `sourceStatuses` can use a snapshot of `availableOffsets`.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12339 from zsxwing/race-condition.
2016-04-12 17:31:47 -07:00
Davies Liu 372baf0479 [SPARK-14578] [SQL] Fix codegen for CreateExternalRow with nested wide schema
## What changes were proposed in this pull request?

The wide schema, the expression of fields will be splitted into multiple functions, but the variable for loopVar can't be accessed in splitted functions, this PR change them as class member.

## How was this patch tested?

Added regression test.

Author: Davies Liu <davies@databricks.com>

Closes #12338 from davies/nested_row.
2016-04-12 17:26:37 -07:00
Davies Liu 1ef5f8cfa6 [SPARK-14544] [SQL] improve performance of SQL UI tab
## What changes were proposed in this pull request?

This PR improve the performance of SQL UI by:

1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page.
2) break-all is super slow in Chrome recently, so switch to break-word.
3) Using "display: none" to hide a block.
4) using one js closure for  for all the executions, not one for each.
5) remove the height limitation of details, don't need to scroll it in the tiny window.

## How was this patch tested?

Exists tests.

![ui](https://cloud.githubusercontent.com/assets/40902/14445712/68d7b258-0004-11e6-9b48-5d329b05d165.png)

Author: Davies Liu <davies@databricks.com>

Closes #12311 from davies/ui_perf.
2016-04-12 15:03:00 -07:00
bomeng bcd2076274 [SPARK-14414][SQL] improve the error message class hierarchy
## What changes were proposed in this pull request?

Before we are using `AnalysisException`, `ParseException`, `NoSuchFunctionException` etc when a parsing error encounters. I am trying to make it consistent and also **minimum** code impact to the current implementation by changing the class hierarchy.
1. `NoSuchItemException` is removed, since it is an abstract class and it just simply takes a message string.
2. `NoSuchDatabaseException`, `NoSuchTableException`, `NoSuchPartitionException` and `NoSuchFunctionException` now extends `AnalysisException`, as well as `ParseException`, they are all under `AnalysisException` umbrella, but you can also determine how to use them in a granular way.

## How was this patch tested?
The existing test cases should cover this patch.

Author: bomeng <bmeng@us.ibm.com>

Closes #12314 from bomeng/SPARK-14414.
2016-04-12 13:43:39 -07:00
Liwei Lin 852bbc6c00 [SPARK-14556][SQL] Code clean-ups for package o.a.s.sql.execution.streaming.state
## What changes were proposed in this pull request?

- `StateStoreConf.**max**DeltasForSnapshot` was renamed to `StateStoreConf.**min**DeltasForSnapshot`
- some state switch checks were added
- improved consistency between method names and string literals
- other comments & typo fix

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12323 from lw-lin/streaming-state-clean-up.
2016-04-12 11:50:51 -07:00
Shixiong Zhu 6bf692147c [SPARK-14474][SQL] Move FileSource offset log into checkpointLocation
## What changes were proposed in this pull request?

Now that we have a single location for storing checkpointed state. This PR just propagates the checkpoint location into FileStreamSource so that we don't have one random log off on its own.

## How was this patch tested?

test("metadataPath should be in checkpointLocation")

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12247 from zsxwing/file-source-log-location.
2016-04-12 10:46:28 -07:00
Dongjoon Hyun b0f5497e95 [SPARK-14508][BUILD] Add a new ScalaStyle Rule OmitBracesInCase
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
  ```
  case: Always omit braces in case clauses.
  ```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.

## How was this patch tested?

Pass the Jenkins tests (including Scala style checking)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12280 from dongjoon-hyun/SPARK-14508.
2016-04-12 00:43:28 -07:00
Wenchen Fan 678b96e77b [SPARK-14535][SQL] Remove buildInternalScan from FileFormat
## What changes were proposed in this pull request?

Now `HadoopFsRelation` with all kinds of file formats can be handled in `FileSourceStrategy`, we can remove the branches for  `HadoopFsRelation` in `FileSourceStrategy` and the `buildInternalScan` API from `FileFormat`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12300 from cloud-fan/remove.
2016-04-11 22:59:42 -07:00
Wenchen Fan 52a801124f [SPARK-14554][SQL] disable whole stage codegen if there are too many input columns
## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/12047/files#diff-94a1f59bcc9b6758c4ca874652437634R529, we may split field expressions codes in `CreateExternalRow` to support wide table. However, the whole stage codegen framework doesn't support it, because the input for expressions is not always the input row, but can be `CodeGenContext.currentVars`, which doesn't work well with `CodeGenContext.splitExpressions`.

Actually we do have a check to guard against this cases, but it's incomplete, it only checks output fields.

This PR improves the whole stage codegen support check, to disable it if there are too many input fields, so that we can avoid splitting field expressions codes in `CreateExternalRow` for whole stage codegen.

TODO: Is it a better solution if we can make `CodeGenContext.currentVars` work well with `CodeGenContext.splitExpressions`?

## How was this patch tested?

new test in DatasetSuite.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12322 from cloud-fan/codegen.
2016-04-11 22:58:35 -07:00
gatorsmile 2d81ba542e [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table
#### What changes were proposed in this pull request?
In this PR, we are trying to address the comment in the original PR: dfce9665c4 (commitcomment-17057030)

In this PR, we checks if table/view exists at the beginning and then does not need to capture the exceptions, including `NoSuchTableException` and `InvalidTableException`. We still capture the NonFatal exception when doing `sqlContext.cacheManager.tryUncacheQuery`.

#### How was this patch tested?
The existing test cases should cover the code changes of this PR.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12321 from gatorsmile/dropViewFollowup.
2016-04-11 22:33:05 -07:00
Andrew Or 83fb96403b [SPARK-14132][SPARK-14133][SQL] Alter table partition DDLs
## What changes were proposed in this pull request?

This implements a few alter table partition commands using the `SessionCatalog`. In particular:
```
ALTER TABLE ... ADD PARTITION ...
ALTER TABLE ... DROP PARTITION ...
ALTER TABLE ... RENAME PARTITION ... TO ...
```
The following operations are not supported, and an `AnalysisException` with a helpful error message will be thrown if the user tries to use them:
```
ALTER TABLE ... EXCHANGE PARTITION ...
ALTER TABLE ... ARCHIVE PARTITION ...
ALTER TABLE ... UNARCHIVE PARTITION ...
ALTER TABLE ... TOUCH ...
ALTER TABLE ... COMPACT ...
ALTER TABLE ... CONCATENATE
MSCK REPAIR TABLE ...
```

## How was this patch tested?

`DDLSuite`, `DDLCommandSuite` and `HiveDDLCommandSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #12220 from andrewor14/alter-partition-ddl.
2016-04-11 20:59:45 -07:00
Liang-Chi Hsieh 26d7af9119 [SPARK-14520][SQL] Use correct return type in VectorizedParquetInputFormat
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14520

`VectorizedParquetInputFormat` inherits `ParquetInputFormat` and overrides `createRecordReader`. However, its overridden `createRecordReader` returns a `ParquetRecordReader`. It should return a `RecordReader`. Otherwise, `ClassCastException` will be thrown.

## How was this patch tested?

Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12292 from viirya/fix-vectorized-input-format.
2016-04-11 19:06:38 -07:00
Eric Liang 6f27027d96 [SPARK-14475] Propagate user-defined context from driver to executors
## What changes were proposed in this pull request?

This adds a new API call `TaskContext.getLocalProperty` for getting properties set in the driver from executors. These local properties are automatically propagated from the driver to executors. For streaming, the context for streaming tasks will be the initial driver context when ssc.start() is called.

## How was this patch tested?

Unit tests.

cc JoshRosen

Author: Eric Liang <ekl@databricks.com>

Closes #12248 from ericl/sc-2813.
2016-04-11 18:33:54 -07:00
Shixiong Zhu 2dacc81ec3 [SPARK-14494][SQL] Fix the race conditions in MemoryStream and MemorySink
## What changes were proposed in this pull request?

Make sure accessing mutable variables in MemoryStream and MemorySink are protected by `synchronized`.
This is probably why MemorySinkSuite failed here: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.2/650/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/

## How was this patch tested?
Existing unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12261 from zsxwing/memory-race-condition.
2016-04-11 10:42:51 -07:00
Rekha Joshi e82d95bf63 [SPARK-14372][SQL] Dataset.randomSplit() needs a Java version
## What changes were proposed in this pull request?

1.Added method randomSplitAsList() in Dataset for java
for https://issues.apache.org/jira/browse/SPARK-14372

## How was this patch tested?

TestSuite

Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: Joshi <rekhajoshm@gmail.com>

Closes #12184 from rekhajoshm/SPARK-14372.
2016-04-11 17:13:30 +08:00
gatorsmile 9f838bd242 [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table
#### What changes were proposed in this pull request?
This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.

#### How was this patch tested?
Modified the existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12284 from gatorsmile/followupDropTable.
2016-04-10 20:46:15 -07:00
Davies Liu fbf8d00883 [SPARK-14419] [MINOR] coding style cleanup
## What changes were proposed in this pull request?

Making them more consistent.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #12289 from davies/cleanup_style.
2016-04-10 18:10:44 -07:00
Dongjoon Hyun a7ce473bd0 [SPARK-14415][SQL] All functions should show usages by command DESC FUNCTION
## What changes were proposed in this pull request?

Currently, many functions do now show usages like the followings.
```
scala> sql("desc function extended `sin`").collect().foreach(println)
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: To be added.]
[Extended Usage:
To be added.]
```

This PR adds descriptions for functions and adds a testcase prevent adding function without usage.
```
scala>  sql("desc function extended `sin`").collect().foreach(println);
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: sin(x) - Returns the sine of x.]
[Extended Usage:
> SELECT sin(0);
 0.0]
```

The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`.

## How was this patch tested?

Pass the Jenkins tests (including new testcases.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12185 from dongjoon-hyun/SPARK-14415.
2016-04-10 11:46:45 -07:00
Dongjoon Hyun aea30a1a9b [SPARK-14465][BUILD] Checkstyle should check all Java files
## What changes were proposed in this pull request?

Currently, `checkstyle` is configured to check the files under `src/main/java`. However, Spark has Java files in `src/main/scala`, too. This PR fixes the following configuration in `pom.xml` and the unchecked-so-far violations on those files.
```xml
-<sourceDirectory>${basedir}/src/main/java</sourceDirectory>
+<sourceDirectories>${basedir}/src/main/java,${basedir}/src/main/scala</sourceDirectories>
```

## How was this patch tested?

After passing the Jenkins build and manually `dev/lint-java`. (Note that Jenkins does not run `lint-java`)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12242 from dongjoon-hyun/SPARK-14465.
2016-04-09 21:31:20 -07:00
Nong Li 5989c85b53 [SPARK-14217] [SQL] Fix bug if parquet data has columns that use dictionary encoding for some of the data
## What changes were proposed in this pull request?

This PR is based on #12017

Currently, this causes batches where some values are dictionary encoded and some
which are not. The non-dictionary encoded values cause us to remove the dictionary
from the batch causing the first values to return garbage.

This patch fixes the issue by first decoding the dictionary for the values that are
already dictionary encoded before switching. A similar thing is done for the reverse
case where the initial values are not dictionary encoded.

## How was this patch tested?

This is difficult to test but replicated on a test cluster using a large tpcds data set.

Author: Nong Li <nong@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #12279 from davies/fix_dict.
2016-04-09 17:45:10 -07:00
Davies Liu 5cb5edaf9c [SPARK-14419] [SQL] Improve HashedRelation for key fit within Long
## What changes were proposed in this pull request?

Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.

This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.

This PR reopen #12190 to fix bugs.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #12278 from davies/long_map3.
2016-04-09 17:44:38 -07:00
gatorsmile dfce9665c4 [SPARK-14362][SPARK-14406][SQL] DDL Native Support: Drop View and Drop Table
#### What changes were proposed in this pull request?

This PR is to provide a native support for DDL `DROP VIEW` and `DROP TABLE`. The PR includes native parsing and native analysis.

Based on the HIVE DDL document for [DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-
DropView
), `DROP VIEW` is defined as,
**Syntax:**
```SQL
DROP VIEW [IF EXISTS] [db_name.]view_name;
```
 - to remove metadata for the specified view.
 - illegal to use DROP TABLE on a view.
 - illegal to use DROP VIEW on a table.
 - this command only works in `HiveContext`. In `SQLContext`, we will get an exception.

This PR also handles `DROP TABLE`.
**Syntax:**
```SQL
DROP TABLE [IF EXISTS] table_name [PURGE];
```
- Previously, the `DROP TABLE` command only can drop Hive tables in `HiveContext`. Now, after this PR, this command also can drop temporary table, external table, external data source table in `SQLContext`.
- In `HiveContext`, we will not issue an exception if the to-be-dropped table does not exist and users did not specify `IF EXISTS`. Instead, we just log an error message. If `IF EXISTS` is specified, we will not issue any error message/exception.
- In `SQLContext`, we will issue an exception if the to-be-dropped table does not exist, unless `IF EXISTS` is specified.
- Data will not be deleted if the tables are `external`, unless table type is `managed_table`.

#### How was this patch tested?
For verifying command parsing, added test cases in `spark/sql/hive/HiveDDLCommandSuite.scala`
For verifying command analysis, added test cases in `spark/sql/hive/execution/HiveDDLSuite.scala`

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12146 from gatorsmile/dropView.
2016-04-09 17:40:36 -07:00
gatorsmile 9be5558e00 [SPARK-14481][SQL] Issue Exceptions for All Unsupported Options during Parsing
#### What changes were proposed in this pull request?
"Not good to slightly ignore all the un-supported options/clauses. We should either support it or throw an exception." A comment from yhuai in another PR https://github.com/apache/spark/pull/12146

- Can `Explain` be an exception? The `Formatted` clause is used in `HiveCompatibilitySuite`.
- Two unsupported clauses in `Drop Table` are handled in a separate PR: https://github.com/apache/spark/pull/12146

#### How was this patch tested?
Test cases are added to verify all the cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12255 from gatorsmile/warningToException.
2016-04-09 14:10:44 -07:00
Yong Tang cd2fed7012 [SPARK-14335][SQL] Describe function command returns wrong output
## What changes were proposed in this pull request?

…because some of built-in functions are not in function registry.

This fix tries to fix issues in `describe function` command where some of the outputs
still shows Hive's function because some built-in functions are not in FunctionRegistry.

The following built-in functions have been added to FunctionRegistry:
```
-
!
*
/
&
%
^
+
<
<=
<=>
=
==
>
>=
|
~
and
in
like
not
or
rlike
when
```

The following listed functions are not added, but hard coded in `commands.scala` (hvanhovell):
```
!=
<>
between
case
```
Below are the existing result of the above functions that have not been added:
```
spark-sql> describe function `!=`;
Function: <>
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
Usage: a <> b - Returns TRUE if a is not equal to b
```
```
spark-sql> describe function `<>`;
Function: <>
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
Usage: a <> b - Returns TRUE if a is not equal to b
```
```
spark-sql> describe function `between`;
Function: between
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFBetween
Usage: between a [NOT] BETWEEN b AND c - evaluate if a is [not] in between b and c
```
```
spark-sql> describe function `case`;
Function: case
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFCase
Usage: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
```

## How was this patch tested?

Existing tests passed. Additional test cases added.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12128 from yongtang/SPARK-14335.
2016-04-09 13:54:30 -07:00
Davies Liu f7ec854f1b Revert "[SPARK-14419] [SQL] Improve HashedRelation for key fit within Long"
This reverts commit 90c0a04506.
2016-04-09 13:51:28 -07:00
bomeng 10a95781ee [SPARK-14496][SQL] fix some javadoc typos
## What changes were proposed in this pull request?

Minor issues. Found 2 typos while browsing the code.

## How was this patch tested?
None.

Author: bomeng <bmeng@us.ibm.com>

Closes #12264 from bomeng/SPARK-14496.
2016-04-09 22:30:54 +09:00
Davies Liu 90c0a04506 [SPARK-14419] [SQL] Improve HashedRelation for key fit within Long
## What changes were proposed in this pull request?

Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.

This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.

## How was this patch tested?

Updated existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #12190 from davies/long_map2.
2016-04-09 00:37:55 -07:00
Reynold Xin 520dde48d0 [SPARK-14451][SQL] Move encoder definition into Aggregator interface
## What changes were proposed in this pull request?
When we first introduced Aggregators, we required the user of Aggregators to (implicitly) specify the encoders. It would actually make more sense to have the encoders be specified by the implementation of Aggregators, since each implementation should have the most state about how to encode its own data type.

Note that this simplifies the Java API because Java users no longer need to explicitly specify encoders for aggregators.

## How was this patch tested?
Updated unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12231 from rxin/SPARK-14451.
2016-04-09 00:00:39 -07:00
Reynold Xin 2f0b882e5c [SPARK-14482][SQL] Change default Parquet codec from gzip to snappy
## What changes were proposed in this pull request?
Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core.

This patch changes the default compression codec for Parquet output from gzip to snappy, and also introduces a ParquetOptions class to be more consistent with other data sources (e.g. CSV, JSON).

## How was this patch tested?
Should be covered by existing unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12256 from rxin/SPARK-14482.
2016-04-08 23:52:04 -07:00
Joseph K. Bradley d7af736b2c [SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs
## What changes were proposed in this pull request?

Cleanups to documentation.  No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last.  Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only

## How was this patch tested?

Doc build

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12266 from jkbradley/ml-doc-cleanups.
2016-04-08 20:15:44 -07:00
Sameer Agarwal 813e96e6fa [SPARK-14454] Better exception handling while marking tasks as failed
## What changes were proposed in this pull request?

This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`:

```scala
logError("Aborting task.", cause)
// call failure callbacks first, so we could have a chance to cleanup the writer.
TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause)
if (currentWriter != null) {
  currentWriter.close()
}
abortTask()
throw new SparkException("Task failed while writing rows.", cause)
```

If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception.

## How was this patch tested?

No new functionality added

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12234 from sameeragarwal/fix-exception.
2016-04-08 17:23:32 -07:00
Sameer Agarwal f8c9beca38 [SPARK-14394][SQL] Generate AggregateHashMap class for LongTypes during TungstenAggregate codegen
## What changes were proposed in this pull request?

This PR adds support for generating the `AggregateHashMap` class in `TungstenAggregate` if the aggregate group by keys/value are of `LongType`. Note that currently this generate aggregate is not actually used.

NB: This currently only supports `LongType` keys/values (please see `isAggregateHashMapSupported` in `TungstenAggregate`) and will be generalized to other data types in a subsequent PR.

## How was this patch tested?

Manually inspected the generated code. This is what the generated map looks like for 2 keys:

```java
/* 068 */   public class agg_GeneratedAggregateHashMap {
/* 069 */     private org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
/* 070 */     private int[] buckets;
/* 071 */     private int numBuckets;
/* 072 */     private int maxSteps;
/* 073 */     private int numRows = 0;
/* 074 */     private org.apache.spark.sql.types.StructType schema =
/* 075 */     new org.apache.spark.sql.types.StructType()
/* 076 */     .add("k1", org.apache.spark.sql.types.DataTypes.LongType)
/* 077 */     .add("k2", org.apache.spark.sql.types.DataTypes.LongType)
/* 078 */     .add("sum", org.apache.spark.sql.types.DataTypes.LongType);
/* 079 */
/* 080 */     public agg_GeneratedAggregateHashMap(int capacity, double loadFactor, int maxSteps) {
/* 081 */       assert (capacity > 0 && ((capacity & (capacity - 1)) == 0));
/* 082 */       this.maxSteps = maxSteps;
/* 083 */       numBuckets = (int) (capacity / loadFactor);
/* 084 */       batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
/* 085 */         org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
/* 086 */       buckets = new int[numBuckets];
/* 087 */       java.util.Arrays.fill(buckets, -1);
/* 088 */     }
/* 089 */
/* 090 */     public agg_GeneratedAggregateHashMap() {
/* 091 */       new agg_GeneratedAggregateHashMap(1 << 16, 0.25, 5);
/* 092 */     }
/* 093 */
/* 094 */     public org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row findOrInsert(long agg_key, long agg_key1) {
/* 095 */       long h = hash(agg_key, agg_key1);
/* 096 */       int step = 0;
/* 097 */       int idx = (int) h & (numBuckets - 1);
/* 098 */       while (step < maxSteps) {
/* 099 */         // Return bucket index if it's either an empty slot or already contains the key
/* 100 */         if (buckets[idx] == -1) {
/* 101 */           batch.column(0).putLong(numRows, agg_key);
/* 102 */           batch.column(1).putLong(numRows, agg_key1);
/* 103 */           batch.column(2).putLong(numRows, 0);
/* 104 */           buckets[idx] = numRows++;
/* 105 */           return batch.getRow(buckets[idx]);
/* 106 */         } else if (equals(idx, agg_key, agg_key1)) {
/* 107 */           return batch.getRow(buckets[idx]);
/* 108 */         }
/* 109 */         idx = (idx + 1) & (numBuckets - 1);
/* 110 */         step++;
/* 111 */       }
/* 112 */       // Didn't find it
/* 113 */       return null;
/* 114 */     }
/* 115 */
/* 116 */     private boolean equals(int idx, long agg_key, long agg_key1) {
/* 117 */       return batch.column(0).getLong(buckets[idx]) == agg_key && batch.column(1).getLong(buckets[idx]) == agg_key1;
/* 118 */     }
/* 119 */
/* 120 */     // TODO: Improve this Hash Function
/* 121 */     private long hash(long agg_key, long agg_key1) {
/* 122 */       return agg_key ^ agg_key1;
/* 123 */     }
/* 124 */
/* 125 */   }
```

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12161 from sameeragarwal/tungsten-aggregate.
2016-04-08 13:52:28 -07:00
tedyu 02757535b5 [SPARK-14448] Improvements to ColumnVector
## What changes were proposed in this pull request?

In this PR, two changes are proposed for ColumnVector :
1. ColumnVector should be declared as implementing AutoCloseable - it already has close() method
2. In OnHeapColumnVector#reserveInternal(), we only need to allocate new array when existing array is null or the length of existing array is shorter than the newCapacity.

## How was this patch tested?

Existing unit tests.

Author: tedyu <yuzhihong@gmail.com>

Closes #12225 from tedyu/master.
2016-04-08 12:25:36 -07:00
hyukjinkwon 73b56a3c6c [SPARK-14189][SQL] JSON data sources find compatible types even if inferred decimal type is not capable of the others
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14189

When inferred types in the same field during finding compatible `DataType`, are `IntegralType` and `DecimalType` but `DecimalType` is not capable of the given `IntegralType`, JSON data source simply fails to find a compatible type resulting in `StringType`.

This can be observed when `prefersDecimal` is enabled.

```scala
def mixedIntegerAndDoubleRecords: RDD[String] =
  sqlContext.sparkContext.parallelize(
    """{"a": 3, "b": 1.1}""" ::
    """{"a": 3.1, "b": 1}""" :: Nil)

val jsonDF = sqlContext.read
  .option("prefersDecimal", "true")
  .json(mixedIntegerAndDoubleRecords)
  .printSchema()
```

- **Before**

```
root
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)
```

- **After**

```
root
 |-- a: decimal(21, 1) (nullable = true)
 |-- b: decimal(21, 1) (nullable = true)
```
(Note that integer is inferred as `LongType` which becomes `DecimalType(20, 0)`)

## How was this patch tested?

unit tests were used and style tests by `dev/run_tests`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11993 from HyukjinKwon/SPARK-14189.
2016-04-08 00:30:26 -07:00
hyukjinkwon 725b860e2b [SPARK-14103][SQL] Parse unescaped quotes in CSV data source.
## What changes were proposed in this pull request?

This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below:

```
"a"b,ccc,ddd
e,f,g
```

produces a data below:

- **Before**

```bash
["a"b,ccc,ddd[\n]e,f,g]  <- as a value.
```

- **After**

```bash
["a"b], [ccc], [ddd]
[e], [f], [g]
```

This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, https://github.com/uniVocity/univocity-parsers/issues/60.

## How was this patch tested?

Unit tests in `CSVSuite` and `sbt/sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12226 from HyukjinKwon/SPARK-14103-quote.
2016-04-08 00:28:59 -07:00
Reynold Xin 04fb7dba70 Replace getLocalizedMessage with just normal toString in exception handling in WriterContainer. 2016-04-07 21:41:41 -07:00
Wenchen Fan 49fb237081 [SPARK-14270][SQL] whole stage codegen support for typed filter
## What changes were proposed in this pull request?

We implement typed filter by `MapPartitions`, which doesn't work well with whole stage codegen. This PR use `Filter` to implement typed filter and we can get the whole stage codegen support for free.

This PR also introduced `DeserializeToObject` and `SerializeFromObject`, to seperate serialization logic from object operator, so that it's eaiser to write optimization rules for adjacent object operators.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12061 from cloud-fan/whole-stage-codegen.
2016-04-07 17:23:34 -07:00
Andrew Or ae1db91d15 [SPARK-14410][SQL] Push functions existence check into catalog
## What changes were proposed in this pull request?

This is a followup to #12117 and addresses some of the TODOs introduced there. In particular, the resolution of database is now pushed into session catalog, which knows about the current database. Further, the logic for checking whether a function exists is pushed into the external catalog.

No change in functionality is expected.

## How was this patch tested?

`SessionCatalogSuite`, `DDLSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #12198 from andrewor14/function-exists.
2016-04-07 16:23:17 -07:00
Davies Liu aa852215f8 [SPARK-12740] [SPARK-13932] support grouping()/grouping_id() in having/order clause
## What changes were proposed in this pull request?

This PR brings the support of using grouping()/grouping_id() in HAVING/ORDER BY clause.

The resolved grouping()/grouping_id() will be replaced by unresolved "spark_gropuing_id" virtual attribute, then resolved by ResolveMissingAttribute.

This PR also fix the HAVING clause that access a grouping column that is not presented in SELECT clause, for example:
```sql
select count(1) from (select 1 as a) t group by a having a > 0
```
## How was this patch tested?

Add new tests.

Author: Davies Liu <davies@databricks.com>

Closes #12235 from davies/grouping_having.
2016-04-07 11:51:34 -07:00
Kousuke Saruta 8dcb0c7c97 [SPARK-14456][SQL][MINOR] Remove unused variables and logics in DataSource
## What changes were proposed in this pull request?

In DataSource#write method, the variables `dataSchema` and `equality`, and related logics are no longer used. Let's remove them.

## How was this patch tested?

Existing tests.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #12237 from sarutak/SPARK-14456.
2016-04-07 11:03:39 -07:00
Tathagata Das 3aa7d76395 [SQL][TESTS] Fix for flaky test in ContinuousQueryManagerSuite
## What changes were proposed in this pull request?

The timeouts were lower the other timeouts in the test. Other tests were stable over the last month.

## How was this patch tested?

Jenkins tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12219 from tdas/flaky-test-fix.
2016-04-07 10:51:49 -07:00
Reynold Xin 9ca0760d67 [SPARK-10063][SQL] Remove DirectParquetOutputCommitter
## What changes were proposed in this pull request?
This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.

## How was this patch tested?
Removed the related tests also.

Author: Reynold Xin <rxin@databricks.com>

Closes #12229 from rxin/SPARK-10063.
2016-04-07 00:51:45 -07:00
Reynold Xin e11aa9ec5c [SPARK-14452][SQL] Explicit APIs in Scala for specifying encoders
## What changes were proposed in this pull request?
The Scala Dataset public API currently only allows users to specify encoders through SQLContext.implicits. This is OK but sometimes people want to explicitly get encoders without a SQLContext (e.g. Aggregator implementations). This patch adds public APIs to Encoders class for getting Scala encoders.

## How was this patch tested?
None - I will update test cases once https://github.com/apache/spark/pull/12231 is merged.

Author: Reynold Xin <rxin@databricks.com>

Closes #12232 from rxin/SPARK-14452.
2016-04-07 00:46:57 -07:00
Herman van Hovell d76592276f [SPARK-12610][SQL] Left Anti Join
### What changes were proposed in this pull request?

This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys.

We currently add support for the following SQL join syntax:

    SELECT   *
    FROM      tbl1 A
              LEFT ANTI JOIN tbl2 B
               ON A.Id = B.Id

Or using a dataframe:

    tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti)

This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator.

The PR has been (losely) based on PR's by both davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/10563); credit should be given where credit is due.

This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`.

### How was this patch tested?

Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from https://github.com/apache/spark/pull/10563.

cc davies chenghao-intel rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12214 from hvanhovell/SPARK-12610.
2016-04-06 19:25:10 -07:00
Luciano Resende 611dbce4bd [SPARK-12555][SQL] Result should not be corrupted after input columns are reordered
This PR add test case described in SPARK-12555 to validate that correct data is returned when input data is reordered and to avoid future regressions.

Author: Luciano Resende <lresende@apache.org>

Closes #11623 from lresende/SPARK-12555.
2016-04-07 08:35:00 +08:00
Marcelo Vanzin 864d1b4d66 [SPARK-14436][SQL] Make JavaDatasetAggregatorSuiteBase public.
Without this, unit tests that extend that class fail for me locally
on maven, because JUnit tries to run methods in that class and gets
an IllegalAccessError.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12212 from vanzin/SPARK-14436.
2016-04-06 16:50:59 -07:00
Dongjoon Hyun d717ae1fd7 [SPARK-14444][BUILD] Add a new scalastyle NoScalaDoc to prevent ScalaDoc-style multiline comments
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
```
/** In Spark, we don't use the ScalaDoc style so this
  * is not correct.
  */
```

## How was this patch tested?

Pass the Jenkins tests (including `lint-scala`).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12221 from dongjoon-hyun/SPARK-14444.
2016-04-06 16:02:55 -07:00
Davies Liu 5a4b11a901 [SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet reader for wide table
## What changes were proposed in this pull request?

1) fix the RowEncoder for wide table (many columns) by splitting the generate code into multiple functions.
2) Separate DataSourceScan as RowDataSourceScan and BatchedDataSourceScan
3) Disable the returning columnar batch in parquet reader if there are many columns.
4) Added a internal config for maximum number of fields (nested) columns supported by whole stage codegen.

Closes #12098

## How was this patch tested?

Add a tests for table with 1000 columns.

Author: Davies Liu <davies@databricks.com>

Closes #12047 from davies/many_columns.
2016-04-06 15:33:39 -07:00
Shixiong Zhu a4ead6d388 [SPARK-14382][SQL] QueryProgress should be post after committedOffsets is updated
## What changes were proposed in this pull request?

Make sure QueryProgress is post after committedOffsets is updated. If QueryProgress is post before committedOffsets is updated, the listener may see a wrong sinkStatus (created from committedOffsets).

See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/644/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/single_listener/ for an example of the failure.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12155 from zsxwing/SPARK-14382.
2016-04-06 12:28:04 -07:00
Sameer Agarwal bb1fa5b218 [SPARK-14320][SQL] Make ColumnarBatch.Row mutable
## What changes were proposed in this pull request?

In order to leverage a data structure like `AggregateHashMap` (https://github.com/apache/spark/pull/12055) to speed up aggregates with keys, we need to make `ColumnarBatch.Row` mutable.

## How was this patch tested?

Unit test in `ColumnarBatchSuite`. Also, tested via `BenchmarkWholeStageCodegen`.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12103 from sameeragarwal/mutable-row.
2016-04-06 11:59:42 -07:00
bomeng 3c8d882165 [SPARK-14383][SQL] missing "|" in the g4 file
## What changes were proposed in this pull request?

A very trivial one. It missed "|" between DISTRIBUTE and UNSET.

## How was this patch tested?

I do not think it is really needed.

Author: bomeng <bmeng@us.ibm.com>

Closes #12156 from bomeng/SPARK-14383.
2016-04-06 11:12:48 -07:00
bomeng 5abd02c02b [SPARK-14429][SQL] Improve LIKE pattern in "SHOW TABLES / FUNCTIONS LIKE <pattern>" DDL
LIKE <pattern> is commonly used in SHOW TABLES / FUNCTIONS etc DDL. In the pattern, user can use `|` or `*` as wildcards.

1. Currently, we used `replaceAll()` to replace `*` with `.*`, but the replacement was scattered in several places; I have created an utility method and use it in all the places;

2. Consistency with Hive: the pattern is case insensitive in Hive and white spaces will be trimmed, but current pattern matching does not do that. For example, suppose we have tables (t1, t2, t3), `SHOW TABLES LIKE ' T* ' ` will list all the t-tables. Please use Hive to verify it.

3. Combined with `|`, the result will be sorted. For pattern like `'  B*|a*  '`, it will list the result in a-b order.

I've made some changes to the utility method to make sure we will get the same result as Hive does.

A new method was created in StringUtil and test cases were added.

andrewor14

Author: bomeng <bmeng@us.ibm.com>

Closes #12206 from bomeng/SPARK-14429.
2016-04-06 11:06:14 -07:00
Michael Armbrust 59236e5c5b [SPARK-14288][SQL] Memory Sink for streaming
This PR exposes the internal testing `MemorySink` though the data source API.  This will allow users to easily test streaming applications in the Spark shell or other local tests.

Usage:
```scala
inputStream.write
  .format("memory")
  .queryName("memStream")
  .startStream()

// Now you can query the result of the stream here.
sqlContext.table("memStream")
```

The most complicated part of the logic is choosing the checkpoint directory.  There are a few requirements we are attempting to satisfy here:
 - when working in the shell locally, it should just work with no extra configuration.
 - when working on a cluster you should be able to make it easily create the checkpoint on a distributed file system so you can test aggregation (state checkpoints are also stored in this directory and must be accessible from workers).
 - it should be clear that you can't resume since the data is just in memory.

The chosen algorithm proceeds as follows:
 - the user gives a checkpoint directory, use it
 - if the conf has a checkpoint location, use `$location/$queryName`
 - if neither, create a local directory
 - always check to make sure there are no offsets written to the directory

Author: Michael Armbrust <michael@databricks.com>

Closes #12119 from marmbrus/memorySink.
2016-04-06 10:05:02 -07:00
gatorsmile 68be5b9e8a [SPARK-14396][SQL] Throw Exceptions for DDLs of Partitioned Views
#### What changes were proposed in this pull request?

Because the concept of partitioning is associated with physical tables, we disable all the supports of partitioned views, which are defined in the following three commands in [Hive DDL Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
```
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];

ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
```

An exception is thrown when users issue any of these three DDL commands.

#### How was this patch tested?
Added test cases for parsing create view and changed the existing test cases to verify if the exceptions are thrown.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12169 from gatorsmile/viewPartition.
2016-04-05 22:33:44 -07:00
Andrew Or adbfdb878d [SPARK-14128][SQL] Alter table DDL followup
## What changes were proposed in this pull request?

This is just a followup to #12121, which implemented the alter table DDLs using the `SessionCatalog`. Specially, this corrects the behavior of setting the location of a datasource table. For datasource tables, we need to set the `locationUri` in addition to the `path` entry in the serde properties. Additionally, changing the location of a datasource table partition is not allowed.

## How was this patch tested?

`DDLSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #12186 from andrewor14/alter-table-ddl-followup.
2016-04-05 21:23:20 -07:00
Wenchen Fan f6456fa80b [SPARK-14296][SQL] whole stage codegen support for Dataset.map
## What changes were proposed in this pull request?

This PR adds a new operator `MapElements` for `Dataset.map`, it's a 1-1 mapping and is easier to adapt to whole stage codegen framework.

## How was this patch tested?

new test in `WholeStageCodegenSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12087 from cloud-fan/map.
2016-04-06 12:09:10 +08:00
Eric Liang 7d29c72f64 [SPARK-14359] Unit tests for java 8 lambda syntax with typed aggregates
## What changes were proposed in this pull request?

Adds unit tests for java 8 lambda syntax with typed aggregates as a follow-up to #12168

## How was this patch tested?

Unit tests.

Author: Eric Liang <ekl@databricks.com>

Closes #12181 from ericl/sc-2794-2.
2016-04-05 21:22:20 -05:00
Marcelo Vanzin d5ee9d5c24 [SPARK-529][SQL] Modify SQLConf to use new config API from core.
Because SQL keeps track of all known configs, some customization was
needed in SQLConf to allow that, since the core API does not have that
feature.

Tested via existing (and slightly updated) unit tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11570 from vanzin/SPARK-529-sql.
2016-04-05 15:19:51 -07:00