Commit graph

704 commits

Author SHA1 Message Date
gatorsmile 5706472670 [SPARK-15215][SQL] Fix Explain Parsing and Output
#### What changes were proposed in this pull request?
This PR is to address a few existing issues in `EXPLAIN`:
- The `EXPLAIN` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command.
- The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command.
- The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example:
  ```
  == Parsed Logical Plan ==
  CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false

  == Analyzed Logical Plan ==

  CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false

  == Optimized Logical Plan ==
  CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
  ...
  ```

#### How was this patch tested?
Added and modified a few test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12991 from gatorsmile/explainCreateTable.
2016-05-10 11:53:37 +02:00
Wenchen Fan beb16ec556 [SPARK-15093][SQL] create/delete/rename directory for InMemoryCatalog operations if needed
## What changes were proposed in this pull request?

following operations have file system operation now:

1. CREATE DATABASE: create a dir
2. DROP DATABASE: delete the dir
3. CREATE TABLE: create a dir
4. DROP TABLE: delete the dir
5. RENAME TABLE: rename the dir
6. CREATE PARTITIONS: create a dir
7. RENAME PARTITIONS: rename the dir
8. DROP PARTITIONS: drop the dir

## How was this patch tested?

new tests in `ExternalCatalogSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12871 from cloud-fan/catalog.
2016-05-09 10:47:45 -07:00
gatorsmile e9131ec277 [SPARK-15185][SQL] InMemoryCatalog: Silent Removal of an Existent Table/Function/Partitions by Rename
#### What changes were proposed in this pull request?
So far, in the implementation of InMemoryCatalog, we do not check if the new/destination table/function/partition exists or not. Thus, we just silently remove the existent table/function/partition.

This PR is to detect them and issue an appropriate exception.

#### How was this patch tested?
Added the related test cases. They also verify if HiveExternalCatalog also detects these errors.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12960 from gatorsmile/renameInMemoryCatalog.
2016-05-09 12:40:30 +08:00
gatorsmile 5c8fad7b9b [SPARK-15108][SQL] Describe Permanent UDTF
#### What changes were proposed in this pull request?
When Describe a UDTF, the command returns a wrong result. The command is unable to find the function, which has been created and cataloged in the catalog but not in the functionRegistry.

This PR is to correct it. If the function is not in the functionRegistry, we will check the catalog for collecting the information of the UDTF function.

#### How was this patch tested?
Added test cases to verify the results

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12885 from gatorsmile/showFunction.
2016-05-06 11:43:07 -07:00
Wenchen Fan 55cc1c991a [SPARK-14139][SQL] RowEncoder should preserve schema nullability
## What changes were proposed in this pull request?

The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info.

TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR.

## How was this patch tested?

new tests in `RowEncoderSuite`

Note that, This PR takes over https://github.com/apache/spark/pull/11980, with a little simplification, so all credits should go to koertkuipers

Author: Wenchen Fan <wenchen@databricks.com>
Author: Koert Kuipers <koert@tresata.com>

Closes #12364 from cloud-fan/nullable.
2016-05-06 01:08:04 +08:00
Cheng Lian f152fae306 [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] <table>" DDL command
## What changes were proposed in this pull request?

This PR implements native `DESC [EXTENDED | FORMATTED] <table>` DDL command. Sample output:

```
scala> spark.sql("desc extended src").show(100, truncate = false)
+----------------------------+---------------------------------+-------+
|col_name                    |data_type                        |comment|
+----------------------------+---------------------------------+-------+
|key                         |int                              |       |
|value                       |string                           |       |
|                            |                                 |       |
|# Detailed Table Information|CatalogTable(`default`.`src`, ...|       |
+----------------------------+---------------------------------+-------+

scala> spark.sql("desc formatted src").show(100, truncate = false)
+----------------------------+----------------------------------------------------------+-------+
|col_name                    |data_type                                                 |comment|
+----------------------------+----------------------------------------------------------+-------+
|key                         |int                                                       |       |
|value                       |string                                                    |       |
|                            |                                                          |       |
|# Detailed Table Information|                                                          |       |
|Database:                   |default                                                   |       |
|Owner:                      |lian                                                      |       |
|Create Time:                |Mon Jan 04 17:06:00 CST 2016                              |       |
|Last Access Time:           |Thu Jan 01 08:00:00 CST 1970                              |       |
|Location:                   |hdfs://localhost:9000/user/hive/warehouse_hive121/src     |       |
|Table Type:                 |MANAGED                                                   |       |
|Table Parameters:           |                                                          |       |
|  transient_lastDdlTime     |1451898360                                                |       |
|                            |                                                          |       |
|# Storage Information       |                                                          |       |
|SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe        |       |
|InputFormat:                |org.apache.hadoop.mapred.TextInputFormat                  |       |
|OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|       |
|Num Buckets:                |-1                                                        |       |
|Bucket Columns:             |[]                                                        |       |
|Sort Columns:               |[]                                                        |       |
|Storage Desc Parameters:    |                                                          |       |
|  serialization.format      |1                                                         |       |
+----------------------------+----------------------------------------------------------+-------+
```

## How was this patch tested?

A test case is added to `HiveDDLSuite` to check command output.

Author: Cheng Lian <lian@databricks.com>

Closes #12844 from liancheng/spark-14127-desc-table.
2016-05-04 16:44:09 +08:00
Wenchen Fan 6c12e801e8 [SPARK-15029] improve error message for Generate
## What changes were proposed in this pull request?

This PR improve the error message for `Generate` in 3 cases:

1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl`
2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl`
3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)`

## How was this patch tested?

new tests in `AnalysisErrorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12810 from cloud-fan/bug.
2016-05-04 00:10:20 -07:00
Andrew Or 6ba17cd147 [SPARK-14414][SQL] Make DDL exceptions more consistent
## What changes were proposed in this pull request?

Just a bunch of small tweaks on DDL exception messages.

## How was this patch tested?

`DDLCommandSuite` et al.

Author: Andrew Or <andrew@databricks.com>

Closes #12853 from andrewor14/make-exceptions-consistent.
2016-05-03 18:07:53 -07:00
gatorsmile 71296c041e [SPARK-15056][SQL] Parse Unsupported Sampling Syntax and Issue Better Exceptions
#### What changes were proposed in this pull request?
Compared with the current Spark parser, there are two extra syntax are supported in Hive for sampling
- In `On` clauses, `rand()` is used for indicating sampling on the entire row instead of an individual column. For example,

   ```SQL
   SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
   ```
- Users can specify the total length to be read. For example,

   ```SQL
   SELECT * FROM source TABLESAMPLE(100M) s;
   ```

Below is the link for references:
   https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

This PR is to parse and capture these two extra syntax, and issue a better error message.

#### How was this patch tested?
Added test cases to verify the thrown exceptions

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12838 from gatorsmile/bucketOnRand.
2016-05-03 23:20:18 +02:00
bomeng 0fd95be3cd [SPARK-15062][SQL] fix list type infer serializer issue
## What changes were proposed in this pull request?

Make serializer correctly inferred if the input type is `List[_]`, since `List[_]` is type of `Seq[_]`, before it was matched to different case (`case t if definedByConstructorParams(t)`).

## How was this patch tested?

New test case was added.

Author: bomeng <bmeng@us.ibm.com>

Closes #12849 from bomeng/SPARK-15062.
2016-05-02 18:20:29 -07:00
Herman van Hovell 1c19c2769e [SPARK-15047][SQL] Cleanup SQL Parser
## What changes were proposed in this pull request?
This PR addresses a few minor issues in SQL parser:

- Removes some unused rules and keywords in the grammar.
- Removes code path for fallback SQL parsing (was needed for Hive native parsing).
- Use `UnresolvedGenerator` instead of hard-coding `Explode` & `JsonTuple`.
- Adds a more generic way of creating error messages for unsupported Hive features.
- Use `visitFunctionName` as much as possible.
- Interpret a `CatalogColumn`'s `DataType` directly instead of parsing it again.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12826 from hvanhovell/SPARK-15047.
2016-05-02 18:12:31 -07:00
Herman van Hovell f362363d14 [SPARK-14785] [SQL] Support correlated scalar subqueries
## What changes were proposed in this pull request?
In this PR we add support for correlated scalar subqueries. An example of such a query is:
```SQL
select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key)
```
The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans.

I could not find a well defined semantics for the use of scalar subqueries in an `Aggregate`. The current implementation currently evaluates the scalar subquery *before* aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this.

The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an `AggregateExpression`.

## How was this patch tested?
Added tests to `SubquerySuite`.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12822 from hvanhovell/SPARK-14785.
2016-05-02 16:32:31 -07:00
Davies Liu 95e372141a [SPARK-14781] [SQL] support nested predicate subquery
## What changes were proposed in this pull request?

In order to support nested predicate subquery, this PR introduce an internal join type ExistenceJoin, which will emit all the rows from left, plus an additional column, which presents there are any rows matched from right or not (it's not null-aware right now). This additional column could be used to replace the subquery in Filter.

In theory, all the predicate subquery could use this join type, but it's slower than LeftSemi and LeftAnti, so it's only used for nested subquery (subquery inside OR).

For example, the following SQL:
```sql
SELECT a FROM t  WHERE EXISTS (select 0) OR EXISTS (select 1)
```

This PR also fix a bug in predicate subquery push down through join (they should not).

Nested null-aware subquery is still not supported. For example,   `a > 3 OR b NOT IN (select bb from t)`

After this, we could run TPCDS query Q10, Q35, Q45

## How was this patch tested?

Added unit tests.

Author: Davies Liu <davies@databricks.com>

Closes #12820 from davies/or_exists.
2016-05-02 12:58:59 -07:00
Dongjoon Hyun 6e6320122e [SPARK-14830][SQL] Add RemoveRepetitionFromGroupExpressions optimizer.
## What changes were proposed in this pull request?

This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added.

**Before**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9], functions=[], output=[(a + 1)#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + A#0)#9, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

**After**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

## How was this patch tested?

Pass the Jenkins tests (with a new testcase)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12590 from dongjoon-hyun/SPARK-14830.
2016-05-02 12:40:21 -07:00
Wenchen Fan 43b149fb88 [SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT
## What changes were proposed in this pull request?

This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT.

## How was this patch tested?

existing tests and new test suite `UnsafeArraySuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12640 from cloud-fan/ml.
2016-04-29 23:04:51 -07:00
Yin Huai ac41fc648d [SPARK-14591][SQL] Remove DataTypeParser and add more keywords to the nonReserved list.
## What changes were proposed in this pull request?
CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser.

## How was this patch tested?
Existing tests

Author: Yin Huai <yhuai@databricks.com>

Closes #12796 from yhuai/removeDataTypeParser.
2016-04-29 22:49:12 -07:00
Reynold Xin 7945f9f6d4 [SPARK-14757] [SQL] Fix nullability bug in EqualNullSafe codegen
## What changes were proposed in this pull request?
This patch fixes a null handling bug in EqualNullSafe's code generation.

## How was this patch tested?
Updated unit test so they would fail without the fix.

Closes #12628.

Author: Reynold Xin <rxin@databricks.com>
Author: Arash Nabili <arash@levyx.com>

Closes #12799 from rxin/equalnullsafe.
2016-04-29 22:26:12 -07:00
Herman van Hovell 83061be697 [SPARK-14858] [SQL] Enable subquery pushdown
The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example :
```scala
range(0, 10).registerTempTable("a")
range(5, 15).registerTempTable("b")
range(7, 25).registerTempTable("c")
range(3, 12).registerTempTable("d")
val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)")
plan.explain(true)
```
Leads to the following Analyzed & Optimized plans:
```
== Parsed Logical Plan ==
...

== Analyzed Logical Plan ==
id: bigint, id: bigint, id: bigint
Project [id#0L,id#4L,id#8L]
+- Filter predicate-subquery#16 [(id#0L = id#12L)]
   :  +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)]
   :     +- Project [id#12L]
   :        +- SubqueryAlias d
   :           +- Range 3, 12, 1, 8, [id#12L]
   +- Join LeftOuter, Some((id#8L = id#4L))
      :- Join Inner, Some((id#0L = id#4L))
      :  :- SubqueryAlias a
      :  :  +- Range 0, 10, 1, 8, [id#0L]
      :  +- SubqueryAlias b
      :     +- Range 5, 15, 1, 8, [id#4L]
      +- SubqueryAlias c
         +- Range 7, 25, 1, 8, [id#8L]

== Optimized Logical Plan ==
Join LeftOuter, Some((id#8L = id#4L))
:- Join Inner, Some((id#0L = id#4L))
:  :- Join LeftSemi, Some((id#0L = id#12L))
:  :  :- Range 0, 10, 1, 8, [id#0L]
:  :  +- Range 3, 12, 1, 8, [id#12L]
:  +- Range 5, 15, 1, 8, [id#4L]
+- Range 7, 25, 1, 8, [id#8L]

== Physical Plan ==
...
```
I have also taken the opportunity to move quite a bit of code around:
- Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do.
- I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child).
- Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution.

These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere.

Current tests and added test cases in FilterPushdownSuite.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12720 from hvanhovell/SPARK-14858.
2016-04-29 16:50:12 -07:00
gatorsmile 222dcf7937 [SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join
#### What changes were proposed in this pull request?
Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
```SQL
  SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
  ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
```
 Note:
 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
    join conditions will be incorrect.

This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
```SQL
  test("except") {
    val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
    val df_right = Seq(1, 3).toDF("id")

    checkAnswer(
      df_left.except(df_right),
      Row(2) :: Row(2) :: Row(4) :: Nil
    )
  }
```
After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.

#### How was this patch tested?
Modified and added a few test cases to verify the optimization rule and the results of operators.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12736 from gatorsmile/exceptByAntiJoin.
2016-04-29 15:30:36 +08:00
Dongjoon Hyun af92299fdb [SPARK-14664][SQL] Implement DecimalAggregates optimization for Window queries
## What changes were proposed in this pull request?

This PR aims to implement decimal aggregation optimization for window queries by improving existing `DecimalAggregates`. Historically, `DecimalAggregates` optimizer is designed to transform general `sum/avg(decimal)`, but it breaks recently added windows queries like the followings. The following queries work well without the current `DecimalAggregates` optimizer.

**Sum**
```scala
scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").head
java.lang.RuntimeException: Unsupported window function: MakeDecimal((sum(UnscaledValue(a#31)),mode=Complete,isDistinct=false),12,1)
scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23]
:     +- INPUT
+- Window [MakeDecimal((sum(UnscaledValue(a#21)),mode=Complete,isDistinct=false),12,1) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23]
   +- Exchange SinglePartition, None
      +- Generate explode([1.0,2.0]), false, false, [a#21]
         +- Scan OneRowRelation[]
```

**Average**
```scala
scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").head
java.lang.RuntimeException: Unsupported window function: cast(((avg(UnscaledValue(a#40)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5))
scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44]
:     +- INPUT
+- Window [cast(((avg(UnscaledValue(a#42)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44]
   +- Exchange SinglePartition, None
      +- Generate explode([1.0,2.0]), false, false, [a#42]
         +- Scan OneRowRelation[]
```

After this PR, those queries work fine and new optimized physical plans look like the followings.

**Sum**
```scala
scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35]
:     +- INPUT
+- Window [MakeDecimal((sum(UnscaledValue(a#33)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),12,1) AS sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35]
   +- Exchange SinglePartition, None
      +- Generate explode([1.0,2.0]), false, false, [a#33]
         +- Scan OneRowRelation[]
```

**Average**
```scala
scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47]
:     +- INPUT
+- Window [cast(((avg(UnscaledValue(a#45)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) / 10.0) as decimal(6,5)) AS avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47]
   +- Exchange SinglePartition, None
      +- Generate explode([1.0,2.0]), false, false, [a#45]
         +- Scan OneRowRelation[]
```

In this PR, *SUM over window* pattern matching is based on the code of hvanhovell ; he should be credited for the work he did.

## How was this patch tested?

Pass the Jenkins tests (with newly added testcases)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12421 from dongjoon-hyun/SPARK-14664.
2016-04-27 21:36:19 +02:00
Andrew Or d8a83a564f [SPARK-13477][SQL] Expose new user-facing Catalog interface
## What changes were proposed in this pull request?

#12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface.

## How was this patch tested?

See `CatalogSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #12713 from andrewor14/user-facing-catalog.
2016-04-26 21:29:25 -07:00
Reynold Xin f36c9c8379 [SPARK-14888][SQL] UnresolvedFunction should use FunctionIdentifier
## What changes were proposed in this pull request?
This patch changes UnresolvedFunction and UnresolvedGenerator to use a FunctionIdentifier rather than just a String for function name. Also changed SessionCatalog to accept FunctionIdentifier in lookupFunction.

## How was this patch tested?
Updated related unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12659 from rxin/SPARK-14888.
2016-04-25 16:20:57 -07:00
gatorsmile 0c47e274ab [SPARK-13739][SQL] Push Predicate Through Window
#### What changes were proposed in this pull request?

For performance, predicates can be pushed through Window if and only if the following conditions are satisfied:
 1. All the expressions are part of window partitioning key. The expressions can be compound.
 2. Deterministic

#### How was this patch tested?

TODO:
- [X]  DSL needs to be modified for window
- [X] more tests will be added.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11635 from gatorsmile/pushPredicateThroughWindow.
2016-04-25 22:32:34 +02:00
jliwork f0f1a8afde [SPARK-14548][SQL] Support not greater than and not less than operator in Spark SQL
!< means not less than which is equivalent to >=
!> means not greater than which is equivalent to <=

I'd to create a PR to support these two operators.

I've added new test cases in: DataFrameSuite, ExpressionParserSuite, JDBCSuite, PlanParserSuite, SQLQuerySuite

dilipbiswal viirya gatorsmile

Author: jliwork <jiali@us.ibm.com>

Closes #12316 from jliwork/SPARK-14548.
2016-04-24 11:22:06 -07:00
Dongjoon Hyun 3647120a5a [SPARK-14796][SQL] Add spark.sql.optimizer.inSetConversionThreshold config option.
## What changes were proposed in this pull request?

Currently, `OptimizeIn` optimizer replaces `In` expression into `InSet` expression if the size of set is greater than a constant, 10.
This issue aims to make a configuration `spark.sql.optimizer.inSetConversionThreshold` for that.

After this PR, `OptimizerIn` is configurable.
```scala
scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [a#7 IN (1,2,3) AS (a IN (1, 2, 3))#8]
:     +- INPUT
+- Generate explode([1,2]), false, false, [a#7]
   +- Scan OneRowRelation[]

scala> sqlContext.setConf("spark.sql.optimizer.inSetConversionThreshold", "2")

scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [a#16 INSET (1,2,3) AS (a IN (1, 2, 3))#17]
:     +- INPUT
+- Generate explode([1,2]), false, false, [a#16]
   +- Scan OneRowRelation[]
```

## How was this patch tested?

Pass the Jenkins tests (with a new testcase)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12562 from dongjoon-hyun/SPARK-14796.
2016-04-22 14:14:47 -07:00
Davies Liu c417cec067 [SPARK-14763][SQL] fix subquery resolution
## What changes were proposed in this pull request?

Currently, a column could be resolved wrongly if there are columns from both outer table and subquery have the same name, we should only resolve the attributes that can't be resolved within subquery. They may have same exprId than other attributes in subquery, so we should create alias for them.

Also, the column in IN subquery could have same exprId, we should create alias for them.

## How was this patch tested?

Added regression tests. Manually tests TPCDS Q70 and Q95, work well after this patch.

Author: Davies Liu <davies@databricks.com>

Closes #12539 from davies/fix_subquery.
2016-04-22 20:55:41 +02:00
Herman van Hovell d060da098a [SPARK-14762] [SQL] TPCDS Q90 fails to parse
### What changes were proposed in this pull request?
TPCDS Q90 fails to parse because it uses a reserved keyword as an Identifier; `AT` was used as an alias for one of the subqueries. `AT` is not a reserved keyword and should have been registerd as a in the `nonReserved` rule.

In order to prevent this from happening again I have added tests for all keywords that are non-reserved in Hive. See the `nonReserved`, `sql11ReservedKeywordsUsedAsCastFunctionName` & `sql11ReservedKeywordsUsedAsIdentifier` rules in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g.

### How was this patch tested?

Added tests to for all Hive non reserved keywords to `TableIdentifierParserSuite`.

cc davies

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12537 from hvanhovell/SPARK-14762.
2016-04-22 11:28:46 -07:00
Joan bf95b8da27 [SPARK-6429] Implement hashCode and equals together
## What changes were proposed in this pull request?

Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.

Author: Joan <joan@goyeau.com>

Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
2016-04-22 12:24:12 +01:00
Takuya UESHIN f1fdb23821 [SPARK-14793] [SQL] Code generation for large complex type exceeds JVM size limit.
## What changes were proposed in this pull request?

Code generation for complex type, `CreateArray`, `CreateMap`, `CreateStruct`, `CreateNamedStruct`, exceeds JVM size limit for large elements.

We should split generated code into multiple `apply` functions if the complex types have large elements,  like `UnsafeProjection` or others for large expressions.

## How was this patch tested?

I added some tests to check if the generated codes for the expressions exceed or not.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #12559 from ueshin/issues/SPARK-14793.
2016-04-21 21:17:56 -07:00
Reynold Xin f181aee07c [SPARK-14821][SQL] Implement AnalyzeTable in sql/core and remove HiveSqlAstBuilder
## What changes were proposed in this pull request?
This patch moves analyze table parsing into SparkSqlAstBuilder and removes HiveSqlAstBuilder.

In order to avoid extensive refactoring, I created a common trait for CatalogRelation and MetastoreRelation, and match on that. In the future we should probably just consolidate the two into a single thing so we don't need this common trait.

## How was this patch tested?
Updated unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12584 from rxin/SPARK-14821.
2016-04-21 17:41:29 -07:00
Wenchen Fan 7abe9a6578 [SPARK-9013][SQL] generate MutableProjection directly instead of return a function
`MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections.

However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #7373 from cloud-fan/project.
2016-04-20 00:44:02 -07:00
Wenchen Fan 856bc465d5 [SPARK-14600] [SQL] Push predicates through Expand
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14600

This PR makes `Expand.output` have different attributes from the grouping attributes produced by the underlying `Project`, as they have different meaning, so that we can safely push down filter through `Expand`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12496 from cloud-fan/expand.
2016-04-19 21:53:19 -07:00
Joan 3ae25f244b [SPARK-13929] Use Scala reflection for UDTs
## What changes were proposed in this pull request?

Enable ScalaReflection and User Defined Types for plain Scala classes.

This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet.

## How was this patch tested?

Unit test

Author: Joan <joan@goyeau.com>

Closes #12149 from joan38/SPARK-13929-Scala-reflection.
2016-04-19 17:36:31 -07:00
Herman van Hovell da8859226e [SPARK-4226] [SQL] Support IN/EXISTS Subqueries
### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:

- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`

This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did.

### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`

cc rxin, davies & chenghao-intel

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12306 from hvanhovell/SPARK-4226.
2016-04-19 15:16:02 -07:00
Josh Rosen 947b9020b0 [SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture full stacktrace
When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread.

This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`.

I tested this manually using 16b31c8251, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR.

/cc rxin nongli yhuai anabranch

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
2016-04-19 10:38:10 -07:00
Wenchen Fan 9ee95b6ecc [SPARK-14491] [SQL] refactor object operator framework to make it easy to eliminate serializations
## What changes were proposed in this pull request?

This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer.

Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there.

## How was this patch tested?

existing tests and new test in `EliminateSerializationSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12260 from cloud-fan/encoder.
2016-04-19 10:00:44 -07:00
Dongjoon Hyun 3d46d796a3 [SPARK-14577][SQL] Add spark.sql.codegen.maxCaseBranches config option
## What changes were proposed in this pull request?

We currently disable codegen for `CaseWhen` if the number of branches is greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better if this value is a non-public config defined in SQLConf.

## How was this patch tested?

Pass the Jenkins tests (including a new testcase `Support spark.sql.codegen.maxCaseBranches option`)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12353 from dongjoon-hyun/SPARK-14577.
2016-04-19 21:38:15 +08:00
Sameer Agarwal 4eae1dbd7c [SPARK-14718][SQL] Avoid mutating ExprCode in doGenCode
## What changes were proposed in this pull request?

The `doGenCode` method currently takes in an `ExprCode`, mutates it and returns the java code to evaluate the given expression. It should instead just return a new `ExprCode` to avoid passing around mutable objects during code generation.

## How was this patch tested?

Existing Tests

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12483 from sameeragarwal/new-exprcode-2.
2016-04-18 20:28:22 -07:00
Sameer Agarwal 8bd8121329 [SPARK-14710][SQL] Rename gen/genCode to genCode/doGenCode to better reflect the semantics
## What changes were proposed in this pull request?

Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls.

## How was this patch tested?

N/A (refactoring only)

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12475 from sameeragarwal/gencode.
2016-04-18 14:03:40 -07:00
Reynold Xin e4ae974294 [HOTFIX] Fix Scala 2.10 compilation break. 2016-04-18 12:57:23 -07:00
Dongjoon Hyun d280d1da1a [SPARK-14580][SPARK-14655][SQL] Hive IfCoercion should preserve predicate.
## What changes were proposed in this pull request?

Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose return-type are null. However, some UDFs need evaluations because they are designed to throw exceptions. This PR fixes that to preserve the predicates. Also, `assert_true` is implemented as Spark SQL function.

**Before**
```
scala> sql("select if(assert_true(false),2,3)").head
res2: org.apache.spark.sql.Row = [3]
```

**After**
```
scala> sql("select if(assert_true(false),2,3)").head
... ASSERT_TRUE ...
```

**Hive**
```
hive> select if(assert_true(false),2,3);
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed.
```

## How was this patch tested?

Pass the Jenkins tests (including a new testcase in `HivePlanTest`)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12340 from dongjoon-hyun/SPARK-14580.
2016-04-18 12:26:56 -07:00
Tathagata Das 775cf17eaa [SPARK-14473][SQL] Define analysis rules to catch operations not supported in streaming
## What changes were proposed in this pull request?

There are many operations that are currently not supported in the streaming execution. For example:
 - joining two streams
 - unioning a stream and a batch source
 - sorting
 - window functions (not time windows)
 - distinct aggregates

Furthermore, executing a query with a stream source as a batch query should also fail.

This patch add an additional step after analysis in the QueryExecution which will check that all the operations in the analyzed logical plan is supported or not.

## How was this patch tested?
unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12246 from tdas/SPARK-14473.
2016-04-18 11:09:33 -07:00
Dongjoon Hyun 432d1399cb [SPARK-14614] [SQL] Add bround function
## What changes were proposed in this pull request?

This PR aims to add `bound` function (aka Banker's round) by extending current `round` implementation. [Hive supports `bround` since 1.3.0.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)

**Hive (1.3 ~ 2.0)**
```
hive> select round(2.5), bround(2.5);
OK
3.0	2.0
```

**After this PR**
```scala
scala> sql("select round(2.5), bround(2.5)").head
res0: org.apache.spark.sql.Row = [3,2]
```

## How was this patch tested?

Pass the Jenkins tests (with extended tests).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12376 from dongjoon-hyun/SPARK-14614.
2016-04-18 10:44:51 -07:00
Reynold Xin f4be0946af [SPARK-14677][SQL] Make the max number of iterations configurable for Catalyst
## What changes were proposed in this pull request?
We currently hard code the max number of optimizer/analyzer iterations to 100. This patch makes it configurable. While I'm at it, I also added the SessionCatalog to the optimizer, so we can use information there in optimization.

## How was this patch tested?
Updated unit tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12434 from rxin/SPARK-14677.
2016-04-15 20:28:09 -07:00
Yin Huai b2dfa84959 [SPARK-14668][SQL] Move CurrentDatabase to Catalyst
## What changes were proposed in this pull request?

This PR moves `CurrentDatabase` from sql/hive package to sql/catalyst. It also adds the function description, which looks like the following.

```
scala> sqlContext.sql("describe function extended current_database").collect.foreach(println)
[Function: current_database]
[Class: org.apache.spark.sql.execution.command.CurrentDatabase]
[Usage: current_database() - Returns the current database.]
[Extended Usage:
> SELECT current_database()]
```

## How was this patch tested?
Existing tests

Author: Yin Huai <yhuai@databricks.com>

Closes #12424 from yhuai/SPARK-14668.
2016-04-15 17:48:41 -07:00
Dongjoon Hyun d7e124edfe [SPARK-14545][SQL] Improve LikeSimplification by adding a%b rule
## What changes were proposed in this pull request?

Current `LikeSimplification` handles the following four rules.
- 'a%' => expr.StartsWith("a")
- '%b' => expr.EndsWith("b")
- '%a%' => expr.Contains("a")
- 'a' => EqualTo("a")

This PR adds the following rule.
- 'a%b' => expr.Length() >= 2 && expr.StartsWith("a") && expr.EndsWith("b")

Here, 2 is statically calculated from "a".size + "b".size.

**Before**
```
scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Filter a#5 LIKE a%c
:     +- INPUT
+- Generate explode([abc,adc]), false, false, [a#5]
   +- Scan OneRowRelation[]
```

**After**
```
scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Filter ((length(a#5) >= 2) && (StartsWith(a#5, a) && EndsWith(a#5, c)))
:     +- INPUT
+- Generate explode([abc,adc]), false, false, [a#5]
   +- Scan OneRowRelation[]
```

## How was this patch tested?

Pass the Jenkins tests (including new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12312 from dongjoon-hyun/SPARK-14545.
2016-04-14 13:34:29 -07:00
Davies Liu dbbe149070 [SPARK-14581] [SQL] push predicatese through more logical plans
## What changes were proposed in this pull request?

Right now, filter push down only works with Project, Aggregate, Generate and Join, they can't be pushed through many other plans.

This PR added support for Union, Intersect, Except and all unary plans.

## How was this patch tested?

Added tests.

Author: Davies Liu <davies@databricks.com>

Closes #12342 from davies/filter_hint.
2016-04-13 13:01:13 -07:00
Andrew Or 7d2ed8cc03 [SPARK-14388][SQL] Implement CREATE TABLE
## What changes were proposed in this pull request?

This patch implements the `CREATE TABLE` command using the `SessionCatalog`. Previously we handled only `CTAS` and `CREATE TABLE ... USING`. This requires us to refactor `CatalogTable` to accept various fields (e.g. bucket and skew columns) and pass them to Hive.

WIP: Note that I haven't verified whether this actually works yet! But I believe it does.

## How was this patch tested?

Tests will come in a future commit.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12271 from andrewor14/create-table-ddl.
2016-04-13 11:08:34 -07:00
Davies Liu 85e68b4bea [SPARK-14562] [SQL] improve constraints propagation in Union
## What changes were proposed in this pull request?

Currently, Union only takes intersect of the constraints from it's children, all others are dropped, we should try to merge them together.

This PR try to merge the constraints that have the same reference but came from different children, for example: `a > 10` and `a < 100` could be merged as `a > 10 || a < 100`.

## How was this patch tested?

Added more cases in existing test.

Author: Davies Liu <davies@databricks.com>

Closes #12328 from davies/union_const.
2016-04-12 12:29:54 -07:00
Dongjoon Hyun b0f5497e95 [SPARK-14508][BUILD] Add a new ScalaStyle Rule OmitBracesInCase
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
  ```
  case: Always omit braces in case clauses.
  ```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.

## How was this patch tested?

Pass the Jenkins tests (including Scala style checking)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12280 from dongjoon-hyun/SPARK-14508.
2016-04-12 00:43:28 -07:00