Commit graph

1335 commits

Author SHA1 Message Date
Michael Armbrust 17eec0a71b [SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files
This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed.

Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties:
 - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns  in the public API of `org.apache.spark.sql.sources.FileFormat`
 - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns
 - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf)
 - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning.
 - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm.

Currently only a testing source is planned / tested using this strategy.  In follow-up PRs we will port the existing formats to this API.

A stub for `FileScanRDD` is also added, but most methods remain unimplemented.

Other minor cleanups:
 - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic.  This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore)
 - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out.
 - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls
 - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes.

Author: Michael Armbrust <michael@databricks.com>

Closes #11646 from marmbrus/fileStrategy.
2016-03-14 19:21:12 -07:00
Liang-Chi Hsieh 6a4bfcd62b [SPARK-13658][SQL] BooleanSimplification rule is slow with large boolean expressions
JIRA: https://issues.apache.org/jira/browse/SPARK-13658

## What changes were proposed in this pull request?

Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree).

It will great if we could speedup it.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

How to speed up it:

When we ask the canonicalized expression in `Expression`, it calls `Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms up all expressions included in this expression. However, we don't keep the canonicalized versions for these children expressions. So in next time we ask the canonicalized expressions for the children expressions (e.g., `BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them. It wastes much time.

By forcing the children expressions to get and keep their canonicalized versions first, we can avoid re-canonicalize these expressions.

I simply benchmark it with an expression which is part of the where clause in TPCDS Q3:

    val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int)

    val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && (('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) || ('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) || ('ss_sold_date_sk > 2416085 && 'ss_sold_date_sk < 2416115) || ('ss_sold_date_sk > 2416450 && 'ss_sold_date_sk < 2416480) || ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk < 2416846) || ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) || ('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) || ('ss_sold_date_sk > 2417911 && 'ss_sold_date_sk < 2417941) || ('ss_sold_date_sk > 2418277 && 'ss_sold_date_sk < 2418307) || ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk < 2418672) || ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) || ('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) || ('ss_sold_date_sk > 2419738 && 'ss_sold_date_sk < 2419768) || ('ss_sold_date_sk > 2420103 && 'ss_sold_date_sk < 2420133) || ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) || ('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) || ('ss_sold_date_sk > 2421199 && 'ss_sold_date_sk < 2421229) || ('ss_sold_date_sk > 2421564 && 'ss_sold_date_sk < 2421594) || ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk < 2421959) || ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) || ('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) || ('ss_sold_date_sk > 2423025 && 'ss_sold_date_sk < 2423055) || ('ss_sold_date_sk > 2423390 && 'ss_sold_date_sk < 2423420) || ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk < 2423785) || ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) || ('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) || ('ss_sold_date_sk > 2424851 && 'ss_sold_date_sk < 2424881) || ('ss_sold_date_sk > 2425216 && 'ss_sold_date_sk < 2425246) || ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk < 2425612) || ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) || ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) || ('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) || ('ss_sold_date_sk > 2427043 && 'ss_sold_date_sk < 2427073) || ('ss_sold_date_sk > 2427408 && 'ss_sold_date_sk < 2427438) || ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk < 2427803) || ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) || ('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) || ('ss_sold_date_sk > 2428869 && 'ss_sold_date_sk < 2428899) || ('ss_sold_date_sk > 2429234 && 'ss_sold_date_sk < 2429264) || ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk < 2429629) || ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) || ('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) || ('ss_sold_date_sk > 2430695 && 'ss_sold_date_sk < 2430725) || ('ss_sold_date_sk > 2431060 && 'ss_sold_date_sk < 2431090) || ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk < 2431456) || ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) || ('ss_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) || ('ss_sold_date_sk > 2432521 && 'ss_sold_date_sk < 2432551) || ('ss_sold_date_sk > 2432887 && 'ss_sold_date_sk < 2432917) || ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk < 2433282) || ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) || ('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) || ('ss_sold_date_sk > 2434348 && 'ss_sold_date_sk < 2434378) || ('ss_sold_date_sk > 2434713 && 'ss_sold_date_sk < 2434743)))

    val plan = testRelation.where(input).analyze
    val actual = Optimize.execute(plan)

With this patch:

    352 milliseconds
    346 milliseconds
    340 milliseconds

Without this patch:

    585 milliseconds
    880 milliseconds
    677 milliseconds

## How was this patch tested?

Existing tests should pass.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11647 from viirya/improve-expr-canonicalize.
2016-03-14 11:23:29 -07:00
Dongjoon Hyun acdf219703 [MINOR][DOCS] Fix more typos in comments/strings.
## What changes were proposed in this pull request?

This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11689 from dongjoon-hyun/fix_more_typos.
2016-03-14 09:07:39 +00:00
Sean Owen 1840852841 [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
## What changes were proposed in this pull request?

- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11657 from srowen/SPARK-13823.
2016-03-13 21:03:49 -07:00
Davies Liu ba8c86d06f [SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources
## What changes were proposed in this pull request?

This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them.

Also fix the problem for sameResult() on two DataSourceScan.

Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad).

## How was this patch tested?

Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan).

Author: Davies Liu <davies@databricks.com>

Closes #11514 from davies/existing_rdd.
2016-03-12 00:48:36 -08:00
Andrew Or 66d9d0edfe [SPARK-13139][SQL] Parse Hive DDL commands ourselves
## What changes were proposed in this pull request?

This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`.

Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.

## How was this patch tested?

Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here.

Author: Andrew Or <andrew@databricks.com>

Closes #11573 from andrewor14/parser-plus-plus.
2016-03-11 15:13:48 -08:00
Wenchen Fan 6871cc8f3e [SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions
## What changes were proposed in this pull request?

Add SQL generation support for window functions. The idea is simple, just treat `Window` operator like `Project`, i.e. add subquery to its child when necessary, generate a `SELECT ... FROM ...` SQL string, implement `sql` method for window related expressions, e.g. `WindowSpecDefinition`, `WindowFrame`, etc.

This PR also fixed SPARK-13720 by improving the process of adding extra `SubqueryAlias`(the `RecoverScopingInfo` rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split `RecoverScopingInfo` into 2 rules: `AddSubQuery` and `UpdateQualifier`. `AddSubQuery` only add subquery if necessary, and `UpdateQualifier` will re-propagate and update qualifiers bottom up.

Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here.

Many thanks to gatorsmile for the initial discussion and test cases!

## How was this patch tested?

new tests in `LogicalPlanToSQLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11555 from cloud-fan/window.
2016-03-11 13:22:34 +08:00
gatorsmile 560489f4e1 [SPARK-13732][SPARK-13797][SQL] Remove projectList from Window and Eliminate useless Window
#### What changes were proposed in this pull request?

`projectList` is useless. Its value is always the same as the child.output. Remove it from the class `Window`. Removal can simplify the codes in Analyzer and Optimizer.

This PR is based on the discussion started by cloud-fan in a separate PR:
https://github.com/apache/spark/pull/5604#discussion_r55140466

This PR also eliminates useless `Window`.

cloud-fan yhuai

#### How was this patch tested?

Existing test cases cover it.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11565 from gatorsmile/removeProjListWindow.
2016-03-11 11:59:18 +08:00
Sameer Agarwal c3a6269ca9 [SPARK-13789] Infer additional constraints from attribute equality
## What changes were proposed in this pull request?

This PR adds support for inferring an additional set of data constraints based on attribute equality. For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), we can now automatically infer an additional constraint of the form `b = 5`

## How was this patch tested?

Tested that new constraints are properly inferred for filters (by adding a new test) and equi-joins (by modifying an existing test)

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11618 from sameeragarwal/infer-isequal-constraints.
2016-03-10 17:29:45 -08:00
Cheng Lian 1d542785b9 [SPARK-13244][SQL] Migrates DataFrame to Dataset
## What changes were proposed in this pull request?

This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.

Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).

There are several noticeable API changes related to those returning arrays:

1.  `collect`/`take`

    -   Old APIs in class `DataFrame`:

        ```scala
        def collect(): Array[Row]
        def take(n: Int): Array[Row]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def collect(): Array[T]
        def take(n: Int): Array[T]

        def collectRows(): Array[Row]
        def takeRows(n: Int): Array[Row]
        ```

    Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.

    Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).

1.  `randomSplit`

    -   Old APIs in class `DataFrame`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
        def randomSplit(weights: Array[Double]): Array[DataFrame]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
        def randomSplit(weights: Array[Double]): Array[Dataset[T]]
        ```

    Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.

1.  `groupBy`

    Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.

Other noticeable changes:

1.  Dataset always do eager analysis now

    We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.

## How was this patch tested?

Existing tests do the work.

## TODO

- [ ] Fix all tests
- [ ] Re-enable MiMA check
- [ ] Update ScalaDoc (`since`, `group`, and example code)

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>

Closes #11443 from liancheng/ds-to-df.
2016-03-10 17:00:17 -08:00
Dongjoon Hyun 91fed8e9c5 [SPARK-3854][BUILD] Scala style: require spaces before {.
## What changes were proposed in this pull request?

Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern  for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
```
// Correct:
if (true) {
  println("Wow!")
}

// Incorrect:
if (true){
   println("Wow!")
}
```
IntelliJ also shows new warnings based on this.

## How was this patch tested?

Pass the Jenkins ScalaStyle test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11637 from dongjoon-hyun/SPARK-3854.
2016-03-10 15:57:22 -08:00
Nong Li 747d2f5381 [SPARK-13790] Speed up ColumnVector's getDecimal
## What changes were proposed in this pull request?

We should reuse an object similar to the other non-primitive type getters. For
a query that computes averages over decimal columns, this shows a 10% speedup
on overall query times.

## How was this patch tested?

Existing tests and this benchmark

```
TPCDS Snappy:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
--------------------------------------------------------------------------------
q27-agg (master)                       10627 / 11057         10.8          92.3
q27-agg (this patch)                     9722 / 9832         11.8          84.4
```

Author: Nong Li <nong@databricks.com>

Closes #11624 from nongli/spark-13790.
2016-03-10 13:31:19 -08:00
Sameer Agarwal 19f4ac6dc7 [SPARK-13759][SQL] Add IsNotNull constraints for expressions with an inequality
## What changes were proposed in this pull request?

This PR adds support for inferring `IsNotNull` constraints from expressions with an `!==`. More specifically, if an operator has a condition on `a !== b`, we know that both `a` and `b` in the operator output can no longer be null.

## How was this patch tested?

1. Modified a test in `ConstraintPropagationSuite` to test for expressions with an inequality.
2. Added a test in `NullFilteringSuite` for making sure an Inner join with a "non-equal" condition appropriately filters out null from their input.

cc nongli

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11594 from sameeragarwal/isnotequal-constraints.
2016-03-10 12:16:46 -08:00
Yin Huai 790646125e Revert "[SPARK-13760][SQL] Fix BigDecimal constructor for FloatType"
This reverts commit 926e9c45a2.
2016-03-09 18:41:38 -08:00
Sameer Agarwal 926e9c45a2 [SPARK-13760][SQL] Fix BigDecimal constructor for FloatType
## What changes were proposed in this pull request?

A very minor change for using `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The latter is deprecated and can result in inconsistencies due to an implicit conversion to `Double`.

## How was this patch tested?

N/A

cc yhuai

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11597 from sameeragarwal/bigdecimal.
2016-03-09 18:16:29 -08:00
Sameer Agarwal dbf2a7cfad [SPARK-13781][SQL] Use ExpressionSets in ConstraintPropagationSuite
## What changes were proposed in this pull request?

This PR is a small follow up on https://github.com/apache/spark/pull/11338 (https://issues.apache.org/jira/browse/SPARK-13092) to use `ExpressionSet` as part of the verification logic in `ConstraintPropagationSuite`.
## How was this patch tested?

No new tests added. Just changes the verification logic in `ConstraintPropagationSuite`.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11611 from sameeragarwal/expression-set.
2016-03-09 15:27:18 -08:00
gatorsmile c6aa356cd8 [SPARK-13527][SQL] Prune Filters based on Constraints
#### What changes were proposed in this pull request?

Remove all the deterministic conditions in a [[Filter]] that are contained in the Child's Constraints.

For example, the first query can be simplified to the second one.

```scala
    val queryWithUselessFilter = tr1
      .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
      .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
      .where(
        ("tr1.a".attr > 10 || "tr1.c".attr < 10) &&
        'd.attr < 100 &&
        "tr2.a".attr === "tr1.a".attr)
```
```scala
    val query = tr1
      .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
      .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
```
#### How was this patch tested?

Six test cases are added.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11406 from gatorsmile/FilterRemoval.
2016-03-09 12:50:55 -08:00
Davies Liu 3dc9ae2e15 [SPARK-13523] [SQL] Reuse exchanges in a query
## What changes were proposed in this pull request?

It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache).

Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query.

In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan.

Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning.

After the rule, the plan will looks like:

```
WholeStageCodegen
:  +- Project [id#0L]
:     +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
:        :- Project [id#0L]
:        :  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
:        :     :- Range 0, 1, 4, 1024, [id#0L]
:        :     +- INPUT
:        +- INPUT
:- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
:  +- WholeStageCodegen
:     :  +- Range 0, 1, 4, 1024, [id#1L]
+- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
```

![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png)

For three ways SortMergeJoin,
```
== Physical Plan ==
WholeStageCodegen
:  +- Project [id#0L]
:     +- SortMergeJoin [id#0L], [id#4L], None
:        :- INPUT
:        +- INPUT
:- WholeStageCodegen
:  :  +- Project [id#0L]
:  :     +- SortMergeJoin [id#0L], [id#3L], None
:  :        :- INPUT
:  :        +- INPUT
:  :- WholeStageCodegen
:  :  :  +- Sort [id#0L ASC], false, 0
:  :  :     +- INPUT
:  :  +- Exchange hashpartitioning(id#0L, 200), None
:  :     +- WholeStageCodegen
:  :        :  +- Range 0, 1, 4, 33554432, [id#0L]
:  +- WholeStageCodegen
:     :  +- Sort [id#3L ASC], false, 0
:     :     +- INPUT
:     +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None
+- WholeStageCodegen
   :  +- Sort [id#4L ASC], false, 0
   :     +- INPUT
   +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None
```
![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png)

If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents.

## How was this patch tested?

Added some unit tests for this.  Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in #11514 ).

Author: Davies Liu <davies@databricks.com>

Closes #11403 from davies/dedup.
2016-03-09 12:04:29 -08:00
gatorsmile 23369c3bd2 [SPARK-13763][SQL] Remove Project when its Child's Output is Nil
#### What changes were proposed in this pull request?

As shown in another PR: https://github.com/apache/spark/pull/11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,

```SQL
SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value
```
Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below:

```
== Analyzed Logical Plan ==
value: int
Project [value#22]
+- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22]
   +- SubqueryAlias dummyTable
      +- Project [1 AS 1#21]
         +- OneRowRelation$

== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- Project
   +- OneRowRelation$
```

After the fix, the optimized plan removed the useless `Project`, as shown below:
```
== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- OneRowRelation$
```

This PR is to remove `Project` when its Child's output is Nil

#### How was this patch tested?

Added a new unit test case into the suite `ColumnPruningSuite.scala`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11599 from gatorsmile/projectOneRowRelation.
2016-03-09 10:29:27 -08:00
Davies Liu 7791d0c3a9 Revert "[SPARK-13668][SQL] Reorder filter/join predicates to short-circuit isNotNull checks"
This reverts commit e430614eae.
2016-03-09 10:05:57 -08:00
Davies Liu 9634e17d01 [SPARK-13242] [SQL] codegen fallback in case-when if there many branches
## What changes were proposed in this pull request?

If there are many branches in a CaseWhen expression, the generated code could go above the 64K limit for single java method, will fail to compile. This PR change it to fallback to interpret mode if there are more than 20 branches.

This PR is based on #11243 and #11221, thanks to joehalliwell

Closes #11243
Closes #11221

## How was this patch tested?

Add a test with 50 branches.

Author: Davies Liu <davies@databricks.com>

Closes #11592 from davies/fix_when.
2016-03-09 09:27:28 -08:00
Dilip Biswal 53ba6d6e59 [SPARK-13698][SQL] Fix Analysis Exceptions when Using Backticks in Generate
## What changes were proposed in this pull request?
Analysis exception occurs while running the following query.
```
SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
```
```
Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7
'Project ['ints]
+- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
   +- SubqueryAlias nestedarray
      +- LocalRelation [a#0], [[[[1,2,3]]]]
```

## How was this patch tested?

Added new unit tests in SQLQuerySuite and HiveQlSuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #11538 from dilipbiswal/SPARK-13698.
2016-03-09 21:49:37 +08:00
Dongjoon Hyun c3689bc24e [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code.
## What changes were proposed in this pull request?

In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.

```
-    final ArrayList<Product2<Object, Object>> dataToWrite =
-      new ArrayList<Product2<Object, Object>>();
+    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```

Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.

## How was this patch tested?

Manual.
Pass the existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11541 from dongjoon-hyun/SPARK-13702.
2016-03-09 10:31:26 +00:00
Takuya UESHIN 2c5af7d4d9 [SPARK-13640][SQL] Synchronize ScalaReflection.mirror method.
## What changes were proposed in this pull request?

`ScalaReflection.mirror` method should be synchronized when scala version is `2.10` because `universe.runtimeMirror` is not thread safe.

## How was this patch tested?

I added a test to check thread safety of `ScalaRefection.mirror` method in `ScalaReflectionSuite`, which will throw the following Exception in Scala `2.10` without this patch:

```
[info] - thread safety of mirror *** FAILED *** (49 milliseconds)
[info]   java.lang.UnsupportedOperationException: tail of empty list
[info]   at scala.collection.immutable.Nil$.tail(List.scala:339)
[info]   at scala.collection.immutable.Nil$.tail(List.scala:334)
[info]   at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
[info]   at scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
[info]   at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
[info]   at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
[info]   at scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
[info]   at scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
[info]   at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
[info]   at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
[info]   at org.apache.spark.sql.catalyst.ScalaReflection$.mirror(ScalaReflection.scala:36)
[info]   at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:256)
[info]   at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:252)
[info]   at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
[info]   at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
[info]   at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
[info]   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[info]   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[info]   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[info]   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
```

Notice that the test will pass when Scala version is `2.11`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #11487 from ueshin/issues/SPARK-13640.
2016-03-09 10:23:27 +00:00
Jakob Odersky 035d3acdf3 [SPARK-7286][SQL] Deprecate !== in favour of =!=
This PR replaces #9925 which had issues with CI. **Please see the original PR for any previous discussions.**

## What changes were proposed in this pull request?
Deprecate the SparkSQL column operator !== and use =!= as an alternative.
Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===).

## How was this patch tested?
All currently existing tests.

Author: Jakob Odersky <jodersky@gmail.com>

Closes #11588 from jodersky/SPARK-7286.
2016-03-08 18:11:09 -08:00
Sameer Agarwal e430614eae [SPARK-13668][SQL] Reorder filter/join predicates to short-circuit isNotNull checks
## What changes were proposed in this pull request?

If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates.

For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation.

## How was this patch tested?

new unit tests that verify the physical plan for both filters and joins in `ReorderedPredicateSuite`

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11511 from sameeragarwal/reorder-isnotnull.
2016-03-08 15:40:45 -08:00
Dongjoon Hyun 076009b949 [SPARK-13400] Stop using deprecated Octal escape literals
## What changes were proposed in this pull request?

This removes the remaining deprecated Octal escape literals. The followings are the warnings on those two lines.
```
LiteralExpressionSuite.scala:99: Octal escape literals are deprecated, use \u0000 instead.
HiveQlSuite.scala:74: Octal escape literals are deprecated, use \u002c instead.
```

## How was this patch tested?

Manual.
During building, there should be no warning on `Octal escape literals`.
```
mvn -DskipTests clean install
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11584 from dongjoon-hyun/SPARK-13400.
2016-03-08 15:00:26 -08:00
Wenchen Fan 46881b4ea2 [SPARK-12727][SQL] support SQL generation for aggregate with multi-distinct
## What changes were proposed in this pull request?

This PR add SQL generation support for aggregate with multi-distinct, by simply moving the `DistinctAggregationRewriter` rule to optimizer.

More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization.  However, the `ComputeCurrentTime` rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11579 from cloud-fan/distinct.
2016-03-08 11:45:08 -08:00
Davies Liu 78d3b6051e [SPARK-13657] [SQL] Support parsing very long AND/OR expressions
## What changes were proposed in this pull request?

In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer.

## How was this patch tested?

Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

Author: Davies Liu <davies@databricks.com>

Closes #11501 from davies/long_or.
2016-03-08 10:23:19 -08:00
Wenchen Fan 7d05d02bff [SPARK-13637][SQL] use more information to simplify the code in Expand builder
## What changes were proposed in this pull request?

The code in `Expand.apply` can be simplified by existing information:

* the `groupByExprs` parameter are all `Attribute`s
* the `child` parameter is a `Project` that append aliased group by expressions to its child's output

## How was this patch tested?

by existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11485 from cloud-fan/expand.
2016-03-08 23:34:42 +08:00
Davies Liu 25bba58d16 [SPARK-13404] [SQL] Create variables for input row when it's actually used
## What changes were proposed in this pull request?

This PR change the way how we generate the code for the output variables passing from a plan to it's parent.

Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted.

This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by `consume()`. In `parent.consumeChild()`, it will check the output from child and `usedInputs`, generate the code for those columns that is part of `usedInputs` before calling `doConsume()`.

This PR also change the `if` from
```
if (cond) {
  xxx
}
```
to
```
if (!cond) continue;
xxx
```
The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin.

It also added some comments for operators.

## How was the this patch tested?

Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s)

Author: Davies Liu <davies@databricks.com>

Closes #11274 from davies/gen_defer.
2016-03-07 20:09:08 -08:00
Andrew Or da7bfac488 [SPARK-13689][SQL] Move helper things in CatalystQl to new utils object
## What changes were proposed in this pull request?

When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object.

This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller.

## How was this patch tested?

No change in functionality, so just Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11529 from andrewor14/parser-utils.
2016-03-07 18:01:27 -08:00
Michael Armbrust e720dda42e [SPARK-13665][SQL] Separate the concerns of HadoopFsRelation
`HadoopFsRelation` is used for reading most files into Spark SQL.  However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data.  As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency.  This PR is a first cut at separating this into several components / interfaces that are each described below.  Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`.  External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.

### HadoopFsRelation
A simple `case class` that acts as a container for all of the metadata required to read from a datasource.  All discovery, resolution and merging logic for schemas and partitions has been removed.  This an internal representation that no longer needs to be exposed to developers.

```scala
case class HadoopFsRelation(
    sqlContext: SQLContext,
    location: FileCatalog,
    partitionSchema: StructType,
    dataSchema: StructType,
    bucketSpec: Option[BucketSpec],
    fileFormat: FileFormat,
    options: Map[String, String]) extends BaseRelation
```

### FileFormat
The primary interface that will be implemented by each different format including external libraries.  Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`.  A format can optionally return a schema that is inferred from a set of files.

```scala
trait FileFormat {
  def inferSchema(
      sqlContext: SQLContext,
      options: Map[String, String],
      files: Seq[FileStatus]): Option[StructType]

  def prepareWrite(
      sqlContext: SQLContext,
      job: Job,
      options: Map[String, String],
      dataSchema: StructType): OutputWriterFactory

  def buildInternalScan(
      sqlContext: SQLContext,
      dataSchema: StructType,
      requiredColumns: Array[String],
      filters: Array[Filter],
      bucketSet: Option[BitSet],
      inputFiles: Array[FileStatus],
      broadcastedConf: Broadcast[SerializableConfiguration],
      options: Map[String, String]): RDD[InternalRow]
}
```

The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner).  Additionally, scans are still returning `RDD`s instead of iterators for single files.  In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.

### FileCatalog
This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.

```scala
trait FileCatalog {
  def paths: Seq[Path]
  def partitionSpec(schema: Option[StructType]): PartitionSpec
  def allFiles(): Seq[FileStatus]
  def getStatus(path: Path): Array[FileStatus]
  def refresh(): Unit
}
```

Currently there are two implementations:
 - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`.  Infers partitioning by recursive listing and caches this data for performance
 - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.

### ResolvedDataSource
Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
 - `paths: Seq[String] = Nil`
 - `userSpecifiedSchema: Option[StructType] = None`
 - `partitionColumns: Array[String] = Array.empty`
 - `bucketSpec: Option[BucketSpec] = None`
 - `provider: String`
 - `options: Map[String, String]`

This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones).  All reconciliation of partitions, buckets, schema from metastores or inference is done here.

### DataSourceAnalysis / DataSourceStrategy
Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
 - pruning the files from partitions that will be read based on filters.
 - appending partition columns*
 - applying additional filters when a data source can not evaluate them internally.
 - constructing an RDD that is bucketed correctly when required*
 - sanity checking schema match-up and other analysis when writing.

*In the future we should do that following:
 - Break out file handling into its own Strategy as its sufficiently complex / isolated.
 - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
 - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`

Author: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #11509 from marmbrus/fileDataSource.
2016-03-07 15:15:10 -08:00
gatorsmile b6071a7001 [SPARK-13722][SQL] No Push Down for Non-deterministics Predicates through Generate
#### What changes were proposed in this pull request?

Non-deterministic predicates should not be pushed through Generate.

#### How was this patch tested?

Added a test case in `FilterPushdownSuite.scala`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11562 from gatorsmile/pushPredicateDownWindow.
2016-03-07 12:09:27 -08:00
Sameer Agarwal ef77003178 [SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints
## What changes were proposed in this pull request?

This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints.

Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins.

## How was this patch tested?

1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules.
2. Test generated ExpressionTrees via `OrcFilterSuite`
3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite`

cc yhuai nongli

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11372 from sameeragarwal/gen-isnotnull.
2016-03-07 12:04:59 -08:00
Wenchen Fan 4896411176 [SPARK-13694][SQL] QueryPlan.expressions should always include all expressions
## What changes were proposed in this pull request?

It's weird that expressions don't always have all the expressions in it. This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it to exclude some expressions. Currently only `Generate` override it, we can use `producedAttributes` to fix the unresolved attribute problem for it.

Note that this PR doesn't fix the problem in #11497

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11532 from cloud-fan/generate.
2016-03-07 10:32:34 -08:00
Dilip Biswal d7eac9d795 [SPARK-13651] Generator outputs are not resolved correctly resulting in run time error
## What changes were proposed in this pull request?

```
Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src")
sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 'key2', 200)) t1 AS key, value")
```
Results in following logical plan

```
Project [key#2,value#3]
+- Generate explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)), true, false, Some(genoutput), [key#2,value#3]
   +- SubqueryAlias src
      +- Project [_1#0 AS key#2,_2#1 AS value#3]
         +- LocalRelation [_1#0,_2#1], [[id1,value1]]
```

The above query fails with following runtime error.
```
java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42)
	at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98)
	at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
	at scala.collection.Iterator$class.foreach(Iterator.scala:742)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        <stack-trace omitted.....>
```
In this case the generated outputs are wrongly resolved from its child (LocalRelation) due to
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548
## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Added unit tests in hive/SQLQuerySuite and AnalysisSuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #11497 from dilipbiswal/spark-13651.
2016-03-07 09:46:28 -08:00
Andrew Or bc7a3ec290 [SPARK-13685][SQL] Rename catalog.Catalog to ExternalCatalog
## What changes were proposed in this pull request?

Today we have `analysis.Catalog` and `catalog.Catalog`. In the future the former will call the latter. When that happens, if both of them are still called `Catalog` it will be very confusing. This patch renames the latter `ExternalCatalog` because it is expected to talk to external systems.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11526 from andrewor14/rename-catalog.
2016-03-07 00:14:40 -08:00
gatorsmile adce5ee721 [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets
#### What changes were proposed in this pull request?

This PR is for supporting SQL generation for cube, rollup and grouping sets.

For example, a query using rollup:
```SQL
SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
```
Original logical plan:
```
  Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
            [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
             (key#17L % cast(5 as bigint))#47L AS _c1#45L,
             grouping__id#46 AS _c2#44]
  +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
             List(key#17L, value#18, null, 1)],
            [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
     +- Project [key#17L,
                 value#18,
                 (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
        +- Subquery t1
           +- Relation[key#17L,value#18] ParquetRelation
```
Converted SQL:
```SQL
  SELECT count( 1) AS `cnt`,
         (`t1`.`key` % CAST(5 AS BIGINT)),
         grouping_id() AS `_c2`
  FROM `default`.`t1`
  GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
  GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
```

#### How was the this patch tested?

Added eight test cases in `LogicalPlanToSQLSuite`.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11283 from gatorsmile/groupingSetsToSQL.
2016-03-05 19:25:03 +08:00
Andrew Or b7d4147421 [SPARK-13633][SQL] Move things into catalyst.parser package
## What changes were proposed in this pull request?

This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11506 from andrewor14/parser-package.
2016-03-04 10:32:00 -08:00
Davies Liu dd83c209f1 [SPARK-13603][SQL] support SQL generation for subquery
## What changes were proposed in this pull request?

This is support SQL generation for subquery expressions, which will be replaced to a SubqueryHolder inside SQLBuilder recursively.

## How was this patch tested?

Added unit tests.

Author: Davies Liu <davies@databricks.com>

Closes #11453 from davies/sql_subquery.
2016-03-04 16:18:15 +08:00
Davies Liu b373a88862 [SPARK-13415][SQL] Visualize subquery in SQL web UI
## What changes were proposed in this pull request?

This PR support visualization for subquery in SQL web UI, also improve the explain of subquery, especially when it's used together with whole stage codegen.

For example:
```python
>>> sqlContext.range(100).registerTempTable("range")
>>> sqlContext.sql("select id / (select sum(id) from range) from range where id > (select id from range limit 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias(('id / subquery#9), None)]
:  +- 'SubqueryAlias subquery#9
:     +- 'Project [unresolvedalias('sum('id), None)]
:        +- 'UnresolvedRelation `range`, None
+- 'Filter ('id > subquery#8)
   :  +- 'SubqueryAlias subquery#8
   :     +- 'GlobalLimit 1
   :        +- 'LocalLimit 1
   :           +- 'Project [unresolvedalias('id, None)]
   :              +- 'UnresolvedRelation `range`, None
   +- 'UnresolvedRelation `range`, None

== Analyzed Logical Plan ==
(id / scalarsubquery()): double
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
:  +- SubqueryAlias subquery#9
:     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
:        +- SubqueryAlias range
:           +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
   :  +- SubqueryAlias subquery#8
   :     +- GlobalLimit 1
   :        +- LocalLimit 1
   :           +- Project [id#0L]
   :              +- SubqueryAlias range
   :                 +- Range 0, 100, 1, 4, [id#0L]
   +- SubqueryAlias range
      +- Range 0, 100, 1, 4, [id#0L]

== Optimized Logical Plan ==
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
:  +- SubqueryAlias subquery#9
:     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
:        +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
   :  +- SubqueryAlias subquery#8
   :     +- GlobalLimit 1
   :        +- LocalLimit 1
   :           +- Project [id#0L]
   :              +- Range 0, 100, 1, 4, [id#0L]
   +- Range 0, 100, 1, 4, [id#0L]

== Physical Plan ==
WholeStageCodegen
:  +- Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
:     :  +- Subquery subquery#9
:     :     +- WholeStageCodegen
:     :        :  +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[sum(id)#10L])
:     :        :     +- INPUT
:     :        +- Exchange SinglePartition, None
:     :           +- WholeStageCodegen
:     :              :  +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[sum#14L])
:     :              :     +- Range 0, 1, 4, 100, [id#0L]
:     +- Filter (id#0L > subquery#8)
:        :  +- Subquery subquery#8
:        :     +- CollectLimit 1
:        :        +- WholeStageCodegen
:        :           :  +- Project [id#0L]
:        :           :     +- Range 0, 1, 4, 100, [id#0L]
:        +- Range 0, 1, 4, 100, [id#0L]
```

The web UI looks like:

![subquery](https://cloud.githubusercontent.com/assets/40902/13377963/932bcbae-dda7-11e5-82f7-03c9be85d77c.png)

This PR also change the tree structure of WholeStageCodegen to make it consistent than others. Before this change, Both WholeStageCodegen and InputAdapter hold a references to the same plans, those could be updated without notify another, causing problems, this is discovered by #11403 .

## How was this patch tested?

Existing tests, also manual tests with the example query, check the explain and web UI.

Author: Davies Liu <davies@databricks.com>

Closes #11417 from davies/viz_subquery.
2016-03-03 17:36:48 -08:00
Dongjoon Hyun b5f02d6743 [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule
## What changes were proposed in this pull request?

After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.

## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11438 from dongjoon-hyun/SPARK-13583.
2016-03-03 10:12:32 +00:00
Sean Owen e97fc7f176 [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x
## What changes were proposed in this pull request?

Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:

- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map

The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.

## How was the this patch tested?

Existing Jenkins unit tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11292 from srowen/SPARK-13423.
2016-03-03 09:54:09 +00:00
Dongjoon Hyun 02b7677e95 [HOT-FIX] Recover some deprecations for 2.10 compatibility.
## What changes were proposed in this pull request?

#11479 [SPARK-13627] broke 2.10 compatibility: [2.10-Build](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/292/console)
At this moment, we need to support both 2.10 and 2.11.
This PR recovers some deprecated methods which were replace by [SPARK-13627].

## How was this patch tested?

Jenkins build: Both 2.10, 2.11.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11488 from dongjoon-hyun/hotfix_compatibility_with_2.10.
2016-03-03 09:53:02 +00:00
Liang-Chi Hsieh 7b25dc7b7e [SPARK-13466] [SQL] Remove projects that become redundant after column pruning rule
JIRA: https://issues.apache.org/jira/browse/SPARK-13466

## What changes were proposed in this pull request?

With column pruning rule in optimizer, some Project operators will become redundant. We should remove these redundant Projects.

For an example query:

    val input = LocalRelation('key.int, 'value.string)

    val query =
      Project(Seq($"x.key", $"y.key"),
        Join(
          SubqueryAlias("x", input),
          BroadcastHint(SubqueryAlias("y", input)), Inner, None))

After the first run of column pruning, it would like:

    Project(Seq($"x.key", $"y.key"),
      Join(
        Project(Seq($"x.key"), SubqueryAlias("x", input)),
        Project(Seq($"y.key"),      <-- inserted by the rule
        BroadcastHint(SubqueryAlias("y", input))),
        Inner, None))

Actually we don't need the outside Project now. This patch will remove it:

    Join(
      Project(Seq($"x.key"), SubqueryAlias("x", input)),
      Project(Seq($"y.key"),
      BroadcastHint(SubqueryAlias("y", input))),
      Inner, None)

## How was the this patch tested?

Unit test is added into ColumnPruningSuite.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11341 from viirya/remove-redundant-project.
2016-03-03 00:06:46 -08:00
Liang-Chi Hsieh 1085bd862a [SPARK-13635] [SQL] Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
JIRA: https://issues.apache.org/jira/browse/SPARK-13635

## What changes were proposed in this pull request?

LimitPushdown optimizer rule has been disabled due to no whole-stage codegen for Limit. As we have whole-stage codegen for Limit now, we should enable it.

## How was this patch tested?

As we only re-enable LimitPushdown optimizer rule, no need to add new tests for it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11483 from viirya/enable-limitpushdown.
2016-03-02 23:46:23 -08:00
Dongjoon Hyun 9c274ac4a6 [SPARK-13627][SQL][YARN] Fix simple deprecation warnings.
## What changes were proposed in this pull request?

This PR aims to fix the following deprecation warnings.
  * MethodSymbolApi.paramss--> paramLists
  * AnnotationApi.tpe -> tree.tpe
  * BufferLike.readOnly -> toList.
  * StandardNames.nme -> termNames
  * scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader
  * TypeApi.declarations-> decls

## How was this patch tested?

Check the compile build log and pass the tests.
```
./build/sbt
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11479 from dongjoon-hyun/SPARK-13627.
2016-03-02 20:34:22 -08:00
Wenchen Fan b60b813799 [SPARK-13617][SQL] remove unnecessary GroupingAnalytics trait
## What changes were proposed in this pull request?

The `trait GroupingAnalytics` only has one implementation, it's an unnecessary abstraction. This PR removes it, and does some code simplification when resolving `GroupingSet`.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11469 from cloud-fan/groupingset.
2016-03-02 20:18:57 -08:00
gatorsmile 8f8d8a2315 [SPARK-13609] [SQL] Support Column Pruning for MapPartitions
#### What changes were proposed in this pull request?

This PR is to prune unnecessary columns when the operator is  `MapPartitions`. The solution is to add an extra `Project` in the child node.

For the other two operators `AppendColumns` and `MapGroups`, it sounds doable. More discussions are required. The major reason is the current implementation of the `inputPlan` of `groupBy` is based on the child of `AppendColumns`. It might be a bug? Thus, will submit a separate PR.

#### How was this patch tested?

Added a test case in ColumnPruningSuite to verify the rule. Added another test case in DatasetSuite.scala to verify the data.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11460 from gatorsmile/datasetPruningNew.
2016-03-02 09:59:22 -08:00