Commit graph

586 commits

Author SHA1 Message Date
Davies Liu 78d3b6051e [SPARK-13657] [SQL] Support parsing very long AND/OR expressions
## What changes were proposed in this pull request?

In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer.

## How was this patch tested?

Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

Author: Davies Liu <davies@databricks.com>

Closes #11501 from davies/long_or.
2016-03-08 10:23:19 -08:00
gatorsmile b6071a7001 [SPARK-13722][SQL] No Push Down for Non-deterministics Predicates through Generate
#### What changes were proposed in this pull request?

Non-deterministic predicates should not be pushed through Generate.

#### How was this patch tested?

Added a test case in `FilterPushdownSuite.scala`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11562 from gatorsmile/pushPredicateDownWindow.
2016-03-07 12:09:27 -08:00
Sameer Agarwal ef77003178 [SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints
## What changes were proposed in this pull request?

This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints.

Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins.

## How was this patch tested?

1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules.
2. Test generated ExpressionTrees via `OrcFilterSuite`
3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite`

cc yhuai nongli

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11372 from sameeragarwal/gen-isnotnull.
2016-03-07 12:04:59 -08:00
Andrew Or bc7a3ec290 [SPARK-13685][SQL] Rename catalog.Catalog to ExternalCatalog
## What changes were proposed in this pull request?

Today we have `analysis.Catalog` and `catalog.Catalog`. In the future the former will call the latter. When that happens, if both of them are still called `Catalog` it will be very confusing. This patch renames the latter `ExternalCatalog` because it is expected to talk to external systems.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11526 from andrewor14/rename-catalog.
2016-03-07 00:14:40 -08:00
Andrew Or b7d4147421 [SPARK-13633][SQL] Move things into catalyst.parser package
## What changes were proposed in this pull request?

This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11506 from andrewor14/parser-package.
2016-03-04 10:32:00 -08:00
Dongjoon Hyun b5f02d6743 [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule
## What changes were proposed in this pull request?

After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.

## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11438 from dongjoon-hyun/SPARK-13583.
2016-03-03 10:12:32 +00:00
Sean Owen e97fc7f176 [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x
## What changes were proposed in this pull request?

Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:

- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map

The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.

## How was the this patch tested?

Existing Jenkins unit tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11292 from srowen/SPARK-13423.
2016-03-03 09:54:09 +00:00
Liang-Chi Hsieh 7b25dc7b7e [SPARK-13466] [SQL] Remove projects that become redundant after column pruning rule
JIRA: https://issues.apache.org/jira/browse/SPARK-13466

## What changes were proposed in this pull request?

With column pruning rule in optimizer, some Project operators will become redundant. We should remove these redundant Projects.

For an example query:

    val input = LocalRelation('key.int, 'value.string)

    val query =
      Project(Seq($"x.key", $"y.key"),
        Join(
          SubqueryAlias("x", input),
          BroadcastHint(SubqueryAlias("y", input)), Inner, None))

After the first run of column pruning, it would like:

    Project(Seq($"x.key", $"y.key"),
      Join(
        Project(Seq($"x.key"), SubqueryAlias("x", input)),
        Project(Seq($"y.key"),      <-- inserted by the rule
        BroadcastHint(SubqueryAlias("y", input))),
        Inner, None))

Actually we don't need the outside Project now. This patch will remove it:

    Join(
      Project(Seq($"x.key"), SubqueryAlias("x", input)),
      Project(Seq($"y.key"),
      BroadcastHint(SubqueryAlias("y", input))),
      Inner, None)

## How was the this patch tested?

Unit test is added into ColumnPruningSuite.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11341 from viirya/remove-redundant-project.
2016-03-03 00:06:46 -08:00
gatorsmile 8f8d8a2315 [SPARK-13609] [SQL] Support Column Pruning for MapPartitions
#### What changes were proposed in this pull request?

This PR is to prune unnecessary columns when the operator is  `MapPartitions`. The solution is to add an extra `Project` in the child node.

For the other two operators `AppendColumns` and `MapGroups`, it sounds doable. More discussions are required. The major reason is the current implementation of the `inputPlan` of `groupBy` is based on the child of `AppendColumns`. It might be a bug? Thus, will submit a separate PR.

#### How was this patch tested?

Added a test case in ColumnPruningSuite to verify the rule. Added another test case in DatasetSuite.scala to verify the data.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11460 from gatorsmile/datasetPruningNew.
2016-03-02 09:59:22 -08:00
gatorsmile bc65f60ef7 [SPARK-13544][SQL] Rewrite/Propagate Constraints for Aliases in Aggregate
#### What changes were proposed in this pull request?

After analysis by Analyzer, two operators could have alias. They are `Project` and `Aggregate`. So far, we only rewrite and propagate constraints if `Alias` is defined in `Project`. This PR is to resolve this issue in `Aggregate`.

#### How was this patch tested?

Added a test case for `Aggregate` in `ConstraintPropagationSuite`.

marmbrus sameeragarwal

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11422 from gatorsmile/validConstraintsInUnaryNodes.
2016-02-29 10:10:04 -08:00
Cheng Lian 3fa6491be6 [SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s)
## What changes were proposed in this pull request?

Predicates shouldn't be pushed through project with nondeterministic field(s).

See https://github.com/graphframes/graphframes/pull/23 and SPARK-13473 for more details.

This PR targets master, branch-1.6, and branch-1.5.

## How was this patch tested?

A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.

Author: Cheng Lian <lian@databricks.com>

Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.
2016-02-25 20:43:03 +08:00
Davies Liu 07f92ef1fa [SPARK-13376] [SPARK-13476] [SQL] improve column pruning
## What changes were proposed in this pull request?

This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset).

This PR also fix a bug in Generate, it should always output UnsafeRow, added an regression test for that.

## How was this patch tested?

This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s).

Author: Davies Liu <davies@databricks.com>

Closes #11354 from davies/fix_column_pruning.
2016-02-25 00:13:07 -08:00
Michael Armbrust 2b042577fb [SPARK-13092][SQL] Add ExpressionSet for constraint tracking
This PR adds a new abstraction called an `ExpressionSet` which attempts to canonicalize expressions to remove cosmetic differences.  Deterministic expressions that are in the set after canonicalization will always return the same answer given the same input (i.e. false positives should not be possible). However, it is possible that two canonical expressions that are not equal will in fact return the same answer given any input (i.e. false negatives are possible).

```scala
val set = AttributeSet('a + 1 :: 1 + 'a :: Nil)

set.iterator => Iterator('a + 1)
set.contains('a + 1) => true
set.contains(1 + 'a) => true
set.contains('a + 2) => false
```

Other relevant changes include:
 - Since this concept overlaps with the existing `semanticEquals` and `semanticHash`, those functions are also ported to this new infrastructure.
 - A memoized `canonicalized` version of the expression is added as a `lazy val` to `Expression` and is used by both `semanticEquals` and `ExpressionSet`.
 - A set of unit tests for `ExpressionSet` are added
 - Tests which expect `semanticEquals` to be less intelligent than it now is are updated.

As a followup, we should consider auditing the places where we do `O(n)` `semanticEquals` operations and replace them with `ExpressionSet`.  We should also consider consolidating `AttributeSet` as a specialized factory for an `ExpressionSet.`

Author: Michael Armbrust <michael@databricks.com>

Closes #11338 from marmbrus/expressionSet.
2016-02-24 19:43:00 -08:00
Yin Huai cbb0b65ad5 [SPARK-13383][SQL] Fix test
## What changes were proposed in this pull request?

Reverting SPARK-13376 (d563c8fa01) affects the test added by SPARK-13383. So, I am fixing the test.

Author: Yin Huai <yhuai@databricks.com>

Closes #11355 from yhuai/SPARK-13383-fix-test.
2016-02-24 16:13:55 -08:00
Reynold Xin f92f53faee Revert "[SPARK-13321][SQL] Support nested UNION in parser"
This reverts commit 55d6fdf22d.
2016-02-24 12:25:02 -08:00
Reynold Xin 65805ab6ea Revert "Revert "[SPARK-13383][SQL] Keep broadcast hint after column pruning""
This reverts commit 382b27babf.
2016-02-24 12:03:45 -08:00
Reynold Xin d563c8fa01 Revert "[SPARK-13376] [SQL] improve column pruning"
This reverts commit e9533b419e.
2016-02-24 11:58:32 -08:00
Reynold Xin 382b27babf Revert "[SPARK-13383][SQL] Keep broadcast hint after column pruning"
This reverts commit f373986997.
2016-02-24 11:58:12 -08:00
Liang-Chi Hsieh f373986997 [SPARK-13383][SQL] Keep broadcast hint after column pruning
JIRA: https://issues.apache.org/jira/browse/SPARK-13383

## What changes were proposed in this pull request?

When we do column pruning in Optimizer, we put additional Project on top of a logical plan. However, when we already wrap a BroadcastHint on a logical plan, the added Project will hide BroadcastHint after later execution.

We should take care of BroadcastHint when we do column pruning.

## How was the this patch tested?

Unit test is added.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11260 from viirya/keep-broadcasthint.
2016-02-24 10:22:40 -08:00
Davies Liu e9533b419e [SPARK-13376] [SQL] improve column pruning
## What changes were proposed in this pull request?

This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset).

## How was the this patch tested?

This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s).

Author: Davies Liu <davies@databricks.com>

Closes #11256 from davies/fix_column_pruning.
2016-02-23 18:19:22 -08:00
Davies Liu c481bdf512 [SPARK-13329] [SQL] considering output for statistics of logical plan
The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess.

We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join.

After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time.

We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them:

DecimalType:  8 or 16 bytes, based on the precision
StringType:  20 bytes
BinaryType: 100 bytes
UDF: default size of SQL type

These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096.

Author: Davies Liu <davies@databricks.com>

Closes #11210 from davies/statics.
2016-02-23 12:55:44 -08:00
Michael Armbrust c5bfe5d2a2 [SPARK-13440][SQL] ObjectType should accept any ObjectType, If should not care about nullability
The type checking functions of `If` and `UnwrapOption` are fixed to eliminate spurious failures.  `UnwrapOption` was checking for an input of `ObjectType` but `ObjectType`'s accept function was hard coded to return `false`.  `If`'s type check was returning a false negative in the case that the two options differed only by nullability.

Tests added:
 -  an end-to-end regression test is added to `DatasetSuite` for the reported failure.
 - all the unit tests in `ExpressionEncoderSuite` are augmented to also confirm successful analysis.  These tests are actually what pointed out the additional issues with `If` resolution.

Author: Michael Armbrust <michael@databricks.com>

Closes #11316 from marmbrus/datasetOptions.
2016-02-23 11:20:27 -08:00
gatorsmile 87250580f2 [SPARK-13263][SQL] SQL Generation Support for Tablesample
In the parser, tableSample clause is part of tableSource.
```
tableSource
init { gParent.pushMsg("table source", state); }
after { gParent.popMsg(state); }
    : tabname=tableName
    ((tableProperties) => props=tableProperties)?
    ((tableSample) => ts=tableSample)?
    ((KW_AS) => (KW_AS alias=Identifier)
    |
    (Identifier) => (alias=Identifier))?
    -> ^(TOK_TABREF $tabname $props? $ts? $alias?)
    ;
```

Two typical query samples using TABLESAMPLE are:
```
    "SELECT s.id FROM t0 TABLESAMPLE(10 PERCENT) s"
    "SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)"
```

FYI, the logical plan of a TABLESAMPLE query:
```
sql("SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)").explain(true)

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Sample 0.0, 0.001, false, 381
   +- Subquery t0
      +- Relation[id#16L] ParquetRelation
```

Thanks! cc liancheng

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

This patch had conflicts when merged, resolved by
Committer: Cheng Lian <lian@databricks.com>

Closes #11148 from gatorsmile/tablesplitsample.
2016-02-23 16:13:09 +08:00
Dongjoon Hyun 024482bf51 [MINOR][DOCS] Fix all typos in markdown files of doc and similar patterns in other comments
## What changes were proposed in this pull request?

This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.

## How was the this patch tested?

manual tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11300 from dongjoon-hyun/minor_fix_typos.
2016-02-22 09:52:07 +00:00
Reynold Xin 9bf6a926a1 [HOTFIX] Fix compilation break 2016-02-21 19:37:35 -08:00
Liang-Chi Hsieh 55d6fdf22d [SPARK-13321][SQL] Support nested UNION in parser
JIRA: https://issues.apache.org/jira/browse/SPARK-13321

The following SQL can not be parsed with current parser:

    SELECT  `u_1`.`id` FROM (((SELECT  `t0`.`id` FROM `default`.`t0`) UNION ALL (SELECT  `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT  `t0`.`id` FROM `default`.`t0`)) AS u_1

We should fix it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11204 from viirya/nested-union.
2016-02-21 19:10:17 -08:00
Andrew Or 6c3832b26e [SPARK-13080][SQL] Implement new Catalog API using Hive
## What changes were proposed in this pull request?

This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation.

*Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor.

*Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy.

The new class hierarchy is as follows:
```
org.apache.spark.sql.catalyst.catalog.Catalog
  - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
  - org.apache.spark.sql.hive.HiveCatalog
```

Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.

## How was the this patch tested?
All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.

Author: Andrew Or <andrew@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #11293 from rxin/hive-catalog.
2016-02-21 15:00:24 -08:00
Reynold Xin 0947f0989b [SPARK-13420][SQL] Rename Subquery logical plan to SubqueryAlias
## What changes were proposed in this pull request?
This patch renames logical.Subquery to logical.SubqueryAlias, which is a more appropriate name for this operator (versus subqueries as expressions).

## How was the this patch tested?
Unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11288 from rxin/SPARK-13420.
2016-02-21 11:31:46 -08:00
Cheng Lian d9efe63ecd [SPARK-12799] Simplify various string output for expressions
This PR introduces several major changes:

1. Replacing `Expression.prettyString` with `Expression.sql`

   The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.

1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)

   Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:

   Expression         | `prettyString` | `sql`      | Note
   ------------------ | -------------- | ---------- | ---------------
   `a && b`           | `a && b`       | `a AND b`  |
   `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct

1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)

   `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.

Author: Cheng Lian <lian@databricks.com>

Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
2016-02-21 22:53:15 +08:00
Davies Liu 7925071280 [SPARK-13306] [SQL] uncorrelated scalar subquery
A scalar subquery is a subquery that only generate single row and single column, could be used as part of expression. Uncorrelated scalar subquery means it does not has a reference to external table.

All the uncorrelated scalar subqueries will be executed during prepare() of SparkPlan.

The plans for query
```sql
select 1 + (select 2 + (select 3))
```
looks like this
```
== Parsed Logical Plan ==
'Project [unresolvedalias((1 + subquery#1),None)]
:- OneRowRelation$
+- 'Subquery subquery#1
   +- 'Project [unresolvedalias((2 + subquery#0),None)]
      :- OneRowRelation$
      +- 'Subquery subquery#0
         +- 'Project [unresolvedalias(3,None)]
            +- OneRowRelation$

== Analyzed Logical Plan ==
_c0: int
Project [(1 + subquery#1) AS _c0#4]
:- OneRowRelation$
+- Subquery subquery#1
   +- Project [(2 + subquery#0) AS _c0#3]
      :- OneRowRelation$
      +- Subquery subquery#0
         +- Project [3 AS _c0#2]
            +- OneRowRelation$

== Optimized Logical Plan ==
Project [(1 + subquery#1) AS _c0#4]
:- OneRowRelation$
+- Subquery subquery#1
   +- Project [(2 + subquery#0) AS _c0#3]
      :- OneRowRelation$
      +- Subquery subquery#0
         +- Project [3 AS _c0#2]
            +- OneRowRelation$

== Physical Plan ==
WholeStageCodegen
:  +- Project [(1 + subquery#1) AS _c0#4]
:     :- INPUT
:     +- Subquery subquery#1
:        +- WholeStageCodegen
:           :  +- Project [(2 + subquery#0) AS _c0#3]
:           :     :- INPUT
:           :     +- Subquery subquery#0
:           :        +- WholeStageCodegen
:           :           :  +- Project [3 AS _c0#2]
:           :           :     +- INPUT
:           :           +- Scan OneRowRelation[]
:           +- Scan OneRowRelation[]
+- Scan OneRowRelation[]
```

Author: Davies Liu <davies@databricks.com>

Closes #11190 from davies/scalar_subquery.
2016-02-20 21:01:51 -08:00
gatorsmile f88c641bc8 [SPARK-13310] [SQL] Resolve Missing Sorting Columns in Generate
```scala
// case 1: missing sort columns are resolvable if join is true
sql("SELECT explode(a) AS val, b FROM data WHERE b < 2 order by val, c")
// case 2: missing sort columns are not resolvable if join is false. Thus, issue an error message in this case
sql("SELECT explode(a) AS val FROM data order by val, c")
```

When sort columns are not in `Generate`, we can resolve them when `join` is equal to `true`. Still trying to add more test cases for the other `UnaryNode` types.

Could you review the changes? davies cloud-fan Thanks!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11198 from gatorsmile/missingInSort.
2016-02-20 13:53:23 -08:00
Reynold Xin 6624a588c1 Revert "[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs"
This reverts commit 4f9a664818.
2016-02-19 22:44:20 -08:00
Kai Jiang 4f9a664818 [SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs
Author: Kai Jiang <jiangkai@gmail.com>

Closes #10527 from vectorijk/spark-12567.
2016-02-19 22:28:47 -08:00
gatorsmile ec7a1d6e42 [SPARK-12594] [SQL] Outer Join Elimination by Filter Conditions
Conversion of outer joins, if the predicates in filter conditions can restrict the result sets so that all null-supplying rows are eliminated.

- `full outer` -> `inner` if both sides have such predicates
- `left outer` -> `inner` if the right side has such predicates
- `right outer` -> `inner` if the left side has such predicates
- `full outer` -> `left outer` if only the left side has such predicates
- `full outer` -> `right outer` if only the right side has such predicates

If applicable, this can greatly improve the performance, since outer join is much slower than inner join, full outer join is much slower than left/right outer join.

The original PR is https://github.com/apache/spark/pull/10542

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #10567 from gatorsmile/outerJoinEliminationByFilterCond.
2016-02-19 22:27:10 -08:00
Sameer Agarwal 091f6a7830 [SPARK-13091][SQL] Rewrite/Propagate constraints for Aliases
This PR adds support for rewriting constraints if there are aliases in the query plan. For e.g., if there is a query of form `SELECT a, a AS b`, any constraints on `a` now also apply to `b`.

JIRA: https://issues.apache.org/jira/browse/SPARK-13091

cc marmbrus

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11144 from sameeragarwal/alias.
2016-02-19 14:48:34 -08:00
Liang-Chi Hsieh c7c55637bf [SPARK-13384][SQL] Keep attribute qualifiers after dedup in Analyzer
JIRA: https://issues.apache.org/jira/browse/SPARK-13384

## What changes were proposed in this pull request?

When we de-duplicate attributes in Analyzer, we create new attributes. However, we don't keep original qualifiers. Some plans will be failed to analysed. We should keep original qualifiers in new attributes.

## How was the this patch tested?

Unit test is added.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11261 from viirya/keep-attr-qualifiers.
2016-02-19 12:22:22 -08:00
Davies Liu 26f38bb83c [SPARK-13351][SQL] fix column pruning on Expand
Currently, the columns in projects of Expand that are not used by Aggregate are not pruned, this PR fix that.

Author: Davies Liu <davies@databricks.com>

Closes #11225 from davies/fix_pruning_expand.
2016-02-18 13:07:41 -08:00
Josh Rosen a8bbc4f50e [SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN
This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases:

- If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children.
- If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger.

These optimizations were proposed previously by gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting.

When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11121 from JoshRosen/limit-pushdown-2.
2016-02-14 17:32:21 -08:00
Sean Owen 388cd9ea8d [SPARK-13172][CORE][SQL] Stop using RichException.getStackTrace it is deprecated
Replace `getStackTraceString` with `Utils.exceptionString`

Author: Sean Owen <sowen@cloudera.com>

Closes #11182 from srowen/SPARK-13172.
2016-02-13 21:05:48 -08:00
Davies Liu 5b805df279 [SPARK-12705] [SQL] push missing attributes for Sort
The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate).

Author: Davies Liu <davies@databricks.com>

Closes #11153 from davies/resolve_sort.
2016-02-12 09:34:18 -08:00
gatorsmile e88bff1279 [SPARK-13235][SQL] Removed an Extra Distinct from the Plan when Using Union in SQL
Currently, the parser added two `Distinct` operators in the plan if we are using `Union` or `Union Distinct` in the SQL. This PR is to remove the extra `Distinct` from the plan.

For example, before the fix, the following query has a plan with two `Distinct`
```scala
sql("select * from t0 union select * from t0").explain(true)
```
```
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_2
   +- 'Distinct
      +- 'Project [unresolvedalias(*,None)]
         +- 'Subquery u_1
            +- 'Distinct
               +- 'Union
                  :- 'Project [unresolvedalias(*,None)]
                  :  +- 'UnresolvedRelation `t0`, None
                  +- 'Project [unresolvedalias(*,None)]
                     +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Subquery u_2
   +- Distinct
      +- Project [id#16L]
         +- Subquery u_1
            +- Distinct
               +- Union
                  :- Project [id#16L]
                  :  +- Subquery t0
                  :     +- Relation[id#16L] ParquetRelation
                  +- Project [id#16L]
                     +- Subquery t0
                        +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#16L], [id#16L]
+- Aggregate [id#16L], [id#16L]
   +- Union
      :- Project [id#16L]
      :  +- Relation[id#16L] ParquetRelation
      +- Project [id#16L]
         +- Relation[id#16L] ParquetRelation
```
After the fix, the plan is changed without the extra `Distinct` as follows:
```
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_1
   +- 'Distinct
      +- 'Union
         :- 'Project [unresolvedalias(*,None)]
         :  +- 'UnresolvedRelation `t0`, None
         +- 'Project [unresolvedalias(*,None)]
           +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#17L]
+- Subquery u_1
   +- Distinct
      +- Union
        :- Project [id#16L]
        :  +- Subquery t0
        :     +- Relation[id#16L] ParquetRelation
        +- Project [id#16L]
          +- Subquery t0
          +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#17L], [id#17L]
+- Union
  :- Project [id#16L]
  :  +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
    +- Relation[id#16L] ParquetRelation
```

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11120 from gatorsmile/unionDistinct.
2016-02-11 08:40:27 +01:00
Herman van Hovell 1842c55d89 [SPARK-13276] Catch bad characters at the end of a Table Identifier/Expression string
The parser currently parses the following strings without a hitch:
* Table Identifier:
  * `a.b.c` should fail, but results in the following table identifier `a.b`
  * `table!#` should fail, but results in the following table identifier `table`
* Expression
  * `1+2 r+e` should fail, but results in the following expression `1 + 2`

This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing.

cc cloud-fan (we discussed this in https://github.com/apache/spark/pull/10649) jayadevanmurali (this causes your PR https://github.com/apache/spark/pull/11051 to fail)

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #11159 from hvanhovell/SPARK-13276.
2016-02-11 08:30:58 +01:00
gatorsmile 663cc400f3 [SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions
Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`.

This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.

Here's an example Spark 1.6.0 snippet for illustration:
```scala
sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)
```
The above code produces the following resolved plan:
```
== Analyzed Logical Plan ==
_c0: bigint
Project [_c0#101L]
+- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
   +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
      +- Subquery t
         +- Project [id#46L AS a#47L,id#46L AS b#48L]
            +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26
```
Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs.

The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.

In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.

Could you review the solution? marmbrus liancheng

I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11050 from gatorsmile/namingConflicts.
2016-02-11 10:44:39 +08:00
Wenchen Fan 7fe4fe630a [SPARK-12888] [SQL] [FOLLOW-UP] benchmark the new hash expression
Adds the benchmark results as comments.

The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons:

1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153).
2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth?
3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster.

The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10917 from cloud-fan/hash-benchmark.
2016-02-09 13:06:36 -08:00
Wenchen Fan 8e4d15f707 [SPARK-13101][SQL] nullability of array type element should not fail analysis of encoder
nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11035 from cloud-fan/ignore-nullability.
2016-02-08 12:06:00 -08:00
Jakob Odersky 6883a5120c [SPARK-13171][CORE] Replace future calls with Future
Trivial search-and-replace to eliminate deprecation warnings in Scala 2.11.
Also works with 2.10

Author: Jakob Odersky <jakob@odersky.com>

Closes #11085 from jodersky/SPARK-13171.
2016-02-05 19:00:12 -08:00
Wenchen Fan 1ed354a536 [SPARK-12939][SQL] migrate encoder resolution logic to Analyzer
https://issues.apache.org/jira/browse/SPARK-12939

Now we will catch `ObjectOperator` in `Analyzer` and resolve the `fromRowExpression/deserializer` inside it.  Also update the `MapGroups` and `CoGroup` to pass in `dataAttributes`, so that we can correctly resolve value deserializer(the `child.output` contains both groupking key and values, which may mess things up if they have same-name attribtues). End-to-end tests are added.

follow-ups:

* remove encoders from typed aggregate expression.
* completely remove resolve/bind in `ExpressionEncoder`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10852 from cloud-fan/bug.
2016-02-05 14:34:12 -08:00
Andrew Or bd38dd6f75 [SPARK-13079][SQL] InMemoryCatalog follow-ups
This patch incorporates review feedback from #11069, which is already merged.

Author: Andrew Or <andrew@databricks.com>

Closes #11080 from andrewor14/catalog-follow-ups.
2016-02-04 12:20:18 -08:00
Josh Rosen 33212cb9a1 [SPARK-13168][SQL] Collapse adjacent repartition operators
Spark SQL should collapse adjacent `Repartition` operators and only keep the last one.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11064 from JoshRosen/collapse-repartition.
2016-02-04 11:08:50 -08:00
Reynold Xin dee801adb7 [SPARK-12828][SQL] Natural join follow-up
This is a small addendum to #10762 to make the code more robust again future changes.

Author: Reynold Xin <rxin@databricks.com>

Closes #11070 from rxin/SPARK-12828-natural-join.
2016-02-03 23:43:48 -08:00