ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Davies Liu	07f92ef1fa	[SPARK-13376] [SPARK-13476] [SQL] improve column pruning ## What changes were proposed in this pull request? This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset). This PR also fix a bug in Generate, it should always output UnsafeRow, added an regression test for that. ## How was this patch tested? This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s). Author: Davies Liu <davies@databricks.com> Closes #11354 from davies/fix_column_pruning.	2016-02-25 00:13:07 -08:00
Michael Armbrust	2b042577fb	[SPARK-13092][SQL] Add ExpressionSet for constraint tracking This PR adds a new abstraction called an `ExpressionSet` which attempts to canonicalize expressions to remove cosmetic differences. Deterministic expressions that are in the set after canonicalization will always return the same answer given the same input (i.e. false positives should not be possible). However, it is possible that two canonical expressions that are not equal will in fact return the same answer given any input (i.e. false negatives are possible). ```scala val set = AttributeSet('a + 1 :: 1 + 'a :: Nil) set.iterator => Iterator('a + 1) set.contains('a + 1) => true set.contains(1 + 'a) => true set.contains('a + 2) => false ``` Other relevant changes include: - Since this concept overlaps with the existing `semanticEquals` and `semanticHash`, those functions are also ported to this new infrastructure. - A memoized `canonicalized` version of the expression is added as a `lazy val` to `Expression` and is used by both `semanticEquals` and `ExpressionSet`. - A set of unit tests for `ExpressionSet` are added - Tests which expect `semanticEquals` to be less intelligent than it now is are updated. As a followup, we should consider auditing the places where we do `O(n)` `semanticEquals` operations and replace them with `ExpressionSet`. We should also consider consolidating `AttributeSet` as a specialized factory for an `ExpressionSet.` Author: Michael Armbrust <michael@databricks.com> Closes #11338 from marmbrus/expressionSet.	2016-02-24 19:43:00 -08:00
Yin Huai	cbb0b65ad5	[SPARK-13383][SQL] Fix test ## What changes were proposed in this pull request? Reverting SPARK-13376 (`d563c8fa01`) affects the test added by SPARK-13383. So, I am fixing the test. Author: Yin Huai <yhuai@databricks.com> Closes #11355 from yhuai/SPARK-13383-fix-test.	2016-02-24 16:13:55 -08:00
Reynold Xin	f92f53faee	Revert "[SPARK-13321][SQL] Support nested UNION in parser" This reverts commit `55d6fdf22d`.	2016-02-24 12:25:02 -08:00
Reynold Xin	65805ab6ea	Revert "Revert "[SPARK-13383][SQL] Keep broadcast hint after column pruning"" This reverts commit `382b27babf`.	2016-02-24 12:03:45 -08:00
Reynold Xin	d563c8fa01	Revert "[SPARK-13376] [SQL] improve column pruning" This reverts commit `e9533b419e`.	2016-02-24 11:58:32 -08:00
Reynold Xin	382b27babf	Revert "[SPARK-13383][SQL] Keep broadcast hint after column pruning" This reverts commit `f373986997`.	2016-02-24 11:58:12 -08:00
Liang-Chi Hsieh	f373986997	[SPARK-13383][SQL] Keep broadcast hint after column pruning JIRA: https://issues.apache.org/jira/browse/SPARK-13383 ## What changes were proposed in this pull request? When we do column pruning in Optimizer, we put additional Project on top of a logical plan. However, when we already wrap a BroadcastHint on a logical plan, the added Project will hide BroadcastHint after later execution. We should take care of BroadcastHint when we do column pruning. ## How was the this patch tested? Unit test is added. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11260 from viirya/keep-broadcasthint.	2016-02-24 10:22:40 -08:00
Davies Liu	e9533b419e	[SPARK-13376] [SQL] improve column pruning ## What changes were proposed in this pull request? This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset). ## How was the this patch tested? This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s). Author: Davies Liu <davies@databricks.com> Closes #11256 from davies/fix_column_pruning.	2016-02-23 18:19:22 -08:00
Davies Liu	c481bdf512	[SPARK-13329] [SQL] considering output for statistics of logical plan The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess. We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join. After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time. We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them: DecimalType: 8 or 16 bytes, based on the precision StringType: 20 bytes BinaryType: 100 bytes UDF: default size of SQL type These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096. Author: Davies Liu <davies@databricks.com> Closes #11210 from davies/statics.	2016-02-23 12:55:44 -08:00
Michael Armbrust	c5bfe5d2a2	[SPARK-13440][SQL] ObjectType should accept any ObjectType, If should not care about nullability The type checking functions of `If` and `UnwrapOption` are fixed to eliminate spurious failures. `UnwrapOption` was checking for an input of `ObjectType` but `ObjectType`'s accept function was hard coded to return `false`. `If`'s type check was returning a false negative in the case that the two options differed only by nullability. Tests added: - an end-to-end regression test is added to `DatasetSuite` for the reported failure. - all the unit tests in `ExpressionEncoderSuite` are augmented to also confirm successful analysis. These tests are actually what pointed out the additional issues with `If` resolution. Author: Michael Armbrust <michael@databricks.com> Closes #11316 from marmbrus/datasetOptions.	2016-02-23 11:20:27 -08:00
gatorsmile	87250580f2	[SPARK-13263][SQL] SQL Generation Support for Tablesample In the parser, tableSample clause is part of tableSource. ``` tableSource init { gParent.pushMsg("table source", state); } after { gParent.popMsg(state); } : tabname=tableName ((tableProperties) => props=tableProperties)? ((tableSample) => ts=tableSample)? ((KW_AS) => (KW_AS alias=Identifier) \| (Identifier) => (alias=Identifier))? -> ^(TOK_TABREF $tabname $props? $ts? $alias?) ; ``` Two typical query samples using TABLESAMPLE are: ``` "SELECT s.id FROM t0 TABLESAMPLE(10 PERCENT) s" "SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)" ``` FYI, the logical plan of a TABLESAMPLE query: ``` sql("SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)").explain(true) == Analyzed Logical Plan == id: bigint Project [id#16L] +- Sample 0.0, 0.001, false, 381 +- Subquery t0 +- Relation[id#16L] ParquetRelation ``` Thanks! cc liancheng Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> This patch had conflicts when merged, resolved by Committer: Cheng Lian <lian@databricks.com> Closes #11148 from gatorsmile/tablesplitsample.	2016-02-23 16:13:09 +08:00
Dongjoon Hyun	024482bf51	[MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns in other comments ## What changes were proposed in this pull request? This PR tries to fix all typos in all markdown files under `docs` module, and fixes similar typos in other comments, too. ## How was the this patch tested? manual tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11300 from dongjoon-hyun/minor_fix_typos.	2016-02-22 09:52:07 +00:00
Reynold Xin	9bf6a926a1	[HOTFIX] Fix compilation break	2016-02-21 19:37:35 -08:00
Liang-Chi Hsieh	55d6fdf22d	[SPARK-13321][SQL] Support nested UNION in parser JIRA: https://issues.apache.org/jira/browse/SPARK-13321 The following SQL can not be parsed with current parser: SELECT `u_1`.`id` FROM (((SELECT `t0`.`id` FROM `default`.`t0`) UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) AS u_1 We should fix it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11204 from viirya/nested-union.	2016-02-21 19:10:17 -08:00
Andrew Or	6c3832b26e	[SPARK-13080][SQL] Implement new Catalog API using Hive ## What changes were proposed in this pull request? This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation. Where should I start reviewing? The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor. Why is this patch so big? I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy. The new class hierarchy is as follows: ``` org.apache.spark.sql.catalyst.catalog.Catalog - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog - org.apache.spark.sql.hive.HiveCatalog ``` Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release. ## How was the this patch tested? All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases. Author: Andrew Or <andrew@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #11293 from rxin/hive-catalog.	2016-02-21 15:00:24 -08:00
Reynold Xin	0947f0989b	[SPARK-13420][SQL] Rename Subquery logical plan to SubqueryAlias ## What changes were proposed in this pull request? This patch renames logical.Subquery to logical.SubqueryAlias, which is a more appropriate name for this operator (versus subqueries as expressions). ## How was the this patch tested? Unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #11288 from rxin/SPARK-13420.	2016-02-21 11:31:46 -08:00
Cheng Lian	d9efe63ecd	[SPARK-12799] Simplify various string output for expressions This PR introduces several major changes: 1. Replacing `Expression.prettyString` with `Expression.sql` The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users. 1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed) Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples: Expression \| `prettyString` \| `sql` \| Note ------------------ \| -------------- \| ---------- \| --------------- `a && b` \| `a && b` \| `a AND b` \| `a.getField("f")` \| `a[f]` \| `a.f` \| `a` is a struct 1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders) `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression. Author: Cheng Lian <lian@databricks.com> Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.	2016-02-21 22:53:15 +08:00
Davies Liu	7925071280	[SPARK-13306] [SQL] uncorrelated scalar subquery A scalar subquery is a subquery that only generate single row and single column, could be used as part of expression. Uncorrelated scalar subquery means it does not has a reference to external table. All the uncorrelated scalar subqueries will be executed during prepare() of SparkPlan. The plans for query ```sql select 1 + (select 2 + (select 3)) ``` looks like this ``` == Parsed Logical Plan == 'Project [unresolvedalias((1 + subquery#1),None)] :- OneRowRelation$ +- 'Subquery subquery#1 +- 'Project [unresolvedalias((2 + subquery#0),None)] :- OneRowRelation$ +- 'Subquery subquery#0 +- 'Project [unresolvedalias(3,None)] +- OneRowRelation$ == Analyzed Logical Plan == _c0: int Project [(1 + subquery#1) AS _c0#4] :- OneRowRelation$ +- Subquery subquery#1 +- Project [(2 + subquery#0) AS _c0#3] :- OneRowRelation$ +- Subquery subquery#0 +- Project [3 AS _c0#2] +- OneRowRelation$ == Optimized Logical Plan == Project [(1 + subquery#1) AS _c0#4] :- OneRowRelation$ +- Subquery subquery#1 +- Project [(2 + subquery#0) AS _c0#3] :- OneRowRelation$ +- Subquery subquery#0 +- Project [3 AS _c0#2] +- OneRowRelation$ == Physical Plan == WholeStageCodegen : +- Project [(1 + subquery#1) AS _c0#4] : :- INPUT : +- Subquery subquery#1 : +- WholeStageCodegen : : +- Project [(2 + subquery#0) AS _c0#3] : : :- INPUT : : +- Subquery subquery#0 : : +- WholeStageCodegen : : : +- Project [3 AS _c0#2] : : : +- INPUT : : +- Scan OneRowRelation[] : +- Scan OneRowRelation[] +- Scan OneRowRelation[] ``` Author: Davies Liu <davies@databricks.com> Closes #11190 from davies/scalar_subquery.	2016-02-20 21:01:51 -08:00
gatorsmile	f88c641bc8	[SPARK-13310] [SQL] Resolve Missing Sorting Columns in Generate ```scala // case 1: missing sort columns are resolvable if join is true sql("SELECT explode(a) AS val, b FROM data WHERE b < 2 order by val, c") // case 2: missing sort columns are not resolvable if join is false. Thus, issue an error message in this case sql("SELECT explode(a) AS val FROM data order by val, c") ``` When sort columns are not in `Generate`, we can resolve them when `join` is equal to `true`. Still trying to add more test cases for the other `UnaryNode` types. Could you review the changes? davies cloud-fan Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #11198 from gatorsmile/missingInSort.	2016-02-20 13:53:23 -08:00
Reynold Xin	6624a588c1	Revert "[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs" This reverts commit `4f9a664818`.	2016-02-19 22:44:20 -08:00
Kai Jiang	4f9a664818	[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs Author: Kai Jiang <jiangkai@gmail.com> Closes #10527 from vectorijk/spark-12567.	2016-02-19 22:28:47 -08:00
gatorsmile	ec7a1d6e42	[SPARK-12594] [SQL] Outer Join Elimination by Filter Conditions Conversion of outer joins, if the predicates in filter conditions can restrict the result sets so that all null-supplying rows are eliminated. - `full outer` -> `inner` if both sides have such predicates - `left outer` -> `inner` if the right side has such predicates - `right outer` -> `inner` if the left side has such predicates - `full outer` -> `left outer` if only the left side has such predicates - `full outer` -> `right outer` if only the right side has such predicates If applicable, this can greatly improve the performance, since outer join is much slower than inner join, full outer join is much slower than left/right outer join. The original PR is https://github.com/apache/spark/pull/10542 Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10567 from gatorsmile/outerJoinEliminationByFilterCond.	2016-02-19 22:27:10 -08:00
Sameer Agarwal	091f6a7830	[SPARK-13091][SQL] Rewrite/Propagate constraints for Aliases This PR adds support for rewriting constraints if there are aliases in the query plan. For e.g., if there is a query of form `SELECT a, a AS b`, any constraints on `a` now also apply to `b`. JIRA: https://issues.apache.org/jira/browse/SPARK-13091 cc marmbrus Author: Sameer Agarwal <sameer@databricks.com> Closes #11144 from sameeragarwal/alias.	2016-02-19 14:48:34 -08:00
Liang-Chi Hsieh	c7c55637bf	[SPARK-13384][SQL] Keep attribute qualifiers after dedup in Analyzer JIRA: https://issues.apache.org/jira/browse/SPARK-13384 ## What changes were proposed in this pull request? When we de-duplicate attributes in Analyzer, we create new attributes. However, we don't keep original qualifiers. Some plans will be failed to analysed. We should keep original qualifiers in new attributes. ## How was the this patch tested? Unit test is added. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11261 from viirya/keep-attr-qualifiers.	2016-02-19 12:22:22 -08:00
Davies Liu	26f38bb83c	[SPARK-13351][SQL] fix column pruning on Expand Currently, the columns in projects of Expand that are not used by Aggregate are not pruned, this PR fix that. Author: Davies Liu <davies@databricks.com> Closes #11225 from davies/fix_pruning_expand.	2016-02-18 13:07:41 -08:00
Josh Rosen	a8bbc4f50e	[SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases: - If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children. - If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger. These optimizations were proposed previously by gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting. When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion. Author: Josh Rosen <joshrosen@databricks.com> Closes #11121 from JoshRosen/limit-pushdown-2.	2016-02-14 17:32:21 -08:00
Sean Owen	388cd9ea8d	[SPARK-13172][CORE][SQL] Stop using RichException.getStackTrace it is deprecated Replace `getStackTraceString` with `Utils.exceptionString` Author: Sean Owen <sowen@cloudera.com> Closes #11182 from srowen/SPARK-13172.	2016-02-13 21:05:48 -08:00
Davies Liu	5b805df279	[SPARK-12705] [SQL] push missing attributes for Sort The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate). Author: Davies Liu <davies@databricks.com> Closes #11153 from davies/resolve_sort.	2016-02-12 09:34:18 -08:00
gatorsmile	e88bff1279	[SPARK-13235][SQL] Removed an Extra Distinct from the Plan when Using Union in SQL Currently, the parser added two `Distinct` operators in the plan if we are using `Union` or `Union Distinct` in the SQL. This PR is to remove the extra `Distinct` from the plan. For example, before the fix, the following query has a plan with two `Distinct` ```scala sql("select * from t0 union select * from t0").explain(true) ``` ``` == Parsed Logical Plan == 'Project [unresolvedalias(,None)] +- 'Subquery u_2 +- 'Distinct +- 'Project [unresolvedalias(,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#16L] +- Subquery u_2 +- Distinct +- Project [id#16L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#16L], [id#16L] +- Aggregate [id#16L], [id#16L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation ``` After the fix, the plan is changed without the extra `Distinct` as follows: ``` == Parsed Logical Plan == 'Project [unresolvedalias(,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#17L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#17L], [id#17L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation ``` Author: gatorsmile <gatorsmile@gmail.com> Closes #11120 from gatorsmile/unionDistinct.	2016-02-11 08:40:27 +01:00
Herman van Hovell	1842c55d89	[SPARK-13276] Catch bad characters at the end of a Table Identifier/Expression string The parser currently parses the following strings without a hitch: * Table Identifier: * `a.b.c` should fail, but results in the following table identifier `a.b` * `table!#` should fail, but results in the following table identifier `table` * Expression * `1+2 r+e` should fail, but results in the following expression `1 + 2` This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing. cc cloud-fan (we discussed this in https://github.com/apache/spark/pull/10649) jayadevanmurali (this causes your PR https://github.com/apache/spark/pull/11051 to fail) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11159 from hvanhovell/SPARK-13276.	2016-02-11 08:30:58 +01:00
gatorsmile	663cc400f3	[SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`. This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased. Here's an example Spark 1.6.0 snippet for illustration: ```scala sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t") sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true) ``` The above code produces the following resolved plan: ``` == Analyzed Logical Plan == _c0: bigint Project [_c0#101L] +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L] +- Subquery t +- Project [id#46L AS a#47L,id#46L AS b#48L] +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26 ``` Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs. The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation. In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated. Could you review the solution? marmbrus liancheng I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #11050 from gatorsmile/namingConflicts.	2016-02-11 10:44:39 +08:00
Wenchen Fan	7fe4fe630a	[SPARK-12888] [SQL] [FOLLOW-UP] benchmark the new hash expression Adds the benchmark results as comments. The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons: 1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153). 2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth? 3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster. The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction. Author: Wenchen Fan <wenchen@databricks.com> Closes #10917 from cloud-fan/hash-benchmark.	2016-02-09 13:06:36 -08:00
Wenchen Fan	8e4d15f707	[SPARK-13101][SQL] nullability of array type element should not fail analysis of encoder nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check. Author: Wenchen Fan <wenchen@databricks.com> Closes #11035 from cloud-fan/ignore-nullability.	2016-02-08 12:06:00 -08:00
Jakob Odersky	6883a5120c	[SPARK-13171][CORE] Replace future calls with Future Trivial search-and-replace to eliminate deprecation warnings in Scala 2.11. Also works with 2.10 Author: Jakob Odersky <jakob@odersky.com> Closes #11085 from jodersky/SPARK-13171.	2016-02-05 19:00:12 -08:00
Wenchen Fan	1ed354a536	[SPARK-12939][SQL] migrate encoder resolution logic to Analyzer https://issues.apache.org/jira/browse/SPARK-12939 Now we will catch `ObjectOperator` in `Analyzer` and resolve the `fromRowExpression/deserializer` inside it. Also update the `MapGroups` and `CoGroup` to pass in `dataAttributes`, so that we can correctly resolve value deserializer(the `child.output` contains both groupking key and values, which may mess things up if they have same-name attribtues). End-to-end tests are added. follow-ups: * remove encoders from typed aggregate expression. * completely remove resolve/bind in `ExpressionEncoder` Author: Wenchen Fan <wenchen@databricks.com> Closes #10852 from cloud-fan/bug.	2016-02-05 14:34:12 -08:00
Andrew Or	bd38dd6f75	[SPARK-13079][SQL] InMemoryCatalog follow-ups This patch incorporates review feedback from #11069, which is already merged. Author: Andrew Or <andrew@databricks.com> Closes #11080 from andrewor14/catalog-follow-ups.	2016-02-04 12:20:18 -08:00
Josh Rosen	33212cb9a1	[SPARK-13168][SQL] Collapse adjacent repartition operators Spark SQL should collapse adjacent `Repartition` operators and only keep the last one. Author: Josh Rosen <joshrosen@databricks.com> Closes #11064 from JoshRosen/collapse-repartition.	2016-02-04 11:08:50 -08:00
Reynold Xin	dee801adb7	[SPARK-12828][SQL] Natural join follow-up This is a small addendum to #10762 to make the code more robust again future changes. Author: Reynold Xin <rxin@databricks.com> Closes #11070 from rxin/SPARK-12828-natural-join.	2016-02-03 23:43:48 -08:00
Daoyuan Wang	0f81318ae2	[SPARK-12828][SQL] add natural join support Jira: https://issues.apache.org/jira/browse/SPARK-12828 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #10762 from adrian-wang/naturaljoin.	2016-02-03 21:05:53 -08:00
Andrew Or	a64831124c	[SPARK-13079][SQL] Extend and implement InMemoryCatalog This is a step towards consolidating `SQLContext` and `HiveContext`. This patch extends the existing Catalog API added in #10982 to include methods for handling table partitions. In particular, a partition is identified by `PartitionSpec`, which is just a `Map[String, String]`. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested. About 200 lines are test code. Author: Andrew Or <andrew@databricks.com> Closes #11069 from andrewor14/catalog.	2016-02-03 19:32:41 -08:00
Herman van Hovell	9dd2741ebe	[SPARK-13157] [SQL] Support any kind of input for SQL commands. The ```SparkSqlLexer``` currently swallows characters which have not been defined in the grammar. This causes problems with SQL commands, such as: ```add jar file:///tmp/ab/TestUDTF.jar```. In this example the `````` is swallowed. This PR adds an extra Lexer rule to handle such input, and makes a tiny modification to the ```ASTNode```. cc davies liancheng Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11052 from hvanhovell/SPARK-13157.	2016-02-03 12:31:30 -08:00
Sameer Agarwal	138c300f97	[SPARK-12957][SQL] Initial support for constraint propagation in SparkSQL Based on the semantics of a query, we can derive a number of data constraints on output of each (logical or physical) operator. For instance, if a filter defines `‘a > 10`, we know that the output data of this filter satisfies 2 constraints: 1. `‘a > 10` 2. `isNotNull(‘a)` This PR proposes a possible way of keeping track of these constraints and propagating them in the logical plan, which can then help us build more advanced optimizations (such as pruning redundant filters, optimizing joins, among others). We define constraints as a set of (implicitly conjunctive) expressions. For e.g., if a filter operator has constraints = `Set(‘a > 10, ‘b < 100)`, it’s implied that the outputs satisfy both individual constraints (i.e., `‘a > 10` AND `‘b < 100`). Design Document: https://docs.google.com/a/databricks.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit?usp=sharing Author: Sameer Agarwal <sameer@databricks.com> Closes #10844 from sameeragarwal/constraints.	2016-02-02 22:22:50 -08:00
Wenchen Fan	672032d0ab	[SPARK-13020][SQL][TEST] fix random generator for map type when we generate map, we first randomly pick a length, then create a seq of key value pair with the expected length, and finally call `toMap`. However, `toMap` will remove all duplicated keys, which makes the actual map size much less than we expected. This PR fixes this problem by put keys in a set first, to guarantee we have enough keys to build a map with expected length. Author: Wenchen Fan <wenchen@databricks.com> Closes #10930 from cloud-fan/random-generator.	2016-02-03 08:26:35 +08:00
Reynold Xin	be7a2fc071	[SPARK-13078][SQL] API and test cases for internal catalog This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper). I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality. Author: Reynold Xin <rxin@databricks.com> Closes #10982 from rxin/SPARK-13078.	2016-02-01 14:11:52 -08:00
gatorsmile	8f26eb5ef6	[SPARK-12705][SPARK-10777][SQL] Analyzer Rule ResolveSortReferences JIRA: https://issues.apache.org/jira/browse/SPARK-12705 Scope: This PR is a general fix for sorting reference resolution when the child's `outputSet` does not have the order-by attributes (called, missing attributes): - UnaryNode support is limited to `Project`, `Window`, `Aggregate`, `Distinct`, `Filter`, `RepartitionByExpression`. - We will not try to resolve the missing references inside a subquery, unless the outputSet of this subquery contains it. General Reference Resolution Rules: - Jump over the nodes with the following types: `Distinct`, `Filter`, `RepartitionByExpression`. Do not need to add missing attributes. The reason is their `outputSet` is decided by their `inputSet`, which is the `outputSet` of their children. - Group-by expressions in `Aggregate`: missing order-by attributes are not allowed to be added into group-by expressions since it will change the query result. Thus, in RDBMS, it is not allowed. - Aggregate expressions in `Aggregate`: if the group-by expressions in `Aggregate` contains the missing attributes but aggregate expressions do not have it, just add them into the aggregate expressions. This can resolve the analysisExceptions thrown by the three TCPDS queries. - `Project` and `Window` are special. We just need to add the missing attributes to their `projectList`. Implementation: 1. Traverse the whole tree in a pre-order manner to find all the resolvable missing order-by attributes. 2. Traverse the whole tree in a post-order manner to add the found missing order-by attributes to the node if their `inputSet` contains the attributes. 3. If the origins of the missing order-by attributes are different nodes, each pass only resolves the missing attributes that are from the same node. Risk: Low. This rule will be trigger iff ```!s.resolved && child.resolved``` is true. Thus, very few cases are affected. Author: gatorsmile <gatorsmile@gmail.com> Closes #10678 from gatorsmile/sortWindows.	2016-02-01 11:57:13 -08:00
gatorsmile	5f686cc8b7	[SPARK-12656] [SQL] Implement Intersect with Left-semi Join Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: https://github.com/apache/spark/pull/10566 Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10630 from gatorsmile/IntersectBySemiJoin.	2016-01-29 11:22:12 -08:00
Liang-Chi Hsieh	4637fc08a3	[SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet JIRA: https://issues.apache.org/jira/browse/SPARK-11955 Currently we simply skip pushdowning filters in parquet if we enable schema merging. However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #9940 from viirya/safe-pushdown-parquet-filters.	2016-01-28 16:25:21 -08:00
Nong Li	555127387a	[SPARK-12854][SQL] Implement complex types support in ColumnarBatch This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs and arrays. There is a simple mapping between the richer catalyst types to these two. Strings are treated as an array of bytes. ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists of just leaf nodes. Structs represent an internal node with one child for each field. Arrays are internal nodes with one child. Structs just contain nullability. Arrays contain offsets and lengths into the child array. This structure is able to handle arbitrary nesting. It has the key property that we maintain columnar throughout and that primitive types are only stored in the leaf nodes and contiguous across rows. For example, if the schema is ``` array<array<int>> ``` There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively. As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v) vs appendLong(v)). These APIs are necessary when the batch contains variable length elements. The vectors are not fixed length and will grow as necessary. This should make the usage a lot simpler for the writer. Author: Nong Li <nong@databricks.com> Closes #10820 from nongli/spark-12854.	2016-01-26 17:34:01 -08:00
Wenchen Fan	be375fcbd2	[SPARK-12879] [SQL] improve the unsafe row writing framework As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: old version ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 2616.04 102.61 1.00 X single nullable long 3032.54 88.52 0.86 X primitive types 9121.05 29.43 0.29 X nullable primitive types 12410.60 21.63 0.21 X ``` new version ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 1533.34 175.07 1.00 X single nullable long 2306.73 116.37 0.66 X primitive types 8403.93 31.94 0.18 X nullable primitive types 12448.39 21.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan <wenchen@databricks.com> Closes #10809 from cloud-fan/unsafe-projection.	2016-01-25 16:23:59 -08:00
Reynold Xin	423783a08b	[SPARK-12904][SQL] Strength reduction for integral and decimal literal comparisons This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size. Author: Reynold Xin <rxin@databricks.com> Closes #10882 from rxin/SPARK-12904-1.	2016-01-23 12:13:05 -08:00
Wenchen Fan	f3934a8d65	[SPARK-12888][SQL] benchmark the new hash expression Benchmark it on 4 different schemas, the result: ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For simple: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 31.47 266.54 1.00 X codegen version 64.52 130.01 0.49 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For normal: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 4068.11 0.26 1.00 X codegen version 1175.92 0.89 3.46 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For array: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 9276.70 0.06 1.00 X codegen version 14762.23 0.04 0.63 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For map: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 58869.79 0.01 1.00 X codegen version 9285.36 0.06 6.34 X ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10816 from cloud-fan/hash-benchmark.	2016-01-20 15:08:27 -08:00
gatorsmile	8f90c15187	[SPARK-12616][SQL] Making Logical Operator `Union` Support Arbitrary Number of Children The existing `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children to replace the existing one. `Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Unions` into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical `Unions` already supports arbitrary number of children. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10577 from gatorsmile/unionAllMultiChildren.	2016-01-20 14:59:30 -08:00
Davies Liu	8e4f894e98	[SPARK-12881] [SQL] subexpress elimination in mutable projection Author: Davies Liu <davies@databricks.com> Closes #10814 from davies/mutable_subexpr.	2016-01-20 10:02:40 -08:00
Reynold Xin	753b194511	[SPARK-12912][SQL] Add a test suite for EliminateSubQueries Also updated documentation to explain why ComputeCurrentTime and EliminateSubQueries are in the optimizer rather than analyzer. Author: Reynold Xin <rxin@databricks.com> Closes #10837 from rxin/optimizer-analyzer-comment.	2016-01-20 00:00:28 -08:00
Reynold Xin	3e84ef0a54	[SPARK-12770][SQL] Implement rules for branch elimination for CaseWhen The three optimization cases are: 1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch. 2. If a branch's condition is a false or null literal, remove that branch. 3. If only the else branch is left, remove the CaseWhen and use the value from the else branch. Author: Reynold Xin <rxin@databricks.com> Closes #10827 from rxin/SPARK-12770.	2016-01-19 16:14:41 -08:00
Jakob Odersky	c78e2080e0	[SPARK-12816][SQL] De-alias type when generating schemas Call `dealias` on local types to fix schema generation for abstract type members, such as ```scala type KeyValue = (Int, String) ``` Add simple test Author: Jakob Odersky <jodersky@gmail.com> Closes #10749 from jodersky/aliased-schema.	2016-01-19 12:31:03 -08:00
Reynold Xin	44fcf992aa	[SPARK-12873][SQL] Add more comment in HiveTypeCoercion for type widening I was reading this part of the analyzer code again and got confused by the difference between findWiderTypeForTwo and findTightestCommonTypeOfTwo. I also simplified WidenSetOperationTypes to make it a lot simpler. The easiest way to review this one is to just read the original code, and the new code. The logic is super simple. Author: Reynold Xin <rxin@databricks.com> Closes #10802 from rxin/SPARK-12873.	2016-01-18 11:08:44 -08:00
Herman van Hovell	7cd7f22025	[SPARK-12575][SQL] Grammar parity with existing SQL parser In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base. Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out: - The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR removes this keyword. - The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is not supported anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this. - Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we remove this feature from the parser. It would be quite easy to implement such a feature as an Expression later on. - Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed. cc rxin viirya marmbrus yhuai cloud-fan Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10745 from hvanhovell/SPARK-12575-2.	2016-01-15 15:19:10 -08:00
Davies Liu	c5e7076da7	[MINOR] [SQL] GeneratedExpressionCode -> ExprCode GeneratedExpressionCode is too long Author: Davies Liu <davies@databricks.com> Closes #10767 from davies/renaming.	2016-01-15 08:26:20 -08:00
Michael Armbrust	cc7af86afd	[SPARK-12813][SQL] Eliminate serialization for back to back operations The goal of this PR is to eliminate unnecessary translations when there are back-to-back `MapPartitions` operations. In order to achieve this I also made the following simplifications: - Operators no longer have hold encoders, instead they have only the expressions that they need. The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process. now that they are visible we can change them on a case by case basis. - Operators no longer have type parameters. Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication. We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded. Deferred to a follow up PR: - Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an `as` operation. - Eliminate serializations in more cases by adding more cases to `EliminateSerialization` Author: Michael Armbrust <michael@databricks.com> Closes #10747 from marmbrus/encoderExpressions.	2016-01-14 17:44:56 -08:00
Reynold Xin	cbbcd8e425	[SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values" This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field. Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls. Author: Reynold Xin <rxin@databricks.com> Closes #10734 from rxin/simplify-case.	2016-01-13 12:44:35 -08:00
Wenchen Fan	c2ea79f96a	[SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row https://issues.apache.org/jira/browse/SPARK-12642 Author: Wenchen Fan <wenchen@databricks.com> Closes #10694 from cloud-fan/hash-expr.	2016-01-13 12:29:02 -08:00
Kousuke Saruta	cb7b864a24	[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before ",") Fix the style violation (space before , and :). This PR is a followup for #10643 and rework of #10685 . Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10732 from sarutak/SPARK-12692-followup-sql.	2016-01-12 22:25:20 -08:00
Reynold Xin	b3b9ad23cf	[SPARK-12788][SQL] Simplify BooleanEquality by using casts. Author: Reynold Xin <rxin@databricks.com> Closes #10730 from rxin/SPARK-12788.	2016-01-12 18:45:55 -08:00
Reynold Xin	0d543b98f3	Revert "[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":")" This reverts commit `8cfa218f4f`.	2016-01-12 12:56:52 -08:00
Reynold Xin	0ed430e315	[SPARK-12768][SQL] Remove CaseKeyWhen expression This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer. Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination. Author: Reynold Xin <rxin@databricks.com> Closes #10722 from rxin/SPARK-12768.	2016-01-12 11:13:08 -08:00
Reynold Xin	1d88879530	[SPARK-12762][SQL] Add unit test for SimplifyConditionals optimization rule This pull request does a few small things: 1. Separated if simplification from BooleanSimplification and created a new rule SimplifyConditionals. In the future we can also simplify other conditional expressions here. 2. Added unit test for SimplifyConditionals. 3. Renamed SimplifyCaseConversionExpressionsSuite to SimplifyStringCaseConversionSuite Author: Reynold Xin <rxin@databricks.com> Closes #10716 from rxin/SPARK-12762.	2016-01-12 10:58:57 -08:00
Kousuke Saruta	8cfa218f4f	[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10718 from sarutak/SPARK-12692-followup-sql.	2016-01-12 00:51:00 -08:00
Herman van Hovell	fe9eb0b0ce	[SPARK-12576][SQL] Enable expression parsing in CatalystQl The PR allows us to use the new SQL parser to parse SQL expressions such as: ```1 + sin(x*x)``` We enable this functionality in this PR, but we will not start using this actively yet. This will be done as soon as we have reached grammar parity with the existing parser stack. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10649 from hvanhovell/SPARK-12576.	2016-01-11 16:29:37 -08:00
Liang-Chi Hsieh	95cd5d95ce	[SPARK-12577] [SQL] Better support of parentheses in partition by and order by clause of window function's over clause JIRA: https://issues.apache.org/jira/browse/SPARK-12577 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10620 from viirya/fix-parentheses.	2016-01-08 21:48:06 -08:00
Cheng Lian	d9447cac74	[SPARK-12593][SQL] Converts resolved logical plan back to SQL This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings. For now, the major use case is to canonicalize Spark SQL native view support. The major entry point is `SQLBuilder.toSQL`, which returns an `Option[String]` if the logical plan is recognized. The current version is still in WIP status, and is quite limited. Known limitations include: 1. The logical plan must be analyzed but not optimized The optimizer erases `Subquery` operators, which contain necessary scope information for SQL generation. Future versions should be able to recover erased scope information by inserting subqueries when necessary. 1. The logical plan must be created using HiveQL query string Query plans generated by composing arbitrary DataFrame API combinations are not supported yet. Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation. For example, the following query plan ``` Filter (a#1 < 10) +- MetastoreRelation default, src, None ``` need to be canonicalized into the following form before SQL generation: ``` Project [a#1, b#2, c#3] +- Filter (a#1 < 10) +- MetastoreRelation default, src, None ``` Otherwise, the SQL generation process will have to handle a large number of special cases. 1. Only a fraction of expressions and basic logical plan operators are supported in this PR Currently, 95.7% (1720 out of 1798) query plans in `HiveCompatibilitySuite` can be successfully converted to SQL query strings. Known unsupported components are: - Expressions - Part of math expressions - Part of string expressions (buggy?) - Null expressions - Calendar interval literal - Part of date time expressions - Complex type creators - Special `NOT` expressions, e.g. `NOT LIKE` and `NOT IN` - Logical plan operators/patterns - Cube, rollup, and grouping set - Script transformation - Generator - Distinct aggregation patterns that fit `DistinctAggregationRewriter` analysis rule - Window functions Support for window functions, generators, and cubes etc. will be added in follow-up PRs. This PR leverages `HiveCompatibilitySuite` for testing SQL generation in a "round-trip" manner: * For all select queries, we try to convert it back to SQL * If the query plan is convertible, we parse the generated SQL into a new logical plan * Run the new logical plan instead of the original one If the query plan is inconvertible, the test case simply falls back to the original logic. TODO - [x] Fix failed test cases - [x] Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.) - [x] Comments and documentation Author: Cheng Lian <lian@databricks.com> Closes #10541 from liancheng/sql-generation.	2016-01-08 14:08:13 -08:00
Liang-Chi Hsieh	cfe1ba56e4	[SPARK-12687] [SQL] Support from clause surrounded by `()`. JIRA: https://issues.apache.org/jira/browse/SPARK-12687 Some queries such as `(select 1 as a) union (select 2 as a)` can't work. This patch fixes it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10660 from viirya/fix-union.	2016-01-08 09:50:41 -08:00
Sean Owen	b9c8353378	[SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs. Author: Sean Owen <sowen@cloudera.com> Closes #10570 from srowen/SPARK-12618.	2016-01-08 17:47:44 +00:00
Davies Liu	fd1dcfaf26	[SPARK-12542][SQL] support except/intersect in HiveQl Parse the SQL query with except/intersect in FROM clause for HivQL. Author: Davies Liu <davies@databricks.com> Closes #10622 from davies/intersect.	2016-01-06 23:46:12 -08:00
Marcelo Vanzin	b3ba1be3b7	[SPARK-3873][TESTS] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10582 from vanzin/SPARK-3873-tests.	2016-01-05 19:07:39 -08:00
Liang-Chi Hsieh	d202ad2fc2	[SPARK-12439][SQL] Fix toCatalystArray and MapObjects JIRA: https://issues.apache.org/jira/browse/SPARK-12439 In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type. There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null). Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10391 from viirya/fix-catalystarray.	2016-01-05 12:33:21 -08:00
Wenchen Fan	76768337be	[SPARK-12480][FOLLOW-UP] use a single column vararg for hash address comments in #10435 This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty. Author: Wenchen Fan <wenchen@databricks.com> Closes #10588 from cloud-fan/hash.	2016-01-05 10:23:36 -08:00
Liang-Chi Hsieh	b3c48e39f4	[SPARK-12438][SQL] Add SQLUserDefinedType support for encoder JIRA: https://issues.apache.org/jira/browse/SPARK-12438 ScalaReflection lacks the support of SQLUserDefinedType. We should add it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10390 from viirya/encoder-udt.	2016-01-05 10:19:56 -08:00
Michael Armbrust	53beddc5bf	[SPARK-12568][SQL] Add BINARY to Encoders Author: Michael Armbrust <michael@databricks.com> Closes #10516 from marmbrus/datasetCleanup.	2016-01-04 23:23:41 -08:00
Wenchen Fan	b1a771231e	[SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions just write the arguments into unsafe row and use murmur3 to calculate hash code Author: Wenchen Fan <wenchen@databricks.com> Closes #10435 from cloud-fan/hash-expr.	2016-01-04 18:49:41 -08:00
Herman van Hovell	0171b71e95	[SPARK-12421][SQL] Prevent Internal/External row from exposing state. It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem. This PR was inspired by https://github.com/apache/spark/pull/10374 by apo1. cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation). Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10553 from hvanhovell/SPARK-12421.	2016-01-04 12:41:57 -08:00
Liang-Chi Hsieh	c9dbfcc653	[SPARK-11743][SQL] Move the test for arrayOfUDT A following pr for #9712. Move the test for arrayOfUDT. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10538 from viirya/move-udt-test.	2015-12-31 23:48:05 -08:00
Davies Liu	e6c77874b9	[SPARK-12585] [SQL] move numFields to constructor of UnsafeRow Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy. It should be part of constructor of UnsafeRow. Author: Davies Liu <davies@databricks.com> Closes #10528 from davies/numFields.	2015-12-30 22:16:37 -08:00
Wenchen Fan	aa48164a43	[SPARK-12495][SQL] use true as default value for propagateNull in NewInstance Most of cases we should propagate null when call `NewInstance`, and so far there is only one case we should stop null propagation: create product/java bean. So I think it makes more sense to propagate null by dafault. This also fixes a bug when encode null array/map, which is firstly discovered in https://github.com/apache/spark/pull/10401 Author: Wenchen Fan <wenchen@databricks.com> Closes #10443 from cloud-fan/encoder.	2015-12-30 10:56:08 -08:00
Stephan Kessler	a6a4812434	[SPARK-7727][SQL] Avoid inner classes in RuleExecutor Moved (case) classes Strategy, Once, FixedPoint and Batch to the companion object. This is necessary if we want to have the Optimizer easily extendable in the following sense: Usually a user wants to add additional rules, and just take the ones that are already there. However, inner classes made that impossible since the code did not compile This allows easy extension of existing Optimizers see the DefaultOptimizerExtendableSuite for a corresponding test case. Author: Stephan Kessler <stephan.kessler@sap.com> Closes #10174 from stephankessler/SPARK-7727.	2015-12-28 12:46:20 -08:00
Liang-Chi Hsieh	50301c0a28	[SPARK-11164][SQL] Add InSet pushdown filter back for Parquet When the filter is ```"b in ('1', '2')"```, the filter is not pushed down to Parquet. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10278 from gatorsmile/parquetFilterNot.	2015-12-23 14:08:29 +08:00
Dilip Biswal	b374a25831	[SPARK-12102][SQL] Cast a non-nullable struct field to a nullable field during analysis Compare both left and right side of the case expression ignoring nullablity when checking for type equality. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10156 from dilipbiswal/spark-12102.	2015-12-22 15:21:49 -08:00
Cheng Lian	42bfde2983	[SPARK-12371][SQL] Runtime nullability check for NewInstance This PR adds a new expression `AssertNotNull` to ensure non-nullable fields of products and case classes don't receive null values at runtime. Author: Cheng Lian <lian@databricks.com> Closes #10331 from liancheng/dataset-nullability-check.	2015-12-22 19:41:44 +08:00
Davies Liu	4af647c77d	[SPARK-12054] [SQL] Consider nullability of expression in codegen This could simplify the generated code for expressions that is not nullable. This PR fix lots of bugs about nullability. Author: Davies Liu <davies@databricks.com> Closes #10333 from davies/skip_nullable.	2015-12-18 10:09:17 -08:00
Herman van Hovell	658f66e620	[SPARK-8641][SQL] Native Spark Window functions This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features. This has the following advantages: * Better memory management. * The ability to use spark UDAFs in Window functions. cc rxin / yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9819 from hvanhovell/SPARK-8641-2.	2015-12-17 15:16:35 -08:00
Wenchen Fan	a783a8ed49	[SPARK-12320][SQL] throw exception if the number of fields does not line up for Tuple encoder Author: Wenchen Fan <wenchen@databricks.com> Closes #10293 from cloud-fan/err-msg.	2015-12-16 13:20:12 -08:00
Davies Liu	54c512ba90	[SPARK-8745] [SQL] remove GenerateProjection cc rxin Author: Davies Liu <davies@databricks.com> Closes #10316 from davies/remove_generate_projection.	2015-12-16 10:22:48 -08:00
Wenchen Fan	a89e8b6122	[SPARK-10477][SQL] using DSL in ColumnPruningSuite to improve readability Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8645 from cloud-fan/test.	2015-12-15 18:29:19 -08:00
Wenchen Fan	d8ec081c91	[SPARK-12252][SPARK-12131][SQL] refactor MapObjects to make it less hacky in https://github.com/apache/spark/pull/10133 we found that, we shoud ensure the children of `TreeNode` are all accessible in the `productIterator`, or the behavior will be very confusing. In this PR, I try to fix this problem by expsing the `loopVar`. This also fixes SPARK-12131 which is caused by the hacky `MapObjects`. Author: Wenchen Fan <wenchen@databricks.com> Closes #10239 from cloud-fan/map-objects.	2015-12-10 15:11:13 +08:00
Wenchen Fan	381f17b540	[SPARK-12201][SQL] add type coercion rule for greatest/least checked with hive, greatest/least should cast their children to a tightest common type, i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error` Author: Wenchen Fan <wenchen@databricks.com> Closes #10196 from cloud-fan/type-coercion.	2015-12-08 10:13:40 -08:00
Davies Liu	9cde7d5fa8	[SPARK-12032] [SQL] Re-order inner joins to do join with conditions first Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow. This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions. After this patch, the TPCDS query Q64/65 can run hundreds times faster. cc marmbrus nongli Author: Davies Liu <davies@databricks.com> Closes #10073 from davies/reorder_joins.	2015-12-07 10:34:18 -08:00
gatorsmile	49efd03bac	[SPARK-12138][SQL] Escape \u in the generated comments of codegen When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u. yhuai Please review it. I did reproduce it and it works after the fix. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10155 from gatorsmile/escapeU.	2015-12-06 11:15:02 -08:00
Josh Rosen	b7204e1d41	[SPARK-12112][BUILD] Upgrade to SBT 0.13.9 We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.	2015-12-05 08:15:30 +08:00
Yin Huai	5872a9d89f	[SPARK-11352][SQL] Escape */ in the generated comments. https://issues.apache.org/jira/browse/SPARK-11352 Author: Yin Huai <yhuai@databricks.com> Closes #10072 from yhuai/SPARK-11352.	2015-12-01 16:24:04 -08:00

1 2 3 4 5 ...

625 commits