Commit graph

580 commits

Author SHA1 Message Date
gatorsmile 8f26eb5ef6 [SPARK-12705][SPARK-10777][SQL] Analyzer Rule ResolveSortReferences
JIRA: https://issues.apache.org/jira/browse/SPARK-12705

**Scope:**
This PR is a general fix for sorting reference resolution when the child's `outputSet` does not have the order-by attributes (called, *missing attributes*):
  - UnaryNode support is limited to `Project`, `Window`, `Aggregate`, `Distinct`, `Filter`, `RepartitionByExpression`.
  - We will not try to resolve the missing references inside a subquery, unless the outputSet of this subquery contains it.

**General Reference Resolution Rules:**
  - Jump over the nodes with the following types: `Distinct`, `Filter`, `RepartitionByExpression`. Do not need to add missing attributes. The reason is their `outputSet` is decided by their `inputSet`, which is the `outputSet` of their children.
  - Group-by expressions in `Aggregate`: missing order-by attributes are not allowed to be added into group-by expressions since it will change the query result. Thus, in RDBMS, it is not allowed.
  - Aggregate expressions in `Aggregate`: if the group-by expressions in `Aggregate` contains the missing attributes but aggregate expressions do not have it, just add them into the aggregate expressions. This can resolve the analysisExceptions thrown by the three TCPDS queries.
  - `Project` and `Window` are special. We just need to add the missing attributes to their `projectList`.

**Implementation:**
  1. Traverse the whole tree in a pre-order manner to find all the resolvable missing order-by attributes.
  2. Traverse the whole tree in a post-order manner to add the found missing order-by attributes to the node if their `inputSet` contains the attributes.
  3. If the origins of the missing order-by attributes are different nodes, each pass only resolves the missing attributes that are from the same node.

**Risk:**
Low. This rule will be trigger iff ```!s.resolved && child.resolved``` is true. Thus, very few cases are affected.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10678 from gatorsmile/sortWindows.
2016-02-01 11:57:13 -08:00
gatorsmile 5f686cc8b7 [SPARK-12656] [SQL] Implement Intersect with Left-semi Join
Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).

After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: https://github.com/apache/spark/pull/10566

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #10630 from gatorsmile/IntersectBySemiJoin.
2016-01-29 11:22:12 -08:00
Liang-Chi Hsieh 4637fc08a3 [SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet
JIRA: https://issues.apache.org/jira/browse/SPARK-11955

Currently we simply skip pushdowning filters in parquet if we enable schema merging.

However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #9940 from viirya/safe-pushdown-parquet-filters.
2016-01-28 16:25:21 -08:00
Nong Li 555127387a [SPARK-12854][SQL] Implement complex types support in ColumnarBatch
This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs
and arrays. There is a simple mapping between the richer catalyst types to these two. Strings
are treated as an array of bytes.

ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists
of just leaf nodes. Structs represent an internal node with one child for each field. Arrays
are internal nodes with one child. Structs just contain nullability. Arrays contain offsets
and lengths into the child array. This structure is able to handle arbitrary nesting. It has
the key property that we maintain columnar throughout and that primitive types are only stored
in the leaf nodes and contiguous across rows. For example, if the schema is
```
array<array<int>>
```
There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively.

As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v)
vs appendLong(v)). These APIs are necessary when the batch contains variable length elements.
The vectors are not fixed length and will grow as necessary. This should make the usage a lot
simpler for the writer.

Author: Nong Li <nong@databricks.com>

Closes #10820 from nongli/spark-12854.
2016-01-26 17:34:01 -08:00
Wenchen Fan be375fcbd2 [SPARK-12879] [SQL] improve the unsafe row writing framework
As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use.

This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily.

a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR:
**old version**
```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             2616.04           102.61         1.00 X
single nullable long                    3032.54            88.52         0.86 X
primitive types                         9121.05            29.43         0.29 X
nullable primitive types               12410.60            21.63         0.21 X
```

**new version**
```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             1533.34           175.07         1.00 X
single nullable long                    2306.73           116.37         0.66 X
primitive types                         8403.93            31.94         0.18 X
nullable primitive types               12448.39            21.56         0.12 X
```

For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process.  The benchmark code is included in this PR.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10809 from cloud-fan/unsafe-projection.
2016-01-25 16:23:59 -08:00
Reynold Xin 423783a08b [SPARK-12904][SQL] Strength reduction for integral and decimal literal comparisons
This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size.

Author: Reynold Xin <rxin@databricks.com>

Closes #10882 from rxin/SPARK-12904-1.
2016-01-23 12:13:05 -08:00
Wenchen Fan f3934a8d65 [SPARK-12888][SQL] benchmark the new hash expression
Benchmark it on 4 different schemas, the result:
```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
Hash For simple:                   Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
interpreted version                       31.47           266.54         1.00 X
codegen version                           64.52           130.01         0.49 X
```

```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
Hash For normal:                   Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
interpreted version                     4068.11             0.26         1.00 X
codegen version                         1175.92             0.89         3.46 X
```

```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
Hash For array:                    Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
interpreted version                     9276.70             0.06         1.00 X
codegen version                        14762.23             0.04         0.63 X
```

```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
Hash For map:                      Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
interpreted version                    58869.79             0.01         1.00 X
codegen version                         9285.36             0.06         6.34 X
```

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10816 from cloud-fan/hash-benchmark.
2016-01-20 15:08:27 -08:00
gatorsmile 8f90c15187 [SPARK-12616][SQL] Making Logical Operator Union Support Arbitrary Number of Children
The existing `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children to replace the existing one.

`Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Unions` into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical `Unions` already supports arbitrary number of children.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #10577 from gatorsmile/unionAllMultiChildren.
2016-01-20 14:59:30 -08:00
Davies Liu 8e4f894e98 [SPARK-12881] [SQL] subexpress elimination in mutable projection
Author: Davies Liu <davies@databricks.com>

Closes #10814 from davies/mutable_subexpr.
2016-01-20 10:02:40 -08:00
Reynold Xin 753b194511 [SPARK-12912][SQL] Add a test suite for EliminateSubQueries
Also updated documentation to explain why ComputeCurrentTime and EliminateSubQueries are in the optimizer rather than analyzer.

Author: Reynold Xin <rxin@databricks.com>

Closes #10837 from rxin/optimizer-analyzer-comment.
2016-01-20 00:00:28 -08:00
Reynold Xin 3e84ef0a54 [SPARK-12770][SQL] Implement rules for branch elimination for CaseWhen
The three optimization cases are:

1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch.
2. If a branch's condition is a false or null literal, remove that branch.
3. If only the else branch is left, remove the CaseWhen and use the value from the else branch.

Author: Reynold Xin <rxin@databricks.com>

Closes #10827 from rxin/SPARK-12770.
2016-01-19 16:14:41 -08:00
Jakob Odersky c78e2080e0 [SPARK-12816][SQL] De-alias type when generating schemas
Call `dealias` on local types to fix schema generation for abstract type members, such as

```scala
type KeyValue = (Int, String)
```

Add simple test

Author: Jakob Odersky <jodersky@gmail.com>

Closes #10749 from jodersky/aliased-schema.
2016-01-19 12:31:03 -08:00
Reynold Xin 44fcf992aa [SPARK-12873][SQL] Add more comment in HiveTypeCoercion for type widening
I was reading this part of the analyzer code again and got confused by the difference between findWiderTypeForTwo and findTightestCommonTypeOfTwo.

I also simplified WidenSetOperationTypes to make it a lot simpler. The easiest way to review this one is to just read the original code, and the new code. The logic is super simple.

Author: Reynold Xin <rxin@databricks.com>

Closes #10802 from rxin/SPARK-12873.
2016-01-18 11:08:44 -08:00
Herman van Hovell 7cd7f22025 [SPARK-12575][SQL] Grammar parity with existing SQL parser
In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.

Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:
- The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword.
- The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this.
- Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
- Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed.

cc rxin viirya marmbrus yhuai cloud-fan

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10745 from hvanhovell/SPARK-12575-2.
2016-01-15 15:19:10 -08:00
Davies Liu c5e7076da7 [MINOR] [SQL] GeneratedExpressionCode -> ExprCode
GeneratedExpressionCode is too long

Author: Davies Liu <davies@databricks.com>

Closes #10767 from davies/renaming.
2016-01-15 08:26:20 -08:00
Michael Armbrust cc7af86afd [SPARK-12813][SQL] Eliminate serialization for back to back operations
The goal of this PR is to eliminate unnecessary translations when there are back-to-back `MapPartitions` operations.  In order to achieve this I also made the following simplifications:

 - Operators no longer have hold encoders, instead they have only the expressions that they need.  The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process.  now that they are visible we can change them on a case by case basis.
 - Operators no longer have type parameters.  Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication.  We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded.

Deferred to a follow up PR:
 - Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an `as` operation.
 - Eliminate serializations in more cases by adding more cases to `EliminateSerialization`

Author: Michael Armbrust <michael@databricks.com>

Closes #10747 from marmbrus/encoderExpressions.
2016-01-14 17:44:56 -08:00
Reynold Xin cbbcd8e425 [SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values"
This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.

Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.

Author: Reynold Xin <rxin@databricks.com>

Closes #10734 from rxin/simplify-case.
2016-01-13 12:44:35 -08:00
Wenchen Fan c2ea79f96a [SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row
https://issues.apache.org/jira/browse/SPARK-12642

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10694 from cloud-fan/hash-expr.
2016-01-13 12:29:02 -08:00
Kousuke Saruta cb7b864a24 [SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before ",")
Fix the style violation (space before , and :).
This PR is a followup for #10643 and rework of #10685 .

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10732 from sarutak/SPARK-12692-followup-sql.
2016-01-12 22:25:20 -08:00
Reynold Xin b3b9ad23cf [SPARK-12788][SQL] Simplify BooleanEquality by using casts.
Author: Reynold Xin <rxin@databricks.com>

Closes #10730 from rxin/SPARK-12788.
2016-01-12 18:45:55 -08:00
Reynold Xin 0d543b98f3 Revert "[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":")"
This reverts commit 8cfa218f4f.
2016-01-12 12:56:52 -08:00
Reynold Xin 0ed430e315 [SPARK-12768][SQL] Remove CaseKeyWhen expression
This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer.

Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination.

Author: Reynold Xin <rxin@databricks.com>

Closes #10722 from rxin/SPARK-12768.
2016-01-12 11:13:08 -08:00
Reynold Xin 1d88879530 [SPARK-12762][SQL] Add unit test for SimplifyConditionals optimization rule
This pull request does a few small things:

1. Separated if simplification from BooleanSimplification and created a new rule SimplifyConditionals. In the future we can also simplify other conditional expressions here.

2. Added unit test for SimplifyConditionals.

3. Renamed SimplifyCaseConversionExpressionsSuite to SimplifyStringCaseConversionSuite

Author: Reynold Xin <rxin@databricks.com>

Closes #10716 from rxin/SPARK-12762.
2016-01-12 10:58:57 -08:00
Kousuke Saruta 8cfa218f4f [SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10718 from sarutak/SPARK-12692-followup-sql.
2016-01-12 00:51:00 -08:00
Herman van Hovell fe9eb0b0ce [SPARK-12576][SQL] Enable expression parsing in CatalystQl
The PR allows us to use the new SQL parser to parse SQL expressions such as: ```1 + sin(x*x)```

We enable this functionality in this PR, but we will not start using this actively yet. This will be done as soon as we have reached grammar parity with the existing parser stack.

cc rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10649 from hvanhovell/SPARK-12576.
2016-01-11 16:29:37 -08:00
Liang-Chi Hsieh 95cd5d95ce [SPARK-12577] [SQL] Better support of parentheses in partition by and order by clause of window function's over clause
JIRA: https://issues.apache.org/jira/browse/SPARK-12577

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10620 from viirya/fix-parentheses.
2016-01-08 21:48:06 -08:00
Cheng Lian d9447cac74 [SPARK-12593][SQL] Converts resolved logical plan back to SQL
This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings.  For now, the major use case is to canonicalize Spark SQL native view support.  The major entry point is `SQLBuilder.toSQL`, which returns an `Option[String]` if the logical plan is recognized.

The current version is still in WIP status, and is quite limited.  Known limitations include:

1.  The logical plan must be analyzed but not optimized

    The optimizer erases `Subquery` operators, which contain necessary scope information for SQL generation.  Future versions should be able to recover erased scope information by inserting subqueries when necessary.

1.  The logical plan must be created using HiveQL query string

    Query plans generated by composing arbitrary DataFrame API combinations are not supported yet.  Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation.  For example, the following query plan

    ```
    Filter (a#1 < 10)
     +- MetastoreRelation default, src, None
    ```

    need to be canonicalized into the following form before SQL generation:

    ```
    Project [a#1, b#2, c#3]
     +- Filter (a#1 < 10)
         +- MetastoreRelation default, src, None
    ```

    Otherwise, the SQL generation process will have to handle a large number of special cases.

1.  Only a fraction of expressions and basic logical plan operators are supported in this PR

    Currently, 95.7% (1720 out of 1798) query plans in `HiveCompatibilitySuite` can be successfully converted to SQL query strings.

    Known unsupported components are:

    - Expressions
      - Part of math expressions
      - Part of string expressions (buggy?)
      - Null expressions
      - Calendar interval literal
      - Part of date time expressions
      - Complex type creators
      - Special `NOT` expressions, e.g. `NOT LIKE` and `NOT IN`
    - Logical plan operators/patterns
      - Cube, rollup, and grouping set
      - Script transformation
      - Generator
      - Distinct aggregation patterns that fit `DistinctAggregationRewriter` analysis rule
      - Window functions

    Support for window functions, generators, and cubes etc. will be added in follow-up PRs.

This PR leverages `HiveCompatibilitySuite` for testing SQL generation in a "round-trip" manner:

*   For all select queries, we try to convert it back to SQL
*   If the query plan is convertible, we parse the generated SQL into a new logical plan
*   Run the new logical plan instead of the original one

If the query plan is inconvertible, the test case simply falls back to the original logic.

TODO

- [x] Fix failed test cases
- [x] Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.)
- [x] Comments and documentation

Author: Cheng Lian <lian@databricks.com>

Closes #10541 from liancheng/sql-generation.
2016-01-08 14:08:13 -08:00
Liang-Chi Hsieh cfe1ba56e4 [SPARK-12687] [SQL] Support from clause surrounded by ().
JIRA: https://issues.apache.org/jira/browse/SPARK-12687

Some queries such as `(select 1 as a) union (select 2 as a)` can't work. This patch fixes it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10660 from viirya/fix-union.
2016-01-08 09:50:41 -08:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
Davies Liu fd1dcfaf26 [SPARK-12542][SQL] support except/intersect in HiveQl
Parse the SQL query with except/intersect in FROM clause for HivQL.

Author: Davies Liu <davies@databricks.com>

Closes #10622 from davies/intersect.
2016-01-06 23:46:12 -08:00
Marcelo Vanzin b3ba1be3b7 [SPARK-3873][TESTS] Import ordering fixes.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10582 from vanzin/SPARK-3873-tests.
2016-01-05 19:07:39 -08:00
Liang-Chi Hsieh d202ad2fc2 [SPARK-12439][SQL] Fix toCatalystArray and MapObjects
JIRA: https://issues.apache.org/jira/browse/SPARK-12439

In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type.

There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null).

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10391 from viirya/fix-catalystarray.
2016-01-05 12:33:21 -08:00
Wenchen Fan 76768337be [SPARK-12480][FOLLOW-UP] use a single column vararg for hash
address comments in #10435

This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10588 from cloud-fan/hash.
2016-01-05 10:23:36 -08:00
Liang-Chi Hsieh b3c48e39f4 [SPARK-12438][SQL] Add SQLUserDefinedType support for encoder
JIRA: https://issues.apache.org/jira/browse/SPARK-12438

ScalaReflection lacks the support of SQLUserDefinedType. We should add it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10390 from viirya/encoder-udt.
2016-01-05 10:19:56 -08:00
Michael Armbrust 53beddc5bf [SPARK-12568][SQL] Add BINARY to Encoders
Author: Michael Armbrust <michael@databricks.com>

Closes #10516 from marmbrus/datasetCleanup.
2016-01-04 23:23:41 -08:00
Wenchen Fan b1a771231e [SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions
just write the arguments into unsafe row and use murmur3 to calculate hash code

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10435 from cloud-fan/hash-expr.
2016-01-04 18:49:41 -08:00
Herman van Hovell 0171b71e95 [SPARK-12421][SQL] Prevent Internal/External row from exposing state.
It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem.

This PR was inspired by https://github.com/apache/spark/pull/10374 by apo1.

cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation).

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10553 from hvanhovell/SPARK-12421.
2016-01-04 12:41:57 -08:00
Liang-Chi Hsieh c9dbfcc653 [SPARK-11743][SQL] Move the test for arrayOfUDT
A following pr for #9712. Move the test for arrayOfUDT.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10538 from viirya/move-udt-test.
2015-12-31 23:48:05 -08:00
Davies Liu e6c77874b9 [SPARK-12585] [SQL] move numFields to constructor of UnsafeRow
Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy.

It should be part of constructor of UnsafeRow.

Author: Davies Liu <davies@databricks.com>

Closes #10528 from davies/numFields.
2015-12-30 22:16:37 -08:00
Wenchen Fan aa48164a43 [SPARK-12495][SQL] use true as default value for propagateNull in NewInstance
Most of cases we should propagate null when call `NewInstance`, and so far there is only one case we should stop null propagation: create product/java bean. So I think it makes more sense to propagate null by dafault.

This also fixes a bug when encode null array/map, which is firstly discovered in https://github.com/apache/spark/pull/10401

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10443 from cloud-fan/encoder.
2015-12-30 10:56:08 -08:00
Stephan Kessler a6a4812434 [SPARK-7727][SQL] Avoid inner classes in RuleExecutor
Moved (case) classes Strategy, Once, FixedPoint and Batch to the companion object. This is necessary if we want to have the Optimizer easily extendable in the following sense: Usually a user wants to add additional rules, and just take the ones that are already there. However, inner classes made that impossible since the code did not compile

This allows easy extension of existing Optimizers see the DefaultOptimizerExtendableSuite for a corresponding test case.

Author: Stephan Kessler <stephan.kessler@sap.com>

Closes #10174 from stephankessler/SPARK-7727.
2015-12-28 12:46:20 -08:00
Liang-Chi Hsieh 50301c0a28 [SPARK-11164][SQL] Add InSet pushdown filter back for Parquet
When the filter is ```"b in ('1', '2')"```, the filter is not pushed down to Parquet. Thanks!

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #10278 from gatorsmile/parquetFilterNot.
2015-12-23 14:08:29 +08:00
Dilip Biswal b374a25831 [SPARK-12102][SQL] Cast a non-nullable struct field to a nullable field during analysis
Compare both left and right side of the case expression ignoring nullablity when checking for type equality.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #10156 from dilipbiswal/spark-12102.
2015-12-22 15:21:49 -08:00
Cheng Lian 42bfde2983 [SPARK-12371][SQL] Runtime nullability check for NewInstance
This PR adds a new expression `AssertNotNull` to ensure non-nullable fields of products and case classes don't receive null values at runtime.

Author: Cheng Lian <lian@databricks.com>

Closes #10331 from liancheng/dataset-nullability-check.
2015-12-22 19:41:44 +08:00
Davies Liu 4af647c77d [SPARK-12054] [SQL] Consider nullability of expression in codegen
This could simplify the generated code for expressions that is not nullable.

This PR fix lots of bugs about nullability.

Author: Davies Liu <davies@databricks.com>

Closes #10333 from davies/skip_nullable.
2015-12-18 10:09:17 -08:00
Herman van Hovell 658f66e620 [SPARK-8641][SQL] Native Spark Window functions
This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features.

This has the following advantages:
* Better memory management.
* The ability to use spark UDAFs in Window functions.

cc rxin / yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9819 from hvanhovell/SPARK-8641-2.
2015-12-17 15:16:35 -08:00
Wenchen Fan a783a8ed49 [SPARK-12320][SQL] throw exception if the number of fields does not line up for Tuple encoder
Author: Wenchen Fan <wenchen@databricks.com>

Closes #10293 from cloud-fan/err-msg.
2015-12-16 13:20:12 -08:00
Davies Liu 54c512ba90 [SPARK-8745] [SQL] remove GenerateProjection
cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #10316 from davies/remove_generate_projection.
2015-12-16 10:22:48 -08:00
Wenchen Fan a89e8b6122 [SPARK-10477][SQL] using DSL in ColumnPruningSuite to improve readability
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8645 from cloud-fan/test.
2015-12-15 18:29:19 -08:00
Wenchen Fan d8ec081c91 [SPARK-12252][SPARK-12131][SQL] refactor MapObjects to make it less hacky
in https://github.com/apache/spark/pull/10133 we found that, we shoud ensure the children of `TreeNode` are all accessible in the `productIterator`, or the behavior will be very confusing.

In this PR, I try to fix this problem by expsing the `loopVar`.

This also fixes SPARK-12131 which is caused by the hacky `MapObjects`.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10239 from cloud-fan/map-objects.
2015-12-10 15:11:13 +08:00
Wenchen Fan 381f17b540 [SPARK-12201][SQL] add type coercion rule for greatest/least
checked with hive, greatest/least should cast their children to a tightest common type,
i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10196 from cloud-fan/type-coercion.
2015-12-08 10:13:40 -08:00
Davies Liu 9cde7d5fa8 [SPARK-12032] [SQL] Re-order inner joins to do join with conditions first
Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow.

This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions.

After this patch, the TPCDS query Q64/65 can run hundreds times faster.

cc marmbrus nongli

Author: Davies Liu <davies@databricks.com>

Closes #10073 from davies/reorder_joins.
2015-12-07 10:34:18 -08:00
gatorsmile 49efd03bac [SPARK-12138][SQL] Escape \u in the generated comments of codegen
When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u.

yhuai Please review it. I did reproduce it and it works after the fix. Thanks!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10155 from gatorsmile/escapeU.
2015-12-06 11:15:02 -08:00
Josh Rosen b7204e1d41 [SPARK-12112][BUILD] Upgrade to SBT 0.13.9
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).

I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
2015-12-05 08:15:30 +08:00
Yin Huai 5872a9d89f [SPARK-11352][SQL] Escape */ in the generated comments.
https://issues.apache.org/jira/browse/SPARK-11352

Author: Yin Huai <yhuai@databricks.com>

Closes #10072 from yhuai/SPARK-11352.
2015-12-01 16:24:04 -08:00
Wenchen Fan fd95eeaf49 [SPARK-11954][SQL] Encoder for JavaBeans
create java version of `constructorFor` and `extractorFor` in `JavaTypeInference`

Author: Wenchen Fan <wenchen@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #9937 from cloud-fan/pojo.
2015-12-01 10:35:12 -08:00
Wenchen Fan 9df24624af [SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema
When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema.
For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9840 from cloud-fan/err-msg.
2015-12-01 10:24:53 -08:00
Herman van Hovell 3d28081e53 [SPARK-12024][SQL] More efficient multi-column counting.
In https://github.com/apache/spark/pull/9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null.

This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path.

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10015 from hvanhovell/SPARK-12024.
2015-11-29 14:13:11 -08:00
Reynold Xin de28e4d4de [SPARK-11973][SQL] Improve optimizer code readability.
This is a followup for https://github.com/apache/spark/pull/9959.

I added more documentation and rewrote some monadic code into simpler ifs.

Author: Reynold Xin <rxin@databricks.com>

Closes #9995 from rxin/SPARK-11973.
2015-11-26 18:47:54 -08:00
Dilip Biswal bc16a67562 [SPARK-11863][SQL] Unable to resolve order by if it contains mixture of aliases and real columns
this is based on https://github.com/apache/spark/pull/9844, with some bug fix and clean up.

The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`).
For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression.

whoever merge this PR, please give the credit to dilipbiswal

Author: Dilip Biswal <dbiswal@us.ibm.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9961 from cloud-fan/sort.
2015-11-26 11:31:28 -08:00
Davies Liu 27d69a0573 [SPARK-11973] [SQL] push filter through aggregation with alias and literals
Currently, filter can't be pushed through aggregation with alias or literals, this patch fix that.

After this patch, the time of TPC-DS query 4 go down to 13 seconds from 141 seconds (10x improvements).

cc nongli  yhuai

Author: Davies Liu <davies@databricks.com>

Closes #9959 from davies/push_filter2.
2015-11-26 00:19:42 -08:00
Daoyuan Wang 21e5606419 [SPARK-11983][SQL] remove all unused codegen fallback trait
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #9966 from adrian-wang/removeFallback.
2015-11-25 13:51:30 -08:00
Wenchen Fan 19530da690 [SPARK-11926][SQL] unify GetStructField and GetInternalRowField
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9909 from cloud-fan/get-struct.
2015-11-24 11:09:01 -08:00
Wenchen Fan f2996e0d12 [SPARK-11921][SQL] fix nullable of encoder schema
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9906 from cloud-fan/nullable.
2015-11-23 10:15:40 -08:00
Xiu Guo 94ce65dfcb [SPARK-11628][SQL] support column datatype of char(x) to recognize HiveChar
Can someone review my code to make sure I'm not missing anything? Thanks!

Author: Xiu Guo <xguo27@gmail.com>
Author: Xiu Guo <guoxi@us.ibm.com>

Closes #9612 from xguo27/SPARK-11628.
2015-11-23 08:53:40 -08:00
Liang-Chi Hsieh 426004a9c9 [SPARK-11908][SQL] Add NullType support to RowEncoder
JIRA: https://issues.apache.org/jira/browse/SPARK-11908

We should add NullType support to RowEncoder.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9891 from viirya/rowencoder-nulltype.
2015-11-22 10:36:47 -08:00
Nong Li 9ed4ad4265 [SPARK-11724][SQL] Change casting between int and timestamp to consistently treat int in seconds.
Hive has since changed this behavior as well. https://issues.apache.org/jira/browse/HIVE-3454

Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #9685 from nongli/spark-11724.
2015-11-20 14:19:34 -08:00
Wenchen Fan 3b9d2a347f [SPARK-11819][SQL] nice error message for missing encoder
before this PR, when users try to get an encoder for an un-supported class, they will only get a very simple error message like `Encoder for type xxx is not supported`.

After this PR, the error message become more friendly, for example:
```
No Encoder found for abc.xyz.NonEncodable
- array element class: "abc.xyz.NonEncodable"
- field (class: "scala.Array", name: "arrayField")
- root class: "abc.xyz.AnotherClass"
```

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9810 from cloud-fan/error-message.
2015-11-20 12:04:42 -08:00
Liang-Chi Hsieh 60bfb11332 [SPARK-11817][SQL] Truncating the fractional seconds to prevent inserting a NULL
JIRA: https://issues.apache.org/jira/browse/SPARK-11817

Instead of return None, we should truncate the fractional seconds to prevent inserting NULL.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9834 from viirya/truncate-fractional-sec.
2015-11-20 11:43:45 -08:00
Wenchen Fan 47d1c2325c [SPARK-11750][SQL] revert SPARK-11727 and code clean up
After some experiment, I found it's not convenient to have separate encoder builders: `FlatEncoder` and `ProductEncoder`. For example, when create encoders for `ScalaUDF`, we have no idea if the type `T` is flat or not. So I revert the splitting change in https://github.com/apache/spark/pull/9693, while still keeping the bug fixes and tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9726 from cloud-fan/follow.
2015-11-19 12:54:25 -08:00
Reynold Xin f449992009 [SPARK-11849][SQL] Analyzer should replace current_date and current_timestamp with literals
We currently rely on the optimizer's constant folding to replace current_timestamp and current_date. However, this can still result in different values for different instances of current_timestamp/current_date if the optimizer is not running fast enough.

A better solution is to replace these functions in the analyzer in one shot.

Author: Reynold Xin <rxin@databricks.com>

Closes #9833 from rxin/SPARK-11849.
2015-11-19 10:48:04 -08:00
Reynold Xin e61367b9f9 [SPARK-11833][SQL] Add Java tests for Kryo/Java Dataset encoders
Also added some nicer error messages for incompatible types (private types and primitive types) for Kryo/Java encoder.

Author: Reynold Xin <rxin@databricks.com>

Closes #9823 from rxin/SPARK-11833.
2015-11-18 18:34:36 -08:00
Michael Armbrust 59a501359a [SPARK-11636][SQL] Support classes defined in the REPL with Encoders
Before this PR there were two things that would blow up if you called `df.as[MyClass]` if `MyClass` was defined in the REPL:
 - [x] Because `classForName` doesn't work on the munged names returned by `tpe.erasure.typeSymbol.asClass.fullName`
 - [x] Because we don't have anything to pass into the constructor for the `$outer` pointer.

Note that this PR is just adding the infrastructure for working with inner classes in encoder and is not yet sufficient to make them work in the REPL.  Currently, the implementation show in 95cec7d413 is causing a bug that breaks code gen due to some interaction between janino and the `ExecutorClassLoader`.  This will be addressed in a follow-up PR.

Author: Michael Armbrust <michael@databricks.com>

Closes #9602 from marmbrus/dataset-replClasses.
2015-11-18 16:48:09 -08:00
Reynold Xin 5df08949f5 [SPARK-11810][SQL] Java-based encoder for opaque types in Datasets.
This patch refactors the existing Kryo encoder expressions and adds support for Java serialization.

Author: Reynold Xin <rxin@databricks.com>

Closes #9802 from rxin/SPARK-11810.
2015-11-18 15:42:07 -08:00
Wenchen Fan 33b8373334 [SPARK-11725][SQL] correctly handle null inputs for UDF
If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9770 from cloud-fan/udf.
2015-11-18 10:23:12 -08:00
Reynold Xin 5e2b44474c [SPARK-11802][SQL] Kryo-based encoder for opaque types in Datasets
I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803.

Author: Reynold Xin <rxin@databricks.com>

Closes #9789 from rxin/SPARK-11802.
2015-11-18 00:09:29 -08:00
Davies Liu 2f191c66b6 [SPARK-11643] [SQL] parse year with leading zero
Support the years between 0 <= year < 1000

Author: Davies Liu <davies@databricks.com>

Closes #9701 from davies/leading_zero.
2015-11-17 23:14:05 -08:00
gatorsmile 0158ff7737 [SPARK-8658][SQL][FOLLOW-UP] AttributeReference's equals method compares all the members
Based on the comment of cloud-fan in https://github.com/apache/spark/pull/9216, update the AttributeReference's hashCode function by including the hashCode of the other attributes including name, nullable and qualifiers.

Here, I am not 100% sure if we should include name in the hashCode calculation, since the original hashCode calculation does not include it.

marmbrus cloud-fan Please review if the changes are good.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9761 from gatorsmile/hashCodeNamedExpression.
2015-11-17 11:23:54 -08:00
Liang-Chi Hsieh d79d8b08ff [MINOR] [SQL] Fix randomly generated ArrayData in RowEncoderSuite
The randomly generated ArrayData used for the UDT `ExamplePoint` in `RowEncoderSuite` sometimes doesn't have enough elements. In this case, this test will fail. This patch is to fix it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9757 from viirya/fix-randomgenerated-udt.
2015-11-16 23:16:17 -08:00
Bartlomiej Alberski 31296628ac [SPARK-11553][SQL] Primitive Row accessors should not convert null to default value
Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details.

Author: Bartlomiej Alberski <bartlomiej.alberski@allegrogroup.com>

Closes #9642 from alberskib/bugfix/SPARK-11553.
2015-11-16 15:14:38 -08:00
Wenchen Fan b1a9662623 [SPARK-11754][SQL] consolidate ExpressionEncoder.tuple and Encoders.tuple
These 2 are very similar, we can consolidate them into one.

Also add tests for it and fix a bug.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9729 from cloud-fan/tuple.
2015-11-16 12:45:34 -08:00
Liang-Chi Hsieh b0c3fd34e4 [SPARK-11743] [SQL] Add UserDefinedType support to RowEncoder
JIRA: https://issues.apache.org/jira/browse/SPARK-11743

RowEncoder doesn't support UserDefinedType now. We should add the support for it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9712 from viirya/rowencoder-udt.
2015-11-16 09:03:42 -08:00
Wenchen Fan 06f1fdba6d [SPARK-11752] [SQL] fix timezone problem for DateTimeUtils.getSeconds
code snippet to reproduce it:
```
TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai"))
val t = Timestamp.valueOf("1900-06-11 12:14:50.789")
val us = fromJavaTimestamp(t)
assert(getSeconds(us) === t.getSeconds)
```

it will be good to add a regression test for it, but the reproducing code need to change the default timezone, and even we change it back, the `lazy val defaultTimeZone` in `DataTimeUtils` is fixed.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9728 from cloud-fan/seconds.
2015-11-16 08:58:40 -08:00
Yin Huai 3e2e1873b2 [SPARK-11738] [SQL] Making ArrayType orderable
https://issues.apache.org/jira/browse/SPARK-11738

Author: Yin Huai <yhuai@databricks.com>

Closes #9718 from yhuai/makingArrayOrderable.
2015-11-15 13:59:59 -08:00
Wenchen Fan d7b2b97ad6 [SPARK-11727][SQL] Split ExpressionEncoder into FlatEncoder and ProductEncoder
also add more tests for encoders, and fix bugs that I found:

* when convert array to catalyst array, we can only skip element conversion for native types(e.g. int, long, boolean), not `AtomicType`(String is AtomicType but we need to convert it)
* we should also handle scala `BigDecimal` when convert from catalyst `Decimal`.
* complex map type should be supported

other issues that still in investigation:

* encode java `BigDecimal` and decode it back, seems we will loss precision info.
* when encode case class that defined inside a object, `ClassNotFound` exception will be thrown.

I'll remove unused code in a follow-up PR.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9693 from cloud-fan/split.
2015-11-13 11:25:33 -08:00
Daoyuan Wang 39b1e36fbc [SPARK-11396] [SQL] add native implementation of datetime function to_unix_timestamp
`to_unix_timestamp` is the deterministic version of `unix_timestamp`, as it accepts at least one parameters.

Since the behavior here is quite similar to `unix_timestamp`, I think the dataframe API is not necessary here.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #9347 from adrian-wang/to_unix_timestamp.
2015-11-11 20:36:21 -08:00
Yin Huai 3121e78168 [SPARK-9830][SPARK-11641][SQL][FOLLOW-UP] Remove AggregateExpression1 and update toString of Exchange
https://issues.apache.org/jira/browse/SPARK-9830

This is the follow-up pr for https://github.com/apache/spark/pull/9556 to address davies' comments.

Author: Yin Huai <yhuai@databricks.com>

Closes #9607 from yhuai/removeAgg1-followup.
2015-11-10 16:25:22 -08:00
Nong Li 87aedc48c0 [SPARK-10371][SQL] Implement subexpr elimination for UnsafeProjections
This patch adds the building blocks for codegening subexpr elimination and implements
it end to end for UnsafeProjection. The building blocks can be used to do the same thing
for other operators.

It introduces some utilities to compute common sub expressions. Expressions can be added to
this data structure. The expr and its children will be recursively matched against existing
expressions (ones previously added) and grouped into common groups. This is built using
the existing `semanticEquals`. It does not understand things like commutative or associative
expressions. This can be done as future work.

After building this data structure, the codegen process takes advantage of it by:
  1. Generating a helper function in the generated class that computes the common
     subexpression. This is done for all common subexpressions that have at least
     two occurrences and the expression tree is sufficiently complex.
  2. When generating the apply() function, if the helper function exists, call that
     instead of regenerating the expression tree. Repeated calls to the helper function
     shortcircuit the evaluation logic.

Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #9480 from nongli/spark-10371.
2015-11-10 11:28:53 -08:00
Wenchen Fan 53600854c2 [SPARK-11590][SQL] use native json_tuple in lateral view
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9562 from cloud-fan/json-tuple.
2015-11-10 11:21:31 -08:00
Yin Huai e0701c7560 [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s
https://issues.apache.org/jira/browse/SPARK-9830

This PR contains the following main changes.
* Removing `AggregateExpression1`.
* Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
* Removing planner rule used to plan `Aggregate`.
* Linking `MultipleDistinctRewriter` to analyzer.
* Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
* Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
* Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).

Author: Yin Huai <yhuai@databricks.com>

Closes #9556 from yhuai/removeAgg1.
2015-11-10 11:06:29 -08:00
Herman van Hovell 30c8ba71a7 [SPARK-11451][SQL] Support single distinct count on multiple columns.
This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path.

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9409 from hvanhovell/SPARK-11451.
2015-11-08 11:06:10 -08:00
Imran Rashid 49f1a82037 [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits
https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.
2015-11-06 20:06:24 +00:00
Wenchen Fan 253e87e8ab [SPARK-11453][SQL][FOLLOW-UP] remove DecimalLit
A cleanup for https://github.com/apache/spark/pull/9085.

The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them.
Also added low level unit test at `SqlParserSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9482 from cloud-fan/parser.
2015-11-06 06:38:49 -08:00
Davies Liu 07414afac9 [SPARK-11537] [SQL] fix negative hours/minutes/seconds
Currently, if the Timestamp is before epoch (1970/01/01), the hours, minutes and seconds will be negative (also rounding up).

Author: Davies Liu <davies@databricks.com>

Closes #9502 from davies/neg_hour.
2015-11-05 17:02:22 -08:00
Dilip Biswal fc27dfbf0f [SPARK-11024][SQL] Optimize NULL in <inlist-expressions> by folding it to Literal(null)
Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to
Literal(null).

This is a follow up defect to SPARK-8654

cloud-fan Can you please take a look ?

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #9348 from dilipbiswal/spark_11024.
2015-10-31 12:55:33 -07:00
Wenchen Fan 87f28fc240 [SPARK-11379][SQL] ExpressionEncoder can't handle top level primitive type correctly
For inner primitive type(e.g. inside `Product`), we use `schemaFor` to get the catalyst type for it, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L403.

However, for top level primitive type, we use `dataTypeFor`, which is wrong.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9337 from cloud-fan/encoder.
2015-10-29 11:17:03 +01:00
Wenchen Fan 0cb7662d86 [SPARK-11351] [SQL] support hive interval literal
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9304 from cloud-fan/interval.
2015-10-28 21:35:57 -07:00
Michael Armbrust 5a5f65905a [SPARK-11347] [SQL] Support for joinWith in Datasets
This PR adds a new operation `joinWith` to a `Dataset`, which returns a `Tuple` for each pair where a given `condition` evaluates to true.

```scala
case class ClassData(a: String, b: Int)

val ds1 = Seq(ClassData("a", 1), ClassData("b", 2)).toDS()
val ds2 = Seq(("a", 1), ("b", 2)).toDS()

> ds1.joinWith(ds2, $"_1" === $"a").collect()
res0: Array((ClassData("a", 1), ("a", 1)), (ClassData("b", 2), ("b", 2)))
```

This operation is similar to the relation `join` function with one important difference in the result schema. Since `joinWith` preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names `_1` and `_2`.

This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.

## Required Changes to Encoders
In the process of working on this patch, several deficiencies to the way that we were handling encoders were discovered.  Specifically, it turned out to be very difficult to `rebind` the non-expression based encoders to extract the nested objects from the results of joins (and also typed selects that return tuples).

As a result the following changes were made.
 - `ClassEncoder` has been renamed to `ExpressionEncoder` and has been improved to also handle primitive types.  Additionally, it is now possible to take arbitrary expression encoders and rewrite them into a single encoder that returns a tuple.
 - All internal operations on `Dataset`s now require an `ExpressionEncoder`.  If the users tries to pass a non-`ExpressionEncoder` in, an error will be thrown.  We can relax this requirement in the future by constructing a wrapper class that uses expressions to project the row to the expected schema, shielding the users code from the required remapping.  This will give us a nice balance where we don't force user encoders to understand attribute references and binding, but still allow our native encoder to leverage runtime code generation to construct specific encoders for a given schema that avoid an extra remapping step.
 - Additionally, the semantics for different types of objects are now better defined.  As stated in the `ExpressionEncoder` scaladoc:
  - Classes will have their sub fields extracted by name using `UnresolvedAttribute` expressions
  and `UnresolvedExtractValue` expressions.
  - Tuples will have their subfields extracted by position using `BoundReference` expressions.
  - Primitives will have their values extracted from the first ordinal with a schema that defaults
  to the name `value`.
 - Finally, the binding lifecycle for `Encoders` has now been unified across the codebase.  Encoders are now `resolved` to the appropriate schema in the constructor of `Dataset`.  This process replaces an unresolved expressions with concrete `AttributeReference` expressions.  Binding then happens on demand, when an encoder is going to be used to construct an object.  This closely mirrors the lifecycle for standard expressions when executing normal SQL or `DataFrame` queries.

Author: Michael Armbrust <michael@databricks.com>

Closes #9300 from marmbrus/datasets-tuples.
2015-10-27 13:28:52 -07:00
Jia Li 958a0ec8fa [SPARK-11277][SQL] sort_array throws exception scala.MatchError
I'm new to spark. I was trying out the sort_array function then hit this exception. I looked into the spark source code. I found the root cause is that sort_array does not check for an array of NULLs. It's not meaningful to sort an array of entirely NULLs anyway.

I'm adding a check on the input array type to SortArray. If the array consists of NULLs entirely, there is no need to sort such array. I have also added a test case for this.

Please help to review my fix. Thanks!

Author: Jia Li <jiali@us.ibm.com>

Closes #9247 from jliwork/SPARK-11277.
2015-10-27 10:57:08 +01:00
Davies Liu 487d409e71 [SPARK-11243][SQL] zero out padding bytes in UnsafeRow
For nested StructType, the underline buffer could be used for others before, we should zero out the padding bytes for those primitive types that have less than 8 bytes.

cc cloud-fan

Author: Davies Liu <davies@databricks.com>

Closes #9217 from davies/zero_out.
2015-10-23 01:33:14 -07:00