Commit graph

829 commits

Author SHA1 Message Date
gatorsmile 5f686cc8b7 [SPARK-12656] [SQL] Implement Intersect with Left-semi Join
Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).

After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: https://github.com/apache/spark/pull/10566

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #10630 from gatorsmile/IntersectBySemiJoin.
2016-01-29 11:22:12 -08:00
Liang-Chi Hsieh 4637fc08a3 [SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet
JIRA: https://issues.apache.org/jira/browse/SPARK-11955

Currently we simply skip pushdowning filters in parquet if we enable schema merging.

However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #9940 from viirya/safe-pushdown-parquet-filters.
2016-01-28 16:25:21 -08:00
Nong Li 555127387a [SPARK-12854][SQL] Implement complex types support in ColumnarBatch
This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs
and arrays. There is a simple mapping between the richer catalyst types to these two. Strings
are treated as an array of bytes.

ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists
of just leaf nodes. Structs represent an internal node with one child for each field. Arrays
are internal nodes with one child. Structs just contain nullability. Arrays contain offsets
and lengths into the child array. This structure is able to handle arbitrary nesting. It has
the key property that we maintain columnar throughout and that primitive types are only stored
in the leaf nodes and contiguous across rows. For example, if the schema is
```
array<array<int>>
```
There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively.

As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v)
vs appendLong(v)). These APIs are necessary when the batch contains variable length elements.
The vectors are not fixed length and will grow as necessary. This should make the usage a lot
simpler for the writer.

Author: Nong Li <nong@databricks.com>

Closes #10820 from nongli/spark-12854.
2016-01-26 17:34:01 -08:00
Wenchen Fan be375fcbd2 [SPARK-12879] [SQL] improve the unsafe row writing framework
As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use.

This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily.

a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR:
**old version**
```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             2616.04           102.61         1.00 X
single nullable long                    3032.54            88.52         0.86 X
primitive types                         9121.05            29.43         0.29 X
nullable primitive types               12410.60            21.63         0.21 X
```

**new version**
```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             1533.34           175.07         1.00 X
single nullable long                    2306.73           116.37         0.66 X
primitive types                         8403.93            31.94         0.18 X
nullable primitive types               12448.39            21.56         0.12 X
```

For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process.  The benchmark code is included in this PR.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10809 from cloud-fan/unsafe-projection.
2016-01-25 16:23:59 -08:00
Reynold Xin 423783a08b [SPARK-12904][SQL] Strength reduction for integral and decimal literal comparisons
This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size.

Author: Reynold Xin <rxin@databricks.com>

Closes #10882 from rxin/SPARK-12904-1.
2016-01-23 12:13:05 -08:00
Wenchen Fan f3934a8d65 [SPARK-12888][SQL] benchmark the new hash expression
Benchmark it on 4 different schemas, the result:
```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
Hash For simple:                   Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
interpreted version                       31.47           266.54         1.00 X
codegen version                           64.52           130.01         0.49 X
```

```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
Hash For normal:                   Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
interpreted version                     4068.11             0.26         1.00 X
codegen version                         1175.92             0.89         3.46 X
```

```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
Hash For array:                    Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
interpreted version                     9276.70             0.06         1.00 X
codegen version                        14762.23             0.04         0.63 X
```

```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
Hash For map:                      Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
interpreted version                    58869.79             0.01         1.00 X
codegen version                         9285.36             0.06         6.34 X
```

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10816 from cloud-fan/hash-benchmark.
2016-01-20 15:08:27 -08:00
gatorsmile 8f90c15187 [SPARK-12616][SQL] Making Logical Operator Union Support Arbitrary Number of Children
The existing `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children to replace the existing one.

`Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Unions` into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical `Unions` already supports arbitrary number of children.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #10577 from gatorsmile/unionAllMultiChildren.
2016-01-20 14:59:30 -08:00
Davies Liu 8e4f894e98 [SPARK-12881] [SQL] subexpress elimination in mutable projection
Author: Davies Liu <davies@databricks.com>

Closes #10814 from davies/mutable_subexpr.
2016-01-20 10:02:40 -08:00
Reynold Xin 753b194511 [SPARK-12912][SQL] Add a test suite for EliminateSubQueries
Also updated documentation to explain why ComputeCurrentTime and EliminateSubQueries are in the optimizer rather than analyzer.

Author: Reynold Xin <rxin@databricks.com>

Closes #10837 from rxin/optimizer-analyzer-comment.
2016-01-20 00:00:28 -08:00
Reynold Xin 3e84ef0a54 [SPARK-12770][SQL] Implement rules for branch elimination for CaseWhen
The three optimization cases are:

1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch.
2. If a branch's condition is a false or null literal, remove that branch.
3. If only the else branch is left, remove the CaseWhen and use the value from the else branch.

Author: Reynold Xin <rxin@databricks.com>

Closes #10827 from rxin/SPARK-12770.
2016-01-19 16:14:41 -08:00
Jakob Odersky c78e2080e0 [SPARK-12816][SQL] De-alias type when generating schemas
Call `dealias` on local types to fix schema generation for abstract type members, such as

```scala
type KeyValue = (Int, String)
```

Add simple test

Author: Jakob Odersky <jodersky@gmail.com>

Closes #10749 from jodersky/aliased-schema.
2016-01-19 12:31:03 -08:00
Reynold Xin 44fcf992aa [SPARK-12873][SQL] Add more comment in HiveTypeCoercion for type widening
I was reading this part of the analyzer code again and got confused by the difference between findWiderTypeForTwo and findTightestCommonTypeOfTwo.

I also simplified WidenSetOperationTypes to make it a lot simpler. The easiest way to review this one is to just read the original code, and the new code. The logic is super simple.

Author: Reynold Xin <rxin@databricks.com>

Closes #10802 from rxin/SPARK-12873.
2016-01-18 11:08:44 -08:00
Herman van Hovell 7cd7f22025 [SPARK-12575][SQL] Grammar parity with existing SQL parser
In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.

Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:
- The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword.
- The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this.
- Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
- Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed.

cc rxin viirya marmbrus yhuai cloud-fan

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10745 from hvanhovell/SPARK-12575-2.
2016-01-15 15:19:10 -08:00
Davies Liu c5e7076da7 [MINOR] [SQL] GeneratedExpressionCode -> ExprCode
GeneratedExpressionCode is too long

Author: Davies Liu <davies@databricks.com>

Closes #10767 from davies/renaming.
2016-01-15 08:26:20 -08:00
Michael Armbrust cc7af86afd [SPARK-12813][SQL] Eliminate serialization for back to back operations
The goal of this PR is to eliminate unnecessary translations when there are back-to-back `MapPartitions` operations.  In order to achieve this I also made the following simplifications:

 - Operators no longer have hold encoders, instead they have only the expressions that they need.  The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process.  now that they are visible we can change them on a case by case basis.
 - Operators no longer have type parameters.  Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication.  We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded.

Deferred to a follow up PR:
 - Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an `as` operation.
 - Eliminate serializations in more cases by adding more cases to `EliminateSerialization`

Author: Michael Armbrust <michael@databricks.com>

Closes #10747 from marmbrus/encoderExpressions.
2016-01-14 17:44:56 -08:00
Reynold Xin cbbcd8e425 [SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values"
This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.

Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.

Author: Reynold Xin <rxin@databricks.com>

Closes #10734 from rxin/simplify-case.
2016-01-13 12:44:35 -08:00
Wenchen Fan c2ea79f96a [SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row
https://issues.apache.org/jira/browse/SPARK-12642

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10694 from cloud-fan/hash-expr.
2016-01-13 12:29:02 -08:00
Kousuke Saruta cb7b864a24 [SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before ",")
Fix the style violation (space before , and :).
This PR is a followup for #10643 and rework of #10685 .

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10732 from sarutak/SPARK-12692-followup-sql.
2016-01-12 22:25:20 -08:00
Reynold Xin b3b9ad23cf [SPARK-12788][SQL] Simplify BooleanEquality by using casts.
Author: Reynold Xin <rxin@databricks.com>

Closes #10730 from rxin/SPARK-12788.
2016-01-12 18:45:55 -08:00
Reynold Xin 0d543b98f3 Revert "[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":")"
This reverts commit 8cfa218f4f.
2016-01-12 12:56:52 -08:00
Reynold Xin 0ed430e315 [SPARK-12768][SQL] Remove CaseKeyWhen expression
This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer.

Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination.

Author: Reynold Xin <rxin@databricks.com>

Closes #10722 from rxin/SPARK-12768.
2016-01-12 11:13:08 -08:00
Reynold Xin 1d88879530 [SPARK-12762][SQL] Add unit test for SimplifyConditionals optimization rule
This pull request does a few small things:

1. Separated if simplification from BooleanSimplification and created a new rule SimplifyConditionals. In the future we can also simplify other conditional expressions here.

2. Added unit test for SimplifyConditionals.

3. Renamed SimplifyCaseConversionExpressionsSuite to SimplifyStringCaseConversionSuite

Author: Reynold Xin <rxin@databricks.com>

Closes #10716 from rxin/SPARK-12762.
2016-01-12 10:58:57 -08:00
Kousuke Saruta 8cfa218f4f [SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10718 from sarutak/SPARK-12692-followup-sql.
2016-01-12 00:51:00 -08:00
Herman van Hovell fe9eb0b0ce [SPARK-12576][SQL] Enable expression parsing in CatalystQl
The PR allows us to use the new SQL parser to parse SQL expressions such as: ```1 + sin(x*x)```

We enable this functionality in this PR, but we will not start using this actively yet. This will be done as soon as we have reached grammar parity with the existing parser stack.

cc rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10649 from hvanhovell/SPARK-12576.
2016-01-11 16:29:37 -08:00
Liang-Chi Hsieh 95cd5d95ce [SPARK-12577] [SQL] Better support of parentheses in partition by and order by clause of window function's over clause
JIRA: https://issues.apache.org/jira/browse/SPARK-12577

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10620 from viirya/fix-parentheses.
2016-01-08 21:48:06 -08:00
Cheng Lian d9447cac74 [SPARK-12593][SQL] Converts resolved logical plan back to SQL
This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings.  For now, the major use case is to canonicalize Spark SQL native view support.  The major entry point is `SQLBuilder.toSQL`, which returns an `Option[String]` if the logical plan is recognized.

The current version is still in WIP status, and is quite limited.  Known limitations include:

1.  The logical plan must be analyzed but not optimized

    The optimizer erases `Subquery` operators, which contain necessary scope information for SQL generation.  Future versions should be able to recover erased scope information by inserting subqueries when necessary.

1.  The logical plan must be created using HiveQL query string

    Query plans generated by composing arbitrary DataFrame API combinations are not supported yet.  Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation.  For example, the following query plan

    ```
    Filter (a#1 < 10)
     +- MetastoreRelation default, src, None
    ```

    need to be canonicalized into the following form before SQL generation:

    ```
    Project [a#1, b#2, c#3]
     +- Filter (a#1 < 10)
         +- MetastoreRelation default, src, None
    ```

    Otherwise, the SQL generation process will have to handle a large number of special cases.

1.  Only a fraction of expressions and basic logical plan operators are supported in this PR

    Currently, 95.7% (1720 out of 1798) query plans in `HiveCompatibilitySuite` can be successfully converted to SQL query strings.

    Known unsupported components are:

    - Expressions
      - Part of math expressions
      - Part of string expressions (buggy?)
      - Null expressions
      - Calendar interval literal
      - Part of date time expressions
      - Complex type creators
      - Special `NOT` expressions, e.g. `NOT LIKE` and `NOT IN`
    - Logical plan operators/patterns
      - Cube, rollup, and grouping set
      - Script transformation
      - Generator
      - Distinct aggregation patterns that fit `DistinctAggregationRewriter` analysis rule
      - Window functions

    Support for window functions, generators, and cubes etc. will be added in follow-up PRs.

This PR leverages `HiveCompatibilitySuite` for testing SQL generation in a "round-trip" manner:

*   For all select queries, we try to convert it back to SQL
*   If the query plan is convertible, we parse the generated SQL into a new logical plan
*   Run the new logical plan instead of the original one

If the query plan is inconvertible, the test case simply falls back to the original logic.

TODO

- [x] Fix failed test cases
- [x] Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.)
- [x] Comments and documentation

Author: Cheng Lian <lian@databricks.com>

Closes #10541 from liancheng/sql-generation.
2016-01-08 14:08:13 -08:00
Liang-Chi Hsieh cfe1ba56e4 [SPARK-12687] [SQL] Support from clause surrounded by ().
JIRA: https://issues.apache.org/jira/browse/SPARK-12687

Some queries such as `(select 1 as a) union (select 2 as a)` can't work. This patch fixes it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10660 from viirya/fix-union.
2016-01-08 09:50:41 -08:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
Davies Liu fd1dcfaf26 [SPARK-12542][SQL] support except/intersect in HiveQl
Parse the SQL query with except/intersect in FROM clause for HivQL.

Author: Davies Liu <davies@databricks.com>

Closes #10622 from davies/intersect.
2016-01-06 23:46:12 -08:00
Marcelo Vanzin b3ba1be3b7 [SPARK-3873][TESTS] Import ordering fixes.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10582 from vanzin/SPARK-3873-tests.
2016-01-05 19:07:39 -08:00
Liang-Chi Hsieh d202ad2fc2 [SPARK-12439][SQL] Fix toCatalystArray and MapObjects
JIRA: https://issues.apache.org/jira/browse/SPARK-12439

In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type.

There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null).

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10391 from viirya/fix-catalystarray.
2016-01-05 12:33:21 -08:00
Wenchen Fan 76768337be [SPARK-12480][FOLLOW-UP] use a single column vararg for hash
address comments in #10435

This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10588 from cloud-fan/hash.
2016-01-05 10:23:36 -08:00
Liang-Chi Hsieh b3c48e39f4 [SPARK-12438][SQL] Add SQLUserDefinedType support for encoder
JIRA: https://issues.apache.org/jira/browse/SPARK-12438

ScalaReflection lacks the support of SQLUserDefinedType. We should add it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10390 from viirya/encoder-udt.
2016-01-05 10:19:56 -08:00
Michael Armbrust 53beddc5bf [SPARK-12568][SQL] Add BINARY to Encoders
Author: Michael Armbrust <michael@databricks.com>

Closes #10516 from marmbrus/datasetCleanup.
2016-01-04 23:23:41 -08:00
Wenchen Fan b1a771231e [SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions
just write the arguments into unsafe row and use murmur3 to calculate hash code

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10435 from cloud-fan/hash-expr.
2016-01-04 18:49:41 -08:00
Herman van Hovell 0171b71e95 [SPARK-12421][SQL] Prevent Internal/External row from exposing state.
It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem.

This PR was inspired by https://github.com/apache/spark/pull/10374 by apo1.

cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation).

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10553 from hvanhovell/SPARK-12421.
2016-01-04 12:41:57 -08:00
Liang-Chi Hsieh c9dbfcc653 [SPARK-11743][SQL] Move the test for arrayOfUDT
A following pr for #9712. Move the test for arrayOfUDT.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10538 from viirya/move-udt-test.
2015-12-31 23:48:05 -08:00
Davies Liu e6c77874b9 [SPARK-12585] [SQL] move numFields to constructor of UnsafeRow
Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy.

It should be part of constructor of UnsafeRow.

Author: Davies Liu <davies@databricks.com>

Closes #10528 from davies/numFields.
2015-12-30 22:16:37 -08:00
Wenchen Fan aa48164a43 [SPARK-12495][SQL] use true as default value for propagateNull in NewInstance
Most of cases we should propagate null when call `NewInstance`, and so far there is only one case we should stop null propagation: create product/java bean. So I think it makes more sense to propagate null by dafault.

This also fixes a bug when encode null array/map, which is firstly discovered in https://github.com/apache/spark/pull/10401

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10443 from cloud-fan/encoder.
2015-12-30 10:56:08 -08:00
Stephan Kessler a6a4812434 [SPARK-7727][SQL] Avoid inner classes in RuleExecutor
Moved (case) classes Strategy, Once, FixedPoint and Batch to the companion object. This is necessary if we want to have the Optimizer easily extendable in the following sense: Usually a user wants to add additional rules, and just take the ones that are already there. However, inner classes made that impossible since the code did not compile

This allows easy extension of existing Optimizers see the DefaultOptimizerExtendableSuite for a corresponding test case.

Author: Stephan Kessler <stephan.kessler@sap.com>

Closes #10174 from stephankessler/SPARK-7727.
2015-12-28 12:46:20 -08:00
Liang-Chi Hsieh 50301c0a28 [SPARK-11164][SQL] Add InSet pushdown filter back for Parquet
When the filter is ```"b in ('1', '2')"```, the filter is not pushed down to Parquet. Thanks!

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #10278 from gatorsmile/parquetFilterNot.
2015-12-23 14:08:29 +08:00
Dilip Biswal b374a25831 [SPARK-12102][SQL] Cast a non-nullable struct field to a nullable field during analysis
Compare both left and right side of the case expression ignoring nullablity when checking for type equality.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #10156 from dilipbiswal/spark-12102.
2015-12-22 15:21:49 -08:00
Cheng Lian 42bfde2983 [SPARK-12371][SQL] Runtime nullability check for NewInstance
This PR adds a new expression `AssertNotNull` to ensure non-nullable fields of products and case classes don't receive null values at runtime.

Author: Cheng Lian <lian@databricks.com>

Closes #10331 from liancheng/dataset-nullability-check.
2015-12-22 19:41:44 +08:00
Davies Liu 4af647c77d [SPARK-12054] [SQL] Consider nullability of expression in codegen
This could simplify the generated code for expressions that is not nullable.

This PR fix lots of bugs about nullability.

Author: Davies Liu <davies@databricks.com>

Closes #10333 from davies/skip_nullable.
2015-12-18 10:09:17 -08:00
Herman van Hovell 658f66e620 [SPARK-8641][SQL] Native Spark Window functions
This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features.

This has the following advantages:
* Better memory management.
* The ability to use spark UDAFs in Window functions.

cc rxin / yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9819 from hvanhovell/SPARK-8641-2.
2015-12-17 15:16:35 -08:00
Wenchen Fan a783a8ed49 [SPARK-12320][SQL] throw exception if the number of fields does not line up for Tuple encoder
Author: Wenchen Fan <wenchen@databricks.com>

Closes #10293 from cloud-fan/err-msg.
2015-12-16 13:20:12 -08:00
Davies Liu 54c512ba90 [SPARK-8745] [SQL] remove GenerateProjection
cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #10316 from davies/remove_generate_projection.
2015-12-16 10:22:48 -08:00
Wenchen Fan a89e8b6122 [SPARK-10477][SQL] using DSL in ColumnPruningSuite to improve readability
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8645 from cloud-fan/test.
2015-12-15 18:29:19 -08:00
Wenchen Fan d8ec081c91 [SPARK-12252][SPARK-12131][SQL] refactor MapObjects to make it less hacky
in https://github.com/apache/spark/pull/10133 we found that, we shoud ensure the children of `TreeNode` are all accessible in the `productIterator`, or the behavior will be very confusing.

In this PR, I try to fix this problem by expsing the `loopVar`.

This also fixes SPARK-12131 which is caused by the hacky `MapObjects`.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10239 from cloud-fan/map-objects.
2015-12-10 15:11:13 +08:00
Wenchen Fan 381f17b540 [SPARK-12201][SQL] add type coercion rule for greatest/least
checked with hive, greatest/least should cast their children to a tightest common type,
i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10196 from cloud-fan/type-coercion.
2015-12-08 10:13:40 -08:00
Davies Liu 9cde7d5fa8 [SPARK-12032] [SQL] Re-order inner joins to do join with conditions first
Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow.

This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions.

After this patch, the TPCDS query Q64/65 can run hundreds times faster.

cc marmbrus nongli

Author: Davies Liu <davies@databricks.com>

Closes #10073 from davies/reorder_joins.
2015-12-07 10:34:18 -08:00
gatorsmile 49efd03bac [SPARK-12138][SQL] Escape \u in the generated comments of codegen
When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u.

yhuai Please review it. I did reproduce it and it works after the fix. Thanks!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10155 from gatorsmile/escapeU.
2015-12-06 11:15:02 -08:00
Josh Rosen b7204e1d41 [SPARK-12112][BUILD] Upgrade to SBT 0.13.9
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).

I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
2015-12-05 08:15:30 +08:00
Yin Huai 5872a9d89f [SPARK-11352][SQL] Escape */ in the generated comments.
https://issues.apache.org/jira/browse/SPARK-11352

Author: Yin Huai <yhuai@databricks.com>

Closes #10072 from yhuai/SPARK-11352.
2015-12-01 16:24:04 -08:00
Wenchen Fan fd95eeaf49 [SPARK-11954][SQL] Encoder for JavaBeans
create java version of `constructorFor` and `extractorFor` in `JavaTypeInference`

Author: Wenchen Fan <wenchen@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #9937 from cloud-fan/pojo.
2015-12-01 10:35:12 -08:00
Wenchen Fan 9df24624af [SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema
When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema.
For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9840 from cloud-fan/err-msg.
2015-12-01 10:24:53 -08:00
Herman van Hovell 3d28081e53 [SPARK-12024][SQL] More efficient multi-column counting.
In https://github.com/apache/spark/pull/9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null.

This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path.

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10015 from hvanhovell/SPARK-12024.
2015-11-29 14:13:11 -08:00
Reynold Xin de28e4d4de [SPARK-11973][SQL] Improve optimizer code readability.
This is a followup for https://github.com/apache/spark/pull/9959.

I added more documentation and rewrote some monadic code into simpler ifs.

Author: Reynold Xin <rxin@databricks.com>

Closes #9995 from rxin/SPARK-11973.
2015-11-26 18:47:54 -08:00
Dilip Biswal bc16a67562 [SPARK-11863][SQL] Unable to resolve order by if it contains mixture of aliases and real columns
this is based on https://github.com/apache/spark/pull/9844, with some bug fix and clean up.

The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`).
For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression.

whoever merge this PR, please give the credit to dilipbiswal

Author: Dilip Biswal <dbiswal@us.ibm.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9961 from cloud-fan/sort.
2015-11-26 11:31:28 -08:00
Davies Liu 27d69a0573 [SPARK-11973] [SQL] push filter through aggregation with alias and literals
Currently, filter can't be pushed through aggregation with alias or literals, this patch fix that.

After this patch, the time of TPC-DS query 4 go down to 13 seconds from 141 seconds (10x improvements).

cc nongli  yhuai

Author: Davies Liu <davies@databricks.com>

Closes #9959 from davies/push_filter2.
2015-11-26 00:19:42 -08:00
Daoyuan Wang 21e5606419 [SPARK-11983][SQL] remove all unused codegen fallback trait
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #9966 from adrian-wang/removeFallback.
2015-11-25 13:51:30 -08:00
Wenchen Fan 19530da690 [SPARK-11926][SQL] unify GetStructField and GetInternalRowField
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9909 from cloud-fan/get-struct.
2015-11-24 11:09:01 -08:00
Wenchen Fan f2996e0d12 [SPARK-11921][SQL] fix nullable of encoder schema
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9906 from cloud-fan/nullable.
2015-11-23 10:15:40 -08:00
Xiu Guo 94ce65dfcb [SPARK-11628][SQL] support column datatype of char(x) to recognize HiveChar
Can someone review my code to make sure I'm not missing anything? Thanks!

Author: Xiu Guo <xguo27@gmail.com>
Author: Xiu Guo <guoxi@us.ibm.com>

Closes #9612 from xguo27/SPARK-11628.
2015-11-23 08:53:40 -08:00
Liang-Chi Hsieh 426004a9c9 [SPARK-11908][SQL] Add NullType support to RowEncoder
JIRA: https://issues.apache.org/jira/browse/SPARK-11908

We should add NullType support to RowEncoder.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9891 from viirya/rowencoder-nulltype.
2015-11-22 10:36:47 -08:00
Nong Li 9ed4ad4265 [SPARK-11724][SQL] Change casting between int and timestamp to consistently treat int in seconds.
Hive has since changed this behavior as well. https://issues.apache.org/jira/browse/HIVE-3454

Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #9685 from nongli/spark-11724.
2015-11-20 14:19:34 -08:00
Wenchen Fan 3b9d2a347f [SPARK-11819][SQL] nice error message for missing encoder
before this PR, when users try to get an encoder for an un-supported class, they will only get a very simple error message like `Encoder for type xxx is not supported`.

After this PR, the error message become more friendly, for example:
```
No Encoder found for abc.xyz.NonEncodable
- array element class: "abc.xyz.NonEncodable"
- field (class: "scala.Array", name: "arrayField")
- root class: "abc.xyz.AnotherClass"
```

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9810 from cloud-fan/error-message.
2015-11-20 12:04:42 -08:00
Liang-Chi Hsieh 60bfb11332 [SPARK-11817][SQL] Truncating the fractional seconds to prevent inserting a NULL
JIRA: https://issues.apache.org/jira/browse/SPARK-11817

Instead of return None, we should truncate the fractional seconds to prevent inserting NULL.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9834 from viirya/truncate-fractional-sec.
2015-11-20 11:43:45 -08:00
Wenchen Fan 47d1c2325c [SPARK-11750][SQL] revert SPARK-11727 and code clean up
After some experiment, I found it's not convenient to have separate encoder builders: `FlatEncoder` and `ProductEncoder`. For example, when create encoders for `ScalaUDF`, we have no idea if the type `T` is flat or not. So I revert the splitting change in https://github.com/apache/spark/pull/9693, while still keeping the bug fixes and tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9726 from cloud-fan/follow.
2015-11-19 12:54:25 -08:00
Reynold Xin f449992009 [SPARK-11849][SQL] Analyzer should replace current_date and current_timestamp with literals
We currently rely on the optimizer's constant folding to replace current_timestamp and current_date. However, this can still result in different values for different instances of current_timestamp/current_date if the optimizer is not running fast enough.

A better solution is to replace these functions in the analyzer in one shot.

Author: Reynold Xin <rxin@databricks.com>

Closes #9833 from rxin/SPARK-11849.
2015-11-19 10:48:04 -08:00
Reynold Xin e61367b9f9 [SPARK-11833][SQL] Add Java tests for Kryo/Java Dataset encoders
Also added some nicer error messages for incompatible types (private types and primitive types) for Kryo/Java encoder.

Author: Reynold Xin <rxin@databricks.com>

Closes #9823 from rxin/SPARK-11833.
2015-11-18 18:34:36 -08:00
Michael Armbrust 59a501359a [SPARK-11636][SQL] Support classes defined in the REPL with Encoders
Before this PR there were two things that would blow up if you called `df.as[MyClass]` if `MyClass` was defined in the REPL:
 - [x] Because `classForName` doesn't work on the munged names returned by `tpe.erasure.typeSymbol.asClass.fullName`
 - [x] Because we don't have anything to pass into the constructor for the `$outer` pointer.

Note that this PR is just adding the infrastructure for working with inner classes in encoder and is not yet sufficient to make them work in the REPL.  Currently, the implementation show in 95cec7d413 is causing a bug that breaks code gen due to some interaction between janino and the `ExecutorClassLoader`.  This will be addressed in a follow-up PR.

Author: Michael Armbrust <michael@databricks.com>

Closes #9602 from marmbrus/dataset-replClasses.
2015-11-18 16:48:09 -08:00
Reynold Xin 5df08949f5 [SPARK-11810][SQL] Java-based encoder for opaque types in Datasets.
This patch refactors the existing Kryo encoder expressions and adds support for Java serialization.

Author: Reynold Xin <rxin@databricks.com>

Closes #9802 from rxin/SPARK-11810.
2015-11-18 15:42:07 -08:00
Wenchen Fan 33b8373334 [SPARK-11725][SQL] correctly handle null inputs for UDF
If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9770 from cloud-fan/udf.
2015-11-18 10:23:12 -08:00
Reynold Xin 5e2b44474c [SPARK-11802][SQL] Kryo-based encoder for opaque types in Datasets
I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803.

Author: Reynold Xin <rxin@databricks.com>

Closes #9789 from rxin/SPARK-11802.
2015-11-18 00:09:29 -08:00
Davies Liu 2f191c66b6 [SPARK-11643] [SQL] parse year with leading zero
Support the years between 0 <= year < 1000

Author: Davies Liu <davies@databricks.com>

Closes #9701 from davies/leading_zero.
2015-11-17 23:14:05 -08:00
gatorsmile 0158ff7737 [SPARK-8658][SQL][FOLLOW-UP] AttributeReference's equals method compares all the members
Based on the comment of cloud-fan in https://github.com/apache/spark/pull/9216, update the AttributeReference's hashCode function by including the hashCode of the other attributes including name, nullable and qualifiers.

Here, I am not 100% sure if we should include name in the hashCode calculation, since the original hashCode calculation does not include it.

marmbrus cloud-fan Please review if the changes are good.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9761 from gatorsmile/hashCodeNamedExpression.
2015-11-17 11:23:54 -08:00
Liang-Chi Hsieh d79d8b08ff [MINOR] [SQL] Fix randomly generated ArrayData in RowEncoderSuite
The randomly generated ArrayData used for the UDT `ExamplePoint` in `RowEncoderSuite` sometimes doesn't have enough elements. In this case, this test will fail. This patch is to fix it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9757 from viirya/fix-randomgenerated-udt.
2015-11-16 23:16:17 -08:00
Bartlomiej Alberski 31296628ac [SPARK-11553][SQL] Primitive Row accessors should not convert null to default value
Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details.

Author: Bartlomiej Alberski <bartlomiej.alberski@allegrogroup.com>

Closes #9642 from alberskib/bugfix/SPARK-11553.
2015-11-16 15:14:38 -08:00
Wenchen Fan b1a9662623 [SPARK-11754][SQL] consolidate ExpressionEncoder.tuple and Encoders.tuple
These 2 are very similar, we can consolidate them into one.

Also add tests for it and fix a bug.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9729 from cloud-fan/tuple.
2015-11-16 12:45:34 -08:00
Liang-Chi Hsieh b0c3fd34e4 [SPARK-11743] [SQL] Add UserDefinedType support to RowEncoder
JIRA: https://issues.apache.org/jira/browse/SPARK-11743

RowEncoder doesn't support UserDefinedType now. We should add the support for it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9712 from viirya/rowencoder-udt.
2015-11-16 09:03:42 -08:00
Wenchen Fan 06f1fdba6d [SPARK-11752] [SQL] fix timezone problem for DateTimeUtils.getSeconds
code snippet to reproduce it:
```
TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai"))
val t = Timestamp.valueOf("1900-06-11 12:14:50.789")
val us = fromJavaTimestamp(t)
assert(getSeconds(us) === t.getSeconds)
```

it will be good to add a regression test for it, but the reproducing code need to change the default timezone, and even we change it back, the `lazy val defaultTimeZone` in `DataTimeUtils` is fixed.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9728 from cloud-fan/seconds.
2015-11-16 08:58:40 -08:00
Yin Huai 3e2e1873b2 [SPARK-11738] [SQL] Making ArrayType orderable
https://issues.apache.org/jira/browse/SPARK-11738

Author: Yin Huai <yhuai@databricks.com>

Closes #9718 from yhuai/makingArrayOrderable.
2015-11-15 13:59:59 -08:00
Wenchen Fan d7b2b97ad6 [SPARK-11727][SQL] Split ExpressionEncoder into FlatEncoder and ProductEncoder
also add more tests for encoders, and fix bugs that I found:

* when convert array to catalyst array, we can only skip element conversion for native types(e.g. int, long, boolean), not `AtomicType`(String is AtomicType but we need to convert it)
* we should also handle scala `BigDecimal` when convert from catalyst `Decimal`.
* complex map type should be supported

other issues that still in investigation:

* encode java `BigDecimal` and decode it back, seems we will loss precision info.
* when encode case class that defined inside a object, `ClassNotFound` exception will be thrown.

I'll remove unused code in a follow-up PR.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9693 from cloud-fan/split.
2015-11-13 11:25:33 -08:00
Daoyuan Wang 39b1e36fbc [SPARK-11396] [SQL] add native implementation of datetime function to_unix_timestamp
`to_unix_timestamp` is the deterministic version of `unix_timestamp`, as it accepts at least one parameters.

Since the behavior here is quite similar to `unix_timestamp`, I think the dataframe API is not necessary here.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #9347 from adrian-wang/to_unix_timestamp.
2015-11-11 20:36:21 -08:00
Yin Huai 3121e78168 [SPARK-9830][SPARK-11641][SQL][FOLLOW-UP] Remove AggregateExpression1 and update toString of Exchange
https://issues.apache.org/jira/browse/SPARK-9830

This is the follow-up pr for https://github.com/apache/spark/pull/9556 to address davies' comments.

Author: Yin Huai <yhuai@databricks.com>

Closes #9607 from yhuai/removeAgg1-followup.
2015-11-10 16:25:22 -08:00
Nong Li 87aedc48c0 [SPARK-10371][SQL] Implement subexpr elimination for UnsafeProjections
This patch adds the building blocks for codegening subexpr elimination and implements
it end to end for UnsafeProjection. The building blocks can be used to do the same thing
for other operators.

It introduces some utilities to compute common sub expressions. Expressions can be added to
this data structure. The expr and its children will be recursively matched against existing
expressions (ones previously added) and grouped into common groups. This is built using
the existing `semanticEquals`. It does not understand things like commutative or associative
expressions. This can be done as future work.

After building this data structure, the codegen process takes advantage of it by:
  1. Generating a helper function in the generated class that computes the common
     subexpression. This is done for all common subexpressions that have at least
     two occurrences and the expression tree is sufficiently complex.
  2. When generating the apply() function, if the helper function exists, call that
     instead of regenerating the expression tree. Repeated calls to the helper function
     shortcircuit the evaluation logic.

Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #9480 from nongli/spark-10371.
2015-11-10 11:28:53 -08:00
Wenchen Fan 53600854c2 [SPARK-11590][SQL] use native json_tuple in lateral view
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9562 from cloud-fan/json-tuple.
2015-11-10 11:21:31 -08:00
Yin Huai e0701c7560 [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s
https://issues.apache.org/jira/browse/SPARK-9830

This PR contains the following main changes.
* Removing `AggregateExpression1`.
* Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
* Removing planner rule used to plan `Aggregate`.
* Linking `MultipleDistinctRewriter` to analyzer.
* Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
* Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
* Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).

Author: Yin Huai <yhuai@databricks.com>

Closes #9556 from yhuai/removeAgg1.
2015-11-10 11:06:29 -08:00
Herman van Hovell 30c8ba71a7 [SPARK-11451][SQL] Support single distinct count on multiple columns.
This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path.

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9409 from hvanhovell/SPARK-11451.
2015-11-08 11:06:10 -08:00
Imran Rashid 49f1a82037 [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits
https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.
2015-11-06 20:06:24 +00:00
Wenchen Fan 253e87e8ab [SPARK-11453][SQL][FOLLOW-UP] remove DecimalLit
A cleanup for https://github.com/apache/spark/pull/9085.

The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them.
Also added low level unit test at `SqlParserSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9482 from cloud-fan/parser.
2015-11-06 06:38:49 -08:00
Davies Liu 07414afac9 [SPARK-11537] [SQL] fix negative hours/minutes/seconds
Currently, if the Timestamp is before epoch (1970/01/01), the hours, minutes and seconds will be negative (also rounding up).

Author: Davies Liu <davies@databricks.com>

Closes #9502 from davies/neg_hour.
2015-11-05 17:02:22 -08:00
Dilip Biswal fc27dfbf0f [SPARK-11024][SQL] Optimize NULL in <inlist-expressions> by folding it to Literal(null)
Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to
Literal(null).

This is a follow up defect to SPARK-8654

cloud-fan Can you please take a look ?

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #9348 from dilipbiswal/spark_11024.
2015-10-31 12:55:33 -07:00
Wenchen Fan 87f28fc240 [SPARK-11379][SQL] ExpressionEncoder can't handle top level primitive type correctly
For inner primitive type(e.g. inside `Product`), we use `schemaFor` to get the catalyst type for it, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L403.

However, for top level primitive type, we use `dataTypeFor`, which is wrong.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9337 from cloud-fan/encoder.
2015-10-29 11:17:03 +01:00
Wenchen Fan 0cb7662d86 [SPARK-11351] [SQL] support hive interval literal
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9304 from cloud-fan/interval.
2015-10-28 21:35:57 -07:00
Michael Armbrust 5a5f65905a [SPARK-11347] [SQL] Support for joinWith in Datasets
This PR adds a new operation `joinWith` to a `Dataset`, which returns a `Tuple` for each pair where a given `condition` evaluates to true.

```scala
case class ClassData(a: String, b: Int)

val ds1 = Seq(ClassData("a", 1), ClassData("b", 2)).toDS()
val ds2 = Seq(("a", 1), ("b", 2)).toDS()

> ds1.joinWith(ds2, $"_1" === $"a").collect()
res0: Array((ClassData("a", 1), ("a", 1)), (ClassData("b", 2), ("b", 2)))
```

This operation is similar to the relation `join` function with one important difference in the result schema. Since `joinWith` preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names `_1` and `_2`.

This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.

## Required Changes to Encoders
In the process of working on this patch, several deficiencies to the way that we were handling encoders were discovered.  Specifically, it turned out to be very difficult to `rebind` the non-expression based encoders to extract the nested objects from the results of joins (and also typed selects that return tuples).

As a result the following changes were made.
 - `ClassEncoder` has been renamed to `ExpressionEncoder` and has been improved to also handle primitive types.  Additionally, it is now possible to take arbitrary expression encoders and rewrite them into a single encoder that returns a tuple.
 - All internal operations on `Dataset`s now require an `ExpressionEncoder`.  If the users tries to pass a non-`ExpressionEncoder` in, an error will be thrown.  We can relax this requirement in the future by constructing a wrapper class that uses expressions to project the row to the expected schema, shielding the users code from the required remapping.  This will give us a nice balance where we don't force user encoders to understand attribute references and binding, but still allow our native encoder to leverage runtime code generation to construct specific encoders for a given schema that avoid an extra remapping step.
 - Additionally, the semantics for different types of objects are now better defined.  As stated in the `ExpressionEncoder` scaladoc:
  - Classes will have their sub fields extracted by name using `UnresolvedAttribute` expressions
  and `UnresolvedExtractValue` expressions.
  - Tuples will have their subfields extracted by position using `BoundReference` expressions.
  - Primitives will have their values extracted from the first ordinal with a schema that defaults
  to the name `value`.
 - Finally, the binding lifecycle for `Encoders` has now been unified across the codebase.  Encoders are now `resolved` to the appropriate schema in the constructor of `Dataset`.  This process replaces an unresolved expressions with concrete `AttributeReference` expressions.  Binding then happens on demand, when an encoder is going to be used to construct an object.  This closely mirrors the lifecycle for standard expressions when executing normal SQL or `DataFrame` queries.

Author: Michael Armbrust <michael@databricks.com>

Closes #9300 from marmbrus/datasets-tuples.
2015-10-27 13:28:52 -07:00
Jia Li 958a0ec8fa [SPARK-11277][SQL] sort_array throws exception scala.MatchError
I'm new to spark. I was trying out the sort_array function then hit this exception. I looked into the spark source code. I found the root cause is that sort_array does not check for an array of NULLs. It's not meaningful to sort an array of entirely NULLs anyway.

I'm adding a check on the input array type to SortArray. If the array consists of NULLs entirely, there is no need to sort such array. I have also added a test case for this.

Please help to review my fix. Thanks!

Author: Jia Li <jiali@us.ibm.com>

Closes #9247 from jliwork/SPARK-11277.
2015-10-27 10:57:08 +01:00
Davies Liu 487d409e71 [SPARK-11243][SQL] zero out padding bytes in UnsafeRow
For nested StructType, the underline buffer could be used for others before, we should zero out the padding bytes for those primitive types that have less than 8 bytes.

cc cloud-fan

Author: Davies Liu <davies@databricks.com>

Closes #9217 from davies/zero_out.
2015-10-23 01:33:14 -07:00
Reynold Xin cdea0174e3 [SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.util package
Author: Reynold Xin <rxin@databricks.com>

Closes #9239 from rxin/types-private.
2015-10-23 00:00:21 -07:00
Michael Armbrust 53e83a3a77 [SPARK-11116][SQL] First Draft of Dataset API
*This PR adds a new experimental API to Spark, tentitively named Datasets.*

A `Dataset` is a strongly-typed collection of objects that can be transformed in parallel using functional or relational operations.  Example usage is as follows:

### Functional
```scala
> val ds: Dataset[Int] = Seq(1, 2, 3).toDS()
> ds.filter(_ % 1 == 0).collect()
res1: Array[Int] = Array(1, 2, 3)
```

### Relational
```scala
scala> ds.toDF().show()
+-----+
|value|
+-----+
|    1|
|    2|
|    3|
+-----+

> ds.select(expr("value + 1").as[Int]).collect()
res11: Array[Int] = Array(2, 3, 4)
```

## Comparison to RDDs
 A `Dataset` differs from an `RDD` in the following ways:
  - The creation of a `Dataset` requires the presence of an explicit `Encoder` that can be
    used to serialize the object into a binary format.  Encoders are also capable of mapping the
    schema of a given object to the Spark SQL type system.  In contrast, RDDs rely on runtime
    reflection based serialization.
  - Internally, a `Dataset` is represented by a Catalyst logical plan and the data is stored
    in the encoded form.  This representation allows for additional logical operations and
    enables many operations (sorting, shuffling, etc.) to be performed without deserializing to
    an object.

A `Dataset` can be converted to an `RDD` by calling the `.rdd` method.

## Comparison to DataFrames

A `Dataset` can be thought of as a specialized DataFrame, where the elements map to a specific
JVM object type, instead of to a generic `Row` container. A DataFrame can be transformed into
specific Dataset by calling `df.as[ElementType]`.  Similarly you can transform a strongly-typed
`Dataset` to a generic DataFrame by calling `ds.toDF()`.

## Implementation Status and TODOs

This is a rough cut at the least controversial parts of the API.  The primary purpose here is to get something committed so that we can better parallelize further work and get early feedback on the API.  The following is being deferred to future PRs:
 - Joins and Aggregations (prototype here f11f91e6f0)
 - Support for Java

Additionally, the responsibility for binding an encoder to a given schema is currently done in a fairly ad-hoc fashion.  This is an internal detail, and what we are doing today works for the cases we care about.  However, as we add more APIs we'll probably need to do this in a more principled way (i.e. separate resolution from binding as we do in DataFrames).

## COMPATIBILITY NOTE
Long term we plan to make `DataFrame` extend `Dataset[Row]`.  However,
making this change to che class hierarchy would break the function signatures for the existing
function operations (map, flatMap, etc).  As such, this class should be considered a preview
of the final API.  Changes will be made to the interface after Spark 1.6.

Author: Michael Armbrust <michael@databricks.com>

Closes #9190 from marmbrus/dataset-infra.
2015-10-22 15:20:17 -07:00
Wenchen Fan 42d225f449 [SPARK-11216][SQL][FOLLOW-UP] add encoder/decoder for external row
address comments in https://github.com/apache/spark/pull/9184

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9212 from cloud-fan/encoder.
2015-10-22 10:53:59 -07:00
Dilip Biswal dce2f8c9d7 [SPARK-8654][SQL] Analysis exception when using NULL IN (...) : invalid cast
In the analysis phase , while processing the rules for IN predicate, we
compare the in-list types to the lhs expression type and generate
cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up
generating cast between in list types to NULL like cast (1 as NULL) which
is not a valid cast.

The fix is to find a common type between LHS and RHS expressions and cast
all the expression to the common type.

Author: Dilip Biswal <dbiswal@us.ibm.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #9036 from dilipbiswal/spark_8654_new.
2015-10-21 14:29:03 -07:00
Davies Liu f8c6bec657 [SPARK-11197][SQL] run SQL on files directly
This PR introduce a new feature to run SQL directly on files without create a table, for example:

```
select id from json.`path/to/json/files` as j
```

Author: Davies Liu <davies@databricks.com>

Closes #9173 from davies/source.
2015-10-21 13:38:30 -07:00
Dilip Biswal 49ea0e9d7c [SPARK-10534] [SQL] ORDER BY clause allows only columns that are present in the select projection list
Find out the missing attributes by recursively looking
at the sort order expression and rest of the code
takes care of projecting them out.

Added description from cloud-fan

I wanna explain a bit more about this bug.

When we resolve sort ordering, we will use a special method, which only resolves UnresolvedAttributes and UnresolvedExtractValue. However, for something like Floor('a), even the 'a is resolved, the floor expression may still being unresolved as data type mismatch(for example, 'a is string type and Floor need double type), thus can't pass this filter, and we can't push down this missing attribute 'a

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #9123 from dilipbiswal/SPARK-10534.
2015-10-21 11:10:32 -07:00
Wenchen Fan ccf536f903 [SPARK-11216] [SQL] add encoder/decoder for external row
Implement encode/decode for external row based on `ClassEncoder`.

TODO:
* code cleanup
* ~~fix corner cases~~
* refactor the encoder interface
* improve test for product codegen, to cover more corner cases.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9184 from cloud-fan/encoder.
2015-10-21 11:06:34 -07:00
nitin goyal f62e326088 [SPARK-11179] [SQL] Push filters through aggregate
Push conjunctive predicates though Aggregate operators when their references are a subset of the groupingExpressions.

Query plan before optimisation :-
Filter ((c#138L = 2) && (a#0 = 3))
 Aggregate [a#0], [a#0,count(b#1) AS c#138L]
  Project [a#0,b#1]
   LocalRelation [a#0,b#1,c#2]

Query plan after optimisation :-
Filter (c#138L = 2)
 Aggregate [a#0], [a#0,count(b#1) AS c#138L]
  Filter (a#0 = 3)
   Project [a#0,b#1]
    LocalRelation [a#0,b#1,c#2]

Author: nitin goyal <nitin.goyal@guavus.com>
Author: nitin.goyal <nitin.goyal@guavus.com>

Closes #9167 from nitin2goyal/master.
2015-10-21 10:45:21 -07:00
Davies Liu 06e6b765d0 [SPARK-11149] [SQL] Improve cache performance for primitive types
This PR improve the performance by:

1) Generate an Iterator that take Iterator[CachedBatch] as input, and call accessors (unroll the loop for columns), avoid the expensive Iterator.flatMap.

2) Use Unsafe.getInt/getLong/getFloat/getDouble instead of ByteBuffer.getInt/getLong/getFloat/getDouble, the later one actually read byte by byte.

3) Remove the unnecessary copy() in Coalesce(), which is not related to memory cache, found during benchmark.

The following benchmark showed that we can speedup the columnar cache of int by 2x.

```
path = '/opt/tpcds/store_sales/'
int_cols = ['ss_sold_date_sk', 'ss_sold_time_sk', 'ss_item_sk','ss_customer_sk']
df = sqlContext.read.parquet(path).select(int_cols).cache()
df.count()

t = time.time()
print df.select("*")._jdf.queryExecution().toRdd().count()
print time.time() - t
```

Author: Davies Liu <davies@databricks.com>

Closes #9145 from davies/byte_buffer.
2015-10-20 14:01:53 -07:00
Davies Liu 67d468f8d9 [SPARK-11111] [SQL] fast null-safe join
Currently, we use CartesianProduct for join with null-safe-equal condition.
```
scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain
== Physical Plan ==
TungstenProject [i#2,j#3,i#7,j#8]
 Filter (i#2 <=> i#7)
  CartesianProduct
   LocalTableScan [i#2,j#3], [[1,1]]
   LocalTableScan [i#7,j#8], [[1,1]]
```
Actually, we can have an equal-join condition as  `coalesce(i, default) = coalesce(b.i, default)`, then an partitioned join algorithm could be used.

After this PR, the plan will become:
```
>>> sqlContext.sql("select * from a join b ON a.id <=> b.id").explain()
TungstenProject [id#0L,id#1L]
 Filter (id#0L <=> id#1L)
  SortMergeJoin [coalesce(id#0L,0)], [coalesce(id#1L,0)]
   TungstenSort [coalesce(id#0L,0) ASC], false, 0
    TungstenExchange hashpartitioning(coalesce(id#0L,0),200)
     ConvertToUnsafe
      Scan PhysicalRDD[id#0L]
   TungstenSort [coalesce(id#1L,0) ASC], false, 0
    TungstenExchange hashpartitioning(coalesce(id#1L,0),200)
     ConvertToUnsafe
      Scan PhysicalRDD[id#1L]
```

Author: Davies Liu <davies@databricks.com>

Closes #9120 from davies/null_safe.
2015-10-20 13:40:24 -07:00
Wenchen Fan 478c7ce862 [SPARK-6740] [SQL] correctly parse NOT operator with comparison operations
We can't parse `NOT` operator with comparison operations like `SELECT NOT TRUE > TRUE`, this PR fixed it.

Takes over https://github.com/apache/spark/pull/6326.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8617 from cloud-fan/not.
2015-10-20 13:38:25 -07:00
Wenchen Fan 7893cd95db [SPARK-11119] [SQL] cleanup for unsafe array and map
The purpose of this PR is to keep the unsafe format detail only inside the unsafe class itself, so when we use them(like use unsafe array in unsafe map, use unsafe array and map in columnar cache), we don't need to understand the format before use them.

change list:
* unsafe array's 4-bytes numElements header is now required(was optional), and become a part of unsafe array format.
* w.r.t the previous changing, the `sizeInBytes` of unsafe array now counts the 4-bytes header.
* unsafe map's format was `[numElements] [key array numBytes] [key array content(without numElements header)] [value array content(without numElements header)]` before, which is a little hacky as it makes unsafe array's header optional. I think saving 4 bytes is not a big deal, so the format is now: `[key array numBytes] [unsafe key array] [unsafe value array]`.
* w.r.t the previous changing, the `sizeInBytes` of unsafe map now counts both map's header and array's header.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9131 from cloud-fan/unsafe.
2015-10-19 11:02:26 -07:00
Cheng Hao 9808052b5a [SPARK-11076] [SQL] Add decimal support for floor and ceil
Actually all of the `UnaryMathExpression` doens't support the Decimal, will create follow ups for supporing it. This is the first PR which will be good to review the approach I am taking.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #9086 from chenghao-intel/ceiling.
2015-10-14 20:56:08 -07:00
Wenchen Fan 56d7da14ab [SPARK-10104] [SQL] Consolidate different forms of table identifiers
Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to represent table identifiers. We should only have one form and TableIdentifier is the best one because it provides methods to get table name, database name, return unquoted string, and return quoted string.

Author: Wenchen Fan <wenchen@databricks.com>
Author: Wenchen Fan <cloud0fan@163.com>

Closes #8453 from cloud-fan/table-name.
2015-10-14 16:05:37 -07:00
Michael Armbrust 328d1b3e4b [SPARK-11090] [SQL] Constructor for Product types from InternalRow
This is a first draft of the ability to construct expressions that will take a catalyst internal row and construct a Product (case class or tuple) that has fields with the correct names.  Support include:
 - Nested classes
 - Maps
 - Efficiently handling of arrays of primitive types

Not yet supported:
 - Case classes that require custom collection types (i.e. List instead of Seq).

Author: Michael Armbrust <michael@databricks.com>

Closes #9100 from marmbrus/productContructor.
2015-10-13 17:09:17 -07:00
Michael Armbrust 9e66a53c99 [SPARK-10993] [SQL] Inital code generated encoder for product types
This PR is a first cut at code generating an encoder that takes a Scala `Product` type and converts it directly into the tungsten binary format.  This is done through the addition of a new set of expression that can be used to invoke methods on raw JVM objects, extracting fields and converting the result into the required format.  These can then be used directly in an `UnsafeProjection` allowing us to leverage the existing encoding logic.

According to some simple benchmarks, this can significantly speed up conversion (~4x).  However, replacing CatalystConverters is deferred to a later PR to keep this PR at a reasonable size.

```scala
case class SomeInts(a: Int, b: Int, c: Int, d: Int, e: Int)

val data = SomeInts(1, 2, 3, 4, 5)
val encoder = ProductEncoder[SomeInts]
val converter = CatalystTypeConverters.createToCatalystConverter(ScalaReflection.schemaFor[SomeInts].dataType)

(1 to 5).foreach {iter =>
  benchmark(s"converter $iter") {
    var i = 100000000
    while (i > 0) {
      val res = converter(data).asInstanceOf[InternalRow]
      assert(res.getInt(0) == 1)
      assert(res.getInt(1) == 2)
      i -= 1
    }
  }

  benchmark(s"encoder $iter") {
    var i = 100000000
    while (i > 0) {
      val res = encoder.toRow(data)
      assert(res.getInt(0) == 1)
      assert(res.getInt(1) == 2)
      i -= 1
    }
  }
}
```

Results:
```
[info] converter 1: 7170ms
[info] encoder 1: 1888ms
[info] converter 2: 6763ms
[info] encoder 2: 1824ms
[info] converter 3: 6912ms
[info] encoder 3: 1802ms
[info] converter 4: 7131ms
[info] encoder 4: 1798ms
[info] converter 5: 7350ms
[info] encoder 5: 1912ms
```

Author: Michael Armbrust <michael@databricks.com>

Closes #9019 from marmbrus/productEncoder.
2015-10-08 14:28:14 -07:00
Michael Armbrust a8226a9f14 Revert [SPARK-8654] [SQL] Fix Analysis exception when using NULL IN
This reverts commit dcbd58a929 from #8983

Author: Michael Armbrust <michael@databricks.com>

Closes #9034 from marmbrus/revert8654.
2015-10-08 13:49:10 -07:00
Dilip Biswal dcbd58a929 [SPARK-8654] [SQL] Fix Analysis exception when using NULL IN (...)
In the analysis phase , while processing the rules for IN predicate, we
compare the in-list types to the lhs expression type and generate
cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up
generating cast between in list types to NULL like cast (1 as NULL) which
is not a valid cast.

The fix is to not generate such a cast if the lhs type is a NullType instead
we translate the expression to Literal(Null).

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #8983 from dilipbiswal/spark_8654.
2015-10-08 10:41:45 -07:00
Davies Liu 37526aca24 [SPARK-10980] [SQL] fix bug in create Decimal
The created decimal is wrong if using `Decimal(unscaled, precision, scale)` with unscaled > 1e18 and and precision > 18 and scale > 0.

This bug exists since the beginning.

Author: Davies Liu <davies@databricks.com>

Closes #9014 from davies/fix_decimal.
2015-10-07 15:51:09 -07:00
Josh Rosen a9ecd06149 [SPARK-10941] [SQL] Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity
This patch refactors several of the Aggregate2 interfaces in order to improve code clarity.

The biggest change is a refactoring of the `AggregateFunction2` class hierarchy. In the old code, we had a class named `AlgebraicAggregate` that inherited from `AggregateFunction2`, added a new set of methods, then banned the use of the inherited methods. I found this to be fairly confusing because.

If you look carefully at the existing code, you'll see that subclasses of `AggregateFunction2` fall into two disjoint categories: imperative aggregation functions which directly extended `AggregateFunction2` and declarative, expression-based aggregate functions which extended `AlgebraicAggregate`. In order to make this more explicit, this patch refactors things so that `AggregateFunction2` is a sealed abstract class with two subclasses, `ImperativeAggregateFunction` and `ExpressionAggregateFunction`. The superclass, `AggregateFunction2`, now only contains methods and fields that are common to both subclasses.

After making this change, I updated the various AggregationIterator classes to comply with this new naming scheme. I also performed several small renamings in the aggregate interfaces themselves in order to improve clarity and rewrote or expanded a number of comments.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8973 from JoshRosen/tungsten-agg-comments.
2015-10-07 13:19:49 -07:00
Wenchen Fan c4871369db [SPARK-10585] [SQL] only copy data once when generate unsafe projection
This PR is a completely rewritten of GenerateUnsafeProjection, to accomplish the goal of copying data only once. The old code of GenerateUnsafeProjection is still there to reduce review difficulty.

Instead of creating unsafe conversion code for struct, array and map, we create code of writing the content to the global row buffer.

Author: Wenchen Fan <cloud0fan@163.com>
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8747 from cloud-fan/copy-once.
2015-10-05 13:00:58 -07:00
Cheng Hao 4d8c7c6d1c [SPARK-10865] [SPARK-10866] [SQL] Fix bug of ceil/floor, which should returns long instead of the Double type
Floor & Ceiling function should returns Long type, rather than Double.

Verified with MySQL & Hive.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #8933 from chenghao-intel/ceiling.
2015-10-01 11:48:15 -07:00
Nathan Howell 89ea0041ae [SPARK-9617] [SQL] Implement json_tuple
This is an implementation of Hive's `json_tuple` function using Jackson Streaming.

Author: Nathan Howell <nhowell@godaddy.com>

Closes #7946 from NathanHowell/SPARK-9617.
2015-09-30 15:33:12 -07:00
Herman van Hovell 16fd2a2f42 [SPARK-9741] [SQL] Approximate Count Distinct using the new UDAF interface.
This PR implements a HyperLogLog based Approximate Count Distinct function using the new UDAF interface.

The implementation is inspired by the ClearSpring HyperLogLog implementation and should produce the same results.

There is still some documentation and testing left to do.

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #8362 from hvanhovell/SPARK-9741.
2015-09-30 10:12:52 -07:00
Wenchen Fan 5017c685f4 [SPARK-10740] [SQL] handle nondeterministic expressions correctly for set operations
https://issues.apache.org/jira/browse/SPARK-10740

Author: Wenchen Fan <cloud0fan@163.com>

Closes #8858 from cloud-fan/non-deter.
2015-09-22 12:14:59 -07:00
Yijie Shen c6f8135ee5 [SPARK-10539] [SQL] Project should not be pushed down through Intersect or Except #8742
Intersect and Except are both set operators and they use the all the columns to compare equality between rows. When pushing their Project parent down, the relations they based on would change, therefore not an equivalent transformation.

JIRA: https://issues.apache.org/jira/browse/SPARK-10539

I added some comments based on the fix of https://github.com/apache/spark/pull/8742.

Author: Yijie Shen <henry.yijieshen@gmail.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #8823 from yhuai/fix_set_optimization.
2015-09-18 13:20:13 -07:00
Yin Huai aad644fbe2 [SPARK-10639] [SQL] Need to convert UDAF's result from scala to sql type
https://issues.apache.org/jira/browse/SPARK-10639

Author: Yin Huai <yhuai@databricks.com>

Closes #8788 from yhuai/udafConversion.
2015-09-17 11:14:52 -07:00
Kevin Cox d39f15ea2b [SPARK-9794] [SQL] Fix datetime parsing in SparkSQL.
This fixes https://issues.apache.org/jira/browse/SPARK-9794 by using a real ISO8601 parser. (courtesy of the xml component of the standard java library)

cc: angelini

Author: Kevin Cox <kevincox@kevincox.ca>

Closes #8396 from kevincox/kevincox-sql-time-parsing.
2015-09-16 15:30:17 -07:00
Wenchen Fan 31a229aa73 [SPARK-10475] [SQL] improve column prunning for Project on Sort
Sometimes we can't push down the whole `Project` though `Sort`, but we still have a chance to push down part of it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8644 from cloud-fan/column-prune.
2015-09-15 13:36:52 -07:00
Davies Liu 7e32387ae6 [SPARK-10522] [SQL] Nanoseconds of Timestamp in Parquet should be positive
Or Hive can't read it back correctly.

Thanks vanzin for report this.

Author: Davies Liu <davies@databricks.com>

Closes #8674 from davies/positive_nano.
2015-09-14 14:20:49 -07:00
Wenchen Fan d5d647380f [SPARK-10442] [SQL] fix string to boolean cast
When we cast string to boolean in hive, it returns `true` if the length of string is > 0, and spark SQL follows this behavior.

However, this behavior is very different from other SQL systems:

1. [presto](https://github.com/facebook/presto/blob/master/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L89-L118) will return `true` for 't' 'true' '1', `false` for 'f' 'false' '0', throw exception for others.
2. [redshift](http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others.
3. [postgresql](http://www.postgresql.org/docs/devel/static/datatype-boolean.html) will return `true` for 't' 'true' 'y' 'yes' 'on' '1', `false` for 'f' 'false' 'n' 'no' 'off' '0', throw exception for others.
4. [vertica](https://my.vertica.com/docs/5.0/HTML/Master/2983.htm) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others.
5. [impala](http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_boolean.html) throw exception when try to cast string to boolean.
6. mysql, oracle, sqlserver don't have boolean type

Whether we should change the cast behavior according to other SQL system or not is not decided yet, this PR is a test to see if we changed, how many compatibility tests will fail.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8698 from cloud-fan/string2boolean.
2015-09-11 14:15:16 -07:00
Yash Datta f892d927d7 [SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimizer rule
Use these in the optimizer as well:

            A and (not(A) or B) => A and B
            not(A and B) => not(A) or not(B)
            not(A or B) => not(A) and not(B)

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #5700 from saucam/bool_simp.
2015-09-10 10:34:00 -07:00
Yin Huai 7a9dcbc91d [SPARK-10441] [SQL] Save data correctly to json.
https://issues.apache.org/jira/browse/SPARK-10441

Author: Yin Huai <yhuai@databricks.com>

Closes #8597 from yhuai/timestampJson.
2015-09-08 14:10:12 -07:00
Wenchen Fan c3c0e431a6 [SPARK-10176] [SQL] Show partially analyzed plans when checkAnswer fails to analyze
This PR takes over https://github.com/apache/spark/pull/8389.

This PR improves `checkAnswer` to print the partially analyzed plan in addition to the user friendly error message, in order to aid debugging failing tests.

In doing so, I ran into a conflict with the various ways that we bring a SQLContext into the tests. Depending on the trait we refer to the current context as `sqlContext`, `_sqlContext`, `ctx` or `hiveContext` with access modifiers `public`, `protected` and `private` depending on the defining class.

I propose we refactor as follows:

1. All tests should only refer to a `protected sqlContext` when testing general features, and `protected hiveContext` when it is a method that only exists on a `HiveContext`.
2. All tests should only import `testImplicits._` (i.e., don't import `TestHive.implicits._`)

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8584 from cloud-fan/cleanupTests.
2015-09-04 15:17:37 -07:00
Davies Liu bb7f352393 [SPARK-10323] [SQL] fix nullability of In/InSet/ArrayContain
After this PR, In/InSet/ArrayContain will return null if value is null, instead of false. They also will return null even if there is a null in the set/array.

Author: Davies Liu <davies@databricks.com>

Closes #8492 from davies/fix_in.
2015-08-28 14:38:20 -07:00
Davies Liu 7467b52ed0 [SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive)
Follow the rule in Hive for decimal division. see ac755ebe26/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java (L113)

cc chenghao-intel

Author: Davies Liu <davies@databricks.com>

Closes #8415 from davies/decimal_div2.
2015-08-25 15:20:24 -07:00
Davies Liu ec89bd840a [SPARK-10245] [SQL] Fix decimal literals with precision < scale
In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal.

Author: Davies Liu <davies@databricks.com>

Closes #8428 from davies/smaller_decimal.
2015-08-25 14:55:34 -07:00
Davies Liu 2f493f7e39 [SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive
We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.

In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5).

Author: Davies Liu <davies@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #8400 from davies/timestamp_parquet.
2015-08-25 16:00:44 +08:00
Josh Rosen 82268f07ab [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns
This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions.

I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7631 from JoshRosen/SPARK-9293.
2015-08-25 00:04:10 -07:00
Josh Rosen d7b4c09527 [SPARK-10190] Fix NPE in CatalystTypeConverters Decimal toScala converter
This adds a missing null check to the Decimal `toScala` converter in `CatalystTypeConverters`, fixing an NPE.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8401 from JoshRosen/SPARK-10190.
2015-08-24 16:17:45 -07:00
Reynold Xin 2f2686a73f [SPARK-9242] [SQL] Audit UDAF interface.
A few minor changes:

1. Improved documentation
2. Rename apply(distinct....) to distinct.
3. Changed MutableAggregationBuffer from a trait to an abstract class.
4. Renamed returnDataType to dataType to be more consistent with other expressions.

And unrelated to UDAFs:

1. Renamed file names in expressions to use suffix "Expressions" to be more consistent.
2. Moved regexp related expressions out to its own file.
3. Renamed StringComparison => StringPredicate.

Author: Reynold Xin <rxin@databricks.com>

Closes #8321 from rxin/SPARK-9242.
2015-08-19 17:35:41 -07:00
Wenchen Fan b0dbaec4f9 [SPARK-6489] [SQL] add column pruning for Generate
This PR takes over https://github.com/apache/spark/pull/5358

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8268 from cloud-fan/6489.
2015-08-19 15:05:06 -07:00
Daoyuan Wang 373a376c04 [SPARK-10083] [SQL] CaseWhen should support type coercion of DecimalType and FractionalType
create t1 (a decimal(7, 2), b long);
select case when 1=1 then a else 1.0 end from t1;
select case when 1=1 then a else b end from t1;

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #8270 from adrian-wang/casewhenfractional.
2015-08-19 14:31:51 -07:00
Davies Liu 1f4c4fe6df [SPARK-10090] [SQL] fix decimal scale of division
We should rounding the result of multiply/division of decimal to expected precision/scale, also check overflow.

Author: Davies Liu <davies@databricks.com>

Closes #8287 from davies/decimal_division.
2015-08-19 14:03:47 -07:00
Davies Liu 5af3838d2e [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary in ArrayData
The type for array of array in Java is slightly different than array of others.

cc cloud-fan

Author: Davies Liu <davies@databricks.com>

Closes #8250 from davies/array_binary.
2015-08-17 23:27:55 -07:00
Yijie Shen b265e282b6 [SPARK-9526] [SQL] Utilize randomized tests to reveal potential bugs in sql expressions
JIRA: https://issues.apache.org/jira/browse/SPARK-9526

This PR is a follow up of #7830, aiming at utilizing randomized tests to reveal more potential bugs in sql expression.

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7855 from yjshen/property_check.
2015-08-17 14:10:19 -07:00
Wenchen Fan 570567258b [SPARK-9955] [SQL] correct error message for aggregate
We should skip unresolved `LogicalPlan`s for `PullOutNondeterministic`, as calling `output` on unresolved `LogicalPlan` will produce confusing error message.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8203 from cloud-fan/error-msg and squashes the following commits:

1c67ca7 [Wenchen Fan] move test
7593080 [Wenchen Fan] correct error message for aggregate
2015-08-15 14:13:12 -07:00
Wenchen Fan ec29f2034a [SPARK-9634] [SPARK-9323] [SQL] cleanup unnecessary Aliases in LogicalPlan at the end of analysis
Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary.

Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata.

Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #8215 from marmbrus/pr/7957.
2015-08-14 20:59:54 -07:00
Andrew Or 8187b3ae47 [SPARK-9580] [SQL] Replace singletons in SQL tests
A fundamental limitation of the existing SQL tests is that *there is simply no way to create your own `SparkContext`*. This is a serious limitation because the user may wish to use a different master or config. As a case in point, `BroadcastJoinSuite` is entirely commented out because there is no way to make it pass with the existing infrastructure.

This patch removes the singletons `TestSQLContext` and `TestData`, and instead introduces a `SharedSQLContext` that starts a context per suite. Unfortunately the singletons were so ingrained in the SQL tests that this patch necessarily needed to touch *all* the SQL test files.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/8111)
<!-- Reviewable:end -->

Author: Andrew Or <andrew@databricks.com>

Closes #8111 from andrewor14/sql-tests-refactor.
2015-08-13 17:42:01 -07:00
Cheng Lian 6993031011 [SPARK-9757] [SQL] Fixes persistence of Parquet relation with decimal column
PR #7967 enables us to save data source relations to metastore in Hive compatible format when possible. But it fails to persist Parquet relations with decimal column(s) to Hive metastore of versions lower than 1.2.0. This is because `ParquetHiveSerDe` in Hive versions prior to 1.2.0 doesn't support decimal. This PR checks for this case and falls back to Spark SQL specific metastore table format.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #8130 from liancheng/spark-9757/old-hive-parquet-decimal.
2015-08-13 16:16:50 +08:00
Josh Rosen dfe347d2ca [SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering
HashPartitioning compatibility is currently defined w.r.t the _set_ of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem.

The fix for this is simple: make `HashPartitioning.compatibleWith` and `HashPartitioning.guarantees` sensitive to the expression ordering (i.e. do not perform set comparison).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8074 from JoshRosen/hashpartitioning-compatiblewith-fixes and squashes the following commits:

b61412f [Josh Rosen] Demonstrate that I haven't cheated in my fix
0b4d7d9 [Josh Rosen] Update so that clusteringSet is only used in satisfies().
dc9c9d7 [Josh Rosen] Add failing regression test for SPARK-9785
2015-08-11 08:52:15 -07:00
Reynold Xin d378396f86 [SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform.
PlatformDependent.UNSAFE is way too verbose.

Author: Reynold Xin <rxin@databricks.com>

Closes #8094 from rxin/SPARK-9815 and squashes the following commits:

229b603 [Reynold Xin] [SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform.
2015-08-11 08:41:06 -07:00
Davies Liu fe2fb7fb71 [SPARK-9620] [SQL] generated UnsafeProjection should support many columns or large exressions
Currently, generated UnsafeProjection can reach 64k byte code limit of Java. This patch will split the generated expressions into multiple functions, to avoid the limitation.

After this patch, we can work well with table that have up to 64k columns (hit max number of constants limit in Java), it should be enough in practice.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #8044 from davies/wider_table and squashes the following commits:

9192e6c [Davies Liu] fix generated safe projection
d1ef81a [Davies Liu] fix failed tests
737b3d3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table
ffcd132 [Davies Liu] address comments
1b95be4 [Davies Liu] put the generated class into sql package
77ed72d [Davies Liu] address comments
4518e17 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table
75ccd01 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table
495e932 [Davies Liu] support wider table with more than 1k columns for generated projections
2015-08-10 13:52:18 -07:00
Yijie Shen 68ccc6e184 [SPARK-8930] [SQL] Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions
Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #8057 from yjshen/explode_star and squashes the following commits:

eae181d [Yijie Shen] change explaination message
54c9d11 [Yijie Shen] meaning message for * in explode
2015-08-09 11:44:51 -07:00
Wenchen Fan 2432c2e239 [SPARK-8382] [SQL] Improve Analysis Unit test framework
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8025 from cloud-fan/analysis and squashes the following commits:

51461b1 [Wenchen Fan] move test file to test folder
ec88ace [Wenchen Fan] Improve Analysis Unit test framework
2015-08-07 11:28:43 -07:00
Wenchen Fan 1f62f104c7 [SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info
This re-applies #7955, which was reverted due to a race condition to fix build breaking.

Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #8002 from rxin/InternalRow-toSeq and squashes the following commits:

332416a [Reynold Xin] Merge pull request #7955 from cloud-fan/toSeq
21665e2 [Wenchen Fan] fix hive again...
4addf29 [Wenchen Fan] fix hive
bc16c59 [Wenchen Fan] minor fix
33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq
3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow
2015-08-06 13:11:59 -07:00
Davies Liu 2eca46a17a Revert "[SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info"
This reverts commit 6e009cb9c4.
2015-08-06 11:15:37 -07:00
Wenchen Fan 6e009cb9c4 [SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7955 from cloud-fan/toSeq and squashes the following commits:

21665e2 [Wenchen Fan] fix hive again...
4addf29 [Wenchen Fan] fix hive
bc16c59 [Wenchen Fan] minor fix
33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq
3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow
2015-08-06 10:40:54 -07:00
Davies Liu 5b965d64ee [SPARK-9644] [SQL] Support update DecimalType with precision > 18 in UnsafeRow
In order to support update a varlength (actually fixed length) object, the space should be preserved even  it's null. And, we can't call setNullAt(i) for it anymore, we because setNullAt(i) will remove the offset of the preserved space, should call setDecimal(i, null, precision) instead.

After this, we can do hash based aggregation on DecimalType with precision > 18. In a tests, this could decrease the end-to-end run time of aggregation query from 37 seconds (sort based) to 24 seconds (hash based).

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7978 from davies/update_decimal and squashes the following commits:

bed8100 [Davies Liu] isSettable -> isMutable
923c9eb [Davies Liu] address comments and fix bug
385891d [Davies Liu] Merge branch 'master' of github.com:apache/spark into update_decimal
36a1872 [Davies Liu] fix tests
cd6c524 [Davies Liu] support set decimal with precision > 18
2015-08-06 09:10:57 -07:00
zhichao.li aead18ffca [SPARK-8266] [SQL] add function translate
![translate](http://www.w3resource.com/PostgreSQL/postgresql-translate-function.png)

Author: zhichao.li <zhichao.li@intel.com>

Closes #7709 from zhichao-li/translate and squashes the following commits:

9418088 [zhichao.li] refine checking condition
f2ab77a [zhichao.li] clone string
9d88f2d [zhichao.li] fix indent
6aa2962 [zhichao.li] style
e575ead [zhichao.li] add python api
9d4bab0 [zhichao.li] add special case for fodable and refactor unittest
eda7ad6 [zhichao.li] update to use TernaryExpression
cdfd4be [zhichao.li] add function translate
2015-08-06 09:02:30 -07:00
Josh Rosen 9c878923db [SPARK-9054] [SQL] Rename RowOrdering to InterpretedOrdering; use newOrdering in SMJ
This patches renames `RowOrdering` to `InterpretedOrdering` and updates SortMergeJoin to use the `SparkPlan` methods for constructing its ordering so that it may benefit from codegen.

This is an updated version of #7408.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7973 from JoshRosen/SPARK-9054 and squashes the following commits:

e610655 [Josh Rosen] Add comment RE: Ascending ordering
34b8e0c [Josh Rosen] Import ordering
be19a0f [Josh Rosen] [SPARK-9054] [SQL] Rename RowOrdering to InterpretedOrdering; use newOrdering in more places.
2015-08-05 16:33:42 -07:00
Liang-Chi Hsieh e1e05873fc [SPARK-9403] [SQL] Add codegen support in In and InSet
This continues tarekauel's work in #7778.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7893 from viirya/codegen_in and squashes the following commits:

81ff97b [Liang-Chi Hsieh] For comments.
47761c6 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into codegen_in
cf4bf41 [Liang-Chi Hsieh] For comments.
f532b3c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into codegen_in
446bbcd [Liang-Chi Hsieh] Fix bug.
b3d0ab4 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into codegen_in
4610eff [Liang-Chi Hsieh] Relax the types of references and update optimizer test.
224f18e [Liang-Chi Hsieh] Beef up the test cases for In and InSet to include all primitive data types.
86dc8aa [Liang-Chi Hsieh] Only convert In to InSet when the number of items in set is more than the threshold.
b7ded7e [Tarek Auel] [SPARK-9403][SQL] codeGen in / inSet
2015-08-05 11:38:56 -07:00
Yin Huai 1f8c364b9c [SPARK-9141] [SQL] [MINOR] Fix comments of PR #7920
This is a follow-up of https://github.com/apache/spark/pull/7920 to fix comments.

Author: Yin Huai <yhuai@databricks.com>

Closes #7964 from yhuai/SPARK-9141-follow-up and squashes the following commits:

4d0ee80 [Yin Huai] Fix comments.
2015-08-05 11:03:02 -07:00
Michael Armbrust 23d982204b [SPARK-9141] [SQL] Remove project collapsing from DataFrame API
Currently we collapse successive projections that are added by `withColumn`.  However, this optimization violates the constraint that adding nodes to a plan will never change its analyzed form and thus breaks caching.  Instead of doing early optimization, in this PR I just fix some low-hanging slowness in the analyzer.  In particular, I add a mechanism for skipping already analyzed subplans, `resolveOperators` and `resolveExpression`.  Since trees are generally immutable after construction, it's safe to annotate a plan as already analyzed as any transformation will create a new tree with this bit no longer set.

Together these result in a faster analyzer than before, even with added timing instrumentation.

```
Original Code
[info] 3430ms
[info] 2205ms
[info] 1973ms
[info] 1982ms
[info] 1916ms

Without Project Collapsing in DataFrame
[info] 44610ms
[info] 45977ms
[info] 46423ms
[info] 46306ms
[info] 54723ms

With analyzer optimizations
[info] 6394ms
[info] 4630ms
[info] 4388ms
[info] 4093ms
[info] 4113ms

With resolveOperators
[info] 2495ms
[info] 1380ms
[info] 1685ms
[info] 1414ms
[info] 1240ms
```

Author: Michael Armbrust <michael@databricks.com>

Closes #7920 from marmbrus/withColumnCache and squashes the following commits:

2145031 [Michael Armbrust] fix hive udfs tests
5a5a525 [Michael Armbrust] remove wrong comment
7a507d5 [Michael Armbrust] style
b59d710 [Michael Armbrust] revert small change
1fa5949 [Michael Armbrust] move logic into LogicalPlan, add tests
0e2cb43 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into withColumnCache
c926e24 [Michael Armbrust] naming
e593a2d [Michael Armbrust] style
f5a929e [Michael Armbrust] [SPARK-9141][SQL] Remove project collapsing from DataFrame API
38b1c83 [Michael Armbrust] WIP
2015-08-05 09:01:45 -07:00
Davies Liu 781c8d71a0 [SPARK-9119] [SPARK-8359] [SQL] match Decimal.precision/scale with DecimalType
Let Decimal carry the correct precision and scale with DecimalType.

cc rxin yhuai

Author: Davies Liu <davies@databricks.com>

Closes #7925 from davies/decimal_scale and squashes the following commits:

e19701a [Davies Liu] some tweaks
57d78d2 [Davies Liu] fix tests
5d5bc69 [Davies Liu] match precision and scale with DecimalType
2015-08-04 23:12:49 -07:00
Pedro Rodriguez d34548587a [SPARK-8231] [SQL] Add array_contains
This PR is based on #7580 , thanks to EntilZha

PR for work on https://issues.apache.org/jira/browse/SPARK-8231

Currently, I have an initial implementation for contains. Based on discussion on JIRA, it should behave same as Hive: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFArrayContains.java#L102-L128

Main points are:
1. If the array is empty, null, or the value is null, return false
2. If there is a type mismatch, throw error
3. If comparison is not supported, throw error

Closes #7580

Author: Pedro Rodriguez <prodriguez@trulia.com>
Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #7949 from davies/array_contains and squashes the following commits:

d3c08bc [Davies Liu] use foreach() to avoid copy
bc3d1fe [Davies Liu] fix array_contains
719e37d [Davies Liu] Merge branch 'master' of github.com:apache/spark into array_contains
e352cf9 [Pedro Rodriguez] fixed diff from master
4d5b0ff [Pedro Rodriguez] added docs and another type check
ffc0591 [Pedro Rodriguez] fixed unit test
7a22deb [Pedro Rodriguez] Changed test to use strings instead of long/ints which are different between python 2 an 3
b5ffae8 [Pedro Rodriguez] fixed pyspark test
4e7dce3 [Pedro Rodriguez] added more docs
3082399 [Pedro Rodriguez] fixed unit test
46f9789 [Pedro Rodriguez] reverted change
d3ca013 [Pedro Rodriguez] Fixed type checking to match hive behavior, then added tests to insure this
8528027 [Pedro Rodriguez] added more tests
686e029 [Pedro Rodriguez] fix scala style
d262e9d [Pedro Rodriguez] reworked type checking code and added more tests
2517a58 [Pedro Rodriguez] removed unused import
28b4f71 [Pedro Rodriguez] fixed bug with type conversions and re-added tests
12f8795 [Pedro Rodriguez] fix scala style checks
e8a20a9 [Pedro Rodriguez] added python df (broken atm)
65b562c [Pedro Rodriguez] made array_contains nullable false
33b45aa [Pedro Rodriguez] reordered test
9623c64 [Pedro Rodriguez] fixed test
4b4425b [Pedro Rodriguez] changed Arrays in tests to Seqs
72cb4b1 [Pedro Rodriguez] added checkInputTypes and docs
69c46fb [Pedro Rodriguez] added tests and codegen
9e0bfc4 [Pedro Rodriguez] initial attempt at implementation
2015-08-04 22:34:02 -07:00
Yijie Shen a7fe48f687 [SPARK-9432][SQL] Audit expression unit tests to make sure we pass the proper numeric ranges
JIRA: https://issues.apache.org/jira/browse/SPARK-9432

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7933 from yjshen/numeric_ranges and squashes the following commits:

e719f78 [Yijie Shen] proper integral range check
2015-08-04 18:19:26 -07:00
Michael Armbrust 34a0eb2e89 [SPARK-9512][SQL] Revert SPARK-9251, Allow evaluation while sorting
The analysis rule has a bug and we ended up making the sorter still capable of doing evaluation, so lets revert this for now.

Author: Michael Armbrust <michael@databricks.com>

Closes #7906 from marmbrus/revertSortProjection and squashes the following commits:

2da6972 [Michael Armbrust] unrevert unrelated changes
4f2b00c [Michael Armbrust] Revert "[SPARK-9251][SQL] do not order by expressions which still need evaluation"
2015-08-04 10:07:53 -07:00
Davies Liu 73dedb589d [SPARK-8246] [SQL] Implement get_json_object
This is based on #7485 , thanks to NathanHowell

Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead.

Author: Davies Liu <davies@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Nathan Howell <nhowell@godaddy.com>

Closes #7901 from davies/get_json_object and squashes the following commits:

3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object
98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala
c818519 [Yin Huai] new results.
18ce26b [Davies Liu] fix tests
6ac29fb [Yin Huai] Golden files.
25eebef [Davies Liu] use HiveQuerySuite
e0ac6ec [Yin Huai] Golden answer files.
940c060 [Davies Liu] tweat code style
44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
9192d09 [Nathan Howell] Match Hive’s behavior for unwrapping arrays of one element
8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object
2015-08-04 09:07:09 -07:00
Tarek Auel b1f88a38d5 [SPARK-8244] [SQL] string function: find in set
This PR is based on #7186 (just fix the conflict), thanks to tarekauel .

find_in_set(string str, string strList): int

Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3.

Only add this to SQL, not DataFrame.

Closes #7186

Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Davies Liu <davies@databricks.com>

Closes #7900 from davies/find_in_set and squashes the following commits:

4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility
1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix
e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8
0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244
208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix
71b2e69 [Tarek Auel] [SPARK-8244] find_in_set
66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244
61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String comparison
4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244
e3b20c8 [Tarek Auel] [SPARK-8244] added type check
1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet
2015-08-04 08:59:42 -07:00
Reynold Xin b2e4b85d2d Revert "[SPARK-9372] [SQL] Filter nulls in join keys"
This reverts commit 687c8c3715.
2015-08-03 14:51:15 -07:00
Reynold Xin 8be198c869 Two minor comments from code review on 191bf2689. 2015-08-03 04:26:18 -07:00
Davies Liu 191bf2689d [SPARK-9518] [SQL] cleanup generated UnsafeRowJoiner and fix bug
Currently, when copy the bitsets, we didn't consider that the row1 may not sit in the beginning of byte array.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7892 from davies/clean_join and squashes the following commits:

14cce9e [Davies Liu] cleanup generated UnsafeRowJoiner and fix bug
2015-08-03 04:23:26 -07:00
Yijie Shen 98d6d9c7a9 [SPARK-9549][SQL] fix bugs in expressions
JIRA: https://issues.apache.org/jira/browse/SPARK-9549

This PR fix the following bugs:
1.  `UnaryMinus`'s codegen version would fail to compile when the input is `Long.MinValue`
2.  `BinaryComparison` would fail to compile in codegen mode when comparing Boolean types.
3.  `AddMonth` would fail if passed a huge negative month, which would lead accessing negative index of `monthDays` array.
4.  `Nanvl` with different type operands.

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7882 from yjshen/minor_bug_fix and squashes the following commits:

41bbd2c [Yijie Shen] fix bug in Nanvl type coercion
3dee204 [Yijie Shen] address comments
4fa5de0 [Yijie Shen] fix bugs in expressions
2015-08-03 00:15:24 -07:00
Wenchen Fan 608353c8e8 [SPARK-9404][SPARK-9542][SQL] unsafe array data and map data
This PR adds a UnsafeArrayData, current we encode it in this way:

first 4 bytes is the # elements
then each 4 byte is the start offset of the element, unless it is negative, in which case the element is null.
followed by the elements themselves

an example:  [10, 11, 12, 13, null, 14] will be encoded as:
5, 28, 32, 36, 40, -44, 44, 10, 11, 12, 13, 14

Note that, when we read a UnsafeArrayData from bytes, we can read the first 4 bytes as numElements and take the rest(first 4 bytes skipped) as value region.

unsafe map data just use 2 unsafe array data, first 4 bytes is # of elements, second 4 bytes is numBytes of key array, the follows key array data and value array data.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7752 from cloud-fan/unsafe-array and squashes the following commits:

3269bd7 [Wenchen Fan] fix a bug
6445289 [Wenchen Fan] add unit tests
49adf26 [Wenchen Fan] add unsafe map
20d1039 [Wenchen Fan] add comments and unsafe converter
821b8db [Wenchen Fan] add unsafe array
2015-08-02 23:41:16 -07:00
Yin Huai 687c8c3715 [SPARK-9372] [SQL] Filter nulls in join keys
This PR adds an optimization rule, `FilterNullsInJoinKey`, to add `Filter` before join operators to filter out rows having null values for join keys.

This optimization is guarded by a new SQL conf, `spark.sql.advancedOptimization`.

The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.

Author: Yin Huai <yhuai@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #7768 from JoshRosen/filter-nulls-in-join-key and squashes the following commits:

c02fc3f [Yin Huai] Address Josh's comments.
0a8e096 [Yin Huai] Update comments.
ea7d5a6 [Yin Huai] Make sure we do not keep adding filters.
be88760 [Yin Huai] Make it clear that FilterNullsInJoinKeySuite.scala is used to test FilterNullsInJoinKey.
8bb39ad [Yin Huai] Fix non-deterministic tests.
303236b [Josh Rosen] Revert changes that are unrelated to null join key filtering
40eeece [Josh Rosen] Merge remote-tracking branch 'origin/master' into filter-nulls-in-join-key
c57a954 [Yin Huai] Bug fix.
d3d2e64 [Yin Huai] First round of cleanup.
f9516b0 [Yin Huai] Style
c6667e7 [Yin Huai] Add PartitioningCollection.
e616d3b [Yin Huai] wip
7c2d2d8 [Yin Huai] Bug fix and refactoring.
69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning.
d5b84c3 [Yin Huai] Do not add unnessary filters.
2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.
2015-08-02 23:32:09 -07:00
Yin Huai 114ff926fc [SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins
This PR adds `PartitioningCollection`, which is used to represent the `outputPartitioning` for SparkPlans with multiple children (e.g. `ShuffledHashJoin`). So, a `SparkPlan` can have multiple descriptions of its partitioning schemes. Taking `ShuffledHashJoin` as an example, it has two descriptions of its partitioning schemes, i.e. `left.outputPartitioning` and `right.outputPartitioning`. So when we have a query like `select * from t1 join t2 on (t1.x = t2.x) join t3 on (t2.x = t3.x)` will only have three Exchange operators (when shuffled joins are needed) instead of four.

The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7773)
<!-- Reviewable:end -->

Author: Yin Huai <yhuai@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #7773 from JoshRosen/multi-way-join-planning-improvements and squashes the following commits:

5c45924 [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements
cd8269b [Josh Rosen] Refactor test to use SQLTestUtils
2963857 [Yin Huai] Revert unnecessary SqlConf change.
73913f7 [Yin Huai] Add comments and test. Also, revert the change in ShuffledHashOuterJoin for now.
4a99204 [Josh Rosen] Delete unrelated expression change
884ab95 [Josh Rosen] Carve out only SPARK-2205 changes.
247e5fa [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements
c57a954 [Yin Huai] Bug fix.
d3d2e64 [Yin Huai] First round of cleanup.
f9516b0 [Yin Huai] Style
c6667e7 [Yin Huai] Add PartitioningCollection.
e616d3b [Yin Huai] wip
7c2d2d8 [Yin Huai] Bug fix and refactoring.
69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning.
d5b84c3 [Yin Huai] Do not add unnessary filters.
2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.
2015-08-02 20:44:23 -07:00
Reynold Xin 30e89111d6 [SPARK-9546][SQL] Centralize orderable data type checking.
This pull request creates two isOrderable functions in RowOrdering that can be used to check whether a data type or a sequence of expressions can be used in sorting.

Author: Reynold Xin <rxin@databricks.com>

Closes #7880 from rxin/SPARK-9546 and squashes the following commits:

f9e322d [Reynold Xin] Fixed tests.
0439b43 [Reynold Xin] [SPARK-9546][SQL] Centralize orderable data type checking.
2015-08-02 20:12:03 -07:00
Reynold Xin 9d03ad910b [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter.
The detailed approach is documented in UnsafeKVExternalSorterSuite.testKVSorter(), working as follows:

1. Create input by generating data randomly based on the given key/value schema (which is also randomly drawn from a list of candidate types)
2. Run UnsafeKVExternalSorter on the generated data
3. Collect the output from the sorter, and make sure the keys are sorted in ascending order
4. Sort the input by both key and value, and sort the sorter output also by both key and value. Compare the sorted input and sorted output together to make sure all the key/values match.
5. Check memory allocation to make sure there is no memory leak.

There is also a spill flag. When set to true, the sorter will spill probabilistically roughly every 100 records.

Author: Reynold Xin <rxin@databricks.com>

Closes #7873 from rxin/kvsorter-randomized-test and squashes the following commits:

a08c251 [Reynold Xin] Resource cleanup.
0488b5c [Reynold Xin] [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter.
2015-08-02 17:54:30 -07:00
Liang-Chi Hsieh 0722f43316 [SPARK-7937][SQL] Support comparison on StructType
This brings #6519 up-to-date with master branch.

Closes #6519.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7877 from rxin/sort-struct and squashes the following commits:

4968231 [Reynold Xin] Minor fixes.
2537813 [Reynold Xin] Merge branch 'compare_named_struct' of github.com:viirya/spark-1 into sort-struct
d2ba8ad [Liang-Chi Hsieh] Remove unused import.
3a3f40e [Liang-Chi Hsieh] Don't need to add compare to InternalRow because we can use RowOrdering.
dae6aad [Liang-Chi Hsieh] Fix nested struct.
d5349c7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
43d4354 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
1f66196 [Liang-Chi Hsieh] Reuse RowOrdering and GenerateOrdering.
f8b2e9c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
1187a65 [Liang-Chi Hsieh] Fix scala style.
9d67f68 [Liang-Chi Hsieh] Fix wrongly merging.
8f4d775 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
94b27d5 [Liang-Chi Hsieh] Remove test for error on complex type comparison.
2071693 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
3c142e4 [Liang-Chi Hsieh] Fix scala style.
cf58dc3 [Liang-Chi Hsieh] Use checkAnswer.
f651b8d [Liang-Chi Hsieh] Remove Either and move orderings to BinaryComparison to reuse it.
b6e1009 [Liang-Chi Hsieh] Fix scala style.
3922b54 [Liang-Chi Hsieh] Support ordering on named_struct.
2015-08-02 17:53:44 -07:00
Davies Liu 16b928c543 [SPARK-9529] [SQL] improve TungstenSort on DecimalType
Generate prefix for DecimalType, fix the random generator of decimal

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #7857 from davies/sort_decimal and squashes the following commits:

2433959 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_decimal
de24253 [Davies Liu] fix style
0a54c1a [Davies Liu] sort decimal
2015-08-01 23:36:06 -07:00
Davies Liu 57084e0c7c [SPARK-9459] [SQL] use generated FromUnsafeProjection to do deep copy for UTF8String and struct
When accessing a column in UnsafeRow, it's good to avoid the copy, then we should do deep copy when turn the UnsafeRow into generic Row, this PR brings generated FromUnsafeProjection to do that.

This PR also fix the expressions that cache the UTF8String, which should also copy it.

Author: Davies Liu <davies@databricks.com>

Closes #7840 from davies/avoid_copy and squashes the following commits:

230c8a1 [Davies Liu] address comment
fd797c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into avoid_copy
e095dd0 [Davies Liu] rollback rename
8ef5b0b [Davies Liu] copy String in Columnar
81360b8 [Davies Liu] fix class name
9aecb88 [Davies Liu] use FromUnsafeProjection to do deep copy for UTF8String and struct
2015-08-01 21:50:42 -07:00
Davies Liu c1b0cbd762 [SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff, to_utc_timestamp, from_utc_timestamp
This PR is based on #7643 , thanks to adrian-wang

Author: Davies Liu <davies@databricks.com>
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #7847 from davies/datediff and squashes the following commits:

74333d7 [Davies Liu] fix bug
22d8a8c [Davies Liu] optimize
85cdd21 [Davies Liu] remove unnecessary tests
241d90c [Davies Liu] Merge branch 'master' of github.com:apache/spark into datediff
e9dc0f5 [Davies Liu] fix datediff/to_utc_timestamp/from_utc_timestamp
c360447 [Daoyuan Wang] function datediff, to_utc_timestamp, from_utc_timestamp (commits merged)
2015-08-01 21:46:46 -07:00
HuJiayin 00cd92f32f [SPARK-8269] [SQL] string function: initcap
This PR is based on #7208 , thanks to HuJiayin

Closes #7208

Author: HuJiayin <jiayin.hu@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7850 from davies/initcap and squashes the following commits:

54472e9 [Davies Liu] fix python test
17ffe51 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
ca46390 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
3a906e4 [Davies Liu] implement title case in UTF8String
8b2506a [HuJiayin] Update functions.py
2cd43e5 [HuJiayin] fix python style check
b616c0e [HuJiayin] add python api
1f5a0ef [HuJiayin] add codegen
7e0c604 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark into initcap
6a0b958 [HuJiayin] add column
c79482d [HuJiayin] support soundex
7ce416b [HuJiayin] support initcap rebase code
2015-08-01 21:44:57 -07:00
zhichao.li c5166f7a69 [SPARK-8263] [SQL] substr/substring should also support binary type
This is based on #7641, thanks to zhichao-li

Closes #7641

Author: zhichao.li <zhichao.li@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7848 from davies/substr and squashes the following commits:

461b709 [Davies Liu] remove bytearry from tests
b45377a [Davies Liu] Merge branch 'master' of github.com:apache/spark into substr
01d795e [zhichao.li] scala style
99aa130 [zhichao.li] add substring to dataframe
4f68bfe [zhichao.li] add binary type support for substring
2015-08-01 08:48:46 -07:00
Cheng Hao cf6c9ca32a [SPARK-8232] [SQL] Add sort_array support
This PR is based on #7581 , just fix the conflict.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7851 from davies/sort_array and squashes the following commits:

a80ef66 [Davies Liu] fix conflict
7cfda65 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_array
664c960 [Cheng Hao] update the sort_array by using the ArrayData
276d2d5 [Cheng Hao] add empty line
0edab9c [Cheng Hao] Add asending/descending support for sort_array
80fc0f8 [Cheng Hao] Add type checking
a42b678 [Cheng Hao] Add sort_array support
2015-08-01 08:32:29 -07:00
Davies Liu 60ea7ab4bb Revert "[SPARK-8232] [SQL] Add sort_array support"
This reverts commit 67ad4e21fc.
2015-08-01 00:41:15 -07:00
Wenchen Fan 1d59a4162b [SPARK-9480][SQL] add MapData and cleanup internal row stuff
This PR adds a `MapData` as internal representation of map type in Spark SQL, and provides a default implementation with just 2 `ArrayData`.

After that, we have specialized getters for all internal type, so I removed generic getter in `ArrayData` and added specialized `toArray` for it.
Also did some refactor and cleanup for `InternalRow` and its subclasses.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7799 from cloud-fan/map-data and squashes the following commits:

77d482f [Wenchen Fan] fix python
e8f6682 [Wenchen Fan] skip MapData equality check in HiveInspectorSuite
40cc9db [Wenchen Fan] add toString
6e06ec9 [Wenchen Fan] some more cleanup
a90aca1 [Wenchen Fan] add MapData
2015-08-01 00:17:15 -07:00
Cheng Hao 67ad4e21fc [SPARK-8232] [SQL] Add sort_array support
Add expression `sort_array` support.

Author: Cheng Hao <hao.cheng@intel.com>

This patch had conflicts when merged, resolved by
Committer: Davies Liu <davies.liu@gmail.com>

Closes #7581 from chenghao-intel/sort_array and squashes the following commits:

664c960 [Cheng Hao] update the sort_array by using the ArrayData
276d2d5 [Cheng Hao] add empty line
0edab9c [Cheng Hao] Add asending/descending support for sort_array
80fc0f8 [Cheng Hao] Add type checking
a42b678 [Cheng Hao] Add sort_array support
2015-07-31 23:11:22 -07:00
Liang-Chi Hsieh 3320b0ba26 [SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate
JIRA: https://issues.apache.org/jira/browse/SPARK-9415

Following up #7787. We shouldn't use MapType as grouping keys and join keys too.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7819 from viirya/map_join_groupby and squashes the following commits:

005ee0c [Liang-Chi Hsieh] For comments.
7463398 [Liang-Chi Hsieh] MapType can't be used as join keys, grouping keys.
2015-07-31 22:26:30 -07:00
zhichao.li 6996bd2e81 [SPARK-8264][SQL]add substring_index function
This PR is based on #7533 , thanks to zhichao-li

Closes #7533

Author: zhichao.li <zhichao.li@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7843 from davies/str_index and squashes the following commits:

391347b [Davies Liu] add python api
3ce7802 [Davies Liu] fix substringIndex
f2d29a1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into str_index
515519b [zhichao.li] add foldable and remove null checking
9546991 [zhichao.li] scala style
67c253a [zhichao.li] hide some apis and clean code
b19b013 [zhichao.li] add codegen and clean code
ac863e9 [zhichao.li] reduce the calling of numChars
12e108f [zhichao.li] refine unittest
d92951b [zhichao.li] add lastIndexOf
52d7b03 [zhichao.li] add substring_index function
2015-07-31 21:18:01 -07:00
Reynold Xin 03377d2522 [SPARK-9358][SQL] Code generation for UnsafeRow joiner.
This patch creates a code generated unsafe row concatenator that can be used to concatenate/join two UnsafeRows into a single UnsafeRow.

Since it is inherently hard to test these low level stuff, the test suites employ randomized testing heavily in order to guarantee correctness.

Author: Reynold Xin <rxin@databricks.com>

Closes #7821 from rxin/rowconcat and squashes the following commits:

8717f35 [Reynold Xin] Rebase and code review.
72c5d8e [Reynold Xin] Fixed a bug.
a84ed2e [Reynold Xin] Fixed offset.
40c3fb2 [Reynold Xin] Reset random data generator.
f0913aa [Reynold Xin] Test fixes.
6687b6f [Reynold Xin] Updated documentation.
00354b9 [Reynold Xin] Support concat data as well.
e9a4347 [Reynold Xin] Updated.
6269f96 [Reynold Xin] Fixed a bug .
0f89716 [Reynold Xin] [SPARK-9358][SQL][WIP] Code generation for UnsafeRow concat.
2015-07-31 21:09:00 -07:00
Josh Rosen 8cb415a4b9 [SPARK-9451] [SQL] Support entries larger than default page size in BytesToBytesMap & integrate with ShuffleMemoryManager
This patch adds support for entries larger than the default page size in BytesToBytesMap.  These large rows are handled by allocating special overflow pages to hold individual entries.

In addition, this patch integrates BytesToBytesMap with the ShuffleMemoryManager:

- Move BytesToBytesMap from `unsafe` to `core` so that it can import `ShuffleMemoryManager`.
- Before allocating new data pages, ask the ShuffleMemoryManager to reserve the memory:
  - `putNewKey()` now returns a boolean to indicate whether the insert succeeded or failed due to a lack of memory.  The caller can use this value to respond to the memory pressure (e.g. by spilling).
- `UnsafeFixedWidthAggregationMap. getAggregationBuffer()` now returns `null` to signal failure due to a lack of memory.
- Updated all uses of these classes to handle these error conditions.
- Added new tests for allocating large records and for allocations which fail due to memory pressure.
- Extended the `afterAll()` test teardown methods to detect ShuffleMemoryManager leaks.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7762 from JoshRosen/large-rows and squashes the following commits:

ae7bc56 [Josh Rosen] Fix compilation
82fc657 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-rows
34ab943 [Josh Rosen] Remove semi
31a525a [Josh Rosen] Integrate BytesToBytesMap with ShuffleMemoryManager.
626b33c [Josh Rosen] Move code to sql/core and spark/core packages so that ShuffleMemoryManager can be integrated
ec4484c [Josh Rosen] Move BytesToBytesMap from unsafe package to core.
642ed69 [Josh Rosen] Rename size to numElements
bea1152 [Josh Rosen] Add basic test.
2cd3570 [Josh Rosen] Remove accidental duplicated code
07ff9ef [Josh Rosen] Basic support for large rows in BytesToBytesMap.
2015-07-31 19:19:27 -07:00
HuJiayin 4d5a6e7b60 [SPARK-8271][SQL]string function: soundex
This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738

It's based on #7115 , thanks to HuJiayin

Author: HuJiayin <jiayin.hu@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7812 from davies/soundex and squashes the following commits:

fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex
a4bd6d8 [Davies Liu] fix soundex
2538908 [HuJiayin] add codegen soundex
d15d329 [HuJiayin] add back ut
ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark
e2dec2c [HuJiayin] support soundex rebase code
2015-07-31 16:05:26 -07:00
Davies Liu 0024da9157 [SQL] address comments for to_date/trunc
This PR address the comments in #7805

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7817 from davies/trunc and squashes the following commits:

f729d5f [Davies Liu] rollback
cb7f7832 [Davies Liu] genCode() is protected
31e52ef [Davies Liu] fix style
ed1edc7 [Davies Liu] address comments for #7805
2015-07-31 11:07:34 -07:00
Liang-Chi Hsieh 0244170b66 [SPARK-9152][SQL] Implement code generation for Like and RLike
JIRA: https://issues.apache.org/jira/browse/SPARK-9152

This PR implements code generation for `Like` and `RLike`.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7561 from viirya/like_rlike_codegen and squashes the following commits:

fe5641b [Liang-Chi Hsieh] Add test for NonFoldableLiteral.
ccd1b43 [Liang-Chi Hsieh] For comments.
0086723 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
50df9a8 [Liang-Chi Hsieh] Use nullSafeCodeGen.
8092a68 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
696d451 [Liang-Chi Hsieh] Check expression foldable.
48e5536 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
aea58e0 [Liang-Chi Hsieh] For comments.
46d946f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
a0fb76e [Liang-Chi Hsieh] For comments.
6cffe3c [Liang-Chi Hsieh] For comments.
69f0fb6 [Liang-Chi Hsieh] Add code generation for Like and RLike.
2015-07-30 23:05:58 -07:00
Daoyuan Wang 83670fc9e6 [SPARK-8176] [SPARK-8197] [SQL] function to_date/ trunc
This PR is based on #6988 , thanks to adrian-wang .

This brings two SQL functions: to_date() and trunc().

Closes #6988

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7805 from davies/to_date and squashes the following commits:

2c7beba [Davies Liu] Merge branch 'master' of github.com:apache/spark into to_date
310dd55 [Daoyuan Wang] remove dup test in rebase
980b092 [Daoyuan Wang] resolve rebase conflict
a476c5a [Daoyuan Wang] address comments from davies
d44ea5f [Daoyuan Wang] function to_date, trunc
2015-07-30 19:22:38 -07:00
Davies Liu 0b1a464b6e [SPARK-9425] [SQL] support DecimalType in UnsafeRow
This PR brings the support of DecimalType in UnsafeRow, for precision <= 18, it's settable, otherwise it's not settable.

Author: Davies Liu <davies@databricks.com>

Closes #7758 from davies/unsafe_decimal and squashes the following commits:

478b1ba [Davies Liu] address comments
536314c [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal
7c2e77a [Davies Liu] fix JoinedRow
76d6fa4 [Davies Liu] fix tests
99d3151 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal
d49c6ae [Davies Liu] support DecimalType in UnsafeRow
2015-07-30 17:18:32 -07:00
Daoyuan Wang 1abf7dc16c [SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290] [SQL] functions: date_add, date_sub, add_months, months_between, time-interval calculation
This PR is based on #7589 , thanks to adrian-wang

Added SQL function date_add, date_sub, add_months, month_between, also add a rule for
add/subtract of date/timestamp and interval.

Closes #7589

cc rxin

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7754 from davies/date_add and squashes the following commits:

e8c633a [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
9e8e085 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
6224ce4 [Davies Liu] fix conclict
bd18cd4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
e47ff2c [Davies Liu] add python api, fix date functions
01943d0 [Davies Liu] Merge branch 'master' into date_add
522e91a [Daoyuan Wang] fix
e8a639a [Daoyuan Wang] fix
42df486 [Daoyuan Wang] fix style
87c4b77 [Daoyuan Wang] function add_months, months_between and some fixes
1a68e03 [Daoyuan Wang] poc of time interval calculation
c506661 [Daoyuan Wang] function date_add , date_sub
2015-07-30 13:21:46 -07:00
Daoyuan Wang 6d94bf6ac1 [SPARK-8174] [SPARK-8175] [SQL] function unix_timestamp, from_unixtime
unix_timestamp(): long
Gets current Unix timestamp in seconds.

unix_timestamp(string|date): long
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801

unix_timestamp(string date, string pattern): long
Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return null if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400.

from_unixtime(bigint unixtime[, string format]): string
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".

Jira:
https://issues.apache.org/jira/browse/SPARK-8174
https://issues.apache.org/jira/browse/SPARK-8175

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #7644 from adrian-wang/udfunixtime and squashes the following commits:

2fe20c4 [Daoyuan Wang] util.Date
ea2ec16 [Daoyuan Wang] use util.Date for better performance
a2cf929 [Daoyuan Wang] doc return null instead of 0
f6f070a [Daoyuan Wang] address comments from davies
6a4cbb3 [Daoyuan Wang] temp
56ded53 [Daoyuan Wang] rebase and address comments
14a8b37 [Daoyuan Wang] function unix_timestamp, from_unixtime
2015-07-30 11:14:09 -07:00
Wenchen Fan c0cc0eaec6 [SPARK-9390][SQL] create a wrapper for array type
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7724 from cloud-fan/array-data and squashes the following commits:

d0408a1 [Wenchen Fan] fix python
661e608 [Wenchen Fan] rebase
f39256c [Wenchen Fan] fix hive...
6dbfa6f [Wenchen Fan] fix hive again...
8cb8842 [Wenchen Fan] remove element type parameter from getArray
43e9816 [Wenchen Fan] fix mllib
e719afc [Wenchen Fan] fix hive
4346290 [Wenchen Fan] address comment
d4a38da [Wenchen Fan] remove sizeInBytes and add license
7e283e2 [Wenchen Fan] create a wrapper for array type
2015-07-30 10:04:30 -07:00
Joseph Batchik 1221849f91 [SPARK-8005][SQL] Input file name
Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it.

This supports:

`df.select(inputFileName())`

and

`sqlContext.sql("select input_file_name() from table")`

Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7743 from JDrit/input_file_name and squashes the following commits:

abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string
d2f323d [Joseph Batchik] updates per review
102061f [Joseph Batchik] updates per review
75313f5 [Joseph Batchik] small fixes
c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL
2015-07-29 23:35:55 -07:00
Yijie Shen e127ec34d5 [SPARK-9428] [SQL] Add test cases for null inputs for expression unit tests
JIRA: https://issues.apache.org/jira/browse/SPARK-9428

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7748 from yjshen/string_cleanup and squashes the following commits:

e0c2b3d [Yijie Shen] update codegen in RegExpExtract and RegExpReplace
26614d2 [Yijie Shen] MathFunctionSuite
a402859 [Yijie Shen] complex_create, conditional and cast
6e4e608 [Yijie Shen] arithmetic and cast
52593c1 [Yijie Shen] null input test cases for StringExpressionSuite
2015-07-29 23:24:20 -07:00
Reynold Xin 27850af525 [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode.
Author: Reynold Xin <rxin@databricks.com>

Closes #7767 from rxin/SPARK-9462 and squashes the following commits:

ef3e2d9 [Reynold Xin] Removed println
713ac3a [Reynold Xin] More unit tests.
bb5c334 [Reynold Xin] [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode.
2015-07-29 21:24:47 -07:00
Reynold Xin 86505962e6 [SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances.
We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems for expressions with mutable state.

This pull request fixed that problem, and also added unit tests for all codegen classes, except GeneratedOrdering (which will never need any expressions since sort now only accepts bound references.

Author: Reynold Xin <rxin@databricks.com>

Closes #7759 from rxin/SPARK-9448 and squashes the following commits:

c09b50f [Reynold Xin] [SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances.
2015-07-29 16:49:02 -07:00
Josh Rosen 1b0099fc62 [SPARK-9411] [SQL] Make Tungsten page sizes configurable
We need to make page sizes configurable so we can reduce them in unit tests and increase them in real production workloads.  These sizes are now controlled by a new configuration, `spark.buffer.pageSize`.  The new default is 64 megabytes.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7741 from JoshRosen/SPARK-9411 and squashes the following commits:

a43c4db [Josh Rosen] Fix pow
2c0eefc [Josh Rosen] Fix MAXIMUM_PAGE_SIZE_BYTES comment + value
bccfb51 [Josh Rosen] Lower page size to 4MB in TestHive
ba54d4b [Josh Rosen] Make UnsafeExternalSorter's page size configurable
0045aa2 [Josh Rosen] Make UnsafeShuffle's page size configurable
bc734f0 [Josh Rosen] Rename configuration
e614858 [Josh Rosen] Makes BytesToBytesMap page size configurable
2015-07-29 16:00:30 -07:00
Reynold Xin 5340dfaf94 [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.
We want to introduce a new IntervalType in 1.6 that is based on only the number of microseoncds,
so interval can be compared.

Renaming the existing IntervalType to CalendarIntervalType so we can do that in the future.

Author: Reynold Xin <rxin@databricks.com>

Closes #7745 from rxin/calendarintervaltype and squashes the following commits:

99f64e8 [Reynold Xin] One more line ...
13466c8 [Reynold Xin] Fixed tests.
e20f24e [Reynold Xin] [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.
2015-07-29 13:49:22 -07:00
Reynold Xin 97906944e1 [SPARK-9127][SQL] Rand/Randn codegen fails with long seed.
Author: Reynold Xin <rxin@databricks.com>

Closes #7747 from rxin/SPARK-9127 and squashes the following commits:

e851418 [Reynold Xin] [SPARK-9127][SQL] Rand/Randn codegen fails with long seed.
2015-07-29 09:36:22 -07:00
Wenchen Fan 708794e8aa [SPARK-9251][SQL] do not order by expressions which still need evaluation
as an offline discussion with rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7593 from cloud-fan/sort and squashes the following commits:

7b1bef7 [Wenchen Fan] add test
daf206d [Wenchen Fan] add more comments
289bee0 [Wenchen Fan] do not order by expressions which still need evaluation
2015-07-29 00:08:45 -07:00
Davies Liu 15667a0afa [SPARK-9281] [SQL] use decimal or double when parsing SQL
Right now, we use double to parse all the float number in SQL. When it's used in expression together with DecimalType, it will turn the decimal into double as well. Also it will loss some precision when using double.

This PR change to parse float number to decimal or double, based on it's  using scientific notation or not, see https://msdn.microsoft.com/en-us/library/ms179899.aspx

This is a break change, should we doc it somewhere?

Author: Davies Liu <davies@databricks.com>

Closes #7642 from davies/parse_decimal and squashes the following commits:

1f576d9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
5e142b6 [Davies Liu] fix scala style
eca99de [Davies Liu] fix tests
2afe702 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
f4a320b [Davies Liu] Update SqlParser.scala
1c48e34 [Davies Liu] use decimal or double when parsing SQL
2015-07-28 22:51:08 -07:00
Yijie Shen 6309b93467 [SPARK-9398] [SQL] Datetime cleanup
JIRA: https://issues.apache.org/jira/browse/SPARK-9398

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7725 from yjshen/date_null_check and squashes the following commits:

b4eade1 [Yijie Shen] inline daysToMonthEnd
d09acc1 [Yijie Shen] implement getLastDayOfMonth to avoid repeated evaluation
d857ec3 [Yijie Shen] add null check in DateExpressionSuite
2015-07-28 22:38:28 -07:00
Wenchen Fan 429b2f0df4 [SPARK-8608][SPARK-8609][SPARK-9083][SQL] reset mutable states of nondeterministic expression before evaluation and fix PullOutNondeterministic
We will do local projection for LocalRelation, and thus reuse the same Expression object among multiply evaluations. We should reset the mutable states of Expression before evaluate it.

Fix `PullOutNondeterministic` rule to make it work for `Sort`.

Also got a chance to cleanup the dataframe test suite.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7674 from cloud-fan/show and squashes the following commits:

888934f [Wenchen Fan] fix sort
c0e93e8 [Wenchen Fan] local DataFrame with random columns should return same value when call `show`
2015-07-28 21:37:50 -07:00
Josh Rosen e78ec1a8fa [SPARK-9421] Fix null-handling bugs in UnsafeRow.getDouble, getFloat(), and get(ordinal, dataType)
UnsafeRow.getDouble and getFloat() return NaN when called on columns that are null, which is inconsistent with the behavior of other row classes (which is to return 0.0).

In addition, the generic get(ordinal, dataType) method should always return null for a null literal, but currently it handles nulls by calling the type-specific accessors.

This patch addresses both of these issues and adds a regression test.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7736 from JoshRosen/unsafe-row-null-fixes and squashes the following commits:

c8eb2ee [Josh Rosen] Fix test in UnsafeRowConverterSuite
6214682 [Josh Rosen] Fixes to null handling in UnsafeRow
2015-07-28 17:51:58 -07:00
Reynold Xin b7f54119f8 [SPARK-9420][SQL] Move expressions in sql/core package to catalyst.
Since catalyst package already depends on Spark core, we can move those expressions
into catalyst, and simplify function registry.

This is a followup of #7478.

Author: Reynold Xin <rxin@databricks.com>

Closes #7735 from rxin/SPARK-8003 and squashes the following commits:

2ffbdc3 [Reynold Xin] [SPARK-8003][SQL] Move expressions in sql/core package to catalyst.
2015-07-28 17:03:59 -07:00
Reynold Xin 9bbe0171cb [SPARK-8196][SQL] Fix null handling & documentation for next_day.
The original patch didn't handle nulls correctly for next_day.

Author: Reynold Xin <rxin@databricks.com>

Closes #7718 from rxin/next_day and squashes the following commits:

616a425 [Reynold Xin] Merged DatetimeExpressionsSuite into DateFunctionsSuite.
faa78cf [Reynold Xin] Merged DatetimeFunctionsSuite into DateExpressionsSuite.
6c4fb6a [Reynold Xin] [SPARK-8196][SQL] Fix null handling & documentation for next_day.
2015-07-28 09:43:39 -07:00
Reynold Xin 15724fac56 [SPARK-9394][SQL] Handle parentheses in CodeFormatter.
Our CodeFormatter currently does not handle parentheses, and as a result in code dump, we see code formatted this way:

```
foo(
a,
b,
c)
```

With this patch, it is formatted this way:
```
foo(
  a,
  b,
  c)
```

Author: Reynold Xin <rxin@databricks.com>

Closes #7712 from rxin/codeformat-parentheses and squashes the following commits:

c2b1c5f [Reynold Xin] Took square bracket out
3cfb174 [Reynold Xin] Code review feedback.
91f5bb1 [Reynold Xin] [SPARK-9394][SQL] Handle parentheses in CodeFormatter.
2015-07-28 00:52:26 -07:00
Reynold Xin 60f08c7c87 [SPARK-9373][SQL] Support StructType in Tungsten projection
This pull request updates GenerateUnsafeProjection to support StructType. If an input struct type is backed already by an UnsafeRow, GenerateUnsafeProjection copies the bytes directly into its buffer space without any conversion. However, if the input is not an UnsafeRow, GenerateUnsafeProjection runs the code generated recursively to convert the input into an UnsafeRow and then copies it into the buffer space.

Also create a TungstenProject operator that projects data directly into UnsafeRow. Note that I'm not sure if this is the way we want to structure Unsafe+codegen operators, but we can defer that decision to follow-up pull requests.

Author: Reynold Xin <rxin@databricks.com>

Closes #7689 from rxin/tungsten-struct-type and squashes the following commits:

9162f42 [Reynold Xin] Support IntervalType in UnsafeRow's getter.
be9f377 [Reynold Xin] Fixed tests.
10c4b7c [Reynold Xin] Format generated code.
77e8d0e [Reynold Xin] Fixed NondeterministicSuite.
ac4951d [Reynold Xin] Yay.
ac203bf [Reynold Xin] More comments.
9f36216 [Reynold Xin] Updated comment.
6b781fe [Reynold Xin] Reset the change in DataFrameSuite.
525b95b [Reynold Xin] Merged with master, more documentation & test cases.
321859a [Reynold Xin] [SPARK-9373][SQL] Support StructType in Tungsten projection [WIP]
2015-07-27 22:51:15 -07:00
Daoyuan Wang 2e7f99a004 [SPARK-8195] [SPARK-8196] [SQL] udf next_day last_day
next_day, returns next certain dayofweek.
last_day, returns the last day of the month which given date belongs to.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #6986 from adrian-wang/udfnlday and squashes the following commits:

ef7e3da [Daoyuan Wang] fix
02b3426 [Daoyuan Wang] address 2 comments
dc69630 [Daoyuan Wang] address comments from rxin
8846086 [Daoyuan Wang] address comments from rxin
d09bcce [Daoyuan Wang] multi fix
1a9de3d [Daoyuan Wang] function next_day and last_day
2015-07-27 21:08:56 -07:00
Wenchen Fan 75438422c2 [SPARK-9369][SQL] Support IntervalType in UnsafeRow
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7688 from cloud-fan/interval and squashes the following commits:

5b36b17 [Wenchen Fan] fix codegen
a99ed50 [Wenchen Fan] address comment
9e6d319 [Wenchen Fan] Support IntervalType in UnsafeRow
2015-07-27 11:28:22 -07:00
Wenchen Fan dd9ae7945a [SPARK-9351] [SQL] remove literals from grouping expressions in Aggregate
literals in grouping expressions have no effect at all, only make our grouping key bigger, so we should remove them in Optimizer.

I also make old and new aggregation code consistent about literals in grouping here. In old aggregation, actually literals in grouping are already removed but new aggregation is not. So I explicitly make it a rule in Optimizer.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7583 from cloud-fan/minor and squashes the following commits:

471adff [Wenchen Fan] add test
0839925 [Wenchen Fan] use transformDown when rewrite final result expressions
2015-07-27 11:23:29 -07:00
Wenchen Fan e2f38167f8 [SPARK-9376] [SQL] use a seed in RandomDataGeneratorSuite
Make this test deterministic, i.e. make sure this test can be passed no matter how many times we run it.

The origin implementation uses a random seed and gives a chance that we may break the null check assertion `assert(Iterator.fill(100)(generator()).contains(null))`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7691 from cloud-fan/seed and squashes the following commits:

eae7281 [Wenchen Fan] use a seed in RandomDataGeneratorSuite
2015-07-27 11:02:16 -07:00
Yijie Shen fb5d43fb25 [SPARK-9356][SQL]Remove the internal use of DecimalType.Unlimited
JIRA: https://issues.apache.org/jira/browse/SPARK-9356

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7671 from yjshen/deprecated_unlimit and squashes the following commits:

c707f56 [Yijie Shen] remove pattern matching in changePrecision
4a1823c [Yijie Shen] remove internal occurrence of Decimal.Unlimited
2015-07-26 10:29:22 -07:00
Reynold Xin 4a01bfc2a2 [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType
Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter.

Author: Reynold Xin <rxin@databricks.com>

Closes #7666 from rxin/generic-getter-with-datatype and squashes the following commits:

ee2874c [Reynold Xin] Add a default implementation for getStruct.
1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType.
033ee88 [Reynold Xin] Removed getAs in non test code.
2015-07-25 23:52:37 -07:00
Reynold Xin b1f4b4abfd [SPARK-9348][SQL] Remove apply method on InternalRow.
Author: Reynold Xin <rxin@databricks.com>

Closes #7665 from rxin/remove-row-apply and squashes the following commits:

0b43001 [Reynold Xin] support getString in UnsafeRow.
176d633 [Reynold Xin] apply -> get.
2941324 [Reynold Xin] [SPARK-9348][SQL] Remove apply method on InternalRow.
2015-07-25 18:41:51 -07:00
Wenchen Fan 2c94d0f24a [SPARK-9192][SQL] add initialization phase for nondeterministic expression
Currently nondeterministic expression is broken without a explicit initialization phase.

Let me take `MonotonicallyIncreasingID` as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use `transient var count: Long` there. By being transient, the `count` will be reset to 0 and **only** to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is *no way* to use another initial value for `count`, until we add the explicit initialization phase.

Another use case is local execution for `LocalRelation`, there is no serialize and deserialize phase and thus we can't reset mutable states for it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7535 from cloud-fan/init and squashes the following commits:

6c6f332 [Wenchen Fan] add test
ef68ff4 [Wenchen Fan] fix comments
9eac85e [Wenchen Fan] move init code to interpreted class
bb7d838 [Wenchen Fan] pulls out nondeterministic expressions into a project
b4a4fc7 [Wenchen Fan] revert a refactor
86fee36 [Wenchen Fan] add initialization phase for nondeterministic expression
2015-07-25 12:10:02 -07:00
Reynold Xin 215713e199 [SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection.
The two are redundant.

Once this patch is merged, I plan to remove the inbound conversions from unsafe aggregates.

Author: Reynold Xin <rxin@databricks.com>

Closes #7658 from rxin/unsafeconverters and squashes the following commits:

ed19e6c [Reynold Xin] Updated support types.
2a56d7e [Reynold Xin] [SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection.
2015-07-25 01:37:41 -07:00
Reynold Xin c84acd4aa4 [SPARK-9331][SQL] Add a code formatter to auto-format generated code.
The generated expression code can be hard to read since they are not indented well. This patch adds a code formatter that formats code automatically when we output them to the screen.

Author: Reynold Xin <rxin@databricks.com>

Closes #7656 from rxin/codeformatter and squashes the following commits:

5ba0e90 [Reynold Xin] [SPARK-9331][SQL] Add a code formatter to auto-format generated code.
2015-07-24 19:35:24 -07:00
Josh Rosen 6aceaf3d62 [SPARK-9295] Analysis should detect sorting on unsupported column types
This patch extends CheckAnalysis to throw errors for queries that try to sort on unsupported column types, such as ArrayType.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7633 from JoshRosen/SPARK-9295 and squashes the following commits:

23b2fbf [Josh Rosen] Embed function in foreach
bfe1451 [Josh Rosen] Update to allow sorting by null literals
2f1b802 [Josh Rosen] Add analysis rule to detect sorting on unsupported column types (SPARK-9295)
2015-07-24 11:34:23 -07:00
Josh Rosen c2b50d693e [SPARK-9292] Analysis should check that join conditions' data types are BooleanType
This patch adds an analysis check to ensure that join conditions' data types are BooleanType. This check is necessary in order to report proper errors for non-boolean DataFrame join conditions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7630 from JoshRosen/SPARK-9292 and squashes the following commits:

aec6c7b [Josh Rosen] Check condition type in resolved()
75a3ea6 [Josh Rosen] Fix SPARK-9292.
2015-07-24 09:49:50 -07:00
Reynold Xin c8d71a4183 [SPARK-9305] Rename org.apache.spark.Row to Item.
It's a thing used in test cases, but named Row. Pretty annoying because everytime I search for Row, it shows up before the Spark SQL Row, which is what a developer wants most of the time.

Author: Reynold Xin <rxin@databricks.com>

Closes #7638 from rxin/remove-row and squashes the following commits:

aeda52d [Reynold Xin] [SPARK-9305] Rename org.apache.spark.Row to Item.
2015-07-24 09:38:13 -07:00
Reynold Xin 431ca39be5 [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields.

Author: Reynold Xin <rxin@databricks.com>

Closes #7626 from rxin/internalRow and squashes the following commits:

e124daf [Reynold Xin] Fixed test case.
805ceb7 [Reynold Xin] Commented out the failed test suite.
f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining.
76d9081 [Reynold Xin] Fixed data sources.
7807f70 [Reynold Xin] Fixed DataFrameSuite.
cb60cd2 [Reynold Xin] Code review & small bug fixes.
0a2948b [Reynold Xin] Fixed style.
3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
2015-07-24 09:37:36 -07:00
Davies Liu dfb18be036 [SPARK-9069] [SQL] follow up
Address comments for #7605

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7634 from davies/decimal_unlimited2 and squashes the following commits:

b2d8b0d [Davies Liu] add doc and test for DecimalType.isWiderThan
65b251c [Davies Liu] fix test
6a91f32 [Davies Liu] fix style
ca9c973 [Davies Liu] address comments
2015-07-24 08:24:13 -07:00
Reynold Xin cb8c241f05 [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type.
Author: Reynold Xin <rxin@databricks.com>

Closes #7636 from rxin/complex-string-implicit-cast and squashes the following commits:

3e67327 [Reynold Xin] [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type.
2015-07-24 01:18:43 -07:00
Davies Liu 8a94eb23d5 [SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType
Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases).

In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18).

If no precision and scale is provide, it's Decimal(10, 0) as before.

Author: Davies Liu <davies@databricks.com>

Closes #7605 from davies/decimal_unlimited and squashes the following commits:

aa3f115 [Davies Liu] fix tests and style
fb0d20d [Davies Liu] address comments
bfaae35 [Davies Liu] fix style
df93657 [Davies Liu] address comments and clean up
06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited
4c28969 [Davies Liu] fix tests
8d783cc [Davies Liu] fix tests
788631c [Davies Liu] fix double with decimal in Union/except
1779bde [Davies Liu] fix scala style
c9c7c78 [Davies Liu] remove Decimal.Unlimited
2015-07-23 18:31:13 -07:00
Reynold Xin fb36397b3c Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"
Reverts ObjectPool. As it stands, it has a few problems:

1. ObjectPool doesn't work with spilling and memory accounting.
2. I don't think in the long run the idea of an object pool is what we want to support, since it essentially goes back to unmanaged memory, and creates pressure on GC, and is hard to account for the total in memory size.
3. The ObjectPool patch removed the specialized getters for strings and binary, and as a result, actually introduced branches when reading non primitive data types.

If we do want to support arbitrary user defined types in the future, I think we can just add an object array in UnsafeRow, rather than relying on indirect memory addressing through a pool. We also need to pick execution strategies that are optimized for those, rather than keeping a lot of unserialized JVM objects in memory during aggregation.

This is probably the hardest thing I had to revert in Spark, due to recent patches that also change the same part of the code. Would be great to get a careful look.

Author: Reynold Xin <rxin@databricks.com>

Closes #7591 from rxin/revert-object-pool and squashes the following commits:

01db0bc [Reynold Xin] Scala style.
eda89fc [Reynold Xin] Fixed describe.
2967118 [Reynold Xin] Fixed accessor for JoinedRow.
e3294eb [Reynold Xin] Merge branch 'master' into revert-object-pool
657855f [Reynold Xin] Temp commit.
c20f2c8 [Reynold Xin] Style fix.
fe37079 [Reynold Xin] Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"
2015-07-23 01:51:34 -07:00
Yijie Shen 6d0d8b4069 [SPARK-8935] [SQL] Implement code generation for all casts
JIRA: https://issues.apache.org/jira/browse/SPARK-8935

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7365 from yjshen/cast_codegen and squashes the following commits:

ef6e8b5 [Yijie Shen] getColumn and setColumn in struct cast, autounboxing in array and map
eaece18 [Yijie Shen] remove null case in cast code gen
fd7eba4 [Yijie Shen] resolve comments
80378a5 [Yijie Shen] the missing self cast
611d66e [Yijie Shen] Bug fix: NullType & primitive object unboxing
6d5c0fe [Yijie Shen] rebase and add Interval codegen
9424b65 [Yijie Shen] tiny style fix
4a1c801 [Yijie Shen] remove CodeHolder class, use function instead.
3f5df88 [Yijie Shen] CodeHolder for complex dataTypes
c286f13 [Yijie Shen] moved all the cast code into class body
4edfd76 [Yijie Shen] [WIP] finished primitive part
2015-07-22 23:44:08 -07:00
Yijie Shen 86f80e2b47 [SPARK-9165] [SQL] codegen for CreateArray, CreateStruct and CreateNamedStruct
JIRA: https://issues.apache.org/jira/browse/SPARK-9165

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7537 from yjshen/array_struct_codegen and squashes the following commits:

3a6dce6 [Yijie Shen] use infix notion in createArray test
5e90f0a [Yijie Shen] resolve comments: classOf
39cefb8 [Yijie Shen] codegen for createArray createStruct & createNamedStruct
2015-07-22 12:19:59 -07:00
Wenchen Fan 76520955fd [SPARK-9082] [SQL] Filter using non-deterministic expressions should not be pushed down
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7446 from cloud-fan/filter and squashes the following commits:

330021e [Wenchen Fan] add exists to tree node
2cab68c [Wenchen Fan] more enhance
949be07 [Wenchen Fan] push down part of predicate if possible
3912f84 [Wenchen Fan] address comments
8ce15ca [Wenchen Fan] fix bug
557158e [Wenchen Fan] Filter using non-deterministic expressions should not be pushed down
2015-07-22 11:45:51 -07:00
Reynold Xin a4c83cb1e4 [SPARK-9154][SQL] Rename formatString to format_string.
Also make format_string the canonical form, rather than printf.

Author: Reynold Xin <rxin@databricks.com>

Closes #7579 from rxin/format_strings and squashes the following commits:

53ee54f [Reynold Xin] Fixed unit tests.
52357e1 [Reynold Xin] Add format_string alias.
b40a42a [Reynold Xin] [SPARK-9154][SQL] Rename formatString to format_string.
2015-07-21 19:14:07 -07:00
Tarek Auel d4c7a7a364 [SPARK-9154] [SQL] codegen StringFormat
Jira: https://issues.apache.org/jira/browse/SPARK-9154

fixes bug of #7546

marmbrus I can't reopen the other PR, because I didn't closed it. Can you trigger Jenkins?

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7571 from tarekauel/SPARK-9154 and squashes the following commits:

dcae272 [Tarek Auel] [SPARK-9154][SQL] build fix
1487602 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-9154
f512c5f [Tarek Auel] [SPARK-9154][SQL] build fix
a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives
10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait
cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format
086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
2015-07-21 15:47:40 -07:00
Michael Armbrust 87d890cc10 Revert "[SPARK-9154] [SQL] codegen StringFormat"
This reverts commit 7f072c3d5e.

Revert #7546

Author: Michael Armbrust <michael@databricks.com>

Closes #7570 from marmbrus/revert9154 and squashes the following commits:

ed2c32a [Michael Armbrust] Revert "[SPARK-9154] [SQL] codegen StringFormat"
2015-07-21 11:18:39 -07:00
Tarek Auel 7f072c3d5e [SPARK-9154] [SQL] codegen StringFormat
Jira: https://issues.apache.org/jira/browse/SPARK-9154

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7546 from tarekauel/SPARK-9154 and squashes the following commits:

a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives
10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait
cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format
086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
2015-07-21 09:58:16 -07:00
Yijie Shen be5c5d3741 [SPARK-9081] [SPARK-9168] [SQL] nanvl & dropna/fillna supporting nan as well
JIRA:
https://issues.apache.org/jira/browse/SPARK-9081
https://issues.apache.org/jira/browse/SPARK-9168

This PR target at two modifications:
1.  Change `isNaN` to return `false` on `null` input
2.  Make `dropna` and `fillna` to fill/drop NaN values as well
3.  Implement `nanvl`

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7523 from yjshen/fillna_dropna and squashes the following commits:

f0a51db [Yijie Shen] make coalesce untouched and implement nanvl
1d3e35f [Yijie Shen] make Coalesce aware of NaN in order to support fillna
2760cbc [Yijie Shen] change isNaN(null) to false as well as implement dropna
2015-07-21 08:25:50 -07:00
Yijie Shen ae230596b8 [SPARK-9173][SQL]UnionPushDown should also support Intersect and Except
JIRA: https://issues.apache.org/jira/browse/SPARK-9173

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7540 from yjshen/union_pushdown and squashes the following commits:

278510a [Yijie Shen] rename UnionPushDown to SetOperationPushDown
91741c1 [Yijie Shen] Add UnionPushDown support for intersect and except
2015-07-21 00:56:57 -07:00
Pedro Rodriguez 560c658a74 [SPARK-8230][SQL] Add array/map size method
Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230

Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket.

Things to review:
1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python.
2. In Python code, should it be in a `1.5.0` function array or in a collections array?
3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case?
4. Something else?

Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
Author: Pedro Rodriguez <prodriguez@trulia.com>

Closes #7462 from EntilZha/SPARK-8230 and squashes the following commits:

9a442ae [Pedro Rodriguez] fixed functions and sorted __all__
9aea3bb [Pedro Rodriguez] removed imports from python docs
15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen
d88247c [Pedro Rodriguez] removed python code
bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge
59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging
c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print
130839f [Pedro Rodriguez] fixed failing test
aa9bade [Pedro Rodriguez] fix style
e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
9a1a2ff [Pedro Rodriguez] added unit tests for map size
2bfbcb6 [Pedro Rodriguez] added unit test for size
20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python
b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
99a6a5c [Pedro Rodriguez] fixed failing test
cac75ac [Pedro Rodriguez] fix style
933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
f9c3b8a [Pedro Rodriguez] added unit tests for map size
2515d9f [Pedro Rodriguez] added documentation
0e60541 [Pedro Rodriguez] added unit test for size
acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python
84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
2015-07-21 00:53:20 -07:00
Cheng Hao 8c8f0ef59e [SPARK-8255] [SPARK-8256] [SQL] Add regex_extract/regex_replace
Add expressions `regex_extract` & `regex_replace`

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7468 from chenghao-intel/regexp and squashes the following commits:

e5ea476 [Cheng Hao] minor update for documentation
ef96fd6 [Cheng Hao] update the code gen
72cf28f [Cheng Hao] Add more log for compilation error
4e11381 [Cheng Hao] Add regexp_replace / regexp_extract support
2015-07-21 00:48:07 -07:00
Josh Rosen c032b0bf92 [SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL
This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort.  The root of this problem is that `NaN > anything`, `NaN == anything`, and `NaN < anything` all return `false`.

Per the design specified in SPARK-9079, we have decided that `NaN = NaN` should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value).

In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7194 from JoshRosen/nan and squashes the following commits:

983d4fc [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
88bd73c [Josh Rosen] Fix Row.equals()
a702e2e [Josh Rosen] normalization -> canonicalization
a7267cf [Josh Rosen] Normalize NaNs in UnsafeRow
fe629ae [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
fbb2a29 [Josh Rosen] Fix NaN comparisons in BinaryComparison expressions
c1fd4fe [Josh Rosen] Fold NaN test into existing test framework
b31eb19 [Josh Rosen] Uncomment failing tests
7fe67af [Josh Rosen] Support NaN == NaN (SPARK-9145)
58bad2c [Josh Rosen] Revert "Compare rows' string representations to work around NaN incomparability."
fc6b4d2 [Josh Rosen] Update CodeGenerator
3998ef2 [Josh Rosen] Remove unused code
a2ba2e7 [Josh Rosen] Fix prefix comparision for NaNs
a30d371 [Josh Rosen] Compare rows' string representations to work around NaN incomparability.
6f03f85 [Josh Rosen] Fix bug in Double / Float ordering
42a1ad5 [Josh Rosen] Stop filtering NaNs in UnsafeExternalSortSuite
bfca524 [Josh Rosen] Change ordering so that NaN is maximum value.
8d7be61 [Josh Rosen] Update randomized test to use ScalaTest's assume()
b20837b [Josh Rosen] Add failing test for new NaN comparision ordering
5b88b2b [Josh Rosen] Fix compilation of CodeGenerationSuite
d907b5b [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
630ebc5 [Josh Rosen] Specify an ordering for NaN values.
9bf195a [Josh Rosen] Re-enable NaNs in CodeGenerationSuite to produce more regression tests
13fc06a [Josh Rosen] Add regression test for NaN sorting issue
f9efbb5 [Josh Rosen] Fix ORDER BY NULL
e7dc4fb [Josh Rosen] Add very generic test for ordering
7d5c13e [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL)
b55875a [Josh Rosen] Generate doubles and floats over entire possible range.
5acdd5c [Josh Rosen] Infinity and NaN are interesting.
ab76cbd [Josh Rosen] Move code to Catalyst package.
d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.
2015-07-20 22:38:05 -07:00
Tarek Auel a3c7a3ce32 [SPARK-9132][SPARK-9163][SQL] codegen conv
Jira: https://issues.apache.org/jira/browse/SPARK-9132
https://issues.apache.org/jira/browse/SPARK-9163

rxin as you proposed in the Jira ticket, I just moved the logic to a separate object. I haven't changed anything of the logic of `NumberConverter`.

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7552 from tarekauel/SPARK-9163 and squashes the following commits:

40dcde9 [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] style fix
fa985bd [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] codegen conv
2015-07-20 22:08:12 -07:00
Tarek Auel c9db8eaa42 [SPARK-9159][SQL] codegen ascii, base64, unbase64
Jira: https://issues.apache.org/jira/browse/SPARK-9159

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7542 from tarekauel/SPARK-9159 and squashes the following commits:

772e6bc [Tarek Auel] [SPARK-9159][SQL] codegen ascii, base64, unbase64
2015-07-20 15:32:46 -07:00
Tarek Auel 5112b7f58b [SPARK-9153][SQL] codegen StringLPad/StringRPad
Jira: https://issues.apache.org/jira/browse/SPARK-9153

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7527 from tarekauel/SPARK-9153 and squashes the following commits:

3840c6b [Tarek Auel] [SPARK-9153] removed codegen fallback
92b6a5d [Tarek Auel] [SPARK-9153] codegen lpad/rpad
2015-07-20 09:35:45 -07:00
Liang-Chi Hsieh d743bec645 [SPARK-9172][SQL] Make DecimalPrecision support for Intersect and Except
JIRA: https://issues.apache.org/jira/browse/SPARK-9172

Simply make `DecimalPrecision` support for `Intersect` and `Except` in addition to `Union`.

Besides, add unit test for `DecimalPrecision` as well.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7511 from viirya/more_decimalprecieion and squashes the following commits:

4d29d10 [Liang-Chi Hsieh] Fix code comment.
9fb0d49 [Liang-Chi Hsieh] Make DecimalPrecision support for Intersect and Except.
2015-07-19 20:53:18 -07:00
Reynold Xin 163e3f1df9 [SPARK-8241][SQL] string function: concat_ws.
I also changed the semantics of concat w.r.t. null back to the same behavior as Hive.
That is to say, concat now returns null if any input is null.

Author: Reynold Xin <rxin@databricks.com>

Closes #7504 from rxin/concat_ws and squashes the following commits:

83fd950 [Reynold Xin] Fixed type casting.
3ae85f7 [Reynold Xin] Write null better.
cdc7be6 [Reynold Xin] Added code generation for pure string mode.
a61c4e4 [Reynold Xin] Updated comments.
2d51406 [Reynold Xin] [SPARK-8241][SQL] string function: concat_ws.
2015-07-19 16:48:47 -07:00
Cheng Lian 34ed82bb44 [HOTFIX] [SQL] Fixes compilation error introduced by PR #7506
PR #7506 breaks master build because of compilation error. Note that #7506 itself looks good, but it seems that `git merge` did something stupid.

Author: Cheng Lian <lian@databricks.com>

Closes #7510 from liancheng/hotfix-for-pr-7506 and squashes the following commits:

7ea7e89 [Cheng Lian] Fixes compilation error
2015-07-19 18:58:19 +08:00
Reynold Xin 3427937ea2 [SQL] Make date/time functions more consistent with other database systems.
This pull request fixes some of the problems in #6981.

- Added date functions to `__all__` so they get exposed
- Rename day_of_month -> dayofmonth
- Rename day_in_year -> dayofyear
- Rename week_of_year -> weekofyear
- Removed "day" from Scala/Python API since it is ambiguous. Only leaving the alias in SQL.

Author: Reynold Xin <rxin@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>

Closes #7506 from rxin/datetime and squashes the following commits:

0cb24d9 [Reynold Xin] Export all functions in Python.
e44a4a0 [Reynold Xin] Removed day function from Scala and Python.
9c08fdc [Reynold Xin] [SQL] Make date/time functions more consistent with other database systems.
2015-07-19 01:17:22 -07:00
Tarek Auel a53d13f7aa [SPARK-8199][SQL] follow up; revert change in test
rxin / davies

Sorry for that unnecessary change. And thanks again for all your support!

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7505 from tarekauel/SPARK-8199-FollowUp and squashes the following commits:

d09321c [Tarek Auel] [SPARK-8199] follow up; revert change in test
c17397f [Tarek Auel] [SPARK-8199] follow up; revert change in test
67acfe6 [Tarek Auel] [SPARK-8199] follow up; revert change in test
2015-07-19 01:16:01 -07:00
Reynold Xin 04c1b49f5e Fixed test cases. 2015-07-18 22:50:34 -07:00
Tarek Auel 83b682beec [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions
Jira:
https://issues.apache.org/jira/browse/SPARK-8199
https://issues.apache.org/jira/browse/SPARK-8184
https://issues.apache.org/jira/browse/SPARK-8183
https://issues.apache.org/jira/browse/SPARK-8182
https://issues.apache.org/jira/browse/SPARK-8181
https://issues.apache.org/jira/browse/SPARK-8180
https://issues.apache.org/jira/browse/SPARK-8179
https://issues.apache.org/jira/browse/SPARK-8177
https://issues.apache.org/jira/browse/SPARK-8179
https://issues.apache.org/jira/browse/SPARK-9115

Regarding `day`and `dayofmonth` are both necessary?

~~I am going to add `Quarter` to this PR as well.~~ Done.

~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done

Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Tarek Auel <tarek.auel@gmail.com>

Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits:

f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests
bb567b6 [Tarek Auel] [SPARK-8199] fixed test
3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix
256c357 [Tarek Auel] [SPARK-8199] code cleanup
5983dcc [Tarek Auel] [SPARK-8199] whitespace fix
6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488
4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling
ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master
70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199
3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search
fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix
cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix
746b80a [Tarek Auel] [SPARK-8199] build fix
0ad6db8 [Tarek Auel] [SPARK-8199] minor fix
523542d [Tarek Auel] [SPARK-8199] address comments
2259299 [Tarek Auel] [SPARK-8199] day_of_month alias
d01b977 [Tarek Auel] [SPARK-8199] python underscore
56c4a92 [Tarek Auel] [SPARK-8199] update python docu
e223bc0 [Tarek Auel] [SPARK-8199] refactoring
d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility
b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it
1b2e540 [Tarek Auel] [SPARK-8119] style fix
0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts
ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring
1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199
740af0e [Tarek Auel] implement date function using a calculation based on days
4fb66da [Tarek Auel] WIP: date functions on calculation only
1a436c9 [Tarek Auel] wip
f775f39 [Tarek Auel] fixed return type
ad17e96 [Tarek Auel] improved implementation
c42b444 [Tarek Auel] Removed merge conflict file
ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues
10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast
7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue
f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite
6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval
d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track
7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199
5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged
eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199
f120415 [Tarek Auel] improved runtime
a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat
5fe74e1 [Tarek Auel] fixed python style
3bfac90 [Tarek Auel] fixed style
356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation
02efc5d [Tarek Auel] removed doubled code
a5ea120 [Tarek Auel] added python api; changed test to be more meaningful
b680db6 [Tarek Auel] added codegeneration to all functions
c739788 [Tarek Auel] added support for quarter SPARK-8178
849fb41 [Tarek Auel] fixed stupid test
638596f [Tarek Auel] improved codegen
4d8049b [Tarek Auel] fixed tests and added type check
5ebb235 [Tarek Auel] resolved naming conflict
d0e2f99 [Tarek Auel] date functions
2015-07-18 22:48:05 -07:00
Forest Fang 6cb6096c01 [SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits
By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete.

Author: Forest Fang <forest.fang@outlook.com>

Closes #7076 from saurfang/codegen_size_limit and squashes the following commits:

b7a7635 [Forest Fang] [SPARK-8443][SQL] Execute and verify split projections in test
adef95a [Forest Fang] [SPARK-8443][SQL] Use safer factor and rewrite splitting code
1b5aa7e [Forest Fang] [SPARK-8443][SQL] inline execution if one block only
9405680 [Forest Fang] [SPARK-8443][SQL] split projection code by size limit
2015-07-18 21:05:44 -07:00
Reynold Xin 9914b1b2c5 [SPARK-9150][SQL] Create CodegenFallback and Unevaluable trait
It is very hard to track which expressions have code gen implemented or not. This patch removes the default fallback gencode implementation from Expression, and moves that into a new trait called CodegenFallback. Each concrete expression needs to either implement code generation, or mix in CodegenFallback. This makes it very easy to track which expressions have code generation implemented already.

Additionally, this patch creates an Unevaluable trait that can be used to track expressions that don't support evaluation (e.g. Star).

Author: Reynold Xin <rxin@databricks.com>

Closes #7487 from rxin/codegenfallback and squashes the following commits:

14ebf38 [Reynold Xin] Fixed Conv
6c1c882 [Reynold Xin] Fixed Alias.
b42611b [Reynold Xin] [SPARK-9150][SQL] Create a trait to track code generation for expressions.
cb5c066 [Reynold Xin] Removed extra import.
39cbe40 [Reynold Xin] [SPARK-8240][SQL] string function: concat
2015-07-18 18:18:19 -07:00
Reynold Xin 6e1e2eba69 [SPARK-8240][SQL] string function: concat
Author: Reynold Xin <rxin@databricks.com>

Closes #7486 from rxin/concat and squashes the following commits:

5217d6e [Reynold Xin] Removed Hive's concat test.
f5cb7a3 [Reynold Xin] Concat is never nullable.
ae4e61f [Reynold Xin] Removed extra import.
fddcbbd [Reynold Xin] Fixed NPE.
22e831c [Reynold Xin] Added missing file.
57a2352 [Reynold Xin] [SPARK-8240][SQL] string function: concat
2015-07-18 14:07:56 -07:00
Yijie Shen 3d2134fc0d [SPARK-9055][SQL] WidenTypes should also support Intersect and Except
JIRA: https://issues.apache.org/jira/browse/SPARK-9055

cc rxin

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7491 from yijieshen/widen and squashes the following commits:

079fa52 [Yijie Shen] widenType support for intersect and expect
2015-07-18 12:57:53 -07:00
Wenchen Fan 86c50bf72c [SPARK-9171][SQL] add and improve tests for nondeterministic expressions
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7496 from cloud-fan/tests and squashes the following commits:

0958f90 [Wenchen Fan] improve test for nondeterministic expressions
2015-07-18 11:58:53 -07:00
Wenchen Fan 692378c01d [SPARK-9167][SQL] use UTC Calendar in stringToDate
fix 2 bugs introduced in https://github.com/apache/spark/pull/7353

1. we should use UTC Calendar when cast string to date . Before #7353 , we use `DateTimeUtils.fromJavaDate(Date.valueOf(s.toString))` to cast string to date, and `fromJavaDate` will call `millisToDays` to avoid the time zone issue. Now we use `DateTimeUtils.stringToDate(s)`, we should create a Calendar with UTC in the begging.
2. we should not change the default time zone in test cases. The `threadLocalLocalTimeZone` and `threadLocalTimestampFormat` in `DateTimeUtils` will only be evaluated once for each thread, so we can't set the default time zone back anymore.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7488 from cloud-fan/datetime and squashes the following commits:

9cd6005 [Wenchen Fan] address comments
21ef293 [Wenchen Fan] fix 2 bugs in datetime
2015-07-18 11:25:16 -07:00
Reynold Xin fba3f5ba85 [SPARK-9169][SQL] Improve unit test coverage for null expressions.
Author: Reynold Xin <rxin@databricks.com>

Closes #7490 from rxin/unit-test-null-funcs and squashes the following commits:

7b276f0 [Reynold Xin] Move isNaN.
8307287 [Reynold Xin] [SPARK-9169][SQL] Improve unit test coverage for null expressions.
2015-07-18 11:06:46 -07:00
Yijie Shen 529a2c2d92 [SPARK-8280][SPARK-8281][SQL]Handle NaN, null and Infinity in math
JIRA:
https://issues.apache.org/jira/browse/SPARK-8280
https://issues.apache.org/jira/browse/SPARK-8281

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7451 from yijieshen/nan_null2 and squashes the following commits:

47a529d [Yijie Shen] style fix
63dee44 [Yijie Shen] handle log expressions similar to Hive
188be51 [Yijie Shen] null to nan in Math Expression
2015-07-17 17:33:19 -07:00
Wenchen Fan bd903ee89f [SPARK-9117] [SQL] fix BooleanSimplification in case-insensitive
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7452 from cloud-fan/boolean-simplify and squashes the following commits:

2a6e692 [Wenchen Fan] fix style
d3cfd26 [Wenchen Fan] fix BooleanSimplification in case-insensitive
2015-07-17 16:28:24 -07:00
Wenchen Fan fd6b3101fb [SPARK-9113] [SQL] enable analysis check code for self join
The check was unreachable before, as `case operator: LogicalPlan` catches everything already.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7449 from cloud-fan/tmp and squashes the following commits:

2bb6637 [Wenchen Fan] add test
5493aea [Wenchen Fan] add the check back
27221a7 [Wenchen Fan] remove unnecessary analysis check code for self join
2015-07-17 16:03:33 -07:00
Yijie Shen 15fc2ffe55 [SPARK-9080][SQL] add isNaN predicate expression
JIRA: https://issues.apache.org/jira/browse/SPARK-9080

cc rxin

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7464 from yijieshen/isNaN and squashes the following commits:

11ae039 [Yijie Shen] add isNaN in functions
666718e [Yijie Shen] add isNaN predicate expression
2015-07-17 15:49:31 -07:00
Wenchen Fan 074085d678 [SPARK-9136] [SQL] fix several bugs in DateTimeUtils.stringToTimestamp
a follow up of https://github.com/apache/spark/pull/7353

1. we should use `Calendar.HOUR_OF_DAY` instead of `Calendar.HOUR`(this is for AM, PM).
2. we should call `c.set(Calendar.MILLISECOND, 0)` after `Calendar.getInstance`

I'm not sure why the tests didn't fail in jenkins, but I ran latest spark master branch locally and `DateTimeUtilsSuite` failed.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7473 from cloud-fan/datetime and squashes the following commits:

66cdaf2 [Wenchen Fan] fix several bugs in DateTimeUtils.stringToTimestamp
2015-07-17 13:57:31 -07:00
Liang-Chi Hsieh eba6a1af4c [SPARK-8945][SQL] Add add and subtract expressions for IntervalType
JIRA: https://issues.apache.org/jira/browse/SPARK-8945

Add add and subtract expressions for IntervalType.

Author: Liang-Chi Hsieh <viirya@appier.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>

Closes #7398 from viirya/interval_add_subtract and squashes the following commits:

acd1f1e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract
5abae28 [Liang-Chi Hsieh] For comments.
6f5b72e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract
dbe3906 [Liang-Chi Hsieh] For comments.
13a2fc5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract
83ec129 [Liang-Chi Hsieh] Remove intervalMethod.
acfe1ab [Liang-Chi Hsieh] Fix scala style.
d3e9d0e [Liang-Chi Hsieh] Add add and subtract expressions for IntervalType.
2015-07-17 09:38:08 -07:00
zhichao.li 305e77cd83 [SPARK-8209[SQL]Add function conv
cc chenghao-intel  adrian-wang

Author: zhichao.li <zhichao.li@intel.com>

Closes #6872 from zhichao-li/conv and squashes the following commits:

6ef3b37 [zhichao.li] add unittest and comments
78d9836 [zhichao.li] polish dataframe api and add unittest
e2bace3 [zhichao.li] update to use ImplicitCastInputTypes
cbcad3f [zhichao.li] add function conv
2015-07-17 09:32:27 -07:00
Wenchen Fan 59d24c226a [SPARK-9130][SQL] throw exception when check equality between external and internal row
instead of return false, throw exception when check equality between external and internal row is better.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7460 from cloud-fan/row-compare and squashes the following commits:

8a20911 [Wenchen Fan] improve equals
402daa8 [Wenchen Fan] throw exception when check equality between external and internal row
2015-07-17 09:31:13 -07:00
Davies Liu ec8973d124 [SPARK-9022] [SQL] Generated projections for UnsafeRow
Added two projections: GenerateUnsafeProjection and FromUnsafeProjection, which could be used to convert UnsafeRow from/to GenericInternalRow.

They will re-use the buffer during projection, similar to MutableProjection (without all the interface MutableProjection has).

cc rxin JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #7437 from davies/unsafe_proj2 and squashes the following commits:

dbf538e [Davies Liu] test with all the expression (only for supported types)
dc737b2 [Davies Liu] address comment
e424520 [Davies Liu] fix scala style
70e231c [Davies Liu] address comments
729138d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_proj2
5a26373 [Davies Liu] unsafe projections
2015-07-17 01:27:14 -07:00
Wenchen Fan 3f6d28a5ca [SPARK-9102] [SQL] Improve project collapse with nondeterministic expressions
Currently we will stop project collapse when the lower projection has nondeterministic expressions. However it's overkill sometimes, we should be able to optimize `df.select(Rand(10)).select('a)` to `df.select('a)`

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7445 from cloud-fan/non-deterministic and squashes the following commits:

0deaef6 [Wenchen Fan] Improve project collapse with nondeterministic expressions
2015-07-17 00:59:15 -07:00
Tarek Auel 4ea6480a3b [SPARK-8995] [SQL] cast date strings like '2015-01-01 12:15:31' to date
Jira https://issues.apache.org/jira/browse/SPARK-8995

In PR #6981we noticed that we cannot cast date strings that contains a time, like '2015-03-18 12:39:40' to date. Besides it's not possible to cast a string like '18:03:20' to a timestamp.

If a time is passed without a date, today is inferred as date.

Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Tarek Auel <tarek.auel@gmail.com>

Closes #7353 from tarekauel/SPARK-8995 and squashes the following commits:

14f333b [Tarek Auel] [SPARK-8995] added tests for daylight saving time
ca1ae69 [Tarek Auel] [SPARK-8995] style fix
d20b8b4 [Tarek Auel] [SPARK-8995] bug fix: distinguish between 0 and null
ef05753 [Tarek Auel] [SPARK-8995] added check for year >= 1000
01c9ff3 [Tarek Auel] [SPARK-8995] support for time strings
34ec573 [Tarek Auel] fixed style
71622c0 [Tarek Auel] improved timestamp and date parsing
0e30c0a [Tarek Auel] Hive compatibility
cfbaed7 [Tarek Auel] fixed wrong checks
71f89c1 [Tarek Auel] [SPARK-8995] minor style fix
f7452fa [Tarek Auel] [SPARK-8995] removed old timestamp parsing
30e5aec [Tarek Auel] [SPARK-8995] date and timestamp cast
c1083fb [Tarek Auel] [SPARK-8995] cast date strings like '2015-01-01 12:15:31' to date or timestamp
2015-07-16 08:26:39 -07:00
Wenchen Fan ba33096846 [SPARK-9068][SQL] refactor the implicit type cast code
based on https://github.com/apache/spark/pull/7348

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7420 from cloud-fan/type-check and squashes the following commits:

7633fa9 [Wenchen Fan] revert
fe169b0 [Wenchen Fan] improve test
03b70da [Wenchen Fan] enhance implicit type cast
2015-07-15 22:27:39 -07:00
Cheng Hao 42dea3acf9 [SPARK-8245][SQL] FormatNumber/Length Support for Expression
- `BinaryType` for `Length`
- `FormatNumber`

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7034 from chenghao-intel/expression and squashes the following commits:

e534b87 [Cheng Hao] python api style issue
601bbf5 [Cheng Hao] add python API support
3ebe288 [Cheng Hao] update as feedback
52274f7 [Cheng Hao] add support for udf_format_number and length for binary
2015-07-15 21:47:21 -07:00
Yin Huai 9c64a75bfc [SPARK-9060] [SQL] Revert SPARK-8359, SPARK-8800, and SPARK-8677
JIRA: https://issues.apache.org/jira/browse/SPARK-9060

This PR reverts:
* 31bd30687b (SPARK-8359)
* 24fda73811 (SPARK-8677)
* 4b5cfc988f (SPARK-8800)

Author: Yin Huai <yhuai@databricks.com>

Closes #7426 from yhuai/SPARK-9060 and squashes the following commits:

651264d [Yin Huai] Revert "[SPARK-8359] [SQL] Fix incorrect decimal precision after multiplication"
cfda7e4 [Yin Huai] Revert "[SPARK-8677] [SQL] Fix non-terminating decimal expansion for decimal divide operation"
2de9afe [Yin Huai] Revert "[SPARK-8800] [SQL] Fix inaccurate precision/scale of Decimal division operation"
2015-07-15 21:08:30 -07:00
zhichao.li a9385271a9 [SPARK-8221][SQL]Add pmod function
https://issues.apache.org/jira/browse/SPARK-8221

One concern is the result would be negative if the divisor is not positive( i.e pmod(7, -3) ), but the behavior is the same as hive.

Author: zhichao.li <zhichao.li@intel.com>

Closes #6783 from zhichao-li/pmod2 and squashes the following commits:

7083eb9 [zhichao.li] update to the latest type checking
d26dba7 [zhichao.li] add pmod
2015-07-15 10:43:38 -07:00
Reynold Xin 14935d846a [HOTFIX][SQL] Unit test breaking. 2015-07-15 00:12:21 -07:00
Yijie Shen f0e129740d [SPARK-8279][SQL]Add math function round
JIRA: https://issues.apache.org/jira/browse/SPARK-8279

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #6938 from yijieshen/udf_round_3 and squashes the following commits:

07a124c [Yijie Shen] remove useless def children
392b65b [Yijie Shen] add negative scale test in DecimalSuite
61760ee [Yijie Shen] address reviews
302a78a [Yijie Shen] Add dataframe function test
31dfe7c [Yijie Shen] refactor round to make it readable
8c7a949 [Yijie Shen] rebase & inputTypes update
9555e35 [Yijie Shen] tiny style fix
d10be4a [Yijie Shen] use TypeCollection to specify wanted input and implicit cast
c3b9839 [Yijie Shen] rely on implict cast to handle string input
b0bff79 [Yijie Shen] make round's inner method's name more meaningful
9bd6930 [Yijie Shen] revert accidental change
e6f44c4 [Yijie Shen] refactor eval and genCode
1b87540 [Yijie Shen] modify checkInputDataTypes using foldable
5486b2d [Yijie Shen] DataFrame API modification
2077888 [Yijie Shen] codegen versioned eval
6cd9a64 [Yijie Shen] refactor Round's constructor
9be894e [Yijie Shen] add round functions in o.a.s.sql.functions
7c83e13 [Yijie Shen] more tests on round
56db4bb [Yijie Shen] Add decimal support to Round
7e163ae [Yijie Shen] style fix
653d047 [Yijie Shen] Add math function round
2015-07-14 23:30:41 -07:00
Reynold Xin f23a721c10 [SPARK-8993][SQL] More comprehensive type checking in expressions.
This patch makes the following changes:

1. ExpectsInputTypes only defines expected input types, but does not perform any implicit type casting.
2. ImplicitCastInputTypes is a new trait that defines both expected input types, as well as performs implicit type casting.
3. BinaryOperator has a new abstract function "inputType", which defines the expected input type for both left/right. Concrete BinaryOperator expressions no longer perform any implicit type casting.
4. For BinaryOperators, convert NullType (i.e. null literals) into some accepted type so BinaryOperators don't need to handle NullTypes.

TODOs needed: fix unit tests for error reporting.

I'm intentionally not changing anything in aggregate expressions because yhuai is doing a big refactoring on that right now.

Author: Reynold Xin <rxin@databricks.com>

Closes #7348 from rxin/typecheck and squashes the following commits:

8fcf814 [Reynold Xin] Fixed ordering of cases.
3bb63e7 [Reynold Xin] Style fix.
f45408f [Reynold Xin] Comment update.
aa7790e [Reynold Xin] Moved RemoveNullTypes into ImplicitTypeCasts.
438ea07 [Reynold Xin] space
d55c9e5 [Reynold Xin] Removes NullTypes.
360d124 [Reynold Xin] Fixed the rule.
fb66657 [Reynold Xin] Convert NullType into some accepted type for BinaryOperators.
2e22330 [Reynold Xin] Fixed unit tests.
4932d57 [Reynold Xin] Style fix.
d061691 [Reynold Xin] Rename existing ExpectsInputTypes -> ImplicitCastInputTypes.
e4727cc [Reynold Xin] BinaryOperator should not be doing implicit cast.
d017861 [Reynold Xin] Improve expression type checking.
2015-07-14 22:52:53 -07:00
Liang-Chi Hsieh 4b5cfc988f [SPARK-8800] [SQL] Fix inaccurate precision/scale of Decimal division operation
JIRA: https://issues.apache.org/jira/browse/SPARK-8800

Previously, we turn to Java BigDecimal's divide with specified ROUNDING_MODE to avoid non-terminating decimal expansion problem. However, as JihongMA reported, for the division operation on some specific values, we get inaccurate results.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #7212 from viirya/fix_decimal4 and squashes the following commits:

4205a0a [Liang-Chi Hsieh] Fix inaccuracy precision/scale of Decimal division operation.
2015-07-14 14:19:27 -07:00
Daoyuan Wang 257236c3e1 [SPARK-6851] [SQL] function least/greatest follow up
This is a follow up of remaining comments from #6851

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #7387 from adrian-wang/udflgfollow and squashes the following commits:

6163e62 [Daoyuan Wang] add skipping null values
e8c2e09 [Daoyuan Wang] use seq
8362966 [Daoyuan Wang] pr6851 follow up
2015-07-14 01:09:33 -07:00
Vinod K C 4c797f2b09 [SPARK-8636] [SQL] Fix equalNullSafe comparison
Author: Vinod K C <vinod.kc@huawei.com>

Closes #7040 from vinodkc/fix_CaseKeyWhen_equalNullSafe and squashes the following commits:

be5e641 [Vinod K C] Renamed equalNullSafe to threeValueEquals
aac9f67 [Vinod K C] Updated test suite and genCode method
f2d0b53 [Vinod K C]  Fix equalNullSafe comparison
2015-07-13 12:51:33 -07:00
Wenchen Fan 6b89943834 [SPARK-8944][SQL] Support casting between IntervalType and StringType
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7355 from cloud-fan/fromString and squashes the following commits:

3bbb9d6 [Wenchen Fan] fix code gen
7dab957 [Wenchen Fan] naming fix
0fbbe19 [Wenchen Fan] address comments
ac1f3d1 [Wenchen Fan] Support casting between IntervalType and StringType
2015-07-13 00:49:39 -07:00
Daoyuan Wang 92540d22e4 [SPARK-8203] [SPARK-8204] [SQL] conditional function: least/greatest
chenghao-intel zhichao-li qiansl127

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #6851 from adrian-wang/udflg and squashes the following commits:

0f1bff2 [Daoyuan Wang] address comments from davis
7a6bdbb [Daoyuan Wang] add '.' for hex()
c1f6824 [Daoyuan Wang] add codegen, test for all types
ec625b0 [Daoyuan Wang] conditional function: least/greatest
2015-07-13 00:14:32 -07:00
Cheng Hao 0b0b9ceaf7 [SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257] [SPARK-8258] [SPARK-8259] [SPARK-8261] [SPARK-8262] [SPARK-8253] [SPARK-8260] [SPARK-8267] [SQL] Add String Expressions
Author: Cheng Hao <hao.cheng@intel.com>

Closes #6762 from chenghao-intel/str_funcs and squashes the following commits:

b09a909 [Cheng Hao] update the code as feedback
7ebbf4c [Cheng Hao] Add more string expressions
2015-07-09 11:11:34 -07:00
Reynold Xin a870a82fb6 [SPARK-8926][SQL] Code review followup.
I merged https://github.com/apache/spark/pull/7303 so it unblocks another PR. This addresses my own code review comment for that PR.

Author: Reynold Xin <rxin@databricks.com>

Closes #7313 from rxin/adt and squashes the following commits:

7ade82b [Reynold Xin] Fixed unit tests.
f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.
2015-07-09 10:01:33 -07:00
Tarek Auel a1964e9d90 [SPARK-8830] [SQL] native levenshtein distance
Jira: https://issues.apache.org/jira/browse/SPARK-8830

rxin and HuJiayin can you have a look on it.

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7236 from tarekauel/native-levenshtein-distance and squashes the following commits:

ee4c4de [Tarek Auel] [SPARK-8830] implemented improvement proposals
c252e71 [Tarek Auel] [SPARK-8830] removed chartAt; use unsafe method for byte array comparison
ddf2222 [Tarek Auel] Merge branch 'master' into native-levenshtein-distance
179920a [Tarek Auel] [SPARK-8830] added description
5e9ed54 [Tarek Auel] [SPARK-8830] removed StringUtils import
dce4308 [Tarek Auel] [SPARK-8830] native levenshtein distance
2015-07-09 09:23:35 -07:00
Michael Armbrust 768907eb7b [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
For example: `cannot resolve 'testfunction(null)' due to data type mismatch: argument 1 is expected to be of type int, however, null is of type datetype.`

Author: Michael Armbrust <michael@databricks.com>

Closes #7303 from marmbrus/expectsTypeErrors and squashes the following commits:

c654a0e [Michael Armbrust] fix udts and make errors pretty
137160d [Michael Armbrust] style
5428fda [Michael Armbrust] style
10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
2015-07-08 22:05:58 -07:00
Josh Rosen b55499a44a [SPARK-8932] Support copy() for UnsafeRows that do not use ObjectPools
We call Row.copy() in many places throughout SQL but UnsafeRow currently throws UnsupportedOperationException when copy() is called.

Supporting copying when ObjectPool is used may be difficult, since we may need to handle deep-copying of objects in the pool. In addition, this copy() method needs to produce a self-contained row object which may be passed around / buffered by downstream code which does not understand the UnsafeRow format.

In the long run, we'll need to figure out how to handle the ObjectPool corner cases, but this may be unnecessary if other changes are made. Therefore, in order to unblock my sort patch (#6444) I propose that we support copy() for the cases where UnsafeRow does not use an ObjectPool and continue to throw UnsupportedOperationException when an ObjectPool is used.

This patch accomplishes this by modifying UnsafeRow so that it knows the size of the row's backing data in order to be able to copy it into a byte array.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7306 from JoshRosen/SPARK-8932 and squashes the following commits:

338e6bf [Josh Rosen] Support copy for UnsafeRows that do not use ObjectPools.
2015-07-08 20:28:05 -07:00
Yijie Shen a290814877 [SPARK-8866][SQL] use 1us precision for timestamp type
JIRA: https://issues.apache.org/jira/browse/SPARK-8866

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7283 from yijieshen/micro_timestamp and squashes the following commits:

dc735df [Yijie Shen] update CastSuite to avoid round error
714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose
c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp
8d4aa6b [Yijie Shen] use 1us precision for timestamp type
2015-07-08 20:20:17 -07:00
Reynold Xin 5d603dfe49 [SPARK-8878][SQL] Improve unit test coverage for bitwise expressions.
Author: Reynold Xin <rxin@databricks.com>

Closes #7273 from rxin/bitwise-unittest and squashes the following commits:

60c5667 [Reynold Xin] [SPARK-8878][SQL] Improve unit test coverage for bitwise expressions.
2015-07-07 19:12:40 -07:00
Reynold Xin 770ff1025e [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.
The type alias was there because initially when I moved Row around, I didn't want to do massive changes to the expression code. But now it should be pretty easy to just remove it. One less concept to worry about.

Author: Reynold Xin <rxin@databricks.com>

Closes #7270 from rxin/internalrow and squashes the following commits:

72fc842 [Reynold Xin] [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.
2015-07-07 17:40:14 -07:00
Liang-Chi Hsieh da56c4e728 [SPARK-8794] [SQL] Make PrunedScan work for Sample
JIRA: https://issues.apache.org/jira/browse/SPARK-8794

Currently `PrunedScan` works only when followed by project or filter operations. However, even if there is a `Sample` between these operations and `PrunedScan`, `PrunedScan` should work too.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #7228 from viirya/sample_prunedscan and squashes the following commits:

ede7cd8 [Liang-Chi Hsieh] Keep PrunedScanSuite untouched.
6f05d30 [Liang-Chi Hsieh] Move unit test to FilterPushdownSuite.
5f32473 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan
7e4ba76 [Liang-Chi Hsieh] Use Optimzier for push down projection and filter.
0686830 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan
df82785 [Liang-Chi Hsieh] Make PrunedScan work on Sample.
2015-07-07 15:49:22 -07:00
Yin Huai 7b467cc934 [SPARK-8588] [SQL] Regression test
This PR adds regression test for https://issues.apache.org/jira/browse/SPARK-8588 (fixed by 457d07eaa0).

Author: Yin Huai <yhuai@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #7103 from yhuai/SPARK-8588-test and squashes the following commits:

eb5f418 [Yin Huai] Add a query test.
c61a173 [Yin Huai] Regression test for SPARK-8588.
2015-07-06 16:28:47 -07:00
Davies Liu 37e4d92142 [SPARK-8784] [SQL] Add Python API for hex and unhex
Add Python API for hex/unhex,  also cleanup Hex/Unhex

Author: Davies Liu <davies@databricks.com>

Closes #7223 from davies/hex and squashes the following commits:

6f1249d [Davies Liu] no explicit rule to cast string into binary
711a6ed [Davies Liu] fix test
f9fe5a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex
49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
b31fc9a [Davies Liu] Update math.scala
25156b7 [Davies Liu] address comments and fix test
c3af78c [Davies Liu] address commments
1a24082 [Davies Liu] Add Python API for hex and unhex
2015-07-06 13:31:31 -07:00
Reynold Xin 86768b7b3b [SPARK-8831][SQL] Support AbstractDataType in TypeCollection.
Otherwise it is impossible to declare an expression supporting DecimalType.

Author: Reynold Xin <rxin@databricks.com>

Closes #7232 from rxin/typecollection-adt and squashes the following commits:

934d3d1 [Reynold Xin] [SPARK-8831][SQL] Support AbstractDataType in TypeCollection.
2015-07-05 23:54:25 -07:00
Reynold Xin c991ef5abb [SPARK-8822][SQL] clean up type checking in math.scala.
Author: Reynold Xin <rxin@databricks.com>

Closes #7220 from rxin/SPARK-8822 and squashes the following commits:

0cda076 [Reynold Xin] Test cases.
22d0463 [Reynold Xin] Fixed type precedence.
beb2a97 [Reynold Xin] [SPARK-8822][SQL] clean up type checking in math.scala.
2015-07-04 11:55:20 -07:00
Reynold Xin 347cab85cd [SQL] More unit tests for implicit type cast & add simpleString to AbstractDataType.
Author: Reynold Xin <rxin@databricks.com>

Closes #7221 from rxin/implicit-cast-tests and squashes the following commits:

64b13bd [Reynold Xin] Fixed a bug ..
489b732 [Reynold Xin] [SQL] More unit tests for implicit type cast & add simpleString to AbstractDataType.
2015-07-04 11:55:04 -07:00
Tarek Auel 6b3574e687 [SPARK-8270][SQL] levenshtein distance
Jira: https://issues.apache.org/jira/browse/SPARK-8270

Info: I can not build the latest master, it stucks during the build process: `[INFO] Dependency-reduced POM written at: /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7214 from tarekauel/SPARK-8270 and squashes the following commits:

ab348b9 [Tarek Auel] Merge branch 'master' into SPARK-8270
a2ad318 [Tarek Auel] [SPARK-8270] changed order of fields
d91b12c [Tarek Auel] [SPARK-8270] python fix
adbd075 [Tarek Auel] [SPARK-8270] fixed typo
23185c9 [Tarek Auel] [SPARK-8270] levenshtein distance
2015-07-04 01:10:52 -07:00
Cheng Hao f35b0c3436 [SPARK-8238][SPARK-8239][SPARK-8242][SPARK-8243][SPARK-8268][SQL]Add ascii/base64/unbase64/encode/decode functions
Add `ascii`,`base64`,`unbase64`,`encode` and `decode` expressions.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6843 from chenghao-intel/str_funcs2 and squashes the following commits:

78dee7d [Cheng Hao] base 64 -> base64
9d6f9f4 [Cheng Hao] remove the toString method for expressions
ed5c19c [Cheng Hao] update code as comments
96170fc [Cheng Hao] scalastyle issues
e2df768 [Cheng Hao] remove the unused import
491ce7b [Cheng Hao] add ascii/base64/unbase64/encode/decode functions
2015-07-03 23:45:21 -07:00
Josh Rosen f32487b7ca [SPARK-8777] [SQL] Add random data generator test utilities to Spark SQL
This commit adds a set of random data generation utilities to Spark SQL, for use in its own unit tests.

- `RandomDataGenerator.forType(DataType)` returns an `Option[() => Any]` that, if defined, contains a function for generating random values for the given DataType.  The random values use the external representations for the given DataType (for example, for DateType we return `java.sql.Date` instances instead of longs).
- `DateTypeTestUtilities` defines some convenience fields for looping over instances of data types.  For example, `numericTypes` holds `DataType` instances for all supported numeric types.  These constants will help us to raise the level of abstraction in our tests.  For example, it's now very easy to write a test which is parameterized by all common data types.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7176 from JoshRosen/sql-random-data-generators and squashes the following commits:

f71634d [Josh Rosen] Roll back ScalaCheck usage
e0d7d49 [Josh Rosen] Bump ScalaCheck version in LICENSE
89d86b1 [Josh Rosen] Bump ScalaCheck version.
0c20905 [Josh Rosen] Initial attempt at using ScalaCheck.
b55875a [Josh Rosen] Generate doubles and floats over entire possible range.
5acdd5c [Josh Rosen] Infinity and NaN are interesting.
ab76cbd [Josh Rosen] Move code to Catalyst package.
d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.
2015-07-03 23:05:17 -07:00
Daoyuan Wang 9fb6b832bc [SPARK-8192] [SPARK-8193] [SQL] udf current_date, current_timestamp
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #6985 from adrian-wang/udfcurrent and squashes the following commits:

6a20b64 [Daoyuan Wang] remove codegen and add lazy in testsuite
27c9f95 [Daoyuan Wang] refine tests..
e11ae75 [Daoyuan Wang] refine tests
61ed3d5 [Daoyuan Wang] add in functions
98e8550 [Daoyuan Wang] fix sytle
427d9dc [Daoyuan Wang] add tests and codegen
0b69a1f [Daoyuan Wang] udf current
2015-07-03 22:19:43 -07:00
zhichao.li ab535b9a1d [SPARK-8226] [SQL] Add function shiftrightunsigned
Author: zhichao.li <zhichao.li@intel.com>

Closes #7035 from zhichao-li/shiftRightUnsigned and squashes the following commits:

6bcca5a [zhichao.li] change coding style
3e9f5ae [zhichao.li] python style
d85ae0b [zhichao.li] add shiftrightunsigned
2015-07-03 15:39:16 -07:00
Reynold Xin a59d14f623 [SPARK-8801][SQL] Support TypeCollection in ExpectsInputTypes
This patch adds a new TypeCollection AbstractDataType that can be used by expressions to specify more than one expected input types.

Author: Reynold Xin <rxin@databricks.com>

Closes #7202 from rxin/type-collection and squashes the following commits:

c714ca1 [Reynold Xin] Fixed style.
a0c0d12 [Reynold Xin] Fixed bugs and unit tests.
d8b8ae7 [Reynold Xin] Added TypeCollection.
2015-07-02 21:45:25 -07:00
zhichao.li 1a7a7d7d57 [SPARK-8213][SQL]Add function factorial
Author: zhichao.li <zhichao.li@intel.com>

Closes #6822 from zhichao-li/factorial and squashes the following commits:

26edf4f [zhichao.li] add factorial
2015-07-02 20:37:31 -07:00
Reynold Xin e589e71a29 Revert "[SPARK-8784] [SQL] Add Python API for hex and unhex"
This reverts commit fc7aebd94a.
2015-07-02 16:25:10 -07:00
Davies Liu fc7aebd94a [SPARK-8784] [SQL] Add Python API for hex and unhex
Also improve the performance of hex/unhex

Author: Davies Liu <davies@databricks.com>

Closes #7181 from davies/hex and squashes the following commits:

f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex
49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
b31fc9a [Davies Liu] Update math.scala
25156b7 [Davies Liu] address comments and fix test
c3af78c [Davies Liu] address commments
1a24082 [Davies Liu] Add Python API for hex and unhex
2015-07-02 15:43:02 -07:00
Reynold Xin 52508beb65 [SPARK-8772][SQL] Implement implicit type cast for expressions that define input types.
Author: Reynold Xin <rxin@databricks.com>

Closes #7175 from rxin/implicitCast and squashes the following commits:

88080a2 [Reynold Xin] Clearer definition of implicit type cast.
f0ff97f [Reynold Xin] Added missing file.
c65e532 [Reynold Xin] [SPARK-8772][SQL] Implement implicit type cast for expressions that defines input types.
2015-07-02 14:16:14 -07:00
Yijie Shen 52302a8039 [SPARK-8407] [SQL] complex type constructors: struct and named_struct
This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](https://github.com/apache/spark/pull/6828)), to support both `struct` and `named_struct` in Spark SQL.

After [#6725](https://github.com/apache/spark/pull/6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ...

This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support.

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #6874 from yijieshen/SPARK-8283 and squashes the following commits:

4cd3375ac [Yijie Shen] change struct documentation
d599d0b [Yijie Shen] rebase code
9a7039e [Yijie Shen] fix reviews and regenerate golden answers
b487354 [Yijie Shen] replace assert using checkAnswer
f07e114 [Yijie Shen] tiny fix
9613be9 [Yijie Shen] review fix
7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable
60812a7 [Yijie Shen] Fix type check
828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method
fd3cd8e [Yijie Shen] remove type check from eval
7a71255 [Yijie Shen] tiny fix
ccbbd86 [Yijie Shen] Fix reviews
47da332 [Yijie Shen] remove nameStruct API from DataFrame
917e680 [Yijie Shen] Fix reviews
4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children
0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery
2015-07-02 10:12:25 -07:00
Wenchen Fan afa021e03f [SPARK-8747] [SQL] fix EqualNullSafe for binary type
also improve tests for binary comparison.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7143 from cloud-fan/binary and squashes the following commits:

28a5b76 [Wenchen Fan] improve test
04ef4b0 [Wenchen Fan] fix equalNullSafe
2015-07-02 10:06:38 -07:00
Tarek Auel 5b3338130d [SPARK-8223] [SPARK-8224] [SQL] shift left and shift right
Jira:
https://issues.apache.org/jira/browse/SPARK-8223
https://issues.apache.org/jira/browse/SPARK-8224

~~I am aware of #7174 and will update this pr, if it's merged.~~ Done
I don't know if #7034 can simplify this, but we can have a look on it, if it gets merged

rxin In the Jira ticket the function as no second argument. I added a `numBits` argument that allows to specify the number of bits. I guess this improves the usability. I wanted to add `shiftleft(value)` as well, but the `selectExpr` dataframe tests crashes, if I have both. I order to do this, I added the following to the functions.scala `def shiftRight(e: Column): Column = ShiftRight(e.expr, lit(1).expr)`, but as I mentioned this doesn't pass tests like `df.selectExpr("shiftRight(a)", ...` (not enough arguments exception).

If we need the bitwise shift in order to be hive compatible, I suggest to add `shiftLeft` and something like `shiftLeftX`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7178 from tarekauel/8223 and squashes the following commits:

8023bb5 [Tarek Auel] [SPARK-8223][SPARK-8224] fixed test
f3f64e6 [Tarek Auel] [SPARK-8223][SPARK-8224] Integer -> Int
f628706 [Tarek Auel] [SPARK-8223][SPARK-8224] removed toString; updated function description
3b56f2a [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
5189690 [Tarek Auel] [SPARK-8223][SPARK-8224] minor fix and style fix
9434a28 [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
44ee324 [Tarek Auel] [SPARK-8223][SPARK-8224] docu fix
ac7fe9d [Tarek Auel] [SPARK-8223][SPARK-8224] right and left bit shift
2015-07-02 10:02:19 -07:00
zhichao.li b285ac5ba8 [SPARK-8227] [SQL] Add function unhex
cc chenghao-intel  adrian-wang

Author: zhichao.li <zhichao.li@intel.com>

Closes #7113 from zhichao-li/unhex and squashes the following commits:

379356e [zhichao.li] remove exception checking
a4ae6dc [zhichao.li] add udf_unhex to whitelist
fe5c14a [zhichao.li] add todigit
607d7a3 [zhichao.li] use checkInputTypes
bffd37f [zhichao.li] change to use Hex in apache common package
cde73f5 [zhichao.li] update to use AutoCastInputTypes
11945c7 [zhichao.li] style
c852d46 [zhichao.li] Add function unhex
2015-07-01 22:19:51 -07:00
Reynold Xin 9fd13d5613 [SPARK-8770][SQL] Create BinaryOperator abstract class.
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.

This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.

Author: Reynold Xin <rxin@databricks.com>

Closes #7174 from rxin/binary-opterator and squashes the following commits:

f31900d [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
fceb216 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into binary-opterator
d8518cf [Reynold Xin] Updated Python tests.
2015-07-01 21:14:13 -07:00
Reynold Xin 3a342dedc0 Revert "[SPARK-8770][SQL] Create BinaryOperator abstract class."
This reverts commit 2727789998.
2015-07-01 16:59:39 -07:00
Reynold Xin 2727789998 [SPARK-8770][SQL] Create BinaryOperator abstract class.
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.

This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.

Author: Reynold Xin <rxin@databricks.com>

Closes #7170 from rxin/binaryoperator and squashes the following commits:

51264a5 [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
2015-07-01 16:56:48 -07:00
Davies Liu 3083e17645 [QUICKFIX] [SQL] fix copy of generated row
copy() of generated Row doesn't check nullability of columns

Author: Davies Liu <davies@databricks.com>

Closes #7163 from davies/fix_copy and squashes the following commits:

661a206 [Davies Liu] fix copy of generated row
2015-07-01 12:39:57 -07:00
Reynold Xin fc3a6fe67f [SPARK-8749][SQL] Remove HiveTypeCoercion trait.
Moved all the rules into the companion object.

Author: Reynold Xin <rxin@databricks.com>

Closes #7147 from rxin/SPARK-8749 and squashes the following commits:

c1c6dc0 [Reynold Xin] [SPARK-8749][SQL] Remove HiveTypeCoercion trait.
2015-07-01 00:08:16 -07:00
Christian Kadner 1e1f339976 [SPARK-6785] [SQL] fix DateTimeUtils for dates before 1970
Hi Michael,
this Pull-Request is a follow-up to [PR-6242](https://github.com/apache/spark/pull/6242). I removed the two obsolete test cases from the HiveQuerySuite and deleted the corresponding golden answer files.
Thanks for your review!

Author: Christian Kadner <ckadner@us.ibm.com>

Closes #6983 from ckadner/SPARK-6785 and squashes the following commits:

ab1e79b [Christian Kadner] Merge remote-tracking branch 'origin/SPARK-6785' into SPARK-6785
1fed877 [Christian Kadner] [SPARK-6785][SQL] failed Scala style test, remove spaces on empty line DateTimeUtils.scala:61
9d8021d [Christian Kadner] [SPARK-6785][SQL] merge recent changes in DateTimeUtils & MiscFunctionsSuite
b97c3fb [Christian Kadner] [SPARK-6785][SQL] move test case for DateTimeUtils to DateTimeUtilsSuite
a451184 [Christian Kadner] [SPARK-6785][SQL] fix DateTimeUtils.fromJavaDate(java.util.Date) for Dates before 1970
2015-06-30 12:22:34 -07:00
Shilei 722aa5f48e [SPARK-8236] [SQL] misc functions: crc32
https://issues.apache.org/jira/browse/SPARK-8236

Author: Shilei <shilei.qian@intel.com>

Closes #7108 from qiansl127/Crc32 and squashes the following commits:

5477352 [Shilei] Change to AutoCastInputTypes
5f16e5d [Shilei] Add misc function crc32
2015-06-30 09:49:58 -07:00
Wenchen Fan 08fab48438 [SPARK-8590] [SQL] add code gen for ExtractValue
TODO:  use array instead of Seq as internal representation for `ArrayType`

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6982 from cloud-fan/extract-value and squashes the following commits:

e203bc1 [Wenchen Fan] address comments
4da0f0b [Wenchen Fan] some clean up
f679969 [Wenchen Fan] fix bug
e64f942 [Wenchen Fan] remove generic
e3f8427 [Wenchen Fan] fix style and address comments
fc694e8 [Wenchen Fan] add code gen for extract value
2015-06-30 07:58:49 -07:00
Davies Liu ed359de595 [SPARK-8579] [SQL] support arbitrary object in UnsafeRow
This PR brings arbitrary object support in UnsafeRow (both in grouping key and aggregation buffer).

Two object pools will be created to hold those non-primitive objects, and put the index of them into UnsafeRow. In order to compare the grouping key as bytes, the objects in key will be stored in a unique object pool, to make sure same objects will have same index (used as hashCode).

For StringType and BinaryType, we still put them as var-length in UnsafeRow when initializing for better performance. But for update, they will be an object inside object pools (there will be some garbages left in the buffer).

BTW: Will create a JIRA once issue.apache.org is available.

cc JoshRosen rxin

Author: Davies Liu <davies@databricks.com>

Closes #6959 from davies/unsafe_obj and squashes the following commits:

5ce39da [Davies Liu] fix comment
5e797bf [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
5803d64 [Davies Liu] fix conflict
461d304 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
2f41c90 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
b04d69c [Davies Liu] address comments
4859b80 [Davies Liu] fix comments
f38011c [Davies Liu] add a test for grouping by decimal
d2cf7ab [Davies Liu] add more tests for null checking
71983c5 [Davies Liu] add test for timestamp
e8a1649 [Davies Liu] reuse buffer for string
39f09ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
035501e [Davies Liu] fix style
236d6de [Davies Liu] support arbitrary object in UnsafeRow
2015-06-29 15:59:20 -07:00
Ilya Ganelin f6fc254ec4 [SPARK-8056][SQL] Design an easier way to construct schema for both Scala and Python
I've added functionality to create new StructType similar to how we add parameters to a new SparkContext.

I've also added tests for this type of creation.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #6686 from ilganeli/SPARK-8056B and squashes the following commits:

27c1de1 [Ilya Ganelin] Rename
467d836 [Ilya Ganelin] Removed from_string in favor of _parse_Datatype_json_value
5fef5a4 [Ilya Ganelin] Updates for type parsing
4085489 [Ilya Ganelin] Style errors
3670cf5 [Ilya Ganelin] added string to DataType conversion
8109e00 [Ilya Ganelin] Fixed error in tests
41ab686 [Ilya Ganelin] Fixed style errors
e7ba7e0 [Ilya Ganelin] Moved some python tests to tests.py. Added cleaner handling of null data type and added test for correctness of input format
15868fa [Ilya Ganelin] Fixed python errors
b79b992 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-8056B
a3369fc [Ilya Ganelin] Fixing space errors
e240040 [Ilya Ganelin] Style
bab7823 [Ilya Ganelin] Constructor error
73d4677 [Ilya Ganelin] Style
4ed00d9 [Ilya Ganelin] Fixed default arg
67df57a [Ilya Ganelin] Removed Foo
04cbf0c [Ilya Ganelin] Added comments for single object
0484d7a [Ilya Ganelin] Restored second method
6aeb740 [Ilya Ganelin] Style
689e54d [Ilya Ganelin] Style
f497e9e [Ilya Ganelin] Got rid of old code
e3c7a88 [Ilya Ganelin] Fixed doctest failure
a62ccde [Ilya Ganelin] Style
966ac06 [Ilya Ganelin] style checks
dabb7e6 [Ilya Ganelin] Added Python tests
a3f4152 [Ilya Ganelin] added python bindings and better comments
e6e536c [Ilya Ganelin] Added extra space
7529a2e [Ilya Ganelin] Fixed formatting
d388f86 [Ilya Ganelin] Fixed small bug
c4e3bf5 [Ilya Ganelin] Reverted to using parse. Updated parse to support long
d7634b6 [Ilya Ganelin] Reverted to fromString to properly support types
22c39d5 [Ilya Ganelin] replaced FromString with DataTypeParser.parse. Replaced empty constructor initializing a null to have it instead create a new array to allow appends to it.
faca398 [Ilya Ganelin] [SPARK-8056] Replaced default argument usage. Updated usage and code for DataType.fromString
1acf76e [Ilya Ganelin] Scala style
e31c674 [Ilya Ganelin] Fixed bug in test
8dc0795 [Ilya Ganelin] Added tests for creation of StructType object with new methods
fdf7e9f [Ilya Ganelin] [SPARK-8056] Created add methods to facilitate building new StructType objects.
2015-06-29 14:15:15 -07:00
zhichao.li 637b4eedad [SPARK-8214] [SQL] Add function hex
cc chenghao-intel  adrian-wang

Author: zhichao.li <zhichao.li@intel.com>

Closes #6976 from zhichao-li/hex and squashes the following commits:

e218d1b [zhichao.li] turn off scalastyle for non-ascii
de3f5ea [zhichao.li] non-ascii char
cf9c936 [zhichao.li] give separated buffer for each hex method
967ec90 [zhichao.li] Make 'value' as a feild of Hex
3b2fa13 [zhichao.li] tiny fix
a647641 [zhichao.li] remove duplicate null check
7cab020 [zhichao.li] tiny refactoring
35ecfe5 [zhichao.li] add function hex
2015-06-29 12:25:16 -07:00
Tarek Auel a5c2961caa [SPARK-8235] [SQL] misc function sha / sha1
Jira: https://issues.apache.org/jira/browse/SPARK-8235

I added the support for sha1. If I understood rxin correctly, sha and sha1 should execute the same algorithm, shouldn't they?

Please take a close look on the Python part. This is adopted from #6934

Author: Tarek Auel <tarek.auel@gmail.com>
Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #6963 from tarekauel/SPARK-8235 and squashes the following commits:

f064563 [Tarek Auel] change to shaHex
7ce3cdc [Tarek Auel] rely on automatic cast
a1251d6 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-8235
68eb043 [Tarek Auel] added docstring
be5aff1 [Tarek Auel] improved error message
7336c96 [Tarek Auel] added type check
cf23a80 [Tarek Auel] simplified example
ebf75ef [Tarek Auel] [SPARK-8301] updated the python documentation. Removed sha in python and scala
6d6ff0d [Tarek Auel] [SPARK-8233] added docstring
ea191a9 [Tarek Auel] [SPARK-8233] fixed signatureof python function. Added expected type to misc
e3fd7c3 [Tarek Auel] SPARK[8235] added sha to the list of __all__
e5dad4e [Tarek Auel] SPARK[8235] sha / sha1
2015-06-29 11:57:19 -07:00
Liang-Chi Hsieh 24fda73811 [SPARK-8677] [SQL] Fix non-terminating decimal expansion for decimal divide operation
JIRA: https://issues.apache.org/jira/browse/SPARK-8677

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #7056 from viirya/fix_decimal3 and squashes the following commits:

34d7419 [Liang-Chi Hsieh] Fix Non-terminating decimal expansion for decimal divide operation.
2015-06-28 14:48:44 -07:00
Davies Liu 77da5be6f1 [SPARK-8610] [SQL] Separate Row and InternalRow (part 2)
Currently, we use GenericRow both for Row and InternalRow, which is confusing because it could contain Scala type also Catalyst types.

This PR changes to use GenericInternalRow for InternalRow (contains catalyst types), GenericRow for Row (contains Scala types).

Also fixes some incorrect use of InternalRow or Row.

Author: Davies Liu <davies@databricks.com>

Closes #7003 from davies/internalrow and squashes the following commits:

d05866c [Davies Liu] fix test: rollback changes for pyspark
72878dd [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
efd0b25 [Davies Liu] fix copy of MutableRow
87b13cf [Davies Liu] fix test
d2ebd72 [Davies Liu] fix style
eb4b473 [Davies Liu] mark expensive API as final
bd4e99c [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
bdfb78f [Davies Liu] remove BaseMutableRow
6f99a97 [Davies Liu] fix catalyst test
defe931 [Davies Liu] remove BaseRow
288b31f [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
9d24350 [Davies Liu] separate Row and InternalRow (part 2)
2015-06-28 08:03:58 -07:00
Liang-Chi Hsieh 47c874babe [SPARK-8237] [SQL] Add misc function sha2
JIRA: https://issues.apache.org/jira/browse/SPARK-8237

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6934 from viirya/expr_sha2 and squashes the following commits:

35e0bb3 [Liang-Chi Hsieh] For comments.
68b5284 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2
8573aff [Liang-Chi Hsieh] Remove unnecessary Product.
ee61e06 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2
59e41aa [Liang-Chi Hsieh] Add misc function: sha2.
2015-06-25 22:07:37 -07:00
Wenchen Fan b71d3254e5 [SPARK-8075] [SQL] apply type check interface to more expressions
a follow up of https://github.com/apache/spark/pull/6405.
Note: It's not a big change, a lot of changing is due to I swap some code in `aggregates.scala` to make aggregate functions right below its corresponding aggregate expressions.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6723 from cloud-fan/type-check and squashes the following commits:

2124301 [Wenchen Fan] fix tests
5a658bb [Wenchen Fan] add tests
287d3bb [Wenchen Fan] apply type check interface to more expressions
2015-06-24 16:26:00 -07:00
Wenchen Fan f04b5672c5 [SPARK-7289] handle project -> limit -> sort efficiently
make the `TakeOrdered` strategy and operator more general, such that it can optionally handle a projection when necessary

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6780 from cloud-fan/limit and squashes the following commits:

34aa07b [Wenchen Fan] revert
07d5456 [Wenchen Fan] clean closure
20821ec [Wenchen Fan] fix
3676a82 [Wenchen Fan] address comments
b558549 [Wenchen Fan] address comments
214842b [Wenchen Fan] fix style
2d8be83 [Wenchen Fan] add LimitPushDown
948f740 [Wenchen Fan] fix existing
2015-06-24 13:28:50 -07:00
Wenchen Fan 09fcf96b8f [SPARK-8371] [SQL] improve unit test for MaxOf and MinOf and fix bugs
a follow up of https://github.com/apache/spark/pull/6813

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6825 from cloud-fan/cg and squashes the following commits:

43170cc [Wenchen Fan] fix bugs in code gen
2015-06-23 23:11:42 -07:00
Davies Liu 6f4cadf5ee [SPARK-8432] [SQL] fix hashCode() and equals() of BinaryType in Row
Also added more tests in LiteralExpressionSuite

Author: Davies Liu <davies@databricks.com>

Closes #6876 from davies/fix_hashcode and squashes the following commits:

429c2c0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode
32d9811 [Davies Liu] fix test
a0626ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode
89c2432 [Davies Liu] fix style
bd20780 [Davies Liu] check with catalyst types
41caec6 [Davies Liu] change for to while
d96929b [Davies Liu] address comment
6ad2a90 [Davies Liu] fix style
5819d33 [Davies Liu] unify equals() and hashCode()
0fff25d [Davies Liu] fix style
53c38b1 [Davies Liu] fix hashCode() and equals() of BinaryType in Row
2015-06-23 11:55:47 -07:00
Liang-Chi Hsieh 31bd30687b [SPARK-8359] [SQL] Fix incorrect decimal precision after multiplication
JIRA: https://issues.apache.org/jira/browse/SPARK-8359

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6814 from viirya/fix_decimal2 and squashes the following commits:

071a757 [Liang-Chi Hsieh] Remove maximum precision and use MathContext.UNLIMITED.
df217d4 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2
a43bfc3 [Liang-Chi Hsieh] Add MathContext with maximum supported precision.
72eeb3f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2
44c9348 [Liang-Chi Hsieh] Fix incorrect decimal precision after multiplication.
2015-06-22 23:11:56 -07:00
Davies Liu 6b7f2ceafd [SPARK-8307] [SQL] improve timestamp from parquet
This PR change to convert julian day to unix timestamp directly (without Calendar and Timestamp).

cc adrian-wang rxin

Author: Davies Liu <davies@databricks.com>

Closes #6759 from davies/improve_ts and squashes the following commits:

849e301 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
b0e4cad [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
8e2d56f [Davies Liu] address comments
634b9f5 [Davies Liu] fix mima
4891efb [Davies Liu] address comment
bfc437c [Davies Liu] fix build
ae5979c [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
602b969 [Davies Liu] remove jodd
2f2e48c [Davies Liu] fix test
8ace611 [Davies Liu] fix mima
212143b [Davies Liu] fix mina
c834108 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
a3171b8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
5233974 [Davies Liu] fix scala style
361fd62 [Davies Liu] address comments
ea196d4 [Davies Liu] improve timestamp from parquet
2015-06-22 18:03:59 -07:00
Wenchen Fan 860a49ef20 [SPARK-7153] [SQL] support all integral type ordinal in GetArrayItem
first convert `ordinal` to `Number`, then convert to int type.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #5706 from cloud-fan/7153 and squashes the following commits:

915db79 [Wenchen Fan] fix 7153
2015-06-22 17:37:35 -07:00
Davies Liu 96aa01378e [SPARK-8492] [SQL] support binaryType in UnsafeRow
Support BinaryType in UnsafeRow, just like StringType.

Also change the layout of StringType and BinaryType in UnsafeRow, by combining offset and size together as Long, which will limit the size of Row to under 2G (given that fact that any single buffer can not be bigger than 2G in JVM).

Author: Davies Liu <davies@databricks.com>

Closes #6911 from davies/unsafe_bin and squashes the following commits:

d68706f [Davies Liu] update comment
519f698 [Davies Liu] address comment
98a964b [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_bin
180b49d [Davies Liu] fix zero-out
22e4c0a [Davies Liu] zero-out padding bytes
6abfe93 [Davies Liu] fix style
447dea0 [Davies Liu] support binaryType in UnsafeRow
2015-06-22 15:22:17 -07:00
Shilei 0c32fc125c [SPARK-8234][SQL] misc function: md5
Author: Shilei <shilei.qian@intel.com>

Closes #6779 from qiansl127/MD5 and squashes the following commits:

11fcdb2 [Shilei] Fix the indent
04bd27b [Shilei] Add codegen
da60eb3 [Shilei] Remove checkInputDataTypes function
9509ad0 [Shilei] Format code
12c61f4 [Shilei] Accept only BinaryType for Md5
1df0b5b [Shilei] format to scala type
60ccde1 [Shilei] Add more test case
b8c73b4 [Shilei] Rewrite the type check for Md5
c166167 [Shilei] Add md5 function
2015-06-19 10:49:27 -07:00
Liang-Chi Hsieh 2c59d5c12a [SPARK-8207] [SQL] Add math function bin
JIRA: https://issues.apache.org/jira/browse/SPARK-8207

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6721 from viirya/expr_bin and squashes the following commits:

07e1c8f [Liang-Chi Hsieh] Remove AbstractUnaryMathExpression and let BIN inherit UnaryExpression.
0677f1a [Liang-Chi Hsieh] For comments.
cf62b95 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
0cf20f2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
dea9c12 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
d4f4774 [Liang-Chi Hsieh] Add @ignore_unicode_prefix.
7a0196f [Liang-Chi Hsieh] Fix python style.
ac2bacd [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
a0a2d0f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
4cb764d [Liang-Chi Hsieh] For comments.
0f78682 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
c0c3197 [Liang-Chi Hsieh] Add bin to FunctionRegistry.
824f761 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
50e0c3b [Liang-Chi Hsieh] Add math function bin(a: long): string.
2015-06-19 10:09:31 -07:00
Liang-Chi Hsieh 31641128b3 [SPARK-8363][SQL] Move sqrt to math and extend UnaryMathExpression
JIRA: https://issues.apache.org/jira/browse/SPARK-8363

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6823 from viirya/move_sqrt and squashes the following commits:

8977e11 [Liang-Chi Hsieh] Remove unnecessary old tests.
d23e79e [Liang-Chi Hsieh] Explicitly indicate sqrt value sequence.
699f48b [Liang-Chi Hsieh] Use correct @since tag.
8dff6d1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into move_sqrt
bc2ed77 [Liang-Chi Hsieh] Remove/move arithmetic expression test and expression type checking test. Remove unnecessary Sqrt type rule.
d38492f [Liang-Chi Hsieh] Now sqrt accepts boolean because type casting is handled by HiveTypeCoercion.
297cc90 [Liang-Chi Hsieh] Sqrt only accepts double input.
ef4a21a [Liang-Chi Hsieh] Move sqrt to math.
2015-06-18 13:00:31 -07:00
Liang-Chi Hsieh fee3438a32 [SPARK-8218][SQL] Add binary log math function
JIRA: https://issues.apache.org/jira/browse/SPARK-8218

Because there is already `log` unary function defined, the binary log function is called `logarithm` for now.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6725 from viirya/expr_binary_log and squashes the following commits:

bf96bd9 [Liang-Chi Hsieh] Compare log result in string.
102070d [Liang-Chi Hsieh] Round log result to better comparing in python test.
fd01863 [Liang-Chi Hsieh] For comments.
beed631 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
6089d11 [Liang-Chi Hsieh] Remove unnecessary override.
8cf37b7 [Liang-Chi Hsieh] For comments.
bc89597 [Liang-Chi Hsieh] For comments.
db7dc38 [Liang-Chi Hsieh] Use ctor instead of companion object.
0634ef7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
1750034 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
3d75bfc [Liang-Chi Hsieh] Fix scala style.
5b39c02 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
23c54a3 [Liang-Chi Hsieh] Fix scala style.
ebc9929 [Liang-Chi Hsieh] Let Logarithm accept one parameter too.
605574d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
21c3bfd [Liang-Chi Hsieh] Fix scala style.
c6c187f [Liang-Chi Hsieh] For comments.
c795342 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
f373bac [Liang-Chi Hsieh] Add binary log expression.
2015-06-17 23:31:30 -07:00
Lianhui Wang 9db73ec124 [SPARK-8381][SQL]reuse typeConvert when convert Seq[Row] to catalyst type
reuse-typeConvert when convert Seq[Row] to CatalystType

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #6831 from lianhuiwang/reuse-typeConvert and squashes the following commits:

1fec395 [Lianhui Wang] remove CatalystTypeConverters.convertToCatalyst
714462d [Lianhui Wang] add package[sql]
9d1fbf3 [Lianhui Wang] address JoshRosen's comments
768956f [Lianhui Wang] update scala style
4498c62 [Lianhui Wang] reuse typeConvert
2015-06-17 22:52:47 -07:00
Liang-Chi Hsieh 104f30c36f [SPARK-7199] [SQL] Add date and timestamp support to UnsafeRow
JIRA: https://issues.apache.org/jira/browse/SPARK-7199

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5984 from viirya/add_date_timestamp and squashes the following commits:

7f21ce9 [Liang-Chi Hsieh] For comment.
0b89698 [Liang-Chi Hsieh] Add timestamp to settableFieldTypes.
c30d490 [Liang-Chi Hsieh] Use default IntUnsafeColumnWriter and LongUnsafeColumnWriter.
672ef17 [Liang-Chi Hsieh] Remove getter/setter for Date and Timestamp and use Int and Long for them.
9f3e577 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
281e844 [Liang-Chi Hsieh] Fix scala style.
fb532b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
80af342 [Liang-Chi Hsieh] Fix compiling error.
f4f5de6 [Liang-Chi Hsieh] Fix scala style.
a463e83 [Liang-Chi Hsieh] Use Long to store timestamp for rows.
635388a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
46946c6 [Liang-Chi Hsieh] Adapt for moved DateUtils.
b16994e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
752251f [Liang-Chi Hsieh] Support setDate. Fix failed test.
fcf8db9 [Liang-Chi Hsieh] Add functions for Date and Timestamp to SpecificRow.
e42a809 [Liang-Chi Hsieh] Fix style.
4c07b57 [Liang-Chi Hsieh] Add date and timestamp support to UnsafeRow.
2015-06-17 09:00:37 -07:00
Davies Liu bc76a0f750 [SPARK-7184] [SQL] enable codegen by default
In order to have better performance out of box, this PR turn on codegen by default, then codegen can be tested by sql/test and hive/test.

This PR also fix some corner cases for codegen.

Before 1.5 release, we should re-visit this, turn it off if it's not stable or causing regressions.

cc rxin JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #6726 from davies/enable_codegen and squashes the following commits:

f3b25a5 [Davies Liu] fix warning
73750ea [Davies Liu] fix long overflow when compare
3017a47 [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen
a7d75da [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen
ff5b75a [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen
f4cf2c2 [Davies Liu] fix style
99fc139 [Davies Liu] Merge branch 'enable_codegen' of github.com:davies/spark into enable_codegen
91fc7a2 [Davies Liu] disable codegen for ScalaUDF
207e339 [Davies Liu] Update CodeGenerator.scala
44573a3 [Davies Liu] check thread safety of expression
f3886fa [Davies Liu] don't inline primitiveTerm for null literal
c8e7cd2 [Davies Liu] address comment
a8618c9 [Davies Liu] enable codegen by default
2015-06-15 23:03:14 -07:00
tedyu 1a62d61696 SPARK-8336 Fix NullPointerException with functions.rand()
This PR fixes the problem reported by Justin Yip in the thread 'NullPointerException with functions.rand()'

Tested using spark-shell and verified that the following works:
sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn("index", rand(30)).show()

Author: tedyu <yuzhihong@gmail.com>

Closes #6793 from tedyu/master and squashes the following commits:

62fd97b [tedyu] Create RandomSuite
750f92c [tedyu] Add test for Rand() with seed
a1d66c5 [tedyu] Fix NullPointerException with functions.rand()
2015-06-15 17:00:38 -07:00
Reynold Xin 53c16b92a5 [SPARK-8362] [SQL] Add unit tests for +, -, *, /, %
Added unit tests for all supported data types for:
- Add
- Subtract
- Multiply
- Divide
- UnaryMinus
- Remainder

Fixed bugs caught by the unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #6813 from rxin/SPARK-8362 and squashes the following commits:

fb3fe62 [Reynold Xin] Added Remainder.
3b266ba [Reynold Xin] [SPARK-8362] Add unit tests for +, -, *, /.
2015-06-14 11:23:23 -07:00
Liang-Chi Hsieh cb7ada1196 [SPARK-8342][SQL] Fix Decimal setOrNull
JIRA: https://issues.apache.org/jira/browse/SPARK-8342

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6797 from viirya/fix_decimal and squashes the following commits:

8a447b1 [Liang-Chi Hsieh] Add unit test.
d67a5ea [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal
ab6d8af [Liang-Chi Hsieh] Fix setOrNull.
2015-06-13 22:42:28 -07:00
Reynold Xin a138953391 [SPARK-8347][SQL] Add unit tests for abs.
Also addressed code review feedback from #6754

Author: Reynold Xin <rxin@databricks.com>

Closes #6803 from rxin/abs and squashes the following commits:

d07beba [Reynold Xin] [SPARK-8347] Add unit tests for abs.
2015-06-13 17:10:13 -07:00
Davies Liu d46f8e5d4b [SPARK-7186] [SQL] Decouple internal Row from external Row
Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations.

As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users.

For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters.

For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow.

cc rxin marmbrus JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #6792 from davies/internal_row and squashes the following commits:

f2abd13 [Davies Liu] fix scalastyle
a7e025c [Davies Liu] move InternalRow into catalyst
30db8ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into internal_row
7cbced8 [Davies Liu] separate Row and InternalRow
2015-06-12 23:06:31 -07:00
Reynold Xin 337c16d57e [SQL] Miscellaneous SQL/DF expression changes.
SPARK-8201 conditional function: if
SPARK-8205 conditional function: nvl
SPARK-8208 math function: ceiling
SPARK-8210 math function: degrees
SPARK-8211 math function: radians
SPARK-8219 math function: negative
SPARK-8216 math function: rename log -> ln
SPARK-8222 math function: alias power / pow
SPARK-8225 math function: alias sign / signum
SPARK-8228 conditional function: isnull
SPARK-8229 conditional function: isnotnull
SPARK-8250 string function: alias lower/lcase
SPARK-8251 string function: alias upper / ucase

Author: Reynold Xin <rxin@databricks.com>

Closes #6754 from rxin/expressions-misc and squashes the following commits:

35fce15 [Reynold Xin] Removed println.
2647067 [Reynold Xin] Promote to string type.
3c32bbc [Reynold Xin] Fixed if.
de827ac [Reynold Xin] Fixed style
b201cd4 [Reynold Xin] Removed if.
6b21a9b [Reynold Xin] [SQL] Miscellaneous SQL/DF expression changes.
2015-06-11 17:06:21 -07:00