Commit graph

501 commits

Author SHA1 Message Date
navis.ryu 778f3ca81f [SPARK-7792] [SQL] HiveContext registerTempTable not thread safe
Just replaced mutable.HashMap to ConcurrentHashMap

Author: navis.ryu <navis@apache.org>

Closes #6699 from navis/SPARK-7792 and squashes the following commits:

f03654a [navis.ryu] [SPARK-7792] [SQL] HiveContext registerTempTable not thread safe
2015-06-09 19:33:00 -07:00
Reynold Xin 1b499993ad [SPARK-7886] Add built-in expressions to FunctionRegistry.
This patch switches to using FunctionRegistry for built-in expressions. It is based on #6463, but with some work to simplify it along with unit tests.

TODOs for future pull requests:
- Use static registration so we don't need to register all functions every time we start a new SQLContext
- Switch to using this in HiveContext

Author: Reynold Xin <rxin@databricks.com>
Author: Santiago M. Mola <santi@mola.io>

Closes #6710 from rxin/udf-registry and squashes the following commits:

6930822 [Reynold Xin] Fixed Python test.
b802c9a [Reynold Xin] Made UDF case insensitive.
e60d815 [Reynold Xin] Made UDF case insensitive.
852f9c0 [Reynold Xin] Fixed style violation.
e76a3c1 [Reynold Xin] Fixed parser.
52ddaba [Reynold Xin] Fixed compilation.
ee7854f [Reynold Xin] Improved error reporting.
ff906f2 [Reynold Xin] More robust constructor calling.
77b46f1 [Reynold Xin] Simplified the code.
2a2a149 [Reynold Xin] Merge pull request #6463 from smola/SPARK-7886
8616924 [Santiago M. Mola] [SPARK-7886] Add built-in expressions to FunctionRegistry.
2015-06-09 16:24:38 +08:00
Wenchen Fan fe7669d307 [SQL][minor] remove duplicated cases in DecimalPrecision
We already have a rule to do type coercion for fixed decimal and unlimited decimal in `WidenTypes`, so we don't need to handle them in `DecimalPrecision`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6698 from cloud-fan/fix and squashes the following commits:

413ad4a [Wenchen Fan] remove duplicated cases
2015-06-08 11:52:02 -07:00
Reynold Xin 72ba0fc4fd [SPARK-8154][SQL] Remove Term/Code type aliases in code generation.
From my perspective as a code reviewer, I find them more confusing than using String directly.

Author: Reynold Xin <rxin@databricks.com>

Closes #6694 from rxin/SPARK-8154 and squashes the following commits:

4e5056c [Reynold Xin] [SPARK-8154][SQL] Remove Term/Code type aliases in code generation.
2015-06-07 23:16:19 -07:00
Reynold Xin f74be744d4 [SPARK-8149][SQL] Break ExpressionEvaluationSuite down to multiple files
Also moved a few files in expressions package around to match test suites.

Author: Reynold Xin <rxin@databricks.com>

Closes #6693 from rxin/expr-refactoring and squashes the following commits:

857599f [Reynold Xin] Fixed style violation.
c0eb74b [Reynold Xin] Fixed compilation.
b3a40f8 [Reynold Xin] Refactored expression test suites.
2015-06-07 18:45:24 -07:00
Davies Liu 5e7b6b67be [SPARK-8117] [SQL] Push codegen implementation into each Expression
This PR move codegen implementation of expressions into Expression class itself, make it easy to manage.

It introduces two APIs in Expression:
```
def gen(ctx: CodeGenContext): GeneratedExpressionCode
def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): Code
```

gen(ctx) will call genSource(ctx, ev) to generate Java source code for the current expression. A expression needs to override genSource().

Here are the types:
```
type Term String
type Code String

/**
 * Java source for evaluating an [[Expression]] given a [[Row]] of input.
 */
case class GeneratedExpressionCode(var code: Code,
                               nullTerm: Term,
                               primitiveTerm: Term,
                               objectTerm: Term)
/**
 * A context for codegen, which is used to bookkeeping the expressions those are not supported
 * by codegen, then they are evaluated directly. The unsupported expression is appended at the
 * end of `references`, the position of it is kept in the code, used to access and evaluate it.
 */
class CodeGenContext {
  /**
   * Holding all the expressions those do not support codegen, will be evaluated directly.
   */
  val references: Seq[Expression] = new mutable.ArrayBuffer[Expression]()
}
```

This is basically #6660, but fixed style violation and compilation failure.

Author: Davies Liu <davies@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6690 from rxin/codegen and squashes the following commits:

e1368c2 [Reynold Xin] Fixed tests.
73db80e [Reynold Xin] Fixed compilation failure.
19d6435 [Reynold Xin] Fixed style violation.
9adaeaf [Davies Liu] address comments
f42c732 [Davies Liu] improve coverage and tests
bad6828 [Davies Liu] address comments
e03edaa [Davies Liu] consts fold
86fac2c [Davies Liu] fix style
02262c9 [Davies Liu] address comments
b5d3617 [Davies Liu] Merge pull request #5 from rxin/codegen
48c454f [Reynold Xin] Some code gen update.
2344bc0 [Davies Liu] fix test
12ff88a [Davies Liu] fix build
c5fb514 [Davies Liu] rename
8c6d82d [Davies Liu] update docs
b145047 [Davies Liu] fix style
e57959d [Davies Liu] add type alias
3ff25f8 [Davies Liu] refactor
593d617 [Davies Liu] pushing codegen into Expression
2015-06-07 14:11:20 -07:00
Wenchen Fan db81b9d89f [SPARK-7952][SQL] use internal Decimal instead of java.math.BigDecimal
This PR fixes a bug introduced in https://github.com/apache/spark/pull/6505.
Decimal literal's value is not `java.math.BigDecimal`, but Spark SQL internal type: `Decimal`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6574 from cloud-fan/fix and squashes the following commits:

b0e3549 [Wenchen Fan] rename to BooleanEquality
1987b37 [Wenchen Fan] use Decimal instead of java.math.BigDecimal
f93c420 [Wenchen Fan] compare literal
2015-06-07 11:07:19 -07:00
Cheng Lian bc0d76a246 [SQL] Simplifies binary node pattern matching
This PR is a simpler version of #2764, and adds `unapply` methods to the following binary nodes for simpler pattern matching:

- `BinaryExpression`
- `BinaryComparison`
- `BinaryArithmetics`

This enables nested pattern matching for binary nodes. For example, the following pattern matching

```scala
case p: BinaryComparison if p.left.dataType == StringType &&
                            p.right.dataType == DateType =>
  p.makeCopy(Array(p.left, Cast(p.right, StringType)))
```

can be simplified to

```scala
case p  BinaryComparison(l  StringType(), r  DateType()) =>
  p.makeCopy(Array(l, Cast(r, StringType)))
```

Author: Cheng Lian <lian@databricks.com>

Closes #6537 from liancheng/binary-node-patmat and squashes the following commits:

a3bf5fe [Cheng Lian] Fixes compilation error introduced while rebasing
b738986 [Cheng Lian] Renames `l`/`r` to `left`/`right` or `lhs`/`rhs`
14900ae [Cheng Lian] Simplifies binary node pattern matching
2015-06-05 23:06:19 +08:00
Reynold Xin 8f16b94afb [SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._
I kept some of the sql import there to avoid changing too many lines.

Author: Reynold Xin <rxin@databricks.com>

Closes #6661 from rxin/remove-wildcard-import-sqlcontext and squashes the following commits:

c265347 [Reynold Xin] Fixed ListTablesSuite failure.
de9d491 [Reynold Xin] Fixed tests.
73b5365 [Reynold Xin] Mima.
8f6b642 [Reynold Xin] Fixed style violation.
443f6e8 [Reynold Xin] [SPARK-8113][SQL] Remove some wildcard import on TestSQLContext._
2015-06-04 22:15:58 -07:00
Reynold Xin 2bcdf8c239 [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate
This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become
more efficient over time as we optimize Aggregate (via Tungsten).

Author: Reynold Xin <rxin@databricks.com>

Closes #6637 from rxin/replace-distinct and squashes the following commits:

b3cc50e [Reynold Xin] Mima excludes.
93d6117 [Reynold Xin] Code review feedback.
87e4741 [Reynold Xin] [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate.
2015-06-04 13:52:53 -07:00
Davies Liu c8709dcfd1 [SPARK-7956] [SQL] Use Janino to compile SQL expressions into bytecode
In order to reduce the overhead of codegen, this PR switch to use Janino to compile SQL expressions into bytecode.

After this, the time used to compile a SQL expression is decreased from 100ms to 5ms, which is necessary to turn on codegen for general workload, also tests.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #6479 from davies/janino and squashes the following commits:

cc689f5 [Davies Liu] remove globalLock
262d848 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
eec3a33 [Davies Liu] address comments from Josh
f37c8c3 [Davies Liu] fix DecimalType and cast to String
202298b [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
a21e968 [Davies Liu] fix style
0ed3dc6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
551a851 [Davies Liu] fix tests
c3bdffa [Davies Liu] remove print
6089ce5 [Davies Liu] change logging level
7e46ac3 [Davies Liu] fix style
d8f0f6c [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
da4926a [Davies Liu] fix tests
03660f3 [Davies Liu] WIP: use Janino to compile Java source
f2629cd [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
f7d66cf [Davies Liu] use template based string for codegen
2015-06-04 10:28:59 -07:00
Wenchen Fan d38cf217e0 [SPARK-7562][SPARK-6444][SQL] Improve error reporting for expression data type mismatch
It seems hard to find a common pattern of checking types in `Expression`. Sometimes we know what input types we need(like `And`, we know we need two booleans), sometimes we just have some rules(like `Add`, we need 2 numeric types which are equal). So I defined a general interface `checkInputDataTypes` in `Expression` which returns a `TypeCheckResult`. `TypeCheckResult` can tell whether this expression passes the type checking or what the type mismatch is.

This PR mainly works on apply input types checking for arithmetic and predicate expressions.

TODO: apply type checking interface to more expressions.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6405 from cloud-fan/6444 and squashes the following commits:

b5ff31b [Wenchen Fan] address comments
b917275 [Wenchen Fan] rebase
39929d9 [Wenchen Fan] add todo
0808fd2 [Wenchen Fan] make constrcutor of TypeCheckResult private
3bee157 [Wenchen Fan] and decimal type coercion rule for binary comparison
8883025 [Wenchen Fan] apply type check interface to CaseWhen
cffb67c [Wenchen Fan] to have resolved call the data type check function
6eaadff [Wenchen Fan] add equal type constraint to EqualTo
3affbd8 [Wenchen Fan] more fixes
654d46a [Wenchen Fan] improve tests
e0a3628 [Wenchen Fan] improve error message
1524ff6 [Wenchen Fan] fix style
69ca3fe [Wenchen Fan] add error message and tests
c71d02c [Wenchen Fan] fix hive tests
6491721 [Wenchen Fan] use value class TypeCheckResult
7ae76b9 [Wenchen Fan] address comments
cb77e4f [Wenchen Fan] Improve error reporting for expression data type mismatch
2015-06-03 00:47:52 -07:00
Josh Rosen cafd5056e1 [SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row accessors
This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features.

At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`.  In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods.  This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`.

The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL:

- #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies
- #6218: DataFrame.describe() should cast all aggregates to String
- #6400: Use output schema, not relation schema, for data source input conversion

Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema.  According to the `createDataFrame()` Scaladoc:

>  It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception.

Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats.  This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions.

In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows.  Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch.  Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits:

740341b [Josh Rosen] Optimize method dispatch for primitive type conversions
befc613 [Josh Rosen] Add tests to document Option-handling behavior.
5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite
6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it
3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first
6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException
677ff27 [Josh Rosen] Fix null handling bug; add tests.
8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator.
85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite
9c0e4e1 [Josh Rosen] Remove last use of convertToScala().
ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions.
7ca7fcb [Josh Rosen] Comments and cleanup
1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters
2015-06-02 22:11:03 -07:00
Yin Huai 0f80990bfa [SPARK-8023][SQL] Add "deterministic" attribute to Expression to avoid collapsing nondeterministic projects.
This closes #6570.

Author: Yin Huai <yhuai@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6573 from rxin/deterministic and squashes the following commits:

356cd22 [Reynold Xin] Added unit test for the optimizer.
da3fde1 [Reynold Xin] Merge pull request #6570 from yhuai/SPARK-8023
da56200 [Yin Huai] Comments.
e38f264 [Yin Huai] Comment.
f9d6a73 [Yin Huai] Add a deterministic method to Expression.
2015-06-02 00:20:52 -07:00
Reynold Xin b53a011647 Fixed typo in the previous commit. 2015-06-01 21:41:53 -07:00
Yin Huai e797dba58e [SPARK-7965] [SPARK-7972] [SQL] Handle expressions containing multiple window expressions and make parser match window frames in case insensitive way
JIRAs:
https://issues.apache.org/jira/browse/SPARK-7965
https://issues.apache.org/jira/browse/SPARK-7972

Author: Yin Huai <yhuai@databricks.com>

Closes #6524 from yhuai/7965-7972 and squashes the following commits:

c12c79c [Yin Huai] Add doc for returned value.
de64328 [Yin Huai] Address rxin's comments.
fc9b1ad [Yin Huai] wip
2996da4 [Yin Huai] scala style
20b65b7 [Yin Huai] Handle expressions containing multiple window expressions.
9568b21 [Yin Huai] case insensitive matches
41f633d [Yin Huai] Failed test case.
2015-06-01 21:40:17 -07:00
Wenchen Fan a0e46a0d2a [SPARK-7952][SPARK-7984][SQL] equality check between boolean type and numeric type is broken.
The origin code has several problems:
* `true <=> 1` will return false as we didn't set a rule to handle it.
* `true = a` where `a` is not `Literal` and its value is 1, will return false as we only handle literal values.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6505 from cloud-fan/tmp1 and squashes the following commits:

77f0f39 [Wenchen Fan] minor fix
b6401ba [Wenchen Fan] add type coercion for CaseKeyWhen and address comments
ebc8c61 [Wenchen Fan] use SQLTestUtils and If
625973c [Wenchen Fan] improve
9ba2130 [Wenchen Fan] address comments
fc0d741 [Wenchen Fan] fix style
2846a04 [Wenchen Fan] fix 7952
2015-05-31 21:01:46 -07:00
Reynold Xin 63a50be13d [SPARK-3850] Trim trailing spaces for SQL.
Author: Reynold Xin <rxin@databricks.com>

Closes #6535 from rxin/whitespace-sql and squashes the following commits:

de50316 [Reynold Xin] [SPARK-3850] Trim trailing spaces for SQL.
2015-05-31 00:48:49 -07:00
Reynold Xin c63e1a742b [SPARK-7971] Add JavaDoc style deprecation for deprecated DataFrame methods
Scala deprecated annotation actually doesn't show up in JavaDoc.

Author: Reynold Xin <rxin@databricks.com>

Closes #6523 from rxin/df-deprecated-javadoc and squashes the following commits:

26da2b2 [Reynold Xin] [SPARK-7971] Add JavaDoc style deprecation for deprecated DataFrame methods.
2015-05-30 19:51:53 -07:00
Reynold Xin 14b314dc2c [SQL] Tighten up visibility for JavaDoc.
I went through all the JavaDocs and tightened up visibility.

Author: Reynold Xin <rxin@databricks.com>

Closes #6526 from rxin/sql-1.4-visibility-for-docs and squashes the following commits:

bc37d1e [Reynold Xin] Tighten up visibility for JavaDoc.
2015-05-30 19:50:52 -07:00
Wenchen Fan 0978aec9cd [SPARK-7964][SQL] remove unnecessary type coercion rule
We have defined these logics in `Cast` already, I think we should remove this rule.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6516 from cloud-fan/tmp2 and squashes the following commits:

d5035a4 [Wenchen Fan] remove useless rule
2015-05-30 00:26:46 -07:00
Andrew Or 9eb222c139 [SPARK-7558] Demarcate tests in unit-tests.log
Right now `unit-tests.log` are not of much value because we can't tell where the test boundaries are easily. This patch adds log statements before and after each test to outline the test boundaries, e.g.:

```
===== TEST OUTPUT FOR o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' =====

15/05/27 12:36:39.596 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO SparkContext: Starting job: count at KryoSerializerSuite.scala:230
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Got job 3 (count at KryoSerializerSuite.scala:230) with 4 output partitions (allowLocal=false)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 3(count at KryoSerializerSuite.scala:230)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[5] at parallelize at KryoSerializerSuite.scala:230), which has no missing parents

...

15/05/27 12:36:39.624 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO DAGScheduler: Job 3 finished: count at KryoSerializerSuite.scala:230, took 0.028563 s
15/05/27 12:36:39.625 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO KryoSerializerSuite:

***** FINISHED o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' *****

...
```

Author: Andrew Or <andrew@databricks.com>

Closes #6441 from andrewor14/demarcate-tests and squashes the following commits:

879b060 [Andrew Or] Fix compile after rebase
d622af7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
017c8ba [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
7790b6c [Andrew Or] Fix tests after logical merge conflict
c7460c0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
c43ffc4 [Andrew Or] Fix tests?
8882581 [Andrew Or] Fix tests
ee22cda [Andrew Or] Fix log message
fa9450e [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
12d1e1b [Andrew Or] Various whitespace changes (minor)
69cbb24 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite
bbce12e [Andrew Or] Fix manual things that cannot be covered through automation
da0b12f [Andrew Or] Add core tests as dependencies in all modules
f7d29ce [Andrew Or] Introduce base abstract class for all test suites
2015-05-29 14:03:12 -07:00
Reynold Xin 94f62a4979 [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.
…

Author: Reynold Xin <rxin@databricks.com>

Closes #6491 from rxin/more-whitespace and squashes the following commits:

f6e63dc [Reynold Xin] [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.
2015-05-29 13:38:37 -07:00
Reynold Xin 8da560d7de [SPARK-7927] whitespace fixes for Catalyst module.
So we can enable a whitespace enforcement rule in the style checker to save code review time.

Author: Reynold Xin <rxin@databricks.com>

Closes #6476 from rxin/whitespace-catalyst and squashes the following commits:

650409d [Reynold Xin] Fixed tests.
51a9e5d [Reynold Xin] [SPARK-7927] whitespace fixes for Catalyst module.
2015-05-28 20:11:57 -07:00
Reynold Xin 6fec1a9409 Removed Guava dependency from JavaTypeInference's type signature.
This should also close #6243.

Author: Reynold Xin <rxin@databricks.com>

Closes #6431 from rxin/JavaTypeInference-guava and squashes the following commits:

e58df3c [Reynold Xin] Removed Gauva dependency from JavaTypeInference's type signature.
2015-05-27 11:54:35 -07:00
Reynold Xin 3e7d7d6b3d [SQL] Rename MathematicalExpression UnaryMathExpression, and specify BinaryMathExpression's output data type as DoubleType.
Two minor changes.

cc brkyvz

Author: Reynold Xin <rxin@databricks.com>

Closes #6428 from rxin/math-func-cleanup and squashes the following commits:

5910df5 [Reynold Xin] [SQL] Rename MathematicalExpression UnaryMathExpression, and specify BinaryMathExpression's output data type as DoubleType.
2015-05-27 01:13:57 -07:00
Reynold Xin 9f48bf6b37 [SPARK-7887][SQL] Remove EvaluatedType from SQL Expression.
This type is not really used. Might as well remove it.

Author: Reynold Xin <rxin@databricks.com>

Closes #6427 from rxin/evalutedType and squashes the following commits:

51a319a [Reynold Xin] [SPARK-7887][SQL] Remove EvaluatedType from SQL Expression.
2015-05-27 01:12:59 -07:00
rowan 03668348e2 [SPARK-7637] [SQL] O(N) merge implementation for StructType merge
Contribution is my original work and I license the work to the project under the projects open source license.

Author: rowan <rowan.chattaway@googlemail.com>

Closes #6259 from rowan000/SPARK-7637 and squashes the following commits:

c479df4 [rowan] SPARK-7637: rename mapFields to fieldsMap as per comments on github.
8d2e419 [rowan] SPARK-7637: fix up whitespace changes
0e9d662 [rowan] SPARK-7637: O(N) merge implementatio for StructType merge
2015-05-26 18:17:16 -07:00
Reynold Xin c9adcad81a [SQL][minor] Removed unused Catalyst logical plan DSL.
The Catalyst DSL is no longer used as a public facing API. This pull request removes the UDF and writeToFile feature from it since they are not used in unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #6350 from rxin/unused-logical-dsl and squashes the following commits:

90b3de6 [Reynold Xin] [SQL][minor] Removed unused Catalyst logical plan DSL.
2015-05-25 23:09:22 -07:00
Michael Armbrust 3c1305107a [SPARK-7834] [SQL] Better window error messages
Author: Michael Armbrust <michael@databricks.com>

Closes #6363 from marmbrus/windowErrors and squashes the following commits:

516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages
2015-05-22 17:23:12 -07:00
Santiago M. Mola e4aef91fe7 [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.
Author: Santiago M. Mola <santi@mola.io>

Closes #6327 from smola/feature/catalyst-dsl-set-ops and squashes the following commits:

11db778 [Santiago M. Mola] [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.
2015-05-22 15:10:27 -07:00
Michael Armbrust 3b68cb0430 [SPARK-6743] [SQL] Fix empty projections of cached data
Author: Michael Armbrust <michael@databricks.com>

Closes #6165 from marmbrus/wrongColumn and squashes the following commits:

4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn
aad7eab [Michael Armbrust] rxins comments
f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data
2015-05-22 09:43:46 -07:00
scwf 60336e3bc0 [SPARK-7656] [SQL] use CatalystConf in FunctionRegistry
follow up for #5806

Author: scwf <wangfei1@huawei.com>

Closes #6164 from scwf/FunctionRegistry and squashes the following commits:

15e6697 [scwf] use catalogconf in FunctionRegistry
2015-05-19 17:36:00 -07:00
Cheng Hao bcb1ff8146 [SPARK-7662] [SQL] Resolve correct names for generator in projection
```
select explode(map(value, key)) from src;
```
Throws exception
```
org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6178 from chenghao-intel/explode and squashes the following commits:

916fbe9 [Cheng Hao] add more strict rules for TGF alias
5c3f2c5 [Cheng Hao] fix bug in unit test
e1d93ab [Cheng Hao] Add more unit test
19db09e [Cheng Hao] resolve names for generator in projection
2015-05-19 15:20:46 -07:00
Wenchen Fan 103c863c2e [SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals)
A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6173 from cloud-fan/7269 and squashes the following commits:

e4a3cc7 [Wenchen Fan] address comments
cc02045 [Wenchen Fan] consider elements length equal
d7ff8f4 [Wenchen Fan] fix 7269
2015-05-18 12:12:55 -07:00
scwf fc2480ed13 [SPARK-7631] [SQL] treenode argString should not print children
spark-sql>
> explain extended
> select * from (
> select key from src union all
> select key from src) t;

now the spark plan will print children in argString
```
== Physical Plan ==
Union[ HiveTableScan key#1, (MetastoreRelation default, src, None), None,
HiveTableScan key#3, (MetastoreRelation default, src, None), None]
HiveTableScan key#1, (MetastoreRelation default, src, None), None
HiveTableScan key#3, (MetastoreRelation default, src, None), None
```

after this patch:
```
== Physical Plan ==
Union
 HiveTableScan [key#1], (MetastoreRelation default, src, None), None
 HiveTableScan [key#3], (MetastoreRelation default, src, None), None
```

I have tested this locally

Author: scwf <wangfei1@huawei.com>

Closes #6144 from scwf/fix-argString and squashes the following commits:

1a642e0 [scwf] fix treenode argString
2015-05-18 12:05:14 -07:00
Wenchen Fan 56ede88485 [SQL] [MINOR] [THIS] use private for internal field in ScalaUdf
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6235 from cloud-fan/tmp and squashes the following commits:

8f16367 [Wenchen Fan] use private[this]
2015-05-18 12:01:30 -07:00
Wenchen Fan 2f22424e9f [SQL] [MINOR] use catalyst type converter in ScalaUdf
It's a follow-up of https://github.com/apache/spark/pull/5154, we can speed up scala udf evaluation by create type converter in advance.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6182 from cloud-fan/tmp and squashes the following commits:

241cfe9 [Wenchen Fan] use converter in ScalaUdf
2015-05-17 16:51:57 -07:00
scwf edf09ea1bd [SQL] [MINOR] Skip unresolved expression for InConversion
Author: scwf <wangfei1@huawei.com>

Closes #6145 from scwf/InConversion and squashes the following commits:

5c8ac6b [scwf] minir fix for InConversion
2015-05-17 15:17:11 +08:00
Liang-Chi Hsieh f9705d4613 [SPARK-7098][SQL] Make the WHERE clause with timestamp show consistent result
JIRA: https://issues.apache.org/jira/browse/SPARK-7098

The WHERE clause with timstamp shows inconsistent results. This pr fixes it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5682 from viirya/consistent_timestamp and squashes the following commits:

171445a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into consistent_timestamp
4e98520 [Liang-Chi Hsieh] Make the WHERE clause with timestamp show consistent result.
2015-05-14 20:49:21 -07:00
Michael Armbrust 6d0633e3ec [SPARK-7548] [SQL] Add explode function for DataFrames
Add an `explode` function for dataframes and modify the analyzer so that single table generating functions can be present in a select clause along with other expressions.   There are currently the following restrictions:
 - only top level TGFs are allowed (i.e. no `select(explode('list) + 1)`)
 - only one may be present in a single select to avoid potentially confusing implicit Cartesian products.

TODO:
 - [ ] Python

Author: Michael Armbrust <michael@databricks.com>

Closes #6107 from marmbrus/explodeFunction and squashes the following commits:

7ee2c87 [Michael Armbrust] whitespace
6f80ba3 [Michael Armbrust] Update dataframe.py
c176c89 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
81b5da3 [Michael Armbrust] style
d3faa05 [Michael Armbrust] fix self join case
f9e1e3e [Michael Armbrust] fix python, add since
4f0d0a9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
e710fe4 [Michael Armbrust] add java and python
52ca0dc [Michael Armbrust] [SPARK-7548][SQL] Add explode function for dataframes.
2015-05-14 19:49:44 -07:00
Wenchen Fan f2cd00be35 [SQL][minor] rename apply for QueryPlanner
A follow-up of https://github.com/apache/spark/pull/5624

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6142 from cloud-fan/tmp and squashes the following commits:

971a92b [Wenchen Fan] use plan instead of execute
24c5ffe [Wenchen Fan] rename apply
2015-05-14 10:25:18 -07:00
linweizhong 13e652b61a [SPARK-7595] [SQL] Window will cause resolve failed with self join
for example:
table: src(key string, value string)
sql: with v1 as(select key, count(value) over (partition by key) cnt_val from src), v2 as(select v1.key, v1_lag.cnt_val from v1, v1 v1_lag where v1.key = v1_lag.key) select * from v2 limit 5;
then will analyze fail when resolving conflicting references in Join:
'Limit 5
 'Project [*]
  'Subquery v2
   'Project ['v1.key,'v1_lag.cnt_val]
    'Filter ('v1.key = 'v1_lag.key)
     'Join Inner, None
      Subquery v1
       Project [key#95,cnt_val#94L]
        Window [key#95,value#96], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#96) WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
         Project [key#95,value#96]
          MetastoreRelation default, src, None
      Subquery v1_lag
       Subquery v1
        Project [key#97,cnt_val#94L]
         Window [key#97,value#98], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#98) WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
          Project [key#97,value#98]
           MetastoreRelation default, src, None

Conflicting attributes: cnt_val#94L

Author: linweizhong <linweizhong@huawei.com>

Closes #6114 from Sephiroth-Lin/spark-7595 and squashes the following commits:

f8f2637 [linweizhong] Add unit test
dfe9169 [linweizhong] Handle windowExpression with self join
2015-05-14 00:23:27 -07:00
Reynold Xin e683182c3e [SQL] Move some classes into packages that are more appropriate.
JavaTypeInference into catalyst
types.DateUtils into catalyst
CacheManager into execution
DefaultParserDialect into catalyst

Author: Reynold Xin <rxin@databricks.com>

Closes #6108 from rxin/sql-rename and squashes the following commits:

3fc9613 [Reynold Xin] Fixed import ordering.
83d9ff4 [Reynold Xin] Fixed codegen tests.
e271e86 [Reynold Xin] mima
f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more appropriate.
2015-05-13 16:15:31 -07:00
scwf 59250fe514 [SPARK-7303] [SQL] push down project if possible when the child is sort
Optimize the case of `project(_, sort)` , a example is:

`select key from (select * from testData order by key) t`

before this PR:
```
== Parsed Logical Plan ==
'Project ['key]
 'Subquery t
  'Sort ['key ASC], true
   'Project [*]
    'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Project [key#0]
 Subquery t
  Sort [key#0 ASC], true
   Project [key#0,value#1]
    Subquery testData
     LogicalRDD [key#0,value#1], MapPartitionsRDD[1]

== Optimized Logical Plan ==
Project [key#0]
 Sort [key#0 ASC], true
  LogicalRDD [key#0,value#1], MapPartitionsRDD[1]

== Physical Plan ==
Project [key#0]
 Sort [key#0 ASC], true
  Exchange (RangePartitioning [key#0 ASC], 5), []
   PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
```

after this PR
```
== Parsed Logical Plan ==
'Project ['key]
 'Subquery t
  'Sort ['key ASC], true
   'Project [*]
    'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Project [key#0]
 Subquery t
  Sort [key#0 ASC], true
   Project [key#0,value#1]
    Subquery testData
     LogicalRDD [key#0,value#1], MapPartitionsRDD[1]

== Optimized Logical Plan ==
Sort [key#0 ASC], true
 Project [key#0]
  LogicalRDD [key#0,value#1], MapPartitionsRDD[1]

== Physical Plan ==
Sort [key#0 ASC], true
 Exchange (RangePartitioning [key#0 ASC], 5), []
  Project [key#0]
   PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
```

with this rule we will first do column pruning on the table and then do sorting.

Author: scwf <wangfei1@huawei.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #5838 from scwf/pruning and squashes the following commits:

b00d833 [scwf] address michael's comment
e230155 [scwf] fix tests failure
b09b895 [scwf] improve column pruning
2015-05-13 16:13:48 -07:00
Wenchen Fan 213a6f30fe [SPARK-7551][DataFrame] support backticks for DataFrame attribute resolution
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6074 from cloud-fan/7551 and squashes the following commits:

e6f579e [Wenchen Fan] allow space
2b86699 [Wenchen Fan] handle blank
e218d99 [Wenchen Fan] address comments
54c4209 [Wenchen Fan] fix 7551
2015-05-13 12:47:48 -07:00
Cheng Hao 0da254fb29 [SPARK-6734] [SQL] Add UDTF.close support in Generate
Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5383 from chenghao-intel/udtf_close and squashes the following commits:

98b4e4b [Cheng Hao] Support UDTF.close
2015-05-14 00:14:59 +08:00
Michael Armbrust 2a41c0d71a [SPARK-7569][SQL] Better error for invalid binary expressions
`scala> Seq((1,1)).toDF("a", "b").select(lit(1) + new java.sql.Date(1)) `

Before:

```
org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between Literal 1, IntegerType and Literal 0, DateType;
```

After:
```
org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between int and date;
```

Author: Michael Armbrust <michael@databricks.com>

Closes #6089 from marmbrus/betterBinaryError and squashes the following commits:

23b68ad [Michael Armbrust] [SPARK-7569][SQL] Better error for invalid binary expressions
2015-05-12 13:36:55 -07:00
Wenchen Fan 4e290522c2 [SPARK-7276] [DATAFRAME] speed up DataFrame.select by collapsing Project
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #5831 from cloud-fan/7276 and squashes the following commits:

ee4a1e1 [Wenchen Fan] fix rebase mistake
a3b565d [Wenchen Fan] refactor
99deb5d [Wenchen Fan] add test
f1f67ad [Wenchen Fan] fix 7276
2015-05-12 11:51:55 -07:00
Wenchen Fan 831504cf6b [DataFrame][minor] cleanup unapply methods in DataTypes
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6079 from cloud-fan/unapply and squashes the following commits:

40da442 [Wenchen Fan] one more
7d90a05 [Wenchen Fan] cleanup unapply in DataTypes
2015-05-12 10:28:40 -07:00