Commit graph

2002 commits

Author SHA1 Message Date
Michał Kiełbowicz 9dc3e602d7 Fixed typo
## What changes were proposed in this pull request?

Fixed small typo - "value ... ~~in~~ is null"

## How was this patch tested?

Still compiles!

Author: Michał Kiełbowicz <jupblb@users.noreply.github.com>

Closes #14569 from jupblb/typo-fix.
2016-08-09 23:01:50 -07:00
Davies Liu 92da22878b [SPARK-16905] SQL DDL: MSCK REPAIR TABLE
## What changes were proposed in this pull request?

MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system.

Another syntax is: ALTER TABLE table RECOVER PARTITIONS

The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed).

## How was this patch tested?

Added unit tests for it and Hive compatibility test suite.

Author: Davies Liu <davies@databricks.com>

Closes #14500 from davies/repair_table.
2016-08-09 10:04:36 -07:00
Sean Zhong bca43cd635 [SPARK-16898][SQL] Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn
## What changes were proposed in this pull request?

This PR adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn, so that we can use these info in customized optimizer rule.

## How was this patch tested?

Existing test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14494 from clockfly/add_more_info_for_typed_operator.
2016-08-09 08:36:50 +08:00
Holden Karau 9216901d52 [SPARK-16779][TRIVIAL] Avoid using postfix operators where they do not add much and remove whitelisting
## What changes were proposed in this pull request?

Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.

## How was this patch tested?

Existing tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #14407 from holdenk/SPARK-16779.
2016-08-08 15:54:03 -07:00
gatorsmile 5959df217d [SPARK-16936][SQL] Case Sensitivity Support for Refresh Temp Table
### What changes were proposed in this pull request?
Currently, the `refreshTable` API is always case sensitive.

When users use the view name without the exact case match, the API silently ignores the call. Users might expect the command has been successfully completed. However, when users run the subsequent SQL commands, they might still get the exception, like
```
Job aborted due to stage failure:
Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 7, localhost):
java.io.FileNotFoundException:
File file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-00000-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet does not exist
```

This PR is to fix the issue.

### How was this patch tested?
Added a test case.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14523 from gatorsmile/refreshTempTable.
2016-08-08 22:34:28 +08:00
Nattavut Sutyanyong 06f5dc8415 [SPARK-16804][SQL] Correlated subqueries containing non-deterministic operations return incorrect results
## What changes were proposed in this pull request?

This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase by returning an error message when the LIMIT is found in the path from the parent table to the correlated predicate in the subquery.

## How was this patch tested?

./dev/run-tests
a new unit test on the problematic pattern.

Author: Nattavut Sutyanyong <nsy.can@gmail.com>

Closes #14411 from nsyca/master.
2016-08-08 12:14:11 +02:00
Weiqing Yang e10ca8de49 [SPARK-16945] Fix Java Lint errors
## What changes were proposed in this pull request?
This PR is to fix the minor Java linter errors as following:
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[42,10] (modifier) RedundantModifier: Redundant 'final' modifier.
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[97,10] (modifier) RedundantModifier: Redundant 'final' modifier.

## How was this patch tested?
Manual test.
dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #14532 from Sherry302/master.
2016-08-08 09:24:37 +01:00
Sean Owen 8d87252087 [SPARK-16409][SQL] regexp_extract with optional groups causes NPE
## What changes were proposed in this pull request?

regexp_extract actually returns null when it shouldn't when a regex matches but the requested optional group did not. This makes it return an empty string, as apparently designed.

## How was this patch tested?

Additional unit test

Author: Sean Owen <sowen@cloudera.com>

Closes #14504 from srowen/SPARK-16409.
2016-08-07 12:20:07 +01:00
Sylvain Zimmer 2460f03ffe [SPARK-16826][SQL] Switch to java.net.URI for parse_url()
## What changes were proposed in this pull request?
The java.net.URL class has a globally synchronized Hashtable, which limits the throughput of any single executor doing lots of calls to parse_url(). Tests have shown that a 36-core machine can only get to 10% CPU use because the threads are locked most of the time.

This patch switches to java.net.URI which has less features than java.net.URL but focuses on URI parsing, which is enough for parse_url().

New tests were added to make sure a few common edge cases didn't change behaviour.
https://issues.apache.org/jira/browse/SPARK-16826

## How was this patch tested?
I've kept the old URL code commented for now, so that people can verify that the new unit tests do pass with java.net.URL.

Thanks to srowen for the help!

Author: Sylvain Zimmer <sylvain@sylvainzimmer.com>

Closes #14488 from sylvinus/master.
2016-08-05 20:55:58 +01:00
Wenchen Fan 5effc016c8 [SPARK-16879][SQL] unify logical plans for CREATE TABLE and CTAS
## What changes were proposed in this pull request?

we have various logical plans for CREATE TABLE and CTAS: `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateHiveTableAsSelectLogicalPlan`. This PR unifies them to reduce the complexity and centralize the error handling.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14482 from cloud-fan/table.
2016-08-05 10:50:26 +02:00
Sean Zhong 9d7a47406e [SPARK-16853][SQL] fixes encoder error in DataSet typed select
## What changes were proposed in this pull request?

For DataSet typed select:
```
def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
```
If type T is a case class or a tuple class that is not atomic, the resulting logical plan's schema will mismatch with `Dataset[T]` encoder's schema, which will cause encoder error and throw AnalysisException.

### Before change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2];
..
```

### After change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]).show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
+---+---+
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14474 from clockfly/SPARK-16853.
2016-08-04 19:45:47 +08:00
Wenchen Fan 43f4fd6f9b [SPARK-16867][SQL] createTable and alterTable in ExternalCatalog should not take db
## What changes were proposed in this pull request?

These 2 methods take `CatalogTable` as parameter, which already have the database information.

## How was this patch tested?

existing test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14476 from cloud-fan/minor5.
2016-08-04 16:48:30 +08:00
Sean Zhong 27e815c31d [SPARK-16888][SQL] Implements eval method for expression AssertNotNull
## What changes were proposed in this pull request?

Implements `eval()` method for expression `AssertNotNull` so that we can convert local projection on LocalRelation to another LocalRelation.

### Before change:
```
scala> import org.apache.spark.sql.catalyst.dsl.expressions._
scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
scala> import org.apache.spark.sql.Column
scala> case class A(a: Int)
scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain

java.lang.UnsupportedOperationException: Only code-generated evaluation is supported.
  at org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
  ...
```

### After the change:
```
scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain(true)

== Parsed Logical Plan ==
'Project [assertnotnull('_1) AS assertnotnull(_1)#5]
+- LocalRelation [_1#2, _2#3]

== Analyzed Logical Plan ==
assertnotnull(_1): struct<a:int>
Project [assertnotnull(_1#2) AS assertnotnull(_1)#5]
+- LocalRelation [_1#2, _2#3]

== Optimized Logical Plan ==
LocalRelation [assertnotnull(_1)#5]

== Physical Plan ==
LocalTableScan [assertnotnull(_1)#5]
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14486 from clockfly/assertnotnull_eval.
2016-08-04 13:43:25 +08:00
Eric Liang e6f226c567 [SPARK-16596] [SQL] Refactor DataSourceScanExec to do partition discovery at execution instead of planning time
## What changes were proposed in this pull request?

Partition discovery is rather expensive, so we should do it at execution time instead of during physical planning. Right now there is not much benefit since ListingFileCatalog will read scan for all partitions at planning time anyways, but this can be optimized in the future. Also, there might be more information for partition pruning not available at planning time.

This PR moves a lot of the file scan logic from planning to execution time. All file scan operations are handled by `FileSourceScanExec`, which handles both batched and non-batched file scans. This requires some duplication with `RowDataSourceScanExec`, but is probably worth it so that `FileSourceScanExec` does not need to depend on an input RDD.

TODO: In another pr, move DataSourceScanExec to it's own file.

## How was this patch tested?

Existing tests (it might be worth adding a test that catalog.listFiles() is delayed until execution, but this can be delayed until there is an actual benefit to doing so).

Author: Eric Liang <ekl@databricks.com>

Closes #14241 from ericl/refactor.
2016-08-03 11:19:55 -07:00
Wenchen Fan b55f34370f [SPARK-16714][SPARK-16735][SPARK-16646] array, map, greatest, least's type coercion should handle decimal type
## What changes were proposed in this pull request?

Here is a table about the behaviours of `array`/`map` and `greatest`/`least` in Hive, MySQL and Postgres:

|    |Hive|MySQL|Postgres|
|---|---|---|---|---|
|`array`/`map`|can find a wider type with decimal type arguments, and will truncate the wider decimal type if necessary|can find a wider type with decimal type arguments, no truncation problem|can find a wider type with decimal type arguments, no truncation problem|
|`greatest`/`least`|can find a wider type with decimal type arguments, and truncate if necessary, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|

I think these behaviours makes sense and Spark SQL should follow them.

This PR fixes `array` and `map` by using `findWiderCommonType` to get the wider type.
This PR fixes `greatest` and `least` by add a `findWiderTypeWithoutStringPromotion`, which provides similar semantic of `findWiderCommonType`, but without string promotion.

## How was this patch tested?

new tests in `TypeCoersionSuite`

Author: Wenchen Fan <wenchen@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #14439 from cloud-fan/bug.
2016-08-03 11:15:09 -07:00
Wenchen Fan a9beeaaaeb [SPARK-16855][SQL] move Greatest and Least from conditionalExpressions.scala to arithmetic.scala
## What changes were proposed in this pull request?

`Greatest` and `Least` are not conditional expressions, but arithmetic expressions.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14460 from cloud-fan/move.
2016-08-02 11:08:32 -07:00
Herman van Hovell 2330f3ecbb [SPARK-16836][SQL] Add support for CURRENT_DATE/CURRENT_TIMESTAMP literals
## What changes were proposed in this pull request?
In Spark 1.6 (with Hive support) we could use `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions as literals (without adding braces), for example:
```SQL
select /* Spark 1.6: */ current_date, /* Spark 1.6  & Spark 2.0: */ current_date()
```
This was accidentally dropped in Spark 2.0. This PR reinstates this functionality.

## How was this patch tested?
Added a case to ExpressionParserSuite.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14442 from hvanhovell/SPARK-16836.
2016-08-02 10:09:47 -07:00
Liang-Chi Hsieh 146001a9ff [SPARK-16062] [SPARK-15989] [SQL] Fix two bugs of Python-only UDTs
## What changes were proposed in this pull request?

There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know.

### First bug: When MapObjects works on Python-only UDTs

`RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer expression. If the sql type is `ArrayType`, we will have `MapObjects` working on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input data type. It causes error like:

    import pyspark.sql.group
    from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
    from pyspark.sql.types import *

    schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT())
    df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema)
    df.show()

    File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
    : java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType)
    ...

### Second bug: When Python-only UDTs is the element type of ArrayType

    import pyspark.sql.group
    from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
    from pyspark.sql.types import *

    schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT()))
    df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema)
    df.show()

## How was this patch tested?
PySpark's sql tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13778 from viirya/fix-pyudt.
2016-08-02 10:08:18 -07:00
Tom Magrino 1dab63d8d3 [SPARK-16837][SQL] TimeWindow incorrectly drops slideDuration in constructors
## What changes were proposed in this pull request?

Fix of incorrect arguments (dropping slideDuration and using windowDuration) in constructors for TimeWindow.

The JIRA this addresses is here: https://issues.apache.org/jira/browse/SPARK-16837

## How was this patch tested?

Added a test to TimeWindowSuite to check that the results of TimeWindow object apply and TimeWindow class constructor are equivalent.

Author: Tom Magrino <tmagrino@fb.com>

Closes #14441 from tmagrino/windowing-fix.
2016-08-02 09:16:44 -07:00
petermaxlee a1ff72e1cc [SPARK-16850][SQL] Improve type checking error message for greatest/least
## What changes were proposed in this pull request?
Greatest/least function does not have the most friendly error message for data types. This patch improves the error message to not show the Seq type, and use more human readable data types.

Before:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST (ArrayBuffer(DecimalType(2,1), StringType)).; line 1 pos 7
```

After:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST(decimal(2,1), string).; line 1 pos 7
```

## How was this patch tested?
Manually verified the output and also added unit tests to ConditionalExpressionSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14453 from petermaxlee/SPARK-16850.
2016-08-02 19:32:35 +08:00
Wenchen Fan 2eedc00b04 [SPARK-16828][SQL] remove MaxOf and MinOf
## What changes were proposed in this pull request?

These 2 expressions are not needed anymore after we have `Greatest` and `Least`. This PR removes them and related tests.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14434 from cloud-fan/minor1.
2016-08-01 17:54:41 -07:00
Holden Karau ab1e761f96 [SPARK-16774][SQL] Fix use of deprecated timestamp constructor & improve timezone handling
## What changes were proposed in this pull request?

Removes the deprecated timestamp constructor and incidentally fixes the use which was using system timezone rather than the one specified when working near DST.

This change also causes the roundtrip tests to fail since it now actually uses all the timezones near DST boundaries where it didn't before.

Note: this is only a partial the solution, longer term we should follow up with https://issues.apache.org/jira/browse/SPARK-16788 to avoid this problem & simplify our timezone handling code.

## How was this patch tested?

New tests for two timezones added so even if user timezone happens to coincided with one, the other tests should still fail. Important note: this (temporarily) disables the round trip tests until we can fix the issue more thoroughly.

Author: Holden Karau <holden@us.ibm.com>

Closes #14398 from holdenk/SPARK-16774-fix-use-of-deprecated-timestamp-constructor.
2016-08-01 13:57:05 -07:00
eyal farago 338a98d65c [SPARK-16791][SQL] cast struct with timestamp field fails
## What changes were proposed in this pull request?
a failing test case + fix to SPARK-16791 (https://issues.apache.org/jira/browse/SPARK-16791)

## How was this patch tested?
added a failing test case to CastSuit, then fixed the Cast code and rerun the entire CastSuit

Author: eyal farago <eyal farago>
Author: Eyal Farago <eyal.farago@actimize.com>

Closes #14400 from eyalfa/SPARK-16791_cast_struct_with_timestamp_field_fails.
2016-08-01 22:43:32 +08:00
Dongjoon Hyun 64d8f37c71 [SPARK-16726][SQL] Improve Union/Intersect/Except error messages on incompatible types
## What changes were proposed in this pull request?

Currently, `UNION` queries on incompatible types show misleading error messages, i.e., `unresolved operator Union`. We had better show a more correct message. This will help users in the situation of [SPARK-16704](https://issues.apache.org/jira/browse/SPARK-16704).

**Before**
```scala
scala> sql("select 1,2,3 union (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: unresolved operator 'Intersect;
scala> sql("select 1,2,3 except (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: unresolved operator 'Except;
```

**After**
```scala
scala> sql("select 1,2,3 union (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the second column of the second table;
scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: Intersect can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the second column of the second table;
scala> sql("select 1,2,3 except (select array(1),array(2),3)")
org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the first column of the second table;
```

## How was this patch tested?

Pass the Jenkins test with a new test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14355 from dongjoon-hyun/SPARK-16726.
2016-08-01 11:12:58 +02:00
Wenchen Fan 301fb0d723 [SPARK-16731][SQL] use StructType in CatalogTable and remove CatalogColumn
## What changes were proposed in this pull request?

`StructField` has very similar semantic with `CatalogColumn`, except that `CatalogColumn` use string to express data type. I think it's reasonable to use `StructType` as the `CatalogTable.schema` and remove `CatalogColumn`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14363 from cloud-fan/column.
2016-07-31 18:18:53 -07:00
Reynold Xin 064d91ff73 [SPARK-16813][SQL] Remove private[sql] and private[spark] from catalyst package
## What changes were proposed in this pull request?
The catalyst package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime.

This patch removes all private[sql] and private[spark] visibility modifiers in org.apache.spark.sql.catalyst.

## How was this patch tested?
N/A - just visibility changes.

Author: Reynold Xin <rxin@databricks.com>

Closes #14418 from rxin/SPARK-16813.
2016-07-31 16:31:06 +08:00
Sean Owen 0dc4310b47 [SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose side effects are required
## What changes were proposed in this pull request?

Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14332 from srowen/SPARK-16694.
2016-07-30 04:42:38 -07:00
Tathagata Das bbc247548a [SPARK-16748][SQL] SparkExceptions during planning should not wrapped in TreeNodeException
## What changes were proposed in this pull request?
We do not want SparkExceptions from job failures in the planning phase to create TreeNodeException. Hence do not wrap SparkException in TreeNodeException.

## How was this patch tested?
New unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14395 from tdas/SPARK-16748.
2016-07-29 19:59:35 -07:00
Liang-Chi Hsieh 9ade77c3fa [SPARK-16639][SQL] The query with having condition that contains grouping by column should work
## What changes were proposed in this pull request?

The query with having condition that contains grouping by column will be failed during analysis. E.g.,

    create table tbl(a int, b string);
    select count(b) from tbl group by a + 1 having a + 1 = 2;

Having condition should be able to use grouping by column.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14296 from viirya/having-contains-grouping-column.
2016-07-28 22:33:33 +08:00
petermaxlee 11d427c924 [SPARK-16730][SQL] Implement function aliases for type casts
## What changes were proposed in this pull request?
Spark 1.x supports using the Hive type name as function names for doing casts, e.g.
```sql
SELECT int(1.0);
SELECT string(2.0);
```

The above query would work in Spark 1.x because Spark 1.x fail back to Hive for unimplemented functions, and break in Spark 2.0 because the fall back was removed.

This patch implements function aliases using an analyzer rule for the following cast functions:
- boolean
- tinyint
- smallint
- int
- bigint
- float
- double
- decimal
- date
- timestamp
- binary
- string

## How was this patch tested?
Added end-to-end tests in SQLCompatibilityFunctionSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14364 from petermaxlee/SPARK-16730-2.
2016-07-28 13:13:17 +08:00
petermaxlee ef0ccbcb07 [SPARK-16729][SQL] Throw analysis exception for invalid date casts
## What changes were proposed in this pull request?
Spark currently throws exceptions for invalid casts for all other data types except date type. Somehow date type returns null. It should be consistent and throws analysis exception as well.

## How was this patch tested?
Added a unit test case in CastSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14358 from petermaxlee/SPARK-16729.
2016-07-27 16:04:43 +08:00
Qifan Pu 738b4cc548 [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator
## What changes were proposed in this pull request?

This PR is the first step for the following feature:

For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBasedKeyValueBatch`. We then automatically pick between the two implementations based on certain knobs.

In this first-step PR, implementations for `RowBasedKeyValueBatch` and `RowBasedHashMapGenerator` are added.

## How was this patch tested?

Unit tests: `RowBasedKeyValueBatchSuite`

Author: Qifan Pu <qifan.pu@gmail.com>

Closes #14349 from ooq/SPARK-16524.
2016-07-26 18:08:07 -07:00
Wenchen Fan 6959061f02 [SPARK-16706][SQL] support java map in encoder
## What changes were proposed in this pull request?

finish the TODO, create a new expression `ExternalMapToCatalyst` to iterate the map directly.

## How was this patch tested?

new test in `JavaDatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14344 from cloud-fan/java-map.
2016-07-26 15:33:05 +08:00
Liang-Chi Hsieh 7b06a8948f [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by ColumnPruning
## What changes were proposed in this pull request?

We push down `Project` through `Sample` in `Optimizer` by the rule `PushProjectThroughSample`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect.

Since the rule `ColumnPruning` already handles general `Project` pushdown. We don't need  `PushProjectThroughSample` anymore. The rule `ColumnPruning` also avoids the described issue.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14327 from viirya/fix-sample-pushdown.
2016-07-26 12:00:01 +08:00
Yin Huai 815f3eece5 [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions
## What changes were proposed in this pull request?
This PR contains three changes.

First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:
1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value.
2. If the offset row does not exist, the default value will be used.
3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change).

Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.

Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.

## How was this patch tested?
New tests in SQLWindowFunctionSuite

Author: Yin Huai <yhuai@databricks.com>

Closes #14284 from yhuai/lead-lag.
2016-07-25 20:58:07 -07:00
Michael Armbrust f99e34e8e5 [SPARK-16724] Expose DefinedByConstructorParams
We don't generally make things in catalyst/execution private.  Instead they are just undocumented due to their lack of stability guarantees.

Author: Michael Armbrust <michael@databricks.com>

Closes #14356 from marmbrus/patch-1.
2016-07-25 20:41:24 -07:00
gatorsmile 3fc4566941 [SPARK-16678][SPARK-16677][SQL] Fix two View-related bugs
## What changes were proposed in this pull request?
**Issue 1: Disallow Creating/Altering a View when the same-name Table Exists (without IF NOT EXISTS)**
When we create OR alter a view, we check whether the view already exists. In the current implementation, if a table with the same name exists, we treat it as a view. However, this is not the right behavior. We should follow what Hive does. For example,
```
hive> CREATE TABLE tab1 (id int);
OK
Time taken: 0.196 seconds
hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> ALTER VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
OK
Time taken: 0.678 seconds
```

**Issue 2: Strange Error when Issuing Load Table Against A View**
Users should not be allowed to issue LOAD DATA against a view. Currently, when users doing it, we got a very strange runtime error. For example,
```SQL
LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
```
```
java.lang.reflect.InvocationTargetException was thrown.
java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
```
## How was this patch tested?
Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14314 from gatorsmile/tableDDLAgainstView.
2016-07-26 09:32:29 +08:00
Shixiong Zhu 12f490b5c8 [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"
## What changes were proposed in this pull request?

SubexpressionEliminationSuite."Semantic equals and hash" assumes the default AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when this test runs. It may happen to use "ExprId(1)".

This PR detects the conflict and makes sure we create a different ExprId when the conflict happens.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14350 from zsxwing/SPARK-16715.
2016-07-25 16:08:29 -07:00
Cheng Lian 7ea6d282b9 [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions
## What changes were proposed in this pull request?

This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no partitioning expressions are present.

Before:

```sql
( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

After:

```sql
(ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

## How was this patch tested?

New test case added in `ExpressionSQLBuilderSuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #14334 from liancheng/window-spec-sql-format.
2016-07-25 09:42:39 -07:00
hyukjinkwon 79826f3c79 [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat
## What changes were proposed in this pull request?

It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698.

Field name having dots throws an exception. For example the codes below:

```scala
val path = "/tmp/path"
val json =""" {"a.b":"data"}"""
spark.sparkContext
  .parallelize(json :: Nil)
  .saveAsTextFile(path)
spark.read.json(path).collect()
```

throws an exception as below:

```
Unable to resolve a.b given [a.b];
org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b];
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
	at scala.Option.getOrElse(Option.scala:121)
```

This problem was introduced in 17eec0a71b (diff-27c76f96a7b2733ecfd6f46a1716e153R121)

When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields.

For example, this throws an exception. (**Loading JSON from RDD is fine**)

```scala
val json =""" {"a.b":"data"}"""
val rdd = spark.sparkContext.parallelize(json :: Nil)
spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true))))
  .json(rdd).select("`a.b`").printSchema()
```

as below:

```
cannot resolve '```a.b```' given input columns: [`a.b`];
org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`];
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

## How was this patch tested?

Unit tests in `FileSourceStrategySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14339 from HyukjinKwon/SPARK-16698-regression.
2016-07-25 22:51:30 +08:00
Wenchen Fan 64529b186a [SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTable
## What changes were proposed in this pull request?

It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14331 from cloud-fan/check.
2016-07-25 22:05:48 +08:00
Wenchen Fan d27d362eba [SPARK-16660][SQL] CreateViewCommand should not take CatalogTable
## What changes were proposed in this pull request?

`CreateViewCommand` only needs some information of a `CatalogTable`, but not all of them. We have some tricks(e.g. we need to check the table type is `VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take a `CatalogTable`.
This PR cleans it up and only pass in necessary information to `CreateViewCommand`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14297 from cloud-fan/minor2.
2016-07-25 22:02:00 +08:00
Cheng Lian 68b4020d0c [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last
## What changes were proposed in this pull request?

Default `TreeNode.withNewChildren` implementation doesn't work for `Last` and when both constructor arguments are the same, e.g.:

```sql
LAST_VALUE(FALSE) -- The 2nd argument defaults to FALSE
LAST_VALUE(FALSE, FALSE)
LAST_VALUE(TRUE, TRUE)
```

This is because although `Last` is a unary expression, both of its constructor arguments, `child` and `ignoreNullsExpr`, are `Expression`s. When they have the same value, `TreeNode.withNewChildren` treats both of them as child nodes by mistake. `First` is also affected by this issue in exactly the same way.

This PR fixes this issue by making `ignoreNullsExpr` a child expression of `First` and `Last`.

## How was this patch tested?

New test case added in `WindowQuerySuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #14295 from liancheng/spark-16648-last-value.
2016-07-25 17:22:29 +08:00
Wenchen Fan 1221ce0402 [SPARK-16645][SQL] rename CatalogStorageFormat.serdeProperties to properties
## What changes were proposed in this pull request?

we also store data source table options in this field, it's unreasonable to call it `serdeProperties`.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14283 from cloud-fan/minor1.
2016-07-25 09:28:56 +08:00
Liang-Chi Hsieh e10b8741d8 [SPARK-16622][SQL] Fix NullPointerException when the returned value of the called method in Invoke is null
## What changes were proposed in this pull request?

Currently we don't check the value returned by called method in `Invoke`. When the returned value is null and is assigned to a variable of primitive type, `NullPointerException` will be thrown.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14259 from viirya/agg-empty-ds.
2016-07-23 10:27:16 +08:00
Jacek Laskowski e1bd70f44b [SPARK-16287][HOTFIX][BUILD][SQL] Fix annotation argument needs to be a constant
## What changes were proposed in this pull request?

Build fix for [SPARK-16287][SQL] Implement str_to_map SQL function that has introduced this compilation error:

```
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala:402: error: annotation argument needs to be a constant; found: "_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map after splitting the text ".+("into key/value pairs using delimiters. ").+("Default delimiters are \',\' for pairDelim and \':\' for keyValueDelim.")
    "into key/value pairs using delimiters. " +
                                              ^
```

## How was this patch tested?

Local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #14315 from jaceklaskowski/build-fix-complexTypeCreator.
2016-07-22 12:37:30 +01:00
Sandeep Singh df2c6d59d0 [SPARK-16287][SQL] Implement str_to_map SQL function
## What changes were proposed in this pull request?
This PR adds `str_to_map` SQL function in order to remove Hive fallback.

## How was this patch tested?
Pass the Jenkins tests with newly added.

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13990 from techaddict/SPARK-16287.
2016-07-22 10:05:21 +08:00
Liang-Chi Hsieh 6203668d50 [SPARK-16640][SQL] Add codegen for Elt function
## What changes were proposed in this pull request?

Elt function doesn't support codegen execution now. We should add the support.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14277 from viirya/elt-codegen.
2016-07-21 20:54:17 +08:00
Wenchen Fan cfa5ae84ed [SPARK-16644][SQL] Aggregate should not propagate constraints containing aggregate expressions
## What changes were proposed in this pull request?

aggregate expressions can only be executed inside `Aggregate`, if we propagate it up with constraints, the parent operator can not execute it and will fail at runtime.

## How was this patch tested?

new test in SQLQuerySuite

Author: Wenchen Fan <wenchen@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #14281 from cloud-fan/bug.
2016-07-20 18:37:15 -07:00
Marcelo Vanzin e3cd5b3050 [SPARK-16634][SQL] Workaround JVM bug by moving some code out of ctor.
Some 1.7 JVMs have a bug that is triggered by certain Scala-generated
bytecode. GenericArrayData suffers from that and fails to load in certain
JVMs.

Moving the offending code out of the constructor and into a helper method
avoids the issue.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14271 from vanzin/SPARK-16634.
2016-07-20 10:38:44 -07:00
Dongjoon Hyun 162d04a30e [SPARK-16602][SQL] Nvl function should support numeric-string cases
## What changes were proposed in this pull request?

`Nvl` function should support numeric-straing cases like Hive/Spark1.6. Currently, `Nvl` finds the tightest common types among numeric types. This PR extends that to consider `String` type, too.

```scala
- TypeCoercion.findTightestCommonTypeOfTwo(left.dataType, right.dataType).map { dtype =>
+ TypeCoercion.findTightestCommonTypeToString(left.dataType, right.dataType).map { dtype =>
```

**Before**
```scala
scala> sql("select nvl('0', 1)").collect()
org.apache.spark.sql.AnalysisException: cannot resolve `nvl("0", 1)` due to data type mismatch:
input to function coalesce should all be the same type, but it's [string, int]; line 1 pos 7
```

**After**
```scala
scala> sql("select nvl('0', 1)").collect()
res0: Array[org.apache.spark.sql.Row] = Array([0])
```

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14251 from dongjoon-hyun/SPARK-16602.
2016-07-19 10:28:17 -07:00
Xin Ren 21a6dd2aef [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent
https://issues.apache.org/jira/browse/SPARK-16535

## What changes were proposed in this pull request?

When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)

I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).

ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762

## How was this patch tested?

I've tested by re-building the project, and build succeeded.

Author: Xin Ren <iamshrek@126.com>

Closes #14189 from keypointt/SPARK-16535.
2016-07-19 11:59:46 +01:00
Reynold Xin 7b84758034 [SPARK-16584][SQL] Move regexp unit tests to RegexpExpressionsSuite
## What changes were proposed in this pull request?
This patch moves regexp related unit tests from StringExpressionsSuite to RegexpExpressionsSuite to match the file name for regexp expressions.

## How was this patch tested?
This is a test only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #14230 from rxin/SPARK-16584.
2016-07-16 23:42:28 -07:00
Sameer Agarwal a1ffbada8a [SPARK-16582][SQL] Explicitly define isNull = false for non-nullable expressions
## What changes were proposed in this pull request?

This patch is just a slightly safer way to fix the issue we encountered in https://github.com/apache/spark/pull/14168 should this pattern re-occur at other places in the code.

## How was this patch tested?

Existing tests. Also, I manually tested that it fixes the problem in SPARK-16514 without having the proposed change in https://github.com/apache/spark/pull/14168

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14227 from sameeragarwal/codegen.
2016-07-16 13:24:00 -07:00
gatorsmile 1b5c9e52a7 [SPARK-16530][SQL][TRIVIAL] Wrong Parser Keyword in ALTER TABLE CHANGE COLUMN
#### What changes were proposed in this pull request?
Based on the [Hive SQL syntax](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment), the command to change column name/type/position/comment is `ALTER TABLE CHANGE COLUMN`. However, in our .g4 file, it is `ALTER TABLE CHANGE COLUMNS`. Because it is the last optional keyword, it does not take any effect. Thus, I put the issue as a Trivial level.

cc hvanhovell

#### How was this patch tested?
Existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14186 from gatorsmile/changeColumns.
2016-07-14 17:15:51 +02:00
Wenchen Fan db7317ac3c [SPARK-16448] RemoveAliasOnlyProject should not remove alias with metadata
## What changes were proposed in this pull request?

`Alias` with metadata is not a no-op and we should not strip it in `RemoveAliasOnlyProject` rule.
This PR also did some improvement for this rule:

1. extend the semantic of `alias-only`. Now we allow the project list to be partially aliased.
2. add unit test for this rule.

## How was this patch tested?

new `RemoveAliasOnlyProjectSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14106 from cloud-fan/bug.
2016-07-14 15:48:22 +08:00
蒋星博 f376c37268 [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates correctly in non-deterministic condition.
## What changes were proposed in this pull request?

Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example:
```SELECT a FROM t WHERE rand() < 0.1 AND a = 1```
And
```SELECT a FROM t WHERE a = 1 AND rand() < 0.1```
may call rand() for different times and therefore the output rows differ.

This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates.

## How was this patch tested?

Expanded related testcases in FilterPushdownSuite.

Author: 蒋星博 <jiangxingbo@meituan.com>

Closes #14012 from jiangxb1987/ppd.
2016-07-14 00:21:27 +08:00
Eric Liang 1c58fa905b [SPARK-16514][SQL] Fix various regex codegen bugs
## What changes were proposed in this pull request?

RegexExtract and RegexReplace currently crash on non-nullable input due use of a hard-coded local variable name (e.g. compiles fail with `java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 26: Redefinition of local variable "m" `).

This changes those variables to use fresh names, and also in a few other places.

## How was this patch tested?

Unit tests. rxin

Author: Eric Liang <ekl@databricks.com>

Closes #14168 from ericl/sc-3906.
2016-07-12 23:09:02 -07:00
petermaxlee 56bd399a86 [SPARK-16284][SQL] Implement reflect SQL function
## What changes were proposed in this pull request?
This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969.

java_method is an alias for reflect, so this should also resolve SPARK-16277.

## How was this patch tested?
Added expression unit tests and an end-to-end test.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14138 from petermaxlee/reflect-static.
2016-07-13 08:05:20 +08:00
Marcelo Vanzin 7f968867ff [SPARK-16119][SQL] Support PURGE option to drop table / partition.
This option is used by Hive to directly delete the files instead of
moving them to the trash. This is needed in certain configurations
where moving the files does not work. For non-Hive tables and partitions,
Spark already behaves as if the PURGE option was set, so there's no
need to do anything.

Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for
partitions), so the code reflects that: trying to use the option with
older versions of Hive will cause an exception to be thrown.

The change is a little noisier than I would like, because of the code
to propagate the new flag through all the interfaces and implementations;
the main changes are in the parser and in HiveShim, aside from the tests
(DDLCommandSuite, VersionsSuite).

Tested by running sql and catalyst unit tests, plus VersionsSuite which
has been updated to test the version-specific behavior. I also ran an
internal test suite that uses PURGE and would not pass previously.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13831 from vanzin/SPARK-16119.
2016-07-12 12:47:46 -07:00
Reynold Xin c377e49e38 [SPARK-16489][SQL] Guard against variable reuse mistakes in expression code generation
## What changes were proposed in this pull request?
In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r".

This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. This patch also fixes the bug in crc32 expression.

## How was this patch tested?
This is a test harness change, but I also created a new test suite for testing the test harness.

Author: Reynold Xin <rxin@databricks.com>

Closes #14146 from rxin/SPARK-16489.
2016-07-12 10:07:23 -07:00
Sameer Agarwal 9cc74f95ed [SPARK-16488] Fix codegen variable namespace collision in pmod and partitionBy
## What changes were proposed in this pull request?

This patch fixes a variable namespace collision bug in pmod and partitionBy

## How was this patch tested?

Regression test for one possible occurrence. A more general fix in `ExpressionEvalHelper.checkEvaluation` will be in a subsequent PR.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #14144 from sameeragarwal/codegen-bug.
2016-07-11 20:26:01 -07:00
Dongjoon Hyun 840853ed06 [SPARK-16458][SQL] SessionCatalog should support listColumns for temporary tables
## What changes were proposed in this pull request?

Temporary tables are used frequently, but `spark.catalog.listColumns` does not support those tables. This PR make `SessionCatalog` supports temporary table column listing.

**Before**
```scala
scala> spark.range(10).createOrReplaceTempView("t1")

scala> spark.catalog.listTables().collect()
res1: Array[org.apache.spark.sql.catalog.Table] = Array(Table[name=`t1`, tableType=`TEMPORARY`, isTemporary=`true`])

scala> spark.catalog.listColumns("t1").collect()
org.apache.spark.sql.AnalysisException: Table `t1` does not exist in database `default`.;
```

**After**
```
scala> spark.catalog.listColumns("t1").collect()
res2: Array[org.apache.spark.sql.catalog.Column] = Array(Column[name='id', description='id', dataType='bigint', nullable='false', isPartition='false', isBucket='false'])
```
## How was this patch tested?

Pass the Jenkins tests including a new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14114 from dongjoon-hyun/SPARK-16458.
2016-07-11 22:45:22 +02:00
Reynold Xin ffcb6e055a [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14130 from rxin/SPARK-16477.
2016-07-11 09:42:56 -07:00
Dongjoon Hyun 7ac79da0e4 [SPARK-16459][SQL] Prevent dropping current database
## What changes were proposed in this pull request?

This PR prevents dropping the current database to avoid errors like the followings.

```scala
scala> sql("create database delete_db")
scala> sql("use delete_db")
scala> sql("drop database delete_db")
scala> sql("create table t as select 1")
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database `delete_db` not found;
```

## How was this patch tested?

Pass the Jenkins tests including an updated testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14115 from dongjoon-hyun/SPARK-16459.
2016-07-11 15:15:47 +02:00
gatorsmile e226278941 [SPARK-16355][SPARK-16354][SQL] Fix Bugs When LIMIT/TABLESAMPLE is Non-foldable, Zero or Negative
#### What changes were proposed in this pull request?
**Issue 1:** When a query containing LIMIT/TABLESAMPLE 0, the statistics could be zero. Results are correct but it could cause a huge performance regression. For example,
```Scala
Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v")
  .createOrReplaceTempView("test")
val df1 = spark.table("test")
val df2 = spark.table("test").limit(0)
val df = df1.join(df2, Seq("k"), "left")
```
The statistics of both `df` and `df2` are zero. The statistics values should never be zero; otherwise `sizeInBytes` of `BinaryNode` will also be zero (product of children). This PR is to increase it to `1` when the num of rows is equal to 0.

**Issue 2:** When a query containing negative LIMIT/TABLESAMPLE, we should issue exceptions. Negative values could break the implementation assumption of multiple parts. For example, statistics calculation.  Below is the example query.
```SQL
SELECT * FROM testData TABLESAMPLE (-1 rows)
SELECT * FROM testData LIMIT -1
```
This PR is to issue an appropriate exception in this case.

**Issue 3:** Spark SQL follows the restriction of LIMIT clause in Hive. The argument to the LIMIT clause must evaluate to a constant value. It can be a numeric literal, or another kind of numeric expression involving operators, casts, and function return values. You cannot refer to a column or use a subquery. Currently, we do not detect whether the expression in LIMIT clause is foldable or not. If non-foldable, we might issue a strange error message. For example,
```SQL
SELECT * FROM testData LIMIT rand() > 0.2
```
Then, a misleading error message is issued, like
```
assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
+- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
   +- LocalLimit (_nondeterministic#202 > 0.2)
      +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
         +- LogicalRDD [key#11, value#12]

java.lang.AssertionError: assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
+- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
   +- LocalLimit (_nondeterministic#202 > 0.2)
      +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
         +- LogicalRDD [key#11, value#12]
```
This PR detects it and then issues a meaningful error message.

#### How was this patch tested?
Added test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14034 from gatorsmile/limit.
2016-07-11 16:21:13 +08:00
petermaxlee 82f0874453 [SPARK-16318][SQL] Implement all remaining xpath functions
## What changes were proposed in this pull request?
This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath.

## How was this patch tested?
Added unit tests and end-to-end tests.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13991 from petermaxlee/SPARK-16318.
2016-07-11 13:28:34 +08:00
wujian f5fef69143 [SPARK-16281][SQL] Implement parse_url SQL function
## What changes were proposed in this pull request?

This PR adds parse_url SQL functions in order to remove Hive fallback.

A new implementation of #13999

## How was this patch tested?

Pass the exist tests including new testcases.

Author: wujian <jan.chou.wu@gmail.com>

Closes #14008 from janplus/SPARK-16281.
2016-07-08 14:38:05 -07:00
Dongjoon Hyun a54438cb23 [SPARK-16285][SQL] Implement sentences SQL functions
## What changes were proposed in this pull request?

This PR implements `sentences` SQL function.

## How was this patch tested?

Pass the Jenkins tests with a new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14004 from dongjoon-hyun/SPARK_16285.
2016-07-08 17:05:24 +08:00
petermaxlee 8228b06303 [SPARK-16436][SQL] checkEvaluation should support NaN
## What changes were proposed in this pull request?
This small patch modifies ExpressionEvalHelper. checkEvaluation to support comparing NaN values for floating point comparisons.

## How was this patch tested?
This is a test harness change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14103 from petermaxlee/SPARK-16436.
2016-07-08 16:49:02 +08:00
Dongjoon Hyun dff73bfa5e [SPARK-16052][SQL] Improve CollapseRepartition optimizer for Repartition/RepartitionBy
## What changes were proposed in this pull request?

This PR improves `CollapseRepartition` to optimize the adjacent combinations of **Repartition** and **RepartitionBy**. Also, this PR adds a testsuite for this optimizer.

**Target Scenario**
```scala
scala> val dsView1 = spark.range(8).repartition(8, $"id")
scala> dsView1.createOrReplaceTempView("dsView1")
scala> sql("select id from dsView1 distribute by id").explain(true)
```

**Before**
```scala
scala> sql("select id from dsView1 distribute by id").explain(true)
== Parsed Logical Plan ==
'RepartitionByExpression ['id]
+- 'Project ['id]
   +- 'UnresolvedRelation `dsView1`

== Analyzed Logical Plan ==
id: bigint
RepartitionByExpression [id#0L]
+- Project [id#0L]
   +- SubqueryAlias dsview1
      +- RepartitionByExpression [id#0L], 8
         +- Range (0, 8, splits=8)

== Optimized Logical Plan ==
RepartitionByExpression [id#0L]
+- RepartitionByExpression [id#0L], 8
   +- Range (0, 8, splits=8)

== Physical Plan ==
Exchange hashpartitioning(id#0L, 200)
+- Exchange hashpartitioning(id#0L, 8)
   +- *Range (0, 8, splits=8)
```

**After**
```scala
scala> sql("select id from dsView1 distribute by id").explain(true)
== Parsed Logical Plan ==
'RepartitionByExpression ['id]
+- 'Project ['id]
   +- 'UnresolvedRelation `dsView1`

== Analyzed Logical Plan ==
id: bigint
RepartitionByExpression [id#0L]
+- Project [id#0L]
   +- SubqueryAlias dsview1
      +- RepartitionByExpression [id#0L], 8
         +- Range (0, 8, splits=8)

== Optimized Logical Plan ==
RepartitionByExpression [id#0L]
+- Range (0, 8, splits=8)

== Physical Plan ==
Exchange hashpartitioning(id#0L, 200)
+- *Range (0, 8, splits=8)
```

## How was this patch tested?

Pass the Jenkins tests (including a new testsuite).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13765 from dongjoon-hyun/SPARK-16052.
2016-07-08 16:44:53 +08:00
Daoyuan Wang 28710b42b0 [SPARK-16415][SQL] fix catalog string error
## What changes were proposed in this pull request?

In #13537 we truncate `simpleString` if it is a long `StructType`. But sometimes we need `catalogString` to reconstruct `TypeInfo`, for example in description of [SPARK-16415 ](https://issues.apache.org/jira/browse/SPARK-16415). So we need to keep the implementation of `catalogString` not affected by our truncate.

## How was this patch tested?

added a test case.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #14089 from adrian-wang/catalogstring.
2016-07-07 11:08:06 -07:00
Dongjoon Hyun a04cab8f17 [SPARK-16174][SQL] Improve OptimizeIn optimizer to remove literal repetitions
## What changes were proposed in this pull request?

This PR improves `OptimizeIn` optimizer to remove the literal repetitions from SQL `IN` predicates. This optimizer prevents user mistakes and also can optimize some queries like [TPCDS-36](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19).

**Before**
```scala
scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
== Physical Plan ==
*Filter state#6 IN (TN,TN,TN,TN,TN,TN,TN)
+- Generate explode([CA,TN]), false, false, [state#6]
   +- Scan OneRowRelation[]
```

**After**
```scala
scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
== Physical Plan ==
*Filter state#6 IN (TN)
+- Generate explode([CA,TN]), false, false, [state#6]
   +- Scan OneRowRelation[]
```

## How was this patch tested?

Pass the Jenkins tests (including a new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13876 from dongjoon-hyun/SPARK-16174.
2016-07-07 19:45:43 +08:00
Reynold Xin 986b251401 [SPARK-16400][SQL] Remove InSet filter pushdown from Parquet
## What changes were proposed in this pull request?
This patch removes InSet filter pushdown from Parquet data source, since row-based pushdown is not beneficial to Spark and brings extra complexity to the code base.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14076 from rxin/SPARK-16400.
2016-07-07 18:09:18 +08:00
gatorsmile 42279bff68 [SPARK-16374][SQL] Remove Alias from MetastoreRelation and SimpleCatalogRelation
#### What changes were proposed in this pull request?
Different from the other leaf nodes, `MetastoreRelation` and `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change the qualifier of the node. However, based on the existing alias handling, alias should be put in `SubqueryAlias`.

This PR is to separate alias handling from `MetastoreRelation` and `SimpleCatalogRelation` to make it consistent with the other nodes. It simplifies the signature and conversion to a `BaseRelation`.

For example, below is an example query for `MetastoreRelation`,  which is converted to a `LogicalRelation`:
```SQL
SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2
```

Before changes, the analyzed plan is
```
== Analyzed Logical Plan ==
(a + 1): int
Project [(a#951 + 1) AS (a + 1)#952]
+- Filter (a#951 > 2)
   +- SubqueryAlias tmp
      +- Relation[a#951] parquet
```
After changes, the analyzed plan becomes
```
== Analyzed Logical Plan ==
(a + 1): int
Project [(a#951 + 1) AS (a + 1)#952]
+- Filter (a#951 > 2)
   +- SubqueryAlias tmp
      +- SubqueryAlias test_parquet_ctas
         +- Relation[a#951] parquet
```

**Note: the optimized plans are the same.**

For `SimpleCatalogRelation`, the existing code always generates two Subqueries. Thus, no change is needed.

#### How was this patch tested?
Added test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14053 from gatorsmile/removeAliasFromMetastoreRelation.
2016-07-07 12:07:19 +08:00
hyukjinkwon 34283de160 [SPARK-14839][SQL] Support for other types for tableProperty rule in SQL syntax
## What changes were proposed in this pull request?

Currently, Scala API supports to take options with the types, `String`, `Long`, `Double` and `Boolean` and Python API also supports other types.

This PR corrects `tableProperty` rule to support other types (string, boolean, double and integer) so that support the options for data sources in a consistent way. This will affect other rules such as DBPROPERTIES and TBLPROPERTIES (allowing other types as values).

Also, `TODO add bucketing and partitioning.` was removed because it was resolved in 24bea00047

## How was this patch tested?

Unit test in `MetastoreDataSourcesSuite.scala`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13517 from HyukjinKwon/SPARK-14839.
2016-07-06 23:57:18 -04:00
Reynold Xin 8e3e4ed6c0 [SPARK-16371][SQL] Two follow-up tasks
## What changes were proposed in this pull request?
This is a small follow-up for SPARK-16371:

1. Hide removeMetadata from public API.
2. Add JIRA ticket number to test case name.

## How was this patch tested?
Updated a test comment.

Author: Reynold Xin <rxin@databricks.com>

Closes #14074 from rxin/parquet-filter.
2016-07-06 15:04:37 -07:00
Dongjoon Hyun d0d28507ca [SPARK-16286][SQL] Implement stack table generating function
## What changes were proposed in this pull request?

This PR implements `stack` table generating function.

## How was this patch tested?

Pass the Jenkins tests including new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14033 from dongjoon-hyun/SPARK-16286.
2016-07-06 10:54:43 +08:00
Reynold Xin 16a2a7d714 [SPARK-16311][SQL] Metadata refresh should work on temporary views
## What changes were proposed in this pull request?
This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on https://github.com/apache/spark/pull/13989, but removes the public Dataset.refresh() API as well as improved test coverage.

Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution).

## How was this patch tested?
Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation.

Author: Reynold Xin <rxin@databricks.com>
Author: petermaxlee <petermaxlee@gmail.com>

Closes #14009 from rxin/SPARK-16311.
2016-07-05 11:36:05 -07:00
Dongjoon Hyun 88134e7368 [SPARK-16288][SQL] Implement inline table generating function
## What changes were proposed in this pull request?

This PR implements `inline` table generating function.

## How was this patch tested?

Pass the Jenkins tests with new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13976 from dongjoon-hyun/SPARK-16288.
2016-07-04 01:57:45 +08:00
Dongjoon Hyun 54b27c1797 [SPARK-16278][SPARK-16279][SQL] Implement map_keys/map_values SQL functions
## What changes were proposed in this pull request?

This PR adds `map_keys` and `map_values` SQL functions in order to remove Hive fallback.

## How was this patch tested?

Pass the Jenkins tests including new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13967 from dongjoon-hyun/SPARK-16278.
2016-07-03 16:59:40 +08:00
gatorsmile ea990f9693 [SPARK-16329][SQL] Star Expansion over Table Containing No Column
#### What changes were proposed in this pull request?
Star expansion over a table containing zero column does not work since 1.6. However, it works in Spark 1.5.1. This PR is to fix the issue in the master branch.

For example,
```scala
val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => Row.empty)
val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
dfNoCols.registerTempTable("temp_table_no_cols")
sqlContext.sql("select * from temp_table_no_cols").show
```

Without the fix, users will get the following the exception:
```
java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:221)
        at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
```

#### How was this patch tested?
Tests are added

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14007 from gatorsmile/starExpansionTableWithZeroColumn.
2016-07-03 16:48:04 +08:00
Dongjoon Hyun 3000b4b29f [MINOR][BUILD] Fix Java linter errors
## What changes were proposed in this pull request?

This PR fixes the minor Java linter errors like the following.
```
-    public int read(char cbuf[], int off, int len) throws IOException {
+    public int read(char[] cbuf, int off, int len) throws IOException {
```

## How was this patch tested?

Manual.
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14017 from dongjoon-hyun/minor_build_java_linter_error.
2016-07-02 16:31:06 +01:00
Dongjoon Hyun c55397652a [SPARK-16208][SQL] Add PropagateEmptyRelation optimizer
## What changes were proposed in this pull request?

This PR adds a new logical optimizer, `PropagateEmptyRelation`, to collapse a logical plans consisting of only empty LocalRelations.

**Optimizer Targets**

1. Binary(or Higher)-node Logical Plans
   - Union with all empty children.
   - Join with one or two empty children (including Intersect/Except).
2. Unary-node Logical Plans
   - Project/Filter/Sample/Join/Limit/Repartition with all empty children.
   - Aggregate with all empty children and without AggregateFunction expressions, COUNT.
   - Generate with Explode because other UserDefinedGenerators like Hive UDTF returns results.

**Sample Query**
```sql
WITH t1 AS (SELECT a FROM VALUES 1 t(a)),
     t2 AS (SELECT b FROM VALUES 1 t(b) WHERE 1=2)
SELECT a,b
FROM t1, t2
WHERE a=b
GROUP BY a,b
HAVING a>1
ORDER BY a,b
```

**Before**
```scala
scala> sql("with t1 as (select a from values 1 t(a)), t2 as (select b from values 1 t(b) where 1=2) select a,b from t1, t2 where a=b group by a,b having a>1 order by a,b").explain
== Physical Plan ==
*Sort [a#0 ASC, b#1 ASC], true, 0
+- Exchange rangepartitioning(a#0 ASC, b#1 ASC, 200)
   +- *HashAggregate(keys=[a#0, b#1], functions=[])
      +- Exchange hashpartitioning(a#0, b#1, 200)
         +- *HashAggregate(keys=[a#0, b#1], functions=[])
            +- *BroadcastHashJoin [a#0], [b#1], Inner, BuildRight
               :- *Filter (isnotnull(a#0) && (a#0 > 1))
               :  +- LocalTableScan [a#0]
               +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
                  +- *Filter (isnotnull(b#1) && (b#1 > 1))
                     +- LocalTableScan <empty>, [b#1]
```

**After**
```scala
scala> sql("with t1 as (select a from values 1 t(a)), t2 as (select b from values 1 t(b) where 1=2) select a,b from t1, t2 where a=b group by a,b having a>1 order by a,b").explain
== Physical Plan ==
LocalTableScan <empty>, [a#0, b#1]
```

## How was this patch tested?

Pass the Jenkins tests (including a new testsuite).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13906 from dongjoon-hyun/SPARK-16208.
2016-07-01 22:13:56 +08:00
Hiroshi Inoue 14cf61e909 [SPARK-16331][SQL] Reduce code generation time
## What changes were proposed in this pull request?
During the code generation, a `LocalRelation` often has a huge `Vector` object as `data`. In the simple example below, a `LocalRelation` has a Vector with 1000000 elements of `UnsafeRow`.

```
val numRows = 1000000
val ds = (1 to numRows).toDS().persist()
benchmark.addCase("filter+reduce") { iter =>
  ds.filter(a => (a & 1) == 0).reduce(_ + _)
}
```

At `TreeNode.transformChildren`, all elements of the vector is unnecessarily iterated to check whether any children exist in the vector since `Vector` is Traversable. This part significantly increases code generation time.

This patch avoids this overhead by checking the number of children before iterating all elements; `LocalRelation` does not have children since it extends `LeafNode`.

The performance of the above example
```
without this patch
Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i5-5257U CPU  2.70GHz
compilationTime:                         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
filter+reduce                                 4426 / 4533          0.2        4426.0       1.0X

with this patch
compilationTime:                         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
filter+reduce                                 3117 / 3391          0.3        3116.6       1.0X
```

## How was this patch tested?

using existing unit tests

Author: Hiroshi Inoue <inouehrs@jp.ibm.com>

Closes #14000 from inouehrs/compilation-time-reduction.
2016-06-30 21:47:44 -07:00
petermaxlee 85f2303eca [SPARK-16276][SQL] Implement elt SQL function
## What changes were proposed in this pull request?
This patch implements the elt function, as it is implemented in Hive.

## How was this patch tested?
Added expression unit test in StringExpressionsSuite and end-to-end test in StringFunctionsSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13966 from petermaxlee/SPARK-16276.
2016-07-01 07:57:48 +08:00
Dongjoon Hyun 46395db80e [SPARK-16289][SQL] Implement posexplode table generating function
## What changes were proposed in this pull request?

This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.

**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```

**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+
```

For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with newly added testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13971 from dongjoon-hyun/SPARK-16289.
2016-06-30 12:03:54 -07:00
Sital Kedia 07f46afc73 [SPARK-13850] Force the sorter to Spill when number of elements in th…
## What changes were proposed in this pull request?

Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size.

## How was this patch tested?

Tested by running a job which was failing without this change due to TimSort bug.

Author: Sital Kedia <skedia@fb.com>

Closes #13107 from sitalkedia/fix_TimSort.
2016-06-30 10:53:18 -07:00
Sean Zhong 5320adc863 [SPARK-16071][SQL] Checks size limit when doubling the array size in BufferHolder
## What changes were proposed in this pull request?

This PR Checks the size limit when doubling the array size in BufferHolder to avoid integer overflow.

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13829 from clockfly/SPARK-16071_2.
2016-06-30 21:56:34 +08:00
petermaxlee d3af6731fa [SPARK-16274][SQL] Implement xpath_boolean
## What changes were proposed in this pull request?
This patch implements xpath_boolean expression for Spark SQL, a xpath function that returns true or false. The implementation is modelled after Hive's xpath_boolean, except that how the expression handles null inputs. Hive throws a NullPointerException at runtime if either of the input is null. This implementation returns null if either of the input is null.

## How was this patch tested?
Created two new test suites. One for unit tests covering the expression, and the other for end-to-end test in SQL.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13964 from petermaxlee/SPARK-16274.
2016-06-30 09:27:48 +08:00
Wenchen Fan d063898beb [SPARK-16134][SQL] optimizer rules for typed filter
## What changes were proposed in this pull request?

This PR adds 3 optimizer rules for typed filter:

1. push typed filter down through `SerializeFromObject` and eliminate the deserialization in filter condition.
2. pull typed filter up through `SerializeFromObject` and eliminate the deserialization in filter condition.
3. combine adjacent typed filters and share the deserialized object among all the condition expressions.

This PR also adds `TypedFilter` logical plan, to separate it from normal filter, so that the concept is more clear and it's easier to write optimizer rules.

## How was this patch tested?

`TypedFilterOptimizationSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13846 from cloud-fan/filter.
2016-06-30 08:15:08 +08:00
Eric Liang 23c58653f9 [SPARK-16238] Metrics for generated method and class bytecode size
## What changes were proposed in this pull request?

This extends SPARK-15860 to include metrics for the actual bytecode size of janino-generated methods. They can be accessed in the same way as any other codahale metric, e.g.

```
scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.getSnapshot().getValues()
res7: Array[Long] = Array(532, 532, 532, 542, 1479, 2670, 3585, 3585)

scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.getSnapshot().getValues()
res8: Array[Long] = Array(5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 38, 63, 79, 88, 94, 94, 94, 132, 132, 165, 165, 220, 220)
```

## How was this patch tested?

Small unit test, also verified manually that the performance impact is minimal (<10%). hvanhovell

Author: Eric Liang <ekl@databricks.com>

Closes #13934 from ericl/spark-16238.
2016-06-29 15:07:32 -07:00
Yin Huai 8b5a8b25b9 [SPARK-16301] [SQL] The analyzer rule for resolving using joins should respect the case sensitivity setting.
## What changes were proposed in this pull request?
The analyzer rule for resolving using joins should respect the case sensitivity setting.

## How was this patch tested?
New tests in ResolveNaturalJoinSuite

Author: Yin Huai <yhuai@databricks.com>

Closes #13977 from yhuai/SPARK-16301.
2016-06-29 14:42:58 -07:00
gatorsmile 7ee9e39cb4 [SPARK-16157][SQL] Add New Methods for comments in StructField and StructType
#### What changes were proposed in this pull request?
Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`.

Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way:
```Scala
StructType(
  StructField(
    "cl1",
    IntegerType,
    nullable = false,
    new MetadataBuilder().putString("comment", "test").build()) :: Nil)
```
This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it:
```Scala
val struct = (new StructType)
  .add("a", "int", true, "test1")

val struct = (new StructType)
  .add("c", StringType, true, "test3")

val struct = (new StructType)
  .add(StructField("d", StringType).withComment("test4"))
```

#### How was this patch tested?
Added test cases:
- `DataTypeSuite` is for testing three types of API changes,
- `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog
- `OrcQuerySuite.scala` is for orc format using Hive-metastore

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13860 from gatorsmile/newMethodForComment.
2016-06-29 19:36:21 +08:00
Cheng Lian d1e8108854 [SPARK-16291][SQL] CheckAnalysis should capture nested aggregate functions that reference no input attributes
## What changes were proposed in this pull request?

`MAX(COUNT(*))` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime.

The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT(*))` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute.

This PR fixes this issue by removing the aforementioned branch.

## How was this patch tested?

New test case added in `AnalysisErrorSuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #13968 from liancheng/spark-16291-nested-agg-functions.
2016-06-29 19:08:36 +08:00
petermaxlee 153c2f9ac1 [SPARK-16271][SQL] Implement Hive's UDFXPathUtil
## What changes were proposed in this pull request?
This patch ports Hive's UDFXPathUtil over to Spark, which can be used to implement xpath functionality in Spark in the near future.

## How was this patch tested?
Added two new test suites UDFXPathUtilSuite and ReusableStringReaderSuite. They have been ported over from Hive (but rewritten in Scala in order to leverage ScalaTest).

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13961 from petermaxlee/xpath.
2016-06-28 21:07:52 -07:00
Reynold Xin 363bcedeea [SPARK-16248][SQL] Whitelist the list of Hive fallback functions
## What changes were proposed in this pull request?
This patch removes the blind fallback into Hive for functions. Instead, it creates a whitelist and adds only a small number of functions to the whitelist, i.e. the ones we intend to support in the long run in Spark.

## How was this patch tested?
Updated tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13939 from rxin/hive-whitelist.
2016-06-28 19:36:53 -07:00
Burak Yavuz 5545b79109 [MINOR][DOCS][STRUCTURED STREAMING] Minor doc fixes around DataFrameWriter and DataStreamWriter
## What changes were proposed in this pull request?

Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #13952 from brkyvz/minor-doc-fix.
2016-06-28 17:02:16 -07:00
Wenchen Fan 8a977b0654 [SPARK-16100][SQL] fix bug when use Map as the buffer type of Aggregator
## What changes were proposed in this pull request?

The root cause is in `MapObjects`. Its parameter `loopVar` is not declared as child, but sometimes can be same with `lambdaFunction`(e.g. the function that takes `loopVar` and produces `lambdaFunction` may be `identity`), which is a child. This brings trouble when call `withNewChildren`, it may mistakenly treat `loopVar` as a child and cause `IndexOutOfBoundsException: 0` later.

This PR fixes this bug by simply pulling out the paremters from `LambdaVariable` and pass them to `MapObjects` directly.

## How was this patch tested?

new test in `DatasetAggregatorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13835 from cloud-fan/map-objects.
2016-06-29 06:39:28 +08:00
Wenchen Fan 1f2776df6e [SPARK-16181][SQL] outer join with isNull filter may return wrong result
## What changes were proposed in this pull request?

The root cause is: the output attributes of outer join are derived from its children, while they are actually different attributes(outer join can return null).

We have already added some special logic to handle it, e.g. `PushPredicateThroughJoin` won't push down predicates through outer join side, `FixNullability`.

This PR adds one more special logic in `FoldablePropagation`.

## How was this patch tested?

new test in `DataFrameSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13884 from cloud-fan/bug.
2016-06-28 10:26:01 -07:00
Herman van Hovell 02a029df43 [SPARK-16220][SQL] Add scope to show functions
## What changes were proposed in this pull request?
Spark currently shows all functions when issue a `SHOW FUNCTIONS` command. This PR refines the `SHOW FUNCTIONS` command by allowing users to select all functions, user defined function or system functions. The following syntax can be used:

**ALL** (default)
```SHOW FUNCTIONS```
```SHOW ALL FUNCTIONS```

**SYSTEM**
```SHOW SYSTEM FUNCTIONS```

**USER**
```SHOW USER FUNCTIONS```
## How was this patch tested?
Updated tests and added tests to the DDLSuite

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13929 from hvanhovell/SPARK-16220.
2016-06-27 16:57:34 -07:00
Bill Chambers c48c8ebc0a [SPARK-16220][SQL] Revert Change to Bring Back SHOW FUNCTIONS Functionality
## What changes were proposed in this pull request?

- Fix tests regarding show functions functionality
- Revert `catalog.ListFunctions` and `SHOW FUNCTIONS` to return to `Spark 1.X` functionality.

Cherry picked changes from this PR: https://github.com/apache/spark/pull/13413/files

## How was this patch tested?

Unit tests.

Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>

Closes #13916 from anabranch/master.
2016-06-27 11:50:34 -07:00
Takeshi YAMAMURO 3e4e868c85 [SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData
## What changes were proposed in this pull request?
This pr is to remove `hashCode` and `equals` in `ArrayBasedMapData` because the type cannot be used as join keys, grouping keys, or in equality tests.

## How was this patch tested?
Add a new test suite `MapDataSuite` for comparison tests.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13847 from maropu/UnsafeMapTest.
2016-06-27 21:45:22 +08:00
Sital Kedia bf665a9586 [SPARK-15958] Make initial buffer size for the Sorter configurable
## What changes were proposed in this pull request?

Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable.

## How was this patch tested?

Tested by running a job on the cluster.

Author: Sital Kedia <skedia@fb.com>

Closes #13699 from sitalkedia/config_sort_buffer_upstream.
2016-06-25 09:13:39 +01:00
Takeshi YAMAMURO d2e44d7db8 [SPARK-16192][SQL] Add type checks in CollectSet
## What changes were proposed in this pull request?
`CollectSet` cannot have map-typed data because MapTypeData does not implement `equals`.
So, this pr is to add type checks in `CheckAnalysis`.

## How was this patch tested?
Added tests to check failures when we found map-typed data in `CollectSet`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13892 from maropu/SPARK-16192.
2016-06-24 21:07:03 -07:00
Sean Owen 158af162ea [SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in favor of commons-lang3
## What changes were proposed in this pull request?

Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException`

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #13843 from srowen/SPARK-16129.
2016-06-24 10:35:54 +01:00
Wenchen Fan 6a3c6276f5 [SQL][MINOR] ParserUtils.operationNotAllowed should throw exception directly
## What changes were proposed in this pull request?

It's weird that `ParserUtils.operationNotAllowed` returns an exception and the caller throw it.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13874 from cloud-fan/style.
2016-06-23 20:20:55 -07:00
Dongjoon Hyun 91b1ef28d1 [SPARK-16164][SQL] Update CombineFilters to try to construct predicates with child predicate first
## What changes were proposed in this pull request?

This PR changes `CombineFilters` to compose the final predicate condition by using (`child predicate` AND `parent predicate`) instead of (`parent predicate` AND `child predicate`). This is a best effort approach. Some other optimization rules may destroy this order by reorganizing conjunctive predicates.

**Reported Error Scenario**
Chris McCubbin reported a bug when he used StringIndexer in an ML pipeline with additional filters. It seems that during filter pushdown, we changed the ordering in the logical plan.
```scala
import org.apache.spark.ml.feature._
val df1 = (0 until 3).map(_.toString).toDF
val indexer = new StringIndexer()
  .setInputCol("value")
  .setOutputCol("idx")
  .setHandleInvalid("skip")
  .fit(df1)
val df2 = (0 until 5).map(_.toString).toDF
val predictions = indexer.transform(df2)
predictions.show() // this is okay
predictions.where('idx > 2).show() // this will throw an exception
```

Please see the notebook at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html for error messages.

## How was this patch tested?

Pass the Jenkins tests (including a new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13872 from dongjoon-hyun/SPARK-16164.
2016-06-23 15:27:43 -07:00
Davies Liu 10396d9505 [SPARK-16163] [SQL] Cache the statistics for logical plans
## What changes were proposed in this pull request?

This calculation of statistics is not trivial anymore, it could be very slow on large query (for example, TPC-DS Q64 took several minutes to plan).

During the planning of a query, the statistics of any logical plan should not change (even InMemoryRelation), so we should use `lazy val` to cache the statistics.

For InMemoryRelation, the statistics could be updated after materialization, it's only useful when used in another query (before planning), because once we finished the planning, the statistics will not be used anymore.

## How was this patch tested?

Testsed with TPC-DS Q64, it could be planned in a second after the patch.

Author: Davies Liu <davies@databricks.com>

Closes #13871 from davies/fix_statistics.
2016-06-23 11:48:48 -07:00
Davies Liu 20d411bc5d [SPARK-16078][SQL] from_utc_timestamp/to_utc_timestamp should not depends on local timezone
## What changes were proposed in this pull request?

Currently, we use local timezone to parse or format a timestamp (TimestampType), then use Long as the microseconds since epoch UTC.

In from_utc_timestamp() and to_utc_timestamp(), we did not consider the local timezone, they could return different results with different local timezone.

This PR will do the conversion based on human time (in local timezone), it should return same result in whatever timezone. But because the mapping from absolute timestamp to human time is not exactly one-to-one mapping, it will still return wrong result in some timezone (also in the begging or ending of DST).

This PR is kind of the best effort fix. In long term, we should make the TimestampType be timezone aware to fix this totally.

## How was this patch tested?

Tested these function in all timezone.

Author: Davies Liu <davies@databricks.com>

Closes #13784 from davies/convert_tz.
2016-06-22 13:40:24 -07:00
Wenchen Fan 01277d4b25 [SPARK-16097][SQL] Encoders.tuple should handle null object correctly
## What changes were proposed in this pull request?

Although the top level input object can not be null, but when we use `Encoders.tuple` to combine 2 encoders, their input objects are not top level anymore and can be null. We should handle this case.

## How was this patch tested?

new test in DatasetSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13807 from cloud-fan/bug.
2016-06-22 18:32:14 +08:00
Yin Huai 905f774b71 [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables
## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target table is a data source table.

## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.

**Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**

Author: Yin Huai <yhuai@databricks.com>

Closes #13769 from yhuai/SPARK-16030-1.
2016-06-20 20:17:47 +08:00
Yin Huai 6d0f921aed [SPARK-16036][SPARK-16037][SPARK-16034][SQL] Follow up code clean up and improvement
## What changes were proposed in this pull request?
This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and https://github.com/apache/spark/pull/13749. I will comment inline to explain my changes.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #13766 from yhuai/caseSensitivity.
2016-06-19 21:45:53 -07:00
Davies Liu 001a589603 [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time
## What changes were proposed in this pull request?

Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).

This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13652 from davies/fix_timezone.
2016-06-19 00:34:52 -07:00
Wenchen Fan 3d010c8375 [SPARK-16036][SPARK-16037][SQL] fix various table insertion problems
## What changes were proposed in this pull request?

The current table insertion has some weird behaviours:

1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
2. inserting into a partitioned table without partition list has wrong result for hive table.

This PR fixes these 2 problems.

## How was this patch tested?

new test in hive `SQLQuerySuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13754 from cloud-fan/insert2.
2016-06-18 10:32:27 -07:00
Reynold Xin 1a65e62a7f [SPARK-16014][SQL] Rename optimizer rules to be more consistent
## What changes were proposed in this pull request?
This small patch renames a few optimizer rules to make the naming more consistent, e.g. class name start with a verb. The main important "fix" is probably SamplePushDown -> PushProjectThroughSample. SamplePushDown is actually the wrong name, since the rule is not about pushing Sample down.

## How was this patch tested?
Updated test cases.

Author: Reynold Xin <rxin@databricks.com>

Closes #13732 from rxin/SPARK-16014.
2016-06-17 15:51:20 -07:00
gatorsmile e5d703bca8 [SPARK-15706][SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT OVERWRITE for DYNAMIC PARTITION
#### What changes were proposed in this pull request?
`IF NOT EXISTS` in `INSERT OVERWRITE` should not support dynamic partitions. If we specify `IF NOT EXISTS`, the inserted statement is not shown in the table.

This PR is to issue an exception in this case, just like what Hive does. Also issue an exception if users specify `IF NOT EXISTS` if users do not specify any `PARTITION` specification.

#### How was this patch tested?
Added test cases into `PlanParserSuite` and `InsertIntoHiveTableSuite`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13447 from gatorsmile/insertIfNotExist.
2016-06-16 22:54:02 -07:00
Pete Robbins 5ada606144 [SPARK-15822] [SQL] Prevent byte array backed classes from referencing freed memory
## What changes were proposed in this pull request?
`UTF8String` and all `Unsafe*` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys.

This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem.

This PR is largely based on the work of robbinspg and he should be credited for this.

closes https://github.com/apache/spark/pull/13707

## How was this patch tested?
Manually tested on problematic workloads.

Author: Pete Robbins <robbinspg@gmail.com>
Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13723 from hvanhovell/SPARK-15822-2.
2016-06-16 22:27:32 -07:00
Dongjoon Hyun 2d27eb1e75 [MINOR][DOCS][SQL] Fix some comments about types(TypeCoercion,Partition) and exceptions.
## What changes were proposed in this pull request?

This PR contains a few changes on code comments.
- `HiveTypeCoercion` is renamed into `TypeCoercion`.
- `NoSuchDatabaseException` is only used for the absence of database.
- For partition type inference, only `DoubleType` is considered.

## How was this patch tested?

N/A

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13674 from dongjoon-hyun/minor_doc_types.
2016-06-16 14:27:09 -07:00
bomeng bbad4cb48d [SPARK-15978][SQL] improve 'show tables' command related codes
## What changes were proposed in this pull request?

I've found some minor issues in "show tables" command:

1. In the `SessionCatalog.scala`, `listTables(db: String)` method will call `listTables(formatDatabaseName(db), "*")` to list all the tables for certain db, but in the method `listTables(db: String, pattern: String)`, this db name is formatted once more. So I think we should remove
`formatDatabaseName()` in the caller.

2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, just like listDatabases().

## How was this patch tested?

The existing test cases should cover it.

Author: bomeng <bmeng@us.ibm.com>

Closes #13695 from bomeng/SPARK-15978.
2016-06-16 14:18:02 -07:00
gatorsmile 6451cf9270 [SPARK-15862][SQL] Better Error Message When Having Database Name in CACHE TABLE AS SELECT
#### What changes were proposed in this pull request?
~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~

~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~

The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists.

In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string.

#### How was this patch tested?
- Added a test case for caching and uncaching qualified table names
- Fixed a few test cases that do not drop temp table at the end
- Added the related test case for the issue resolved in this PR

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #13572 from gatorsmile/cacheTableAsSelect.
2016-06-16 10:01:59 -07:00
Narine Kokhlikyan 7c6c692637 [SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR
## What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12836 from NarineK/gapply2.
2016-06-15 21:42:05 -07:00
Sean Zhong 9bd80ad6bd [SPARK-15776][SQL] Divide Expression inside Aggregation function is casted to wrong type
## What changes were proposed in this pull request?

This PR fixes the problem that Divide Expression inside Aggregation function is casted to wrong type, which cause `select 1/2` and `select sum(1/2)`returning different result.

**Before the change:**

```
scala> sql("select 1/2 as a").show()
+---+
|  a|
+---+
|0.5|
+---+

scala> sql("select sum(1/2) as a").show()
+---+
|  a|
+---+
|0  |
+---+

scala> sql("select sum(1 / 2) as a").schema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true))
```

**After the change:**

```
scala> sql("select 1/2 as a").show()
+---+
|  a|
+---+
|0.5|
+---+

scala> sql("select sum(1/2) as a").show()
+---+
|  a|
+---+
|0.5|
+---+

scala> sql("select sum(1/2) as a").schema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true))
```

## How was this patch tested?

Unit test.

This PR is based on https://github.com/apache/spark/pull/13524 by Sephiroth-Lin

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13651 from clockfly/SPARK-15776.
2016-06-15 14:34:15 -07:00
bomeng 42a28caf10 [SPARK-15952][SQL] fix "show databases" ordering issue
## What changes were proposed in this pull request?

Two issues I've found for "show databases" command:

1. The returned database name list was not sorted, it only works when "like" was used together; (HIVE will always return a sorted list)

2. When it is used as sql("show databases").show, it will output a table with column named as "result", but for sql("show tables").show, it will output the column name as "tableName", so I think we should be consistent and use "databaseName" at least.

## How was this patch tested?

Updated existing test case to test its ordering as well.

Author: bomeng <bmeng@us.ibm.com>

Closes #13671 from bomeng/SPARK-15952.
2016-06-14 18:35:29 -07:00
Takuya UESHIN c5b7355819 [SPARK-15915][SQL] Logical plans should use canonicalized plan when override sameResult.
## What changes were proposed in this pull request?

`DataFrame` with plan overriding `sameResult` but not using canonicalized plan to compare can't cacheTable.

The example is like:

```
    val localRelation = Seq(1, 2, 3).toDF()
    localRelation.createOrReplaceTempView("localRelation")

    spark.catalog.cacheTable("localRelation")
    assert(
      localRelation.queryExecution.withCachedData.collect {
        case i: InMemoryRelation => i
      }.size == 1)
```

and this will fail as:

```
ArrayBuffer() had size 0 instead of expected size 1
```

The reason is that when do `spark.catalog.cacheTable("localRelation")`, `CacheManager` tries to cache for the plan wrapped by `SubqueryAlias` but when planning for the DataFrame `localRelation`, `CacheManager` tries to find cached table for the not-wrapped plan because the plan for DataFrame `localRelation` is not wrapped.
Some plans like `LocalRelation`, `LogicalRDD`, etc. override `sameResult` method, but not use canonicalized plan to compare so the `CacheManager` can't detect the plans are the same.

This pr modifies them to use canonicalized plan when override `sameResult` method.

## How was this patch tested?

Added a test to check if DataFrame with plan overriding sameResult but not using canonicalized plan to compare can cacheTable.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13638 from ueshin/issues/SPARK-15915.
2016-06-14 10:52:13 -07:00
Wenchen Fan 688b6ef9dc [SPARK-15932][SQL][DOC] document the contract of encoder serializer expressions
## What changes were proposed in this pull request?

In our encoder framework, we imply that serializer expressions should use `BoundReference` to refer to the input object, and a lot of codes depend on this contract(e.g. ExpressionEncoder.tuple).  This PR adds some document and assert in `ExpressionEncoder` to make it clearer.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13648 from cloud-fan/comment.
2016-06-13 22:02:23 -07:00
Sandeep Singh 1842cdd4ee [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions
## What changes were proposed in this pull request?
SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions.

## How was this patch tested?
CatalogSuite

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13413 from techaddict/SPARK-15663.
2016-06-13 21:58:52 -07:00
Sean Zhong 7b9071eeaa [SPARK-15910][SQL] Check schema consistency when using Kryo encoder to convert DataFrame to Dataset
## What changes were proposed in this pull request?

This PR enforces schema check when converting DataFrame to Dataset using Kryo encoder. For example.

**Before the change:**

Schema is NOT checked when converting DataFrame to Dataset using kryo encoder.
```
scala> case class B(b: Int)
scala> implicit val encoder = Encoders.kryo[B]
scala> val df = Seq((1)).toDF("b")
scala> val ds = df.as[B] // Schema compatibility is NOT checked
```

**After the change:**
Report AnalysisException since the schema is NOT compatible.
```
scala> val ds = Seq((1)).toDF("b").as[B]
org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`b` AS BINARY)' due to data type mismatch: cannot cast IntegerType to BinaryType;
...
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13632 from clockfly/spark-15910.
2016-06-13 17:43:55 -07:00
Herman van Hövell tot Westerflier 1f8f2b5c2a [SPARK-15370][SQL] Fix count bug
# What changes were proposed in this pull request?
This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule.

After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns `NULL` on empty input. If the expression does not return `NULL`, the rule generates additional logic in the `Project` operator above the rewritten subquery. This additional logic intercepts `NULL` values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input.

This PR takes over https://github.com/apache/spark/pull/13155. It only fixes an issue with `Literal` construction and style issues.  All credits should go frreiss.

# How was this patch tested?
Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite`).
Ran all existing automated regression tests after merging with latest trunk.

Author: frreiss <frreiss@us.ibm.com>
Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13629 from hvanhovell/SPARK-15370-cleanup.
2016-06-12 21:30:32 -07:00
Wenchen Fan f5d38c3925 Revert "[SPARK-15753][SQL] Move Analyzer stuff to Analyzer from DataFrameWriter"
This reverts commit 0ec279ffdf.
2016-06-12 16:52:15 -07:00
Herman van Hovell 20b8f2c32a [SPARK-15370][SQL] Revert PR "Update RewriteCorrelatedSuquery rule"
This reverts commit 9770f6ee60.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13626 from hvanhovell/SPARK-15370-revert.
2016-06-12 15:06:37 -07:00
Ioana Delaney 0ff8a68b9f [SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException
## What changes were proposed in this pull request?
Queries with embedded existential sub-query predicates throws exception when building the physical plan.

Example failing query:
```SQL
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 2 else 3 end) IN (select c2 from t1)").show()

Binding attribute, tree: c2#239
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
  at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)

  ...
  at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
  at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
  at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52)
```

**Problem description:**
When the left hand side expression of an existential sub-query predicate contains another embedded sub-query predicate, the RewritePredicateSubquery optimizer rule does not resolve the embedded sub-query expressions into existential joins.For example, the above query has the following optimized plan, which fails during physical plan build.

```SQL
== Optimized Logical Plan ==
Project [_1#224 AS c1#227]
+- Join LeftSemi, (CASE WHEN predicate-subquery#255 [(_2#225 = c2#239)] THEN 2 ELSE 3 END = c2#228#262)
   :  +- SubqueryAlias predicate-subquery#255 [(_2#225 = c2#239)]
   :     +- LocalRelation [c2#239]
   :- LocalRelation [_1#224, _2#225]
   +- LocalRelation [c2#228#262]

== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239
```

**Solution:**
In RewritePredicateSubquery, before rewriting the outermost predicate sub-query, resolve any embedded existential sub-queries. The Optimized plan for the above query after the changes looks like below.

```SQL
== Optimized Logical Plan ==
Project [_1#224 AS c1#227]
+- Join LeftSemi, (CASE WHEN exists#285 THEN 2 ELSE 3 END = c2#228#284)
   :- Join ExistenceJoin(exists#285), (_2#225 = c2#239)
   :  :- LocalRelation [_1#224, _2#225]
   :  +- LocalRelation [c2#239]
   +- LocalRelation [c2#228#284]

== Physical Plan ==
*Project [_1#224 AS c1#227]
+- *BroadcastHashJoin [CASE WHEN exists#285 THEN 2 ELSE 3 END], [c2#228#284], LeftSemi, BuildRight
   :- *BroadcastHashJoin [_2#225], [c2#239], ExistenceJoin(exists#285), BuildRight
   :  :- LocalTableScan [_1#224, _2#225]
   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :     +- LocalTableScan [c2#239]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [c2#228#284]
      +- LocalTableScan [c222#36], [[111],[222]]
```

## How was this patch tested?
Added new test cases in SubquerySuite.scala

Author: Ioana Delaney <ioanamdelaney@gmail.com>

Closes #13570 from ioana-delaney/fixEmbedSubPredV1.
2016-06-12 14:26:29 -07:00
frreiss 9770f6ee60 [SPARK-15370][SQL] Update RewriteCorrelatedScalarSubquery rule to fix COUNT bug
## What changes were proposed in this pull request?
This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule.

After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns NULL on empty input. If the expression does not return NULL, the rule generates additional logic in the Project operator above the rewritten subquery.  This additional logic intercepts NULL values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input.

## How was this patch tested?
Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite.scala`).
Ran all existing automated regression tests after merging with latest trunk.

Author: frreiss <frreiss@us.ibm.com>

Closes #13155 from frreiss/master.
2016-06-12 14:21:10 -07:00
Eric Liang e1f986c7a3 [SPARK-15860] Metrics for codegen size and perf
## What changes were proposed in this pull request?

Adds codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get.

To simplify, I added the metrics under a statically-initialized source that is always registered with SparkEnv.

## How was this patch tested?

Unit tests

Author: Eric Liang <ekl@databricks.com>

Closes #13586 from ericl/spark-15860.
2016-06-11 23:16:21 -07:00
Eric Liang c06c58bbbb [SPARK-14851][CORE] Support radix sort with nullable longs
## What changes were proposed in this pull request?

This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort.

This strategy for nulls does mean the sort is no longer stable. cc davies

## How was this patch tested?

Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts.

Some test queries (best of 5 runs each).
Before change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190437233227987
res3: Double = 4716.471091

After change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190367870952791
res4: Double = 2981.143045

Author: Eric Liang <ekl@databricks.com>

Closes #13161 from ericl/sc-2998.
2016-06-11 15:42:58 -07:00
Sameer Agarwal 468da03e23 [SPARK-15678] Add support to REFRESH data source paths
## What changes were proposed in this pull request?

Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.

Current behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset
```

This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.

Expected behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
spark.catalog.refreshResource(dir)
sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset
```

## How was this patch tested?

Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13566 from sameeragarwal/refresh-path-2.
2016-06-10 20:43:18 -07:00
Cheng Lian 8e7b56f3d4 Revert "[SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader"
This reverts commit bba5d7999f.
2016-06-10 20:41:48 -07:00
Liang-Chi Hsieh bba5d7999f [SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader
## What changes were proposed in this pull request?

The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.

## How was this patch tested?
Existing tests should be passed.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13371 from viirya/vectorized-reader-push-down-filter.
2016-06-10 18:23:59 -07:00
Narine Kokhlikyan 54f758b5fc [SPARK-15884][SPARKR][SQL] Overriding stringArgs in MapPartitionsInR
## What changes were proposed in this pull request?

As discussed in https://github.com/apache/spark/pull/12836
we need to override stringArgs method in MapPartitionsInR in order to avoid too large strings generated by "stringArgs" method based on the input arguments.

In this case exclude some of the input arguments: serialized R objects.

## How was this patch tested?
Existing test cases

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #13610 from NarineK/dapply_MapPartitionsInR_stringArgs.
2016-06-10 17:17:47 -07:00
Sela 127a6678d7 [SPARK-15489][SQL] Dataset kryo encoder won't load custom user settings
## What changes were proposed in this pull request?

Serializer instantiation will consider existing SparkConf

## How was this patch tested?
manual test with `ImmutableList` (Guava) and `kryo-serializers`'s `Immutable*Serializer` implementations.

Added Test Suite.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Sela <ansela@paypal.com>

Closes #13424 from amitsela/SPARK-15489.
2016-06-10 14:36:51 -07:00
Herman van Hovell e05a2feebe [SPARK-15825] [SQL] Fix SMJ invalid results
## What changes were proposed in this pull request?
Code generated `SortMergeJoin` failed with wrong results when using structs as keys. This could (eventually) be traced back to the use of a wrong row reference when comparing structs.

## How was this patch tested?
TBD

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13589 from hvanhovell/SPARK-15822.
2016-06-10 14:29:05 -07:00
wangyang 026eb90644 [SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length == 0 and Seq.length > 0
## What changes were proposed in this pull request?

In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead.

## How was this patch tested?
existing tests

Author: wangyang <wangyang@haizhi.com>

Closes #13601 from yangw1234/isEmpty.
2016-06-10 13:10:03 -07:00
Sandeep Singh 865ec32dd9 [MINOR][X][X] Replace all occurrences of None: Option with Option.empty
## What changes were proposed in this pull request?
Replace all occurrences of `None: Option[X]` with `Option.empty[X]`

## How was this patch tested?
Exisiting Tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13591 from techaddict/minor-7.
2016-06-10 13:06:51 -07:00
Takuya UESHIN 667d4ea7b3 [SPARK-6320][SQL] Move planLater method into GenericStrategy.
## What changes were proposed in this pull request?

This PR moves `QueryPlanner.planLater()` method into `GenericStrategy` for extra strategies to be able to use `planLater` in its strategy.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13147 from ueshin/issues/SPARK-6320.
2016-06-10 13:06:18 -07:00
Liang-Chi Hsieh 0ec279ffdf [SPARK-15753][SQL] Move Analyzer stuff to Analyzer from DataFrameWriter
## What changes were proposed in this pull request?

This patch moves some codes in `DataFrameWriter.insertInto` that belongs to `Analyzer`.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13496 from viirya/move-analyzer-stuff.
2016-06-10 11:05:04 -07:00
Tathagata Das abdb5d42c5 [SPARK-15812][SQ][STREAMING] Added support for sorting after streaming aggregation with complete mode
## What changes were proposed in this pull request?

When the output mode is complete, then the output of a streaming aggregation essentially will contain the complete aggregates every time. So this is not different from a batch dataset within an incremental execution. Other non-streaming operations should be supported on this dataset. In this PR, I am just adding support for sorting, as it is a common useful functionality. Support for other operations will come later.

## How was this patch tested?
Additional unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13549 from tdas/SPARK-15812.
2016-06-10 10:48:28 -07:00
Eric Liang b914e1930f [SPARK-15794] Should truncate toString() of very wide plans
## What changes were proposed in this pull request?

With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact.

It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability.

## How was this patch tested?

Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold.

```
numFields = 5
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem)            2336 / 2558          0.0       23364.4       0.1X

numFields = 25
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem)            4237 / 4465          0.0       42367.9       0.1X

numFields = 100
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem)          10458 / 11223          0.0      104582.0       0.0X

numFields = Infinity
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
[info]   java.lang.OutOfMemoryError: Java heap space
```

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #13537 from ericl/truncated-string.
2016-06-09 18:05:16 -07:00
Herman van Hovell b0768538e5 [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions
## What changes were proposed in this pull request?
The current implementations of `UnixTime` and `FromUnixTime` do not cache their parser/formatter as much as they could. This PR resolved this issue.

This PR is a take over from https://github.com/apache/spark/pull/13522 and further optimizes the re-use of the parser/formatter. It also fixes the improves handling (catching the actual exception instead of `Throwable`). All credits for this work should go to rajeshbalamohan.

This PR closes https://github.com/apache/spark/pull/13522

## How was this patch tested?
Current tests.

Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Rajesh Balamohan <rbalamohan@apache.org>

Closes #13581 from hvanhovell/SPARK-14321.
2016-06-09 16:37:18 -07:00
Kevin Yu 99386fe398 [SPARK-15804][SQL] Include metadata in the toStructType
## What changes were proposed in this pull request?
The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource.

The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata.

## How was this patch tested?

added test case in ParquetQuerySuite.scala

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Kevin Yu <qyu@us.ibm.com>

Closes #13555 from kevinyu98/spark-15804.
2016-06-09 09:50:09 -07:00
Herman van Hovell 91fbc880b6 [SPARK-15789][SQL] Allow reserved keywords in most places
## What changes were proposed in this pull request?
The parser currently does not allow the use of some SQL keywords as table or field names. This PR adds supports for all keywords as identifier. The exception to this are table aliases, in this case most keywords are allowed except for join keywords (```anti, full, inner, left, semi, right, natural, on, join, cross```) and set-operator keywords (```union, intersect, except```).

## How was this patch tested?
I have added/move/renamed test in the catalyst `*ParserSuite`s.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13534 from hvanhovell/SPARK-15789.
2016-06-07 17:01:11 -07:00
Sean Zhong 890baaca50 [SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT TEMPORARY VIEW USING..." instead
## What changes were proposed in this pull request?

The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead.

This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)):
> A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query

**Example**:

```
scala> spark.sql("CREATE temporary view  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
scala> spark.sql("select c1, c2 from my_tab7").show()
+----+-----+
|  c1|   c2|
+----+-----+
|year| make|
|2012|Tesla|
...
```

It NOW prints a **deprecation warning** if "CREATE TEMPORARY TABLE USING..." is used.

```
scala> spark.sql("CREATE temporary table  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13414 from clockfly/create_temp_view_using.
2016-06-07 15:21:55 -07:00
Sean Zhong 5f731d6859 [SPARK-15792][SQL] Allows operator to change the verbosity in explain output
## What changes were proposed in this pull request?

This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan.

Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future.

**Less verbose mode:** dataframe.explain(extended = false)

`output=[count(a)#85L]` is **NOT** displayed for HashAggregate.

```
scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2")
scala> spark.sql("select count(a) from df2").explain()
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)])
      +- LocalTableScan
```

**Verbose mode:** dataframe.explain(extended = true)

`output=[count(a)#85L]` is displayed for HashAggregate.

```
scala> spark.sql("select count(a) from df2").explain(true)  // "output=[count(a)#85L]" is added
...
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L])
      +- LocalTableScan
```

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13535 from clockfly/verbose_breakdown_2.
2016-06-06 22:59:25 -07:00
Sean Zhong 0e0904a2fc [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema
## What changes were proposed in this pull request?

This PR makes sure the typed Filter doesn't change the Dataset schema.

**Before the change:**

```
scala> val df = spark.range(0,9)
scala> df.schema
res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
scala> val afterFilter = df.filter(_=>true)
scala> afterFilter.schema   // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true.
res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true))

```

SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset.

**After the change:**

```
scala> afterFilter.schema   // schema is NOT changed.
res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13529 from clockfly/spark-15632.
2016-06-06 22:40:21 -07:00
Josh Rosen 0b8d694999 [SPARK-15764][SQL] Replace N^2 loop in BindReferences
BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n).

Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups.

Perf. benchmarks to follow. /cc ericl

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13505 from JoshRosen/bind-references-improvement.
2016-06-06 11:44:51 -07:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Wenchen Fan 30c4774f33 [SPARK-15657][SQL] RowEncoder should validate the data type of input object
## What changes were proposed in this pull request?

This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column.

This PR also removes the support to use `Product` as a valid external type of struct type.  This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working.  However, we never officially support this feature and I think it's ok to ban it.

## How was this patch tested?

new tests in `RowEncoderSuite`.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13401 from cloud-fan/bug.
2016-06-05 15:59:52 -07:00
Weiqing Yang 0f307db5e1 [SPARK-15707][SQL] Make Code Neat - Use map instead of if check.
## What changes were proposed in this pull request?
In forType function of object RandomDataGenerator, the code following:
if (maybeSqlTypeGenerator.isDefined){
  ....
  Some(generator)
} else{
 None
}
will be changed. Instead, maybeSqlTypeGenerator.map will be used.

## How was this patch tested?
All of the current unit tests passed.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #13448 from Sherry302/master.
2016-06-04 22:44:03 +01:00
Josh Rosen 091f81e1f7 [SPARK-15762][SQL] Cache Metadata & StructType hashCodes; use singleton Metadata.empty
We should cache `Metadata.hashCode` and use a singleton for `Metadata.empty` because calculating metadata hashCodes appears to be a bottleneck for certain workloads.

We should also cache `StructType.hashCode`.

In an optimizer stress-test benchmark run by ericl, these `hashCode` calls accounted for roughly 40% of the total CPU time and this bottleneck was completely eliminated by the caching added by this patch.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13504 from JoshRosen/metadata-fix.
2016-06-04 14:14:50 -07:00
Wenchen Fan 11c83f83d5 [SPARK-15140][SQL] make the semantics of null input object for encoder clear
## What changes were proposed in this pull request?

For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null.

This PR explicitly add this constraint and throw exception if users break it.

## How was this patch tested?

several new tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13469 from cloud-fan/null-object.
2016-06-03 14:28:19 -07:00
Wenchen Fan 61b80d552a [SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema
## What changes were proposed in this pull request?

There are 2 kinds of `GetStructField`:

1. resolved from `UnresolvedExtractValue`, and it will have a `name` property.
2. created when we build deserializer expression for nested tuple, no `name` property.

When we want to validate the ordinals of nested tuple, we should only catch `GetStructField` without the name property.

## How was this patch tested?

new test in `EncoderResolutionSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13474 from cloud-fan/ordinal-check.
2016-06-03 14:26:24 -07:00
gatorsmile eb10b481ca [SPARK-15286][SQL] Make the output readable for EXPLAIN CREATE TABLE and DESC EXTENDED
#### What changes were proposed in this pull request?
Before this PR, the output of EXPLAIN of following SQL is like

```SQL
CREATE EXTERNAL TABLE extTable_with_partitions (key INT, value STRING)
PARTITIONED BY (ds STRING, hr STRING)
LOCATION '/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-b39a6185-8981-403b-a4aa-36fb2f4ca8a9'
```
``ExecutedCommand CreateTableCommand CatalogTable(`extTable_with_partitions`,CatalogTableType(EXTERNAL),CatalogStorageFormat(Some(/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-dd234718-e85d-4c5a-8353-8f1834ac0323),Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(key,int,true,None), CatalogColumn(value,string,true,None), CatalogColumn(ds,string,true,None), CatalogColumn(hr,string,true,None)),List(ds, hr),List(),List(),-1,,1463026413544,-1,Map(),None,None,None), false``

After this PR, the output is like

```
ExecutedCommand
:  +- CreateTableCommand CatalogTable(
	Table:`extTable_with_partitions`
	Created:Thu Jun 02 21:30:54 PDT 2016
	Last Access:Wed Dec 31 15:59:59 PST 1969
	Type:EXTERNAL
	Schema:[`key` int, `value` string, `ds` string, `hr` string]
	Partition Columns:[`ds`, `hr`]
	Storage(Location:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a06083b8-8e88-4d07-9ff0-d6bd8d943ad3, InputFormat:org.apache.hadoop.mapred.TextInputFormat, OutputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false
```

This is also applicable to `DESC EXTENDED`. However, this does not have special handling for Data Source Tables. If needed, we need to move the logics of `DDLUtil`. Let me know if we should do it in this PR. Thanks! rxin liancheng

#### How was this patch tested?
Manual testing

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13070 from gatorsmile/betterExplainCatalogTable.
2016-06-03 13:56:22 -07:00
Josh Rosen e526913989 [SPARK-15742][SQL] Reduce temp collections allocations in TreeNode transform methods
In Catalyst's TreeNode transform methods we end up calling `productIterator.map(...).toArray` in a number of places, which is slightly inefficient because it needs to allocate an `ArrayBuilder` and grow a temporary array. Since we already know the size of the final output (`productArity`), we can simply allocate an array up-front and use a while loop to consume the iterator and populate the array.

For most workloads, this performance difference is negligible but it does make a measurable difference in optimizer performance for queries that operate over very wide schemas (such as the benchmark queries in #13456).

### Perf results (from #13456 benchmarks)

**Before**

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz

parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            19 /   22          0.0    19119858.0       1.0X
10 select expressions                           23 /   25          0.0    23208774.0       0.8X
100 select expressions                          55 /   73          0.0    54768402.0       0.3X
1000 select expressions                        229 /  259          0.0   228606373.0       0.1X
2500 select expressions                        530 /  554          0.0   529938178.0       0.0X
```

**After**

```
parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            15 /   21          0.0    14978203.0       1.0X
10 select expressions                           22 /   27          0.0    22492262.0       0.7X
100 select expressions                          48 /   64          0.0    48449834.0       0.3X
1000 select expressions                        189 /  208          0.0   189346428.0       0.1X
2500 select expressions                        429 /  449          0.0   428943897.0       0.0X
```

###

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13484 from JoshRosen/treenode-productiterator-map.
2016-06-03 13:53:02 -07:00
Ioana Delaney 9e2eb13ca5 [SPARK-15677][SQL] Query with scalar sub-query in the SELECT list throws UnsupportedOperationException
## What changes were proposed in this pull request?
Queries with scalar sub-query in the SELECT list run against a local, in-memory relation throw
UnsupportedOperationException exception.

Problem repro:
```SQL
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
scala> sql("select (select min(c1) from t2) from t1").show()

java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#62 []
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:215)
  at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:62)
  at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
  at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:45)
  at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:29)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$37.applyOrElse(Optimizer.scala:1473)
```
The problem is specific to local, in memory relations. It is caused by rule ConvertToLocalRelation, which attempts to push down
a scalar-subquery expression to the local tables.

The solution prevents the rule to apply if Project references scalar subqueries.

## How was this patch tested?
Added regression tests to SubquerySuite.scala

Author: Ioana Delaney <ioanamdelaney@gmail.com>

Closes #13418 from ioana-delaney/scalarSubV2.
2016-06-03 12:04:27 -07:00
Wenchen Fan 190ff274fd [SPARK-15494][SQL] encoder code cleanup
## What changes were proposed in this pull request?

Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.

1. move validation logic to analyzer instead of encoder
2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)

## How was this patch tested?

existing test

Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #13269 from cloud-fan/clean-encoder.
2016-06-03 00:43:02 -07:00
Sean Zhong 6dde27404c [SPARK-15733][SQL] Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc.
## What changes were proposed in this pull request?

This PR makes the explain output less verbose by hiding some verbose output like `None`, `null`, empty List `[]`, empty set `{}`, and etc.

**Before change**:

```
== Physical Plan ==
ExecutedCommand
:  +- ShowTablesCommand None, None
```

**After change**:

```
== Physical Plan ==
ExecutedCommand
:  +- ShowTablesCommand
```

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13470 from clockfly/verbose_breakdown_4.
2016-06-02 22:45:37 -07:00
Wenchen Fan 6323e4bd76 [SPARK-15732][SQL] better error message when use java reserved keyword as field name
## What changes were proposed in this pull request?

When users create a case class and use java reserved keyword as field name, spark sql will generate illegal java code and throw exception at runtime.

This PR checks the field names when building the encoder, and if illegal field names are used, throw exception immediately with a good error message.

## How was this patch tested?

new test in DatasetSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13485 from cloud-fan/java.
2016-06-02 18:13:04 -07:00
Andrew Or d1c1fbc345 [SPARK-15715][SQL] Fix alter partition with storage information in Hive
## What changes were proposed in this pull request?

This command didn't work for Hive tables. Now it does:
```
ALTER TABLE boxes PARTITION (width=3)
    SET SERDE 'com.sparkbricks.serde.ColumnarSerDe'
    WITH SERDEPROPERTIES ('compress'='true')
```

## How was this patch tested?

`HiveExternalCatalogSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13453 from andrewor14/alter-partition-storage.
2016-06-02 17:44:48 -07:00
Sean Zhong 985d532812 [SPARK-15734][SQL] Avoids printing internal row in explain output
## What changes were proposed in this pull request?

This PR avoids printing internal rows in explain output for some operators.

**Before change:**

```
scala> (1 to 10).toSeq.map(_ => (1,2,3)).toDF().createTempView("df3")
scala> spark.sql("select * from df3 where 1=2").explain(true)
...
== Analyzed Logical Plan ==
_1: int, _2: int, _3: int
Project [_1#37,_2#38,_3#39]
+- Filter (1 = 2)
   +- SubqueryAlias df3
      +- LocalRelation [_1#37,_2#38,_3#39], [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]]
...
== Physical Plan ==
LocalTableScan [_1#37,_2#38,_3#39]
```

**After change:**

```
scala> spark.sql("select * from df3 where 1=2").explain(true)
...
== Analyzed Logical Plan ==
_1: int, _2: int, _3: int
Project [_1#58,_2#59,_3#60]
+- Filter (1 = 2)
   +- SubqueryAlias df3
      +- LocalRelation [_1#58,_2#59,_3#60]
...
== Physical Plan ==
LocalTableScan <empty>, [_1#58,_2#59,_3#60]
```

## How was this patch tested?
Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13471 from clockfly/verbose_breakdown_5.
2016-06-02 16:21:33 -07:00
Sameer Agarwal 09b3c56c91 [SPARK-14752][SQL] Explicitly implement KryoSerialization for LazilyGenerateOrdering
## What changes were proposed in this pull request?

This patch fixes a number of `com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException` exceptions reported in [SPARK-15604], [SPARK-14752] etc. (while executing sparkSQL queries with the kryo serializer) by explicitly implementing `KryoSerialization` for `LazilyGenerateOrdering`.

## How was this patch tested?

1. Modified `OrderingSuite` so that all tests in the suite also test kryo serialization (for both interpreted and generated ordering).
2. Manually verified TPC-DS q1.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13466 from sameeragarwal/kryo.
2016-06-02 10:58:00 -07:00
Dongjoon Hyun 63b7f127ca [SPARK-15076][SQL] Add ReorderAssociativeOperator optimizer
## What changes were proposed in this pull request?

This issue add a new optimizer `ReorderAssociativeOperator` by taking advantage of integral associative property. Currently, Spark works like the following.

1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`.
2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`.

This PR can handle Case 2 for **Add/Multiply** expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the plan comparison between `before` and `after` this issue.

**Before**
```scala
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(((((((((a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
:     +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain
== Physical Plan ==
*Project [(((((((((a#18 * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19]
+- Generate explode([1]), false, false, [a#18]
   +- Scan OneRowRelation[]
```

**After**
```scala
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 45) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
:     +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain
== Physical Plan ==
*Project [(a#18 * 362880) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19]
+- Generate explode([1]), false, false, [a#18]
   +- Scan OneRowRelation[]
```

This PR is greatly generalized by cloud-fan 's key ideas; he should be credited for the work he did.

## How was this patch tested?

Pass the Jenkins tests including new testsuite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12850 from dongjoon-hyun/SPARK-15076.
2016-06-02 09:48:58 -07:00
Takeshi YAMAMURO 5eea332307 [SPARK-13484][SQL] Prevent illegal NULL propagation when filtering outer-join results
## What changes were proposed in this pull request?
This PR add a rule at the end of analyzer to correct nullable fields of attributes in a logical plan by using nullable fields of the corresponding attributes in its children logical plans (these plans generate the input rows).

This is another approach for addressing SPARK-13484 (the first approach is https://github.com/apache/spark/pull/11371).

Close #113711

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #13290 from yhuai/SPARK-13484.
2016-06-01 22:23:00 -07:00
jerryshao 8288e16a5a [SPARK-15620][SQL] Fix transformed dataset attributes revolve failure
## What changes were proposed in this pull request?

Join on transformed dataset has attributes conflicts, which make query execution failure, for example:

```
val dataset = Seq(1, 2, 3).toDs
val mappedDs = dataset.map(_ + 1)

mappedDs.as("t1").joinWith(mappedDs.as("t2"), $"t1.value" === $"t2.value").show()
```

will throw exception:

```
org.apache.spark.sql.AnalysisException: cannot resolve '`t1.value`' given input columns: [value];
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
```

## How was this patch tested?

Unit test.

Author: jerryshao <sshao@hortonworks.com>

Closes #13399 from jerryshao/SPARK-15620.
2016-06-01 21:58:05 -07:00
Sean Zhong c8fb776d4a [SPARK-15692][SQL] Improves the explain output of several physical plans by displaying embedded logical plan in tree style
## What changes were proposed in this pull request?

Improves the explain output of several physical plans by displaying embedded logical plan in tree style

Some physical plan contains a embedded logical plan, for example, `cache tableName query` maps to:

```
case class CacheTableCommand(
    tableName: String,
    plan: Option[LogicalPlan],
    isLazy: Boolean)
  extends RunnableCommand
```

It is easier to read the explain output if we can display the `plan` in tree style.

**Before change:**

Everything is messed in one line.

```
scala> Seq((1,2)).toDF().createOrReplaceTempView("testView")
scala> spark.sql("cache table testView2 select * from testView").explain()
== Physical Plan ==
ExecutedCommand CacheTableCommand testView2, Some('Project [*]
+- 'UnresolvedRelation `testView`, None
), false
```

**After change:**

```
scala> spark.sql("cache table testView2 select * from testView").explain()
== Physical Plan ==
ExecutedCommand
:  +- CacheTableCommand testView2, false
:     :  +- 'Project [*]
:     :     +- 'UnresolvedRelation `testView`, None
```

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13433 from clockfly/verbose_breakdown_3_2.
2016-06-01 17:03:39 -07:00
Wenchen Fan 8640cdb836 [SPARK-15441][SQL] support null object in Dataset outer-join
## What changes were proposed in this pull request?

Currently we can't encode top level null object into internal row, as Spark SQL doesn't allow row to be null, only its columns can be null.

This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantics of null object.

This PR fixes this problem by making both join sides produce a single column, i.e. nest the logical plan output(by `CreateStruct`), so that we have an extra level to represent top level null obejct.

## How was this patch tested?

new test in `DatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13425 from cloud-fan/outer-join2.
2016-06-01 16:16:54 -07:00
Cheng Lian 7bb64aae27 [SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables.
This PR is an alternative to #13120 authored by xwu0226.

## What changes were proposed in this pull request?

When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).

This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table.

Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround.

[1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408

## How was this patch tested?

1. A new test case is added in `HiveQuerySuite` for this case
2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.)

Author: Cheng Lian <lian@databricks.com>

Closes #13270 from liancheng/spark-15269-unpleasant-fix.
2016-06-01 16:02:27 -07:00
Reynold Xin a71d1364ae [SPARK-15686][SQL] Move user-facing streaming classes into sql.streaming
## What changes were proposed in this pull request?
This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them.

## How was this patch tested?
Updated tests to reflect the moves.

Author: Reynold Xin <rxin@databricks.com>

Closes #13429 from rxin/SPARK-15686.
2016-06-01 10:14:40 -07:00
Sean Zhong d5012c2740 [SPARK-15495][SQL] Improve the explain output for Aggregation operator
## What changes were proposed in this pull request?

This PR improves the explain output of Aggregator operator.

SQL:

```
Seq((1,2,3)).toDF("a", "b", "c").createTempView("df1")
spark.sql("cache table df1")
spark.sql("select count(a), count(c), b from df1 group by b").explain()
```

**Before change:**

```
*TungstenAggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8])
+- Exchange hashpartitioning(b#8, 200), None
   +- *TungstenAggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L])
      +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1)
``````

**After change:**

```
*Aggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8])
+- Exchange hashpartitioning(b#8, 200), None
   +- *Aggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L])
      +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1)
```

## How was this patch tested?

Manual test and existing UT.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13363 from clockfly/verbose3.
2016-06-01 09:58:01 -07:00
jerryshao e4ce1bc4f3 [SPARK-15659][SQL] Ensure FileSystem is gotten from path
## What changes were proposed in this pull request?

Currently `spark.sql.warehouse.dir` is pointed to local dir by default, which will throw exception when HADOOP_CONF_DIR is configured and default FS is hdfs.

```
java.lang.IllegalArgumentException: Wrong FS: file:/Users/sshao/projects/apache-spark/spark-warehouse, expected: hdfs://localhost:8020
```

So we should always get the `FileSystem` from `Path` to avoid wrong FS problem.

## How was this patch tested?

Local test.

Author: jerryshao <sshao@hortonworks.com>

Closes #13405 from jerryshao/SPARK-15659.
2016-06-01 08:28:19 -05:00
Eric Liang 93e97147eb [MINOR] Slightly better error message when attempting to query hive tables w/in-mem catalog
andrewor14

Author: Eric Liang <ekl@databricks.com>

Closes #13427 from ericl/better-error-msg.
2016-05-31 17:39:03 -07:00
Josh Rosen 8ca01a6feb [SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues
## What changes were proposed in this pull request?

In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase.

The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying.

In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful).

## How was this patch tested?

This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13421 from JoshRosen/disable-line-comments-in-codegen.
2016-05-31 17:30:03 -07:00
Tathagata Das 90b11439b3 [SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming
## What changes were proposed in this pull request?
Currently structured streaming only supports append output mode.  This PR adds the following.

- Added support for Complete output mode in the internal state store, analyzer and planner.
- Added public API in Scala and Python for users to specify output mode
- Added checks for unsupported combinations of output mode and DF operations
  - Plans with no aggregation should support only Append mode
  - Plans with aggregation should support only Update and Complete modes
  - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
- Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.

## How was this patch tested?
Unit tests in various test suites
- StreamingAggregationSuite: tests for complete mode
- MemorySinkSuite: tests for checking behavior in Append and Complete modes.
- UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
- DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
- Python doc test and existing unit tests modified to call write.outputMode.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13286 from tdas/complete-mode.
2016-05-31 15:57:01 -07:00
Dilip Biswal dfe2cbeb43 [SPARK-15557] [SQL] cast the string into DoubleType when it's used together with decimal
In this case, the result type of the expression becomes DECIMAL(38, 36) as we promote the individual string literals to DECIMAL(38, 18) when we handle string promotions for `BinaryArthmaticExpression`.

I think we need to cast the string literals to Double type instead. I looked at the history and found that  this was changed to use decimal instead of double to avoid potential loss of precision when we cast decimal to double.

To double check i ran the query against hive, mysql. This query returns non NULL result for both the databases and both promote the expression to use double.
Here is the output.

- Hive
```SQL
hive> create table l2 as select (cast(99 as decimal(19,6)) + '2') from l1;
OK
hive> describe l2;
OK
_c0                 	double
```
- MySQL
```SQL
mysql> create table foo2 as select (cast(99 as decimal(19,6)) + '2') from test;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> describe foo2;
+-----------------------------------+--------+------+-----+---------+-------+
| Field                             | Type   | Null | Key | Default | Extra |
+-----------------------------------+--------+------+-----+---------+-------+
| (cast(99 as decimal(19,6)) + '2') | double | NO   |     | 0       |       |
+-----------------------------------+--------+------+-----+---------+-------+
```

## How was this patch tested?
Added a new test in SQLQuerySuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #13368 from dilipbiswal/spark-15557.
2016-05-31 15:49:45 -07:00
Davies Liu 2df6ca848e [SPARK-15327] [SQL] fix split expression in whole stage codegen
## What changes were proposed in this pull request?

Right now, we will split the code for expressions into multiple functions when it exceed 64k, which requires that the the expressions are using Row object, but this is not true for whole-state codegen, it will fail to compile after splitted.

This PR will not split the code in whole-stage codegen.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13235 from davies/fix_nested_codegen.
2016-05-31 15:36:02 -07:00
Yin Huai c6de5832bf [SPARK-15622][SQL] Wrap the parent classloader of Janino's classloader in the ParentClassLoader.
## What changes were proposed in this pull request?
At https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85, Janino's classloader throws the exception when its parent throws a ClassNotFoundException with a cause set. However, it does not throw the exception when there is no cause set. Seems we need to use a special ClassLoader to wrap the actual parent classloader set to Janino handle this behavior.

## How was this patch tested?
I have reverted the workaround made by https://issues.apache.org/jira/browse/SPARK-11636 ( https://github.com/apache/spark/compare/master...yhuai:SPARK-15622?expand=1#diff-bb538fda94224dd0af01d0fd7e1b4ea0R81) and `test-only *ReplSuite -- -z "SPARK-2576 importing implicits"` still passes the test (without the change in `CodeGenerator`, this test does not pass with the change in `ExecutorClassLoader `).

Author: Yin Huai <yhuai@databricks.com>

Closes #13366 from yhuai/SPARK-15622.
2016-05-31 12:30:34 -07:00
Wenchen Fan 2bfed1a0c5 [SPARK-15658][SQL] UDT serializer should declare its data type as udt instead of udt.sqlType
## What changes were proposed in this pull request?

When we build serializer for UDT object, we should declare its data type as udt instead of udt.sqlType, or if we deserialize it again, we lose the information that it's a udt object and throw analysis exception.

## How was this patch tested?

new test in `UserDefiendTypeSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13402 from cloud-fan/udt.
2016-05-31 11:00:38 -07:00
gatorsmile d67c82e4b6 [SPARK-15647][SQL] Fix Boundary Cases in OptimizeCodegen Rule
#### What changes were proposed in this pull request?

The following condition in the Optimizer rule `OptimizeCodegen` is not right.
```Scala
branches.size < conf.maxCaseBranchesForCodegen
```

- The number of branches in case when clause should be `branches.size + elseBranch.size`.
- `maxCaseBranchesForCodegen` is the maximum boundary for enabling codegen. Thus, we should use `<=` instead of `<`.

This PR is to fix this boundary case and also add missing test cases for verifying the conf `MAX_CASES_BRANCHES`.

#### How was this patch tested?
Added test cases in `SQLConfSuite`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13392 from gatorsmile/maxCaseWhen.
2016-05-31 10:08:00 -07:00
Takeshi YAMAMURO 95db8a44f3 [SPARK-15528][SQL] Fix race condition in NumberConverter
## What changes were proposed in this pull request?
A local variable in NumberConverter is wrongly shared between threads.
This pr fixes the race condition.

## How was this patch tested?
Manually checked.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13391 from maropu/SPARK-15528.
2016-05-31 07:25:16 -05:00
Reynold Xin 675921040e [SPARK-15638][SQL] Audit Dataset, SparkSession, and SQLContext
## What changes were proposed in this pull request?
This patch contains a list of changes as a result of my auditing Dataset, SparkSession, and SQLContext. The patch audits the categorization of experimental APIs, function groups, and deprecations. For the detailed list of changes, please see the diff.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13370 from rxin/SPARK-15638.
2016-05-30 22:47:58 -07:00
Cheng Lian 1360a6d636 [SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema
## What changes were proposed in this pull request?

`EmbedSerializerInFilter` implicitly assumes that the plan fragment being optimized doesn't change plan schema, which is reasonable because `Dataset.filter` should never change the schema.

However, due to another issue involving `DeserializeToObject` and `SerializeFromObject`, typed filter *does* change plan schema (see [SPARK-15632][1]). This breaks `EmbedSerializerInFilter` and causes corrupted data.

This PR disables `EmbedSerializerInFilter` when there's a schema change to avoid data corruption. The schema change issue should be addressed in follow-up PRs.

## How was this patch tested?

New test case added in `DatasetSuite`.

[1]: https://issues.apache.org/jira/browse/SPARK-15632

Author: Cheng Lian <lian@databricks.com>

Closes #13362 from liancheng/spark-15112-corrupted-filter.
2016-05-29 23:19:12 -07:00
Sean Owen ce1572d16f [MINOR] Resolve a number of miscellaneous build warnings
## What changes were proposed in this pull request?

This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #13377 from srowen/BuildWarnings.
2016-05-29 16:48:14 -05:00
Reynold Xin 472f16181d [SPARK-15636][SQL] Make aggregate expressions more concise in explain
## What changes were proposed in this pull request?
This patch reduces the verbosity of aggregate expressions in explain (but does not actually remove any information). As an example, for the following command:
```
spark.range(10).selectExpr("sum(id) + 1", "count(distinct id)").explain(true)
```

Output before this patch:
```
== Physical Plan ==
*TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false),(count(id#0L),mode=Final,isDistinct=true)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L])
+- Exchange SinglePartition, None
   +- *TungstenAggregate(key=[], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false),(count(id#0L),mode=Partial,isDistinct=true)], output=[sum#18L,count#21L])
      +- *TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false)], output=[id#0L,sum#18L])
         +- Exchange hashpartitioning(id#0L, 5), None
            +- *TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[id#0L,sum#18L])
               +- *Range (0, 10, splits=2)
```

Output after this patch:
```
== Physical Plan ==
*TungstenAggregate(key=[], functions=[sum(id#0L),count(distinct id#0L)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L])
+- Exchange SinglePartition, None
   +- *TungstenAggregate(key=[], functions=[merge_sum(id#0L),partial_count(distinct id#0L)], output=[sum#18L,count#21L])
      +- *TungstenAggregate(key=[id#0L], functions=[merge_sum(id#0L)], output=[id#0L,sum#18L])
         +- Exchange hashpartitioning(id#0L, 5), None
            +- *TungstenAggregate(key=[id#0L], functions=[partial_sum(id#0L)], output=[id#0L,sum#18L])
               +- *Range (0, 10, splits=2)
```

Note the change from `(sum(id#0L),mode=PartialMerge,isDistinct=false)` to `merge_sum(id#0L)`.

In general aggregate explain is still very verbose, but further work will be done as follow-up pull requests.

## How was this patch tested?
Tested manually.

Author: Reynold Xin <rxin@databricks.com>

Closes #13367 from rxin/SPARK-15636.
2016-05-28 14:14:36 -07:00
Liang-Chi Hsieh f1b220eeee [SPARK-15553][SQL] Dataset.createTempView should use CreateViewCommand
## What changes were proposed in this pull request?

Let `Dataset.createTempView` and `Dataset.createOrReplaceTempView` use `CreateViewCommand`, rather than calling `SparkSession.createTempView`. Besides, this patch also removes `SparkSession.createTempView`.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13327 from viirya/dataset-createtempview.
2016-05-27 21:24:08 -07:00
Zheng RuiFeng 6b1a6180e7 [MINOR] Fix Typos 'a -> an'
## What changes were proposed in this pull request?

`a` -> `an`

I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.

## How was this patch tested?

local build
`lint-java` checking

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13317 from zhengruifeng/a_an.
2016-05-26 22:39:14 -07:00
Sean Zhong b5859e0bb8 [SPARK-13445][SQL] Improves error message and add test coverage for Window function
## What changes were proposed in this pull request?

Add more verbose error message when order by clause is missed when using Window function.

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13333 from clockfly/spark-13445.
2016-05-26 14:50:00 -07:00
Andrew Or ee682fe293 [SPARK-15534][SPARK-15535][SQL] Truncate table fixes
## What changes were proposed in this pull request?

Two changes:
- When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
- Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.

## How was this patch tested?
Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #13302 from andrewor14/truncate-table.
2016-05-25 15:08:39 -07:00
lfzCarlosC 02c8072eea [MINOR][MLLIB][STREAMING][SQL] Fix typos
fixed typos for source code for components [mllib] [streaming] and [SQL]

None and obvious.

Author: lfzCarlosC <lfz.carlos@gmail.com>

Closes #13298 from lfzCarlosC/master.
2016-05-25 10:53:57 -07:00
Reynold Xin 4f27b8dd58 [SPARK-15436][SQL] Remove DescribeFunction and ShowFunctions
## What changes were proposed in this pull request?
This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.

## How was this patch tested?
Created a new SparkSqlParserSuite.

Author: Reynold Xin <rxin@databricks.com>

Closes #13292 from rxin/SPARK-15436.
2016-05-25 19:17:53 +02:00
Wenchen Fan 50b660d725 [SPARK-15498][TESTS] fix slow tests
## What changes were proposed in this pull request?

This PR fixes 3 slow tests:

1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13273 from cloud-fan/test.
2016-05-24 21:23:39 -07:00
Dongjoon Hyun f08bf587b1 [SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException
## What changes were proposed in this pull request?

Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases.

**Before**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0).collect()
res1: Array[Int] = Array()   // empty
scala> spark.sql("select 1").coalesce(0)
res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").coalesce(0).collect()
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
scala> spark.sql("select 1").repartition(0)
res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").repartition(0).collect()
res4: Array[org.apache.spark.sql.Row] = Array()  // empty
```

**After**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
```

## How was this patch tested?

Pass the Jenkins tests with new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13282 from dongjoon-hyun/SPARK-15512.
2016-05-24 18:55:23 -07:00
Dongjoon Hyun f8763b80ec [SPARK-13135] [SQL] Don't print expressions recursively in generated code
## What changes were proposed in this pull request?

This PR is an up-to-date and a little bit improved version of #11019 of rxin for
- (1) preventing recursive printing of expressions in generated code.

Since the major function of this PR is indeed the above,  he should be credited for the work he did. In addition to #11019, this PR improves the followings in code generation.
- (2) Improve multiline comment indentation.
- (3) Reduce the number of empty lines (mainly consecutive empty lines).
- (4) Remove all space characters on empty lines.

**Example**
```scala
spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)
```

**Before**
```
Generated code:
/* 001 */ public Object generate(Object[] references) {
...
/* 005 */ /**
/* 006 */ * Codegend pipeline for
/* 007 */ * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 008 */ * +- Range 1, 1, 8, 999, [id#0L]
/* 009 */ */
...
/* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 076 */
/* 077 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
/* 078 */
/* 079 */     // initialize Range
...
/* 092 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 093 */
/* 094 */       // CONSUME: WholeStageCodegen
/* 095 */
/* 096 */       // (((input[0, bigint, false] + 1) + 2) + 3)
/* 097 */       // ((input[0, bigint, false] + 1) + 2)
/* 098 */       // (input[0, bigint, false] + 1)
...
/* 107 */       // (((input[0, bigint, false] + 4) + 5) + 6)
/* 108 */       // ((input[0, bigint, false] + 4) + 5)
/* 109 */       // (input[0, bigint, false] + 4)
...
/* 126 */ }
```

**After**
```
Generated code:
/* 001 */ public Object generate(Object[] references) {
...
/* 005 */ /**
/* 006 */  * Codegend pipeline for
/* 007 */  * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 008 */  * +- Range 1, 1, 8, 999, [id#0L]
/* 009 */  */
...
/* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 076 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
/* 077 */     // initialize Range
...
/* 090 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 091 */       // CONSUME: WholeStageCodegen
/* 092 */       // (((input[0, bigint, false] + 1) + 2) + 3)
...
/* 101 */       // (((input[0, bigint, false] + 4) + 5) + 6)
...
/* 118 */ }
```

## How was this patch tested?

Pass the Jenkins tests and see the result of the following command manually.
```scala
scala> spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6).queryExecution.debug.codegen()
```

Author: Dongjoon Hyun <dongjoonapache.org>
Author: Reynold Xin <rxindatabricks.com>

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13192 from dongjoon-hyun/SPARK-13135.
2016-05-24 10:08:14 -07:00
Daoyuan Wang d642b27354 [SPARK-15397][SQL] fix string udf locate as hive
## What changes were proposed in this pull request?

in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 and  `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.

## How was this patch tested?

tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #13186 from adrian-wang/locate.
2016-05-23 23:29:15 -07:00
Andrew Or de726b0d53 Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB"
This reverts commit fa244e5a90.
2016-05-23 21:43:11 -07:00
Kazuaki Ishizaki fa244e5a90 [SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB
## What changes were proposed in this pull request?

This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method.

## How was this patch tested?

Added new tests

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #13243 from kiszk/SPARK-15285.
2016-05-23 21:12:34 -07:00
gatorsmile 5afd927a47 [SPARK-15311][SQL] Disallow DML on Regular Tables when Using In-Memory Catalog
#### What changes were proposed in this pull request?
So far, when using In-Memory Catalog, we allow DDL operations for the tables. However, the corresponding DML operations are not supported for the tables that are neither temporary nor data source tables. For example,
```SQL
CREATE TABLE tabName(i INT, j STRING)
SELECT * FROM tabName
INSERT OVERWRITE TABLE tabName SELECT 1, 'a'
```
In the above example, before this PR fix, we will get very confusing exception messages for either `SELECT` or `INSERT`
```
org.apache.spark.sql.AnalysisException: unresolved operator 'SimpleCatalogRelation default, CatalogTable(`default`.`tbl`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(i,int,true,None), CatalogColumn(j,string,true,None)),List(),List(),List(),-1,,1463928681802,-1,Map(),None,None,None,List()), None;
```

This PR is to issue appropriate exceptions in this case. The message will be like
```
org.apache.spark.sql.AnalysisException: Please enable Hive support when operating non-temporary tables: `tbl`;
```
#### How was this patch tested?
Added a test case in `DDLSuite`.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #13093 from gatorsmile/selectAfterCreate.
2016-05-23 18:03:45 -07:00
Xin Wu 01659bc50c [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively
## What changes were proposed in this pull request?
Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context.
Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli)

This PR is to support following commands:
`LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])`

### For example:
##### LIST FILE(s)
```
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt")
res1: org.apache.spark.sql.DataFrame = []
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false)
+----------------------------------------------+
|result                                        |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
+----------------------------------------------+

scala> spark.sql("list files").show(false)
+----------------------------------------------+
|result                                        |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt |
+----------------------------------------------+
```

##### LIST JAR(s)
```
scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar")
res9: org.apache.spark.sql.DataFrame = [result: int]

scala> spark.sql("list jar TestUDTF.jar").show(false)
+---------------------------------------------+
|result                                       |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+

scala> spark.sql("list jars").show(false)
+---------------------------------------------+
|result                                       |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+
```
## How was this patch tested?
New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path.

Author: Xin Wu <xinwu@us.ibm.com>
Author: xin Wu <xinwu@us.ibm.com>

Closes #13212 from xwu0226/list_command.
2016-05-23 17:32:01 -07:00
Dongjoon Hyun 37c617e4f5 [MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions
## What changes were proposed in this pull request?

Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.

## How was this patch tested?

It's only about docs.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13087 from dongjoon-hyun/SPARK-15282.
2016-05-23 14:19:25 -07:00
Andrew Or 2585d2b322 [SPARK-15279][SQL] Catch conflicting SerDe when creating table
## What changes were proposed in this pull request?

The user may do something like:
```
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde'
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde'
```
None of these should be allowed because the SerDe's conflict. As of this patch:
- `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE`
- `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE`

## How was this patch tested?

New tests in `DDLCommandSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #13068 from andrewor14/row-format-conflict.
2016-05-23 11:55:03 -07:00
Wenchen Fan 07c36a2f07 [SPARK-15471][SQL] ScalaReflection cleanup
## What changes were proposed in this pull request?

1. simplify the logic of deserializing option type.
2. simplify the logic of serializing array type, and remove silentSchemaFor
3. remove some unnecessary code.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13250 from cloud-fan/encoder.
2016-05-23 11:13:27 -07:00
wangyang fc44b694bf [SPARK-15379][SQL] check special invalid date
## What changes were proposed in this pull request?

When invalid date string like "2015-02-29 00:00:00" are cast as date or timestamp using spark sql, it used to not return null but another valid date (2015-03-01 in this case).
In this pr, invalid date string like "2016-02-29" and "2016-04-31" are returned as null when cast as date or timestamp.

## How was this patch tested?

Unit tests are added.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: wangyang <wangyang@haizhi.com>

Closes #13169 from wangyang1992/invalid_date.
2016-05-22 19:30:14 -07:00
Bo Meng 72288fd67e [SPARK-15468][SQL] fix some typos
## What changes were proposed in this pull request?

Fix some typos while browsing the codes.

## How was this patch tested?

None and obvious.

Author: Bo Meng <mengbo@hotmail.com>
Author: bomeng <bmeng@us.ibm.com>

Closes #13246 from bomeng/typo.
2016-05-22 08:10:54 -05:00
Tathagata Das 1ffa608ba5 [SPARK-15428][SQL] Disable multiple streaming aggregations
## What changes were proposed in this pull request?

Incrementalizing plans of with multiple streaming aggregation is tricky and we dont have the necessary support for "delta" to implement correctly. So disabling the support for multiple streaming aggregations.

## How was this patch tested?
Additional unit tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13210 from tdas/SPARK-15428.
2016-05-22 02:08:18 -07:00
Reynold Xin 845e447fa0 [SPARK-15459][SQL] Make Range logical and physical explain consistent
## What changes were proposed in this pull request?
This patch simplifies the implementation of Range operator and make the explain string consistent between logical plan and physical plan. To do this, I changed RangeExec to embed a Range logical plan in it.

Before this patch (note that the logical Range and physical Range actually output different information):
```
== Optimized Logical Plan ==
Range 0, 100, 2, 2, [id#8L]

== Physical Plan ==
*Range 0, 2, 2, 50, [id#8L]
```

After this patch:
If step size is 1:
```
== Optimized Logical Plan ==
Range(0, 100, splits=2)

== Physical Plan ==
*Range(0, 100, splits=2)
```

If step size is not 1:
```
== Optimized Logical Plan ==
Range (0, 100, step=2, splits=2)

== Physical Plan ==
*Range (0, 100, step=2, splits=2)
```

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13239 from rxin/SPARK-15459.
2016-05-22 00:03:37 -07:00
gatorsmile a11175eeca [SPARK-15312][SQL] Detect Duplicate Key in Partition Spec and Table Properties
#### What changes were proposed in this pull request?
When there are duplicate keys in the partition specs or table properties, we always use the last value and ignore all the previous values. This is caused by the function call `toMap`.

partition specs or table properties are widely used in multiple DDL statements.

This PR is to detect the duplicates and issue an exception if found.

#### How was this patch tested?
Added test cases in DDLSuite

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13095 from gatorsmile/detectDuplicate.
2016-05-21 23:56:10 -07:00
Reynold Xin 6d0bfb9601 Small documentation and style fix. 2016-05-21 23:12:56 -07:00
Jurriaan Pruis 223f633908 [SPARK-15415][SQL] Fix BroadcastHint when autoBroadcastJoinThreshold is 0 or -1
## What changes were proposed in this pull request?

This PR makes BroadcastHint more deterministic by using a special isBroadcastable property
instead of setting the sizeInBytes to 1.

See https://issues.apache.org/jira/browse/SPARK-15415

## How was this patch tested?

Added testcases to test if the broadcast hash join is included in the plan when the BroadcastHint is supplied and also tests for propagation of the joins.

Author: Jurriaan Pruis <email@jurriaanpruis.nl>

Closes #13244 from jurriaan/broadcast-hint.
2016-05-21 23:01:14 -07:00
gatorsmile 8f0a3d5bcb [SPARK-15330][SQL] Implement Reset Command
#### What changes were proposed in this pull request?
Like `Set` Command in Hive, `Reset` is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-3202

This PR is to implement such a command for resetting the SQL-related configuration to the default values. One of the use case shown in HIVE-3202 is listed below:

> For the purpose of optimization we set various configs per query. It's worthy but all those configs should be reset every time for next query.

#### How was this patch tested?
Added a test case.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #13121 from gatorsmile/resetCommand.
2016-05-21 20:07:34 -07:00
Dilip Biswal 5e1ee28984 [SPARK-15114][SQL] Column name generated by typed aggregate is super verbose
## What changes were proposed in this pull request?

Generate a shorter default alias for `AggregateExpression `, In this PR, aggregate function name along with a index is used for generating the alias name.

```SQL
val ds = Seq(1, 3, 2, 5).toDS()
ds.select(typed.sum((i: Int) => i), typed.avg((i: Int) => i)).show()
```

Output before change.
```SQL
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|typedsumdouble(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), upcast(value))|typedaverage(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), newInstance(class scala.Tuple2))|
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                         11.0|                                                                                                                                         2.75|
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
```
Output after change:
```SQL
+-----------------+---------------+
|typedsumdouble_c1|typedaverage_c2|
+-----------------+---------------+
|             11.0|           2.75|
+-----------------+---------------+
```

Note: There is one test in ParquetSuites.scala which shows that that the system picked alias
name is not usable and is rejected.  [test](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L672-#L687)
## How was this patch tested?

A new test was added in DataSetAggregatorSuite.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #13045 from dilipbiswal/spark-15114.
2016-05-21 08:36:08 -07:00
Dongjoon Hyun f39621c998 [SPARK-15462][SQL][TEST] unresolved === false` is enough in testcases.
## What changes were proposed in this pull request?

In only `catalyst` module, there exists 8 evaluation test cases on unresolved expressions. But, in real-world situation, those cases doesn't happen since they occurs exceptions before evaluations.
```scala
scala> sql("select format_number(null, 3)")
res0: org.apache.spark.sql.DataFrame = [format_number(CAST(NULL AS DOUBLE), 3): string]
scala> sql("select format_number(cast(null as NULL), 3)")
org.apache.spark.sql.catalyst.parser.ParseException:
DataType null() is not supported.(line 1, pos 34)
```

This PR makes those testcases more realistic.
```scala
-    checkEvaluation(FormatNumber(Literal.create(null, NullType), Literal(3)), null)
+    assert(FormatNumber(Literal.create(null, NullType), Literal(3)).resolved === false)
```
Also, this PR also removes redundant `resolved` checking in `FoldablePropagation` optimizer.

## How was this patch tested?

Pass the modified Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13241 from dongjoon-hyun/SPARK-15462.
2016-05-21 08:11:14 -07:00
Sandeep Singh 666bf2e835 [SPARK-15445][SQL] Build fails for java 1.7 after adding java.mathBigInteger support
## What changes were proposed in this pull request?
Using longValue() and then checking whether the value is in the range for a long manually.

## How was this patch tested?
Existing tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13223 from techaddict/SPARK-15445.
2016-05-21 06:39:47 -05:00
Davies Liu 0e70fd61b4 [SPARK-15438][SQL] improve explain of whole stage codegen
## What changes were proposed in this pull request?

Currently, the explain of a query with whole-stage codegen looks like this
```
>>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [id#1L]
:     +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None
:        :- Range 0, 1, 4, 1000, [id#1L]
:        +- INPUT
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
   +- WholeStageCodegen
      :  +- Range 0, 1, 4, 1000, [id#4L]
```

The problem is that the plan looks much different than logical plan, make us hard to understand the plan (especially when the logical plan is not showed together).

This PR will change it to:

```
>>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None
   :- *Range 0, 1, 4, 1000, [id#0L]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
      +- *Range 0, 1, 4, 1000, [id#3L]
```

The `*`before the plan means that it's part of whole-stage codegen, it's easy to understand.

## How was this patch tested?

Manually ran some queries and check the explain.

Author: Davies Liu <davies@databricks.com>

Closes #13204 from davies/explain_codegen.
2016-05-20 13:21:53 -07:00
Shixiong Zhu dfa61f7b13 [SPARK-15190][SQL] Support using SQLUserDefinedType for case classes
## What changes were proposed in this pull request?

Right now inferring the schema for case classes happens before searching the SQLUserDefinedType annotation, so the SQLUserDefinedType annotation for case classes doesn't work.

This PR simply changes the inferring order to resolve it. I also reenabled the java.math.BigDecimal test and added two tests for `List`.

## How was this patch tested?

`encodeDecodeTest(UDTCaseClass(new java.net.URI("http://spark.apache.org/")), "udt with case class")`

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12965 from zsxwing/SPARK-15190.
2016-05-20 12:38:46 -07:00
Kousuke Saruta 22947cd021 [SPARK-15165] [SPARK-15205] [SQL] Introduce place holder for comments in generated code
## What changes were proposed in this pull request?

This PR introduce place holder for comment in generated code and the purpose  is same for #12939 but much safer.

Generated code to be compiled doesn't include actual comments but includes place holder instead.

Place holders in generated code will be replaced with actual comments only at the time of  logging.

Also, this PR can resolve SPARK-15205.

## How was this patch tested?

Existing tests.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #12979 from sarutak/SPARK-15205.
2016-05-20 10:56:35 -07:00
Takuya UESHIN 2cbe96e64d [SPARK-15400][SQL] CreateNamedStruct and CreateNamedStructUnsafe should preserve metadata of value expressions if it is NamedExpression.
## What changes were proposed in this pull request?

`CreateNamedStruct` and `CreateNamedStructUnsafe` should preserve metadata of value expressions if it is `NamedExpression` like `CreateStruct` or `CreateStructUnsafe` are doing.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13193 from ueshin/issues/SPARK-15400.
2016-05-20 09:38:34 -07:00
Takuya UESHIN d2e1aa97ef [SPARK-15308][SQL] RowEncoder should preserve nested column name.
## What changes were proposed in this pull request?

The following code generates wrong schema:

```
val schema = new StructType().add(
  "struct",
  new StructType()
    .add("i", IntegerType, nullable = false)
    .add(
      "s",
      new StructType().add("int", IntegerType, nullable = false),
      nullable = false),
  nullable = false)
val ds = sqlContext.range(10).map(l => Row(l, Row(l)))(RowEncoder(schema))
ds.printSchema()
```

This should print as follows:

```
root
 |-- struct: struct (nullable = false)
 |    |-- i: integer (nullable = false)
 |    |-- s: struct (nullable = false)
 |    |    |-- int: integer (nullable = false)
```

but the result is:

```
root
 |-- struct: struct (nullable = false)
 |    |-- col1: integer (nullable = false)
 |    |-- col2: struct (nullable = false)
 |    |    |-- col1: integer (nullable = false)
```

This PR fixes `RowEncoder` to preserve nested column name.

## How was this patch tested?

Existing tests and I added a test to check if `RowEncoder` preserves nested column name.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13090 from ueshin/issues/SPARK-15308.
2016-05-20 09:34:55 -07:00
Takuya UESHIN d5e1c5acde [SPARK-15313][SQL] EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject.
## What changes were proposed in this pull request?

The following code:

```
val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
```

throws an Exception:

```
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420
 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)

...
 Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417]
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
...
```

This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`.

The analyzed and optimized plans of the above example are as follows:

```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
   +- Filter <function1>.apply
      +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
         +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]

== Optimized Logical Plan ==
!Project [_1#420]
+- Filter <function1>.apply
   +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```

This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`.

The plans after this patch are as follows:

```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
   +- Filter <function1>.apply
      +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
         +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]

== Optimized Logical Plan ==
Project [_1#416]
+- Filter <function1>.apply
   +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```

## How was this patch tested?

Existing tests and I added a test to check if `filter and then select` works.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13096 from ueshin/issues/SPARK-15313.
2016-05-19 22:55:44 -07:00
Reynold Xin 3ba34d435c [SPARK-14990][SQL] Fix checkForSameTypeInputExpr (ignore nullability)
## What changes were proposed in this pull request?
This patch fixes a bug in TypeUtils.checkForSameTypeInputExpr. Previously the code was testing on strict equality, which does not taking nullability into account.

This is based on https://github.com/apache/spark/pull/12768. This patch fixed a bug there (with empty expression) and added a test case.

## How was this patch tested?
Added a new test suite and test case.

Closes #12768.

Author: Reynold Xin <rxin@databricks.com>
Author: Oleg Danilov <oleg.danilov@wandisco.com>

Closes #13208 from rxin/SPARK-14990.
2016-05-19 22:14:10 -07:00
Kevin Yu 17591d90e6 [SPARK-11827][SQL] Adding java.math.BigInteger support in Java type inference for POJOs and Java collections
Hello : Can you help check this PR? I am adding support for the java.math.BigInteger for java bean code path. I saw internally spark is converting the BigInteger to BigDecimal in ColumnType.scala and CatalystRowConverter.scala. I use the similar way and convert the BigInteger to the BigDecimal. .

Author: Kevin Yu <qyu@us.ibm.com>

Closes #10125 from kevinyu98/working_on_spark-11827.
2016-05-20 12:41:14 +08:00
Sumedh Mungee d5c47f8ff8 [SPARK-15321] Fix bug where Array[Timestamp] cannot be encoded/decoded correctly
## What changes were proposed in this pull request?

Fix `MapObjects.itemAccessorMethod` to handle `TimestampType`. Without this fix, `Array[Timestamp]` cannot be properly encoded or decoded. To reproduce this, in `ExpressionEncoderSuite`, if you add the following test case:

`encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of timestamp")
`
... you will see that (without this fix) it fails with the following output:

```
- encode/decode for array of timestamp: [Ljava.sql.Timestamp;fd9ebde *** FAILED ***
  Exception thrown while decoding
  Converted: [0,1000000010,800000001,52a7ccdc36800]
  Schema: value#61615
  root
  -- value: array (nullable = true)
      |-- element: timestamp (containsNull = true)
  Encoder:
  class[value[0]: array<timestamp>] (ExpressionEncoderSuite.scala:312)
```

## How was this patch tested?

Existing tests

Author: Sumedh Mungee <smungee@gmail.com>

Closes #13108 from smungee/fix-itemAccessorMethod.
2016-05-20 12:30:04 +08:00
Cheng Lian 6ac1c3a040 [SPARK-14346][SQL] Lists unsupported Hive features in SHOW CREATE TABLE output
## What changes were proposed in this pull request?

This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message.

## How was this patch tested?

Updated existing test case to check exception message.

Author: Cheng Lian <lian@databricks.com>

Closes #13173 from liancheng/spark-14346-follow-up.
2016-05-19 12:02:41 -07:00
Kousuke Saruta faafd1e9db [SPARK-15387][SQL] SessionCatalog in SimpleAnalyzer does not need to make database directory.
## What changes were proposed in this pull request?

After #12871 is fixed, we are forced to make `/user/hive/warehouse` when SimpleAnalyzer is used but SimpleAnalyzer may not need the directory.

## How was this patch tested?

Manual test.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #13175 from sarutak/SPARK-15387.
2016-05-19 11:51:59 -07:00
gatorsmile ef7a5e0bca [SPARK-14603][SQL][FOLLOWUP] Verification of Metadata Operations by Session Catalog
#### What changes were proposed in this pull request?
This follow-up PR is to address the remaining comments in https://github.com/apache/spark/pull/12385

The major change in this PR is to issue better error messages in PySpark by using the mechanism that was proposed by davies in https://github.com/apache/spark/pull/7135

For example, in PySpark, if we input the following statement:
```python
>>> l = [('Alice', 1)]
>>> df = sqlContext.createDataFrame(l)
>>> df.createTempView("people")
>>> df.createTempView("people")
```
Before this PR, the exception we will get is like
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
    self._jdf.createTempView(name)
  File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.createTempView.
: org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'people' already exists;
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:324)
    at org.apache.spark.sql.SparkSession.createTempView(SparkSession.scala:523)
    at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2328)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:745)
```
After this PR, the exception we will get become cleaner:
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
    self._jdf.createTempView(name)
  File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 75, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"Temporary table 'people' already exists;"
```

#### How was this patch tested?
Fixed an existing PySpark test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13126 from gatorsmile/followup-14684.
2016-05-19 11:46:11 -07:00
Dongjoon Hyun 5907ebfc11 [SPARK-14939][SQL] Add FoldablePropagation optimizer
## What changes were proposed in this pull request?

This PR aims to add new **FoldablePropagation** optimizer that propagates foldable expressions by replacing all attributes with the aliases of original foldable expression. Other optimizations will take advantage of the propagated foldable expressions: e.g. `EliminateSorts` optimizer now can handle the following Case 2 and 3. (Case 1 is the previous implementation.)

1. Literals and foldable expression, e.g. "ORDER BY 1.0, 'abc', Now()"
2. Foldable ordinals, e.g. "SELECT 1.0, 'abc', Now() ORDER BY 1, 2, 3"
3. Foldable aliases, e.g. "SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, z"

This PR has been generalized based on cloud-fan 's key ideas many times; he should be credited for the work he did.

**Before**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
:  +- Sort [1.0#5 ASC,x#0 ASC], true, 0
:     +- INPUT
+- Exchange rangepartitioning(1.0#5 ASC, x#0 ASC, 200), None
   +- WholeStageCodegen
      :  +- Project [1.0 AS 1.0#5,1461873043577000 AS x#0]
      :     +- INPUT
      +- Scan OneRowRelation[]
```

**After**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [1.0 AS 1.0#5,1461873079484000 AS x#0]
:     +- INPUT
+- Scan OneRowRelation[]
```

## How was this patch tested?

Pass the Jenkins tests including a new test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12719 from dongjoon-hyun/SPARK-14939.
2016-05-19 15:57:44 +08:00
Wenchen Fan 661c21049b [SPARK-15381] [SQL] physical object operator should define reference correctly
## What changes were proposed in this pull request?

Whole Stage Codegen depends on `SparkPlan.reference` to do some optimization. For physical object operators, they should be consistent with their logical version and set the `reference` correctly.

## How was this patch tested?

new test in DatasetSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13167 from cloud-fan/bug.
2016-05-18 21:43:07 -07:00
Wenchen Fan ebfe3a1f2c [SPARK-15192][SQL] null check for SparkSession.createDataFrame
## What changes were proposed in this pull request?

This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.

## How was this patch tested?

new tests in `DatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13008 from cloud-fan/row-encoder.
2016-05-18 18:06:38 -07:00
Dongjoon Hyun d2f81df1ba [MINOR][SQL] Remove unused pattern matching variables in Optimizers.
## What changes were proposed in this pull request?

This PR removes unused pattern matching variable in Optimizers in order to improve readability.

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13145 from dongjoon-hyun/remove_unused_pattern_matching_variables.
2016-05-18 11:51:50 +01:00
Yin Huai 2a5db9c140 [SPARK-14346] Fix scala-2.10 build
## What changes were proposed in this pull request?
Scala 2.10 build was broken by #13079. I am reverting the change of that line.

Author: Yin Huai <yhuai@databricks.com>

Closes #13157 from yhuai/SPARK-14346-fix-scala2.10.
2016-05-17 18:02:31 -07:00
Cheng Lian b674e67c22 [SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/views
## What changes were proposed in this pull request?

This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc..

## How was this patch tested?

New test cases are added in `ShowCreateTableSuite` to do round-trip tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13079 from liancheng/spark-14346-show-create-table-for-hive-tables.
2016-05-17 15:56:44 -07:00
Kousuke Saruta c0c3ec3547 [SPARK-15165] [SQL] Codegen can break because toCommentSafeString is not actually safe
## What changes were proposed in this pull request?

toCommentSafeString method replaces "\u" with "\\\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\\\u", in the string literal in the query, codegen can break.

Following code causes compilation error.

```
val df = Seq(...).toDF
df.select("'\\\\\\\\u002A/'").show
```

The reason of the compilation error is because "\\\\\\\\\\\\\\\\u002A/" is translated into "*/" (the end of comment).

Due to this unsafety, arbitrary code can be injected like as follows.

```
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'\\\\\\\\u002A/{System.exit(1);}/*'").show
```

## How was this patch tested?

Added new test cases.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: sarutak <sarutak@oss.nttdata.co.jp>

Closes #12939 from sarutak/SPARK-15165.
2016-05-17 10:07:01 -07:00
Wenchen Fan c36ca651f9 [SPARK-15351][SQL] RowEncoder should support array as the external type for ArrayType
## What changes were proposed in this pull request?

This PR improves `RowEncoder` and `MapObjects`, to support array as the external type for `ArrayType`. The idea is straightforward, we use `Object` as the external input type for `ArrayType`, and determine its type at runtime in `MapObjects`.

## How was this patch tested?

new test in `RowEncoderSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13138 from cloud-fan/map-object.
2016-05-17 17:02:52 +08:00
Sean Owen 122302cbf5 [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags
## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.
2016-05-17 09:55:53 +01:00
Reynold Xin e1dc853737 [SPARK-15310][SQL] Rename HiveTypeCoercion -> TypeCoercion
## What changes were proposed in this pull request?
We originally designed the type coercion rules to match Hive, but over time we have diverged. It does not make sense to call it HiveTypeCoercion anymore. This patch renames it TypeCoercion.

## How was this patch tested?
Updated unit tests to reflect the rename.

Author: Reynold Xin <rxin@databricks.com>

Closes #13091 from rxin/SPARK-15310.
2016-05-13 00:15:39 -07:00
Reynold Xin eda2800d44 [SPARK-14541][SQL] Support IFNULL, NULLIF, NVL and NVL2
## What changes were proposed in this pull request?
This patch adds support for a few SQL functions to improve compatibility with other databases: IFNULL, NULLIF, NVL and NVL2. In order to do this, this patch introduced a RuntimeReplaceable expression trait that allows replacing an unevaluable expression in the optimizer before evaluation.

Note that the semantics are not completely identical to other databases in esoteric cases.

## How was this patch tested?
Added a new test suite SQLCompatibilityFunctionSuite.

Closes #12373.

Author: Reynold Xin <rxin@databricks.com>

Closes #13084 from rxin/SPARK-14541.
2016-05-12 22:18:39 -07:00
Reynold Xin ba169c3230 [SPARK-15306][SQL] Move object expressions into expressions.objects package
## What changes were proposed in this pull request?
This patch moves all the object related expressions into expressions.objects package, for better code organization.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13085 from rxin/SPARK-15306.
2016-05-12 21:35:14 -07:00
Herman van Hovell bb1362eb3b [SPARK-10605][SQL] Create native collect_list/collect_set aggregates
## What changes were proposed in this pull request?
We currently use the Hive implementations for the collect_list/collect_set aggregate functions. This has a few major drawbacks: the use of HiveUDAF (which has quite a bit of overhead) and the lack of support for struct datatypes. This PR adds native implementation of these functions to Spark.

The size of the collected list/set may vary, this means we cannot use the fast, Tungsten, aggregation path to perform the aggregation, and that we fallback to the slower sort based path. Another big issue with these operators is that when the size of the collected list/set grows too large, we can start experiencing large GC pauzes and OOMEs.

This `collect*` aggregates implemented in this PR rely on the sort based aggregate path for correctness. They maintain their own internal buffer which holds the rows for one group at a time. The sortbased aggregation path is triggered by disabling `partialAggregation` for these aggregates (which is kinda funny); this technique is also employed in `org.apache.spark.sql.hiveHiveUDAFFunction`.

I have done some performance testing:
```scala
import org.apache.spark.sql.{Dataset, Row}

sql("create function collect_list2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList'")

val df = range(0, 10000000).select($"id", (rand(213123L) * 100000).cast("int").as("grp"))
df.select(countDistinct($"grp")).show

def benchmark(name: String, plan: Dataset[Row], maxItr: Int = 5): Unit = {
   // Do not measure planning.
   plan1.queryExecution.executedPlan

   // Execute the plan a number of times and average the result.
   val start = System.nanoTime
   var i = 0
   while (i < maxItr) {
     plan.rdd.foreach(row => Unit)
     i += 1
   }
   val time = (System.nanoTime - start) / (maxItr * 1000000L)
   println(s"[$name] $maxItr iterations completed in an average time of $time ms.")
}

val plan1 = df.groupBy($"grp").agg(collect_list($"id"))
val plan2 = df.groupBy($"grp").agg(callUDF("collect_list2", $"id"))

benchmark("Spark collect_list", plan1)
...
> [Spark collect_list] 5 iterations completed in an average time of 3371 ms.

benchmark("Hive collect_list", plan2)
...
> [Hive collect_list] 5 iterations completed in an average time of 9109 ms.
```
Performance is improved by a factor 2-3.

## How was this patch tested?
Added tests to `DataFrameAggregateSuite`.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12874 from hvanhovell/implode.
2016-05-12 13:56:00 -07:00
gatorsmile be617f3d06 [SPARK-14684][SPARK-15277][SQL] Partition Spec Validation in SessionCatalog and Checking Partition Spec Existence Before Dropping
#### What changes were proposed in this pull request?
~~Currently, multiple partitions are allowed to drop by using a single DDL command: Alter Table Drop Partition. However, the internal implementation could break atomicity. That means, we could just drop a subset of qualified partitions, if hitting an exception when dropping one of qualified partitions~~

~~This PR contains the following behavior changes:~~
~~- disallow dropping multiple partitions by a single command ~~
~~- allow users to input predicates in partition specification and issue a nicer error message if the predicate's comparison operator is not `=`.~~
~~- verify the partition spec in SessionCatalog. This can ensure each partition spec in `Drop Partition` does not correspond to multiple partitions.~~

This PR has two major parts:
- Verify the partition spec in SessionCatalog for fixing the following issue:
  ```scala
  sql(s"ALTER TABLE $externalTab DROP PARTITION (ds='2008-04-09', unknownCol='12')")
  ```
  Above example uses an invalid partition spec. Without this PR, we will drop all the partitions. The reason is Hive megastores getPartitions API returns all the partitions if we provide an invalid spec.

- Re-implemented the `dropPartitions` in `HiveClientImpl`. Now, we always check if all the user-specified partition specs exist before attempting to drop the partitions. Previously, we start drop the partition before completing checking the existence of all the partition specs. If any failure happened after we start to drop the partitions, we will log an error message to indicate which partitions have been dropped and which partitions have not been dropped.

#### How was this patch tested?
Modified the existing test cases and added new test cases.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12801 from gatorsmile/banDropMultiPart.
2016-05-12 11:14:40 -07:00
Liang-Chi Hsieh 470de743ec [SPARK-15094][SPARK-14803][SQL] Remove extra Project added in EliminateSerialization
## What changes were proposed in this pull request?

We will eliminate the pair of `DeserializeToObject` and `SerializeFromObject` in `Optimizer` and add extra `Project`. However, when DeserializeToObject's outputObjectType is ObjectType and its cls can't be processed by unsafe project, it will be failed.

To fix it, we can simply remove the extra `Project` and replace the output attribute of `DeserializeToObject` in another rule.

## How was this patch tested?
`DatasetSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12926 from viirya/fix-eliminate-serialization-projection.
2016-05-12 10:11:12 -07:00
Sean Zhong 33c6eb5218 [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView
## What changes were proposed in this pull request?

Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #12945 from clockfly/spark-15171.
2016-05-12 15:51:53 +08:00
Wenchen Fan 46991448aa [SPARK-15160][SQL] support data source table in InMemoryCatalog
## What changes were proposed in this pull request?

This PR adds a new rule to convert `SimpleCatalogRelation` to data source table if its table property contains data source information.

## How was this patch tested?

new test in SQLQuerySuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12935 from cloud-fan/ds-table.
2016-05-11 23:55:42 -07:00
Cheng Lian f036dd7ce7 [SPARK-14346] SHOW CREATE TABLE for data source tables
## What changes were proposed in this pull request?

This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. Support for Hive tables will be added in follow-up PR(s).

To show table creation DDL for data source tables created by CTAS statements, this PR also added partitioning and bucketing support for normal `CREATE TABLE ... USING ...` syntax.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

A new test suite `ShowCreateTableSuite` is added in sql/hive package to test the new feature.

Author: Cheng Lian <lian@databricks.com>

Closes #12781 from liancheng/spark-14346-show-create-table.
2016-05-11 20:44:04 -07:00
Eric Liang 6d0368ab8d [SPARK-15259] Sort time metric should not include spill and record insertion time
## What changes were proposed in this pull request?

After SPARK-14669 it seems the sort time metric includes both spill and record insertion time. This makes it not very useful since the metric becomes close to the total execution time of the node.

We should track just the time spent for in-memory sort, as before.

## How was this patch tested?

Verified metric in the UI, also unit test on UnsafeExternalRowSorter.

cc davies

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #13035 from ericl/fix-metrics.
2016-05-11 11:25:46 -07:00