Commit graph

4080 commits

Author SHA1 Message Date
Kevin McHale 685b08e261 [SPARK-14204][SQL] register driverClass rather than user-specified class
This is a pull request that was originally merged against branch-1.6 as #12000, now being merged into master as well.  srowen zzcclp JoshRosen

This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user). The consequence of this is that all JDBC accesses under the described circumstances fail with an IllegalStateException. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204

My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user.

This patch was tested manually. I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed.

Author: Kevin McHale <kevin@premise.com>

Closes #14420 from mchalek/mchalek-jdbc_driver_registration.
2016-08-03 13:15:13 -07:00
Eric Liang e6f226c567 [SPARK-16596] [SQL] Refactor DataSourceScanExec to do partition discovery at execution instead of planning time
## What changes were proposed in this pull request?

Partition discovery is rather expensive, so we should do it at execution time instead of during physical planning. Right now there is not much benefit since ListingFileCatalog will read scan for all partitions at planning time anyways, but this can be optimized in the future. Also, there might be more information for partition pruning not available at planning time.

This PR moves a lot of the file scan logic from planning to execution time. All file scan operations are handled by `FileSourceScanExec`, which handles both batched and non-batched file scans. This requires some duplication with `RowDataSourceScanExec`, but is probably worth it so that `FileSourceScanExec` does not need to depend on an input RDD.

TODO: In another pr, move DataSourceScanExec to it's own file.

## How was this patch tested?

Existing tests (it might be worth adding a test that catalog.listFiles() is delayed until execution, but this can be delayed until there is an actual benefit to doing so).

Author: Eric Liang <ekl@databricks.com>

Closes #14241 from ericl/refactor.
2016-08-03 11:19:55 -07:00
Wenchen Fan b55f34370f [SPARK-16714][SPARK-16735][SPARK-16646] array, map, greatest, least's type coercion should handle decimal type
## What changes were proposed in this pull request?

Here is a table about the behaviours of `array`/`map` and `greatest`/`least` in Hive, MySQL and Postgres:

|    |Hive|MySQL|Postgres|
|---|---|---|---|---|
|`array`/`map`|can find a wider type with decimal type arguments, and will truncate the wider decimal type if necessary|can find a wider type with decimal type arguments, no truncation problem|can find a wider type with decimal type arguments, no truncation problem|
|`greatest`/`least`|can find a wider type with decimal type arguments, and truncate if necessary, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|

I think these behaviours makes sense and Spark SQL should follow them.

This PR fixes `array` and `map` by using `findWiderCommonType` to get the wider type.
This PR fixes `greatest` and `least` by add a `findWiderTypeWithoutStringPromotion`, which provides similar semantic of `findWiderCommonType`, but without string promotion.

## How was this patch tested?

new tests in `TypeCoersionSuite`

Author: Wenchen Fan <wenchen@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #14439 from cloud-fan/bug.
2016-08-03 11:15:09 -07:00
Wenchen Fan ae226283e1 [SQL][MINOR] use stricter type parameter to make it clear that parquet reader returns UnsafeRow
## What changes were proposed in this pull request?

a small code style change, it's better to make the type parameter more accurate.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14458 from cloud-fan/parquet.
2016-08-03 08:23:26 +08:00
gatorsmile b73a570603 [SPARK-16858][SQL][TEST] Removal of TestHiveSharedState
### What changes were proposed in this pull request?
This PR is to remove `TestHiveSharedState`.

Also, this is also associated with the Hive refractoring for removing `HiveSharedState`.

### How was this patch tested?
The existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14463 from gatorsmile/removeTestHiveSharedState.
2016-08-02 14:17:45 -07:00
Wenchen Fan a9beeaaaeb [SPARK-16855][SQL] move Greatest and Least from conditionalExpressions.scala to arithmetic.scala
## What changes were proposed in this pull request?

`Greatest` and `Least` are not conditional expressions, but arithmetic expressions.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14460 from cloud-fan/move.
2016-08-02 11:08:32 -07:00
Herman van Hovell 2330f3ecbb [SPARK-16836][SQL] Add support for CURRENT_DATE/CURRENT_TIMESTAMP literals
## What changes were proposed in this pull request?
In Spark 1.6 (with Hive support) we could use `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions as literals (without adding braces), for example:
```SQL
select /* Spark 1.6: */ current_date, /* Spark 1.6  & Spark 2.0: */ current_date()
```
This was accidentally dropped in Spark 2.0. This PR reinstates this functionality.

## How was this patch tested?
Added a case to ExpressionParserSuite.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14442 from hvanhovell/SPARK-16836.
2016-08-02 10:09:47 -07:00
Liang-Chi Hsieh 146001a9ff [SPARK-16062] [SPARK-15989] [SQL] Fix two bugs of Python-only UDTs
## What changes were proposed in this pull request?

There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know.

### First bug: When MapObjects works on Python-only UDTs

`RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer expression. If the sql type is `ArrayType`, we will have `MapObjects` working on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input data type. It causes error like:

    import pyspark.sql.group
    from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
    from pyspark.sql.types import *

    schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT())
    df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema)
    df.show()

    File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
    : java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType)
    ...

### Second bug: When Python-only UDTs is the element type of ArrayType

    import pyspark.sql.group
    from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
    from pyspark.sql.types import *

    schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT()))
    df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema)
    df.show()

## How was this patch tested?
PySpark's sql tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13778 from viirya/fix-pyudt.
2016-08-02 10:08:18 -07:00
Tom Magrino 1dab63d8d3 [SPARK-16837][SQL] TimeWindow incorrectly drops slideDuration in constructors
## What changes were proposed in this pull request?

Fix of incorrect arguments (dropping slideDuration and using windowDuration) in constructors for TimeWindow.

The JIRA this addresses is here: https://issues.apache.org/jira/browse/SPARK-16837

## How was this patch tested?

Added a test to TimeWindowSuite to check that the results of TimeWindow object apply and TimeWindow class constructor are equivalent.

Author: Tom Magrino <tmagrino@fb.com>

Closes #14441 from tmagrino/windowing-fix.
2016-08-02 09:16:44 -07:00
petermaxlee a1ff72e1cc [SPARK-16850][SQL] Improve type checking error message for greatest/least
## What changes were proposed in this pull request?
Greatest/least function does not have the most friendly error message for data types. This patch improves the error message to not show the Seq type, and use more human readable data types.

Before:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST (ArrayBuffer(DecimalType(2,1), StringType)).; line 1 pos 7
```

After:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST(decimal(2,1), string).; line 1 pos 7
```

## How was this patch tested?
Manually verified the output and also added unit tests to ConditionalExpressionSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14453 from petermaxlee/SPARK-16850.
2016-08-02 19:32:35 +08:00
jiangxingbo 5184df06b3 [SPARK-16793][SQL] Set the temporary warehouse path to sc'conf in TestHive.
## What changes were proposed in this pull request?

With SPARK-15034, we could use the value of spark.sql.warehouse.dir to set the warehouse location. In TestHive, we can now simply set the temporary warehouse path in sc's conf, and thus, param "warehousePath" could be removed.

## How was this patch tested?

exsiting testsuites.

Author: jiangxingbo <jiangxingbo@meituan.com>

Closes #14401 from jiangxb1987/warehousePath.
2016-08-01 23:08:06 -07:00
Wenchen Fan 2eedc00b04 [SPARK-16828][SQL] remove MaxOf and MinOf
## What changes were proposed in this pull request?

These 2 expressions are not needed anymore after we have `Greatest` and `Least`. This PR removes them and related tests.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14434 from cloud-fan/minor1.
2016-08-01 17:54:41 -07:00
Holden Karau ab1e761f96 [SPARK-16774][SQL] Fix use of deprecated timestamp constructor & improve timezone handling
## What changes were proposed in this pull request?

Removes the deprecated timestamp constructor and incidentally fixes the use which was using system timezone rather than the one specified when working near DST.

This change also causes the roundtrip tests to fail since it now actually uses all the timezones near DST boundaries where it didn't before.

Note: this is only a partial the solution, longer term we should follow up with https://issues.apache.org/jira/browse/SPARK-16788 to avoid this problem & simplify our timezone handling code.

## How was this patch tested?

New tests for two timezones added so even if user timezone happens to coincided with one, the other tests should still fail. Important note: this (temporarily) disables the round trip tests until we can fix the issue more thoroughly.

Author: Holden Karau <holden@us.ibm.com>

Closes #14398 from holdenk/SPARK-16774-fix-use-of-deprecated-timestamp-constructor.
2016-08-01 13:57:05 -07:00
eyal farago 338a98d65c [SPARK-16791][SQL] cast struct with timestamp field fails
## What changes were proposed in this pull request?
a failing test case + fix to SPARK-16791 (https://issues.apache.org/jira/browse/SPARK-16791)

## How was this patch tested?
added a failing test case to CastSuit, then fixed the Cast code and rerun the entire CastSuit

Author: eyal farago <eyal farago>
Author: Eyal Farago <eyal.farago@actimize.com>

Closes #14400 from eyalfa/SPARK-16791_cast_struct_with_timestamp_field_fails.
2016-08-01 22:43:32 +08:00
Holden Karau 1e9b59b73b [SPARK-16778][SQL][TRIVIAL] Fix deprecation warning with SQLContext
## What changes were proposed in this pull request?

Change to non-deprecated constructor for SQLContext.

## How was this patch tested?

Existing tests

Author: Holden Karau <holden@us.ibm.com>

Closes #14406 from holdenk/SPARK-16778-fix-use-of-deprecated-SQLContext-constructor.
2016-08-01 06:55:31 -07:00
Dongjoon Hyun 64d8f37c71 [SPARK-16726][SQL] Improve Union/Intersect/Except error messages on incompatible types
## What changes were proposed in this pull request?

Currently, `UNION` queries on incompatible types show misleading error messages, i.e., `unresolved operator Union`. We had better show a more correct message. This will help users in the situation of [SPARK-16704](https://issues.apache.org/jira/browse/SPARK-16704).

**Before**
```scala
scala> sql("select 1,2,3 union (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: unresolved operator 'Intersect;
scala> sql("select 1,2,3 except (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: unresolved operator 'Except;
```

**After**
```scala
scala> sql("select 1,2,3 union (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the second column of the second table;
scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: Intersect can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the second column of the second table;
scala> sql("select 1,2,3 except (select array(1),array(2),3)")
org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the first column of the second table;
```

## How was this patch tested?

Pass the Jenkins test with a new test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14355 from dongjoon-hyun/SPARK-16726.
2016-08-01 11:12:58 +02:00
Reynold Xin 579fbcf3bd [SPARK-16805][SQL] Log timezone when query result does not match
## What changes were proposed in this pull request?
It is useful to log the timezone when query result does not match, especially on build machines that have different timezone from AMPLab Jenkins.

## How was this patch tested?
This is a test-only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #14413 from rxin/SPARK-16805.
2016-07-31 18:21:06 -07:00
Wenchen Fan 301fb0d723 [SPARK-16731][SQL] use StructType in CatalogTable and remove CatalogColumn
## What changes were proposed in this pull request?

`StructField` has very similar semantic with `CatalogColumn`, except that `CatalogColumn` use string to express data type. I think it's reasonable to use `StructType` as the `CatalogTable.schema` and remove `CatalogColumn`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14363 from cloud-fan/column.
2016-07-31 18:18:53 -07:00
Reynold Xin 064d91ff73 [SPARK-16813][SQL] Remove private[sql] and private[spark] from catalyst package
## What changes were proposed in this pull request?
The catalyst package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime.

This patch removes all private[sql] and private[spark] visibility modifiers in org.apache.spark.sql.catalyst.

## How was this patch tested?
N/A - just visibility changes.

Author: Reynold Xin <rxin@databricks.com>

Closes #14418 from rxin/SPARK-16813.
2016-07-31 16:31:06 +08:00
Eric Liang 957a8ab374 [SPARK-16818] Exchange reuse incorrectly reuses scans over different sets of partitions
## What changes were proposed in this pull request?

This fixes a bug wherethe file scan operator does not take into account partition pruning in its implementation of `sameResult()`. As a result, executions may be incorrect on self-joins over the same base file relation.

The patch here is minimal, but we should reconsider relying on `metadata` for implementing sameResult() in the future, as string representations may not be uniquely identifying.

cc rxin

## How was this patch tested?

Unit tests.

Author: Eric Liang <ekl@databricks.com>

Closes #14425 from ericl/spark-16818.
2016-07-30 22:48:09 -07:00
Sean Owen 0dc4310b47 [SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose side effects are required
## What changes were proposed in this pull request?

Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14332 from srowen/SPARK-16694.
2016-07-30 04:42:38 -07:00
Tathagata Das bbc247548a [SPARK-16748][SQL] SparkExceptions during planning should not wrapped in TreeNodeException
## What changes were proposed in this pull request?
We do not want SparkExceptions from job failures in the planning phase to create TreeNodeException. Hence do not wrap SparkException in TreeNodeException.

## How was this patch tested?
New unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14395 from tdas/SPARK-16748.
2016-07-29 19:59:35 -07:00
Wesley Tang d1d5069aa3 [SPARK-16664][SQL] Fix persist call on Data frames with more than 200…
## What changes were proposed in this pull request?

f12f11e578 introduced this bug, missed foreach as map

## How was this patch tested?

Test added

Author: Wesley Tang <tangmingjun@mininglamp.com>

Closes #14324 from breakdawn/master.
2016-07-29 04:26:05 -07:00
Sameer Agarwal 3fd39b87bd [SPARK-16764][SQL] Recommend disabling vectorized parquet reader on OutOfMemoryError
## What changes were proposed in this pull request?

We currently don't bound or manage the data array size used by column vectors in the vectorized reader (they're just bound by INT.MAX) which may lead to OOMs while reading data. As a short term fix, this patch intercepts the OutOfMemoryError exception and suggest the user to disable the vectorized parquet reader.

## How was this patch tested?

Existing Tests

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14387 from sameeragarwal/oom.
2016-07-28 13:04:19 -07:00
Sylvain Zimmer 1178d61ede [SPARK-16740][SQL] Fix Long overflow in LongToUnsafeRowMap
## What changes were proposed in this pull request?

Avoid overflow of Long type causing a NegativeArraySizeException a few lines later.

## How was this patch tested?

Unit tests for HashedRelationSuite still pass.

I can confirm the python script I included in https://issues.apache.org/jira/browse/SPARK-16740 works fine with this patch. Unfortunately I don't have the knowledge/time to write a Scala test case for HashedRelationSuite right now. As the patch is pretty obvious I hope it can be included without this.

Thanks!

Author: Sylvain Zimmer <sylvain@sylvainzimmer.com>

Closes #14373 from sylvinus/master.
2016-07-28 09:51:45 -07:00
Liang-Chi Hsieh 9ade77c3fa [SPARK-16639][SQL] The query with having condition that contains grouping by column should work
## What changes were proposed in this pull request?

The query with having condition that contains grouping by column will be failed during analysis. E.g.,

    create table tbl(a int, b string);
    select count(b) from tbl group by a + 1 having a + 1 = 2;

Having condition should be able to use grouping by column.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14296 from viirya/having-contains-grouping-column.
2016-07-28 22:33:33 +08:00
gatorsmile 762366fd87 [SPARK-16552][SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables
#### What changes were proposed in this pull request?

Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables:

**Group A. Users specify the schema.**

_Case 1 CREATE TABLE AS SELECT_: the schema is determined by the result schema of the SELECT clause. For example,
```SQL
CREATE TABLE tab STORED AS TEXTFILE
AS SELECT * from input
```

_Case 2 CREATE TABLE_: users explicitly specify the schema. For example,
```SQL
CREATE TABLE jsonTable (_1 string, _2 string)
USING org.apache.spark.sql.json
```

**Group B. Spark SQL infers the schema at runtime.**

_Case 3 CREATE TABLE_. Users do not specify the schema but the path to the file location. For example,
```SQL
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (path '${tempDir.getCanonicalPath}')
```

Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.

This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema after possible changes on external files (table location), they issue `REFRESH TABLE`. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache.

In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now.

#### How was this patch tested?
TODO: add more cases to cover the changes.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14207 from gatorsmile/userSpecifiedSchema.
2016-07-28 17:29:26 +08:00
Dongjoon Hyun 5c2ae79bfc [SPARK-15232][SQL] Add subquery SQL building tests to LogicalPlanToSQLSuite
## What changes were proposed in this pull request?

We currently test subquery SQL building using the `HiveCompatibilitySuite`. The is not desired since SQL building is actually a part of `sql/core` and because we are slowly reducing our dependency on Hive. This PR adds the same tests from the whitelist of `HiveCompatibilitySuite` into `LogicalPlanToSQLSuite`.

## How was this patch tested?

This adds more testcases. Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14383 from dongjoon-hyun/SPARK-15232.
2016-07-27 23:29:26 -07:00
petermaxlee 11d427c924 [SPARK-16730][SQL] Implement function aliases for type casts
## What changes were proposed in this pull request?
Spark 1.x supports using the Hive type name as function names for doing casts, e.g.
```sql
SELECT int(1.0);
SELECT string(2.0);
```

The above query would work in Spark 1.x because Spark 1.x fail back to Hive for unimplemented functions, and break in Spark 2.0 because the fall back was removed.

This patch implements function aliases using an analyzer rule for the following cast functions:
- boolean
- tinyint
- smallint
- int
- bigint
- float
- double
- decimal
- date
- timestamp
- binary
- string

## How was this patch tested?
Added end-to-end tests in SQLCompatibilityFunctionSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14364 from petermaxlee/SPARK-16730-2.
2016-07-28 13:13:17 +08:00
Liang-Chi Hsieh 045fc36066 [MINOR][DOC][SQL] Fix two documents regarding size in bytes
## What changes were proposed in this pull request?

Fix two places in SQLConf documents regarding size in bytes and statistics.

## How was this patch tested?
No. Just change document.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14341 from viirya/fix-doc-size-in-bytes.
2016-07-27 21:14:20 +08:00
petermaxlee ef0ccbcb07 [SPARK-16729][SQL] Throw analysis exception for invalid date casts
## What changes were proposed in this pull request?
Spark currently throws exceptions for invalid casts for all other data types except date type. Somehow date type returns null. It should be consistent and throws analysis exception as well.

## How was this patch tested?
Added a unit test case in CastSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14358 from petermaxlee/SPARK-16729.
2016-07-27 16:04:43 +08:00
Dongjoon Hyun 5b8e848bbf [SPARK-16621][SQL] Generate stable SQLs in SQLBuilder
## What changes were proposed in this pull request?

Currently, the generated SQLs have not-stable IDs for generated attributes.
The stable generated SQL will give more benefit for understanding or testing the queries.
This PR provides stable SQL generation by the followings.

 - Provide unique ids for generated subqueries, `gen_subquery_xxx`.
 - Provide unique and stable ids for generated attributes, `gen_attr_xxx`.

**Before**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS gen_subquery_0
```

**After**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
```

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14257 from dongjoon-hyun/SPARK-16621.
2016-07-27 13:23:59 +08:00
Qifan Pu 738b4cc548 [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator
## What changes were proposed in this pull request?

This PR is the first step for the following feature:

For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBasedKeyValueBatch`. We then automatically pick between the two implementations based on certain knobs.

In this first-step PR, implementations for `RowBasedKeyValueBatch` and `RowBasedHashMapGenerator` are added.

## How was this patch tested?

Unit tests: `RowBasedKeyValueBatchSuite`

Author: Qifan Pu <qifan.pu@gmail.com>

Closes #14349 from ooq/SPARK-16524.
2016-07-26 18:08:07 -07:00
Wenchen Fan a2abb583ca [SPARK-16663][SQL] desc table should be consistent between data source and hive serde tables
## What changes were proposed in this pull request?

Currently there are 2 inconsistence:

1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema
2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null

## How was this patch tested?

new test in `HiveDDLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14302 from cloud-fan/minor3.
2016-07-26 18:46:12 +08:00
hyukjinkwon 3b2b785ece [SPARK-16675][SQL] Avoid per-record type dispatch in JDBC when writing
## What changes were proposed in this pull request?

Currently, `JdbcUtils.savePartition` is doing type-based dispatch for each row to write appropriate values.

So, appropriate setters for `PreparedStatement` can be created first according to the schema, and then apply them to each row. This approach is similar with `CatalystWriteSupport`.

This PR simply make the setters to avoid this.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14323 from HyukjinKwon/SPARK-16675.
2016-07-26 17:14:58 +08:00
Wenchen Fan 6959061f02 [SPARK-16706][SQL] support java map in encoder
## What changes were proposed in this pull request?

finish the TODO, create a new expression `ExternalMapToCatalyst` to iterate the map directly.

## How was this patch tested?

new test in `JavaDatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14344 from cloud-fan/java-map.
2016-07-26 15:33:05 +08:00
Liang-Chi Hsieh 7b06a8948f [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by ColumnPruning
## What changes were proposed in this pull request?

We push down `Project` through `Sample` in `Optimizer` by the rule `PushProjectThroughSample`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect.

Since the rule `ColumnPruning` already handles general `Project` pushdown. We don't need  `PushProjectThroughSample` anymore. The rule `ColumnPruning` also avoids the described issue.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14327 from viirya/fix-sample-pushdown.
2016-07-26 12:00:01 +08:00
Yin Huai 815f3eece5 [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions
## What changes were proposed in this pull request?
This PR contains three changes.

First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:
1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value.
2. If the offset row does not exist, the default value will be used.
3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change).

Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.

Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.

## How was this patch tested?
New tests in SQLWindowFunctionSuite

Author: Yin Huai <yhuai@databricks.com>

Closes #14284 from yhuai/lead-lag.
2016-07-25 20:58:07 -07:00
Michael Armbrust f99e34e8e5 [SPARK-16724] Expose DefinedByConstructorParams
We don't generally make things in catalyst/execution private.  Instead they are just undocumented due to their lack of stability guarantees.

Author: Michael Armbrust <michael@databricks.com>

Closes #14356 from marmbrus/patch-1.
2016-07-25 20:41:24 -07:00
Dongjoon Hyun 8a8d26f1e2 [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries
## What changes were proposed in this pull request?

Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* `EXISTS` queries. We had better prevent this.
```scala
scala> sql("CREATE TABLE t1(a int)")
scala> val df = sql("select * from t1 b where exists (select * from t1 a)")
scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL
java.lang.UnsupportedOperationException: empty.reduceLeft
```

## How was this patch tested?

Pass the Jenkins tests with a new test suite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14307 from dongjoon-hyun/SPARK-16672.
2016-07-25 19:52:17 -07:00
gatorsmile 3fc4566941 [SPARK-16678][SPARK-16677][SQL] Fix two View-related bugs
## What changes were proposed in this pull request?
**Issue 1: Disallow Creating/Altering a View when the same-name Table Exists (without IF NOT EXISTS)**
When we create OR alter a view, we check whether the view already exists. In the current implementation, if a table with the same name exists, we treat it as a view. However, this is not the right behavior. We should follow what Hive does. For example,
```
hive> CREATE TABLE tab1 (id int);
OK
Time taken: 0.196 seconds
hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> ALTER VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
OK
Time taken: 0.678 seconds
```

**Issue 2: Strange Error when Issuing Load Table Against A View**
Users should not be allowed to issue LOAD DATA against a view. Currently, when users doing it, we got a very strange runtime error. For example,
```SQL
LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
```
```
java.lang.reflect.InvocationTargetException was thrown.
java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
```
## How was this patch tested?
Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14314 from gatorsmile/tableDDLAgainstView.
2016-07-26 09:32:29 +08:00
Tathagata Das c979c8bba0 [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog
## What changes were proposed in this pull request?
Current fix for deadlock disables interrupts in the StreamExecution which getting offsets for all sources, and when writing to any metadata log, to avoid potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, disabling interrupts can have unintended consequences in other sources. So I am making the fix more narrow, by disabling interrupt it only in the HDFSMetadataLog. This is a narrower fix for something risky like disabling interrupt.

## How was this patch tested?
Existing tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14292 from tdas/SPARK-14131.
2016-07-25 16:09:22 -07:00
Shixiong Zhu 12f490b5c8 [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"
## What changes were proposed in this pull request?

SubexpressionEliminationSuite."Semantic equals and hash" assumes the default AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when this test runs. It may happen to use "ExprId(1)".

This PR detects the conflict and makes sure we create a different ExprId when the conflict happens.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14350 from zsxwing/SPARK-16715.
2016-07-25 16:08:29 -07:00
Cheng Lian 7ea6d282b9 [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions
## What changes were proposed in this pull request?

This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no partitioning expressions are present.

Before:

```sql
( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

After:

```sql
(ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

## How was this patch tested?

New test case added in `ExpressionSQLBuilderSuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #14334 from liancheng/window-spec-sql-format.
2016-07-25 09:42:39 -07:00
hyukjinkwon 79826f3c79 [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat
## What changes were proposed in this pull request?

It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698.

Field name having dots throws an exception. For example the codes below:

```scala
val path = "/tmp/path"
val json =""" {"a.b":"data"}"""
spark.sparkContext
  .parallelize(json :: Nil)
  .saveAsTextFile(path)
spark.read.json(path).collect()
```

throws an exception as below:

```
Unable to resolve a.b given [a.b];
org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b];
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
	at scala.Option.getOrElse(Option.scala:121)
```

This problem was introduced in 17eec0a71b (diff-27c76f96a7b2733ecfd6f46a1716e153R121)

When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields.

For example, this throws an exception. (**Loading JSON from RDD is fine**)

```scala
val json =""" {"a.b":"data"}"""
val rdd = spark.sparkContext.parallelize(json :: Nil)
spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true))))
  .json(rdd).select("`a.b`").printSchema()
```

as below:

```
cannot resolve '```a.b```' given input columns: [`a.b`];
org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`];
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

## How was this patch tested?

Unit tests in `FileSourceStrategySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14339 from HyukjinKwon/SPARK-16698-regression.
2016-07-25 22:51:30 +08:00
Sameer Agarwal d6a52176ad [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages
## What changes were proposed in this pull request?

This patch adds an explicit test for [SPARK-14217] by setting the parquet dictionary and page size the generated parquet file spans across 3 pages (within a single row group) where the first page is dictionary encoded and the remaining two are plain encoded.

## How was this patch tested?

1. ParquetEncodingSuite
2. Also manually tested that this test fails without https://github.com/apache/spark/pull/12279

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14304 from sameeragarwal/hybrid-encoding-test.
2016-07-25 22:31:01 +08:00
Wenchen Fan 64529b186a [SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTable
## What changes were proposed in this pull request?

It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14331 from cloud-fan/check.
2016-07-25 22:05:48 +08:00
Wenchen Fan d27d362eba [SPARK-16660][SQL] CreateViewCommand should not take CatalogTable
## What changes were proposed in this pull request?

`CreateViewCommand` only needs some information of a `CatalogTable`, but not all of them. We have some tricks(e.g. we need to check the table type is `VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take a `CatalogTable`.
This PR cleans it up and only pass in necessary information to `CreateViewCommand`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14297 from cloud-fan/minor2.
2016-07-25 22:02:00 +08:00
hyukjinkwon 7ffd99ec5f [SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when reading
## What changes were proposed in this pull request?

Currently, `JDBCRDD.compute` is doing type dispatch for each row to read appropriate values.
It might not have to be done like this because the schema is already kept in `JDBCRDD`.

So, appropriate converters can be created first according to the schema, and then apply them to each row.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14313 from HyukjinKwon/SPARK-16674.
2016-07-25 19:57:47 +08:00
Cheng Lian 68b4020d0c [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last
## What changes were proposed in this pull request?

Default `TreeNode.withNewChildren` implementation doesn't work for `Last` and when both constructor arguments are the same, e.g.:

```sql
LAST_VALUE(FALSE) -- The 2nd argument defaults to FALSE
LAST_VALUE(FALSE, FALSE)
LAST_VALUE(TRUE, TRUE)
```

This is because although `Last` is a unary expression, both of its constructor arguments, `child` and `ignoreNullsExpr`, are `Expression`s. When they have the same value, `TreeNode.withNewChildren` treats both of them as child nodes by mistake. `First` is also affected by this issue in exactly the same way.

This PR fixes this issue by making `ignoreNullsExpr` a child expression of `First` and `Last`.

## How was this patch tested?

New test case added in `WindowQuerySuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #14295 from liancheng/spark-16648-last-value.
2016-07-25 17:22:29 +08:00