Commit graph

2783 commits

Author SHA1 Message Date
Davies Liu 9d4e6212fa [SPARK-16802] [SQL] fix overflow in LongToUnsafeRowMap
## What changes were proposed in this pull request?

This patch fix the overflow in LongToUnsafeRowMap when the range of key is very wide (the key is much much smaller then minKey, for example, key is Long.MinValue, minKey is > 0).

## How was this patch tested?

Added regression test (also for SPARK-16740)

Author: Davies Liu <davies@databricks.com>

Closes #14464 from davies/fix_overflow.
2016-08-04 11:20:17 -07:00
Sean Zhong 9d7a47406e [SPARK-16853][SQL] fixes encoder error in DataSet typed select
## What changes were proposed in this pull request?

For DataSet typed select:
```
def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
```
If type T is a case class or a tuple class that is not atomic, the resulting logical plan's schema will mismatch with `Dataset[T]` encoder's schema, which will cause encoder error and throw AnalysisException.

### Before change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2];
..
```

### After change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]).show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
+---+---+
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14474 from clockfly/SPARK-16853.
2016-08-04 19:45:47 +08:00
Cheng Lian 780c7224a5 [MINOR][SQL] Fix minor formatting issue of SortAggregateExec.toString
## What changes were proposed in this pull request?

This PR fixes a minor formatting issue (missing space after comma) of `SorgAggregateExec.toString`.

Before:

```
SortAggregate(key=[a#76,b#77], functions=[max(c#78),min(c#78)], output=[a#76,b#77,max(c)#89,min(c)#90])
+- *Sort [a#76 ASC, b#77 ASC], false, 0
   +- Exchange hashpartitioning(a#76, b#77, 200)
      +- SortAggregate(key=[a#76,b#77], functions=[partial_max(c#78),partial_min(c#78)], output=[a#76,b#77,max#99,min#100])
         +- *Sort [a#76 ASC, b#77 ASC], false, 0
            +- LocalTableScan <empty>, [a#76, b#77, c#78]
```

After:

```
SortAggregate(key=[a#76, b#77], functions=[max(c#78), min(c#78)], output=[a#76, b#77, max(c)#89, min(c)#90])
+- *Sort [a#76 ASC, b#77 ASC], false, 0
   +- Exchange hashpartitioning(a#76, b#77, 200)
      +- SortAggregate(key=[a#76, b#77], functions=[partial_max(c#78), partial_min(c#78)], output=[a#76, b#77, max#99, min#100])
         +- *Sort [a#76 ASC, b#77 ASC], false, 0
            +- LocalTableScan <empty>, [a#76, b#77, c#78]
```

## How was this patch tested?

Manually tested.

Author: Cheng Lian <lian@databricks.com>

Closes #14480 from liancheng/fix-sort-based-agg-string-format.
2016-08-04 13:32:43 +08:00
Holden Karau c5eb1df72f [SPARK-16814][SQL] Fix deprecated parquet constructor usage
## What changes were proposed in this pull request?

Replace deprecated ParquetWriter with the new builders

## How was this patch tested?

Existing tests

Author: Holden Karau <holden@us.ibm.com>

Closes #14419 from holdenk/SPARK-16814-fix-deprecated-parquet-constructor-usage.
2016-08-03 17:08:51 -07:00
Kevin McHale 685b08e261 [SPARK-14204][SQL] register driverClass rather than user-specified class
This is a pull request that was originally merged against branch-1.6 as #12000, now being merged into master as well.  srowen zzcclp JoshRosen

This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user). The consequence of this is that all JDBC accesses under the described circumstances fail with an IllegalStateException. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204

My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user.

This patch was tested manually. I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed.

Author: Kevin McHale <kevin@premise.com>

Closes #14420 from mchalek/mchalek-jdbc_driver_registration.
2016-08-03 13:15:13 -07:00
Eric Liang e6f226c567 [SPARK-16596] [SQL] Refactor DataSourceScanExec to do partition discovery at execution instead of planning time
## What changes were proposed in this pull request?

Partition discovery is rather expensive, so we should do it at execution time instead of during physical planning. Right now there is not much benefit since ListingFileCatalog will read scan for all partitions at planning time anyways, but this can be optimized in the future. Also, there might be more information for partition pruning not available at planning time.

This PR moves a lot of the file scan logic from planning to execution time. All file scan operations are handled by `FileSourceScanExec`, which handles both batched and non-batched file scans. This requires some duplication with `RowDataSourceScanExec`, but is probably worth it so that `FileSourceScanExec` does not need to depend on an input RDD.

TODO: In another pr, move DataSourceScanExec to it's own file.

## How was this patch tested?

Existing tests (it might be worth adding a test that catalog.listFiles() is delayed until execution, but this can be delayed until there is an actual benefit to doing so).

Author: Eric Liang <ekl@databricks.com>

Closes #14241 from ericl/refactor.
2016-08-03 11:19:55 -07:00
Wenchen Fan ae226283e1 [SQL][MINOR] use stricter type parameter to make it clear that parquet reader returns UnsafeRow
## What changes were proposed in this pull request?

a small code style change, it's better to make the type parameter more accurate.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14458 from cloud-fan/parquet.
2016-08-03 08:23:26 +08:00
Herman van Hovell 2330f3ecbb [SPARK-16836][SQL] Add support for CURRENT_DATE/CURRENT_TIMESTAMP literals
## What changes were proposed in this pull request?
In Spark 1.6 (with Hive support) we could use `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions as literals (without adding braces), for example:
```SQL
select /* Spark 1.6: */ current_date, /* Spark 1.6  & Spark 2.0: */ current_date()
```
This was accidentally dropped in Spark 2.0. This PR reinstates this functionality.

## How was this patch tested?
Added a case to ExpressionParserSuite.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14442 from hvanhovell/SPARK-16836.
2016-08-02 10:09:47 -07:00
Holden Karau 1e9b59b73b [SPARK-16778][SQL][TRIVIAL] Fix deprecation warning with SQLContext
## What changes were proposed in this pull request?

Change to non-deprecated constructor for SQLContext.

## How was this patch tested?

Existing tests

Author: Holden Karau <holden@us.ibm.com>

Closes #14406 from holdenk/SPARK-16778-fix-use-of-deprecated-SQLContext-constructor.
2016-08-01 06:55:31 -07:00
Reynold Xin 579fbcf3bd [SPARK-16805][SQL] Log timezone when query result does not match
## What changes were proposed in this pull request?
It is useful to log the timezone when query result does not match, especially on build machines that have different timezone from AMPLab Jenkins.

## How was this patch tested?
This is a test-only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #14413 from rxin/SPARK-16805.
2016-07-31 18:21:06 -07:00
Wenchen Fan 301fb0d723 [SPARK-16731][SQL] use StructType in CatalogTable and remove CatalogColumn
## What changes were proposed in this pull request?

`StructField` has very similar semantic with `CatalogColumn`, except that `CatalogColumn` use string to express data type. I think it's reasonable to use `StructType` as the `CatalogTable.schema` and remove `CatalogColumn`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14363 from cloud-fan/column.
2016-07-31 18:18:53 -07:00
Eric Liang 957a8ab374 [SPARK-16818] Exchange reuse incorrectly reuses scans over different sets of partitions
## What changes were proposed in this pull request?

This fixes a bug wherethe file scan operator does not take into account partition pruning in its implementation of `sameResult()`. As a result, executions may be incorrect on self-joins over the same base file relation.

The patch here is minimal, but we should reconsider relying on `metadata` for implementing sameResult() in the future, as string representations may not be uniquely identifying.

cc rxin

## How was this patch tested?

Unit tests.

Author: Eric Liang <ekl@databricks.com>

Closes #14425 from ericl/spark-16818.
2016-07-30 22:48:09 -07:00
Sean Owen 0dc4310b47 [SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose side effects are required
## What changes were proposed in this pull request?

Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14332 from srowen/SPARK-16694.
2016-07-30 04:42:38 -07:00
Tathagata Das bbc247548a [SPARK-16748][SQL] SparkExceptions during planning should not wrapped in TreeNodeException
## What changes were proposed in this pull request?
We do not want SparkExceptions from job failures in the planning phase to create TreeNodeException. Hence do not wrap SparkException in TreeNodeException.

## How was this patch tested?
New unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14395 from tdas/SPARK-16748.
2016-07-29 19:59:35 -07:00
Wesley Tang d1d5069aa3 [SPARK-16664][SQL] Fix persist call on Data frames with more than 200…
## What changes were proposed in this pull request?

f12f11e578 introduced this bug, missed foreach as map

## How was this patch tested?

Test added

Author: Wesley Tang <tangmingjun@mininglamp.com>

Closes #14324 from breakdawn/master.
2016-07-29 04:26:05 -07:00
Sameer Agarwal 3fd39b87bd [SPARK-16764][SQL] Recommend disabling vectorized parquet reader on OutOfMemoryError
## What changes were proposed in this pull request?

We currently don't bound or manage the data array size used by column vectors in the vectorized reader (they're just bound by INT.MAX) which may lead to OOMs while reading data. As a short term fix, this patch intercepts the OutOfMemoryError exception and suggest the user to disable the vectorized parquet reader.

## How was this patch tested?

Existing Tests

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14387 from sameeragarwal/oom.
2016-07-28 13:04:19 -07:00
Sylvain Zimmer 1178d61ede [SPARK-16740][SQL] Fix Long overflow in LongToUnsafeRowMap
## What changes were proposed in this pull request?

Avoid overflow of Long type causing a NegativeArraySizeException a few lines later.

## How was this patch tested?

Unit tests for HashedRelationSuite still pass.

I can confirm the python script I included in https://issues.apache.org/jira/browse/SPARK-16740 works fine with this patch. Unfortunately I don't have the knowledge/time to write a Scala test case for HashedRelationSuite right now. As the patch is pretty obvious I hope it can be included without this.

Thanks!

Author: Sylvain Zimmer <sylvain@sylvainzimmer.com>

Closes #14373 from sylvinus/master.
2016-07-28 09:51:45 -07:00
Liang-Chi Hsieh 9ade77c3fa [SPARK-16639][SQL] The query with having condition that contains grouping by column should work
## What changes were proposed in this pull request?

The query with having condition that contains grouping by column will be failed during analysis. E.g.,

    create table tbl(a int, b string);
    select count(b) from tbl group by a + 1 having a + 1 = 2;

Having condition should be able to use grouping by column.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14296 from viirya/having-contains-grouping-column.
2016-07-28 22:33:33 +08:00
gatorsmile 762366fd87 [SPARK-16552][SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables
#### What changes were proposed in this pull request?

Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables:

**Group A. Users specify the schema.**

_Case 1 CREATE TABLE AS SELECT_: the schema is determined by the result schema of the SELECT clause. For example,
```SQL
CREATE TABLE tab STORED AS TEXTFILE
AS SELECT * from input
```

_Case 2 CREATE TABLE_: users explicitly specify the schema. For example,
```SQL
CREATE TABLE jsonTable (_1 string, _2 string)
USING org.apache.spark.sql.json
```

**Group B. Spark SQL infers the schema at runtime.**

_Case 3 CREATE TABLE_. Users do not specify the schema but the path to the file location. For example,
```SQL
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (path '${tempDir.getCanonicalPath}')
```

Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.

This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema after possible changes on external files (table location), they issue `REFRESH TABLE`. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache.

In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now.

#### How was this patch tested?
TODO: add more cases to cover the changes.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14207 from gatorsmile/userSpecifiedSchema.
2016-07-28 17:29:26 +08:00
petermaxlee 11d427c924 [SPARK-16730][SQL] Implement function aliases for type casts
## What changes were proposed in this pull request?
Spark 1.x supports using the Hive type name as function names for doing casts, e.g.
```sql
SELECT int(1.0);
SELECT string(2.0);
```

The above query would work in Spark 1.x because Spark 1.x fail back to Hive for unimplemented functions, and break in Spark 2.0 because the fall back was removed.

This patch implements function aliases using an analyzer rule for the following cast functions:
- boolean
- tinyint
- smallint
- int
- bigint
- float
- double
- decimal
- date
- timestamp
- binary
- string

## How was this patch tested?
Added end-to-end tests in SQLCompatibilityFunctionSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14364 from petermaxlee/SPARK-16730-2.
2016-07-28 13:13:17 +08:00
Liang-Chi Hsieh 045fc36066 [MINOR][DOC][SQL] Fix two documents regarding size in bytes
## What changes were proposed in this pull request?

Fix two places in SQLConf documents regarding size in bytes and statistics.

## How was this patch tested?
No. Just change document.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14341 from viirya/fix-doc-size-in-bytes.
2016-07-27 21:14:20 +08:00
Dongjoon Hyun 5b8e848bbf [SPARK-16621][SQL] Generate stable SQLs in SQLBuilder
## What changes were proposed in this pull request?

Currently, the generated SQLs have not-stable IDs for generated attributes.
The stable generated SQL will give more benefit for understanding or testing the queries.
This PR provides stable SQL generation by the followings.

 - Provide unique ids for generated subqueries, `gen_subquery_xxx`.
 - Provide unique and stable ids for generated attributes, `gen_attr_xxx`.

**Before**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS gen_subquery_0
```

**After**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
```

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14257 from dongjoon-hyun/SPARK-16621.
2016-07-27 13:23:59 +08:00
Qifan Pu 738b4cc548 [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator
## What changes were proposed in this pull request?

This PR is the first step for the following feature:

For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBasedKeyValueBatch`. We then automatically pick between the two implementations based on certain knobs.

In this first-step PR, implementations for `RowBasedKeyValueBatch` and `RowBasedHashMapGenerator` are added.

## How was this patch tested?

Unit tests: `RowBasedKeyValueBatchSuite`

Author: Qifan Pu <qifan.pu@gmail.com>

Closes #14349 from ooq/SPARK-16524.
2016-07-26 18:08:07 -07:00
Wenchen Fan a2abb583ca [SPARK-16663][SQL] desc table should be consistent between data source and hive serde tables
## What changes were proposed in this pull request?

Currently there are 2 inconsistence:

1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema
2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null

## How was this patch tested?

new test in `HiveDDLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14302 from cloud-fan/minor3.
2016-07-26 18:46:12 +08:00
hyukjinkwon 3b2b785ece [SPARK-16675][SQL] Avoid per-record type dispatch in JDBC when writing
## What changes were proposed in this pull request?

Currently, `JdbcUtils.savePartition` is doing type-based dispatch for each row to write appropriate values.

So, appropriate setters for `PreparedStatement` can be created first according to the schema, and then apply them to each row. This approach is similar with `CatalystWriteSupport`.

This PR simply make the setters to avoid this.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14323 from HyukjinKwon/SPARK-16675.
2016-07-26 17:14:58 +08:00
Wenchen Fan 6959061f02 [SPARK-16706][SQL] support java map in encoder
## What changes were proposed in this pull request?

finish the TODO, create a new expression `ExternalMapToCatalyst` to iterate the map directly.

## How was this patch tested?

new test in `JavaDatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14344 from cloud-fan/java-map.
2016-07-26 15:33:05 +08:00
Liang-Chi Hsieh 7b06a8948f [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by ColumnPruning
## What changes were proposed in this pull request?

We push down `Project` through `Sample` in `Optimizer` by the rule `PushProjectThroughSample`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect.

Since the rule `ColumnPruning` already handles general `Project` pushdown. We don't need  `PushProjectThroughSample` anymore. The rule `ColumnPruning` also avoids the described issue.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14327 from viirya/fix-sample-pushdown.
2016-07-26 12:00:01 +08:00
Yin Huai 815f3eece5 [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions
## What changes were proposed in this pull request?
This PR contains three changes.

First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:
1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value.
2. If the offset row does not exist, the default value will be used.
3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change).

Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.

Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.

## How was this patch tested?
New tests in SQLWindowFunctionSuite

Author: Yin Huai <yhuai@databricks.com>

Closes #14284 from yhuai/lead-lag.
2016-07-25 20:58:07 -07:00
Dongjoon Hyun 8a8d26f1e2 [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries
## What changes were proposed in this pull request?

Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* `EXISTS` queries. We had better prevent this.
```scala
scala> sql("CREATE TABLE t1(a int)")
scala> val df = sql("select * from t1 b where exists (select * from t1 a)")
scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL
java.lang.UnsupportedOperationException: empty.reduceLeft
```

## How was this patch tested?

Pass the Jenkins tests with a new test suite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14307 from dongjoon-hyun/SPARK-16672.
2016-07-25 19:52:17 -07:00
gatorsmile 3fc4566941 [SPARK-16678][SPARK-16677][SQL] Fix two View-related bugs
## What changes were proposed in this pull request?
**Issue 1: Disallow Creating/Altering a View when the same-name Table Exists (without IF NOT EXISTS)**
When we create OR alter a view, we check whether the view already exists. In the current implementation, if a table with the same name exists, we treat it as a view. However, this is not the right behavior. We should follow what Hive does. For example,
```
hive> CREATE TABLE tab1 (id int);
OK
Time taken: 0.196 seconds
hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> ALTER VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
OK
Time taken: 0.678 seconds
```

**Issue 2: Strange Error when Issuing Load Table Against A View**
Users should not be allowed to issue LOAD DATA against a view. Currently, when users doing it, we got a very strange runtime error. For example,
```SQL
LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
```
```
java.lang.reflect.InvocationTargetException was thrown.
java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
```
## How was this patch tested?
Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14314 from gatorsmile/tableDDLAgainstView.
2016-07-26 09:32:29 +08:00
Tathagata Das c979c8bba0 [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog
## What changes were proposed in this pull request?
Current fix for deadlock disables interrupts in the StreamExecution which getting offsets for all sources, and when writing to any metadata log, to avoid potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, disabling interrupts can have unintended consequences in other sources. So I am making the fix more narrow, by disabling interrupt it only in the HDFSMetadataLog. This is a narrower fix for something risky like disabling interrupt.

## How was this patch tested?
Existing tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14292 from tdas/SPARK-14131.
2016-07-25 16:09:22 -07:00
hyukjinkwon 79826f3c79 [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat
## What changes were proposed in this pull request?

It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698.

Field name having dots throws an exception. For example the codes below:

```scala
val path = "/tmp/path"
val json =""" {"a.b":"data"}"""
spark.sparkContext
  .parallelize(json :: Nil)
  .saveAsTextFile(path)
spark.read.json(path).collect()
```

throws an exception as below:

```
Unable to resolve a.b given [a.b];
org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b];
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
	at scala.Option.getOrElse(Option.scala:121)
```

This problem was introduced in 17eec0a71b (diff-27c76f96a7b2733ecfd6f46a1716e153R121)

When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields.

For example, this throws an exception. (**Loading JSON from RDD is fine**)

```scala
val json =""" {"a.b":"data"}"""
val rdd = spark.sparkContext.parallelize(json :: Nil)
spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true))))
  .json(rdd).select("`a.b`").printSchema()
```

as below:

```
cannot resolve '```a.b```' given input columns: [`a.b`];
org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`];
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

## How was this patch tested?

Unit tests in `FileSourceStrategySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14339 from HyukjinKwon/SPARK-16698-regression.
2016-07-25 22:51:30 +08:00
Sameer Agarwal d6a52176ad [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages
## What changes were proposed in this pull request?

This patch adds an explicit test for [SPARK-14217] by setting the parquet dictionary and page size the generated parquet file spans across 3 pages (within a single row group) where the first page is dictionary encoded and the remaining two are plain encoded.

## How was this patch tested?

1. ParquetEncodingSuite
2. Also manually tested that this test fails without https://github.com/apache/spark/pull/12279

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14304 from sameeragarwal/hybrid-encoding-test.
2016-07-25 22:31:01 +08:00
Wenchen Fan 64529b186a [SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTable
## What changes were proposed in this pull request?

It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14331 from cloud-fan/check.
2016-07-25 22:05:48 +08:00
Wenchen Fan d27d362eba [SPARK-16660][SQL] CreateViewCommand should not take CatalogTable
## What changes were proposed in this pull request?

`CreateViewCommand` only needs some information of a `CatalogTable`, but not all of them. We have some tricks(e.g. we need to check the table type is `VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take a `CatalogTable`.
This PR cleans it up and only pass in necessary information to `CreateViewCommand`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14297 from cloud-fan/minor2.
2016-07-25 22:02:00 +08:00
hyukjinkwon 7ffd99ec5f [SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when reading
## What changes were proposed in this pull request?

Currently, `JDBCRDD.compute` is doing type dispatch for each row to read appropriate values.
It might not have to be done like this because the schema is already kept in `JDBCRDD`.

So, appropriate converters can be created first according to the schema, and then apply them to each row.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14313 from HyukjinKwon/SPARK-16674.
2016-07-25 19:57:47 +08:00
Qifan Pu 468a3c3ac5 [SPARK-16699][SQL] Fix performance bug in hash aggregate on long string keys
In the following code in `VectorizedHashMapGenerator.scala`:
```
    def hashBytes(b: String): String = {
      val hash = ctx.freshName("hash")
      s"""
         |int $result = 0;
         |for (int i = 0; i < $b.length; i++) {
         |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
         |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2);
         |}
       """.stripMargin
    }

```
when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.

Performance bug, no additional test added.

Author: Qifan Pu <qifan.pu@gmail.com>

Closes #14337 from ooq/SPARK-16699.

(cherry picked from commit d226dce12b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2016-07-24 21:54:42 -07:00
Wenchen Fan 1221ce0402 [SPARK-16645][SQL] rename CatalogStorageFormat.serdeProperties to properties
## What changes were proposed in this pull request?

we also store data source table options in this field, it's unreasonable to call it `serdeProperties`.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14283 from cloud-fan/minor1.
2016-07-25 09:28:56 +08:00
Dongjoon Hyun cc1d2dcb61 [SPARK-16463][SQL] Support truncate option in Overwrite mode for JDBC DataFrameWriter
## What changes were proposed in this pull request?

This PR adds a boolean option, `truncate`, for `SaveMode.Overwrite` of JDBC DataFrameWriter. If this option is `true`, it try to take advantage of `TRUNCATE TABLE` instead of `DROP TABLE`. This is a trivial option, but will provide great **convenience** for BI tool users based on RDBMS tables generated by Spark.

**Goal**
- Without `CREATE/DROP` privilege, we can save dataframe to database. Sometime these are not allowed for security.
- It will preserve the existing table information, so users can add and keep some additional `INDEX` and `CONSTRAINT`s for the table.
- Sometime, `TRUNCATE` is faster than the combination of `DROP/CREATE`.

**Supported DBMS**
The following is `truncate`-option support table. Due to the different behavior of `TRUNCATE TABLE` among DBMSs, it's not always safe to use `TRUNCATE TABLE`. Spark will ignore the `truncate` option for **unknown** and **some** DBMS with **default CASCADING** behavior. Newly added JDBCDialect should implement corresponding function to support `truncate` option additionally.

Spark Dialects | `truncate` OPTION SUPPORT
---------------|-------------------------------
MySQLDialect | O
PostgresDialect | X
DB2Dialect | O
MsSqlServerDialect | O
DerbyDialect | O
OracleDialect | O

**Before (TABLE with INDEX case)**: SparkShell & MySQL CLI are interleaved intentionally.
```scala
scala> val (url, prop)=("jdbc:mysql://localhost:3306/temp?useSSL=false", new java.util.Properties)
scala> prop.setProperty("user","root")
scala> df.write.mode("overwrite").jdbc(url, "table_with_index", prop)
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   |     | NULL    |       |
+-------+------------+------+-----+---------+-------+
mysql> CREATE UNIQUE INDEX idx_id ON table_with_index(id);
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   | PRI | NULL    |       |
+-------+------------+------+-----+---------+-------+
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   |     | NULL    |       |
+-------+------------+------+-----+---------+-------+
```

**After (TABLE with INDEX case)**
```scala
scala> spark.range(10).write.mode("overwrite").option("truncate", true).jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id    | bigint(20) | NO   | PRI | NULL    |       |
+-------+------------+------+-----+---------+-------+
```

**Error Handling**
- In case of exceptions, Spark will not retry. Users should turn off the `truncate` option.
- In case of schema change:
  - If one of the column names changes, this will raise exceptions intuitively.
  - If there exists only type difference, this will work like Append mode.

## How was this patch tested?

Pass the Jenkins tests with a updated testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14086 from dongjoon-hyun/SPARK-16410.
2016-07-24 09:25:02 +01:00
Wenchen Fan 86c2752066 [SPARK-16690][TEST] rename SQLTestUtils.withTempTable to withTempView
## What changes were proposed in this pull request?

after https://github.com/apache/spark/pull/12945, we renamed the `registerTempTable` to `createTempView`, as we do create a view actually. This PR renames `SQLTestUtils.withTempTable` to reflect this change.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14318 from cloud-fan/minor4.
2016-07-23 11:39:48 -07:00
gatorsmile 94f14b52a6 [SPARK-16556][SPARK-16559][SQL] Fix Two Bugs in Bucket Specification
### What changes were proposed in this pull request?

**Issue 1: Silent Ignorance of Bucket Specification When Creating Table Using Schema Inference**

When creating a data source table without explicit specification of schema or SELECT clause, we silently ignore the bucket specification (CLUSTERED BY... SORTED BY...) in [the code](ce3b98bae2/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala (L339-L354)).

For example,
```SQL
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path '${tempDir.getCanonicalPath}'
)
CLUSTERED BY (inexistentColumnA) SORTED BY (inexistentColumnB) INTO 2 BUCKETS
```

This PR captures it and issues an error message.

**Issue 2: Got a run-time `java.lang.ArithmeticException` when num of buckets is set to zero.**

For example,
```SQL
CREATE TABLE t USING PARQUET
OPTIONS (PATH '${path.toString}')
CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
AS SELECT 1 AS a, 2 AS b
```
The exception we got is
```
ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 (TID 2)
java.lang.ArithmeticException: / by zero
```

This PR captures the misuse and issues an appropriate error message.

### How was this patch tested?
Added a test case in DDLSuite

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14210 from gatorsmile/createTableWithoutSchema.
2016-07-22 13:27:17 +08:00
Sandeep Singh df2c6d59d0 [SPARK-16287][SQL] Implement str_to_map SQL function
## What changes were proposed in this pull request?
This PR adds `str_to_map` SQL function in order to remove Hive fallback.

## How was this patch tested?
Pass the Jenkins tests with newly added.

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13990 from techaddict/SPARK-16287.
2016-07-22 10:05:21 +08:00
Sameer Agarwal 46f80a3073 [SPARK-16334] Maintain single dictionary per row-batch in vectorized parquet reader
## What changes were proposed in this pull request?

As part of the bugfix in https://github.com/apache/spark/pull/12279, if a row batch consist of both dictionary encoded and non-dictionary encoded pages, we explicitly decode the dictionary for the values that are already dictionary encoded. Currently we reset the dictionary while reading every page that can potentially cause ` java.lang.ArrayIndexOutOfBoundsException` while decoding older pages. This patch fixes the problem by maintaining a single dictionary per row-batch in vectorized parquet reader.

## How was this patch tested?

Manual Tests against a number of hand-generated parquet files.

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14225 from sameeragarwal/vectorized.
2016-07-21 15:34:32 -07:00
Yin Huai 9abd99b3c3 [SPARK-16656][SQL] Try to make CreateTableAsSelectSuite more stable
## What changes were proposed in this pull request?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/ shows that `create a table, drop it and create another one with the same name` failed. But other runs were good. Seems it is a flaky test. This PR tries to make this test more stable.

Author: Yin Huai <yhuai@databricks.com>

Closes #14289 from yhuai/SPARK-16656.
2016-07-21 12:10:26 -07:00
Cheng Lian 69626adddc [SPARK-16632][SQL] Revert PR #14272: Respect Hive schema when merging parquet schema
## What changes were proposed in this pull request?

PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them.

This PR targets both master and branch-2.0.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <lian@databricks.com>

Closes #14300 from liancheng/revert-pr-14272.
2016-07-21 22:08:34 +08:00
Cheng Lian 8674054d34 [SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization
## What changes were proposed in this pull request?

In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294).

On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one.

## How was this patch tested?

New test case added in `ParquetQuerySuite`.

[1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294

Author: Cheng Lian <lian@databricks.com>

Closes #14278 from liancheng/spark-16632-simpler-fix.
2016-07-21 17:15:07 +08:00
Sean Owen 864b764eaf [SPARK-16226][SQL] Weaken JDBC isolation level to avoid locking when writing partitions
## What changes were proposed in this pull request?

Saving partitions to JDBC in transaction can use a weaker transaction isolation level to reduce locking. Use better method to check if transactions are supported.

## How was this patch tested?

Existing Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #14054 from srowen/SPARK-16226.
2016-07-21 09:23:41 +01:00
Wenchen Fan cfa5ae84ed [SPARK-16644][SQL] Aggregate should not propagate constraints containing aggregate expressions
## What changes were proposed in this pull request?

aggregate expressions can only be executed inside `Aggregate`, if we propagate it up with constraints, the parent operator can not execute it and will fail at runtime.

## How was this patch tested?

new test in SQLQuerySuite

Author: Wenchen Fan <wenchen@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #14281 from cloud-fan/bug.
2016-07-20 18:37:15 -07:00
Marcelo Vanzin 75a06aa256 [SPARK-16272][CORE] Allow config values to reference conf, env, system props.
This allows configuration to be more flexible, for example, when the cluster does
not have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.

As part of the implementation, ConfigEntry now keeps track of all "known" configs
(i.e. those created through the use of ConfigBuilder), since that list is used
by the resolution code. This duplicates some code in SQLConf, which could potentially
be merged with this now. It will also make it simpler to implement some missing
features such as filtering which configs show up in the UI or in event logs - which
are not part of this change.

Another change is in the way ConfigEntry reads config data; it now takes a string
map and a function that reads env variables, so that it can be called both from
SparkConf and SQLConf. This makes it so both places follow the same read path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.

Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14022 from vanzin/SPARK-16272.
2016-07-20 18:24:35 -07:00
Cheng Lian e651900bd5 [SPARK-16344][SQL] Decoding Parquet array of struct with a single field named "element"
## What changes were proposed in this pull request?

Due to backward-compatibility reasons, the following Parquet schema is ambiguous:

```
optional group f (LIST) {
  repeated group list {
    optional group element {
      optional int32 element;
    }
  }
}
```

According to the parquet-format spec, when interpreted as a standard 3-level layout, this type is equivalent to the following SQL type:

```
ARRAY<STRUCT<element: INT>>
```

However, when interpreted as a legacy 2-level layout, it's equivalent to

```
ARRAY<STRUCT<element: STRUCT<element: INT>>>
```

Historically, to disambiguate these cases, we employed two methods:

- `ParquetSchemaConverter.isElementType()`

  Used to disambiguate the above cases while converting Parquet types to Spark types.

- `ParquetRowConverter.isElementType()`

  Used to disambiguate the above cases while instantiating row converters that convert Parquet records to Spark rows.

Unfortunately, these two methods make different decision about the above problematic Parquet type, and caused SPARK-16344.

`ParquetRowConverter.isElementType()` is necessary for Spark 1.4 and earlier versions because Parquet requested schemata are directly converted from Spark schemata in these versions. The converted Parquet schemata may be incompatible with actual schemata of the underlying physical files when the files are written by a system/library that uses a schema conversion scheme that is different from Spark when writing Parquet LIST and MAP fields.

In Spark 1.5, Parquet requested schemata are always properly tailored from schemata of physical files to be read. Thus `ParquetRowConverter.isElementType()` is no longer necessary. This PR replaces this method with a simply yet accurate scheme: whenever an ambiguous Parquet type is hit, convert the type in question back to a Spark type using `ParquetSchemaConverter` and check whether it matches the corresponding Spark type.

## How was this patch tested?

New test cases added in `ParquetHiveCompatibilitySuite` and `ParquetQuerySuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #14014 from liancheng/spark-16344-for-master-and-2.0.
2016-07-20 16:49:46 -07:00