Commit graph

773 commits

Author SHA1 Message Date
Reynold Xin 054f991c43 [SPARK-14994][SQL] Remove execution hive from HiveSessionState
## What changes were proposed in this pull request?
This patch removes executionHive from HiveSessionState and HiveSharedState.

## How was this patch tested?
Updated test cases.

Author: Reynold Xin <rxin@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12770 from rxin/SPARK-14994.
2016-04-29 01:14:02 -07:00
Reynold Xin e249e6f8b5 [HOTFIX] Disable flaky test StatisticsSuite.analyze MetastoreRelations 2016-04-29 00:23:59 -07:00
Reynold Xin 4607f6e7f7 [SPARK-14991][SQL] Remove HiveNativeCommand
## What changes were proposed in this pull request?
This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive.

## How was this patch tested?
Updated tests to reflect this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12769 from rxin/SPARK-14991.
2016-04-28 21:58:48 -07:00
Cheng Lian 24bea00047 [SPARK-14954] [SQL] Add PARTITION BY and BUCKET BY clause for data source CTAS syntax
Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:

```
CREATE TABLE <table-name>
USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)]
[PARTITIONED BY (col1, col2, ...)]
[CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS]
AS SELECT ...
```

Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax.

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12734 from liancheng/spark-14954.
2016-04-27 13:55:13 -07:00
Yin Huai 54a3eb8312 [SPARK-14130][SQL] Throw exceptions for ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands
## What changes were proposed in this pull request?
This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands.

## How was this patch tested?
Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite.

Author: Yin Huai <yhuai@databricks.com>

Closes #12714 from yhuai/banNativeCommand.
2016-04-27 00:30:54 -07:00
Reynold Xin d73d67f623 [SPARK-14944][SPARK-14943][SQL] Remove HiveConf from HiveTableScanExec, HiveTableReader, and ScriptTransformation
## What changes were proposed in this pull request?
This patch removes HiveConf from HiveTableScanExec and HiveTableReader and instead just uses our own configuration system. I'm splitting the large change of removing HiveConf into multiple independent pull requests because it is very difficult to debug test failures when they are all combined in one giant one.

## How was this patch tested?
Should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12727 from rxin/SPARK-14944.
2016-04-26 23:42:42 -07:00
Reynold Xin 8fda5a73dc [SPARK-14913][SQL] Simplify configuration API
## What changes were proposed in this pull request?
We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users.

As part of this patch, I also removed some config options deprecated in Spark 1.x.

## How was this patch tested?
Updated relevant tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12689 from rxin/SPARK-14913.
2016-04-26 22:02:28 -07:00
Andrew Or d8a83a564f [SPARK-13477][SQL] Expose new user-facing Catalog interface
## What changes were proposed in this pull request?

#12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface.

## How was this patch tested?

See `CatalogSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #12713 from andrewor14/user-facing-catalog.
2016-04-26 21:29:25 -07:00
Dilip Biswal d93976d866 [SPARK-14445][SQL] Support native execution of SHOW COLUMNS and SHOW PARTITIONS
## What changes were proposed in this pull request?
This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands.

Command Syntax:
``` SQL
SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]
```
``` SQL
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
```

## How was this patch tested?

Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite
to verify plans.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #12222 from dilipbiswal/dkb_show_columns.
2016-04-27 09:28:24 +08:00
Sameer Agarwal 9797cc20c0 [SPARK-14929] [SQL] Disable vectorized map for wide schemas & high-precision decimals
## What changes were proposed in this pull request?

While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time.

## How was this patch tested?

This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12710 from sameeragarwal/vectorized-enable.
2016-04-26 14:51:14 -07:00
Davies Liu 7131b03bcf [SPARK-14853] [SQL] Support LeftSemi/LeftAnti in SortMergeJoinExec
## What changes were proposed in this pull request?

This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec.

This PR also simplify the join selection in SparkStrategy.

## How was this patch tested?

Added new tests.

Author: Davies Liu <davies@databricks.com>

Closes #12668 from davies/smj_semi.
2016-04-26 12:43:47 -07:00
Reynold Xin 5cb03220a0 [SPARK-14912][SQL] Propagate data source options to Hadoop configuration
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.

## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.

Author: Reynold Xin <rxin@databricks.com>

Closes #12688 from rxin/SPARK-14912.
2016-04-26 10:58:56 -07:00
Andrew Or 18c2c92580 [SPARK-14861][SQL] Replace internal usages of SQLContext with SparkSession
## What changes were proposed in this pull request?

In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.

In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.

**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.

## How was this patch tested?

No change in functionality intended.

Author: Andrew Or <andrew@databricks.com>

Closes #12625 from andrewor14/spark-session-refactor.
2016-04-25 20:54:31 -07:00
Andrew Or cfa64882fc [SPARK-14902][SQL] Expose RuntimeConfig in SparkSession
## What changes were proposed in this pull request?

`RuntimeConfig` is the new user-facing API in 2.0 added in #11378. Until now, however, it's been dead code. This patch uses `RuntimeConfig` in `SessionState` and exposes that through the `SparkSession`.

## How was this patch tested?

New test in `SQLContextSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #12669 from andrewor14/use-runtime-conf.
2016-04-25 17:52:25 -07:00
Andrew Or 3c5e65c339 [SPARK-14721][SQL] Remove HiveContext (part 2)
## What changes were proposed in this pull request?

This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.

Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)

## How was this patch tested?

No change in functionality.

Author: Andrew Or <andrew@databricks.com>

Closes #12585 from andrewor14/delete-hive-context.
2016-04-25 13:23:05 -07:00
Cheng Lian e66afd5c66 [SPARK-14875][SQL] Makes OutputWriterFactory.newInstance public
## What changes were proposed in this pull request?

This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it.

## How was this patch tested?

N/A

Author: Cheng Lian <lian@databricks.com>

Closes #12652 from liancheng/spark-14875.
2016-04-25 20:42:49 +08:00
Reynold Xin 0c8e5332ff Disable flaky script transformation test 2016-04-24 12:54:56 -07:00
gatorsmile 337289d712 [SPARK-14691][SQL] Simplify and Unify Error Generation for Unsupported Alter Table DDL
#### What changes were proposed in this pull request?
So far, we are capturing each unsupported Alter Table in separate visit functions. They should be unified and issue the same ParseException instead.

This PR is to refactor the existing implementation and make error message consistent for Alter Table DDL.

#### How was this patch tested?
Updated the existing test cases and also added new test cases to ensure all the unsupported statements are covered.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12459 from gatorsmile/cleanAlterTable.
2016-04-24 18:53:27 +02:00
Yin Huai 1672149c26 [SPARK-14879][SQL] Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core
## What changes were proposed in this pull request?

CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect are not Hive-specific. So, this PR moves them from sql/hive to sql/core. Also, I am adding `Command` suffix to these two classes.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #12645 from yhuai/moveCreateDataSource.
2016-04-23 22:29:31 -07:00
Reynold Xin 162e12b085 [SPARK-14877][SQL] Remove HiveMetastoreTypes class
## What changes were proposed in this pull request?
It is unnecessary as DataType.catalogString largely replaces the need for this class.

## How was this patch tested?
Mostly removing dead code and should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12644 from rxin/SPARK-14877.
2016-04-23 15:41:17 -07:00
Reynold Xin e3c1366bbc [SPARK-14865][SQL] Better error handling for view creation.
## What changes were proposed in this pull request?
This patch improves error handling in view creation. CreateViewCommand itself will analyze the view SQL query first, and if it cannot successfully analyze it, throw an AnalysisException.

In addition, I also added the following two conservative guards for easier identification of Spark bugs:

1. If there is a bug and the generated view SQL cannot be analyzed, throw an exception at runtime. Note that this is not an AnalysisException because it is not caused by the user and more likely indicate a bug in Spark.
2. SQLBuilder when it gets an unresolved plan, it will also show the plan in the error message.

I also took the chance to simplify the internal implementation of CreateViewCommand, and *removed* a fallback path that would've masked an exception from before.

## How was this patch tested?
1. Added a unit test for the user facing error handling.
2. Manually introduced some bugs in Spark to test the internal defensive error handling.
3. Also added a test case to test nested views (not super relevant).

Author: Reynold Xin <rxin@databricks.com>

Closes #12633 from rxin/SPARK-14865.
2016-04-23 13:19:57 -07:00
Reynold Xin f0bba7447f Turn script transformation back on.
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Reynold Xin <rxin@databricks.com>

Closes #12565 from rxin/test-flaky.
2016-04-23 11:11:48 -07:00
Reynold Xin 95faa731c1 [SPARK-14866][SQL] Break SQLQuerySuite out into smaller test suites
## What changes were proposed in this pull request?
This patch breaks SQLQuerySuite out into smaller test suites. It was a little bit too large for debugging.

## How was this patch tested?
This is a test only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12630 from rxin/SPARK-14866.
2016-04-22 22:50:32 -07:00
Reynold Xin c06110187b [SPARK-14842][SQL] Implement view creation in sql/core
## What changes were proposed in this pull request?
This patch re-implements view creation command in sql/core, based on the pre-existing view creation command in the Hive module. This consolidates the view creation logical command and physical command into a single one, called CreateViewCommand.

## How was this patch tested?
All the code should've been tested by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12615 from rxin/SPARK-14842-2.
2016-04-22 20:30:51 -07:00
Reynold Xin d7d0cad0ad [SPARK-14855][SQL] Add "Exec" suffix to physical operators
## What changes were proposed in this pull request?
This patch adds "Exec" suffix to all physical operators. Before this patch, Spark's physical operators and logical operators are named the same (e.g. Project could be logical.Project or execution.Project), which caused small issues in code review and bigger issues in code refactoring.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #12617 from rxin/exec-node.
2016-04-22 17:43:56 -07:00
Reynold Xin aeb52bea56 [SPARK-14841][SQL] Move SQLBuilder into sql/core
## What changes were proposed in this pull request?
This patch moves SQLBuilder into sql/core so we can in the future move view generation also into sql/core.

## How was this patch tested?
Also moved unit tests.

Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #12602 from rxin/SPARK-14841.
2016-04-22 11:10:31 -07:00
Liang-Chi Hsieh e09ab5da8b [SPARK-14609][SQL] Native support for LOAD DATA DDL command
## What changes were proposed in this pull request?

Add the native support for LOAD DATA DDL command that loads data into Hive table/partition.

## How was this patch tested?

`HiveDDLCommandSuite` and `HiveQuerySuite`. Besides, few Hive tests (`WindowQuerySuite`, `HiveTableScanSuite` and `HiveSerDeSuite`) also use `LOAD DATA` command.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12412 from viirya/ddl-load-data.
2016-04-22 18:26:28 +08:00
Reynold Xin 284b15d2fb [SPARK-14826][SQL] Remove HiveQueryExecution
## What changes were proposed in this pull request?
This patch removes HiveQueryExecution. As part of this, I consolidated all the describe commands into DescribeTableCommand.

## How was this patch tested?
Should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12588 from rxin/SPARK-14826.
2016-04-22 01:31:13 -07:00
Cheng Lian 145433f1aa [SPARK-14369] [SQL] Locality support for FileScanRDD
(This PR is a rebased version of PR #12153.)

## What changes were proposed in this pull request?

This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts:

1.  Block location lookup

    Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es.

    Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems.

2.  Selecting preferred locations

    For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs.

## How was this patch tested?

Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations.

Author: Cheng Lian <lian@databricks.com>

Closes #12527 from liancheng/spark-14369-locality-rebased.
2016-04-21 21:48:09 -07:00
Andrew Or df1953f0df [SPARK-14824][SQL] Rename HiveContext object to HiveUtils
## What changes were proposed in this pull request?

Just a rename so we can get rid of `HiveContext.scala`. Note that this will conflict with #12585.

## How was this patch tested?

No change in functionality.

Author: Andrew Or <andrew@databricks.com>

Closes #12586 from andrewor14/rename-hc-object.
2016-04-21 17:57:59 -07:00
Reynold Xin f181aee07c [SPARK-14821][SQL] Implement AnalyzeTable in sql/core and remove HiveSqlAstBuilder
## What changes were proposed in this pull request?
This patch moves analyze table parsing into SparkSqlAstBuilder and removes HiveSqlAstBuilder.

In order to avoid extensive refactoring, I created a common trait for CatalogRelation and MetastoreRelation, and match on that. In the future we should probably just consolidate the two into a single thing so we don't need this common trait.

## How was this patch tested?
Updated unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12584 from rxin/SPARK-14821.
2016-04-21 17:41:29 -07:00
Reynold Xin 1a95397bb6 [SPARK-14798][SQL] Move native command and script transformation parsing into SparkSqlAstBuilder
## What changes were proposed in this pull request?
This patch moves native command and script transformation into SparkSqlAstBuilder. This builds on #12561. See the last commit for diff.

## How was this patch tested?
Updated test cases to reflect this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12564 from rxin/SPARK-14798.
2016-04-21 15:59:37 -07:00
Reynold Xin 3a21e8d5ed [SPARK-14795][SQL] Remove the use of Hive's variable substitution
## What changes were proposed in this pull request?
This patch builds on #12556 and completely removes the use of Hive's variable substitution.

## How was this patch tested?
Covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12561 from rxin/SPARK-14795.
2016-04-21 11:42:25 -07:00
Reynold Xin 228128ce25 [SPARK-14794][SQL] Don't pass analyze command into Hive
## What changes were proposed in this pull request?
We shouldn't pass analyze command to Hive because some of those would require running MapReduce jobs. For now, let's just always run the no scan analyze.

## How was this patch tested?
Updated test case to reflect this change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12558 from rxin/parser-analyze.
2016-04-21 00:31:06 -07:00
Reynold Xin 3b9fd51739 [HOTFIX] Disable flaky tests 2016-04-21 00:25:28 -07:00
Reynold Xin 77d847ddb2 [SPARK-14792][SQL] Move as many parsing rules as possible into SQL parser
## What changes were proposed in this pull request?
This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR.

## How was this patch tested?
No test change. This simply moves code around.

Author: Reynold Xin <rxin@databricks.com>

Closes #12556 from rxin/SPARK-14792.
2016-04-21 00:24:24 -07:00
Reynold Xin 24f338ba7b [SPARK-14775][SQL] Remove TestHiveSparkSession.rewritePaths
## What changes were proposed in this pull request?
The path rewrite in TestHiveSparkSession is pretty hacky. I think we can remove those complexity and just do a string replacement when we read the query files in. This would remove the overloading of runNativeSql in TestHive, which will simplify the removal of Hive specific variable substitution.

## How was this patch tested?
This is a small test refactoring to simplify test infrastructure.

Author: Reynold Xin <rxin@databricks.com>

Closes #12543 from rxin/SPARK-14775.
2016-04-20 17:56:31 -07:00
Reynold Xin b28fe448d9 [SPARK-14770][SQL] Remove unused queries in hive module test resources
## What changes were proposed in this pull request?
We currently have five folders in queries: clientcompare, clientnegative, clientpositive, negative, and positive. Only clientpositive is used. We can remove the rest.

## How was this patch tested?
N/A - removing unused test resources.

Author: Reynold Xin <rxin@databricks.com>

Closes #12540 from rxin/SPARK-14770.
2016-04-20 16:29:26 -07:00
Andrew Or 8fc267ab33 [SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState and Create a SparkSession class
## What changes were proposed in this pull request?
This PR has two main changes.
1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext.
2. Create a SparkSession Class, which will later be the entry point of Spark SQL users.

## How was this patch tested?
Existing tests

This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12522 from yhuai/spark-session.
2016-04-20 12:58:48 -07:00
Dongjoon Hyun 6f1ec1f267 [MINOR] [SQL] Re-enable explode() and json_tuple() testcases in ExpressionToSQLSuite
## What changes were proposed in this pull request?

Since [SPARK-12719: SQL Generation supports for generators](https://issues.apache.org/jira/browse/SPARK-12719) was resolved, this PR enables the related testcases: `explode()` and `json_tuple()`.

## How was this patch tested?

Pass the Jenkins tests (with re-enabled test cases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12329 from dongjoon-hyun/minor_enable_testcases.
2016-04-19 21:55:29 -07:00
Joan 3ae25f244b [SPARK-13929] Use Scala reflection for UDTs
## What changes were proposed in this pull request?

Enable ScalaReflection and User Defined Types for plain Scala classes.

This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet.

## How was this patch tested?

Unit test

Author: Joan <joan@goyeau.com>

Closes #12149 from joan38/SPARK-13929-Scala-reflection.
2016-04-19 17:36:31 -07:00
Cheng Lian 10f273d8db [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178
## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.

Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.

## How was this patch tested?

Existing tests.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
2016-04-19 17:32:23 -07:00
Cheng Lian 5e360c93be [SPARK-13681][SPARK-14458][SPARK-14566][SQL] Add back once removed CommitFailureTestRelationSuite and SimpleTextHadoopFsRelationSuite
## What changes were proposed in this pull request?

These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back.

This PR also fixes two regressions:

- SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature.

- SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when
  - appending a Dataset `ds` to a persisted partitioned data source relation `t`, and
  - partition columns in `ds` don't all appear after data columns

## How was this patch tested?

`CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup.

`SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces.

The two regressions are both covered by existing test cases.

Author: Cheng Lian <lian@databricks.com>

Closes #12179 from liancheng/spark-13681-commit-failure-test.
2016-04-19 09:37:00 -07:00
Andrew Or f1a11976db [SPARK-14674][SQL] Move HiveContext.hiveconf to HiveSessionState
## What changes were proposed in this pull request?

This is just cleanup. This allows us to remove HiveContext later without inflating the diff too much. This PR fixes the conflicts of https://github.com/apache/spark/pull/12431. It also removes the `def hiveConf` from `HiveSqlParser`. So, we will pass the HiveConf associated with a session explicitly instead of relying on Hive's `SessionState` to pass `HiveConf`.

## How was this patch tested?
Existing tests.

Closes #12431

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12449 from yhuai/hiveconf.
2016-04-18 14:28:47 -07:00
Andrew Or 28ee15702d [SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState
## What changes were proposed in this pull request?

This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #12463 from yhuai/sharedState.
2016-04-18 13:15:23 -07:00
Andrew Or 7de06a646d Revert "[SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState"
This reverts commit 5cefecc95a.
2016-04-17 17:35:41 -07:00
Andrew Or 5cefecc95a [SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState
## What changes were proposed in this pull request?

This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.

## How was this patch tested?
Existing tests.

Closes #12405

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12447 from yhuai/sharedState.
2016-04-16 14:00:53 -07:00
Sameer Agarwal b5c60bcdca [SPARK-14447][SQL] Speed up TungstenAggregate w/ keys using VectorizedHashMap
## What changes were proposed in this pull request?

This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see https://github.com/apache/spark/pull/12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).

Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness.

## How was this patch tested?

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    codegen = F                              2124 / 2204          9.9         101.3       1.0X
    codegen = T hashmap = F                  1198 / 1364         17.5          57.1       1.8X
    codegen = T hashmap = T                   369 /  600         56.8          17.6       5.8X

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12345 from sameeragarwal/tungsten-aggregate-integration.
2016-04-14 20:57:03 -07:00
Liang-Chi Hsieh 28efdd3fd7 [SPARK-14592][SQL] Native support for CREATE TABLE LIKE DDL command
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14592

This patch adds native support for DDL command `CREATE TABLE LIKE`.

The SQL syntax is like:

    CREATE TABLE table_name LIKE existing_table
    CREATE TABLE IF NOT EXISTS table_name LIKE existing_table

## How was this patch tested?
`HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrew@databricks.com>

Closes #12362 from viirya/create-table-like.
2016-04-14 11:08:08 -07:00
gatorsmile c971aee40d [SPARK-14499][SQL][TEST] Drop Partition Does Not Delete Data of External Tables
#### What changes were proposed in this pull request?
This PR is to add a test to ensure drop partitions of an external table will not delete data.

cc yhuai andrewor14

#### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrew@databricks.com>

Closes #12350 from gatorsmile/testDropPartition.
2016-04-14 11:03:19 -07:00