## Problem
The current `sed` in `test_script.sh` is missing a `$`, leading to the failure of `script` test on OS X:
```
== Results ==
!== Correct Answer - 2 == == Spark Answer - 2 ==
![x1_y1] [x1]
![x2_y2] [x2]
```
In addition, this `script` test would also fail on systems like Windows where we couldn't be able to invoke `bash` or `echo | sed`.
## What changes were proposed in this pull request?
This patch
- fixes `sed` in `test_script.sh`
- adds command guards so that the `script` test would pass on systems like Windows
## How was this patch tested?
- Jenkins
- Manually verified tests pass on OS X
Author: Liwei Lin <lwlin7@gmail.com>
Closes#14280 from lw-lin/osx-sed.
## What changes were proposed in this pull request?
after https://github.com/apache/spark/pull/12945, we renamed the `registerTempTable` to `createTempView`, as we do create a view actually. This PR renames `SQLTestUtils.withTempTable` to reflect this change.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14318 from cloud-fan/minor4.
## What changes were proposed in this pull request?
Due to backward-compatibility reasons, the following Parquet schema is ambiguous:
```
optional group f (LIST) {
repeated group list {
optional group element {
optional int32 element;
}
}
}
```
According to the parquet-format spec, when interpreted as a standard 3-level layout, this type is equivalent to the following SQL type:
```
ARRAY<STRUCT<element: INT>>
```
However, when interpreted as a legacy 2-level layout, it's equivalent to
```
ARRAY<STRUCT<element: STRUCT<element: INT>>>
```
Historically, to disambiguate these cases, we employed two methods:
- `ParquetSchemaConverter.isElementType()`
Used to disambiguate the above cases while converting Parquet types to Spark types.
- `ParquetRowConverter.isElementType()`
Used to disambiguate the above cases while instantiating row converters that convert Parquet records to Spark rows.
Unfortunately, these two methods make different decision about the above problematic Parquet type, and caused SPARK-16344.
`ParquetRowConverter.isElementType()` is necessary for Spark 1.4 and earlier versions because Parquet requested schemata are directly converted from Spark schemata in these versions. The converted Parquet schemata may be incompatible with actual schemata of the underlying physical files when the files are written by a system/library that uses a schema conversion scheme that is different from Spark when writing Parquet LIST and MAP fields.
In Spark 1.5, Parquet requested schemata are always properly tailored from schemata of physical files to be read. Thus `ParquetRowConverter.isElementType()` is no longer necessary. This PR replaces this method with a simply yet accurate scheme: whenever an ambiguous Parquet type is hit, convert the type in question back to a Spark type using `ParquetSchemaConverter` and check whether it matches the corresponding Spark type.
## How was this patch tested?
New test cases added in `ParquetHiveCompatibilitySuite` and `ParquetQuerySuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#14014 from liancheng/spark-16344-for-master-and-2.0.
## What changes were proposed in this pull request?
This PR improves `LogicalPlanToSQLSuite` to check the generated SQL directly by **structure**. So far, `LogicalPlanToSQLSuite` relies on `checkHiveQl` to ensure the **successful SQL generation** and **answer equality**. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably.
## How was this patch tested?
Pass the Jenkins. This is only a testsuite change.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14235 from dongjoon-hyun/SPARK-16590.
## What changes were proposed in this pull request?
In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now.
## How was this patch tested?
added a test case in SQLQuerySuite.
Closes#14169
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#14249 from yhuai/scriptTransformation.
#### What changes were proposed in this pull request?
If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table.
~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~
For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.
#### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14148 from gatorsmile/describeSchema.
## What changes were proposed in this pull request?
This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969.
java_method is an alias for reflect, so this should also resolve SPARK-16277.
## How was this patch tested?
Added expression unit tests and an end-to-end test.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14138 from petermaxlee/reflect-static.
This option is used by Hive to directly delete the files instead of
moving them to the trash. This is needed in certain configurations
where moving the files does not work. For non-Hive tables and partitions,
Spark already behaves as if the PURGE option was set, so there's no
need to do anything.
Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for
partitions), so the code reflects that: trying to use the option with
older versions of Hive will cause an exception to be thrown.
The change is a little noisier than I would like, because of the code
to propagate the new flag through all the interfaces and implementations;
the main changes are in the parser and in HiveShim, aside from the tests
(DDLCommandSuite, VersionsSuite).
Tested by running sql and catalyst unit tests, plus VersionsSuite which
has been updated to test the version-specific behavior. I also ran an
internal test suite that uses PURGE and would not pass previously.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13831 from vanzin/SPARK-16119.
## What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
## How was this patch tested?
add unit tests
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>
Closes#13494 from lianhuiwang/metadata-only.
## What changes were proposed in this pull request?
In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with an *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.
Examples from the FilteredScanSuite, in this example `c IN (...)` is handled by the source, `b < ...` is not
### Before
```
//SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo')
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
+- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```
### After
```
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
+- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), *In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```
## How was the this patch tested?
Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested
Post 1.6.1
Tested by modifying the FilteredScanSuite to run explains.
Author: Russell Spitzer <Russell.Spitzer@gmail.com>
Closes#11317 from RussellSpitzer/SPARK-12639-Star.
## What changes were proposed in this pull request?
This PR prevents dropping the current database to avoid errors like the followings.
```scala
scala> sql("create database delete_db")
scala> sql("use delete_db")
scala> sql("drop database delete_db")
scala> sql("create table t as select 1")
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database `delete_db` not found;
```
## How was this patch tested?
Pass the Jenkins tests including an updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14115 from dongjoon-hyun/SPARK-16459.
## What changes were proposed in this pull request?
In #13537 we truncate `simpleString` if it is a long `StructType`. But sometimes we need `catalogString` to reconstruct `TypeInfo`, for example in description of [SPARK-16415 ](https://issues.apache.org/jira/browse/SPARK-16415). So we need to keep the implementation of `catalogString` not affected by our truncate.
## How was this patch tested?
added a test case.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#14089 from adrian-wang/catalogstring.
#### What changes were proposed in this pull request?
When creating a view, a common user error is the number of columns produced by the `SELECT` clause does not match the number of column names specified by `CREATE VIEW`.
For example, given Table `t1` only has 3 columns
```SQL
create view v1(col2, col4, col3, col5) as select * from t1
```
Currently, Spark SQL reports the following error:
```
requirement failed
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:212)
at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:90)
```
This error message is very confusing. This PR is to detect the error and issue a meaningful error message.
#### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14047 from gatorsmile/viewMismatchedColumns.
#### What changes were proposed in this pull request?
In `CREATE TABLE AS SELECT`, if the `SELECT` query failed, the table should not exist. For example,
```SQL
CREATE TABLE tab
STORED AS TEXTFILE
SELECT 1 AS a, (SELECT a FROM (SELECT 1 AS a UNION ALL SELECT 2 AS a) t) AS b
```
The above query failed as expected but an empty table `t` is created.
This PR is to drop the created table when hitting any non-fatal exception.
#### How was this patch tested?
Added a test case to verify the behavior
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13926 from gatorsmile/dropTableAfterException.
## What changes were proposed in this pull request?
These two configs should always be true after Spark 2.0. This patch removes them from the config list. Note that ideally this should've gone into branch-2.0, but due to the timing of the release we should only merge this in master for Spark 2.1.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#14061 from rxin/SPARK-16388.
## What changes were proposed in this pull request?
This PR removes `SessionState.executeSql` in favor of `SparkSession.sql`. We can remove this safely since the visibility `SessionState` is `private[sql]` and `executeSql` is only used in one **ignored** test, `test("Multiple Hive Instances")`.
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14055 from dongjoon-hyun/SPARK-16383.
## What changes were proposed in this pull request?
This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on https://github.com/apache/spark/pull/13989, but removes the public Dataset.refresh() API as well as improved test coverage.
Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution).
## How was this patch tested?
Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation.
Author: Reynold Xin <rxin@databricks.com>
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14009 from rxin/SPARK-16311.
## What changes were proposed in this pull request?
It seems ORC supports all the types in ([`PredicateLeaf.Type`](e085b7e9bd/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java (L50-L56))) which includes boolean types. So, this was tested first.
This PR adds the support for pushing filters down for `BooleanType` in ORC data source.
This PR also removes `OrcTableScan` class and the companion object, which is not used anymore.
## How was this patch tested?
Unittest in `OrcFilterSuite` and `OrcQuerySuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12972 from HyukjinKwon/SPARK-15198.
(Please note this is a revision of PR #13686, which has been closed in favor of this PR.)
This PR addresses [SPARK-15968](https://issues.apache.org/jira/browse/SPARK-15968).
## What changes were proposed in this pull request?
The `getCached` method of [HiveMetastoreCatalog](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala) computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is incomplete/inaccurate for a nonempty partitioned table. As a result, cached lookups on nonempty partitioned tables always miss.
Rather than get `pathsInMetastore` from
metastoreRelation.catalogTable.storage.locationUri.toSeq
I modified the `getCached` method to take a `pathsInMetastore` argument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is how `getCached` was implemented in Spark 1.5:
e0c3212a9b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala (L444).
I also added a call in `InsertIntoHiveTable.scala` to invalidate the table from the SQL session catalog.
## How was this patch tested?
I've added a new unit test to `parquetSuites.scala`:
SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached relation
Note that the only difference between this new test and the one above it in the file is that the new test populates its partitioned table with a single value, while the existing test leaves the table empty. This reveals a subtle, unexpected hole in test coverage present before this patch.
Note I also modified a different but related unit test in `parquetSuites.scala`:
SPARK-15248: explicitly added partitions should be readable
This unit test asserts that Spark SQL should return data from a table partition which has been placed there outside a metastore query immediately after it is added. I changed the test so that, instead of adding the data as a parquet file saved in the partition's location, the data is added through a SQL `INSERT` query. I made this change because I could find no way to efficiently support partitioned table caching without failing that test.
In addition to my primary motivation, I can offer a few reasons I believe this is an acceptable weakening of that test. First, it still validates a fix for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue for which it was written. Second, the assertion made is stronger than that required for non-partitioned tables. If you write data to the storage location of a non-partitioned metastore table without using a proper SQL DML query, a subsequent call to show that data will not return it. I believe this is an intentional limitation put in place to make table caching feasible, but I'm only speculating.
Building a large `HadoopFsRelation` requires `stat`-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL for these cases.
Thanks.
Author: Michael Allman <michael@videoamp.com>
Closes#13818 from mallman/spark-15968.
#### What changes were proposed in this pull request?
LogicalPlan `InsertIntoHiveTable` is useless. Thus, we can remove it from the code base.
#### How was this patch tested?
The existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14037 from gatorsmile/InsertIntoHiveTable.
## What changes were proposed in this pull request?
After SPARK-15674, `DDLStrategy` prints out the following deprecation messages in the testsuites.
```
12:10:53.284 WARN org.apache.spark.sql.execution.SparkStrategies$DDLStrategy:
CREATE TEMPORARY TABLE normal_orc_source USING... is deprecated,
please use CREATE TEMPORARY VIEW viewName USING... instead
```
Total : 40
- JDBCWriteSuite: 14
- DDLSuite: 6
- TableScanSuite: 6
- ParquetSourceSuite: 5
- OrcSourceSuite: 2
- SQLQuerySuite: 2
- HiveCommandSuite: 2
- JsonSuite: 1
- PrunedScanSuite: 1
- FilteredScanSuite 1
This PR replaces `CREATE TEMPORARY TABLE` with `CREATE TEMPORARY VIEW` in order to remove the deprecation messages in the above testsuites except `DDLSuite`, `SQLQuerySuite`, `HiveCommandSuite`.
The Jenkins results shows only remaining 10 messages.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61422/consoleFull
## How was this patch tested?
This is a testsuite-only change.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13956 from dongjoon-hyun/SPARK-16267.
## What changes were proposed in this pull request?
This PR supports a fallback lookup by casting `DecimalType` into `DoubleType` for the external functions with `double`-type parameter.
**Reported Error Scenarios**
```scala
scala> sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
org.apache.spark.sql.AnalysisException: ... No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (int, decimal(38,18)). Possible choices: _FUNC_(bigint, array<double>) _FUNC_(bigint, double) ; line 1 pos 7
scala> sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")
org.apache.spark.sql.AnalysisException: ... Only a float/double or float/double array argument is accepted as parameter 2, but decimal(38,18) was passed instead.; line 1 pos 7
```
## How was this patch tested?
Pass the Jenkins tests (including a new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13930 from dongjoon-hyun/SPARK-16228.
#### What changes were proposed in this pull request?
Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`.
Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way:
```Scala
StructType(
StructField(
"cl1",
IntegerType,
nullable = false,
new MetadataBuilder().putString("comment", "test").build()) :: Nil)
```
This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it:
```Scala
val struct = (new StructType)
.add("a", "int", true, "test1")
val struct = (new StructType)
.add("c", StringType, true, "test3")
val struct = (new StructType)
.add(StructField("d", StringType).withComment("test4"))
```
#### How was this patch tested?
Added test cases:
- `DataTypeSuite` is for testing three types of API changes,
- `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog
- `OrcQuerySuite.scala` is for orc format using Hive-metastore
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13860 from gatorsmile/newMethodForComment.
## What changes were proposed in this pull request?
- Fix tests regarding show functions functionality
- Revert `catalog.ListFunctions` and `SHOW FUNCTIONS` to return to `Spark 1.X` functionality.
Cherry picked changes from this PR: https://github.com/apache/spark/pull/13413/files
## How was this patch tested?
Unit tests.
Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>
Closes#13916 from anabranch/master.
## What changes were proposed in this pull request?
When reading partitions of a partitioned Hive SerDe table, we only initializes the deserializer using partition properties. However, for SerDes like `AvroSerDe`, essential properties (e.g. Avro schema information) may be defined in table properties. We should merge both table properties and partition properties before initializing the deserializer.
Note that an individual partition may have different properties than the one defined in the table properties (e.g. partitions within a table can have different SerDes). Thus, for any property key defined in both partition and table properties, the value set in partition properties wins.
## How was this patch tested?
New test case added in `QueryPartitionSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#13865 from liancheng/spark-13709-partitioned-avro-table.
#### What changes were proposed in this pull request?
This PR is to improve test coverage. It verifies whether `Comment` of `Column` can be appropriate handled.
The test cases verify the related parts in Parser, both SQL and DataFrameWriter interface, and both Hive Metastore catalog and In-memory catalog.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13764 from gatorsmile/dataSourceComment.
## What changes were proposed in this pull request?
Extend the returning of unwrapper functions from primitive types to all types.
This PR is based on https://github.com/apache/spark/pull/13676. It only fixes a bug with scala-2.10 compilation. All credit should go to dafrista.
## How was this patch tested?
The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%.
Author: Brian Cho <bcho@fb.com>
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13854 from hvanhovell/SPARK-15956-scala210.
This reverts commit 0a9c027595. It breaks the 2.10 build, I'll fix this in a different PR.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13853 from hvanhovell/SPARK-15956-revert.
## What changes were proposed in this pull request?
This PR migrates some test cases introduced in #12313 as a follow-up of #13754 and #13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements.
Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#13810 from liancheng/spark-16037-follow-up-tests.
## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target table is a data source table.
## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.
**Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**
Author: Yin Huai <yhuai@databricks.com>
Closes#13769 from yhuai/SPARK-16030-1.
## What changes were proposed in this pull request?
The current table insertion has some weird behaviours:
1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
2. inserting into a partitioned table without partition list has wrong result for hive table.
This PR fixes these 2 problems.
## How was this patch tested?
new test in hive `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13754 from cloud-fan/insert2.
## What changes were proposed in this pull request?
When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra `partitionBy()` calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout.
## How was this patch tested?
New test case added in `InsertIntoHiveTableSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#13747 from liancheng/spark-16033-insert-into-without-partition-by.
#### What changes were proposed in this pull request?
`IF NOT EXISTS` in `INSERT OVERWRITE` should not support dynamic partitions. If we specify `IF NOT EXISTS`, the inserted statement is not shown in the table.
This PR is to issue an exception in this case, just like what Hive does. Also issue an exception if users specify `IF NOT EXISTS` if users do not specify any `PARTITION` specification.
#### How was this patch tested?
Added test cases into `PlanParserSuite` and `InsertIntoHiveTableSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13447 from gatorsmile/insertIfNotExist.
#### What changes were proposed in this pull request?
`HIVE_METASTORE_PARTITION_PRUNING` is a public `SQLConf`. When `true`, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. The current default value is `false`. For performance improvement, users might turn this parameter on.
So far, the code base does not have such a test case to verify whether this `SQLConf` properly works. This PR is to improve the test case coverage for avoiding future regression.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13716 from gatorsmile/addTestMetastorePartitionPruning.
#### What changes were proposed in this pull request?
~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~
~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~
The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists.
In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string.
#### How was this patch tested?
- Added a test case for caching and uncaching qualified table names
- Fixed a few test cases that do not drop temp table at the end
- Added the related test case for the issue resolved in this PR
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13572 from gatorsmile/cacheTableAsSelect.
#### What changes were proposed in this pull request?
So far, we do not have test cases for verifying whether the external parameters `HiveUtils .CONVERT_METASTORE_ORC` and `HiveUtils.CONVERT_METASTORE_PARQUET` properly works when users use non-default values. This PR is to add such test cases for avoiding potential regression.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13622 from gatorsmile/addTestCase4parquetOrcConversion.
## What changes were proposed in this pull request?
This PR adds the support of conf `hive.metastore.warehouse.dir` back. With this patch, the way of setting the warehouse dir is described as follows:
* If `spark.sql.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the value of `spark.sql.warehouse.dir`.
* If `spark.sql.warehouse.dir` is not set but `hive.metastore.warehouse.dir` is set, `spark.sql.warehouse.dir` will be automatically set to the value of `hive.metastore.warehouse.dir`. The warehouse dir is effectively set to the value of `hive.metastore.warehouse.dir`.
* If neither `spark.sql.warehouse.dir` nor `hive.metastore.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the default value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the default value of `spark.sql.warehouse.dir`.
## How was this patch tested?
`set hive.metastore.warehouse.dir` in `HiveSparkSubmitSuite`.
JIRA: https://issues.apache.org/jira/browse/SPARK-15959
Author: Yin Huai <yhuai@databricks.com>
Closes#13679 from yhuai/hiveWarehouseDir.
## What changes were proposed in this pull request?
This test re-enables the `analyze MetastoreRelations` in `org.apache.spark.sql.hive.StatisticsSuite`.
The flakiness of this test was traced back to a shared configuration option, `hive.exec.compress.output`, in `TestHive`. This property was set to `true` by the `HiveCompatibilitySuite`. I have added configuration resetting logic to `HiveComparisonTest`, in order to prevent such a thing from happening again.
## How was this patch tested?
Is a test.
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#13498 from hvanhovell/SPARK-15011.
#### What changes were proposed in this pull request?
To uncache a table, we have three different ways:
- _SQL interface_: `UNCACHE TABLE`
- _DataSet API_: `sparkSession.catalog.uncacheTable`
- _DataSet API_: `sparkSession.table(tableName).unpersist()`
When the table is not cached,
- _SQL interface_: `UNCACHE TABLE non-cachedTable` -> **no error message**
- _Dataset API_: `sparkSession.catalog.uncacheTable("non-cachedTable")` -> **report a strange error message:**
```requirement failed: Table [a: int] is not cached```
- _Dataset API_: `sparkSession.table("non-cachedTable").unpersist()` -> **no error message**
This PR will make them consistent. No operation if the table has already been uncached.
In addition, this PR also removes `uncacheQuery` and renames `tryUncacheQuery` to `uncacheQuery`, and documents it that it's noop if the table has already been uncached
#### How was this patch tested?
Improved the existing test case for verifying the cases when the table has not been cached.
Also added test cases for verifying the cases when the table does not exist
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13593 from gatorsmile/uncacheNonCachedTable.
#### What changes were proposed in this pull request?
When fetching the partitioned table, the output contains wrong results. The order of partition key values do not match the order of partition key columns in output schema. For example,
```SQL
CREATE TABLE table_with_partition(c1 string) PARTITIONED BY (p1 string,p2 string,p3 string,p4 string,p5 string)
INSERT OVERWRITE TABLE table_with_partition PARTITION (p1='a',p2='b',p3='c',p4='d',p5='e') SELECT 'blarr'
SELECT p1, p2, p3, p4, p5, c1 FROM table_with_partition
```
```
+---+---+---+---+---+-----+
| p1| p2| p3| p4| p5| c1|
+---+---+---+---+---+-----+
| d| e| c| b| a|blarr|
+---+---+---+---+---+-----+
```
The expected result should be
```
+---+---+---+---+---+-----+
| p1| p2| p3| p4| p5| c1|
+---+---+---+---+---+-----+
| a| b| c| d| e|blarr|
+---+---+---+---+---+-----+
```
This PR is to fix this by enforcing the order matches the table partition definition.
#### How was this patch tested?
Added a test case into `SQLQuerySuite`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13400 from gatorsmile/partitionedTableFetch.
## What changes were proposed in this pull request?
SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions.
## How was this patch tested?
CatalogSuite
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13413 from techaddict/SPARK-15663.
#### What changes were proposed in this pull request?
**Issue:** Got wrong results or strange errors when append data to a table with mismatched file format.
_Example 1: PARQUET -> CSV_
```Scala
createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
```
Error we got:
```
Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-00000-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23]
```
_Example 2: Json -> CSV_
```Scala
createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
```
No exception, but wrong results:
```
+----+----+
| c1| c2|
+----+----+
|null|null|
|null|null|
|null|null|
|null|null|
| 0|str0|
| 1|str1|
| 2|str2|
| 3|str3|
| 4|str4|
| 5|str5|
| 6|str6|
| 7|str7|
| 8|str8|
| 9|str9|
+----+----+
```
_Example 3: Json -> Text_
```Scala
createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
```
Error we got:
```
Text data source supports only a single column, and you have 2 columns.
```
This PR is to issue an exception with appropriate error messages.
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13546 from gatorsmile/fileFormatCheck.
## What changes were proposed in this pull request?
Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, it seems make sense to still load this conf file.
This PR adds a `hadoopConf` API in `SharedState`, which is `sparkContext.hadoopConfiguration` by default. When users are under hive context, `SharedState.hadoopConf` will load hive-site.xml and append its configs to `sparkContext.hadoopConfiguration`.
When we need to read hadoop config in spark sql, we should call `SessionState.newHadoopConf`, which contains `sparkContext.hadoopConfiguration`, hive-site.xml and sql configs.
## How was this patch tested?
new test in `HiveDataFrameSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13611 from cloud-fan/hive-site.
## What changes were proposed in this pull request?
Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.
Current behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset
```
This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.
Expected behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
spark.catalog.refreshResource(dir)
sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset
```
## How was this patch tested?
Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13566 from sameeragarwal/refresh-path-2.
## What changes were proposed in this pull request?
Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not.
This PR is based on #13442, closes#13442
## How was this patch tested?
add regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13531 from davies/fix_split.
## What changes were proposed in this pull request?
Replace all occurrences of `None: Option[X]` with `Option.empty[X]`
## How was this patch tested?
Exisiting Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13591 from techaddict/minor-7.
## What changes were proposed in this pull request?
This patch moves some codes in `DataFrameWriter.insertInto` that belongs to `Analyzer`.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13496 from viirya/move-analyzer-stuff.
## What changes were proposed in this pull request?
* Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery
* ForeachWriter is the interface for the user to consume partitions of data
* Add a type parameter T to DataFrameWriter
Usage
```Scala
val ds = spark.read....stream().as[String]
ds.....write
.queryName(...)
.option("checkpointLocation", ...)
.foreach(new ForeachWriter[Int] {
def open(partitionId: Long, version: Long): Boolean = {
// prepare some resources for a partition
// check `version` if possible and return `false` if this is a duplicated data to skip the data processing.
}
override def process(value: Int): Unit = {
// process data
}
def close(errorOrNull: Throwable): Unit = {
// release resources for a partition
// check `errorOrNull` and handle the error if necessary.
}
})
```
## How was this patch tested?
New unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13342 from zsxwing/foreach.
## What changes were proposed in this pull request?
`an -> a`
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13515 from zhengruifeng/an_a.
## What changes were proposed in this pull request?
As of this patch, the following throws an exception because the schemas may not match:
```
CREATE TABLE students (age INT, name STRING) AS SELECT * FROM boxes
```
but this is OK:
```
CREATE TABLE students AS SELECT * FROM boxes
```
## How was this patch tested?
SQLQuerySuite, HiveDDLCommandSuite
Author: Andrew Or <andrew@databricks.com>
Closes#13490 from andrewor14/ctas-no-column.
## What changes were proposed in this pull request?
For consistency, this PR updates some remaining `TungstenAggregation/SortBasedAggregate` after SPARK-15728.
- Update a comment in codegen in `VectorizedHashMapGenerator.scala`.
- `TungstenAggregationQuerySuite` --> `HashAggregationQuerySuite`
- `TungstenAggregationQueryWithControlledFallbackSuite` --> `HashAggregationQueryWithControlledFallbackSuite`
- Update two error messages in `SQLQuerySuite.scala` and `AggregationQuerySuite.scala`.
- Update several comments.
## How was this patch tested?
Manual (Only comment changes and test suite renamings).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13487 from dongjoon-hyun/SPARK-15744.
## What changes were proposed in this pull request?
This command didn't work for Hive tables. Now it does:
```
ALTER TABLE boxes PARTITION (width=3)
SET SERDE 'com.sparkbricks.serde.ColumnarSerDe'
WITH SERDEPROPERTIES ('compress'='true')
```
## How was this patch tested?
`HiveExternalCatalogSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13453 from andrewor14/alter-partition-storage.
## What changes were proposed in this pull request?
Currently we don't support bucketing for `save` and `insertInto`.
For `save`, we just write the data out into a directory users specified, and it's not a table, we don't keep its metadata. When we read it back, we have no idea if the data is bucketed or not, so it doesn't make sense to use `save` to write bucketed data, as we can't use the bucket information anyway.
We can support it in the future, once we have features like bucket discovery, or we save bucket information in the data directory too, so that we don't need to rely on a metastore.
For `insertInto`, it inserts data into an existing table, so it doesn't make sense to specify bucket information, as we should get the bucket information from the existing table.
This PR improves the error message for the above 2 cases.
## How was this patch tested?
new test in `BukctedWriteSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13452 from cloud-fan/error-msg.
## What changes were proposed in this pull request?
This PR disables writing Parquet summary files by default (i.e., when Hadoop configuration "parquet.enable.summary-metadata" is not set).
Please refer to [SPARK-15719][1] for more details.
## How was this patch tested?
New test case added in `ParquetQuerySuite` to check no summary files are written by default.
[1]: https://issues.apache.org/jira/browse/SPARK-15719
Author: Cheng Lian <lian@databricks.com>
Closes#13455 from liancheng/spark-15719-disable-parquet-summary-files.
## What changes were proposed in this pull request?
This PR bans syntax like `CREATE TEMPORARY TABLE USING AS SELECT`
`CREATE TEMPORARY TABLE ... USING ... AS ...` is not properly implemented, the temporary data is not cleaned up when the session exits. Before a full fix, we probably should ban this syntax.
This PR only impact syntax like `CREATE TEMPORARY TABLE ... USING ... AS ...`.
Other syntax like `CREATE TEMPORARY TABLE .. USING ...` and `CREATE TABLE ... USING ...` are not impacted.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13451 from clockfly/ban_create_temp_table_using_as.
#### What changes were proposed in this pull request?
This PR is to address the following issues:
- **ISSUE 1:** For ORC source format, we are reporting the strange error message when we did not enable Hive support:
```SQL
SQL Example:
select id from `org.apache.spark.sql.hive.orc`.`file_path`
Error Message:
Table or view not found: `org.apache.spark.sql.hive.orc`.`file_path`
```
Instead, we should issue the error message like:
```
Expected Error Message:
The ORC data source must be used with Hive support enabled
```
- **ISSUE 2:** For the Avro format, we report the strange error message like:
The example query is like
```SQL
SQL Example:
select id from `avro`.`file_path`
select id from `com.databricks.spark.avro`.`file_path`
Error Message:
Table or view not found: `com.databricks.spark.avro`.`file_path`
```
The desired message should be like:
```
Expected Error Message:
Failed to find data source: avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro"
```
- ~~**ISSUE 3:** Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:~~
**Update**: The latest code changes contains
- For JDBC format, we added an extra checking in the rule `ResolveRelations` of `Analyzer`. Without the PR, Spark will return the error message like: `Option 'url' not specified`. Now, we are reporting `Unsupported data source type for direct query on files: jdbc`
- Make data source format name case incensitive so that error handling behaves consistent with the normal cases.
- Added the test cases for all the supported formats.
#### How was this patch tested?
Added test cases to cover all the above issues
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13283 from gatorsmile/runSQLAgainstFile.
## What changes were proposed in this pull request?
When `spark.sql.hive.convertCTAS` is true, for a CTAS statement, we will create a data source table using the default source (i.e. parquet) if the CTAS does not specify any Hive storage format. However, there are two issues with this conversion logic.
1. First, we determine if a CTAS statement defines storage format by checking the serde. However, TEXTFILE/SEQUENCEFILE does not have a default serde. When we do the check, we have not set the default serde. So, a query like `CREATE TABLE abc STORED AS TEXTFILE AS SELECT ...` actually creates a data source parquet table.
2. In the conversion logic, we are ignoring the user-specified location.
This PR fixes the above two issues.
Also, this PR makes the parser throws an exception when a CTAS statement has a PARTITIONED BY clause. This change is made because Hive's syntax does not allow it and our current implementation actually does not work for this case (the insert operation always throws an exception because the insertion does not pick up the partitioning info).
## How was this patch tested?
I am adding new tests in SQLQuerySuite and HiveDDLCommandSuite.
Author: Yin Huai <yhuai@databricks.com>
Closes#13386 from yhuai/SPARK-14507.
This PR is an alternative to #13120 authored by xwu0226.
## What changes were proposed in this pull request?
When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).
This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table.
Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround.
[1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408
## How was this patch tested?
1. A new test case is added in `HiveQuerySuite` for this case
2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.)
Author: Cheng Lian <lian@databricks.com>
Closes#13270 from liancheng/spark-15269-unpleasant-fix.
## What changes were proposed in this pull request?
This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings.
```
- val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
+ val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
```
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13365 from dongjoon-hyun/SPARK-15618.
In this case, the result type of the expression becomes DECIMAL(38, 36) as we promote the individual string literals to DECIMAL(38, 18) when we handle string promotions for `BinaryArthmaticExpression`.
I think we need to cast the string literals to Double type instead. I looked at the history and found that this was changed to use decimal instead of double to avoid potential loss of precision when we cast decimal to double.
To double check i ran the query against hive, mysql. This query returns non NULL result for both the databases and both promote the expression to use double.
Here is the output.
- Hive
```SQL
hive> create table l2 as select (cast(99 as decimal(19,6)) + '2') from l1;
OK
hive> describe l2;
OK
_c0 double
```
- MySQL
```SQL
mysql> create table foo2 as select (cast(99 as decimal(19,6)) + '2') from test;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> describe foo2;
+-----------------------------------+--------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------------------------+--------+------+-----+---------+-------+
| (cast(99 as decimal(19,6)) + '2') | double | NO | | 0 | |
+-----------------------------------+--------+------+-----+---------+-------+
```
## How was this patch tested?
Added a new test in SQLQuerySuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#13368 from dilipbiswal/spark-15557.
## What changes were proposed in this pull request?
This patch contains a list of changes as a result of my auditing Dataset, SparkSession, and SQLContext. The patch audits the categorization of experimental APIs, function groups, and deprecations. For the detailed list of changes, please see the diff.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13370 from rxin/SPARK-15638.
## What changes were proposed in this pull request?
I create a bucketed table bucketed_table with bucket column i,
```scala
case class Data(i: Int, j: Int, k: Int)
sc.makeRDD(Array((1, 2, 3))).map(x => Data(x._1, x._2, x._3)).toDF.write.bucketBy(2, "i").saveAsTable("bucketed_table")
```
and I run the following SQLs:
```sql
SELECT j FROM bucketed_table;
Error in query: bucket column i not found in existing columns (j);
SELECT j, MAX(k) FROM bucketed_table GROUP BY j;
Error in query: bucket column i not found in existing columns (j, k);
```
I think we should add a check that, we only enable bucketing when it satisfies all conditions below:
1. the conf is enabled
2. the relation is bucketed
3. the output contains all bucketing columns
## How was this patch tested?
Updated test cases to reflect the changes.
Author: Yadong Qi <qiyadong2010@gmail.com>
Closes#13321 from watermen/SPARK-15549.
## What changes were proposed in this pull request?
- Refer to the Jira for the problem: jira : https://issues.apache.org/jira/browse/SPARK-14400
- The fix is to check if the process has exited with a non-zero exit code in `hasNext()`. I have moved this and checking of writer thread exception to a separate method.
## How was this patch tested?
- Ran a job which had incorrect transform script command and saw that the job fails
- Existing unit tests for `ScriptTransformationSuite`. Added a new unit test
Author: Tejas Patil <tejasp@fb.com>
Closes#12194 from tejasapatil/script_transform.
## What changes were proposed in this pull request?
This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables.
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13349 from dongjoon-hyun/SPARK-15584.
#### What changes were proposed in this pull request?
This PR is to use the new entrance `Sparksession` to replace the existing `SQLContext` and `HiveContext` in SQL test suites.
No change is made in the following suites:
- `ListTablesSuite` is to test the APIs of `SQLContext`.
- `SQLContextSuite` is to test `SQLContext`
- `HiveContextCompatibilitySuite` is to test `HiveContext`
**Update**: Move tests in `ListTableSuite` to `SQLContextSuite`
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13337 from gatorsmile/sparkSessionTest.
## What changes were proposed in this pull request?
Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to
(1) Create a Hive table setting these properties
(2) Alter these properties in an existing table
Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables.
## How was this patch tested?
DDLSuite
Author: Andrew Or <andrew@databricks.com>
Closes#13341 from andrewor14/alter-table-props.
## What changes were proposed in this pull request?
SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that.
As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility.
## How was this patch tested?
Updated test cases to reflect the changes.
Author: Reynold Xin <rxin@databricks.com>
Closes#13319 from rxin/SPARK-15552.
## What changes were proposed in this pull request?
Same as #13302, but for DROP TABLE.
## How was this patch tested?
`DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13307 from andrewor14/drop-table.
## What changes were proposed in this pull request?
For some of the test cases, e.g. `OrcSourceSuite`, it will create temp folders and temp files inside them. But after tests finish, the folders are not removed. This will cause lots of temp files created and space occupied, if we keep running the test cases.
The reason is dir.delete() won't work if dir is not empty. We need to recursively delete the content before deleting the folder.
## How was this patch tested?
Manually checked the temp folder to make sure the temp files were deleted.
Author: Bo Meng <mengbo@hotmail.com>
Closes#13304 from bomeng/SPARK-15537.
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.
They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat
Backward compatibility is maintained through aliasing.
## How was this patch tested?
Updated relevant test cases too.
Author: Reynold Xin <rxin@databricks.com>
Closes#13311 from rxin/SPARK-15543.
## What changes were proposed in this pull request?
Two changes:
- When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
- Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#13302 from andrewor14/truncate-table.
## What changes were proposed in this pull request?
This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.
## How was this patch tested?
Created a new SparkSqlParserSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#13292 from rxin/SPARK-15436.
## What changes were proposed in this pull request?
Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.
## How was this patch tested?
I have executed queries locally to test.
Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
Closes#13150 from Parth-Brahmbhatt/SPARK-15365.
## What changes were proposed in this pull request?
spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws "org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function default.myfunc does not exist))" with hive 1.2.1.
I think it is introduced by pr #12853. Fixing it by catching Exception (not NoSuchObjectException) and string matching.
## How was this patch tested?
added a unit test and also tested it manually
Author: wangyang <wangyang@haizhi.com>
Closes#13177 from wangyang1992/fixCreateFunc2.
## What changes were proposed in this pull request?
Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context.
Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli)
This PR is to support following commands:
`LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])`
### For example:
##### LIST FILE(s)
```
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt")
res1: org.apache.spark.sql.DataFrame = []
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt")
res2: org.apache.spark.sql.DataFrame = []
scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false)
+----------------------------------------------+
|result |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
+----------------------------------------------+
scala> spark.sql("list files").show(false)
+----------------------------------------------+
|result |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt |
+----------------------------------------------+
```
##### LIST JAR(s)
```
scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar")
res9: org.apache.spark.sql.DataFrame = [result: int]
scala> spark.sql("list jar TestUDTF.jar").show(false)
+---------------------------------------------+
|result |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+
scala> spark.sql("list jars").show(false)
+---------------------------------------------+
|result |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+
```
## How was this patch tested?
New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path.
Author: Xin Wu <xinwu@us.ibm.com>
Author: xin Wu <xinwu@us.ibm.com>
Closes#13212 from xwu0226/list_command.
## What changes were proposed in this pull request?
The user may do something like:
```
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde'
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde'
```
None of these should be allowed because the SerDe's conflict. As of this patch:
- `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE`
- `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE`
## How was this patch tested?
New tests in `DDLCommandSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#13068 from andrewor14/row-format-conflict.
## What changes were proposed in this pull request?
In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products.
## How was this patch tested?
Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13209 from sameeragarwal/disallow-cartesian.
## What changes were proposed in this pull request?
Add new test cases for including distinct aggregate in having clause in 2.0 branch.
This is a followup PR for [#12974](https://github.com/apache/spark/pull/12974), which is for 1.6 branch.
Author: xin Wu <xinwu@us.ibm.com>
Closes#12984 from xwu0226/SPARK-15206.
## What changes were proposed in this pull request?
I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class.
## How was this patch tested?
Tests were moved.
Author: Reynold Xin <rxin@databricks.com>
Closes#13207 from rxin/SPARK-15424.
This reverts commit 8d05a7a from #12855, which seems to have caused regressions when working with empty DataFrames.
Author: Michael Armbrust <michael@databricks.com>
Closes#13181 from marmbrus/revert12855.
## What changes were proposed in this pull request?
We started this convention to append Command suffix to all SQL commands. However, not all commands follow that convention. This patch adds Command suffix to all RunnableCommands.
## How was this patch tested?
Updated test cases to reflect the renames.
Author: Reynold Xin <rxin@databricks.com>
Closes#13215 from rxin/SPARK-15435.
#### What changes were proposed in this pull request?
`refreshTable` was a method in `HiveContext`. It was deleted accidentally while we were migrating the APIs. This PR is to add it back to `HiveContext`.
In addition, in `SparkSession`, we put it under the catalog namespace (`SparkSession.catalog.refreshTable`).
#### How was this patch tested?
Changed the existing test cases to use the function `refreshTable`. Also added a test case for refreshTable in `hivecontext-compatibility`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13156 from gatorsmile/refreshTable.
## What changes were proposed in this pull request?
Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446
This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005).
## How was this patch tested?
Added a test case.
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#13170 from lianhuiwang/truncate.
## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.
This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.
## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.
Author: Reynold Xin <rxin@databricks.com>
Closes#13200 from rxin/SPARK-15075.
## What changes were proposed in this pull request?
This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message.
## How was this patch tested?
Updated existing test case to check exception message.
Author: Cheng Lian <lian@databricks.com>
Closes#13173 from liancheng/spark-14346-follow-up.
## What changes were proposed in this pull request?
This PR fixes a Scala 2.10 compilation failure introduced in PR #13127.
## How was this patch tested?
Jenkins build.
Author: Cheng Lian <lian@databricks.com>
Closes#13166 from liancheng/hotfix-for-scala-2.10.
## What changes were proposed in this pull request?
HiveClient facade is not compatible with Hive 0.12.
This PR Fixes the following compatibility issues:
1. `org.apache.spark.sql.hive.client.HiveClientImpl` use `AddPartitionDesc(db, table, ignoreIfExists)` to create partitions, however, Hive 0.12 doesn't have this constructor for `AddPartitionDesc`.
2. `HiveClientImpl` uses `PartitionDropOptions` when dropping partition, however, class `PartitionDropOptions` doesn't exist in Hive 0.12.
3. Hive 0.12 doesn't support adding permanent functions. It is not valid to call `org.apache.hadoop.hive.ql.metadata.Hive.createFunction`, `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`, and `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`
4. `org.apache.spark.sql.hive.client.VersionsSuite` doesn't have enough test coverage for different hive versions 0.12, 0.13, 0.14, 1.0.0, 1.1.0, 1.2.0.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13127 from clockfly/versionSuite.
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13098 from clockfly/spark-15171-remove-deprecation.
## What changes were proposed in this pull request?
This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc..
## How was this patch tested?
New test cases are added in `ShowCreateTableSuite` to do round-trip tests.
Author: Cheng Lian <lian@databricks.com>
Closes#13079 from liancheng/spark-14346-show-create-table-for-hive-tables.
## What changes were proposed in this pull request?
Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files.
This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources with group by query.
This checks whether the given partition has data in it or not and creates/writes file only when it actually has data.
## How was this patch tested?
Unittests in `InsertIntoHiveTableSuite` and `HadoopFsRelationTest`.
Closes#8411
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Keuntae Park <sirpkt@apache.org>
Closes#12855 from HyukjinKwon/pr/8411.
## What changes were proposed in this pull request?
"DESCRIBE table" is broken when table schema is stored at key "spark.sql.sources.schema".
Originally, we used spark.sql.sources.schema to store the schema of a data source table.
After SPARK-6024, we removed this flag. Although we are not using spark.sql.sources.schema any more, we need to still support it.
## How was this patch tested?
Unit test.
When using spark2.0 to load a table generated by spark 1.2.
Before change:
`DESCRIBE table` => Schema of this table is inferred at runtime,,
After change:
`DESCRIBE table` => correct output.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13073 from clockfly/spark-15253.
## What changes were proposed in this pull request?
Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`).
It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and
This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`.
Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`.
## How was this patch tested?
Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13048 from HyukjinKwon/SPARK-15267.
## What changes were proposed in this pull request?
This patch adds support for a few SQL functions to improve compatibility with other databases: IFNULL, NULLIF, NVL and NVL2. In order to do this, this patch introduced a RuntimeReplaceable expression trait that allows replacing an unevaluable expression in the optimizer before evaluation.
Note that the semantics are not completely identical to other databases in esoteric cases.
## How was this patch tested?
Added a new test suite SQLCompatibilityFunctionSuite.
Closes#12373.
Author: Reynold Xin <rxin@databricks.com>
Closes#13084 from rxin/SPARK-14541.
## What changes were proposed in this pull request?
We currently use the Hive implementations for the collect_list/collect_set aggregate functions. This has a few major drawbacks: the use of HiveUDAF (which has quite a bit of overhead) and the lack of support for struct datatypes. This PR adds native implementation of these functions to Spark.
The size of the collected list/set may vary, this means we cannot use the fast, Tungsten, aggregation path to perform the aggregation, and that we fallback to the slower sort based path. Another big issue with these operators is that when the size of the collected list/set grows too large, we can start experiencing large GC pauzes and OOMEs.
This `collect*` aggregates implemented in this PR rely on the sort based aggregate path for correctness. They maintain their own internal buffer which holds the rows for one group at a time. The sortbased aggregation path is triggered by disabling `partialAggregation` for these aggregates (which is kinda funny); this technique is also employed in `org.apache.spark.sql.hiveHiveUDAFFunction`.
I have done some performance testing:
```scala
import org.apache.spark.sql.{Dataset, Row}
sql("create function collect_list2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList'")
val df = range(0, 10000000).select($"id", (rand(213123L) * 100000).cast("int").as("grp"))
df.select(countDistinct($"grp")).show
def benchmark(name: String, plan: Dataset[Row], maxItr: Int = 5): Unit = {
// Do not measure planning.
plan1.queryExecution.executedPlan
// Execute the plan a number of times and average the result.
val start = System.nanoTime
var i = 0
while (i < maxItr) {
plan.rdd.foreach(row => Unit)
i += 1
}
val time = (System.nanoTime - start) / (maxItr * 1000000L)
println(s"[$name] $maxItr iterations completed in an average time of $time ms.")
}
val plan1 = df.groupBy($"grp").agg(collect_list($"id"))
val plan2 = df.groupBy($"grp").agg(callUDF("collect_list2", $"id"))
benchmark("Spark collect_list", plan1)
...
> [Spark collect_list] 5 iterations completed in an average time of 3371 ms.
benchmark("Hive collect_list", plan2)
...
> [Hive collect_list] 5 iterations completed in an average time of 9109 ms.
```
Performance is improved by a factor 2-3.
## How was this patch tested?
Added tests to `DataFrameAggregateSuite`.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12874 from hvanhovell/implode.
#### What changes were proposed in this pull request?
~~Currently, multiple partitions are allowed to drop by using a single DDL command: Alter Table Drop Partition. However, the internal implementation could break atomicity. That means, we could just drop a subset of qualified partitions, if hitting an exception when dropping one of qualified partitions~~
~~This PR contains the following behavior changes:~~
~~- disallow dropping multiple partitions by a single command ~~
~~- allow users to input predicates in partition specification and issue a nicer error message if the predicate's comparison operator is not `=`.~~
~~- verify the partition spec in SessionCatalog. This can ensure each partition spec in `Drop Partition` does not correspond to multiple partitions.~~
This PR has two major parts:
- Verify the partition spec in SessionCatalog for fixing the following issue:
```scala
sql(s"ALTER TABLE $externalTab DROP PARTITION (ds='2008-04-09', unknownCol='12')")
```
Above example uses an invalid partition spec. Without this PR, we will drop all the partitions. The reason is Hive megastores getPartitions API returns all the partitions if we provide an invalid spec.
- Re-implemented the `dropPartitions` in `HiveClientImpl`. Now, we always check if all the user-specified partition specs exist before attempting to drop the partitions. Previously, we start drop the partition before completing checking the existence of all the partition specs. If any failure happened after we start to drop the partitions, we will log an error message to indicate which partitions have been dropped and which partitions have not been dropped.
#### How was this patch tested?
Modified the existing test cases and added new test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12801 from gatorsmile/banDropMultiPart.
## What changes were proposed in this pull request?
Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#12945 from clockfly/spark-15171.
## What changes were proposed in this pull request?
This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. Support for Hive tables will be added in follow-up PR(s).
To show table creation DDL for data source tables created by CTAS statements, this PR also added partitioning and bucketing support for normal `CREATE TABLE ... USING ...` syntax.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
A new test suite `ShowCreateTableSuite` is added in sql/hive package to test the new feature.
Author: Cheng Lian <lian@databricks.com>
Closes#12781 from liancheng/spark-14346-show-create-table.
## What changes were proposed in this pull request?
Before:
```sql
-- uses that location but issues a warning
CREATE TABLE my_tab LOCATION /some/path
-- deletes any existing data in the specified location
DROP TABLE my_tab
```
After:
```sql
-- uses that location but creates an EXTERNAL table instead
CREATE TABLE my_tab LOCATION /some/path
-- does not delete the data at /some/path
DROP TABLE my_tab
```
This patch essentially makes the `EXTERNAL` field optional. This is related to #13032.
## How was this patch tested?
New test in `DDLCommandSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#13060 from andrewor14/location-implies-external.
## What changes were proposed in this pull request?
Before:
```sql
-- uses warehouse dir anyway
CREATE EXTERNAL TABLE my_tab
-- doesn't actually delete the data
DROP TABLE my_tab
```
After:
```sql
-- no location is provided, throws exception
CREATE EXTERNAL TABLE my_tab
-- creates an external table using that location
CREATE EXTERNAL TABLE my_tab LOCATION '/path/to/something'
-- doesn't delete the data, which is expected
DROP TABLE my_tab
```
## How was this patch tested?
New test in `DDLCommandSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13032 from andrewor14/create-external-table-location.
Table partitions can be added with locations different from default warehouse location of a hive table.
`CREATE TABLE parquetTable (a int) PARTITIONED BY (b int) STORED AS parquet `
`ALTER TABLE parquetTable ADD PARTITION (b=1) LOCATION '/partition'`
Querying such a table throws error as the MetastoreFileCatalog does not list the added partition directory, it only lists the default base location.
```
[info] - SPARK-15248: explicitly added partitions should be readable *** FAILED *** (1 second, 8 milliseconds)
[info] java.util.NoSuchElementException: key not found: file:/Users/tdas/Projects/Spark/spark2/target/tmp/spark-b39ad224-c5d1-4966-8981-fb45a2066d61/partition
[info] at scala.collection.MapLike$class.default(MapLike.scala:228)
[info] at scala.collection.AbstractMap.default(Map.scala:59)
[info] at scala.collection.MapLike$class.apply(MapLike.scala:141)
[info] at scala.collection.AbstractMap.apply(Map.scala:59)
[info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:59)
[info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:55)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
[info] at scala.collection.AbstractTraversable.map(Traversable.scala:104)
[info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.listFiles(PartitioningAwareFileCatalog.scala:55)
[info] at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:93)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info] at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
[info] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:55)
[info] at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:55)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info] at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
[info] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
[info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:77)
[info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
[info] at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:82)
[info] at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:82)
[info] at org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:330)
[info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:146)
[info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:159)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:554)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:535)
[info] at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:125)
[info] at org.apache.spark.sql.hive.ParquetPartitioningTest.withTempDir(parquetSuites.scala:726)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7.apply$mcV$sp(parquetSuites.scala:535)
[info] at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:166)
[info] at org.apache.spark.sql.hive.ParquetPartitioningTest.withTable(parquetSuites.scala:726)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply$mcV$sp(parquetSuites.scala:534)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)
```
The solution in this PR to get the paths to list from the partition spec and not rely on the default table path alone.
unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13022 from tdas/SPARK-15248.
## What changes were proposed in this pull request?
This fixes compile errors.
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13053 from dongjoon-hyun/hotfix_sqlquerysuite.
## What changes were proposed in this pull request?
#### Symptom
If a table is created as parquet or ORC table with hive syntaxt DDL, such as
```SQL
create table t1 (c1 int, c2 string) stored as parquet
```
The following command will fail
```SQL
create view v1 as select * from t1
```
#### Root Cause
Currently, `HiveMetaStoreCatalog` converts Paruqet/Orc tables to `LogicalRelation` without giving any `tableIdentifier`. `SQLBuilder` expects the `LogicalRelation` to have an associated `tableIdentifier`. However, the `LogicalRelation` created earlier does not have such a `tableIdentifier`. Thus, `SQLBuilder.toSQL` can not recognize this logical plan and issue an exception.
This PR is to assign a `TableIdentifier` to the `LogicalRelation` when resolving parquet or orc tables in `HiveMetaStoreCatalog`.
## How was this patch tested?
testcases created and dev/run-tests is run.
Author: xin Wu <xinwu@us.ibm.com>
Closes#12716 from xwu0226/SPARK_14933.
## What changes were proposed in this pull request?
This PR adds documents about the different behaviors between `insertInto` and `saveAsTable`, and throws an exception when the user try to add too man columns using `saveAsTable with append`.
## How was this patch tested?
Unit tests added in this PR.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13013 from zsxwing/SPARK-15231.
This replaces `sparkSession` with `spark` in CatalogSuite.scala.
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13030 from dongjoon-hyun/hotfix_sparkSession.
Since we cannot really trust if the underlying external catalog can throw exceptions when there is an invalid metadata operation, let's do it in SessionCatalog.
- [X] The first step is to unify the error messages issued in Hive-specific Session Catalog and general Session Catalog.
- [X] The second step is to verify the inputs of metadata operations for partitioning-related operations. This is moved to a separate PR: https://github.com/apache/spark/pull/12801
- [X] The third step is to add database existence verification in `SessionCatalog`
- [X] The fourth step is to add table existence verification in `SessionCatalog`
- [X] The fifth step is to add function existence verification in `SessionCatalog`
Add test cases and verify the error messages we issued
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12385 from gatorsmile/verifySessionAPIs.
## What changes were proposed in this pull request?
Use SparkSession instead of SQLContext in Scala/Java TestSuites
as this PR already very big working Python TestSuites in a diff PR.
## How was this patch tested?
Existing tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12907 from techaddict/SPARK-15037.
## What changes were proposed in this pull request?
This PR fixes SQL building for predicate subqueries and correlated scalar subqueries. It also enables most Hive subquery tests.
## How was this patch tested?
Enabled new tests in HiveComparisionSuite.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12988 from hvanhovell/SPARK-14773.
#### What changes were proposed in this pull request?
This PR is to address a few existing issues in `EXPLAIN`:
- The `EXPLAIN` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command.
- The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command.
- The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example:
```
== Parsed Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
== Analyzed Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
== Optimized Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
...
```
#### How was this patch tested?
Added and modified a few test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12991 from gatorsmile/explainCreateTable.
#### What changes were proposed in this pull request?
In Hive Metastore, dropping default database is not allowed. However, in `InMemoryCatalog`, this is allowed.
This PR is to disallow users to drop default database.
#### How was this patch tested?
Previously, we already have a test case in HiveDDLSuite. Now, we also add the same one in DDLSuite
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12962 from gatorsmile/dropDefaultDB.
## What changes were proposed in this pull request?
The issue is that when the user provides the path option with uppercase "PATH" key, `options` contains `PATH` key and will get into the non-external case in the following code in `createDataSourceTables.scala`, where a new key "path" is created with a default path.
```
val optionsWithPath =
if (!options.contains("path")) {
isExternal = false
options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
} else {
options
}
```
So before creating hive table, serdeInfo.parameters will contain both "PATH" and "path" keys and different directories. and Hive table's dataLocation contains the value of "path".
The fix in this PR is to convert `options` in the code above to `CaseInsensitiveMap` before checking for containing "path" key.
## How was this patch tested?
A testcase is added
Author: xin Wu <xinwu@us.ibm.com>
Closes#12804 from xwu0226/SPARK-15025.
When we parse `CREATE TABLE USING`, we should build a `CreateTableUsing` plan with the `managedIfNoPath` set to true. Then we will add default table path to options when write it to hive.
new test in `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12949 from cloud-fan/bug.
## What changes were proposed in this pull request?
This is a follow-up of PR #12844. It makes the newly updated `DescribeTableCommand` to support data sources tables.
## How was this patch tested?
A test case is added to check `DESC [EXTENDED | FORMATTED] <table>` output.
Author: Cheng Lian <lian@databricks.com>
Closes#12934 from liancheng/spark-14127-desc-table-follow-up.
## What changes were proposed in this pull request?
This detects a relation's partitioning and adds checks to the analyzer.
If an InsertIntoTable node has no partitioning, it is replaced by the
relation's partition scheme and input columns are correctly adjusted,
placing the partition columns at the end in partition order. If an
InsertIntoTable node has partitioning, it is checked against the table's
reported partitions.
These changes required adding a PartitionedRelation trait to the catalog
interface because Hive's MetastoreRelation doesn't extend
CatalogRelation.
This commit also includes a fix to InsertIntoTable's resolved logic,
which now detects that all expected columns are present, including
dynamic partition columns. Previously, the number of expected columns
was not checked and resolved was true if there were missing columns.
## How was this patch tested?
This adds new tests to the InsertIntoTableSuite that are fixed by this PR.
Author: Ryan Blue <blue@apache.org>
Closes#12239 from rdblue/SPARK-14459-detect-hive-partitioning.
## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.
The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).
Closes#12774
## How was this patch tested?
unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12856 from tdas/SPARK-14997.
#### What changes were proposed in this pull request?
When Describe a UDTF, the command returns a wrong result. The command is unable to find the function, which has been created and cataloged in the catalog but not in the functionRegistry.
This PR is to correct it. If the function is not in the functionRegistry, we will check the catalog for collecting the information of the UDTF function.
#### How was this patch tested?
Added test cases to verify the results
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12885 from gatorsmile/showFunction.
## What changes were proposed in this pull request?
Enable the test that was disabled when HiveContext was removed.
## How was this patch tested?
Made sure the enabled test passes with the new jar.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#12924 from dilipbiswal/spark-14893.
## What changes were proposed in this pull request?
Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`
## How was this patch tested?
ran tests locally
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12851 from techaddict/SPARK-15072.
#### What changes were proposed in this pull request?
First, a few test cases failed in mac OS X because the property value of `java.io.tmpdir` does not include a trailing slash on some platform. Hive always removes the last trailing slash. For example, what I got in the web:
```
Win NT --> C:\TEMP\
Win XP --> C:\TEMP
Solaris --> /var/tmp/
Linux --> /var/tmp
```
Second, a couple of test cases are added to verify if the commands work properly.
#### How was this patch tested?
Added a test case for it and correct the previous test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12081 from gatorsmile/mkdir.
## What changes were proposed in this pull request?
This PR support new SQL syntax CREATE TEMPORARY VIEW.
Like:
```
CREATE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx
```
## How was this patch tested?
Unit tests.
Author: Sean Zhong <clockfly@gmail.com>
Closes#12872 from clockfly/spark-6399.
## What changes were proposed in this pull request?
Just a bunch of small tweaks on DDL exception messages.
## How was this patch tested?
`DDLCommandSuite` et al.
Author: Andrew Or <andrew@databricks.com>
Closes#12853 from andrewor14/make-exceptions-consistent.
## What changes were proposed in this pull request?
This PR addresses a few minor issues in SQL parser:
- Removes some unused rules and keywords in the grammar.
- Removes code path for fallback SQL parsing (was needed for Hive native parsing).
- Use `UnresolvedGenerator` instead of hard-coding `Explode` & `JsonTuple`.
- Adds a more generic way of creating error messages for unsupported Hive features.
- Use `visitFunctionName` as much as possible.
- Interpret a `CatalogColumn`'s `DataType` directly instead of parsing it again.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12826 from hvanhovell/SPARK-15047.
This PR contains three changes:
1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir.
2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string.
3. When we create a database, we need to make the path qualified.
Existing tests and new tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12812 from yhuai/warehouse.
## What changes were proposed in this pull request?
CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser.
## How was this patch tested?
Existing tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12796 from yhuai/removeDataTypeParser.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14917
As it is described in the JIRA, it seems Hive 1.2.1 which Spark uses now supports snappy and none.
So, this PR enables some tests for writing ORC files with compression codes, `SNAPPY` and `NONE`.
## How was this patch tested?
Unittests in `OrcQuerySuite` and `sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12699 from HyukjinKwon/SPARK-14917.
## What changes were proposed in this pull request?
This PR makes two changes:
1. We will propagate Spark Confs to HiveConf created in HiveClientImpl. So, users can also use spark conf to set warehouse location and metastore url.
2. In sql/hive, HiveClientImpl will be the only place where we create a new HiveConf.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#12791 from yhuai/onlyUseHiveConfInHiveClientImpl.
## What changes were proposed in this pull request?
Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column.
## How was this patch tested?
A test case is added to test the invalid sorting order by checking exception message.
Author: Cheng Lian <lian@databricks.com>
Closes#12759 from liancheng/spark-14981.
This test always fail with sbt's hadoop 2.3 and 2.4 tests. Let'e disable it for now and investigate the problem.
Author: Yin Huai <yhuai@databricks.com>
Closes#12783 from yhuai/SPARK-15011-ignore.
## What changes were proposed in this pull request?
This patch removes executionHive from HiveSessionState and HiveSharedState.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12770 from rxin/SPARK-14994.
## What changes were proposed in this pull request?
This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive.
## How was this patch tested?
Updated tests to reflect this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12769 from rxin/SPARK-14991.
Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:
```
CREATE TABLE <table-name>
USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)]
[PARTITIONED BY (col1, col2, ...)]
[CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS]
AS SELECT ...
```
Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax.
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12734 from liancheng/spark-14954.
## What changes were proposed in this pull request?
This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands.
## How was this patch tested?
Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite.
Author: Yin Huai <yhuai@databricks.com>
Closes#12714 from yhuai/banNativeCommand.
## What changes were proposed in this pull request?
This patch removes HiveConf from HiveTableScanExec and HiveTableReader and instead just uses our own configuration system. I'm splitting the large change of removing HiveConf into multiple independent pull requests because it is very difficult to debug test failures when they are all combined in one giant one.
## How was this patch tested?
Should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12727 from rxin/SPARK-14944.
## What changes were proposed in this pull request?
We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users.
As part of this patch, I also removed some config options deprecated in Spark 1.x.
## How was this patch tested?
Updated relevant tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12689 from rxin/SPARK-14913.
## What changes were proposed in this pull request?
#12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface.
## How was this patch tested?
See `CatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#12713 from andrewor14/user-facing-catalog.
## What changes were proposed in this pull request?
This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands.
Command Syntax:
``` SQL
SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]
```
``` SQL
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
```
## How was this patch tested?
Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite
to verify plans.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#12222 from dilipbiswal/dkb_show_columns.
## What changes were proposed in this pull request?
While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time.
## How was this patch tested?
This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12710 from sameeragarwal/vectorized-enable.
## What changes were proposed in this pull request?
This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec.
This PR also simplify the join selection in SparkStrategy.
## How was this patch tested?
Added new tests.
Author: Davies Liu <davies@databricks.com>
Closes#12668 from davies/smj_semi.
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.
## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.
Author: Reynold Xin <rxin@databricks.com>
Closes#12688 from rxin/SPARK-14912.
## What changes were proposed in this pull request?
In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.
In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.
**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.
## How was this patch tested?
No change in functionality intended.
Author: Andrew Or <andrew@databricks.com>
Closes#12625 from andrewor14/spark-session-refactor.
## What changes were proposed in this pull request?
`RuntimeConfig` is the new user-facing API in 2.0 added in #11378. Until now, however, it's been dead code. This patch uses `RuntimeConfig` in `SessionState` and exposes that through the `SparkSession`.
## How was this patch tested?
New test in `SQLContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#12669 from andrewor14/use-runtime-conf.
## What changes were proposed in this pull request?
This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.
Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)
## How was this patch tested?
No change in functionality.
Author: Andrew Or <andrew@databricks.com>
Closes#12585 from andrewor14/delete-hive-context.
## What changes were proposed in this pull request?
This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#12652 from liancheng/spark-14875.
#### What changes were proposed in this pull request?
So far, we are capturing each unsupported Alter Table in separate visit functions. They should be unified and issue the same ParseException instead.
This PR is to refactor the existing implementation and make error message consistent for Alter Table DDL.
#### How was this patch tested?
Updated the existing test cases and also added new test cases to ensure all the unsupported statements are covered.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12459 from gatorsmile/cleanAlterTable.
## What changes were proposed in this pull request?
CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect are not Hive-specific. So, this PR moves them from sql/hive to sql/core. Also, I am adding `Command` suffix to these two classes.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#12645 from yhuai/moveCreateDataSource.
## What changes were proposed in this pull request?
It is unnecessary as DataType.catalogString largely replaces the need for this class.
## How was this patch tested?
Mostly removing dead code and should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12644 from rxin/SPARK-14877.
## What changes were proposed in this pull request?
This patch improves error handling in view creation. CreateViewCommand itself will analyze the view SQL query first, and if it cannot successfully analyze it, throw an AnalysisException.
In addition, I also added the following two conservative guards for easier identification of Spark bugs:
1. If there is a bug and the generated view SQL cannot be analyzed, throw an exception at runtime. Note that this is not an AnalysisException because it is not caused by the user and more likely indicate a bug in Spark.
2. SQLBuilder when it gets an unresolved plan, it will also show the plan in the error message.
I also took the chance to simplify the internal implementation of CreateViewCommand, and *removed* a fallback path that would've masked an exception from before.
## How was this patch tested?
1. Added a unit test for the user facing error handling.
2. Manually introduced some bugs in Spark to test the internal defensive error handling.
3. Also added a test case to test nested views (not super relevant).
Author: Reynold Xin <rxin@databricks.com>
Closes#12633 from rxin/SPARK-14865.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Reynold Xin <rxin@databricks.com>
Closes#12565 from rxin/test-flaky.
## What changes were proposed in this pull request?
This patch breaks SQLQuerySuite out into smaller test suites. It was a little bit too large for debugging.
## How was this patch tested?
This is a test only change.
Author: Reynold Xin <rxin@databricks.com>
Closes#12630 from rxin/SPARK-14866.
## What changes were proposed in this pull request?
This patch re-implements view creation command in sql/core, based on the pre-existing view creation command in the Hive module. This consolidates the view creation logical command and physical command into a single one, called CreateViewCommand.
## How was this patch tested?
All the code should've been tested by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12615 from rxin/SPARK-14842-2.
## What changes were proposed in this pull request?
This patch adds "Exec" suffix to all physical operators. Before this patch, Spark's physical operators and logical operators are named the same (e.g. Project could be logical.Project or execution.Project), which caused small issues in code review and bigger issues in code refactoring.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#12617 from rxin/exec-node.
## What changes were proposed in this pull request?
This patch moves SQLBuilder into sql/core so we can in the future move view generation also into sql/core.
## How was this patch tested?
Also moved unit tests.
Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12602 from rxin/SPARK-14841.
## What changes were proposed in this pull request?
Add the native support for LOAD DATA DDL command that loads data into Hive table/partition.
## How was this patch tested?
`HiveDDLCommandSuite` and `HiveQuerySuite`. Besides, few Hive tests (`WindowQuerySuite`, `HiveTableScanSuite` and `HiveSerDeSuite`) also use `LOAD DATA` command.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12412 from viirya/ddl-load-data.
## What changes were proposed in this pull request?
This patch removes HiveQueryExecution. As part of this, I consolidated all the describe commands into DescribeTableCommand.
## How was this patch tested?
Should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12588 from rxin/SPARK-14826.
(This PR is a rebased version of PR #12153.)
## What changes were proposed in this pull request?
This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts:
1. Block location lookup
Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es.
Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems.
2. Selecting preferred locations
For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs.
## How was this patch tested?
Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations.
Author: Cheng Lian <lian@databricks.com>
Closes#12527 from liancheng/spark-14369-locality-rebased.
## What changes were proposed in this pull request?
Just a rename so we can get rid of `HiveContext.scala`. Note that this will conflict with #12585.
## How was this patch tested?
No change in functionality.
Author: Andrew Or <andrew@databricks.com>
Closes#12586 from andrewor14/rename-hc-object.
## What changes were proposed in this pull request?
This patch moves analyze table parsing into SparkSqlAstBuilder and removes HiveSqlAstBuilder.
In order to avoid extensive refactoring, I created a common trait for CatalogRelation and MetastoreRelation, and match on that. In the future we should probably just consolidate the two into a single thing so we don't need this common trait.
## How was this patch tested?
Updated unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12584 from rxin/SPARK-14821.
## What changes were proposed in this pull request?
This patch moves native command and script transformation into SparkSqlAstBuilder. This builds on #12561. See the last commit for diff.
## How was this patch tested?
Updated test cases to reflect this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12564 from rxin/SPARK-14798.
## What changes were proposed in this pull request?
This patch builds on #12556 and completely removes the use of Hive's variable substitution.
## How was this patch tested?
Covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12561 from rxin/SPARK-14795.
## What changes were proposed in this pull request?
We shouldn't pass analyze command to Hive because some of those would require running MapReduce jobs. For now, let's just always run the no scan analyze.
## How was this patch tested?
Updated test case to reflect this change.
Author: Reynold Xin <rxin@databricks.com>
Closes#12558 from rxin/parser-analyze.
## What changes were proposed in this pull request?
This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR.
## How was this patch tested?
No test change. This simply moves code around.
Author: Reynold Xin <rxin@databricks.com>
Closes#12556 from rxin/SPARK-14792.
## What changes were proposed in this pull request?
The path rewrite in TestHiveSparkSession is pretty hacky. I think we can remove those complexity and just do a string replacement when we read the query files in. This would remove the overloading of runNativeSql in TestHive, which will simplify the removal of Hive specific variable substitution.
## How was this patch tested?
This is a small test refactoring to simplify test infrastructure.
Author: Reynold Xin <rxin@databricks.com>
Closes#12543 from rxin/SPARK-14775.
## What changes were proposed in this pull request?
We currently have five folders in queries: clientcompare, clientnegative, clientpositive, negative, and positive. Only clientpositive is used. We can remove the rest.
## How was this patch tested?
N/A - removing unused test resources.
Author: Reynold Xin <rxin@databricks.com>
Closes#12540 from rxin/SPARK-14770.
## What changes were proposed in this pull request?
This PR has two main changes.
1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext.
2. Create a SparkSession Class, which will later be the entry point of Spark SQL users.
## How was this patch tested?
Existing tests
This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12522 from yhuai/spark-session.
## What changes were proposed in this pull request?
Since [SPARK-12719: SQL Generation supports for generators](https://issues.apache.org/jira/browse/SPARK-12719) was resolved, this PR enables the related testcases: `explode()` and `json_tuple()`.
## How was this patch tested?
Pass the Jenkins tests (with re-enabled test cases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12329 from dongjoon-hyun/minor_enable_testcases.
## What changes were proposed in this pull request?
Enable ScalaReflection and User Defined Types for plain Scala classes.
This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet.
## How was this patch tested?
Unit test
Author: Joan <joan@goyeau.com>
Closes#12149 from joan38/SPARK-13929-Scala-reflection.
## What changes were proposed in this pull request?
This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.
Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
## What changes were proposed in this pull request?
These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back.
This PR also fixes two regressions:
- SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature.
- SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when
- appending a Dataset `ds` to a persisted partitioned data source relation `t`, and
- partition columns in `ds` don't all appear after data columns
## How was this patch tested?
`CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup.
`SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces.
The two regressions are both covered by existing test cases.
Author: Cheng Lian <lian@databricks.com>
Closes#12179 from liancheng/spark-13681-commit-failure-test.
## What changes were proposed in this pull request?
This is just cleanup. This allows us to remove HiveContext later without inflating the diff too much. This PR fixes the conflicts of https://github.com/apache/spark/pull/12431. It also removes the `def hiveConf` from `HiveSqlParser`. So, we will pass the HiveConf associated with a session explicitly instead of relying on Hive's `SessionState` to pass `HiveConf`.
## How was this patch tested?
Existing tests.
Closes#12431
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12449 from yhuai/hiveconf.
## What changes were proposed in this pull request?
This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#12463 from yhuai/sharedState.
## What changes were proposed in this pull request?
This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.
## How was this patch tested?
Existing tests.
Closes#12405
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12447 from yhuai/sharedState.
## What changes were proposed in this pull request?
This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see https://github.com/apache/spark/pull/12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).
Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness.
## How was this patch tested?
Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
Aggregate w keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
codegen = F 2124 / 2204 9.9 101.3 1.0X
codegen = T hashmap = F 1198 / 1364 17.5 57.1 1.8X
codegen = T hashmap = T 369 / 600 56.8 17.6 5.8X
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12345 from sameeragarwal/tungsten-aggregate-integration.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14592
This patch adds native support for DDL command `CREATE TABLE LIKE`.
The SQL syntax is like:
CREATE TABLE table_name LIKE existing_table
CREATE TABLE IF NOT EXISTS table_name LIKE existing_table
## How was this patch tested?
`HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrew@databricks.com>
Closes#12362 from viirya/create-table-like.
#### What changes were proposed in this pull request?
This PR is to add a test to ensure drop partitions of an external table will not delete data.
cc yhuai andrewor14
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrew@databricks.com>
Closes#12350 from gatorsmile/testDropPartition.
#### What changes were proposed in this pull request?
This PR is to provide a native DDL support for the following three Alter View commands:
Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
##### 1. ALTER VIEW RENAME
**Syntax:**
```SQL
ALTER VIEW view_name RENAME TO new_view_name
```
- to change the name of a view to a different name
- not allowed to rename a view's name by ALTER TABLE
##### 2. ALTER VIEW SET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
```
- to add metadata to a view
- not allowed to set views' properties by ALTER TABLE
- ignore it if trying to set a view's existing property key when the value is the same
- overwrite the value if trying to set a view's existing key to a different value
##### 3. ALTER VIEW UNSET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
```
- to remove metadata from a view
- not allowed to unset views' properties by ALTER TABLE
- issue an exception if trying to unset a view's non-existent key
#### How was this patch tested?
Added test cases to verify if it works properly.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12324 from gatorsmile/alterView.
#### What changes were proposed in this pull request?
**HQL Syntax**: [Create View](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView
)
```SQL
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...;
```
Add a support for the `[COMMENT view_comment]` clause
#### How was this patch tested?
Modified the existing test cases to verify the correctness.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12288 from gatorsmile/addCommentInCreateView.
## What changes were proposed in this pull request?
This patch implements the `CREATE TABLE` command using the `SessionCatalog`. Previously we handled only `CTAS` and `CREATE TABLE ... USING`. This requires us to refactor `CatalogTable` to accept various fields (e.g. bucket and skew columns) and pass them to Hive.
WIP: Note that I haven't verified whether this actually works yet! But I believe it does.
## How was this patch tested?
Tests will come in a future commit.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12271 from andrewor14/create-table-ddl.
## What changes were proposed in this pull request?
Before we are using `AnalysisException`, `ParseException`, `NoSuchFunctionException` etc when a parsing error encounters. I am trying to make it consistent and also **minimum** code impact to the current implementation by changing the class hierarchy.
1. `NoSuchItemException` is removed, since it is an abstract class and it just simply takes a message string.
2. `NoSuchDatabaseException`, `NoSuchTableException`, `NoSuchPartitionException` and `NoSuchFunctionException` now extends `AnalysisException`, as well as `ParseException`, they are all under `AnalysisException` umbrella, but you can also determine how to use them in a granular way.
## How was this patch tested?
The existing test cases should cover this patch.
Author: bomeng <bmeng@us.ibm.com>
Closes#12314 from bomeng/SPARK-14414.
## What changes were proposed in this pull request?
When planning logical plan node `CreateTableUsingAsSelect`, we neglected its `temporary` field and always generates a `CreateMetastoreDataSourceAsSelect`. This PR fixes this issue generating `CreateTempTableUsingAsSelect` when `temporary` is true.
This PR also fixes SPARK-14493 since the root cause of SPARK-14493 is that we were `CreateMetastoreDataSourceAsSelect` uses default Hive warehouse location when `PATH` data source option is absent.
## How was this patch tested?
Added a test case to create a temporary table using the target syntax and check whether it's indeed a temporary table.
Author: Cheng Lian <lian@databricks.com>
Closes#12303 from liancheng/spark-14488-fix-ctas-using.
## What changes were proposed in this pull request?
This implements a few alter table partition commands using the `SessionCatalog`. In particular:
```
ALTER TABLE ... ADD PARTITION ...
ALTER TABLE ... DROP PARTITION ...
ALTER TABLE ... RENAME PARTITION ... TO ...
```
The following operations are not supported, and an `AnalysisException` with a helpful error message will be thrown if the user tries to use them:
```
ALTER TABLE ... EXCHANGE PARTITION ...
ALTER TABLE ... ARCHIVE PARTITION ...
ALTER TABLE ... UNARCHIVE PARTITION ...
ALTER TABLE ... TOUCH ...
ALTER TABLE ... COMPACT ...
ALTER TABLE ... CONCATENATE
MSCK REPAIR TABLE ...
```
## How was this patch tested?
`DDLSuite`, `DDLCommandSuite` and `HiveDDLCommandSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#12220 from andrewor14/alter-partition-ddl.
#### What changes were proposed in this pull request?
This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.
#### How was this patch tested?
Modified the existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12284 from gatorsmile/followupDropTable.
## What changes were proposed in this pull request?
Currently, many functions do now show usages like the followings.
```
scala> sql("desc function extended `sin`").collect().foreach(println)
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: To be added.]
[Extended Usage:
To be added.]
```
This PR adds descriptions for functions and adds a testcase prevent adding function without usage.
```
scala> sql("desc function extended `sin`").collect().foreach(println);
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: sin(x) - Returns the sine of x.]
[Extended Usage:
> SELECT sin(0);
0.0]
```
The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`.
## How was this patch tested?
Pass the Jenkins tests (including new testcases.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12185 from dongjoon-hyun/SPARK-14415.
#### What changes were proposed in this pull request?
This PR is to provide a native support for DDL `DROP VIEW` and `DROP TABLE`. The PR includes native parsing and native analysis.
Based on the HIVE DDL document for [DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-
DropView
), `DROP VIEW` is defined as,
**Syntax:**
```SQL
DROP VIEW [IF EXISTS] [db_name.]view_name;
```
- to remove metadata for the specified view.
- illegal to use DROP TABLE on a view.
- illegal to use DROP VIEW on a table.
- this command only works in `HiveContext`. In `SQLContext`, we will get an exception.
This PR also handles `DROP TABLE`.
**Syntax:**
```SQL
DROP TABLE [IF EXISTS] table_name [PURGE];
```
- Previously, the `DROP TABLE` command only can drop Hive tables in `HiveContext`. Now, after this PR, this command also can drop temporary table, external table, external data source table in `SQLContext`.
- In `HiveContext`, we will not issue an exception if the to-be-dropped table does not exist and users did not specify `IF EXISTS`. Instead, we just log an error message. If `IF EXISTS` is specified, we will not issue any error message/exception.
- In `SQLContext`, we will issue an exception if the to-be-dropped table does not exist, unless `IF EXISTS` is specified.
- Data will not be deleted if the tables are `external`, unless table type is `managed_table`.
#### How was this patch tested?
For verifying command parsing, added test cases in `spark/sql/hive/HiveDDLCommandSuite.scala`
For verifying command analysis, added test cases in `spark/sql/hive/execution/HiveDDLSuite.scala`
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12146 from gatorsmile/dropView.
#### What changes were proposed in this pull request?
"Not good to slightly ignore all the un-supported options/clauses. We should either support it or throw an exception." A comment from yhuai in another PR https://github.com/apache/spark/pull/12146
- Can `Explain` be an exception? The `Formatted` clause is used in `HiveCompatibilitySuite`.
- Two unsupported clauses in `Drop Table` are handled in a separate PR: https://github.com/apache/spark/pull/12146
#### How was this patch tested?
Test cases are added to verify all the cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12255 from gatorsmile/warningToException.
## What changes were proposed in this pull request?
…because some of built-in functions are not in function registry.
This fix tries to fix issues in `describe function` command where some of the outputs
still shows Hive's function because some built-in functions are not in FunctionRegistry.
The following built-in functions have been added to FunctionRegistry:
```
-
!
*
/
&
%
^
+
<
<=
<=>
=
==
>
>=
|
~
and
in
like
not
or
rlike
when
```
The following listed functions are not added, but hard coded in `commands.scala` (hvanhovell):
```
!=
<>
between
case
```
Below are the existing result of the above functions that have not been added:
```
spark-sql> describe function `!=`;
Function: <>
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
Usage: a <> b - Returns TRUE if a is not equal to b
```
```
spark-sql> describe function `<>`;
Function: <>
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
Usage: a <> b - Returns TRUE if a is not equal to b
```
```
spark-sql> describe function `between`;
Function: between
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFBetween
Usage: between a [NOT] BETWEEN b AND c - evaluate if a is [not] in between b and c
```
```
spark-sql> describe function `case`;
Function: case
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFCase
Usage: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
```
## How was this patch tested?
Existing tests passed. Additional test cases added.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#12128 from yongtang/SPARK-14335.
## What changes were proposed in this pull request?
This is a followup to #12117 and addresses some of the TODOs introduced there. In particular, the resolution of database is now pushed into session catalog, which knows about the current database. Further, the logic for checking whether a function exists is pushed into the external catalog.
No change in functionality is expected.
## How was this patch tested?
`SessionCatalogSuite`, `DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#12198 from andrewor14/function-exists.
## What changes were proposed in this pull request?
This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.
## How was this patch tested?
Removed the related tests also.
Author: Reynold Xin <rxin@databricks.com>
Closes#12229 from rxin/SPARK-10063.
#### What changes were proposed in this pull request?
This PR is to fix the compilation errors in Scala 2.10 build, as shown in the link:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-compile-maven-scala-2.10/735/console
```
[error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:266: value contains is not a member of Option[String]
[error] assert(desc.viewText.contains("SELECT * FROM tab1"))
[error] ^
[error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:267: value contains is not a member of Option[String]
[error] assert(desc.viewOriginalText.contains("SELECT * FROM tab1"))
[error] ^
[error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:293: value contains is not a member of Option[String]
[error] assert(desc.viewText.contains("SELECT * FROM tab1"))
[error] ^
[error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:294: value contains is not a member of Option[String]
[error] assert(desc.viewOriginalText.contains("SELECT * FROM tab1"))
[error] ^
[error] four errors found
[error] Compile failed at Apr 5, 2016 10:59:09 PM [10.502s]
```
#### How was this patch tested?
Not sure how to trigger Scala 2.10 compilation in the test environment.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12201 from gatorsmile/buildBreak2.10.
#### What changes were proposed in this pull request?
Because the concept of partitioning is associated with physical tables, we disable all the supports of partitioned views, which are defined in the following three commands in [Hive DDL Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
```
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...;
```
An exception is thrown when users issue any of these three DDL commands.
#### How was this patch tested?
Added test cases for parsing create view and changed the existing test cases to verify if the exceptions are thrown.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12169 from gatorsmile/viewPartition.
## What changes were proposed in this pull request?
This PR implements CreateFunction and DropFunction commands. Besides implementing these two commands, we also change how to manage functions. Here are the main changes.
* `FunctionRegistry` will be a container to store all functions builders and it will not actively load any functions. Because of this change, we do not need to maintain a separate registry for HiveContext. So, `HiveFunctionRegistry` is deleted.
* SessionCatalog takes care the job of loading a function if this function is not in the `FunctionRegistry` but its metadata is stored in the external catalog. For this case, SessionCatalog will (1) load the metadata from the external catalog, (2) load all needed resources (i.e. jars and files), (3) create a function builder based on the function definition, (4) register the function builder in the `FunctionRegistry`.
* A `UnresolvedGenerator` is created. So, the parser will not need to call `FunctionRegistry` directly during parsing, which is not a good time to create a Hive UDTF. In the analysis phase, we will resolve `UnresolvedGenerator`.
This PR is based on viirya's https://github.com/apache/spark/pull/12036/
## How was this patch tested?
Existing tests and new tests.
## TODOs
[x] Self-review
[x] Cleanup
[x] More tests for create/drop functions (we need to more tests for permanent functions).
[ ] File JIRAs for all TODOs
[x] Standardize the error message when a function does not exist.
Author: Yin Huai <yhuai@databricks.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12117 from yhuai/function.
#### What changes were proposed in this pull request?
Currently, the weird error messages are issued if we use Hive Context-only operations in SQL Context.
For example,
- When calling `Drop Table` in SQL Context, we got the following message:
```
Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be thrown, but java.lang.ClassCastException was thrown.
```
- When calling `Script Transform` in SQL Context, we got the message:
```
assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, [tKey#155,tValue#156], null
+- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at BeforeAndAfterAll.scala:187
```
Updates:
Based on the investigation from hvanhovell , the root cause is `visitChildren`, which is the default implementation. It always returns the result of the last defined context child. After merging the code changes from hvanhovell , it works! Thank you hvanhovell !
#### How was this patch tested?
A few test cases are added.
Not sure if the same issue exist for the other operators/DDL/DML. hvanhovell
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Herman van Hovell <hvanhovell@questtec.nl>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12134 from gatorsmile/hiveParserCommand.
## What changes were proposed in this pull request?
This PR adds Native execution of SHOW TBLPROPERTIES command.
Command Syntax:
``` SQL
SHOW TBLPROPERTIES table_name[(property_key_literal)]
```
## How was this patch tested?
Tests added in HiveComandSuiie and DDLCommandSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#12133 from dilipbiswal/dkb_show_tblproperties.
## What changes were proposed in this pull request?
This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.
## How was this patch tested?
Manual and pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12139 from dongjoon-hyun/SPARK-14355.
## What changes were proposed in this pull request?
We throw an AnalysisException that looks like this:
```
scala> sqlContext.sql("CREATE TEMPORARY MACRO SIGMOID (x DOUBLE) 1.0 / (1.0 + EXP(-x))")
org.apache.spark.sql.catalyst.parser.ParseException:
Unsupported SQL statement
== SQL ==
CREATE TEMPORARY MACRO SIGMOID (x DOUBLE) 1.0 / (1.0 + EXP(-x))
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:749)
... 48 elided
```
## How was this patch tested?
Add test cases in HiveQuerySuite.scala
Author: bomeng <bmeng@us.ibm.com>
Closes#12125 from bomeng/SPARK-14341.
## What changes were proposed in this pull request?
EXPLAIN output should be in a single cell.
**Before**
```
scala> sql("explain select 1").collect()
res0: Array[org.apache.spark.sql.Row] = Array([== Physical Plan ==], [WholeStageCodegen], [: +- Project [1 AS 1#1]], [: +- INPUT], [+- Scan OneRowRelation[]])
```
**After**
```
scala> sql("explain select 1").collect()
res1: Array[org.apache.spark.sql.Row] =
Array([== Physical Plan ==
WholeStageCodegen
: +- Project [1 AS 1#4]
: +- INPUT
+- Scan OneRowRelation[]])
```
Or,
```
scala> sql("explain select 1").head
res1: org.apache.spark.sql.Row =
[== Physical Plan ==
WholeStageCodegen
: +- Project [1 AS 1#5]
: +- INPUT
+- Scan OneRowRelation[]]
```
Please note that `Spark-shell(Scala-shell)` trims long string output. So, you may need to use `println` to get full strings.
```
scala> println(sql("explain codegen select 'a' as a group by 1").head)
[Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
WholeStageCodegen
...
/* 059 */ }
/* 060 */ }
]
```
## How was this patch tested?
Pass the Jenkins tests. (Testcases are updated.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12137 from dongjoon-hyun/SPARK-14350.
## What changes were proposed in this pull request?
This PR implements `EXPLAIN CODEGEN` SQL command which returns generated codes like `debugCodegen`. In `spark-shell`, we don't need to `import debug` module. In `spark-sql`, we can use this SQL command now.
**Before**
```
scala> import org.apache.spark.sql.execution.debug._
scala> sql("select 'a' as a group by 1").debugCodegen()
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
...
Generated code:
...
== Subtree 2 / 2 ==
...
Generated code:
...
```
**After**
```
scala> sql("explain extended codegen select 'a' as a group by 1").collect().foreach(println)
[Found 2 WholeStageCodegen subtrees.]
[== Subtree 1 / 2 ==]
...
[]
[Generated code:]
...
[]
[== Subtree 2 / 2 ==]
...
[]
[Generated code:]
...
```
## How was this patch tested?
Pass the Jenkins tests (including new testcases)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12099 from dongjoon-hyun/SPARK-14251.
## What changes were proposed in this pull request?
`SizeBasedWindowFunction.n` is a global singleton attribute created for evaluating size based aggregate window functions like `CUME_DIST`. However, this attribute gets different expression IDs when created on both driver side and executor side. This PR adds `withPartitionSize` method to `SizeBasedWindowFunction` so that we can easily rewrite `SizeBasedWindowFunction.n` on executor side.
## How was this patch tested?
A test case is added in `HiveSparkSubmitSuite`, which supports launching multi-process clusters.
Author: Cheng Lian <lian@databricks.com>
Closes#12040 from liancheng/spark-14244-fix-sized-window-function.
## What changes were proposed in this pull request?
This patch enables use of OrcRelation for SQL queries which read data from Hive tables. Changes in this patch:
- Added a new rule `OrcConversions` which would alter the plan to use `OrcRelation`. In this diff, the conversion is done only for reads.
- Added a new config `spark.sql.hive.convertMetastoreOrc` to control the conversion
BEFORE
```
scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None
== Analyzed Logical Plan ==
key: string, value: string
Project [key#171,value#172]
+- MetastoreRelation default, orc_table, None
== Optimized Logical Plan ==
MetastoreRelation default, orc_table, None
== Physical Plan ==
HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, None
```
AFTER
```
scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None
== Analyzed Logical Plan ==
key: string, value: string
Project [key#76,value#77]
+- SubqueryAlias orc_table
+- Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string>
== Optimized Logical Plan ==
Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string>
== Physical Plan ==
WholeStageCodegen
: +- Scan ORC part: struct<>, data: struct<key:string,value:string>[key#76,value#77] InputPaths: file:/user/hive/warehouse/orc_table
```
## How was this patch tested?
- Added a new unit test. Ran existing unit tests
- Ran with production like data
## Performance gains
Ran on a production table in Facebook (note that the data was in DWRF file format which is similar to ORC)
Best case : when there was no matching rows for the predicate in the query (everything is filtered out)
```
CPU time Wall time Total wall time across all tasks
================================================================
Without the change 541_515 sec 25.0 mins 165.8 hours
With change 407 sec 1.5 mins 15 mins
```
Average case: A subset of rows in the data match the query predicate
```
CPU time Wall time Total wall time across all tasks
================================================================
Without the change 624_630 sec 31.0 mins 199.0 h
With change 14_769 sec 5.3 mins 7.7 h
```
Author: Tejas Patil <tejasp@fb.com>
Closes#11891 from tejasapatil/orc_ppd.
## What changes were proposed in this pull request?
This PR throws Unsupported Operation exception for create index, drop index, alter index , lock table , lock database, unlock table, and unlock database operations that are not supported in Spark SQL. Currently these operations are executed executed by Hive.
Error:
spark-sql> drop index my_index on my_table;
Error in query:
Unsupported operation: drop index(line 1, pos 0)
## How was this patch tested?
Added test cases to HiveQuerySuite
yhuai hvanhovell andrewor14
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes#12069 from sureshthalamati/unsupported_ddl_spark-14133.
## What changes were proposed in this pull request?
This PR addresses the following
1. Supports native execution of SHOW DATABASES command
2. Fixes SHOW TABLES to apply the identifier_with_wildcards pattern if supplied.
SHOW TABLE syntax
```
SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
```
SHOW DATABASES syntax
```
SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];
```
## How was this patch tested?
Tests added in SQLQuerySuite (both hive and sql contexts) and DDLCommandSuite
Note: Since the table name pattern was not working , tests are added in both SQLQuerySuite to
verify the application of the table pattern.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#11991 from dilipbiswal/dkb_show_database.
### What changes were proposed in this pull request?
This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.
### How was this patch tested?
Existing unit tests.
cc rxin andrewor14 yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12071 from hvanhovell/SPARK-14211.
## What changes were proposed in this pull request?
Before: We just pass all role commands to Hive even though it doesn't work.
After: We throw an `AnalysisException` that looks like this:
```
scala> sql("CREATE ROLE x")
org.apache.spark.sql.AnalysisException: Unsupported Hive operation: CREATE ROLE;
at org.apache.spark.sql.hive.HiveQl$$anonfun$parsePlan$1.apply(HiveQl.scala:213)
at org.apache.spark.sql.hive.HiveQl$$anonfun$parsePlan$1.apply(HiveQl.scala:208)
at org.apache.spark.sql.catalyst.parser.CatalystQl.safeParse(CatalystQl.scala:49)
at org.apache.spark.sql.hive.HiveQl.parsePlan(HiveQl.scala:208)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198)
```
## How was this patch tested?
`HiveQuerySuite`
Author: Andrew Or <andrew@databricks.com>
Closes#11948 from andrewor14/ddl-role-management.
## What changes were proposed in this pull request?
This patch addresses the remaining comments left in #11750 and #11918 after they are merged. For a full list of changes in this patch, just trace the commits.
## How was this patch tested?
`SessionCatalogSuite` and `CatalogTestCases`
Author: Andrew Or <andrew@databricks.com>
Closes#12006 from andrewor14/session-catalog-followup.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14156
In HiveComparisonTest, when catalyst results are different to hive results, we will collect the messages for computed tables during the test. During creating the message, we use sparkPlan. But we actually run the query with executedPlan. So the error message is sometimes confusing.
For example, as wholestage codegen is enabled by default now. The shown spark plan for computed tables is the plan before wholestage codegen.
A concrete is the following error message shown before this patch. It is the error shown when running `HiveCompatibilityTest` `auto_join26`.
auto_join26 has one SQL to create table:
INSERT OVERWRITE TABLE dest_j1
SELECT x.key, count(1) FROM src1 x JOIN src y ON (x.key = y.key) group by x.key; (1)
Then a SQL to retrieve the result:
select * from dest_j1 x order by x.key; (2)
When the above SQL (2) to retrieve the result fails, In `HiveComparisonTest` we will try to collect and show the generated data from table `dest_j1` using the SQL (1)'s spark plan. The you will see this error:
TungstenAggregate(key=[key#8804], functions=[(count(1),mode=Partial,isDistinct=false)], output=[key#8804,count#8834L])
+- Project [key#8804]
+- BroadcastHashJoin [key#8804], [key#8806], Inner, BuildRight, None
:- Filter isnotnull(key#8804)
: +- InMemoryColumnarTableScan [key#8804], [isnotnull(key#8804)], InMemoryRelation [key#8804,value#8805], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8717,value#8718], MetastoreRelation default, src1, None, Some(src1)
+- Filter isnotnull(key#8806)
+- InMemoryColumnarTableScan [key#8806], [isnotnull(key#8806)], InMemoryRelation [key#8806,value#8807], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8760,value#8761], MetastoreRelation default, src, None, Some(src)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:82)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:140)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:137)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:120)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:87)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:82)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 70 more
Caused by: java.lang.UnsupportedOperationException: Filter does not implement doExecuteBroadcast
at org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:221)
The message is confusing because it is not the plan actually run by SparkSQL engine to create the generated table. The plan actually run is no problem. But as before this patch, we run `e.sparkPlan.collect` to retrieve and show the generated data, spark plan is not the plan we can run. So the above error will be shown.
After this patch, we won't see the error because the executed plan is no problem and works.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#11957 from viirya/use-executedplan.
## What changes were proposed in this pull request?
This PR fixes the following two testcases in order to test the correct usages.
```
checkSqlGeneration("SELECT substr('This is a test', 'is')")
checkSqlGeneration("SELECT substring('This is a test', 'is')")
```
Actually, the testcases works but tests on exceptional cases.
```
scala> sql("SELECT substr('This is a test', 'is')")
res0: org.apache.spark.sql.DataFrame = [substring(This is a test, CAST(is AS INT), 2147483647): string]
scala> sql("SELECT substr('This is a test', 'is')").collect()
res1: Array[org.apache.spark.sql.Row] = Array([null])
```
## How was this patch tested?
Pass the modified unit tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11963 from dongjoon-hyun/fix_substr_testcase.
## What changes were proposed in this pull request?
As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`. This PR adds the `CreateMap` expression, and the DataFrame API, and python API.
## How was this patch tested?
various new tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11879 from cloud-fan/create_map.
## What changes were proposed in this pull request?
This reopens#11836, which was merged but promptly reverted because it introduced flaky Hive tests.
## How was this patch tested?
See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11938 from andrewor14/session-catalog-again.
## What changes were proposed in this pull request?
unionAll has been deprecated in SPARK-14088.
## How was this patch tested?
Should be covered by all existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11946 from rxin/SPARK-14142.
## What changes were proposed in this pull request?
`SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.
As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext
## How was this patch tested?
This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#11836 from andrewor14/use-session-catalog.
## What changes were proposed in this pull request?
This PR adds support for TimestampType in the vectorized parquet reader
## How was this patch tested?
1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96)` that made us fall back on parquet-mr for handling timestamps. This condition is now removed.
2. The `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `TimestampType`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change. Similarly, the `ParquetHiveCompatibilitySuite.SPARK-10177 timestamp` test that fails when the gating condition is removed, should now pass as well.
3. Added tests in `HadoopFsRelationTest` that test both the dictionary encoded and non-encoded versions across all supported datatypes.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11882 from sameeragarwal/timestamp-parquet.
## What changes were proposed in this pull request?
This PR does the renaming as suggested by marmbrus in [this comment][1].
## How was this patch tested?
Existing tests.
[1]: 6d37e1eb90 (commitcomment-16654694)
Author: Cheng Lian <lian@databricks.com>
Closes#11889 from liancheng/spark-13817-follow-up.
SPARK-13774: IllegalArgumentException: Can not create a Path from an empty string for incorrect file path
**Overview:**
- If a non-existent path is given in this call
``
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
``
it throws the following error:
`java.lang.IllegalArgumentException: Can not create a Path from an empty string` …..
`It gets called from inferSchema call in org.apache.spark.sql.execution.datasources.DataSource.resolveRelation`
- The purpose of this JIRA is to throw a better error message.
- With the fix, you will now get a _Path does not exist_ error message.
```
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204)
...
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141)
... 49 elided
```
**Details**
_Changes include:_
- Check if path exists or not in resolveRelation in DataSource, and throw an AnalysisException with message like “Path does not exist: $path”
- AnalysisException is thrown similar to the exceptions thrown in resolveRelation.
- The glob path and the non glob path is checked with minimal calls to path exists. If the globPath is empty, then it is a nonexistent glob pattern and an error will be thrown. In the scenario that it is not globPath, it is necessary to only check if the first element in the Seq is valid or not.
_Test modifications:_
- Changes went in for 3 tests to account for this error checking.
- SQLQuerySuite:test("run sql directly on files") – Error message needed to be updated.
- 2 tests failed in MetastoreDataSourcesSuite because they had a dummy path and so test is modified to give a tempdir and allow it to move past so it can continue to test the codepath it meant to test
_New Tests:_
2 new tests are added to DataFrameSuite to validate that glob and non-glob path will throw the new error message.
_Testing:_
Unit tests were run with the fix.
**Notes/Questions to reviewers:**
- There is some code duplication in DataSource.scala in resolveRelation method and also createSource with respect to getting the paths. I have not made any changes to the createSource codepath. Should we make the change there as well ?
- From other JIRAs, I know there is restructuring and changes going on in this area, not sure how that will affect these changes, but since this seemed like a starter issue, I looked into it. If we prefer not to add the overhead of the checks, or if there is a better place to do so, let me know.
I would appreciate your review. Thanks for your time and comments.
Author: Sunitha Kambhampati <skambha@us.ibm.com>
Closes#11775 from skambha/improve_errmsg.
This PR add implements the new `buildReader` interface for the Parquet `FileFormat`. An simple implementation of `FileScanRDD` is also included.
This code should be tested by the many existing tests for parquet.
Author: Michael Armbrust <michael@databricks.com>
Author: Sameer Agarwal <sameer@databricks.com>
Author: Nong Li <nong@databricks.com>
Closes#11709 from marmbrus/parquetReader.
This PR resolves two issues:
First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example,
```scala
structDf.groupBy($"a").agg(min(struct($"record.*")))
```
Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example,
```scala
pagecounts4PartitionsDS
.map(line => (line._1, line._3))
.toDF()
.groupBy($"_1")
.agg(sum("*") as "sumOccurances")
```
Before the fix, the invalid usage will issue a confusing error message, like:
```
org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2;
```
After the fix, the message is like:
```
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in function 'sum'
```
cc: rxin nongli cloud-fan
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11208 from gatorsmile/sumDataSetResolution.
## What changes were proposed in this pull request?
This patch merges DatasetHolder and DataFrameHolder. This makes more sense because DataFrame/Dataset are now one class.
In addition, fixed some minor issues with pull request #11732.
## How was this patch tested?
Updated existing unit tests that test these implicits.
Author: Reynold Xin <rxin@databricks.com>
Closes#11737 from rxin/SPARK-13898.
## What changes were proposed in this pull request?
[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.
```xml
- <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
- <!--
<module name="LineLength">
<property name="max" value="100"/>
<property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
</module>
- -->
<module name="NoLineWrap"/>
<module name="EmptyBlock">
<property name="option" value="TEXT"/>
-167,5 +164,7
</module>
<module name="CommentsIndentation"/>
<module name="UnusedImports"/>
+ <module name="RedundantImport"/>
+ <module name="RedundantModifier"/>
```
## How was this patch tested?
Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11831 from dongjoon-hyun/SPARK-14011.
## What changes were proposed in this pull request?
As part of testing generating SQL query from a analyzed SQL plan, we run the generated SQL for tests in HiveComparisonTest. This PR makes the generated SQL get eagerly analyzed. So, when a generated SQL has any analysis error, we can see the error message created by
```
case NonFatal(e) => fail(
s"""Failed to analyze the converted SQL string:
|
|# Original HiveQL query string:
|$queryString
|
|# Resolved query plan:
|${originalQuery.analyzed.treeString}
|
|# Converted SQL query string:
|$convertedSQL
""".stripMargin, e)
```
Right now, if we can parse a generated SQL but fail to analyze it, we will see error message generated by the following code (it only mentions that we cannot execute the original query, i.e. `queryString`).
```
case e: Throwable =>
val errorMessage =
s"""
|Failed to execute query using catalyst:
|Error: ${e.getMessage}
|${stackTraceToString(e)}
|$queryString
|$query
|== HIVE - ${hive.size} row(s) ==
|${hive.mkString("\n")}
""".stripMargin
```
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#11825 from yhuai/SPARK-13972-follow-up.
## What changes were proposed in this pull request?
ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast.
ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory.
This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side).
A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin.
## How was this patch tested?
Added new unit tests for ShuffledHashJoin.
Author: Davies Liu <davies@databricks.com>
Closes#11788 from davies/shuffle_join.
## What changes were proposed in this pull request?
Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log.
We will send new PRs for spotted bugs, and merge this one after all bugs are fixed.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11782 from cloud-fan/test.
## What changes were proposed in this pull request?
The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string.
## How was this patch tested?
The re-enabled test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11818 from cloud-fan/bug-fix.
## What changes were proposed in this pull request?
We haven't figured out the corrected logical to add sub-queries yet, so we should not clear all sub-queries before generate SQL. This PR changed the logic to only remove sub-queries above table relation.
an example for this bug, original SQL: `SELECT a FROM (SELECT a FROM tbl) t WHERE a = 1`
before this PR, we will generate:
```
SELECT attr_1 AS a FROM
SELECT attr_1 FROM (
SELECT a AS attr_1 FROM tbl
) AS sub_q0
WHERE attr_1 = 1
```
We missed a sub-query and this SQL string is illegal.
After this PR, we will generate:
```
SELECT attr_1 AS a FROM (
SELECT attr_1 FROM (
SELECT a AS attr_1 FROM tbl
) AS sub_q0
WHERE attr_1 = 1
) AS t
```
TODO: for long term, we should find a way to add sub-queries correctly, so that arbitrary logical plans can be converted to SQL string.
## How was this patch tested?
`LogicalPlanToSQLSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11786 from cloud-fan/bug-fix.
## What changes were proposed in this pull request?
This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.
This PR is based on https://github.com/apache/spark/pull/11658, please see the last commit to review the real changes.
Thanks dilipbiswal for his initial work! Takes over https://github.com/apache/spark/pull/11596
## How was this patch tested?
new tests in `LogicalPlanToSQLSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11696 from cloud-fan/generate.
## What changes were proposed in this pull request?
Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11764 from cloud-fan/logger.
This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf.
I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)).
Author: Ryan Blue <blue@apache.org>
Closes#11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.
## What changes were proposed in this pull request?
Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly.
## How was this patch tested?
This patch will not affect the real code path.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#11758 from adrian-wang/spark12855.
## What changes were proposed in this pull request?
As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`.
A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands.
## How was this patch tested?
800+ lines of tests in `SessionCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11750 from andrewor14/temp-catalog.
#### What changes were proposed in this pull request?
This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`.
For example, below is the SQL containing `Transform`
```
SELECT TRANSFORM (a, b, c, d) USING 'cat' FROM parquet_t2
```
Its logical plan is like
```
ScriptTransformation [a#210L,b#211L,c#212L,d#213L], cat, [key#208,value#209], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true)
+- SubqueryAlias parquet_t2
+- Relation[a#210L,b#211L,c#212L,d#213L] ParquetRelation
```
The generated SQL will be like
```
SELECT TRANSFORM (`parquet_t2`.`a`, `parquet_t2`.`b`, `parquet_t2`.`c`, `parquet_t2`.`d`) USING 'cat' AS (`key` string, `value` string) FROM `default`.`parquet_t2`
```
#### How was this patch tested?
Seven test cases are added to `LogicalPlanToSQLSuite`.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11503 from gatorsmile/transformToSQL.
## What changes were proposed in this pull request?
This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.
To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.
For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:
```
SELECT sq_1.key, sq_1.max
FROM (
SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max
FROM (
SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y
) AS sq_0
) AS sq_1
```
You can see, the `key` columns become ambiguous after `sq_0`.
After this PR, it will generate something like:
```
SELECT attr_30 AS key, attr_37 AS max
FROM (
SELECT attr_30, attr_37
FROM (
SELECT attr_30, attr_35, MAX(attr_35) AS attr_37
FROM (
SELECT attr_30, attr_35 FROM
(SELECT key AS attr_30 FROM t1) AS sq_0
INNER JOIN
(SELECT key AS attr_35 FROM t1) AS sq_1
) AS sq_2
) AS sq_3
) AS sq_4
```
The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.
## How was this patch tested?
existing tests and new tests in LogicalPlanToSQLSuite.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11658 from cloud-fan/gensql.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13894
Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`.
## How was this patch tested?
No additional unit test required.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#11730 from chenghao-intel/range.
## What changes were proposed in this pull request?
There is a feature of hive SQL called multi-insert. For example:
```
FROM src
INSERT OVERWRITE TABLE dest1
SELECT key + 1
INSERT OVERWRITE TABLE dest2
SELECT key WHERE key > 2
INSERT OVERWRITE TABLE dest3
SELECT col EXPLODE(arr) exp AS col
...
```
We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE.
This PR removes these limitations and make us fully support multi-insert.
## How was this patch tested?
new tests in `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11754 from cloud-fan/lateral-view.
## What changes were proposed in this pull request?
Follow up to https://github.com/apache/spark/pull/11657
- Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
- And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
- And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#11725 from srowen/SPARK-13823.2.
## What changes were proposed in this pull request?
The purpose of [SPARK-12653](https://issues.apache.org/jira/browse/SPARK-12653) is re-enabling a regression test.
Historically, the target regression test is added by [SPARK-8498](093c34838d), but is temporarily disabled by [SPARK-12615](8ce645d4ee) due to binary compatibility error.
The following is the current error message at the submitting spark job with the pre-built `test.jar` file in the target regression test.
```
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.$lessinit$greater$default$6()Lscala/collection/Map;
```
Simple rebuilding `test.jar` can not recover the purpose of testcase since we need to support both Scala 2.10 and 2.11 for a while. For example, we will face the following Scala 2.11 error if we use `test.jar` built by Scala 2.10.
```
Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
```
This PR replace the existing `test.jar` with `test-2.10.jar` and `test-2.11.jar` and improve the regression test to use the suitable jar file.
## How was this patch tested?
Pass the existing Jenkins test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11744 from dongjoon-hyun/SPARK-12653.
## What changes were proposed in this pull request?
This PR brings codegen support for broadcast left-semi join.
## How was this patch tested?
Existing tests. Added benchmark, the result show 7X speedup.
Author: Davies Liu <davies@databricks.com>
Closes#11742 from davies/gen_semi.
## What changes were proposed in this pull request?
Change the return type of toJson in Dataset class
## How was this patch tested?
No additional unit test required.
Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com>
Closes#11732 from skonto/fix_toJson.
## What changes were proposed in this pull request?
Our internal code can go through SessionState.catalog and SessionState.analyzer. This brings two small benefits:
1. Reduces internal dependency on SQLContext.
2. Removes 2 public methods in Java (Java does not obey package private visibility).
More importantly, according to the design in SPARK-13485, we'd need to claim this catalog function for the user-facing public functions, rather than having an internal field.
## How was this patch tested?
Existing unit/integration test code.
Author: Reynold Xin <rxin@databricks.com>
Closes#11716 from rxin/SPARK-13893.
## What changes were proposed in this pull request?
In general it is better for internal classes to not depend on the external class (in this case SQLContext) to reduce coupling between user-facing APIs and the internal implementations. This patch removes SQLContext dependency from some internal classes such as SparkPlanner, SparkOptimizer.
As part of this patch, I also removed the following internal methods from SQLContext:
```
protected[sql] def functionRegistry: FunctionRegistry
protected[sql] def optimizer: Optimizer
protected[sql] def sqlParser: ParserInterface
protected[sql] def planner: SparkPlanner
protected[sql] def continuousQueryManager
protected[sql] def prepareForExecution: RuleExecutor[SparkPlan]
```
## How was this patch tested?
Existing unit/integration tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11712 from rxin/sqlContext-planner.