Commit graph

8339 commits

Author SHA1 Message Date
angerszhu 9f478a6832 [SPARK-28901][SQL] SparkThriftServer's Cancel SQL Operation show it in JDBC Tab UI
### What changes were proposed in this pull request?
Current Spark Thirft Server can't support cancel SQL job,  when we use Hue to query throgh Spark Thrift Server, when we run a sql and then click cancel button to cancel this sql, we will it won't work in backend and in the spark JDBC UI tab, we can see the SQL's status is always COMPILED, then the duration of SQL is always increasing, this may make people confused.

![image](https://user-images.githubusercontent.com/46485123/63869830-60338f00-c9eb-11e9-8776-cee965adcb0a.png)

### Why are the changes needed?

If sql status can't reflect sql's true status, it will make user confused.

### Does this PR introduce any user-facing change?

SparkthriftServer's UI tab will show SQL's status in CANCELED when we cancel a SQL .

### How was this patch tested?
Manuel tested

UI TAB Status
![image](https://user-images.githubusercontent.com/46485123/63915010-80a12f00-ca67-11e9-9342-830dfa9c719f.png)

![image](https://user-images.githubusercontent.com/46485123/63915084-a9292900-ca67-11e9-8e26-375bf8ce0963.png)

backend log
![image](https://user-images.githubusercontent.com/46485123/63914864-1092a900-ca67-11e9-93f2-08690ed9abf4.png)

Closes #25611 from AngersZhuuuu/SPARK-28901.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-04 09:20:51 -07:00
Ryan Blue 5ea134c354 [SPARK-28628][SQL] Implement SupportsNamespaces in V2SessionCatalog
## What changes were proposed in this pull request?

This adds namespace support to V2SessionCatalog.

## How was this patch tested?

WIP: will add tests for v2 session catalog namespace methods.

Closes #25363 from rdblue/SPARK-28628-support-namespaces-in-v2-session-catalog.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2019-09-03 13:13:27 -07:00
Xianjin YE d5688dc732 [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
## What changes were proposed in this pull request?
Datasource table now supports partition tables long ago. This commit adds the ability to translate
the InsertIntoTable(HiveTableRelation) to datasource table insertion.

## How was this patch tested?
Existing tests with some modification

Closes #25306 from advancedxy/SPARK-28573.

Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-03 13:40:06 +08:00
sandeep katta e1946a598b [SPARK-28705][SQL][TEST] Drop tables after being used in AnalysisExternalCatalogSuite
## What changes were proposed in this pull request?

drop the table after the test `query builtin functions don't call the external catalog`  executed

This is required for [SPARK-25464](https://github.com/apache/spark/pull/22466)

## How was this patch tested?

existing UT

Closes #25427 from sandeep-katta/cleanuptable.

Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-02 20:32:32 +09:00
HyukjinKwon bd3915e356 Revert "[SPARK-28612][SQL] Add DataFrameWriterV2 API"
This reverts commit 3821d75b83.
2019-09-02 12:47:14 +09:00
Sean Owen eb037a8180 [SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations
### What changes were proposed in this pull request?

The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R.

The changes below can be summarized as:
- Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental
- Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched
- I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering)

It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those.

### Why are the changes needed?

Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #25558 from srowen/SPARK-28855.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-01 10:15:00 -05:00
Ryan Blue 3821d75b83 [SPARK-28612][SQL] Add DataFrameWriterV2 API
## What changes were proposed in this pull request?

This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API:

* Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans.
* Only creates v2 logical plans so the behavior is always consistent.
* Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`.

Here are a few example uses of the new API:

```scala
df.writeTo("catalog.db.table").append()
df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01")
df.writeTo("catalog.db.table").overwritePartitions()
df.writeTo("catalog.db.table").asParquet.create()
df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace()
df.writeTo("catalog.db.table").using("abc").replace()
```

## How was this patch tested?

Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans.

Closes #25354 from rdblue/SPARK-28612-add-data-frame-writer-v2.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2019-08-31 21:28:20 -07:00
HyukjinKwon 7cc0f0e9a7 [SPARK-28894][SQL][TESTS] Add a clue to make it easier to debug via Jenkins's test results
### What changes were proposed in this pull request?

See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/

![Screen Shot 2019-08-28 at 4 08 58 PM](https://user-images.githubusercontent.com/6477701/63833484-2a23ea00-c9ae-11e9-91a1-0859cb183fea.png)

```xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite hostname="C02Y52ZLJGH5" name="org.apache.spark.sql.SQLQueryTestSuite" tests="3" errors="0" failures="0" skipped="0" time="14.475">
    ...
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scala UDF" time="6.703">
    </testcase>
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Regular Python UDF" time="4.442">
    </testcase>
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scalar Pandas UDF" time="3.33">
    </testcase>
    <system-out/>
    <system-err/>
</testsuite>
```

Root cause seems a bug in SBT - it truncates the test name based on the last dot.

https://github.com/sbt/sbt/issues/2949
https://github.com/sbt/sbt/blob/v0.13.18/testing/src/main/scala/sbt/JUnitXmlTestsListener.scala#L71-L79

I tried to find a better way but couldn't find. Therefore, this PR proposes a workaround by appending the test file name into the assert log:

```diff
  [info] - inner-join.sql *** FAILED *** (4 seconds, 306 milliseconds)
+ [info]   inner-join.sql
  [info]   Expected "1	a
  [info]   1	a
  [info]   1	b
  [info]   1[]", but got "1	a
  [info]   1	a
  [info]   1	b
  [info]   1[	b]" Result did not match for query #6
  [info]   SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag (SQLQueryTestSuite.scala:377)
  [info]   org.scalatest.exceptions.TestFailedException:
  [info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
```

It will at least prevent us to search full logs to identify which test file is failed by clicking filed test.

Note that this PR does not fully fix the issue but only fix the logs on its failed tests.

### Why are the changes needed?
To debug Jenkins logs easier. Otherwise, we should open full logs and search which test was failed.

### Does this PR introduce any user-facing change?
It will print out the file name of failed tests in Jenkins' test reports.

### How was this patch tested?
Manually tested but Jenkins tests are required in this PR.

Now it at least shows which file it is:

![Screen Shot 2019-08-30 at 10 16 32 PM](https://user-images.githubusercontent.com/6477701/64023705-de22a200-cb73-11e9-8806-2e98ad35adef.png)

Closes #25630 from HyukjinKwon/SPARK-28894-1.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-30 15:10:40 -07:00
younggyu chun 3b07a4eb28 [SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type
## What changes were proposed in this pull request?
This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases.

https://www.postgresql.org/docs/devel/datatype-boolean.html
https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html

## How was this patch tested?
Added new tests to CastSuite.

Closes #25458 from younggyuchun/SPARK-27931.

Authored-by: younggyu chun <younggyuchun@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-30 14:18:13 -07:00
Burak Yavuz 827969399b [SPARK-28668][SQL] Support V2SessionCatalog for ALTER TABLE
### What changes were proposed in this pull request?

Adds support for the V2SessionCatalog for ALTER TABLE statements.
Implementation changes are ~50 loc. The rest is just test refactoring.

### Why are the changes needed?
To allow V2 DataSources to plug in through a configurable plugin interface without requiring the explicit use of catalog identifiers, and leverage ALTER TABLE statements.

### How was this patch tested?

By re-using existing tests in DataSourceV2SQLSuite.

Closes #25502 from brkyvz/alterV3.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-30 14:16:47 +08:00
Wenchen Fan f8f7c52f12 [SPARK-28899][SQL][TEST] merge the testing in-memory v2 catalogs from catalyst and core
### What changes were proposed in this pull request?

There are 2 in-memory `TableCatalog` and `Table` implementations for testing, in sql/catalyst and sql/core. This PR merges them.

After merging, there are 3 classes:
1. `InMemoryTable`
2. `InMemoryTableCatalog`
3. `StagingInMemoryTableCatalog`

For better maintainability, these 3 classes are put in 3 different files.

### Why are the changes needed?

reduce duplicated code

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

N/A

Closes #25610 from cloud-fan/dsv2-test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
2019-08-29 12:56:19 -07:00
Gengliang Wang 24655583f1 [SPARK-28495][SQL][FOLLOW-UP] Disallow conversions between timestamp and long in ASNI mode
### What changes were proposed in this pull request?

Disallow conversions between `timestamp` type and `long` type in table insertion with ANSI store assignment policy.

### Why are the changes needed?

In the PR https://github.com/apache/spark/pull/25581, timestamp type is allowed to be converted to long type, since timestamp type is represented by long type internally, and both legacy mode and strict mode allows the conversion.

After reconsideration, I think we should disallow it. As per ANSI SQL section "4.4.2 Characteristics of numbers":
> A number is assignable only to sites of numeric type.

In PostgreSQL, the conversion between timestamp and long is also disallowed.

### Does this PR introduce any user-facing change?

Conversion between timestamp and long is disallowed in table insertion with ANSI store assignment policy.

### How was this patch tested?

Unit test

Closes #25615 from gengliangwang/disallowTimeStampToLong.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-29 19:59:24 +08:00
Matt Hawes 137b20b964 [SPARK-28818][SQL] Respect source column nullability in the arrays created by freqItems()
### What changes were proposed in this pull request?
This PR replaces the hard-coded non-nullability of the array elements returned by `freqItems()` with a nullability that reflects the original schema. Essentially [the functional change](https://github.com/apache/spark/pull/25575/files#diff-bf59bb9f3dc351f5bf6624e5edd2dcf4R122) to the schema generation is:
```
StructField(name + "_freqItems", ArrayType(dataType, false))
```
Becomes:
```
StructField(name + "_freqItems", ArrayType(dataType, originalField.nullable))
```

Respecting the original nullability prevents issues when Spark depends on `ArrayType`'s `containsNull` being accurate. The example that uncovered this is calling `collect()` on the dataframe (see [ticket](https://issues.apache.org/jira/browse/SPARK-28818) for full repro). Though it's likely that there a several places where this could cause a problem.

I've also refactored a small amount of the surrounding code to remove some unnecessary steps and group together related operations.

### Why are the changes needed?
I think it's pretty clear why this change is needed. It fixes a bug that currently prevents users from calling `df.freqItems.collect()` along with potentially causing other, as yet unknown, issues.

### Does this PR introduce any user-facing change?
Nullability of columns when calling freqItems on them is now respected after the change.

### How was this patch tested?
I added a test that specifically tests the carry-through of the nullability as well as explicitly calling `collect()` to catch the exact regression that was observed. I also ran the test against the old version of the code and it fails as expected.

Closes #25575 from MGHawes/mhawes/SPARK-28818.

Authored-by: Matt Hawes <mhawes@palantir.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-29 10:49:10 +09:00
Yuming Wang 1b404b9b99 [SPARK-28890][SQL] Upgrade Hive Metastore Client to the 3.1.2 for Hive 3.1
### What changes were proposed in this pull request?

Hive 3.1.2 has been released. This PR upgrades the Hive Metastore Client to 3.1.2 for Hive 3.1.

Hive 3.1.2 release notes:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12344397&styleName=Html&projectId=12310843

### Why are the changes needed?

This is an improvement to support a newly release 3.1.2. Otherwise, it will throws `UnsupportedOperationException` if user `set spark.sql.hive.metastore.version=3.1.2`:
```scala
Exception in thread "main" java.lang.UnsupportedOperationException: Unsupported Hive Metastore version (3.1.2). Please set spark.sql.hive.metastore.version with a valid version.
	at org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:109)
```

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT

Closes #25604 from wangyum/SPARK-28890.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-28 09:16:54 -07:00
Gengliang Wang 9d6bec183c [SPARK-28730][SPARK-28495][SQL][FOLLOW-UP] Revise the doc of option spark.sql.storeAssignmentPolicy
### What changes were proposed in this pull request?

Revise the documentation of SQL option `spark.sql.storeAssignmentPolicy`.

### Why are the changes needed?

1. Need to point out the ANSI mode is mostly the same with PostgreSQL
2. Need to point out Legacy mode allows type coercion as long as it is valid casting
3. Better examples.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Uni test

Closes #25605 from gengliangwang/reviseDoc.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-28 19:59:53 +08:00
Yuming Wang e3b32da027 [SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs
## What changes were proposed in this pull request?

This PR update `spark.sql.statistics.fallBackToHdfs`'s doc:
1. This flag is effective only if it is Hive table.
2. For non-partitioned data source table, it will be automatically recalculated if table statistics are not available
3. For partitioned data source table, It is 'spark.sql.defaultSizeInBytes' if table statistics are not available.

Related code:
- Non-partitioned data source table:
[SizeInBytesOnlyStatsPlanVisitor.default()](98be8953c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L54-L57)) -> [LogicalRelation.computeStats()](a1c1dd3484/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala (L42-L46)) -> [HadoopFsRelation.sizeInBytes()](c0632cec04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala (L72-L75)) -> [PartitioningAwareFileIndex.sizeInBytes()](b276788d57/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala (L103))
`PartitioningAwareFileIndex.sizeInBytes()` is calculated by [`allFiles().map(_.getLen).sum`](b276788d57/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala (L103)) if table statistics are not available.

- Partitioned data source table:
[SizeInBytesOnlyStatsPlanVisitor.default()](98be8953c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L54-L57)) -> [LogicalRelation.computeStats()](a1c1dd3484/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala (L42-L46)) -> [CatalogFileIndex.sizeInBytes](5d672b7f3e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala (L41))
`CatalogFileIndex.sizeInBytes` is [spark.sql.defaultSizeInBytes](c30b5297bc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L387)) if table statistics are not available.

## How was this patch tested?

N/A

Closes #24715 from wangyum/SPARK-25474.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-28 19:15:26 +08:00
hemanth meka 6252c54e39 [SPARK-23519][SQL] create view should work from query with duplicate output columns
**What changes were proposed in this pull request?**

Moving the call for checkColumnNameDuplication out of generateViewProperties. This way we can choose ifcheckColumnNameDuplication will be performed on analyzed or aliased plan without having to pass an additional argument(aliasedPlan) to generateViewProperties.

Before the pr column name duplication was performed on the query output of below sql(c1, c1) and the pr makes it perform check on the user provided schema of view definition(c1, c2)

**Why are the changes needed?**

Changes are to fix SPARK-23519 bug. Below queries would cause an exception. This pr fixes them and also added a test case.

`CREATE TABLE t23519 AS SELECT 1 AS c1
CREATE VIEW v23519 (c1, c2) AS SELECT c1, c1 FROM t23519`

**Does this PR introduce any user-facing change?**
No

**How was this patch tested?**
new unit test added in SQLViewSuite

Closes #25570 from hem1891/SPARK-23519.

Lead-authored-by: hemanth meka <hmeka@tibco.com>
Co-authored-by: hem1891 <hem1891@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-28 12:11:10 +08:00
Wenchen Fan 90b10b4f7a [HOT-FIX] fix compilation
This is caused by 2 PRs that were merged at the same time:
cb06209fc9
2b24a71fec

Closes #25597 from cloud-fan/hot-fix.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 23:30:44 +08:00
Gengliang Wang 2b24a71fec [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
### What changes were proposed in this pull request?
 Introduce ANSI store assignment policy for table insertion.
With ANSI policy, Spark performs the type coercion of table insertion as per ANSI SQL.

### Why are the changes needed?
In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column.

In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. This is breaking change. It is possible that existing queries are broken after 3.0 releases.

Following ANSI SQL standard makes Spark consistent with the table insertion behaviors of popular DBMS like PostgreSQL/Oracle/Mysql.

### Does this PR introduce any user-facing change?
A new optional mode for table insertion.

### How was this patch tested?
Unit test

Closes #25581 from gengliangwang/ANSImode.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 22:13:23 +08:00
WeichenXu 7f605f5559 [SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true
### What changes were proposed in this pull request?

Make `spark.sql.crossJoin.enabled` default value true

### Why are the changes needed?

For implicit cross join, we can set up a watchdog to cancel it if running for a long time.
When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user:
* it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast.
* if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing.
* the CROSS JOIN syntax doesn't work well if join reorder happens.
* some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error.

So that in order to address this in simpler way, we can turn off showing this cross-join error by default.

For reference, I list some cases raising mismatching error here:
Providing:
```
spark.range(2).createOrReplaceTempView("sm1") // can be broadcast
spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast
spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast
```
1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g.
```
select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id
```
2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g.
```
select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id
```

### Does this PR introduce any user-facing change?

### How was this patch tested?

Closes #25520 from WeichenXu123/SPARK-28621.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 21:53:37 +08:00
Yuming Wang e12da8b957 [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table
### What changes were proposed in this pull request?

This PR makes `spark.sql.statistics.fallBackToHdfs` not support Hive partitioned tables.

### Why are the changes needed?

The current implementation is incorrect for external partitions and it is expensive to support partitioned table with external partitions.

### Does this PR introduce any user-facing change?
Yes.  But I think it will not change the join strategy because partitioned table usually very large.

### How was this patch tested?
unit test

Closes #25584 from wangyum/SPARK-28876.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 21:37:18 +08:00
Yuming Wang 96179732aa [SPARK-27592][SQL][TEST][FOLLOW-UP] Test set the partitioned bucketed data source table SerDe correctly
### What changes were proposed in this pull request?
This PR add test for set the partitioned bucketed data source table SerDe correctly.

### Why are the changes needed?
Improve test.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
N/A

Closes #25591 from wangyum/SPARK-27592-f1.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 21:10:58 +08:00
Wenchen Fan cb06209fc9 [SPARK-28747][SQL] merge the two data source v2 fallback configs
## What changes were proposed in this pull request?

Currently we have 2 configs to specify which v2 sources should fallback to v1 code path. One config for read path, and one config for write path.

However, I found it's awkward to work with these 2 configs:
1. for `CREATE TABLE USING format`, should this be read path or write path?
2. for `V2SessionCatalog.loadTable`,  we need to return `UnresolvedTable` if it's a DS v1 or we need to fallback to v1 code path. However, at that time, we don't know if the returned table will be used for read or write.

We don't have any new features or perf improvement in file source v2. The fallback API is just a safeguard if we have bugs in v2 implementations. There are not many benefits to support falling back to v1 for read and write path separately.

This PR proposes to merge these 2 configs into one.

## How was this patch tested?

existing tests

Closes #25465 from cloud-fan/merge-conf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 20:47:24 +08:00
Yuming Wang ab1819d38a [SPARK-28527][SQL][TEST][FOLLOW-UP] Ignores Thrift server ThriftServerQueryTestSuite
### What changes were proposed in this pull request?

This PR ignores Thrift server `ThriftServerQueryTestSuite`.

### Why are the changes needed?

This ThriftServerQueryTestSuite test case led to frequent Jenkins build failure.

### Does this PR introduce any user-facing change?

Yes.

### How was this patch tested?
N/A

Closes #25592 from wangyum/SPARK-28527-f1.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-27 15:41:22 +09:00
Burak Yavuz e31aec9be4 [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
### What changes were proposed in this pull request?

This PR adds support for INSERT INTO through both the SQL and DataFrameWriter APIs through the V2SessionCatalog.

### Why are the changes needed?

This will allow V2 tables to be plugged in through the V2SessionCatalog, and be used seamlessly with existing APIs.

### Does this PR introduce any user-facing change?

No behavior changes.

### How was this patch tested?

Pulled out a lot of tests so that they can be shared across the DataFrameWriter and SQL code paths.

Closes #25507 from brkyvz/insertSesh.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 12:59:53 +08:00
Yuming Wang 6e12b585a9 [SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server
### What changes were proposed in this pull request?
This PR build a test framework that directly re-run all the tests in `SQLQueryTestSuite` via Thrift Server. But it's a little different from `SQLQueryTestSuite`:
1. Can not support [UDF testing](44e607e921/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L293-L297)).
2. Can not support `DESC` command and `SHOW` command because `SQLQueryTestSuite` [formatted the output](1882912cca/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala (L38-L50).).

When building this framework, found two bug:
[SPARK-28624](https://issues.apache.org/jira/browse/SPARK-28624): `make_date` is inconsistent when reading from table
[SPARK-28611](https://issues.apache.org/jira/browse/SPARK-28611): Histogram's height is different

found two features that ThriftServer can not support:
[SPARK-28636](https://issues.apache.org/jira/browse/SPARK-28636): ThriftServer can not support decimal type with negative scale
[SPARK-28637](https://issues.apache.org/jira/browse/SPARK-28637): ThriftServer can not support interval type

Also, found two inconsistent behavior:
[SPARK-28620](https://issues.apache.org/jira/browse/SPARK-28620): Double type returned for float type in Beeline/JDBC
[SPARK-28619](https://issues.apache.org/jira/browse/SPARK-28619):  The golden result file is different when tested by `bin/spark-sql`

### Why are the changes needed?

Improve the overall test coverage for Thrift Server.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #25567 from wangyum/SPARK-28527.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-26 22:39:57 +09:00
Dilip Biswal c61270fd74 [SPARK-27395][SQL] Improve EXPLAIN command
## What changes were proposed in this pull request?
This PR aims at improving the way physical plans are explained in spark.

Currently, the explain output for physical plan may look very cluttered and each operator's
string representation can be very wide and wraps around in the display making it little
hard to follow. This especially happens when explaining a query 1) Operating on wide tables
2) Has complex expressions etc.

This PR attempts to split the output into two sections. In the header section, we display
the basic operator tree with a number associated with each operator. In this section, we strictly
control what we output for each operator. In the footer section, each operator is verbosely
displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be
correlated by the originating expression id from its parent plan.

To illustrate, here is a simple plan displayed in old vs new way.

Example query1 :
```
EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key HAVING max(val) > 0
```

Old :
```
*(2) Project [key#2, max(val)#15]
+- *(2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0))
   +- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15, max(val#3)#18])
      +- Exchange hashpartitioning(key#2, 200)
         +- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21])
            +- *(1) Project [key#2, val#3]
               +- *(1) Filter (isnotnull(key#2) AND (key#2 > 0))
                  +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int>
```
New :
```
Project (8)
+- Filter (7)
   +- HashAggregate (6)
      +- Exchange (5)
         +- HashAggregate (4)
            +- Project (3)
               +- Filter (2)
                  +- Scan parquet default.explain_temp1 (1)

(1) Scan parquet default.explain_temp1 [codegen id : 1]
Output: [key#2, val#3]

(2) Filter [codegen id : 1]
Input     : [key#2, val#3]
Condition : (isnotnull(key#2) AND (key#2 > 0))

(3) Project [codegen id : 1]
Output    : [key#2, val#3]
Input     : [key#2, val#3]

(4) HashAggregate [codegen id : 1]
Input: [key#2, val#3]

(5) Exchange
Input: [key#2, max#11]

(6) HashAggregate [codegen id : 2]
Input: [key#2, max#11]

(7) Filter [codegen id : 2]
Input     : [key#2, max(val)#5, max(val#3)#8]
Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0))

(8) Project [codegen id : 2]
Output    : [key#2, max(val)#5]
Input     : [key#2, max(val)#5, max(val#3)#8]
```

Example Query2 (subquery):
```
SELECT * FROM   explain_temp1 WHERE  KEY = (SELECT Max(KEY) FROM   explain_temp2 WHERE  KEY = (SELECT Max(KEY) FROM   explain_temp3 WHERE  val > 0) AND val = 2) AND val > 3
```
Old:
```
*(1) Project [key#2, val#3]
+- *(1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#39)) AND (val#3 > 3))
   :  +- Subquery scalar-subquery#39
   :     +- *(2) HashAggregate(keys=[], functions=[max(KEY#26)], output=[max(KEY)#45])
   :        +- Exchange SinglePartition
   :           +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#26)], output=[max#47])
   :              +- *(1) Project [key#26]
   :                 +- *(1) Filter (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2))
   :                    :  +- Subquery scalar-subquery#38
   :                    :     +- *(2) HashAggregate(keys=[], functions=[max(KEY#28)], output=[max(KEY)#43])
   :                    :        +- Exchange SinglePartition
   :                    :           +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#28)], output=[max#49])
   :                    :              +- *(1) Project [key#28]
   :                    :                 +- *(1) Filter (isnotnull(val#29) AND (val#29 > 0))
   :                    :                    +- *(1) FileScan parquet default.explain_temp3[key#28,val#29] Batched: true, DataFilters: [isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: struct<key:int,val:int>
   :                    +- *(1) FileScan parquet default.explain_temp2[key#26,val#27] Batched: true, DataFilters: [isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), EqualTo(val,2)], ReadSchema: struct<key:int,val:int>
   +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), GreaterThan(val,3)], ReadSchema: struct<key:int,val:int>
```
New:
```
Project (3)
+- Filter (2)
   +- Scan parquet default.explain_temp1 (1)

(1) Scan parquet default.explain_temp1 [codegen id : 1]
Output: [key#2, val#3]

(2) Filter [codegen id : 1]
Input     : [key#2, val#3]
Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#23)) AND (val#3 > 3))

(3) Project [codegen id : 1]
Output    : [key#2, val#3]
Input     : [key#2, val#3]
===== Subqueries =====

Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#23
HashAggregate (9)
+- Exchange (8)
   +- HashAggregate (7)
      +- Project (6)
         +- Filter (5)
            +- Scan parquet default.explain_temp2 (4)

(4) Scan parquet default.explain_temp2 [codegen id : 1]
Output: [key#26, val#27]

(5) Filter [codegen id : 1]
Input     : [key#26, val#27]
Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#22)) AND (val#27 = 2))

(6) Project [codegen id : 1]
Output    : [key#26]
Input     : [key#26, val#27]

(7) HashAggregate [codegen id : 1]
Input: [key#26]

(8) Exchange
Input: [max#35]

(9) HashAggregate [codegen id : 2]
Input: [max#35]

Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery scalar-subquery#22
HashAggregate (15)
+- Exchange (14)
   +- HashAggregate (13)
      +- Project (12)
         +- Filter (11)
            +- Scan parquet default.explain_temp3 (10)

(10) Scan parquet default.explain_temp3 [codegen id : 1]
Output: [key#28, val#29]

(11) Filter [codegen id : 1]
Input     : [key#28, val#29]
Condition : (isnotnull(val#29) AND (val#29 > 0))

(12) Project [codegen id : 1]
Output    : [key#28]
Input     : [key#28, val#29]

(13) HashAggregate [codegen id : 1]
Input: [key#28]

(14) Exchange
Input: [max#37]

(15) HashAggregate [codegen id : 2]
Input: [max#37]
```

Note:
I opened this PR as a WIP to start getting feedback. I will be on vacation starting tomorrow
would not be able to immediately incorporate the feedback. I will start to
work on them as soon as i can. Also, currently this PR provides a basic infrastructure
for explain enhancement. The details about individual operators will be implemented
in follow-up prs
## How was this patch tested?
Added a new test `explain.sql` that tests basic scenarios. Need to add more tests.

Closes #24759 from dilipbiswal/explain_feature.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-26 20:37:13 +08:00
Yuming Wang c353a84d1a [SPARK-28642][SQL][TEST][FOLLOW-UP] Test spark.sql.redaction.options.regex with and without default values
### What changes were proposed in this pull request?

Test `spark.sql.redaction.options.regex` with and without  default values.

### Why are the changes needed?

Normally, we do not rely on the default value of `spark.sql.redaction.options.regex`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #25579 from wangyum/SPARK-28642-f1.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-25 23:12:16 -07:00
Yuming Wang adb506afd7 [SPARK-28852][SQL] Implement SparkGetCatalogsOperation for Thrift Server
### What changes were proposed in this pull request?
This PR implements `SparkGetCatalogsOperation` for Thrift Server metadata completeness.

### Why are the changes needed?
Thrift Server metadata completeness.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test

Closes #25555 from wangyum/SPARK-28852.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-25 22:42:50 -07:00
Terry Kim a3328cdc0a [SPARK-28238][SQL][FOLLOW-UP] Clean up attributes for Datasource v2 DESCRIBE TABLE
### What changes were proposed in this pull request?
1. Fix the physical plan (`DescribeTableExec`) to have the same output attributes as the corresponding logical plan.
2. Remove `output` in statements since they are unresolved plans.

### Why are the changes needed?
Correctness of how output attributes should work.

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
Existing tests

Closes #25568 from imback82/describe_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-26 13:39:36 +08:00
Yuming Wang 4b16cf11b3 [SPARK-27988][SQL][TEST] Port AGGREGATES.sql [Part 3]
## What changes were proposed in this pull request?

This PR is to port AGGREGATES.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/aggregates.sql#L352-L605

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/aggregates.out#L986-L1613

When porting the test cases, found seven PostgreSQL specific features that do not exist in Spark SQL:

[SPARK-27974](https://issues.apache.org/jira/browse/SPARK-27974): Add built-in Aggregate Function: array_agg
[SPARK-27978](https://issues.apache.org/jira/browse/SPARK-27978): Add built-in Aggregate Functions: string_agg
[SPARK-27986](https://issues.apache.org/jira/browse/SPARK-27986): Support Aggregate Expressions with filter
[SPARK-27987](https://issues.apache.org/jira/browse/SPARK-27987): Support POSIX Regular Expressions
[SPARK-28682](https://issues.apache.org/jira/browse/SPARK-28682): ANSI SQL: Collation Support
[SPARK-28768](https://issues.apache.org/jira/browse/SPARK-28768): Implement more text pattern operators
[SPARK-28865](https://issues.apache.org/jira/browse/SPARK-28865): Table inheritance

## How was this patch tested?

N/A

Closes #24829 from wangyum/SPARK-27988.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-25 23:34:59 +09:00
Yuming Wang 02a0cdea13 [SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile
### What changes were proposed in this pull request?

This PR upgrade the built-in Hive to 2.3.6 for `hadoop-3.2`.

Hive 2.3.6 release notes:
- [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096): Backport [HIVE-21584](https://issues.apache.org/jira/browse/HIVE-21584) (Java 11 preparation: system class loader is not URLClassLoader)
- [HIVE-21859](https://issues.apache.org/jira/browse/HIVE-21859): Backport [HIVE-17466](https://issues.apache.org/jira/browse/HIVE-17466) (Metastore API to list unique partition-key-value combinations)
- [HIVE-21786](https://issues.apache.org/jira/browse/HIVE-21786): Update repo URLs in poms branch 2.3 version

### Why are the changes needed?
Make Spark support JDK 11.

### Does this PR introduce any user-facing change?
Yes. Please see [SPARK-28684](https://issues.apache.org/jira/browse/SPARK-28684) and [SPARK-24417](https://issues.apache.org/jira/browse/SPARK-24417) for more details.

### How was this patch tested?
Existing unit test and manual test.

Closes #25443 from wangyum/test-on-jenkins.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-23 21:34:30 -07:00
Xiao Li 07c4b9bd1f Revert "[SPARK-25474][SQL] Support spark.sql.statistics.fallBackToHdfs in data source tables"
This reverts commit 485ae6d181.

Closes #25563 from gatorsmile/revert.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-23 07:41:39 -07:00
Gengliang Wang 8258660f67 [SPARK-28741][SQL] Optional mode: throw exceptions when casting to integers causes overflow
## What changes were proposed in this pull request?

To follow ANSI SQL, we should support a configurable mode that throws exceptions when casting to integers causes overflow.
The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, which throws exceptions on arithmetical operation overflow.
To unify it, the configuration is renamed from "spark.sql.arithmeticOperations.failOnOverFlow" to "spark.sql.failOnIntegerOverFlow"
## How was this patch tested?

Unit test

Closes #25461 from gengliangwang/AnsiCastIntegral.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-23 21:49:45 +08:00
Ali Afroozeh 1472e664ba [SPARK-28716][SQL] Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans
## What changes were proposed in this pull request?

Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans, for example:
```
ReusedExchange d_date_sk#827, BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) [id=#2710]
```
Where `2710` is the id of the reused exchange.

## How was this patch tested?

Passes existing tests

Closes #25434 from dbaliafroozeh/ImplementStringArgsExchangeSubqueryExec.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2019-08-23 13:29:32 +02:00
Ali Afroozeh aef7ca1f0b [SPARK-28836][SQL] Remove the canonicalize(attributes) method from PlanExpression
### What changes were proposed in this pull request?
This PR removes the `canonicalize(attrs: AttributeSeq)` from `PlanExpression` and taking care of normalizing expressions in `QueryPlan`.

### Why are the changes needed?
`Expression` has already a `canonicalized` method and having the `canonicalize` method in `PlanExpression` is confusing.

### Does this PR introduce any user-facing change?
Removes the `canonicalize` plan from `PlanExpression`. Also renames the `normalizeExprId` to `normalizeExpressions` in query plan.

### How was this patch tested?
This PR is a refactoring and passes the existing tests

Closes #25534 from dbaliafroozeh/ImproveCanonicalizeAPI.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2019-08-23 13:26:58 +02:00
terryk 98e1a4cea4 [SPARK-28319][SQL] Implement SHOW TABLES for Data Source V2 Tables
## What changes were proposed in this pull request?

Implements the SHOW TABLES logical and physical plans for data source v2 tables.

## How was this patch tested?

Added unit tests to `DataSourceV2SQLSuite`.

Closes #25247 from imback82/dsv2_show_tables.

Lead-authored-by: terryk <yuminkim@gmail.com>
Co-authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-23 14:20:25 +08:00
Ali Afroozeh 9976b876f1 [SPARK-28835][SQL][TEST] Add TPCDSSchema trait
### What changes were proposed in this pull request?
This PR extracts the schema information of TPCDS tables into a separate class called `TPCDSSchema` which can be reused for other testing purposes

### How was this patch tested?
This PR is only a refactoring for tests and passes existing tests

Closes #25535 from dbaliafroozeh/IntroduceTPCDSSchema.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-22 23:18:46 -07:00
Jungtaek Lim (HeartSaVioR) 406c5331ff
[SPARK-28025][SS] Fix FileContextBasedCheckpointFileManager leaking crc files
### What changes were proposed in this pull request?

This PR fixes the leak of crc files from CheckpointFileManager when FileContextBasedCheckpointFileManager is being used.

Spark hits the Hadoop bug, [HADOOP-16255](https://issues.apache.org/jira/browse/HADOOP-16255) which seems to be a long-standing issue.

This is there're two `renameInternal` methods:

```
public void renameInternal(Path src, Path dst)
public void renameInternal(final Path src, final Path dst, boolean overwrite)
```

which should be overridden to handle all cases but ChecksumFs only overrides method with 2 params, so when latter is called FilterFs.renameInternal(...) is called instead, and it will do rename with RawLocalFs as underlying filesystem.

The bug is related to FileContext, so FileSystemBasedCheckpointFileManager is not affected.

[SPARK-17475](https://issues.apache.org/jira/browse/SPARK-17475) took a workaround for this bug, but [SPARK-23966](https://issues.apache.org/jira/browse/SPARK-23966) seemed to bring regression.

This PR deletes crc file as "best-effort" when renaming, as failing to delete crc file is not that critical to fail the task.

### Why are the changes needed?

This PR prevents crc files not being cleaned up even purging batches. Too many files in same directory often hurts performance, as well as each crc file occupies more space than its own size so possible to occupy nontrivial amount of space when batches go up to 100000+.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Some unit tests are modified to check leakage of crc files.

Closes #25488 from HeartSaVioR/SPARK-28025.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2019-08-22 23:10:16 -07:00
Gengliang Wang 895c90b582 [SPARK-28730][SQL] Configurable type coercion policy for table insertion
## What changes were proposed in this pull request?

After all the discussions in the dev list: http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
Here I propose that we can make the store assignment rules in the analyzer configurable, and the behavior of V1 and V2 should be consistent.
When inserting a value into a column with a different data type, Spark will perform type coercion. After this PR, we support 2 policies for the type coercion rules:
legacy and strict.
1. With legacy policy, Spark allows casting any value to any data type. The legacy policy is the only behavior in Spark 2.x and it is compatible with Hive.
2. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` are not allowed.

Eventually, the "legacy" mode will be removed, so it is disallowed in data source V2.
To ensure backward compatibility with existing queries, the default store assignment policy for data source V1 is "legacy".
## How was this patch tested?

Unit test

Closes #25453 from gengliangwang/tableInsertRule.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-23 13:50:26 +08:00
shivusondur 23bed0d3c0 [SPARK-28702][SQL] Display useful error message (instead of NPE) for invalid Dataset operations
### What changes were proposed in this pull request?
Added proper message instead of NPE for invalid Dataset operations (e.g. calling actions inside of transformations) similar to SPARK-5063 for RDD

### Why are the changes needed?
To report the user about the exact issue instead of NPE

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

Manually tested

```scala
test code snap
"import spark.implicits._
    val ds1 = spark.sparkContext.parallelize(1 to 100, 100).toDS()
    val ds2 = spark.sparkContext.parallelize(1 to 100, 100).toDS()
    ds1.map(x => {
      // scalastyle:off
      println(ds2.count + x)
      x
    }).collect()"
```

Closes #25503 from shivusondur/jira28702.

Authored-by: shivusondur <shivusondur@gmail.com>
Signed-off-by: Josh Rosen <rosenville@gmail.com>
2019-08-22 22:15:37 -07:00
Dongjoon Hyun 36da2e3384 [SPARK-28847][TEST] Annotate HiveExternalCatalogVersionsSuite with ExtendedHiveTest
### What changes were proposed in this pull request?

This PR aims to annotate `HiveExternalCatalogVersionsSuite` with `ExtendedHiveTest`.

### Why are the changes needed?

`HiveExternalCatalogVersionsSuite` is an outstanding test in terms of testing time. This PR aims to allow skipping this test suite when we use `ExtendedHiveTest`.
![time](https://user-images.githubusercontent.com/9700541/63489184-4c75af00-c466-11e9-9e12-d250d4a23292.png)

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Since Jenkins doesn't exclude `ExtendedHiveTest`, there is no difference in Jenkins testing.
This PR should be tested by manually by the following.

**BEFORE**
```
$ cd sql/hive
$ mvn package -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest
...
Run starting. Expected test count is: 1
HiveExternalCatalogVersionsSuite:
22:32:16.218 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load ...
```

**AFTER**
```
$ cd sql/hive
$ mvn package -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest
...
Run starting. Expected test count is: 0
HiveExternalCatalogVersionsSuite:
Run completed in 772 milliseconds.
Total number of tests run: 0
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
No tests were executed.
...
```

Closes #25550 from dongjoon-hyun/SPARK-28847.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-22 00:25:56 -07:00
triplesheep 48578a41b5 [SPARK-28844][SQL] Fix typo in SQLConf FILE_COMRESSION_FACTOR
### What changes were proposed in this pull request?
Fix minor typo in SQLConf.
`FILE_COMRESSION_FACTOR` -> `FILE_COMPRESSION_FACTOR`

### Why are the changes needed?
Make conf more understandable.

### Does this PR introduce any user-facing change?
No. (`spark.sql.sources.fileCompressionFactor` is unchanged.)

### How was this patch tested?
Pass the Jenkins with the existing tests.

Closes #25538 from triplesheep/TYPO-FIX.

Authored-by: triplesheep <triplesheep0419@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-22 00:07:40 -07:00
maryannxue aefb2e70e7 [SPARK-28739][SQL] Add a simple cost check for Adaptive Query Execution
### What changes were proposed in this pull request?

This PR adds a simple cost model and a mechanism to compare the costs of the before and after plans of each re-optimization in Adaptive Query Execution. Now the workflow of AQE re-optimization is changed to: If the cost of the plan after re-optimization is lower than or equal to that of the plan before re-optimization and the plan has been changed after re-optimization (if equal), the current physical plan will be updated to the plan after re-optimization, otherwise it will remain unchanged until the next re-optimization.

### Why are the changes needed?
This new mechanism is to prevent regressions in Adaptive Query Execution caused by change of the plan introducing extra cost, in this PR specifically, change of SMJ to BHJ leading to extra `ShuffleExchangeExec`s.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added UT.

Closes #25456 from maryannxue/aqe-cost.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-08-21 19:33:56 -07:00
Wenchen Fan ed3ea6734c [SPARK-28837][SQL] CTAS/RTAS should use nullable schema
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->
When running CTAS/RTAS, use the nullable schema of the input query to create the table.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->
It's very likely to run CTAS/RTAS with non-nullable input query, e.g. `CREATE TABLE t AS SELECT 1`. However, it's surprising to users if they can't write null to this table later. Non-nullable is kind of a constraint of the column and should be specified by users explicitly.

For reference, Postgres also use nullable schema for CTAS:
```
> create table t1(i int not null);

> insert into t1 values (1);

> create table t2 as select i from t1;

> \d+ t1;
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 i      | integer |           | not null |         | plain   |              |

> \d+ t2;
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 i      | integer |           |          |         | plain   |              |

```

File source V1 has the same behavior.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
Yes, after this PR CTAS/RTAS creates tables with nullable schema, then users can insert null values later.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
new test

Closes #25536 from cloud-fan/ctas.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-22 09:49:18 +08:00
Wenchen Fan 97b046f06f [SPARK-28635][SQL][FOLLOWUP] CatalogManager should reflect the changes of default catalog
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->
The current namespace/catalog should be set to None at the beginning, so that we can read the new configs when reporting currennt namespace/catalog later.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Fix a bug in CatalogManager, to reflect the change of default catalog config when reporting current catalog.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
No. The current namespace/catalog stuff is still internal right now.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
a new test suite

Closes #25521 from cloud-fan/fix.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2019-08-21 12:23:42 -07:00
Yuanjian Li 2d9cc42aa8 [SPARK-28699][SQL] Disable using radix sort for ShuffleExchangeExec in repartition case
## What changes were proposed in this pull request?

Disable using radix sort in ShuffleExchangeExec when we do repartition.
In #20393, we fixed the indeterministic result in the shuffle repartition case by performing a local sort before repartitioning.
But for the newly added sort operation, we use radix sort which is wrong because binary data can't be compared by only the prefix. This makes the sort unstable and fails to solve the indeterminate shuffle output problem.

### Why are the changes needed?
Fix the correctness bug caused by repartition after a shuffle.

### Does this PR introduce any user-facing change?
Yes, user will get the right result in the case of repartition stage rerun.

## How was this patch tested?

Test with `local-cluster[5, 2, 5120]`, use the integrated test below, it can return a right answer 100000000.
```
import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).map{ x => (x._1 + 1, x._2)}.repartition(239).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) {
    throw new Exception("pkill -f -n java".!!)
  }
  x
}
val r2 = df.distinct.count()
```

Closes #25491 from xuanyuanking/SPARK-28699-fix.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-21 10:56:50 -07:00
Ali Afroozeh 4dc3093513 [SPARK-28715][SQL] Introduce collectInPlanAndSubqueries and subqueriesAll in QueryPlan
## What changes were proposed in this pull request?

Introduces the collectInPlanAndSubqueries and subqueriesAll methods in QueryPlan that consider all the plans in the query plan, including the ones in nested subqueries.

## How was this patch tested?

Unit test added

Closes #25433 from dbaliafroozeh/IntroduceCollectInPlanAndSubqueries.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2019-08-21 18:05:18 +02:00
Robert (Bobby) Evans fac469e2e0 [SPARK-28774][SQL] Fix exchange reuse for columnar data
### What changes were proposed in this pull request?
The rule ReuseExchange optimization rule will look for instances of Exchange that have the same plan and convert dedupe them to them to a ReuseExchangeExec instance. In the current Spark codebase all Exchange instances are row based, but if we use the spark.sql.extensions config to put in our own columnar based exchange implementation reuse will throw an exception saying that there was a columnar mismatch.

### Why are the changes needed?
Without it Reused Columnar Exchanges throw an exception

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

I tested this patch by running it against a query that was showing this exact issue and it fixed it.

I also added a very simple unit test that shows the issue.

Closes #25499 from revans2/reused-columnar-exchange.

Authored-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-21 18:10:26 +08:00
Burak Yavuz 4855bfe16b [SPARK-28554][SQL] Adds a v1 fallback writer implementation for v2 data source codepaths
## What changes were proposed in this pull request?

This PR adds a V1 fallback interface for writing to V2 Tables using V1 Writer interfaces. The only supported SaveMode that will be called on the target table will be an Append. The target table must use V2 interfaces such as `SupportsOverwrite` or `SupportsTruncate` to support Overwrite operations. It is up to the target DataSource implementation if this operation can be atomic or not.

We do not support dynamicPartitionOverwrite, as we cannot call a `commit` method that actually cleans up the data in the partitions that were touched through this fallback.

## How was this patch tested?

Will add tests and example implementation after comments + feedback. This is a proposal at this point.

Closes #25348 from brkyvz/v1WriteFallback.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-21 17:25:25 +08:00
Marco Gaido 0bfcf9c210 [SPARK-28322][SQL] Add support to Decimal type for integral divide
## What changes were proposed in this pull request?

The expression `IntegralDivide`, which corresponds to the `div` operator, support only integral type. Postgres, though, allows it to work also with decimals.

The PR adds the support to decimal operands for this operation in order to have feature parity with postgres.

## How was this patch tested?

added UTs

Closes #25136 from mgaido91/SPARK-28322.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-08-21 08:43:00 +09:00
maryannxue 39c11273e0 [SPARK-28753][SQL] Dynamically reuse subqueries in AQE
### What changes were proposed in this pull request?
This PR changes subquery reuse in Adaptive Query Execution from compile-time static reuse to execution-time dynamic reuse. This PR adds a `ReuseAdaptiveSubquery` rule that applies to a query stage after it is created and before it is executed. The new dynamic reuse enables subqueries to be reused across all different subquery levels.

### Why are the changes needed?
This is an improvement to the current subquery reuse in Adaptive Query Execution, which allows subquery reuse to happen in a lazy fashion as well as at different subquery levels.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Passed existing tests.

Closes #25471 from maryannxue/aqe-dynamic-sub-reuse.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-20 19:58:29 +08:00
Wenchen Fan d04522187a [SPARK-28635][SQL] create CatalogManager to track registered v2 catalogs
## What changes were proposed in this pull request?

This is a pure refactor PR, which creates a new class `CatalogManager` to track the registered v2 catalogs, and provide the catalog up functionality.

`CatalogManager` also tracks the current catalog/namespace. We will implement corresponding commands in other PRs, like `USE CATALOG my_catalog`

## How was this patch tested?

existing tests

Closes #25368 from cloud-fan/refactor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-20 19:40:21 +08:00
Jungtaek Lim (HeartSaVioR) b37c8d5cea
[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter
#  What changes were proposed in this pull request?

This patch modifies the explanation of guarantee for ForeachWriter as it doesn't guarantee same output for `(partitionId, epochId)`. Refer the description of [SPARK-28650](https://issues.apache.org/jira/browse/SPARK-28650) for more details.

Spark itself still guarantees same output for same epochId (batch) if the preconditions are met, 1) source is always providing the same input records for same offset request. 2) the query is idempotent in overall (indeterministic calculation like now(), random() can break this).

Assuming breaking preconditions as an exceptional case (the preconditions are implicitly required even before), we still can describe the guarantee with `epochId`, though it will be  harder to leverage the guarantee: 1) ForeachWriter should implement a feature to track whether all the partitions are written successfully for given `epochId` 2) There's pretty less chance to leverage the fact, as the chance for Spark to successfully write all partitions and fail to checkpoint the batch is small.

Credit to zsxwing on discovering the broken guarantee.

## How was this patch tested?

This is just a documentation change, both on javadoc and guide doc.

Closes #25407 from HeartSaVioR/SPARK-28650.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2019-08-20 00:56:53 -07:00
lihao 79464bed2f [SPARK-28662][SQL] Create Hive Partitioned Table DDL should fail when partition column type missed
## What changes were proposed in this pull request?
Create Hive Partitioned Table without specifying data type for partition column will success unexpectedly.
```HiveQL
// create a hive table partition by b, but the data type of b isn't specified.
CREATE TABLE tbl(a int) PARTITIONED BY (b) STORED AS parquet
```
In https://issues.apache.org/jira/browse/SPARK-26435 ,  PARTITIONED BY clause  are extended to support Hive CTAS as following:
```ANTLR
// Before
(PARTITIONED BY '(' partitionColumns=colTypeList ')'

 // After
(PARTITIONED BY '(' partitionColumns=colTypeList ')'|
PARTITIONED BY partitionColumnNames=identifierList) |
```

Create Table Statement like above case will pass the syntax check,  and recognized as (PARTITIONED BY partitionColumnNames=identifierList) 。

This PR  will check this case in visitCreateHiveTable and throw a exception which contains  explicit error message to user.

## How was this patch tested?

Added tests.

Closes #25390 from lidinghao/hive-ddl-fix.

Authored-by: lihao <lihaowhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-20 14:37:04 +08:00
Sean Owen 3b4e345fa1 [SPARK-28775][CORE][TESTS] Skip date 8633 in Kwajalein due to changes in tzdata2018i that only some JDK 8s use
### What changes were proposed in this pull request?

Some newer JDKs use the tzdata2018i database, which changes how certain (obscure) historical dates and timezones are handled. As previously, we can pretty much safely ignore these in tests, as the value may vary by JDK.

### Why are the changes needed?

Test otherwise fails using, for example, JDK 1.8.0_222. https://bugs.openjdk.java.net/browse/JDK-8215982 has a full list of JDKs which has this.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests

Closes #25504 from srowen/SPARK-28775.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-19 17:54:25 -07:00
Mick Jermsurawong b79cf0d143 [SPARK-28224][SQL] Check overflow in decimal Sum aggregate
## What changes were proposed in this pull request?
- Currently `sum` in aggregates for decimal type can overflow and return null.
  - `Sum` expression codegens arithmetic on `sql.Decimal` and the output which preserves scale and precision goes into `UnsafeRowWriter`. Here overflowing will be converted to null when writing out.
  - It also does not go through this branch in `DecimalAggregates` because it's expecting precision of the sum (not the elements to be summed) to be less than 5.
4ebff5b6d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (L1400-L1403)

- This PR adds the check at the final result of the sum operator itself.
4ebff5b6d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (L372-L376)

https://issues.apache.org/jira/browse/SPARK-28224

## How was this patch tested?

- Added an integration test on dataframe suite

cc mgaido91 JoshRosen

Closes #25033 from mickjermsurawong-stripe/SPARK-28224.

Authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-08-20 09:47:04 +09:00
Takuya UESHIN 26f344354b [SPARK-27905][SQL][FOLLOW-UP] Add prettyNames
### What changes were proposed in this pull request?

This is a follow-up of #24761 which added a higher-order function `ArrayForAll`.
The PR mistakenly removed the `prettyName` from `ArrayExists` and forgot to add it to `ArrayForAll`.

### Why are the changes needed?

This reverts the `prettyName` back to `ArrayExists` not to affect explained plans, and adds it to `ArrayForAll` to clarify the `prettyName` as the same as the expressions around.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #25501 from ueshin/issues/SPARK-27905/pretty_names.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-19 15:15:50 -07:00
Huaxin Gao ec14b6eb65 [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from ```pgSQL/join.sql``` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'join.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out
index f75fe05196..ad2b5dd0db 100644
--- a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out
 -240,10 +240,10  struct<>

 -- !query 27
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t)
   FROM J1_TBL AS tx
 -- !query 27 schema
-struct<xxx:string,i:int,j:int,t:string>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string>
 -- !query 27 output
        0       NULL    zero
        1       4       one
 -259,10 +259,10  struct<xxx:string,i:int,j:int,t:string>

 -- !query 28
-SELECT '' AS `xxx`, *
+SELECT udf(udf('')) AS `xxx`, udf(udf(i)), udf(j), udf(t)
   FROM J1_TBL tx
 -- !query 28 schema
-struct<xxx:string,i:int,j:int,t:string>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string>
 -- !query 28 output
        0       NULL    zero
        1       4       one
 -278,10 +278,10  struct<xxx:string,i:int,j:int,t:string>

 -- !query 29
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, a, udf(udf(b)), c
   FROM J1_TBL AS t1 (a, b, c)
 -- !query 29 schema
-struct<xxx:string,a:int,b:int,c:string>
+struct<xxx:string,a:int,CAST(udf(cast(cast(udf(cast(b as string)) as int) as string)) AS INT):int,c:string>
 -- !query 29 output
        0       NULL    zero
        1       4       one
 -297,10 +297,10  struct<xxx:string,a:int,b:int,c:string>

 -- !query 30
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(a), udf(b), udf(udf(c))
   FROM J1_TBL t1 (a, b, c)
 -- !query 30 schema
-struct<xxx:string,a:int,b:int,c:string>
+struct<xxx:string,CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(c as string)) as string) as string)) AS STRING):string>
 -- !query 30 output
        0       NULL    zero
        1       4       one
 -316,10 +316,10  struct<xxx:string,a:int,b:int,c:string>

 -- !query 31
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(a), b, udf(c), udf(d), e
   FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
 -- !query 31 schema
-struct<xxx:string,a:int,b:int,c:string,d:int,e:int>
+struct<xxx:string,CAST(udf(cast(a as string)) AS INT):int,b:int,CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(d as string)) AS INT):int,e:int>
 -- !query 31 output
        0       NULL    zero    0       NULL
        0       NULL    zero    1       -1
 -423,7 +423,7  struct<xxx:string,a:int,b:int,c:string,d:int,e:int>

 -- !query 32
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, *
   FROM J1_TBL CROSS JOIN J2_TBL
 -- !query 32 schema
 struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
 -530,20 +530,20  struct<xxx:string,i:int,j:int,t:string,i:int,k:int>

 -- !query 33
-SELECT '' AS `xxx`, i, k, t
+SELECT udf('') AS `xxx`, udf(i) AS i, udf(k), udf(t) AS t
   FROM J1_TBL CROSS JOIN J2_TBL
 -- !query 33 schema
 struct<>
 -- !query 33 output
 org.apache.spark.sql.AnalysisException
-Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; line 1 pos 20
+Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; line 1 pos 29

 -- !query 34
-SELECT '' AS `xxx`, t1.i, k, t
+SELECT udf('') AS `xxx`, udf(t1.i) AS i, udf(k), udf(t)
   FROM J1_TBL t1 CROSS JOIN J2_TBL t2
 -- !query 34 schema
-struct<xxx:string,i:int,k:int,t:string>
+struct<xxx:string,i:int,CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string>
 -- !query 34 output
        0       -1      zero
        0       -3      zero
 -647,11 +647,11  struct<xxx:string,i:int,k:int,t:string>

 -- !query 35
-SELECT '' AS `xxx`, ii, tt, kk
+SELECT udf(udf('')) AS `xxx`, udf(udf(ii)) AS ii, udf(udf(tt)) AS tt, udf(udf(kk))
   FROM (J1_TBL CROSS JOIN J2_TBL)
     AS tx (ii, jj, tt, ii2, kk)
 -- !query 35 schema
-struct<xxx:string,ii:int,tt:string,kk:int>
+struct<xxx:string,ii:int,tt:string,CAST(udf(cast(cast(udf(cast(kk as string)) as int) as string)) AS INT):int>
 -- !query 35 output
        0       zero    -1
        0       zero    -3
 -755,10 +755,10  struct<xxx:string,ii:int,tt:string,kk:int>

 -- !query 36
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(udf(j1_tbl.i)), udf(j), udf(t), udf(a.i), udf(a.k), udf(b.i),  udf(b.k)
   FROM J1_TBL CROSS JOIN J2_TBL a CROSS JOIN J2_TBL b
 -- !query 36 schema
-struct<xxx:string,i:int,j:int,t:string,i:int,k:int,i:int,k:int>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 36 output
        0       NULL    zero    0       NULL    0       NULL
        0       NULL    zero    0       NULL    1       -1
 -1654,10 +1654,10  struct<xxx:string,i:int,j:int,t:string,i:int,k:int,i:int,k:int>

 -- !query 37
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i) AS i, udf(j), udf(t) AS t, udf(k)
   FROM J1_TBL INNER JOIN J2_TBL USING (i)
 -- !query 37 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,i:int,CAST(udf(cast(j as string)) AS INT):int,t:string,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 37 output
        0       NULL    zero    NULL
        1       4       one     -1
 -1669,10 +1669,10  struct<xxx:string,i:int,j:int,t:string,k:int>

 -- !query 38
-SELECT '' AS `xxx`, *
+SELECT udf(udf('')) AS `xxx`, udf(i), udf(j) AS j, udf(t), udf(k) AS k
   FROM J1_TBL JOIN J2_TBL USING (i)
 -- !query 38 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,j:int,CAST(udf(cast(t as string)) AS STRING):string,k:int>
 -- !query 38 output
        0       NULL    zero    NULL
        1       4       one     -1
 -1684,9 +1684,9  struct<xxx:string,i:int,j:int,t:string,k:int>

 -- !query 39
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, *
   FROM J1_TBL t1 (a, b, c) JOIN J2_TBL t2 (a, d) USING (a)
-  ORDER BY a, d
+  ORDER BY udf(udf(a)), udf(d)
 -- !query 39 schema
 struct<xxx:string,a:int,b:int,c:string,d:int>
 -- !query 39 output
 -1700,10 +1700,10  struct<xxx:string,a:int,b:int,c:string,d:int>

 -- !query 40
-SELECT '' AS `xxx`, *
+SELECT udf(udf('')) AS `xxx`, udf(i), udf(j), udf(t), udf(k)
   FROM J1_TBL NATURAL JOIN J2_TBL
 -- !query 40 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 40 output
        0       NULL    zero    NULL
        1       4       one     -1
 -1715,10 +1715,10  struct<xxx:string,i:int,j:int,t:string,k:int>

 -- !query 41
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(udf(udf(a))) AS a, udf(b), udf(c), udf(d)
   FROM J1_TBL t1 (a, b, c) NATURAL JOIN J2_TBL t2 (a, d)
 -- !query 41 schema
-struct<xxx:string,a:int,b:int,c:string,d:int>
+struct<xxx:string,a:int,CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(d as string)) AS INT):int>
 -- !query 41 output
        0       NULL    zero    NULL
        1       4       one     -1
 -1730,10 +1730,10  struct<xxx:string,a:int,b:int,c:string,d:int>

 -- !query 42
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(udf(a)), udf(udf(b)), udf(udf(c)) AS c, udf(udf(udf(d))) AS d
   FROM J1_TBL t1 (a, b, c) NATURAL JOIN J2_TBL t2 (d, a)
 -- !query 42 schema
-struct<xxx:string,a:int,b:int,c:string,d:int>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(b as string)) as int) as string)) AS INT):int,c:string,d:int>
 -- !query 42 output
        0       NULL    zero    NULL
        2       3       two     2
 -1741,10 +1741,10  struct<xxx:string,a:int,b:int,c:string,d:int>

 -- !query 43
-SELECT '' AS `xxx`, *
-  FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i = J2_TBL.i)
+SELECT udf('') AS `xxx`, udf(J1_TBL.i), udf(udf(J1_TBL.j)), udf(J1_TBL.t), udf(J2_TBL.i), udf(J2_TBL.k)
+  FROM J1_TBL JOIN J2_TBL ON (udf(J1_TBL.i) = J2_TBL.i)
 -- !query 43 schema
-struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(j as string)) as int) as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 43 output
        0       NULL    zero    0       NULL
        1       4       one     1       -1
 -1756,10 +1756,10  struct<xxx:string,i:int,j:int,t:string,i:int,k:int>

 -- !query 44
-SELECT '' AS `xxx`, *
-  FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i = J2_TBL.k)
+SELECT udf('') AS `xxx`, udf(udf(J1_TBL.i)), udf(udf(J1_TBL.j)), udf(udf(J1_TBL.t)), J2_TBL.i, J2_TBL.k
+  FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i = udf(J2_TBL.k))
 -- !query 44 schema
-struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(j as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(t as string)) as string) as string)) AS STRING):string,i:int,k:int>
 -- !query 44 output
        0       NULL    zero    NULL    0
        2       3       two     2       2
 -1767,10 +1767,10  struct<xxx:string,i:int,j:int,t:string,i:int,k:int>

 -- !query 45
-SELECT '' AS `xxx`, *
-  FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i <= J2_TBL.k)
+SELECT udf('') AS `xxx`, udf(J1_TBL.i), udf(J1_TBL.j), udf(J1_TBL.t), udf(J2_TBL.i), udf(J2_TBL.k)
+  FROM J1_TBL JOIN J2_TBL ON (udf(J1_TBL.i) <= udf(udf(J2_TBL.k)))
 -- !query 45 schema
-struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 45 output
        0       NULL    zero    2       2
        0       NULL    zero    2       4
 -1784,11 +1784,11  struct<xxx:string,i:int,j:int,t:string,i:int,k:int>

 -- !query 46
-SELECT '' AS `xxx`, *
+SELECT udf(udf('')) AS `xxx`, udf(i), udf(j), udf(t), udf(k)
   FROM J1_TBL LEFT OUTER JOIN J2_TBL USING (i)
-  ORDER BY i, k, t
+  ORDER BY udf(udf(i)), udf(k), udf(t)
 -- !query 46 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 46 output
        NULL    NULL    null    NULL
        NULL    0       zero    NULL
 -1806,11 +1806,11  struct<xxx:string,i:int,j:int,t:string,k:int>

 -- !query 47
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k)
   FROM J1_TBL LEFT JOIN J2_TBL USING (i)
-  ORDER BY i, k, t
+  ORDER BY udf(i), udf(udf(k)), udf(t)
 -- !query 47 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 47 output
        NULL    NULL    null    NULL
        NULL    0       zero    NULL
 -1828,10 +1828,10  struct<xxx:string,i:int,j:int,t:string,k:int>

 -- !query 48
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(udf(i)), udf(j), udf(t), udf(k)
   FROM J1_TBL RIGHT OUTER JOIN J2_TBL USING (i)
 -- !query 48 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 48 output
        0       NULL    zero    NULL
        1       4       one     -1
 -1845,10 +1845,10  struct<xxx:string,i:int,j:int,t:string,k:int>

 -- !query 49
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(udf(j)), udf(t), udf(k)
   FROM J1_TBL RIGHT JOIN J2_TBL USING (i)
 -- !query 49 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(j as string)) as int) as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 49 output
        0       NULL    zero    NULL
        1       4       one     -1
 -1862,11 +1862,11  struct<xxx:string,i:int,j:int,t:string,k:int>

 -- !query 50
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(udf(t)), udf(k)
   FROM J1_TBL FULL OUTER JOIN J2_TBL USING (i)
-  ORDER BY i, k, t
+  ORDER BY udf(udf(i)), udf(k), udf(t)
 -- !query 50 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(t as string)) as string) as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 50 output
        NULL    NULL    NULL    NULL
        NULL    NULL    null    NULL
 -1886,11 +1886,11  struct<xxx:string,i:int,j:int,t:string,k:int>

 -- !query 51
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(j), t, udf(udf(k))
   FROM J1_TBL FULL JOIN J2_TBL USING (i)
-  ORDER BY i, k, t
+  ORDER BY udf(udf(i)), udf(k), udf(udf(t))
 -- !query 51 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,t:string,CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int>
 -- !query 51 output
        NULL    NULL    NULL    NULL
        NULL    NULL    null    NULL
 -1910,19 +1910,19  struct<xxx:string,i:int,j:int,t:string,k:int>

 -- !query 52
-SELECT '' AS `xxx`, *
-  FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (k = 1)
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(udf(k))
+  FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (udf(k) = 1)
 -- !query 52 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int>
 -- !query 52 output

 -- !query 53
-SELECT '' AS `xxx`, *
-  FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (i = 1)
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k)
+  FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (udf(udf(i)) = udf(1))
 -- !query 53 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
 -- !query 53 output
        1       4       one     -1

 -2020,9 +2020,9  ee        NULL    42      NULL

 -- !query 65
 SELECT * FROM
-(SELECT * FROM t2) as s2
+(SELECT udf(name) as name, t2.n FROM t2) as s2
 INNER JOIN
-(SELECT * FROM t3) s3
+(SELECT udf(udf(name)) as name, t3.n FROM t3) s3
 USING (name)
 -- !query 65 schema
 struct<name:string,n:int,n:int>
 -2033,9 +2033,9  cc        22      23

 -- !query 66
 SELECT * FROM
-(SELECT * FROM t2) as s2
+(SELECT udf(udf(name)) as name, t2.n FROM t2) as s2
 LEFT JOIN
-(SELECT * FROM t3) s3
+(SELECT udf(name) as name, t3.n FROM t3) s3
 USING (name)
 -- !query 66 schema
 struct<name:string,n:int,n:int>
 -2046,13 +2046,13  ee      42      NULL

 -- !query 67
-SELECT * FROM
+SELECT udf(name), udf(udf(s2.n)), udf(s3.n) FROM
 (SELECT * FROM t2) as s2
 FULL JOIN
 (SELECT * FROM t3) s3
 USING (name)
 -- !query 67 schema
-struct<name:string,n:int,n:int>
+struct<CAST(udf(cast(name as string)) AS STRING):string,CAST(udf(cast(cast(udf(cast(n as string)) as int) as string)) AS INT):int,CAST(udf(cast(n as string)) AS INT):int>
 -- !query 67 output
 bb     12      13
 cc     22      23
 -2062,9 +2062,9  ee        42      NULL

 -- !query 68
 SELECT * FROM
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(udf(name)) as name, udf(n) as s2_n, udf(2) as s2_2 FROM t2) as s2
 NATURAL INNER JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(name) as name, udf(udf(n)) as s3_n, udf(3) as s3_2 FROM t3) s3
 -- !query 68 schema
 struct<name:string,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
 -- !query 68 output
 -2074,9 +2074,9  cc        22      2       23      3

 -- !query 69
 SELECT * FROM
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(name) as name, udf(udf(n)) as s2_n, 2 as s2_2 FROM t2) as s2
 NATURAL LEFT JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(udf(name)) as name, udf(n) as s3_n, 3 as s3_2 FROM t3) s3
 -- !query 69 schema
 struct<name:string,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
 -- !query 69 output
 -2087,9 +2087,9  ee        42      2       NULL    NULL

 -- !query 70
 SELECT * FROM
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(name) as name, udf(n) as s2_n, 2 as s2_2 FROM t2) as s2
 NATURAL FULL JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(udf(name)) as name, udf(udf(n)) as s3_n, 3 as s3_2 FROM t3) s3
 -- !query 70 schema
 struct<name:string,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
 -- !query 70 output
 -2101,11 +2101,11  ee      42      2       NULL    NULL

 -- !query 71
 SELECT * FROM
-(SELECT name, n as s1_n, 1 as s1_1 FROM t1) as s1
+(SELECT udf(udf(name)) as name, udf(n) as s1_n, 1 as s1_1 FROM t1) as s1
 NATURAL INNER JOIN
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(name) as name, udf(n) as s2_n, 2 as s2_2 FROM t2) as s2
 NATURAL INNER JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(udf(udf(name))) as name, udf(n) as s3_n, 3 as s3_2 FROM t3) s3
 -- !query 71 schema
 struct<name:string,s1_n:int,s1_1:int,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
 -- !query 71 output
 -2114,11 +2114,11  bb      11      1       12      2       13      3

 -- !query 72
 SELECT * FROM
-(SELECT name, n as s1_n, 1 as s1_1 FROM t1) as s1
+(SELECT udf(name) as name, udf(n) as s1_n, udf(udf(1)) as s1_1 FROM t1) as s1
 NATURAL FULL JOIN
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(name) as name, udf(udf(n)) as s2_n, udf(2) as s2_2 FROM t2) as s2
 NATURAL FULL JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(udf(name)) as name, udf(n) as s3_n, udf(3) as s3_2 FROM t3) s3
 -- !query 72 schema
 struct<name:string,s1_n:int,s1_1:int,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
 -- !query 72 output
 -2129,16 +2129,16  ee      NULL    NULL    42      2       NULL    NULL

 -- !query 73
-SELECT * FROM
-(SELECT name, n as s1_n FROM t1) as s1
+SELECT name, udf(udf(s1_n)), udf(s2_n), udf(s3_n) FROM
+(SELECT name, udf(udf(n)) as s1_n FROM t1) as s1
 NATURAL FULL JOIN
   (SELECT * FROM
-    (SELECT name, n as s2_n FROM t2) as s2
+    (SELECT name, udf(n) as s2_n FROM t2) as s2
     NATURAL FULL JOIN
-    (SELECT name, n as s3_n FROM t3) as s3
+    (SELECT name, udf(udf(n)) as s3_n FROM t3) as s3
   ) ss2
 -- !query 73 schema
-struct<name:string,s1_n:int,s2_n:int,s3_n:int>
+struct<name:string,CAST(udf(cast(cast(udf(cast(s1_n as string)) as int) as string)) AS INT):int,CAST(udf(cast(s2_n as string)) AS INT):int,CAST(udf(cast(s3_n as string)) AS INT):int>
 -- !query 73 output
 bb     11      12      13
 cc     NULL    22      23
 -2151,9 +2151,9  SELECT * FROM
 (SELECT name, n as s1_n FROM t1) as s1
 NATURAL FULL JOIN
   (SELECT * FROM
-    (SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+    (SELECT name, udf(udf(n)) as s2_n, 2 as s2_2 FROM t2) as s2
     NATURAL FULL JOIN
-    (SELECT name, n as s3_n FROM t3) as s3
+    (SELECT name, udf(n) as s3_n FROM t3) as s3
   ) ss2
 -- !query 74 schema
 struct<name:string,s1_n:int,s2_n:int,s2_2:int,s3_n:int>
 -2165,13 +2165,13  ee      NULL    42      2       NULL

 -- !query 75
-SELECT * FROM
-  (SELECT name, n as s1_n FROM t1) as s1
+SELECT s1.name, udf(s1_n), s2.name, udf(udf(s2_n)) FROM
+  (SELECT name, udf(n) as s1_n FROM t1) as s1
 FULL JOIN
   (SELECT name, 2 as s2_n FROM t2) as s2
-ON (s1_n = s2_n)
+ON (udf(udf(s1_n)) = udf(s2_n))
 -- !query 75 schema
-struct<name:string,s1_n:int,name:string,s2_n:int>
+struct<name:string,CAST(udf(cast(s1_n as string)) AS INT):int,name:string,CAST(udf(cast(cast(udf(cast(s2_n as string)) as int) as string)) AS INT):int>
 -- !query 75 output
 NULL   NULL    bb      2
 NULL   NULL    cc      2
 -2200,9 +2200,9  struct<>

 -- !query 78
-select * from x
+select udf(udf(x1)), udf(x2) from x
 -- !query 78 schema
-struct<x1:int,x2:int>
+struct<CAST(udf(cast(cast(udf(cast(x1 as string)) as int) as string)) AS INT):int,CAST(udf(cast(x2 as string)) AS INT):int>
 -- !query 78 output
 1      11
 2      22
 -2212,9 +2212,9  struct<x1:int,x2:int>

 -- !query 79
-select * from y
+select udf(y1), udf(udf(y2)) from y
 -- !query 79 schema
-struct<y1:int,y2:int>
+struct<CAST(udf(cast(y1 as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(y2 as string)) as int) as string)) AS INT):int>
 -- !query 79 output
 1      111
 2      222
 -2223,7 +2223,7  struct<y1:int,y2:int>

 -- !query 80
-select * from x left join y on (x1 = y1 and x2 is not null)
+select * from x left join y on (udf(x1) = udf(udf(y1)) and udf(x2) is not null)
 -- !query 80 schema
 struct<x1:int,x2:int,y1:int,y2:int>
 -- !query 80 output
 -2235,7 +2235,7  struct<x1:int,x2:int,y1:int,y2:int>

 -- !query 81
-select * from x left join y on (x1 = y1 and y2 is not null)
+select * from x left join y on (udf(udf(x1)) = udf(y1) and udf(y2) is not null)
 -- !query 81 schema
 struct<x1:int,x2:int,y1:int,y2:int>
 -- !query 81 output
 -2247,8 +2247,8  struct<x1:int,x2:int,y1:int,y2:int>

 -- !query 82
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1)
+select * from (x left join y on (udf(x1) = udf(udf(y1)))) left join x xx(xx1,xx2)
+on (udf(udf(x1)) = udf(xx1))
 -- !query 82 schema
 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
 -- !query 82 output
 -2260,8 +2260,8  struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>

 -- !query 83
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1 and x2 is not null)
+select * from (x left join y on (udf(x1) = udf(y1))) left join x xx(xx1,xx2)
+on (udf(x1) = xx1 and udf(x2) is not null)
 -- !query 83 schema
 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
 -- !query 83 output
 -2273,8 +2273,8  struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>

 -- !query 84
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1 and y2 is not null)
+select * from (x left join y on (x1 = udf(y1))) left join x xx(xx1,xx2)
+on (udf(x1) = udf(udf(xx1)) and udf(y2) is not null)
 -- !query 84 schema
 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
 -- !query 84 output
 -2286,8 +2286,8  struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>

 -- !query 85
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1 and xx2 is not null)
+select * from (x left join y on (udf(x1) = y1)) left join x xx(xx1,xx2)
+on (udf(udf(x1)) = udf(xx1) and udf(udf(xx2)) is not null)
 -- !query 85 schema
 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
 -- !query 85 output
 -2299,8 +2299,8  struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>

 -- !query 86
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1) where (x2 is not null)
+select * from (x left join y on (udf(udf(x1)) = udf(udf(y1)))) left join x xx(xx1,xx2)
+on (udf(x1) = udf(xx1)) where (udf(x2) is not null)
 -- !query 86 schema
 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
 -- !query 86 output
 -2310,8 +2310,8  struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>

 -- !query 87
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1) where (y2 is not null)
+select * from (x left join y on (udf(x1) = udf(y1))) left join x xx(xx1,xx2)
+on (udf(x1) = xx1) where (udf(y2) is not null)
 -- !query 87 schema
 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
 -- !query 87 output
 -2321,8 +2321,8  struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>

 -- !query 88
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1) where (xx2 is not null)
+select * from (x left join y on (udf(x1) = udf(y1))) left join x xx(xx1,xx2)
+on (x1 = udf(xx1)) where (xx2 is not null)
 -- !query 88 schema
 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
 -- !query 88 output
 -2332,75 +2332,75  struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>

 -- !query 89
-select count(*) from tenk1 a where unique1 in
-  (select unique1 from tenk1 b join tenk1 c using (unique1)
-   where b.unique2 = 42)
+select udf(udf(count(*))) from tenk1 a where udf(udf(unique1)) in
+  (select udf(unique1) from tenk1 b join tenk1 c using (unique1)
+   where udf(udf(b.unique2)) = udf(42))
 -- !query 89 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(cast(udf(cast(count(1) as string)) as bigint) as string)) AS BIGINT):bigint>
 -- !query 89 output
 1

 -- !query 90
-select count(*) from tenk1 x where
-  x.unique1 in (select a.f1 from int4_tbl a,float8_tbl b where a.f1=b.f1) and
-  x.unique1 = 0 and
-  x.unique1 in (select aa.f1 from int4_tbl aa,float8_tbl bb where aa.f1=bb.f1)
+select udf(count(*)) from tenk1 x where
+  udf(x.unique1) in (select udf(a.f1) from int4_tbl a,float8_tbl b where udf(udf(a.f1))=b.f1) and
+  udf(x.unique1) = 0 and
+  udf(x.unique1) in (select aa.f1 from int4_tbl aa,float8_tbl bb where aa.f1=udf(udf(bb.f1)))
 -- !query 90 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
 -- !query 90 output
 1

 -- !query 91
-select count(*) from tenk1 x where
-  x.unique1 in (select a.f1 from int4_tbl a,float8_tbl b where a.f1=b.f1) and
-  x.unique1 = 0 and
-  x.unique1 in (select aa.f1 from int4_tbl aa,float8_tbl bb where aa.f1=bb.f1)
+select udf(udf(count(*))) from tenk1 x where
+  udf(x.unique1) in (select udf(a.f1) from int4_tbl a,float8_tbl b where udf(udf(a.f1))=b.f1) and
+  udf(x.unique1) = 0 and
+  udf(udf(x.unique1)) in (select udf(aa.f1) from int4_tbl aa,float8_tbl bb where udf(aa.f1)=udf(udf(bb.f1)))
 -- !query 91 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(cast(udf(cast(count(1) as string)) as bigint) as string)) AS BIGINT):bigint>
 -- !query 91 output
 1

 -- !query 92
 select * from int8_tbl i1 left join (int8_tbl i2 join
-  (select 123 as x) ss on i2.q1 = x) on i1.q2 = i2.q2
-order by 1, 2
+  (select udf(123) as x) ss on udf(udf(i2.q1)) = udf(x)) on udf(udf(i1.q2)) = udf(udf(i2.q2))
+order by udf(udf(1)), 2
 -- !query 92 schema
 struct<q1:bigint,q2:bigint,q1:bigint,q2:bigint,x:int>
 -- !query 92 output
-123    456     123     456     123
-123    4567890123456789        123     4567890123456789        123
 4567890123456789       -4567890123456789       NULL    NULL    NULL
 4567890123456789       123     NULL    NULL    NULL
+123    456     123     456     123
+123    4567890123456789        123     4567890123456789        123
 4567890123456789       4567890123456789        123     4567890123456789        123

 -- !query 93
-select count(*)
+select udf(count(*))
 from
-  (select t3.tenthous as x1, coalesce(t1.stringu1, t2.stringu1) as x2
+  (select udf(t3.tenthous) as x1, udf(coalesce(udf(t1.stringu1), udf(t2.stringu1))) as x2
    from tenk1 t1
-   left join tenk1 t2 on t1.unique1 = t2.unique1
-   join tenk1 t3 on t1.unique2 = t3.unique2) ss,
+   left join tenk1 t2 on udf(t1.unique1) = udf(t2.unique1)
+   join tenk1 t3 on t1.unique2 = udf(t3.unique2)) ss,
   tenk1 t4,
   tenk1 t5
-where t4.thousand = t5.unique1 and ss.x1 = t4.tenthous and ss.x2 = t5.stringu1
+where udf(t4.thousand) = udf(t5.unique1) and udf(udf(ss.x1)) = t4.tenthous and udf(ss.x2) = udf(udf(t5.stringu1))
 -- !query 93 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
 -- !query 93 output
 1000

 -- !query 94
-select a.f1, b.f1, t.thousand, t.tenthous from
+select udf(a.f1), udf(b.f1), udf(t.thousand), udf(t.tenthous) from
   tenk1 t,
-  (select sum(f1)+1 as f1 from int4_tbl i4a) a,
-  (select sum(f1) as f1 from int4_tbl i4b) b
-where b.f1 = t.thousand and a.f1 = b.f1 and (a.f1+b.f1+999) = t.tenthous
+  (select udf(udf(sum(udf(f1))+1)) as f1 from int4_tbl i4a) a,
+  (select udf(sum(udf(f1))) as f1 from int4_tbl i4b) b
+where b.f1 = udf(t.thousand) and udf(a.f1) = udf(b.f1) and udf((udf(a.f1)+udf(b.f1)+999)) = udf(udf(t.tenthous))
 -- !query 94 schema
-struct<f1:bigint,f1:bigint,thousand:int,tenthous:int>
+struct<CAST(udf(cast(f1 as string)) AS BIGINT):bigint,CAST(udf(cast(f1 as string)) AS BIGINT):bigint,CAST(udf(cast(thousand as string)) AS INT):int,CAST(udf(cast(tenthous as string)) AS INT):int>
 -- !query 94 output

 -2408,8 +2408,8  struct<f1:bigint,f1:bigint,thousand:int,tenthous:int>
 -- !query 95
 select * from
   j1_tbl full join
-  (select * from j2_tbl order by j2_tbl.i desc, j2_tbl.k asc) j2_tbl
-  on j1_tbl.i = j2_tbl.i and j1_tbl.i = j2_tbl.k
+  (select * from j2_tbl order by udf(udf(j2_tbl.i)) desc, udf(j2_tbl.k) asc) j2_tbl
+  on udf(j1_tbl.i) = udf(j2_tbl.i) and udf(j1_tbl.i) = udf(j2_tbl.k)
 -- !query 95 schema
 struct<i:int,j:int,t:string,i:int,k:int>
 -- !query 95 output
 -2435,13 +2435,13  NULL    NULL    null    NULL    NULL

 -- !query 96
-select count(*) from
-  (select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
+select udf(count(*)) from
+  (select * from tenk1 x order by udf(x.thousand), udf(udf(x.twothousand)), x.fivethous) x
   left join
-  (select * from tenk1 y order by y.unique2) y
-  on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2
+  (select * from tenk1 y order by udf(y.unique2)) y
+  on udf(x.thousand) = y.unique2 and x.twothousand = udf(y.hundred) and x.fivethous = y.unique2
 -- !query 96 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
 -- !query 96 output
 10000

 -2507,7 +2507,7  struct<>

 -- !query 104
-select tt1.*, tt2.* from tt1 left join tt2 on tt1.joincol = tt2.joincol
+select tt1.*, tt2.* from tt1 left join tt2 on udf(udf(tt1.joincol)) = udf(tt2.joincol)
 -- !query 104 schema
 struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int>
 -- !query 104 output
 -2517,7 +2517,7  struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int>

 -- !query 105
-select tt1.*, tt2.* from tt2 right join tt1 on tt1.joincol = tt2.joincol
+select tt1.*, tt2.* from tt2 right join tt1 on udf(udf(tt1.joincol)) = udf(udf(tt2.joincol))
 -- !query 105 schema
 struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int>
 -- !query 105 output
 -2527,10 +2527,10  struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int>

 -- !query 106
-select count(*) from tenk1 a, tenk1 b
-  where a.hundred = b.thousand and (b.fivethous % 10) < 10
+select udf(count(*)) from tenk1 a, tenk1 b
+  where udf(a.hundred) = b.thousand and udf(udf((b.fivethous % 10)) < 10)
 -- !query 106 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
 -- !query 106 output
 100000

 -2584,14 +2584,14  struct<>

 -- !query 113
-SELECT a.f1
+SELECT udf(udf(a.f1)) as f1
 FROM tt4 a
 LEFT JOIN (
         SELECT b.f1
-        FROM tt3 b LEFT JOIN tt3 c ON (b.f1 = c.f1)
-        WHERE c.f1 IS NULL
-) AS d ON (a.f1 = d.f1)
-WHERE d.f1 IS NULL
+        FROM tt3 b LEFT JOIN tt3 c ON udf(b.f1) = udf(c.f1)
+        WHERE udf(c.f1) IS NULL
+) AS d ON udf(a.f1) = d.f1
+WHERE udf(udf(d.f1)) IS NULL
 -- !query 113 schema
 struct<f1:int>
 -- !query 113 output
 -2621,7 +2621,7  struct<>

 -- !query 116
-select * from tt5,tt6 where tt5.f1 = tt6.f1 and tt5.f1 = tt5.f2 - tt6.f2
+select * from tt5,tt6 where udf(tt5.f1) = udf(tt6.f1) and udf(tt5.f1) = udf(udf(tt5.f2) - udf(tt6.f2))
 -- !query 116 schema
 struct<f1:int,f2:int,f1:int,f2:int>
 -- !query 116 output
 -2649,12 +2649,12  struct<>

 -- !query 119
-select yy.pkyy as yy_pkyy, yy.pkxx as yy_pkxx, yya.pkyy as yya_pkyy,
-       xxa.pkxx as xxa_pkxx, xxb.pkxx as xxb_pkxx
+select udf(udf(yy.pkyy)) as yy_pkyy, udf(yy.pkxx) as yy_pkxx, udf(yya.pkyy) as yya_pkyy,
+       udf(xxa.pkxx) as xxa_pkxx, udf(xxb.pkxx) as xxb_pkxx
 from yy
-     left join (SELECT * FROM yy where pkyy = 101) as yya ON yy.pkyy = yya.pkyy
-     left join xx xxa on yya.pkxx = xxa.pkxx
-     left join xx xxb on coalesce (xxa.pkxx, 1) = xxb.pkxx
+     left join (SELECT * FROM yy where pkyy = 101) as yya ON udf(yy.pkyy) = udf(yya.pkyy)
+     left join xx xxa on udf(yya.pkxx) = udf(udf(xxa.pkxx))
+     left join xx xxb on udf(udf(coalesce (xxa.pkxx, 1))) = udf(xxb.pkxx)
 -- !query 119 schema
 struct<yy_pkyy:int,yy_pkxx:int,yya_pkyy:int,xxa_pkxx:int,xxb_pkxx:int>
 -- !query 119 output
 -2693,9 +2693,9  struct<>

 -- !query 123
 select * from
-  zt2 left join zt3 on (f2 = f3)
-      left join zt1 on (f3 = f1)
-where f2 = 53
+  zt2 left join zt3 on (udf(f2) = udf(udf(f3)))
+      left join zt1 on (udf(udf(f3)) = udf(f1))
+where udf(f2) = 53
 -- !query 123 schema
 struct<f2:int,f3:int,f1:int>
 -- !query 123 output
 -2712,9 +2712,9  struct<>

 -- !query 125
 select * from
-  zt2 left join zt3 on (f2 = f3)
-      left join zv1 on (f3 = f1)
-where f2 = 53
+  zt2 left join zt3 on (f2 = udf(f3))
+      left join zv1 on (udf(f3) = f1)
+where udf(udf(f2)) = 53
 -- !query 125 schema
 struct<f2:int,f3:int,f1:int,junk:string>
 -- !query 125 output
 -2722,12 +2722,12  struct<f2:int,f3:int,f1:int,junk:string>

 -- !query 126
-select a.unique2, a.ten, b.tenthous, b.unique2, b.hundred
-from tenk1 a left join tenk1 b on a.unique2 = b.tenthous
-where a.unique1 = 42 and
-      ((b.unique2 is null and a.ten = 2) or b.hundred = 3)
+select udf(a.unique2), udf(a.ten), udf(b.tenthous), udf(b.unique2), udf(b.hundred)
+from tenk1 a left join tenk1 b on a.unique2 = udf(b.tenthous)
+where udf(a.unique1) = 42 and
+      ((udf(b.unique2) is null and udf(a.ten) = 2) or udf(udf(b.hundred)) = udf(udf(3)))
 -- !query 126 schema
-struct<unique2:int,ten:int,tenthous:int,unique2:int,hundred:int>
+struct<CAST(udf(cast(unique2 as string)) AS INT):int,CAST(udf(cast(ten as string)) AS INT):int,CAST(udf(cast(tenthous as string)) AS INT):int,CAST(udf(cast(unique2 as string)) AS INT):int,CAST(udf(cast(hundred as string)) AS INT):int>
 -- !query 126 output

 -2749,7 +2749,7  struct<>

 -- !query 129
-select * from a left join b on i = x and i = y and x = i
+select * from a left join b on udf(i) = x and i = udf(y) and udf(x) = udf(i)
 -- !query 129 schema
 struct<i:int,x:int,y:int>
 -- !query 129 output
 -2757,11 +2757,11  struct<i:int,x:int,y:int>

 -- !query 130
-select t1.q2, count(t2.*)
-from int8_tbl t1 left join int8_tbl t2 on (t1.q2 = t2.q1)
-group by t1.q2 order by 1
+select udf(t1.q2), udf(count(t2.*))
+from int8_tbl t1 left join int8_tbl t2 on (udf(udf(t1.q2)) = t2.q1)
+group by udf(t1.q2) order by 1
 -- !query 130 schema
-struct<q2:bigint,count(q1, q2):bigint>
+struct<CAST(udf(cast(q2 as string)) AS BIGINT):bigint,CAST(udf(cast(count(q1, q2) as string)) AS BIGINT):bigint>
 -- !query 130 output
 -4567890123456789      0
 123    2
 -2770,11 +2770,11  struct<q2:bigint,count(q1, q2):bigint>

 -- !query 131
-select t1.q2, count(t2.*)
-from int8_tbl t1 left join (select * from int8_tbl) t2 on (t1.q2 = t2.q1)
-group by t1.q2 order by 1
+select udf(udf(t1.q2)), udf(count(t2.*))
+from int8_tbl t1 left join (select * from int8_tbl) t2 on (udf(udf(t1.q2)) = udf(t2.q1))
+group by udf(udf(t1.q2)) order by 1
 -- !query 131 schema
-struct<q2:bigint,count(q1, q2):bigint>
+struct<CAST(udf(cast(cast(udf(cast(q2 as string)) as bigint) as string)) AS BIGINT):bigint,CAST(udf(cast(count(q1, q2) as string)) AS BIGINT):bigint>
 -- !query 131 output
 -4567890123456789      0
 123    2
 -2783,13 +2783,13  struct<q2:bigint,count(q1, q2):bigint>

 -- !query 132
-select t1.q2, count(t2.*)
+select udf(t1.q2) as q2, udf(udf(count(t2.*)))
 from int8_tbl t1 left join
-  (select q1, case when q2=1 then 1 else q2 end as q2 from int8_tbl) t2
-  on (t1.q2 = t2.q1)
+  (select udf(q1) as q1, case when q2=1 then 1 else q2 end as q2 from int8_tbl) t2
+  on (udf(t1.q2) = udf(t2.q1))
 group by t1.q2 order by 1
 -- !query 132 schema
-struct<q2:bigint,count(q1, q2):bigint>
+struct<q2:bigint,CAST(udf(cast(cast(udf(cast(count(q1, q2) as string)) as bigint) as string)) AS BIGINT):bigint>
 -- !query 132 output
 -4567890123456789      0
 123    2
 -2828,17 +2828,17  struct<>

 -- !query 136
-select c.name, ss.code, ss.b_cnt, ss.const
+select udf(c.name), udf(ss.code), udf(ss.b_cnt), udf(ss.const)
 from c left join
   (select a.code, coalesce(b_grp.cnt, 0) as b_cnt, -1 as const
    from a left join
-     (select count(1) as cnt, b.a from b group by b.a) as b_grp
-     on a.code = b_grp.a
+     (select udf(count(1)) as cnt, b.a as a from b group by b.a) as b_grp
+     on udf(a.code) = udf(udf(b_grp.a))
   ) as ss
-  on (c.a = ss.code)
+  on (udf(udf(c.a)) = udf(ss.code))
 order by c.name
 -- !query 136 schema
-struct<name:string,code:string,b_cnt:bigint,const:int>
+struct<CAST(udf(cast(name as string)) AS STRING):string,CAST(udf(cast(code as string)) AS STRING):string,CAST(udf(cast(b_cnt as string)) AS BIGINT):bigint,CAST(udf(cast(const as string)) AS INT):int>
 -- !query 136 output
 A      p       2       -1
 B      q       0       -1
 -2852,15 +2852,15  LEFT JOIN
 ( SELECT sub3.key3, sub4.value2, COALESCE(sub4.value2, 66) as value3 FROM
     ( SELECT 1 as key3 ) sub3
     LEFT JOIN
-    ( SELECT sub5.key5, COALESCE(sub6.value1, 1) as value2 FROM
+    ( SELECT udf(sub5.key5) as key5, udf(udf(COALESCE(sub6.value1, 1))) as value2 FROM
         ( SELECT 1 as key5 ) sub5
         LEFT JOIN
         ( SELECT 2 as key6, 42 as value1 ) sub6
-        ON sub5.key5 = sub6.key6
+        ON sub5.key5 = udf(sub6.key6)
     ) sub4
-    ON sub4.key5 = sub3.key3
+    ON udf(sub4.key5) = sub3.key3
 ) sub2
-ON sub1.key1 = sub2.key3
+ON udf(udf(sub1.key1)) = udf(udf(sub2.key3))
 -- !query 137 schema
 struct<key1:int,key3:int,value2:int,value3:int>
 -- !query 137 output
 -2871,34 +2871,34  struct<key1:int,key3:int,value2:int,value3:int>
 SELECT * FROM
 ( SELECT 1 as key1 ) sub1
 LEFT JOIN
-( SELECT sub3.key3, value2, COALESCE(value2, 66) as value3 FROM
+( SELECT udf(sub3.key3) as key3, udf(value2), udf(COALESCE(value2, 66)) as value3 FROM
     ( SELECT 1 as key3 ) sub3
     LEFT JOIN
     ( SELECT sub5.key5, COALESCE(sub6.value1, 1) as value2 FROM
         ( SELECT 1 as key5 ) sub5
         LEFT JOIN
         ( SELECT 2 as key6, 42 as value1 ) sub6
-        ON sub5.key5 = sub6.key6
+        ON udf(udf(sub5.key5)) = sub6.key6
     ) sub4
     ON sub4.key5 = sub3.key3
 ) sub2
-ON sub1.key1 = sub2.key3
+ON sub1.key1 = udf(udf(sub2.key3))
 -- !query 138 schema
-struct<key1:int,key3:int,value2:int,value3:int>
+struct<key1:int,key3:int,CAST(udf(cast(value2 as string)) AS INT):int,value3:int>
 -- !query 138 output
 1      1       1       1

 -- !query 139
-SELECT qq, unique1
+SELECT udf(qq), udf(udf(unique1))
   FROM
-  ( SELECT COALESCE(q1, 0) AS qq FROM int8_tbl a ) AS ss1
+  ( SELECT udf(COALESCE(q1, 0)) AS qq FROM int8_tbl a ) AS ss1
   FULL OUTER JOIN
-  ( SELECT COALESCE(q2, -1) AS qq FROM int8_tbl b ) AS ss2
+  ( SELECT udf(udf(COALESCE(q2, -1))) AS qq FROM int8_tbl b ) AS ss2
   USING (qq)
-  INNER JOIN tenk1 c ON qq = unique2
+  INNER JOIN tenk1 c ON udf(qq) = udf(unique2)
 -- !query 139 schema
-struct<qq:bigint,unique1:int>
+struct<CAST(udf(cast(qq as string)) AS BIGINT):bigint,CAST(udf(cast(cast(udf(cast(unique1 as string)) as int) as string)) AS INT):int>
 -- !query 139 output
 123    4596
 123    4596
 -2936,19 +2936,19  struct<>

 -- !query 143
-select nt3.id
+select udf(nt3.id)
 from nt3 as nt3
   left join
-    (select nt2.*, (nt2.b1 and ss1.a3) AS b3
+    (select nt2.*, (udf(nt2.b1) and udf(ss1.a3)) AS b3
      from nt2 as nt2
        left join
-         (select nt1.*, (nt1.id is not null) as a3 from nt1) as ss1
-         on ss1.id = nt2.nt1_id
+         (select nt1.*, (udf(nt1.id) is not null) as a3 from nt1) as ss1
+         on ss1.id = udf(udf(nt2.nt1_id))
     ) as ss2
-    on ss2.id = nt3.nt2_id
-where nt3.id = 1 and ss2.b3
+    on udf(ss2.id) = nt3.nt2_id
+where udf(nt3.id) = 1 and udf(ss2.b3)
 -- !query 143 schema
-struct<id:int>
+struct<CAST(udf(cast(id as string)) AS INT):int>
 -- !query 143 output
 1

 -3003,73 +3003,73  NULL    2147483647

 -- !query 146
-select count(*) from
-  tenk1 a join tenk1 b on a.unique1 = b.unique2
-  left join tenk1 c on a.unique2 = b.unique1 and c.thousand = a.thousand
-  join int4_tbl on b.thousand = f1
+select udf(count(*)) from
+  tenk1 a join tenk1 b on udf(a.unique1) = udf(b.unique2)
+  left join tenk1 c on udf(a.unique2) = udf(b.unique1) and udf(c.thousand) = udf(udf(a.thousand))
+  join int4_tbl on udf(b.thousand) = f1
 -- !query 146 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
 -- !query 146 output
 10

 -- !query 147
-select b.unique1 from
-  tenk1 a join tenk1 b on a.unique1 = b.unique2
-  left join tenk1 c on b.unique1 = 42 and c.thousand = a.thousand
-  join int4_tbl i1 on b.thousand = f1
-  right join int4_tbl i2 on i2.f1 = b.tenthous
-  order by 1
+select udf(b.unique1) from
+  tenk1 a join tenk1 b on udf(a.unique1) = udf(b.unique2)
+  left join tenk1 c on udf(b.unique1) = 42 and c.thousand = udf(a.thousand)
+  join int4_tbl i1 on udf(b.thousand) = udf(udf(f1))
+  right join int4_tbl i2 on udf(udf(i2.f1)) = udf(b.tenthous)
+  order by udf(1)
 -- !query 147 schema
-struct<unique1:int>
+struct<CAST(udf(cast(unique1 as string)) AS INT):int>
 -- !query 147 output
 NULL
 NULL
+0
 NULL
 NULL
-0

 -- !query 148
 select * from
 (
-  select unique1, q1, coalesce(unique1, -1) + q1 as fault
-  from int8_tbl left join tenk1 on (q2 = unique2)
+  select udf(unique1), udf(q1), udf(udf(coalesce(unique1, -1)) + udf(q1)) as fault
+  from int8_tbl left join tenk1 on (udf(q2) = udf(unique2))
 ) ss
-where fault = 122
-order by fault
+where udf(fault) = udf(122)
+order by udf(fault)
 -- !query 148 schema
-struct<unique1:int,q1:bigint,fault:bigint>
+struct<CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(q1 as string)) AS BIGINT):bigint,fault:bigint>
 -- !query 148 output
 NULL   123     122

 -- !query 149
-select q1, unique2, thousand, hundred
-  from int8_tbl a left join tenk1 b on q1 = unique2
-  where coalesce(thousand,123) = q1 and q1 = coalesce(hundred,123)
+select udf(q1), udf(unique2), udf(thousand), udf(hundred)
+  from int8_tbl a left join tenk1 b on udf(q1) = udf(unique2)
+  where udf(coalesce(thousand,123)) = udf(q1) and udf(q1) = udf(udf(coalesce(hundred,123)))
 -- !query 149 schema
-struct<q1:bigint,unique2:int,thousand:int,hundred:int>
+struct<CAST(udf(cast(q1 as string)) AS BIGINT):bigint,CAST(udf(cast(unique2 as string)) AS INT):int,CAST(udf(cast(thousand as string)) AS INT):int,CAST(udf(cast(hundred as string)) AS INT):int>
 -- !query 149 output

 -- !query 150
-select f1, unique2, case when unique2 is null then f1 else 0 end
-  from int4_tbl a left join tenk1 b on f1 = unique2
-  where (case when unique2 is null then f1 else 0 end) = 0
+select udf(f1), udf(unique2), case when udf(udf(unique2)) is null then udf(f1) else 0 end
+  from int4_tbl a left join tenk1 b on udf(f1) = udf(udf(unique2))
+  where (case when udf(unique2) is null then udf(f1) else 0 end) = 0
 -- !query 150 schema
-struct<f1:int,unique2:int,CASE WHEN (unique2 IS NULL) THEN f1 ELSE 0 END:int>
+struct<CAST(udf(cast(f1 as string)) AS INT):int,CAST(udf(cast(unique2 as string)) AS INT):int,CASE WHEN (CAST(udf(cast(cast(udf(cast(unique2 as string)) as int) as string)) AS INT) IS NULL) THEN CAST(udf(cast(f1 as string)) AS INT) ELSE 0 END:int>
 -- !query 150 output
 0      0       0

 -- !query 151
-select a.unique1, b.unique1, c.unique1, coalesce(b.twothousand, a.twothousand)
-  from tenk1 a left join tenk1 b on b.thousand = a.unique1                        left join tenk1 c on c.unique2 = coalesce(b.twothousand, a.twothousand)
-  where a.unique2 < 10 and coalesce(b.twothousand, a.twothousand) = 44
+select udf(a.unique1), udf(b.unique1), udf(c.unique1), udf(coalesce(b.twothousand, a.twothousand))
+  from tenk1 a left join tenk1 b on udf(b.thousand) = a.unique1                       left join tenk1 c on udf(c.unique2) = udf(coalesce(b.twothousand, a.twothousand))
+  where a.unique2 < udf(10) and udf(udf(coalesce(b.twothousand, a.twothousand))) = udf(44)
 -- !query 151 schema
-struct<unique1:int,unique1:int,unique1:int,coalesce(twothousand, twothousand):int>
+struct<CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(coalesce(twothousand, twothousand) as string)) AS INT):int>
 -- !query 151 output

 -3078,11 +3078,11  struct<unique1:int,unique1:int,unique1:int,coalesce(twothousand, twothousand):in
 select * from
   text_tbl t1
   inner join int8_tbl i8
-  on i8.q2 = 456
+  on udf(i8.q2) = udf(udf(456))
   right join text_tbl t2
-  on t1.f1 = 'doh!'
+  on udf(t1.f1) = udf(udf('doh!'))
   left join int4_tbl i4
-  on i8.q1 = i4.f1
+  on udf(udf(i8.q1)) = i4.f1
 -- !query 152 schema
 struct<f1:string,q1:bigint,q2:bigint,f1:string,f1:int>
 -- !query 152 output
 -3092,10 +3092,10  doh!    123     456     hi de ho neighbor       NULL

 -- !query 153
 select * from
-  (select 1 as id) as xx
+  (select udf(udf(1)) as id) as xx
   left join
-    (tenk1 as a1 full join (select 1 as id) as yy on (a1.unique1 = yy.id))
-  on (xx.id = coalesce(yy.id))
+    (tenk1 as a1 full join (select udf(1) as id) as yy on (udf(a1.unique1) = udf(yy.id)))
+  on (xx.id = udf(udf(coalesce(yy.id))))
 -- !query 153 schema
 struct<id:int,unique1:int,unique2:int,two:int,four:int,ten:int,twenty:int,hundred:int,thousand:int,twothousand:int,fivethous:int,tenthous:int,odd:int,even:int,stringu1:string,stringu2:string,string4:string,id:int>
 -- !query 153 output
 -3103,11 +3103,11  struct<id:int,unique1:int,unique2:int,two:int,four:int,ten:int,twenty:int,hundre

 -- !query 154
-select a.q2, b.q1
-  from int8_tbl a left join int8_tbl b on a.q2 = coalesce(b.q1, 1)
-  where coalesce(b.q1, 1) > 0
+select udf(a.q2), udf(b.q1)
+  from int8_tbl a left join int8_tbl b on udf(a.q2) = coalesce(b.q1, 1)
+  where udf(udf(coalesce(b.q1, 1)) > 0)
 -- !query 154 schema
-struct<q2:bigint,q1:bigint>
+struct<CAST(udf(cast(q2 as string)) AS BIGINT):bigint,CAST(udf(cast(q1 as string)) AS BIGINT):bigint>
 -- !query 154 output
 -4567890123456789      NULL
 123    123
 -3142,7 +3142,7  struct<>

 -- !query 157
-select p.* from parent p left join child c on (p.k = c.k)
+select p.* from parent p left join child c on (udf(p.k) = udf(c.k))
 -- !query 157 schema
 struct<k:int,pd:int>
 -- !query 157 output
 -3153,8 +3153,8  struct<k:int,pd:int>

 -- !query 158
 select p.*, linked from parent p
-  left join (select c.*, true as linked from child c) as ss
-  on (p.k = ss.k)
+  left join (select c.*, udf(udf(true)) as linked from child c) as ss
+  on (udf(p.k) = udf(udf(ss.k)))
 -- !query 158 schema
 struct<k:int,pd:int,linked:boolean>
 -- !query 158 output
 -3165,8 +3165,8  struct<k:int,pd:int,linked:boolean>

 -- !query 159
 select p.* from
-  parent p left join child c on (p.k = c.k)
-  where p.k = 1 and p.k = 2
+  parent p left join child c on (udf(p.k) = c.k)
+  where p.k = udf(1) and udf(udf(p.k)) = udf(udf(2))
 -- !query 159 schema
 struct<k:int,pd:int>
 -- !query 159 output
 -3175,8 +3175,8  struct<k:int,pd:int>

 -- !query 160
 select p.* from
-  (parent p left join child c on (p.k = c.k)) join parent x on p.k = x.k
-  where p.k = 1 and p.k = 2
+  (parent p left join child c on (udf(p.k) = c.k)) join parent x on p.k = udf(x.k)
+  where udf(p.k) = udf(1) and udf(udf(p.k)) = udf(udf(2))
 -- !query 160 schema
 struct<k:int,pd:int>
 -- !query 160 output
 -3204,7 +3204,7  struct<>

 -- !query 163
-SELECT * FROM b LEFT JOIN a ON (b.a_id = a.id) WHERE (a.id IS NULL OR a.id > 0)
+SELECT * FROM b LEFT JOIN a ON (udf(b.a_id) = udf(a.id)) WHERE (udf(udf(a.id)) IS NULL OR udf(a.id) > 0)
 -- !query 163 schema
 struct<id:int,a_id:int,id:int>
 -- !query 163 output
 -3212,7 +3212,7  struct<id:int,a_id:int,id:int>

 -- !query 164
-SELECT b.* FROM b LEFT JOIN a ON (b.a_id = a.id) WHERE (a.id IS NULL OR a.id > 0)
+SELECT b.* FROM b LEFT JOIN a ON (udf(b.a_id) = udf(a.id)) WHERE (udf(a.id) IS NULL OR udf(udf(a.id)) > 0)
 -- !query 164 schema
 struct<id:int,a_id:int>
 -- !query 164 output
 -3231,13 +3231,13  struct<>

 -- !query 166
 SELECT * FROM
-    (SELECT 1 AS x) ss1
+    (SELECT udf(1) AS x) ss1
   LEFT JOIN
-    (SELECT q1, q2, COALESCE(dat1, q1) AS y
-     FROM int8_tbl LEFT JOIN innertab ON q2 = id) ss2
+    (SELECT udf(q1), udf(q2), udf(COALESCE(dat1, q1)) AS y
+     FROM int8_tbl LEFT JOIN innertab ON udf(udf(q2)) = id) ss2
   ON true
 -- !query 166 schema
-struct<x:int,q1:bigint,q2:bigint,y:bigint>
+struct<x:int,CAST(udf(cast(q1 as string)) AS BIGINT):bigint,CAST(udf(cast(q2 as string)) AS BIGINT):bigint,y:bigint>
 -- !query 166 output
 1      123     456     123
 1      123     4567890123456789        123
 -3248,27 +3248,27  struct<x:int,q1:bigint,q2:bigint,y:bigint>

 -- !query 167
 select * from
-  int8_tbl x join (int4_tbl x cross join int4_tbl y) j on q1 = f1
+  int8_tbl x join (int4_tbl x cross join int4_tbl y) j on udf(q1) = udf(f1)
 -- !query 167 schema
 struct<>
 -- !query 167 output
 org.apache.spark.sql.AnalysisException
-Reference 'f1' is ambiguous, could be: j.f1, j.f1.; line 2 pos 63
+Reference 'f1' is ambiguous, could be: j.f1, j.f1.; line 2 pos 72

 -- !query 168
 select * from
-  int8_tbl x join (int4_tbl x cross join int4_tbl y) j on q1 = y.f1
+  int8_tbl x join (int4_tbl x cross join int4_tbl y) j on udf(q1) = udf(y.f1)
 -- !query 168 schema
 struct<>
 -- !query 168 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`y.f1`' given input columns: [j.f1, j.f1, x.q1, x.q2]; line 2 pos 63
+cannot resolve '`y.f1`' given input columns: [j.f1, j.f1, x.q1, x.q2]; line 2 pos 72

 -- !query 169
 select * from
-  int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1
+  int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on udf(q1) = udf(udf(f1))
 -- !query 169 schema
 struct<q1:bigint,q2:bigint,f1:int,ff:int>
 -- !query 169 output
 -3276,69 +3276,69  struct<q1:bigint,q2:bigint,f1:int,ff:int>

 -- !query 170
-select t1.uunique1 from
-  tenk1 t1 join tenk2 t2 on t1.two = t2.two
+select udf(t1.uunique1) from
+  tenk1 t1 join tenk2 t2 on t1.two = udf(t2.two)
 -- !query 170 schema
 struct<>
 -- !query 170 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`t1.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 7
+cannot resolve '`t1.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 11

 -- !query 171
-select t2.uunique1 from
-  tenk1 t1 join tenk2 t2 on t1.two = t2.two
+select udf(udf(t2.uunique1)) from
+  tenk1 t1 join tenk2 t2 on udf(t1.two) = t2.two
 -- !query 171 schema
 struct<>
 -- !query 171 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`t2.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 7
+cannot resolve '`t2.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 15

 -- !query 172
-select uunique1 from
-  tenk1 t1 join tenk2 t2 on t1.two = t2.two
+select udf(uunique1) from
+  tenk1 t1 join tenk2 t2 on udf(t1.two) = udf(t2.two)
 -- !query 172 schema
 struct<>
 -- !query 172 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 7
+cannot resolve '`uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 11

 -- !query 173
-select f1,g from int4_tbl a, (select f1 as g) ss
+select udf(udf(f1,g)) from int4_tbl a, (select udf(udf(f1)) as g) ss
 -- !query 173 schema
 struct<>
 -- !query 173 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`f1`' given input columns: []; line 1 pos 37
+cannot resolve '`f1`' given input columns: []; line 1 pos 55

 -- !query 174
-select f1,g from int4_tbl a, (select a.f1 as g) ss
+select udf(f1,g) from int4_tbl a, (select a.f1 as g) ss
 -- !query 174 schema
 struct<>
 -- !query 174 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`a.f1`' given input columns: []; line 1 pos 37
+cannot resolve '`a.f1`' given input columns: []; line 1 pos 42

 -- !query 175
-select f1,g from int4_tbl a cross join (select f1 as g) ss
+select udf(udf(f1,g)) from int4_tbl a cross join (select udf(f1) as g) ss
 -- !query 175 schema
 struct<>
 -- !query 175 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`f1`' given input columns: []; line 1 pos 47
+cannot resolve '`f1`' given input columns: []; line 1 pos 61

 -- !query 176
-select f1,g from int4_tbl a cross join (select a.f1 as g) ss
+select udf(f1,g) from int4_tbl a cross join (select udf(udf(a.f1)) as g) ss
 -- !query 176 schema
 struct<>
 -- !query 176 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`a.f1`' given input columns: []; line 1 pos 47
+cannot resolve '`a.f1`' given input columns: []; line 1 pos 60

 -- !query 177
 -3383,8 +3383,8  struct<>

 -- !query 182
 select * from j1
-inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
-where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1
+inner join j2 on udf(j1.id1) = udf(j2.id1) and udf(udf(j1.id2)) = udf(j2.id2)
+where udf(j1.id1) % 1000 = 1 and udf(udf(j2.id1) % 1000) = 1
 -- !query 182 schema
 struct<id1:int,id2:int,id1:int,id2:int>
 -- !query 182 output
```

</p>
</details>

## How was this patch tested?

Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

Closes #25371 from huaxingao/spark-28393.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-19 20:10:56 +09:00
Wenchen Fan 97dc4c0bfc [SPARK-28744][SQL][TEST] rename SharedSQLContext to SharedSparkSession
## What changes were proposed in this pull request?

The Spark SQL test framework needs to support 2 kinds of tests:
1. tests inside Spark to test Spark itself (extends `SparkFunSuite`)
2. test outside of Spark to test Spark applications (introduced at b57ed2245c)

The class hierarchy of the major testing traits:
![image](https://user-images.githubusercontent.com/3182036/63088526-c0f0af80-bf87-11e9-9bed-c144c2486da9.png)

`PlanTestBase`, `SQLTestUtilsBase` and `SharedSparkSession` intentionally don't extend `SparkFunSuite`, so that they can be used for tests outside of Spark. Tests in Spark should extends `QueryTest` and/or `SharedSQLContext` in most cases.

However, the name is a little confusing. As a result, some test suites extend `SharedSparkSession` instead of `SharedSQLContext`. `SharedSparkSession` doesn't work well with `SparkFunSuite` as it doesn't have the special handling of thread auditing in `SharedSQLContext`. For example, you will see a warning starting with `===== POSSIBLE THREAD LEAK IN SUITE` when you run `DataFrameSelfJoinSuite`.

This PR proposes to rename `SharedSparkSession` to `SharedSparkSessionBase`, and rename `SharedSQLContext` to `SharedSparkSession`.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://spark.apache.org/contributing.html before opening a pull request.

Closes #25463 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-19 19:01:56 +08:00
Peter Toth f999e00e9f [SPARK-28356][SHUFFLE][FOLLOWUP] Fix case with different pre-shuffle partition numbers
### What changes were proposed in this pull request?

This PR reverts some of the latest changes in `ReduceNumShufflePartitions` to fix the case when there are different pre-shuffle partition numbers in the plan. Please see the new UT for an example.

### Why are the changes needed?
Eliminate a bug.

### Does this PR introduce any user-facing change?
Yes, some queries that failed will succeed now.

### How was this patch tested?
Added new UT.

Closes #25479 from peter-toth/SPARK-28356-followup.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-19 15:53:43 +08:00
Eyal Zituny d75a11d059 [SPARK-27330][SS] support task abort in foreach writer
## What changes were proposed in this pull request?
in order to address cases where foreach writer task is failing without calling the close() method, (for example when a task is interrupted) added the option to implement an abort() method that will be called when the task is aborted. users should handle resource cleanup (such as connections) in the abort() method

## How was this patch tested?
update existing unit tests.

Closes #24382 from eyalzit/SPARK-27330-foreach-writer-abort.

Lead-authored-by: Eyal Zituny <eyal.zituny@equalum.io>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: eyalzit <eyal.zituny@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-19 14:12:48 +08:00
shivusondur c96b6154b7 [SPARK-28390][SQL][PYTHON][TESTS][FOLLOW-UP] Update the TODO with actual blocking JIRA IDs
## What changes were proposed in this pull request?
 only todo message updated. Need to add udf() for GroupBy Tests, after resolving following jira
[SPARK-28386] and [SPARK-26741]

## How was this patch tested?
NA, only TODO message updated.

Closes #25415 from shivusondur/jiraFollowup.

Authored-by: shivusondur <shivusondur@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-19 13:01:39 +09:00
WeichenXu 4ddad79060 [SPARK-28598][SQL] Few date time manipulation functions does not provide versions supporting Column as input through the Dataframe API
## What changes were proposed in this pull request?

Add following functions:
```
def add_months(startDate: Column, numMonths: Column): Column
def date_add(start: Column, days: Column): Column
def date_sub(start: Column, days: Column): Column
```

## How was this patch tested?

UT.

Please review https://spark.apache.org/contributing.html before opening a pull request.

Closes #25334 from WeichenXu123/datefunc_impr.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-19 11:41:13 +09:00
Dongjoon Hyun f0834d3a7f Revert "[SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server"
This reverts commit efbb035902.
2019-08-18 16:54:24 -07:00
Yuming Wang c308ab5a29 [MINOR][SQL] Make analysis error msg more meaningful on DISTINCT queries
## What changes were proposed in this pull request?

This PR makes analysis error messages more meaningful when the function does not support the modifier DISTINCT:
```sql
postgres=# select upper(distinct a) from (values('a'), ('b')) v(a);
ERROR:  DISTINCT specified, but upper is not an aggregate function
LINE 1: select upper(distinct a) from (values('a'), ('b')) v(a);

spark-sql> select upper(distinct a) from (values('a'), ('b')) v(a);
Error in query: upper does not support the modifier DISTINCT; line 1 pos 7
spark-sql>
```

After this pr:
```sql
spark-sql> select upper(distinct a) from (values('a'), ('b')) v(a);
Error in query: DISTINCT specified, but upper is not an aggregate function; line 1 pos 7
spark-sql>

```

## How was this patch tested?

Unit test

Closes #25486 from wangyum/DISTINCT.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-18 08:36:01 -07:00
Yuming Wang efbb035902 [SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server
## What changes were proposed in this pull request?

This PR build a test framework that directly re-run all the tests in `SQLQueryTestSuite` via Thrift Server. But it's a little different from `SQLQueryTestSuite`:
1. Can not support [UDF testing](44e607e921/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L293-L297)).
2. Can not support `DESC` command and `SHOW` command because `SQLQueryTestSuite` [formatted the output](1882912cca/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala (L38-L50).).

When building this framework, found two bug:
[SPARK-28624](https://issues.apache.org/jira/browse/SPARK-28624): `make_date` is inconsistent when reading from table
[SPARK-28611](https://issues.apache.org/jira/browse/SPARK-28611): Histogram's height is different

found two features that ThriftServer can not support:
[SPARK-28636](https://issues.apache.org/jira/browse/SPARK-28636): ThriftServer can not support decimal type with negative scale
[SPARK-28637](https://issues.apache.org/jira/browse/SPARK-28637): ThriftServer can not support interval type

Also, found two inconsistent behavior:
[SPARK-28620](https://issues.apache.org/jira/browse/SPARK-28620): Double type returned for float type in Beeline/JDBC
[SPARK-28619](https://issues.apache.org/jira/browse/SPARK-28619):  The golden result file is different when tested by `bin/spark-sql`

## How was this patch tested?

N/A

Closes #25373 from wangyum/SPARK-28527.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-08-17 19:12:50 -07:00
Gengliang Wang 92bfd9a317 [SPARK-28757][SQL] File table location should include both values of option path and paths
### What changes were proposed in this pull request?
If both options `path` and `paths` are passed to file data source v2, both values of the options should be included as the target paths.

### Why are the changes needed?
In V1 implementation, file table location includes both values of option `path` and `paths`.
In the refactoring of https://github.com/apache/spark/pull/24025, the value of option `path` is ignored if "paths" are specified. We should make it consistent with V1.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test

Closes #25473 from gengliangwang/fixPathOption.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-16 22:27:27 +08:00
pavithra c48e381214 [SPARK-28671][SQL] Throw NoSuchPermanentFunctionException for a non-exsistent permanent function in dropFunction/alterFunction
## What changes were proposed in this pull request?
**Before Fix**
When a non existent permanent function is dropped, generic NoSuchFunctionException was thrown.- which printed "This function is neither a registered temporary function nor a permanent function registered in the database" .
This creates a ambiguity when a temp function in the same name exist.

**After Fix**
 NoSuchPermanentFunctionException will be thrown, which will print
"NoSuchPermanentFunctionException:Function not found in database "

## How was this patch tested?
Unit test was run and corrected the UT.

Closes #25394 from PavithraRamachandran/funcIssue.

Lead-authored-by: pavithra <pavi.rams@gmail.com>
Co-authored-by: pavithraramachandran <pavi.rams@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-08-16 22:46:04 +09:00
Maxim Gekk 96ca734fb7 [SPARK-28745][SQL][TEST] Add benchmarks for extract()
## What changes were proposed in this pull request?

Added new benchmark `ExtractBenchmark` for the `EXTRACT(field FROM source)` function. It was executed on all currently supported values of the `field` argument:  `MILLENNIUM`, `CENTURY`, `DECADE`, `YEAR`, `ISOYEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`, `DAYOFWEEK`, `HOUR`, `MINUTE`, `SECOND`, `MILLISECONDS`, `MICROSECONDS`, `EPOCH`. The `cast(id as timestamp)` was taken as the `source` argument.

## How was this patch tested?

By running the benchmark via:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark"
```

Closes #25462 from MaxGekk/extract-benchmark.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-15 12:44:36 -07:00
Yuming Wang 1b416a0c77 [SPARK-27592][SQL] Set the bucketed data source table SerDe correctly
## What changes were proposed in this pull request?

Hive using incorrect **InputFormat**(`org.apache.hadoop.mapred.SequenceFileInputFormat`) to read Spark's **Parquet** bucketed data source table.
Spark side:
```sql
spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS;
2019-04-29 17:52:05 WARN  HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
spark-sql> DESC FORMATTED t;
c1	int	NULL
c2	int	NULL

# Detailed Table Information
Database	default
Table	t
Owner	yumwang
Created Time	Mon Apr 29 17:52:05 CST 2019
Last Access	Thu Jan 01 08:00:00 CST 1970
Created By	Spark 2.4.0
Type	MANAGED
Provider	parquet
Num Buckets	2
Bucket Columns	[`c1`]
Sort Columns	[`c1`]
Table Properties	[transient_lastDdlTime=1556531525]
Location	file:/user/hive/warehouse/t
Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat	org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties	[serialization.format=1]
```
Hive side:
```sql
hive> DESC FORMATTED t;
OK
# col_name            	data_type           	comment

c1                  	int
c2                  	int

# Detailed Table Information
Database:           	default
Owner:              	root
CreateTime:         	Wed May 08 03:38:46 GMT-07:00 2019
LastAccessTime:     	UNKNOWN
Retention:          	0
Location:           	file:/user/hive/warehouse/t
Table Type:         	MANAGED_TABLE
Table Parameters:
	bucketing_version   	spark
	spark.sql.create.version	3.0.0-SNAPSHOT
	spark.sql.sources.provider	parquet
	spark.sql.sources.schema.bucketCol.0	c1
	spark.sql.sources.schema.numBucketCols	1
	spark.sql.sources.schema.numBuckets	2
	spark.sql.sources.schema.numParts	1
	spark.sql.sources.schema.numSortCols	1
	spark.sql.sources.schema.part.0	{\"type\":\"struct\",\"fields\":[{\"name\":\"c1\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c2\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}
	spark.sql.sources.schema.sortCol.0	c1
	transient_lastDdlTime	1557311926

# Storage Information
SerDe Library:      	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:        	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:       	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:         	No
Num Buckets:        	-1
Bucket Columns:     	[]
Sort Columns:       	[]
Storage Desc Params:
	path                	file:/user/hive/warehouse/t
	serialization.format	1
```

So it's non-bucketed table at Hive side. This pr set the `SerDe` correctly so Hive can read these tables.

Related code:
33f3c48cac/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (L976-L990)
f9776e3892/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (L444-L459)

## How was this patch tested?

unit tests

Closes #24486 from wangyum/SPARK-27592.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-15 17:21:13 +08:00
Burak Yavuz 0526529b31 [SPARK-28666] Support saveAsTable for V2 tables through Session Catalog
## What changes were proposed in this pull request?

We add support for the V2SessionCatalog for saveAsTable, such that V2 tables can plug in and leverage existing DataFrameWriter.saveAsTable APIs to write and create tables through the session catalog.

## How was this patch tested?

Unit tests. A lot of tests broke under hive when things were not working properly under `ResolveTables`, therefore I believe the current set of tests should be sufficient in testing the table resolution and read code paths.

Closes #25402 from brkyvz/saveAsV2.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-15 12:29:34 +08:00
Maxim Gekk 3a4afce96c [SPARK-28687][SQL] Support epoch, isoyear, milliseconds and microseconds at extract()
## What changes were proposed in this pull request?

In the PR, I propose new expressions `Epoch`, `IsoYear`, `Milliseconds` and `Microseconds`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT):

1. `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.
2. `isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
3. `milliseconds` - the seconds field including fractional parts multiplied by 1,000.
4. `microseconds` - the seconds field including fractional parts multiplied by 1,000,000.

Here are examples:
```sql
spark-sql> SELECT EXTRACT(EPOCH FROM TIMESTAMP '2019-08-11 19:07:30.123456');
1565550450.123456
spark-sql> SELECT EXTRACT(ISOYEAR FROM DATE '2006-01-01');
2005
spark-sql> SELECT EXTRACT(MILLISECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456');
30123.456
spark-sql> SELECT EXTRACT(MICROSECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456');
30123456
```

## How was this patch tested?

Added new tests to `DateExpressionsSuite`, and uncommented existing tests in `extract.sql` and `pgSQL/date.sql`.

Closes #25408 from MaxGekk/extract-ext3.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-14 08:44:44 -07:00
xy_xin 2eeb25e52d [SPARK-28351][SQL] Support DELETE in DataSource V2
## What changes were proposed in this pull request?

This pr adds DELETE support for V2 datasources. As a first step, this pr only support delete by source filters:
```scala
void delete(Filter[] filters);
```
which could not deal with complicated cases like subqueries.

Since it's uncomfortable to embed the implementation of DELETE in the current V2 APIs, a new mix-in of datasource is added, which is called `SupportsMaintenance`, similar to `SupportsRead` and `SupportsWrite`.  A datasource which can be maintained means we can perform DELETE/UPDATE/MERGE/OPTIMIZE on the datasource, as long as the datasource implements the necessary mix-ins.

## How was this patch tested?

new test case.

Please review https://spark.apache.org/contributing.html before opening a pull request.

Closes #25115 from xianyinxin/SPARK-28351.

Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-14 23:38:45 +08:00
John Zhuge 391c7e8f2e [SPARK-27739][SQL] df.persist should save stats from optimized plan
## What changes were proposed in this pull request?

CacheManager.cacheQuery saves the stats from the optimized plan to cache.

## How was this patch tested?

Existing testss.

Closes #24623 from jzhuge/SPARK-27739.

Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-14 19:49:53 +08:00
Edgar Rodriguez 598fcbe5ed [SPARK-28265][SQL] Add renameTable to TableCatalog API
## What changes were proposed in this pull request?

This PR adds the `renameTable` call to the `TableCatalog` API, as described in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d).

This PR is related to: https://github.com/apache/spark/pull/24246

## How was this patch tested?

Added  unit tests and contract tests.

Closes #25206 from edgarRd/SPARK-28265-add-rename-table-catalog-api.

Authored-by: Edgar Rodriguez <edgar.rd@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-14 14:24:13 +08:00
Dilip Biswal 331f2657d9 [SPARK-27768][SQL] Support Infinity/NaN-related float/double literals case-insensitively
## What changes were proposed in this pull request?
Here is the problem description from the JIRA.
```
When the inputs contain the constant 'infinity', Spark SQL does not generate the expected results.

SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('infinity'), ('1')) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('infinity'), ('infinity')) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('-infinity'), ('infinity')) v(x);
 The root cause: Spark SQL does not recognize the special constants in a case insensitive way. In PostgreSQL, they are recognized in a case insensitive way.

Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html
```

In this PR, the casting code is enhanced to handle these `special` string literals in case insensitive manner.

## How was this patch tested?
Added tests in CastSuite and modified existing test suites.

Closes #25331 from dilipbiswal/double_infinity.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-13 16:48:30 -07:00
Maxim Gekk 3d85c54895 [SPARK-28700][SQL] Use DECIMAL type for sec in make_timestamp()
## What changes were proposed in this pull request?

Changed type of `sec` argument in the `make_timestamp()` function from `DOUBLE` to `DECIMAL(8, 6)`. The scale is set to 6 to cover microsecond fractions, and the precision is 2 digits for seconds + 6 digits for microsecond fraction. New type prevents losing precision in some cases, for example:

Before:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001);
2019-08-12 00:00:58
```

After:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001);
2019-08-12 00:00:58.000001
```

Also switching to `DECIMAL` fixes rounding `sec` towards "nearest neighbor" unless both neighbors are equidistant, in which case round up. For example:

Before:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567);
2019-08-12 00:00:00.123456
```

After:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567);
2019-08-12 00:00:00.123457
```

## How was this patch tested?

This was tested by `DateExpressionsSuite` and `pgSQL/timestamp.sql`.

Closes #25421 from MaxGekk/make_timestamp-decimal.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-13 15:51:50 -07:00
Maxim Gekk f04a766946 [SPARK-28718][SQL] Support field synonyms at extract
## What changes were proposed in this pull request?

In the PR, I propose additional synonyms for the `field` argument of `extract` supported by PostgreSQL. The `extract.sql` is updated to check all supported values of the `field` argument. The list of synonyms was taken from https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/datetime.c .

## How was this patch tested?

By running `extract.sql` via:
```
$ build/sbt "sql/test-only *SQLQueryTestSuite -- -z extract.sql"
```

Closes #25438 from MaxGekk/extract-field-synonyms.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-13 15:36:28 -07:00
Yuming Wang 13b62f31cd [SPARK-28708][SQL] IsolatedClientLoader will not load hive classes from application jars on JDK9+
## What changes were proposed in this pull request?

We have 8 test cases in `HiveSparkSubmitSuite` still fail with `java.lang.ClassNotFoundException` when running on JDK9+:
```
[info] - SPARK-18989: DESC TABLE should not fail with format class not found *** FAILED *** (9 seconds, 927 milliseconds)
[info]   spark-submit returned with exit code 1.
[info]   Command line: './bin/spark-submit' '--class' 'org.apache.spark.sql.hive.SPARK_18989_CREATE_TABLE' '--name' 'SPARK-18947' '--master' 'local-cluster[2,1,1024]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--jars' '/root/.m2/repository/org/apache/hive/hive-contrib/2.3.6-SNAPSHOT/hive-contrib-2.3.6-SNAPSHOT.jar' 'file:/root/opensource/spark/target/tmp/spark-36d27542-7b82-4962-a362-bb51ef3e457d/testJar-1565682620744.jar'
[info]
[info]   2019-08-13 00:50:22.073 - stderr> WARNING: An illegal reflective access operation has occurred
[info]   2019-08-13 00:50:22.073 - stderr> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/root/opensource/spark/common/unsafe/target/scala-2.12/classes/) to constructor java.nio.DirectByteBuffer(long,int)
[info]   2019-08-13 00:50:22.073 - stderr> WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
[info]   2019-08-13 00:50:22.073 - stderr> WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
[info]   2019-08-13 00:50:22.073 - stderr> WARNING: All illegal access operations will be denied in a future release
[info]   2019-08-13 00:50:28.31 - stderr> Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
[info]   2019-08-13 00:50:28.31 - stderr> 	at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
[info]   2019-08-13 00:50:28.31 - stderr> 	at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3138)
[info]   2019-08-13 00:50:28.31 - stderr> 	at java.base/java.lang.Class.getConstructors(Class.java:1944)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:294)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:410)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:305)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:221)
[info]   2019-08-13 00:50:28.31 - stderr> 	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:221)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:139)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:129)
[info]   2019-08-13 00:50:28.31 - stderr> 	at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:42)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:57)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:91)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:91)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:244)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:178)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:317)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:132)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:213)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3431)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3427)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:213)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:95)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:653)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.hive.SPARK_18989_CREATE_TABLE$.main(HiveSparkSubmitSuite.scala:829)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.hive.SPARK_18989_CREATE_TABLE.main(HiveSparkSubmitSuite.scala)
[info]   2019-08-13 00:50:28.311 - stderr> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[info]   2019-08-13 00:50:28.311 - stderr> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[info]   2019-08-13 00:50:28.311 - stderr> 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[info]   2019-08-13 00:50:28.311 - stderr> 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:920)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:179)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:202)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:89)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:999)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1008)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[info]   2019-08-13 00:50:28.311 - stderr> Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.metadata.HiveException
[info]   2019-08-13 00:50:28.311 - stderr> 	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
[info]   2019-08-13 00:50:28.311 - stderr> 	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:250)
[info]   2019-08-13 00:50:28.311 - stderr> 	at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:239)
[info]   2019-08-13 00:50:28.311 - stderr> 	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
[info]   2019-08-13 00:50:28.311 - stderr> 	... 48 more
```

Note that this pr fixes `java.lang.ClassNotFoundException`, but the test will fail again with a different reason, the Hive-side `java.lang.ClassCastException` which will be resolved in the official Hive 2.3.6 release.
```
[info] - SPARK-18989: DESC TABLE should not fail with format class not found *** FAILED *** (7 seconds, 649 milliseconds)
[info]   spark-submit returned with exit code 1.
[info]   Command line: './bin/spark-submit' '--class' 'org.apache.spark.sql.hive.SPARK_18989_CREATE_TABLE' '--name' 'SPARK-18947' '--master' 'local-cluster[2,1,1024]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--jars' '/Users/dongjoon/.ivy2/cache/org.apache.hive/hive-contrib/jars/hive-contrib-2.3.5.jar' 'file:/Users/dongjoon/PRS/PR-25429/target/tmp/spark-48b7c936-0ec2-4311-9fb5-0de4bf86a0eb/testJar-1565710418275.jar'
[info]
[info]   2019-08-13 08:33:39.221 - stderr> WARNING: An illegal reflective access operation has occurred
[info]   2019-08-13 08:33:39.221 - stderr> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/dongjoon/PRS/PR-25429/common/unsafe/target/scala-2.12/classes/) to constructor java.nio.DirectByteBuffer(long,int)
[info]   2019-08-13 08:33:39.221 - stderr> WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
[info]   2019-08-13 08:33:39.221 - stderr> WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
[info]   2019-08-13 08:33:39.221 - stderr> WARNING: All illegal access operations will be denied in a future release
[info]   2019-08-13 08:33:43.59 - stderr> Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.ClassCastException: class jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader and java.net.URLClassLoader are in module java.base of loader 'bootstrap');
[info]   2019-08-13 08:33:43.59 - stderr> 	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109)
```

## How was this patch tested?

manual tests:

1. Install [Hive 2.3.6-SNAPSHOT](https://github.com/wangyum/hive/tree/HIVE-21584-branch-2.3) to local maven repository:
```
mvn clean install -DskipTests=true
```
2. Upgrade our built-in Hive to 2.3.6-SNAPSHOT, you can checkout [this branch](https://github.com/wangyum/spark/tree/SPARK-28708-Hive-2.3.6) to test.
3. Test with hadoop-3.2:
```
build/sbt "hive/test-only *. HiveSparkSubmitSuite" -Phive -Phadoop-3.2 -Phive-thriftserver
...
[info] Run completed in 3 minutes, 8 seconds.
[info] Total number of tests run: 11
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 11, failed 0, canceled 3, ignored 0, pending 0
[info] All tests passed.
```

Closes #25429 from wangyum/SPARK-28708.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-13 11:21:19 -07:00
Yuming Wang c81da276ba [SPARK-28714][SQL][TEST] Add hive.aux.jars.path test for spark-sql shell
## What changes were proposed in this pull request?

`Utilities.addToClassPath` has been changed since [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096), but we use it to add plugin jars:
128ea37bda/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala (L144-L147)

This PR add test for `spark-sql` adding plugin jars.

## How was this patch tested?

N/A

Closes #25435 from wangyum/SPARK-28714.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-13 09:19:58 -07:00
Liang-Chi Hsieh e6a0385289 [SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group by clause
## What changes were proposed in this pull request?

A GROUPED_AGG pandas python udf can't work, if without group by clause, like `select udf(id) from table`.

This doesn't match with aggregate function like sum, count..., and also dataset API like `df.agg(udf(df['id']))`.

When we parse a udf (or an aggregate function) like that from SQL syntax, it is known as a function in a project. `GlobalAggregates` rule in analysis makes such project as aggregate, by looking for aggregate expressions. At the moment, we should also look for GROUPED_AGG pandas python udf.

## How was this patch tested?

Added tests.

Closes #25352 from viirya/SPARK-28422.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-14 00:32:33 +09:00
Xingbo Jiang 3249c7ab49 [SPARK-28706][SQL] Allow cast null type to any types
## What changes were proposed in this pull request?

#25242 proposed to disallow upcasting complex data types to string type, however, upcasting from null type to any types should still be safe.

## How was this patch tested?

Add corresponding case in `CastSuite`.

Closes #25425 from jiangxb1987/nullToString.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-13 19:02:04 +08:00
Yuming Wang 9a7f29023e [SPARK-28383][SQL] SHOW CREATE TABLE is not supported on a temporary view
## What changes were proposed in this pull request?
It throws `Table or view not found` when showing  temporary views:
```sql
spark-sql> CREATE TEMPORARY VIEW temp_view AS SELECT 1 AS a;
spark-sql> show create table temp_view;
Error in query: Table or view 'temp_view' not found in database 'default';
```
It's not easy to support temporary views. This pr changed it to throws `SHOW CREATE TABLE is not supported on a temporary view`:
```sql
spark-sql> CREATE TEMPORARY VIEW temp_view AS SELECT 1 AS a;
spark-sql> show create table temp_view;
Error in query: SHOW CREATE TABLE is not supported on a temporary view: temp_view;
```

## How was this patch tested?

unit tests

Closes #25149 from wangyum/SPARK-28383.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-12 21:01:19 -07:00
Yuming Wang 016e1b491c [SPARK-28703][SQL][TEST] Skip HiveExternalCatalogVersionsSuite and 3 tests in HiveSparkSubmitSuite at JDK9+
## What changes were proposed in this pull request?
This PR skip more test when testing with `JAVA_9` or later:
1. Skip `HiveExternalCatalogVersionsSuite` when testing with `JAVA_9` or later because our previous version does not support `JAVA_9` or later.

2. Skip 3 tests in `HiveSparkSubmitSuite` because the `spark.sql.hive.metastore.version` of these tests is lower than `2.0`, however Datanucleus 3.x seem does not support `JAVA_9` or later. Hive upgrade Datanucleus to 4.x from Hive 2.0([HIVE-6113](https://issues.apache.org/jira/browse/HIVE-6113)):

```
[info]   Cause: org.datanucleus.exceptions.NucleusException: The java type java.lang.Long (jdbc-type="", sql-type="") cant be mapped for this datastore. No mapping is available.
[info]   at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.getDatastoreMappingClass(RDBMSMappingManager.java:1215)
[info]   at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.createDatastoreMapping(RDBMSMappingManager.java:1378)
[info]   at org.datanucleus.store.rdbms.table.AbstractClassTable.addDatastoreId(AbstractClassTable.java:392)
[info]   at org.datanucleus.store.rdbms.table.ClassTable.initializePK(ClassTable.java:1087)
[info]   at org.datanucleus.store.rdbms.table.ClassTable.preInitialize(ClassTable.java:247)
```

Please note that this exclude only the tests related to the old metastore library, some other tests of `HiveSparkSubmitSuite` still fail on JDK9+.

## How was this patch tested?

manual tests:

Test with JDK 11:
```
[info] HiveExternalCatalogVersionsSuite:
[info] - backward compatibility !!! CANCELED !!! (37 milliseconds)

[info] HiveSparkSubmitSuite:
...
[info] - SPARK-8020: set sql conf in spark conf !!! CANCELED !!! (30 milliseconds)
[info]   org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(JAVA_9) was true (HiveSparkSubmitSuite.scala:130)
...
[info] - SPARK-9757 Persist Parquet relation with decimal column !!! CANCELED !!! (1 millisecond)
[info]   org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(JAVA_9) was true (HiveSparkSubmitSuite.scala:168)
...
[info] - SPARK-16901: set javax.jdo.option.ConnectionURL !!! CANCELED !!! (1 millisecond)
[info]   org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(JAVA_9) was true (HiveSparkSubmitSuite.scala:260)
...
```

Closes #25426 from wangyum/SPARK-28703.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-12 20:42:06 -07:00
Stavros Kontopoulos ec84415358 [SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql'
## What changes were proposed in this pull request?
This PR is a followup of a fix as described in here: https://github.com/apache/spark/pull/25215#issuecomment-517659981

<details><summary>Diff comparing to 'group-by.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
index 3a5df254f2..febe47b5ba 100644
--- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
 -13,26 +13,26  struct<>

 -- !query 1
-SELECT a, COUNT(b) FROM testData
+SELECT udf(a), udf(COUNT(b)) FROM testData
 -- !query 1 schema
 struct<>
 -- !query 1 output
 org.apache.spark.sql.AnalysisException
-grouping expressions sequence is empty, and 'testdata.`a`' is not an aggregate function. Wrap '(count(testdata.`b`) AS `count(b)`)' in windowing function(s) or wrap 'testdata.`a`' in first() (or first_value) if you don't care which value you get.;
+grouping expressions sequence is empty, and 'testdata.`a`' is not an aggregate function. Wrap '(CAST(udf(cast(count(b) as string)) AS BIGINT) AS `CAST(udf(cast(count(b) as string)) AS BIGINT)`)' in windowing function(s) or wrap 'testdata.`a`' in first() (or first_value) if you don't care which value you get.;

 -- !query 2
-SELECT COUNT(a), COUNT(b) FROM testData
+SELECT COUNT(udf(a)), udf(COUNT(b)) FROM testData
 -- !query 2 schema
-struct<count(a):bigint,count(b):bigint>
+struct<count(CAST(udf(cast(a as string)) AS INT)):bigint,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
 -- !query 2 output
 7	7

 -- !query 3
-SELECT a, COUNT(b) FROM testData GROUP BY a
+SELECT udf(a), COUNT(udf(b)) FROM testData GROUP BY a
 -- !query 3 schema
-struct<a:int,count(b):bigint>
+struct<CAST(udf(cast(a as string)) AS INT):int,count(CAST(udf(cast(b as string)) AS INT)):bigint>
 -- !query 3 output
 1	2
 2	2
 -41,7 +41,7  NULL	1

 -- !query 4
-SELECT a, COUNT(b) FROM testData GROUP BY b
+SELECT udf(a), udf(COUNT(udf(b))) FROM testData GROUP BY b
 -- !query 4 schema
 struct<>
 -- !query 4 output
 -50,9 +50,9  expression 'testdata.`a`' is neither present in the group by, nor is it an aggre

 -- !query 5
-SELECT COUNT(a), COUNT(b) FROM testData GROUP BY a
+SELECT COUNT(udf(a)), COUNT(udf(b)) FROM testData GROUP BY udf(a)
 -- !query 5 schema
-struct<count(a):bigint,count(b):bigint>
+struct<count(CAST(udf(cast(a as string)) AS INT)):bigint,count(CAST(udf(cast(b as string)) AS INT)):bigint>
 -- !query 5 output
 0	1
 2	2
 -61,15 +61,15  struct<count(a):bigint,count(b):bigint>

 -- !query 6
-SELECT 'foo', COUNT(a) FROM testData GROUP BY 1
+SELECT 'foo', COUNT(udf(a)) FROM testData GROUP BY 1
 -- !query 6 schema
-struct<foo:string,count(a):bigint>
+struct<foo:string,count(CAST(udf(cast(a as string)) AS INT)):bigint>
 -- !query 6 output
 foo	7

 -- !query 7
-SELECT 'foo' FROM testData WHERE a = 0 GROUP BY 1
+SELECT 'foo' FROM testData WHERE a = 0 GROUP BY udf(1)
 -- !query 7 schema
 struct<foo:string>
 -- !query 7 output
 -77,25 +77,25  struct<foo:string>

 -- !query 8
-SELECT 'foo', APPROX_COUNT_DISTINCT(a) FROM testData WHERE a = 0 GROUP BY 1
+SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1)
 -- !query 8 schema
-struct<foo:string,approx_count_distinct(a):bigint>
+struct<foo:string,CAST(udf(cast(approx_count_distinct(cast(udf(cast(a as string)) as int), 0.05, 0, 0) as string)) AS BIGINT):bigint>
 -- !query 8 output

 -- !query 9
-SELECT 'foo', MAX(STRUCT(a)) FROM testData WHERE a = 0 GROUP BY 1
+SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1)
 -- !query 9 schema
-struct<foo:string,max(named_struct(a, a)):struct<a:int>>
+struct<foo:string,max(named_struct(col1, CAST(udf(cast(a as string)) AS INT))):struct<col1:int>>
 -- !query 9 output

 -- !query 10
-SELECT a + b, COUNT(b) FROM testData GROUP BY a + b
+SELECT udf(a + b), udf(COUNT(b)) FROM testData GROUP BY a + b
 -- !query 10 schema
-struct<(a + b):int,count(b):bigint>
+struct<CAST(udf(cast((a + b) as string)) AS INT):int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
 -- !query 10 output
 2	1
 3	2
 -105,7 +105,7  NULL	1

 -- !query 11
-SELECT a + 2, COUNT(b) FROM testData GROUP BY a + 1
+SELECT udf(a + 2), udf(COUNT(b)) FROM testData GROUP BY a + 1
 -- !query 11 schema
 struct<>
 -- !query 11 output
 -114,9 +114,9  expression 'testdata.`a`' is neither present in the group by, nor is it an aggre

 -- !query 12
-SELECT a + 1 + 1, COUNT(b) FROM testData GROUP BY a + 1
+SELECT udf(a + 1) + 1, udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)
 -- !query 12 schema
-struct<((a + 1) + 1):int,count(b):bigint>
+struct<(CAST(udf(cast((a + 1) as string)) AS INT) + 1):int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
 -- !query 12 output
 3	2
 4	2
 -125,26 +125,26  NULL	1

 -- !query 13
-SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), STDDEV(a), SUM(a), COUNT(a)
+SELECT SKEWNESS(udf(a)), udf(KURTOSIS(a)), udf(MIN(a)), MAX(udf(a)), udf(AVG(udf(a))), udf(VARIANCE(a)), STDDEV(udf(a)), udf(SUM(a)), udf(COUNT(a))
 FROM testData
 -- !query 13 schema
-struct<skewness(CAST(a AS DOUBLE)):double,kurtosis(CAST(a AS DOUBLE)):double,min(a):int,max(a):int,avg(a):double,var_samp(CAST(a AS DOUBLE)):double,stddev_samp(CAST(a AS DOUBLE)):double,sum(a):bigint,count(a):bigint>
+struct<skewness(CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double,CAST(udf(cast(kurtosis(cast(a as double)) as string)) AS DOUBLE):double,CAST(udf(cast(min(a) as string)) AS INT):int,max(CAST(udf(cast(a as string)) AS INT)):int,CAST(udf(cast(avg(cast(cast(udf(cast(a as string)) as int) as bigint)) as string)) AS DOUBLE):double,CAST(udf(cast(var_samp(cast(a as double)) as string)) AS DOUBLE):double,stddev_samp(CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double,CAST(udf(cast(sum(cast(a as bigint)) as string)) AS BIGINT):bigint,CAST(udf(cast(count(a) as string)) AS BIGINT):bigint>
 -- !query 13 output
 -0.2723801058145729	-1.5069204152249134	1	3	2.142857142857143	0.8095238095238094	0.8997354108424372	15	7

 -- !query 14
-SELECT COUNT(DISTINCT b), COUNT(DISTINCT b, c) FROM (SELECT 1 AS a, 2 AS b, 3 AS c) GROUP BY a
+SELECT COUNT(DISTINCT udf(b)), udf(COUNT(DISTINCT b, c)) FROM (SELECT 1 AS a, 2 AS b, 3 AS c) GROUP BY udf(a)
 -- !query 14 schema
-struct<count(DISTINCT b):bigint,count(DISTINCT b, c):bigint>
+struct<count(DISTINCT CAST(udf(cast(b as string)) AS INT)):bigint,CAST(udf(cast(count(distinct b, c) as string)) AS BIGINT):bigint>
 -- !query 14 output
 1	1

 -- !query 15
-SELECT a AS k, COUNT(b) FROM testData GROUP BY k
+SELECT udf(a) AS k, COUNT(udf(b)) FROM testData GROUP BY k
 -- !query 15 schema
-struct<k:int,count(b):bigint>
+struct<k:int,count(CAST(udf(cast(b as string)) AS INT)):bigint>
 -- !query 15 output
 1	2
 2	2
 -153,21 +153,21  NULL	1

 -- !query 16
-SELECT a AS k, COUNT(b) FROM testData GROUP BY k HAVING k > 1
+SELECT a AS k, udf(COUNT(b)) FROM testData GROUP BY k HAVING k > 1
 -- !query 16 schema
-struct<k:int,count(b):bigint>
+struct<k:int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
 -- !query 16 output
 2	2
 3	2

 -- !query 17
-SELECT COUNT(b) AS k FROM testData GROUP BY k
+SELECT udf(COUNT(b)) AS k FROM testData GROUP BY k
 -- !query 17 schema
 struct<>
 -- !query 17 output
 org.apache.spark.sql.AnalysisException
-aggregate functions are not allowed in GROUP BY, but found count(testdata.`b`);
+aggregate functions are not allowed in GROUP BY, but found CAST(udf(cast(count(b) as string)) AS BIGINT);

 -- !query 18
 -180,7 +180,7  struct<>

 -- !query 19
-SELECT k AS a, COUNT(v) FROM testDataHasSameNameWithAlias GROUP BY a
+SELECT k AS a, udf(COUNT(udf(v))) FROM testDataHasSameNameWithAlias GROUP BY udf(a)
 -- !query 19 schema
 struct<>
 -- !query 19 output
 -197,32 +197,32  spark.sql.groupByAliases	false

 -- !query 21
-SELECT a AS k, COUNT(b) FROM testData GROUP BY k
+SELECT a AS k, udf(COUNT(udf(b))) FROM testData GROUP BY k
 -- !query 21 schema
 struct<>
 -- !query 21 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`k`' given input columns: [testdata.a, testdata.b]; line 1 pos 47
+cannot resolve '`k`' given input columns: [testdata.a, testdata.b]; line 1 pos 57

 -- !query 22
-SELECT a, COUNT(1) FROM testData WHERE false GROUP BY a
+SELECT udf(a), COUNT(udf(1)) FROM testData WHERE false GROUP BY udf(a)
 -- !query 22 schema
-struct<a:int,count(1):bigint>
+struct<CAST(udf(cast(a as string)) AS INT):int,count(CAST(udf(cast(1 as string)) AS INT)):bigint>
 -- !query 22 output

 -- !query 23
-SELECT COUNT(1) FROM testData WHERE false
+SELECT udf(COUNT(1)) FROM testData WHERE false
 -- !query 23 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
 -- !query 23 output
 0

 -- !query 24
-SELECT 1 FROM (SELECT COUNT(1) FROM testData WHERE false) t
+SELECT 1 FROM (SELECT udf(COUNT(1)) FROM testData WHERE false) t
 -- !query 24 schema
 struct<1:int>
 -- !query 24 output
 -232,7 +232,7  struct<1:int>
 -- !query 25
 SELECT 1 from (
   SELECT 1 AS z,
-  MIN(a.x)
+  udf(MIN(a.x))
   FROM (select 1 as x) a
   WHERE false
 ) b
 -244,32 +244,32  struct<1:int>

 -- !query 26
-SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
+SELECT corr(DISTINCT x, y), udf(corr(DISTINCT y, x)), count(*)
   FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y)
 -- !query 26 schema
-struct<corr(DISTINCT CAST(x AS DOUBLE), CAST(y AS DOUBLE)):double,corr(DISTINCT CAST(y AS DOUBLE), CAST(x AS DOUBLE)):double,count(1):bigint>
+struct<corr(DISTINCT CAST(x AS DOUBLE), CAST(y AS DOUBLE)):double,CAST(udf(cast(corr(distinct cast(y as double), cast(x as double)) as string)) AS DOUBLE):double,count(1):bigint>
 -- !query 26 output
 1.0	1.0	3

 -- !query 27
-SELECT 1 FROM range(10) HAVING true
+SELECT udf(1) FROM range(10) HAVING true
 -- !query 27 schema
-struct<1:int>
+struct<CAST(udf(cast(1 as string)) AS INT):int>
 -- !query 27 output
 1

 -- !query 28
-SELECT 1 FROM range(10) HAVING MAX(id) > 0
+SELECT udf(udf(1)) FROM range(10) HAVING MAX(id) > 0
 -- !query 28 schema
-struct<1:int>
+struct<CAST(udf(cast(cast(udf(cast(1 as string)) as int) as string)) AS INT):int>
 -- !query 28 output
 1

 -- !query 29
-SELECT id FROM range(10) HAVING id > 0
+SELECT udf(id) FROM range(10) HAVING id > 0
 -- !query 29 schema
 struct<>
 -- !query 29 output
 -291,33 +291,33  struct<>

 -- !query 31
-SELECT every(v), some(v), any(v) FROM test_agg WHERE 1 = 0
+SELECT udf(every(v)), udf(some(v)), any(v) FROM test_agg WHERE 1 = 0
 -- !query 31 schema
-struct<every(v):boolean,some(v):boolean,any(v):boolean>
+struct<CAST(udf(cast(every(v) as string)) AS BOOLEAN):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean>
 -- !query 31 output
 NULL	NULL	NULL

 -- !query 32
-SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 4
+SELECT udf(every(udf(v))), some(v), any(v) FROM test_agg WHERE k = 4
 -- !query 32 schema
-struct<every(v):boolean,some(v):boolean,any(v):boolean>
+struct<CAST(udf(cast(every(cast(udf(cast(v as string)) as boolean)) as string)) AS BOOLEAN):boolean,some(v):boolean,any(v):boolean>
 -- !query 32 output
 NULL	NULL	NULL

 -- !query 33
-SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 5
+SELECT every(v), udf(some(v)), any(v) FROM test_agg WHERE k = 5
 -- !query 33 schema
-struct<every(v):boolean,some(v):boolean,any(v):boolean>
+struct<every(v):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean>
 -- !query 33 output
 false	true	true

 -- !query 34
-SELECT k, every(v), some(v), any(v) FROM test_agg GROUP BY k
+SELECT udf(k), every(v), udf(some(v)), any(v) FROM test_agg GROUP BY udf(k)
 -- !query 34 schema
-struct<k:int,every(v):boolean,some(v):boolean,any(v):boolean>
+struct<CAST(udf(cast(k as string)) AS INT):int,every(v):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean>
 -- !query 34 output
 1	false	true	true
 2	true	true	true
 -327,9 +327,9  struct<k:int,every(v):boolean,some(v):boolean,any(v):boolean>

 -- !query 35
-SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) = false
+SELECT udf(k), every(v) FROM test_agg GROUP BY k HAVING every(v) = false
 -- !query 35 schema
-struct<k:int,every(v):boolean>
+struct<CAST(udf(cast(k as string)) AS INT):int,every(v):boolean>
 -- !query 35 output
 1	false
 3	false
 -337,77 +337,77  struct<k:int,every(v):boolean>

 -- !query 36
-SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) IS NULL
+SELECT udf(k), udf(every(v)) FROM test_agg GROUP BY udf(k) HAVING every(v) IS NULL
 -- !query 36 schema
-struct<k:int,every(v):boolean>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(every(v) as string)) AS BOOLEAN):boolean>
 -- !query 36 output
 4	NULL

 -- !query 37
-SELECT k,
-       Every(v) AS every
+SELECT udf(k),
+       udf(Every(v)) AS every
 FROM   test_agg
 WHERE  k = 2
        AND v IN (SELECT Any(v)
                  FROM   test_agg
                  WHERE  k = 1)
-GROUP  BY k
+GROUP  BY udf(k)
 -- !query 37 schema
-struct<k:int,every:boolean>
+struct<CAST(udf(cast(k as string)) AS INT):int,every:boolean>
 -- !query 37 output
 2	true

 -- !query 38
-SELECT k,
+SELECT udf(udf(k)),
        Every(v) AS every
 FROM   test_agg
 WHERE  k = 2
        AND v IN (SELECT Every(v)
                  FROM   test_agg
                  WHERE  k = 1)
-GROUP  BY k
+GROUP  BY udf(udf(k))
 -- !query 38 schema
-struct<k:int,every:boolean>
+struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,every:boolean>
 -- !query 38 output

 -- !query 39
-SELECT every(1)
+SELECT every(udf(1))
 -- !query 39 schema
 struct<>
 -- !query 39 output
 org.apache.spark.sql.AnalysisException
-cannot resolve 'every(1)' due to data type mismatch: Input to function 'every' should have been boolean, but it's [int].; line 1 pos 7
+cannot resolve 'every(CAST(udf(cast(1 as string)) AS INT))' due to data type mismatch: Input to function 'every' should have been boolean, but it's [int].; line 1 pos 7

 -- !query 40
-SELECT some(1S)
+SELECT some(udf(1S))
 -- !query 40 schema
 struct<>
 -- !query 40 output
 org.apache.spark.sql.AnalysisException
-cannot resolve 'some(1S)' due to data type mismatch: Input to function 'some' should have been boolean, but it's [smallint].; line 1 pos 7
+cannot resolve 'some(CAST(udf(cast(1 as string)) AS SMALLINT))' due to data type mismatch: Input to function 'some' should have been boolean, but it's [smallint].; line 1 pos 7

 -- !query 41
-SELECT any(1L)
+SELECT any(udf(1L))
 -- !query 41 schema
 struct<>
 -- !query 41 output
 org.apache.spark.sql.AnalysisException
-cannot resolve 'any(1L)' due to data type mismatch: Input to function 'any' should have been boolean, but it's [bigint].; line 1 pos 7
+cannot resolve 'any(CAST(udf(cast(1 as string)) AS BIGINT))' due to data type mismatch: Input to function 'any' should have been boolean, but it's [bigint].; line 1 pos 7

 -- !query 42
-SELECT every("true")
+SELECT udf(every("true"))
 -- !query 42 schema
 struct<>
 -- !query 42 output
 org.apache.spark.sql.AnalysisException
-cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7
+cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 11

 -- !query 43
 -428,9 +428,9  struct<k:int,v:boolean,every(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST

 -- !query 44
-SELECT k, v, some(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
+SELECT k, udf(udf(v)), some(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
 -- !query 44 schema
-struct<k:int,v:boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
+struct<k:int,CAST(udf(cast(cast(udf(cast(v as string)) as boolean) as string)) AS BOOLEAN):boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
 -- !query 44 output
 1	false	false
 1	true	true
 -445,9 +445,9  struct<k:int,v:boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST R

 -- !query 45
-SELECT k, v, any(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
+SELECT udf(udf(k)), v, any(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
 -- !query 45 schema
-struct<k:int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
+struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
 -- !query 45 output
 1	false	false
 1	true	true
 -462,17 +462,17  struct<k:int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RA

 -- !query 46
-SELECT count(*) FROM test_agg HAVING count(*) > 1L
+SELECT udf(count(*)) FROM test_agg HAVING count(*) > 1L
 -- !query 46 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
 -- !query 46 output
 10

 -- !query 47
-SELECT k, max(v) FROM test_agg GROUP BY k HAVING max(v) = true
+SELECT k, udf(max(v)) FROM test_agg GROUP BY k HAVING max(v) = true
 -- !query 47 schema
-struct<k:int,max(v):boolean>
+struct<k:int,CAST(udf(cast(max(v) as string)) AS BOOLEAN):boolean>
 -- !query 47 output
 1	true
 2	true
 -480,7 +480,7  struct<k:int,max(v):boolean>

 -- !query 48
-SELECT * FROM (SELECT COUNT(*) AS cnt FROM test_agg) WHERE cnt > 1L
+SELECT * FROM (SELECT udf(COUNT(*)) AS cnt FROM test_agg) WHERE cnt > 1L
 -- !query 48 schema
 struct<cnt:bigint>
 -- !query 48 output
 -488,7 +488,7  struct<cnt:bigint>

 -- !query 49
-SELECT count(*) FROM test_agg WHERE count(*) > 1L
+SELECT udf(count(*)) FROM test_agg WHERE count(*) > 1L
 -- !query 49 schema
 struct<>
 -- !query 49 output
 -500,7 +500,7  Invalid expressions: [count(1)];

 -- !query 50
-SELECT count(*) FROM test_agg WHERE count(*) + 1L > 1L
+SELECT udf(count(*)) FROM test_agg WHERE count(*) + 1L > 1L
 -- !query 50 schema
 struct<>
 -- !query 50 output
 -512,7 +512,7  Invalid expressions: [count(1)];

 -- !query 51
-SELECT count(*) FROM test_agg WHERE k = 1 or k = 2 or count(*) + 1L > 1L or max(k) > 1
+SELECT udf(count(*)) FROM test_agg WHERE k = 1 or k = 2 or count(*) + 1L > 1L or max(k) > 1
 -- !query 51 schema
 struct<>
 -- !query 51 output

```

</p>
</details>

## How was this patch tested?
Tested as instructed in SPARK-27921.

Closes #25360 from skonto/group-by-followup.

Authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-13 10:06:32 +09:00
s71955 163f4a45df [SPARK-26969][SQL] Using ODBC client not able to see the query data when column datatype is decimal
## What changes were proposed in this pull request?
While processing the Rowdata in the server side ColumnValue BigDecimal type value processed by server has to converted to the HiveDecmal data type for successful processing of query using Hive ODBC client.As per current logic  corresponding to the Decimal column datatype, the Spark server uses BigDecimal, and the ODBC client uses HiveDecimal. If the data type does not match, the client fail to parse

Since this handing was missing the query executed in Hive ODBC client wont return or provides result to the user even though the decimal type column value data present.

## How was this patch tested?

Manual test report and impact assessment is done using existing test-cases

Before fix
![decimal_odbc](https://user-images.githubusercontent.com/12999161/53440179-e74a7f00-3a29-11e9-93db-83f2ae37ef16.PNG)

After Fix
![hive_odbc](https://user-images.githubusercontent.com/12999161/53679519-70e0a200-3cf3-11e9-9437-9c27d2e5056d.PNG)

Closes #23899 from sujith71955/master_decimalissue.

Authored-by: s71955 <sujithchacko.2010@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-12 15:47:59 -07:00
Maxim Gekk 6964128e25 [SPARK-28017][SPARK-28656][SQL][FOLLOW-UP] Restore comments in date.sql
## What changes were proposed in this pull request?

Restored comments in `date.sql` removed by 924d794a6f and 997d153e54 . The comments was introduced by 51379b731d .

## How was this patch tested?

By re-running `date.sql` via:
```shell
$ build/sbt "sql/test-only *SQLQueryTestSuite -- -z date.sql"
```

Closes #25422 from MaxGekk/sql-comments-followup.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-12 11:19:19 -07:00
Yuming Wang e5f4a106db [SPARK-28688][SQL][TEST] Skip VersionsSuite.read hive materialized view test for HMS 3.0+ on JDK9+
## What changes were proposed in this pull request?

This PR makes it skip test `read hive materialized view` since Hive 3.0 in `VersionsSuite.scala` on JDK 11 because [HIVE-19383](https://issues.apache.org/jira/browse/HIVE-19383) added [ArrayList$SubList](ae4df62795/ql/src/java/org/apache/hadoop/hive/ql/exec/SerializationUtilities.java (L383)) which is incompatible with JDK 11:
```java
java.lang.RuntimeException: java.lang.NoSuchFieldException: parentOffset
	at org.apache.hadoop.hive.ql.exec.SerializationUtilities$ArrayListSubListSerializer.<init>(SerializationUtilities.java:389)
	at org.apache.hadoop.hive.ql.exec.SerializationUtilities$1.create(SerializationUtilities.java:235)
...
```
![image](https://issues.apache.org/jira/secure/attachment/12977250/12977250_screenshot-2.png)
![image](https://issues.apache.org/jira/secure/attachment/12977249/12977249_screenshot-1.png)

## How was this patch tested?

manual tests
**Test on JDK 11**:
```
...
[info] - 2.3: sql read hive materialized view (1 second, 253 milliseconds)
...
[info] - 3.0: sql read hive materialized view !!! CANCELED !!! (31 milliseconds)
[info]   "[3.0]" did not equal "[2.3]", and org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(JAVA_9) was true (VersionsSuite.scala:624)
...
[info] - 3.1: sql read hive materialized view !!! CANCELED !!! (0 milliseconds)
[info]   "[3.1]" did not equal "[2.3]", and org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(JAVA_9) was true (VersionsSuite.scala:624)
...
```

**Test on JDK 1.8**:
```
...
[info] - 2.3: sql read hive materialized view (1 second, 444 milliseconds)
...
[info] - 3.0: sql read hive materialized view (3 seconds, 100 milliseconds)
...
[info] - 3.1: sql read hive materialized view (2 seconds, 941 milliseconds)
...
```

Closes #25414 from wangyum/SPARK-28688.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-12 03:37:10 -07:00
Yuming Wang 6c06eea411 [SPARK-28686][SQL][TEST] Move udf_radians from HiveCompatibilitySuite to HiveQuerySuite
## What changes were proposed in this pull request?

This PR moves `udf_radians` from `HiveCompatibilitySuite` to `HiveQuerySuite` to make it easy to test with JDK 11 because it returns different value from JDK 9:
```java
public class TestRadians {
  public static void main(String[] args) {
    System.out.println(java.lang.Math.toRadians(57.2958));
  }
}
```
```sh
[rootspark-3267648 ~]# javac TestRadians.java
[rootspark-3267648 ~]# /usr/lib/jdk-9.0.4+11/bin/java TestRadians
1.0000003575641672
[rootspark-3267648 ~]# /usr/lib/jdk-11.0.3/bin/java TestRadians
1.0000003575641672
[rootspark-3267648 ~]# /usr/lib/jdk8u222-b10/bin/java TestRadians
1.000000357564167
```

## How was this patch tested?

manual tests

Closes #25417 from wangyum/SPARK-28686.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-12 02:24:48 -07:00
Yuming Wang 58cc0df59e [SPARK-28685][SQL][TEST] Test HMS 2.0.0+ in VersionsSuite/HiveClientSuites on JDK 11
## What changes were proposed in this pull request?

It seems Datanucleus 3.x can not support JDK 11:
```java
[info]   Cause: org.datanucleus.exceptions.NucleusException: The java type java.lang.Long (jdbc-type="", sql-type="") cant be mapped for this datastore. No mapping is available.
[info]   at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.getDatastoreMappingClass(RDBMSMappingManager.java:1215)
[info]   at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.createDatastoreMapping(RDBMSMappingManager.java:1378)
[info]   at org.datanucleus.store.rdbms.table.AbstractClassTable.addDatastoreId(AbstractClassTable.java:392)
[info]   at org.datanucleus.store.rdbms.table.ClassTable.initializePK(ClassTable.java:1087)
[info]   at org.datanucleus.store.rdbms.table.ClassTable.preInitialize(ClassTable.java:247)
```

Hive upgrade Datanucleus to 4.x from Hive 2.0([HIVE-6113](https://issues.apache.org/jira/browse/HIVE-6113)). This PR makes it skip `0.12`, `0.13`, `0.14`, `1.0`, `1.1` and `1.2` when testing with JDK 11.

Note that, this pr will not fix sql read hive materialized view. It's another issue:
```
3.0: sql read hive materialized view *** FAILED *** (1 second, 521 milliseconds)
3.1: sql read hive materialized view *** FAILED *** (1 second, 536 milliseconds)
```

## How was this patch tested?

manual tests:
```shell
export JAVA_HOME="/usr/lib/jdk-11.0.3"
build/sbt "hive/test-only *.VersionsSuite *.HiveClientSuites" -Phive -Phadoop-3.2
```

Closes #25405 from wangyum/SPARK-28685.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-10 17:01:15 -07:00
Yuming Wang 47af8925b6 [SPARK-28675][SQL] Remove maskCredentials and use redactOptions
## What changes were proposed in this pull request?

This PR replaces `CatalogUtils.maskCredentials` with `SQLConf.get.redactOptions` to match other redacts.

## How was this patch tested?

unit test and manual tests:
Before this PR:
```sql
spark-sql> DESC EXTENDED test_spark_28675;
id	int	NULL

# Detailed Table Information
Database	default
Table	test_spark_28675
Owner	root
Created Time	Fri Aug 09 08:23:17 GMT-07:00 2019
Last Access	Wed Dec 31 17:00:00 GMT-07:00 1969
Created By	Spark 3.0.0-SNAPSHOT
Type	MANAGED
Provider	org.apache.spark.sql.jdbc
Location	file:/user/hive/warehouse/test_spark_28675
Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat	org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties	[url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]

spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675';
default	test_spark_28675	false	Database: default
Table: test_spark_28675
Owner: root
Created Time: Fri Aug 09 08:23:17 GMT-07:00 2019
Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969
Created By: Spark 3.0.0-SNAPSHOT
Type: MANAGED
Provider: org.apache.spark.sql.jdbc
Location: file:/user/hive/warehouse/test_spark_28675
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties: [url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
Schema: root
 |-- id: integer (nullable = true)

```

After this PR:
```sql
spark-sql> DESC EXTENDED test_spark_28675;
id	int	NULL

# Detailed Table Information
Database	default
Table	test_spark_28675
Owner	root
Created Time	Fri Aug 09 08:19:49 GMT-07:00 2019
Last Access	Wed Dec 31 17:00:00 GMT-07:00 1969
Created By	Spark 3.0.0-SNAPSHOT
Type	MANAGED
Provider	org.apache.spark.sql.jdbc
Location	file:/user/hive/warehouse/test_spark_28675
Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat	org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties	[url=*********(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]

spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675';
default	test_spark_28675	false	Database: default
Table: test_spark_28675
Owner: root
Created Time: Fri Aug 09 08:19:49 GMT-07:00 2019
Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969
Created By: Spark 3.0.0-SNAPSHOT
Type: MANAGED
Provider: org.apache.spark.sql.jdbc
Location: file:/user/hive/warehouse/test_spark_28675
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties: [url=*********(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
Schema: root
 |-- id: integer (nullable = true)
```

Closes #25395 from wangyum/SPARK-28675.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-10 16:45:59 -07:00
younggyu chun 8535df7261 [MINOR] Fix typos in comments and replace an explicit type with <>
## What changes were proposed in this pull request?
This PR fixed typos in comments and replace the explicit type with '<>' for Java 8+.

## How was this patch tested?
Manually tested.

Closes #25338 from younggyuchun/younggyu.

Authored-by: younggyu chun <younggyuchun@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-10 16:47:11 -05:00
Maxim Gekk 924d794a6f [SPARK-28656][SQL] Support millennium, century and decade at extract()
## What changes were proposed in this pull request?

In the PR, I propose new expressions `Millennium`, `Century` and `Decade`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT):

1. `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_.
2. `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
3. `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.

Here are examples:
```sql
spark-sql> SELECT EXTRACT(MILLENNIUM FROM DATE '1981-01-19');
2
spark-sql> SELECT EXTRACT(CENTURY FROM DATE '1981-01-19');
20
spark-sql> SELECT EXTRACT(DECADE FROM DATE '1981-01-19');
198
```

## How was this patch tested?

Added new tests to `DateExpressionsSuite` and uncommented existing tests in `pgSQL/date.sql`.

Closes #25388 from MaxGekk/extract-ext2.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-09 11:18:50 -07:00
gengjiaan 5159876415 [SPARK-28077][SQL][TEST][FOLLOW-UP] Enable Overlay function tests
## What changes were proposed in this pull request?

This PR is a follow-up to https://github.com/apache/spark/pull/24918

## How was this patch tested?

Pass the Jenkins with the newly update test files.

Closes #25393 from beliefer/enable-overlay-tests.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-08-09 19:05:41 +09:00
Shixiong Zhu 5bb69945e4 [SPARK-28651][SS] Force the schema of Streaming file source to be nullable
## What changes were proposed in this pull request?

Right now, batch DataFrame always changes the schema to nullable automatically (See this line: 325bc8e9c6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L399)). But streaming file source is missing this.

This PR updates the streaming file source schema to force it be nullable. I also added a flag `spark.sql.streaming.fileSource.schema.forceNullable` to disable this change since some users may rely on the old behavior.

## How was this patch tested?

The new unit test.

Closes #25382 from zsxwing/SPARK-28651.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-09 18:54:55 +09:00
Burak Yavuz 5368eaa2fc [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs
## What changes were proposed in this pull request?

Adds support for V2 catalogs and the V2SessionCatalog for V2 tables for saveAsTable.
If the table can resolve through the V2SessionCatalog, we use SaveMode for datasource v1 for backwards compatibility to select the code path we're going to hit.

Depending on the SaveMode:
 - SaveMode.Append:
     a) If table exists: Use AppendData.byName
     b) If table doesn't exist, use CTAS (ignoreIfExists = false)
 - SaveMode.Overwrite: Use RTAS (orCreate = true)
 - SaveMode.Ignore: Use CTAS (ignoreIfExists = true)
 - SaveMode.ErrorIfExists: Use CTAS (ignoreIfExists = false)

## How was this patch tested?

Unit tests in DataSourceV2DataFrameSuite

Closes #25330 from brkyvz/saveAsTable.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2019-08-08 22:30:00 -07:00
Maxim Gekk 997d153e54 [SPARK-28017][SQL] Support additional levels of truncations by DATE_TRUNC/TRUNC
## What changes were proposed in this pull request?

I propose new levels of truncations for the `date_trunc()` and `trunc()` functions:
1. `MICROSECOND` and `MILLISECOND` truncate values of the `TIMESTAMP` type to microsecond and millisecond precision.
2. `DECADE`, `CENTURY` and `MILLENNIUM` truncate dates/timestamps to lowest date of current decade/century/millennium.

Also the `WEEK` and `QUARTER` levels have been supported by the `trunc()` function.

The function is implemented similarly to `date_trunc` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC to maintain feature parity with it.

Here are examples of `TRUNC`:
```sql
spark-sql> SELECT TRUNC('2015-10-27', 'DECADE');
2010-01-01
spark-sql> set spark.sql.datetime.java8API.enabled=true;
spark.sql.datetime.java8API.enabled	true
spark-sql> SELECT TRUNC('1999-10-27', 'millennium');
1001-01-01
```
Examples of `DATE_TRUNC`:
```sql
spark-sql> SELECT DATE_TRUNC('CENTURY', '2015-03-05T09:32:05.123456');
2001-01-01T00:00:00Z
```

## How was this patch tested?

Added new tests to `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`, and uncommented existing tests in `pgSQL/date.sql`.

Closes #25336 from MaxGekk/date_truct-ext.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-09 12:29:44 +08:00
Burak Yavuz c80430f5c9 [SPARK-28572][SQL] Simple analyzer checks for v2 table creation code paths
## What changes were proposed in this pull request?

Adds checks around:
 - The existence of transforms in the table schema (even in nested fields)
 - Duplications of transforms
 - Case sensitivity checks around column names
in the V2 table creation code paths.

## How was this patch tested?

Unit tests.

Closes #25305 from brkyvz/v2CreateTable.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-09 12:04:28 +08:00
Yuming Wang 2580c1bfe2 [SPARK-28660][SQL][TEST] Port AGGREGATES.sql [Part 4]
## What changes were proposed in this pull request?

This PR is to port AGGREGATES.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/aggregates.sql#L607-L997

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/aggregates.out#L1615-L2289

When porting the test cases, found five PostgreSQL specific features that do not exist in Spark SQL:

[SPARK-27980](https://issues.apache.org/jira/browse/SPARK-27980): Ordered-Set Aggregate Functions
[SPARK-28661](https://issues.apache.org/jira/browse/SPARK-28661): Hypothetical-Set Aggregate Functions
[SPARK-28382](https://issues.apache.org/jira/browse/SPARK-28382): Array Functions: unnest
[SPARK-28663](https://issues.apache.org/jira/browse/SPARK-28663): Aggregate Functions for Statistics
[SPARK-28664](https://issues.apache.org/jira/browse/SPARK-28664): ORDER BY in aggregate function
[SPARK-28669](https://issues.apache.org/jira/browse/SPARK-28669): System Information Functions

## How was this patch tested?

N/A

Closes #25392 from wangyum/SPARK-28660.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-08 16:39:32 -07:00