## What changes were proposed in this pull request?
This PR adds the ability to parse SQL (hexadecimal) binary literals (AKA bit strings). It follows the following syntax `X'[Hexadecimal Characters]+'`, for example: `X'01AB'` would create a binary the following binary array `0x01AB`.
If an uneven number of hexadecimal characters is passed, then the upper 4 bits of the initial byte are kept empty, and the lower 4 bits are filled using the first character. For example `X'1C7'` would create the following binary array `0x01C7`.
Binary data (Array[Byte]) does not have a proper `hashCode` and `equals` functions. This meant that comparing `Literal`s containing binary data was a pain. I have updated Literal.hashCode and Literal.equals to deal properly with binary data.
## How was this patch tested?
Added tests to the `ExpressionParserSuite`, `SQLQueryTestSuite` and `ExpressionSQLBuilderSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14832 from hvanhovell/SPARK-17263.
## What changes were proposed in this pull request?
This PR implements aggregation function `percentile_approx`. Function `percentile_approx` returns the approximate percentile(s) of a column at the given percentage(s). A percentile is a watermark value below which a given percentage of the column values fall. For example, the percentile of column `col` at percentage 50% is the median value of column `col`.
### Syntax:
```
# Returns percentile at a given percentage value. The approximation error can be reduced by increasing parameter accuracy, at the cost of memory.
percentile_approx(col, percentage [, accuracy])
# Returns percentile value array at given percentage value array
percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy])
```
### Features:
1. This function supports partial aggregation.
2. The memory consumption is bounded. The larger `accuracy` parameter we choose, we smaller error we get. The default accuracy value is 10000, to match with Hive default setting. Choose a smaller value for smaller memory footprint.
3. This function supports window function aggregation.
### Example usages:
```
## Returns the 25th percentile value, with default accuracy
SELECT percentile_approx(col, 0.25) FROM table
## Returns an array of percentile value (25th, 50th, 75th), with default accuracy
SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table
## Returns 25th percentile value, with custom accuracy value 100, larger accuracy parameter yields smaller approximation error
SELECT percentile_approx(col, 0.25, 100) FROM table
## Returns the 25th, and 50th percentile values, with custom accuracy value 100
SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table
```
### NOTE:
1. The `percentile_approx` implementation is different from Hive, so the result returned on same query maybe slightly different with Hive. This implementation uses `QuantileSummaries` as the underlying probabilistic data structure, and mainly follows paper `Space-efficient Online Computation of Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. (http://dx.doi.org/10.1145/375663.375670)`
2. The current implementation of `QuantileSummaries` doesn't support automatic compression. This PR has a rule to do compression automatically at the caller side, but it may not be optimal.
## How was this patch tested?
Unit test, and Sql query test.
## Acknowledgement
1. This PR's work in based on lw-lin's PR https://github.com/apache/spark/pull/14298, with improvements like supporting partial aggregation, fixing out of memory issue.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14868 from clockfly/appro_percentile_try_2.
## What changes were proposed in this pull request?
This PR eliminates redundant cast from an `ArrayType` with `containsNull = false` or a `MapType` with `containsNull = false`.
For example, in `ArrayType` case, current implementation leaves a cast `cast(value#63 as array<double>).toDoubleArray`. However, we can eliminate `cast(value#63 as array<double>)` if we know `value#63` does not include `null`. This PR apply this elimination for `ArrayType` and `MapType` in `SimplifyCasts` at a plan optimization phase.
In summary, we got 1.2-1.3x performance improvements over the code before applying this PR.
Here are performance results of benchmark programs:
```
test("Read array in Dataset") {
import sparkSession.implicits._
val iters = 5
val n = 1024 * 1024
val rows = 15
val benchmark = new Benchmark("Read primnitive array", n)
val rand = new Random(511)
val intDS = sparkSession.sparkContext.parallelize(0 until rows, 1)
.map(i => Array.tabulate(n)(i => i)).toDS()
intDS.count() // force to create ds
val lastElement = n - 1
val randElement = rand.nextInt(lastElement)
benchmark.addCase(s"Read int array in Dataset", numIters = iters)(iter => {
val idx0 = randElement
val idx1 = lastElement
intDS.map(a => a(0) + a(idx0) + a(idx1)).collect
})
val doubleDS = sparkSession.sparkContext.parallelize(0 until rows, 1)
.map(i => Array.tabulate(n)(i => i.toDouble)).toDS()
doubleDS.count() // force to create ds
benchmark.addCase(s"Read double array in Dataset", numIters = iters)(iter => {
val idx0 = randElement
val idx1 = lastElement
doubleDS.map(a => a(0) + a(idx0) + a(idx1)).collect
})
benchmark.run()
}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4
Intel(R) Core(TM) i5-5257U CPU 2.70GHz
without this PR
Read primnitive array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Read int array in Dataset 525 / 690 2.0 500.9 1.0X
Read double array in Dataset 947 / 1209 1.1 902.7 0.6X
with this PR
Read primnitive array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Read int array in Dataset 400 / 492 2.6 381.5 1.0X
Read double array in Dataset 788 / 870 1.3 751.4 0.5X
```
An example program that originally caused this performance issue.
```
val ds = Seq(Array(1.0, 2.0, 3.0), Array(4.0, 5.0, 6.0)).toDS()
val ds2 = ds.map(p => {
var s = 0.0
for (i <- 0 to 2) { s += p(i) }
s
})
ds2.show
ds2.explain(true)
```
Plans before this PR
```
== Parsed Logical Plan ==
'SerializeFromObject [input[0, double, true] AS value#68]
+- 'MapElements <function1>, obj#67: double
+- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#66: [D
+- LocalRelation [value#63]
== Analyzed Logical Plan ==
value: double
SerializeFromObject [input[0, double, true] AS value#68]
+- MapElements <function1>, obj#67: double
+- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
+- LocalRelation [value#63]
== Optimized Logical Plan ==
SerializeFromObject [input[0, double, true] AS value#68]
+- MapElements <function1>, obj#67: double
+- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
+- LocalRelation [value#63]
== Physical Plan ==
*SerializeFromObject [input[0, double, true] AS value#68]
+- *MapElements <function1>, obj#67: double
+- *DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
+- LocalTableScan [value#63]
```
Plans after this PR
```
== Parsed Logical Plan ==
'SerializeFromObject [input[0, double, true] AS value#6]
+- 'MapElements <function1>, obj#5: double
+- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#4: [D
+- LocalRelation [value#1]
== Analyzed Logical Plan ==
value: double
SerializeFromObject [input[0, double, true] AS value#6]
+- MapElements <function1>, obj#5: double
+- DeserializeToObject cast(value#1 as array<double>).toDoubleArray, obj#4: [D
+- LocalRelation [value#1]
== Optimized Logical Plan ==
SerializeFromObject [input[0, double, true] AS value#6]
+- MapElements <function1>, obj#5: double
+- DeserializeToObject value#1.toDoubleArray, obj#4: [D
+- LocalRelation [value#1]
== Physical Plan ==
*SerializeFromObject [input[0, double, true] AS value#6]
+- *MapElements <function1>, obj#5: double
+- *DeserializeToObject value#1.toDoubleArray, obj#4: [D
+- LocalTableScan [value#1]
```
## How was this patch tested?
Tested by new test cases in `SimplifyCastsSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#13704 from kiszk/SPARK-15985.
### What changes were proposed in this pull request?
Hive Index tables are not supported by Spark SQL. Thus, we issue an exception when users try to access Hive Index tables. When the internal function `tableExists` tries to access Hive Index tables, it always gets the same error message: ```Hive index table is not supported```. This message could be confusing to users, since their SQL operations could be completely unrelated to Hive Index tables. For example, when users try to alter a table to a new name and there exists an index table with the same name, the expected exception should be a `TableAlreadyExistsException`.
This PR made the following changes:
- Introduced a new `AnalysisException` type: `SQLFeatureNotSupportedException`. When users try to access an `Index Table`, we will issue a `SQLFeatureNotSupportedException`.
- `tableExists` returns `true` when hitting a `SQLFeatureNotSupportedException` and the feature is `Hive index table`.
- Add a checking `requireTableNotExists` for `SessionCatalog`'s `createTable` API; otherwise, the current implementation relies on the Hive's internal checking.
### How was this patch tested?
Added a test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14801 from gatorsmile/tableExists.
## What changes were proposed in this pull request?
Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that.
## How was this patch tested?
A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions.
Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
Closes#14815 from sameeragarwal/constraint-inputfile.
## What changes were proposed in this pull request?
This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values.
## How was this patch tested?
Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14819 from hvanhovell/SPARK-17246.
## What changes were proposed in this pull request?
Currently, type-widening does not work between `TimestampType` and `DateType`.
This applies to `SetOperation`, `Union`, `In`, `CaseWhen`, `Greatest`, `Leatest`, `CreateArray`, `CreateMap`, `Coalesce`, `NullIf`, `IfNull`, `Nvl` and `Nvl2`, .
This PR adds the support for widening `DateType` to `TimestampType` for them.
For a simple example,
**Before**
```scala
Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show()
```
shows below:
```
cannot resolve 'greatest(`a`, `b`)' due to data type mismatch: The expressions should all have the same type, got GREATEST(timestamp, date)
```
or union as below:
```scala
val a = Seq(Tuple1(new Timestamp(0))).toDF()
val b = Seq(Tuple1(new Date(0))).toDF()
a.union(b).show()
```
shows below:
```
Union can only be performed on tables with the compatible column types. DateType <> TimestampType at the first column of the second table;
```
**After**
```scala
Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show()
```
shows below:
```
+----------------------------------------------------+
|greatest(CAST(a AS TIMESTAMP), CAST(b AS TIMESTAMP))|
+----------------------------------------------------+
| 1969-12-31 16:00:...|
+----------------------------------------------------+
```
or union as below:
```scala
val a = Seq(Tuple1(new Timestamp(0))).toDF()
val b = Seq(Tuple1(new Date(0))).toDF()
a.union(b).show()
```
shows below:
```
+--------------------+
| _1|
+--------------------+
|1969-12-31 16:00:...|
|1969-12-31 00:00:...|
+--------------------+
```
## How was this patch tested?
Unit tests in `TypeCoercionSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: HyukjinKwon <gurwls223@gmail.com>
Closes#14786 from HyukjinKwon/SPARK-17212.
### What changes were proposed in this pull request?
This PR is to fix an incorrect outer join elimination when filter's `isNotNull` constraints is unable to filter out all null-supplying rows. For example, `isnotnull(coalesce(b#227, c#238))`.
Users can hit this error when they try to use `using/natural outer join`, which is converted to a normal outer join with a `coalesce` expression on the `using columns`. For example,
```Scala
val a = Seq((1, 2), (2, 3)).toDF("a", "b")
val b = Seq((2, 5), (3, 4)).toDF("a", "c")
val c = Seq((3, 1)).toDF("a", "d")
val ab = a.join(b, Seq("a"), "fullouter")
ab.join(c, "a").explain(true)
```
The dataframe `ab` is doing `using full-outer join`, which is converted to a normal outer join with a `coalesce` expression. Constraints inference generates a `Filter` with constraints `isnotnull(coalesce(b#227, c#238))`. Then, it triggers a wrong outer join elimination and generates a wrong result.
```
Project [a#251, b#227, c#237, d#247]
+- Join Inner, (a#251 = a#246)
:- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
: +- Join FullOuter, (a#226 = a#236)
: :- Project [_1#223 AS a#226, _2#224 AS b#227]
: : +- LocalRelation [_1#223, _2#224]
: +- Project [_1#233 AS a#236, _2#234 AS c#237]
: +- LocalRelation [_1#233, _2#234]
+- Project [_1#243 AS a#246, _2#244 AS d#247]
+- LocalRelation [_1#243, _2#244]
== Optimized Logical Plan ==
Project [a#251, b#227, c#237, d#247]
+- Join Inner, (a#251 = a#246)
:- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
: +- Filter isnotnull(coalesce(a#226, a#236))
: +- Join FullOuter, (a#226 = a#236)
: :- LocalRelation [a#226, b#227]
: +- LocalRelation [a#236, c#237]
+- LocalRelation [a#246, d#247]
```
**A note to the `Committer`**, please also give the credit to dongjoon-hyun who submitted another PR for fixing this issue. https://github.com/apache/spark/pull/14580
### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14661 from gatorsmile/fixOuterJoinElimination.
## What changes were proposed in this pull request?
Currently `MapObjects` does not make copies of unsafe-backed data, leading to problems like [SPARK-17061](https://issues.apache.org/jira/browse/SPARK-17061) [SPARK-17093](https://issues.apache.org/jira/browse/SPARK-17093).
This patch makes `MapObjects` make copies of unsafe-backed data.
Generated code - prior to this patch:
```java
...
/* 295 */ if (isNull12) {
/* 296 */ convertedArray1[loopIndex1] = null;
/* 297 */ } else {
/* 298 */ convertedArray1[loopIndex1] = value12;
/* 299 */ }
...
```
Generated code - after this patch:
```java
...
/* 295 */ if (isNull12) {
/* 296 */ convertedArray1[loopIndex1] = null;
/* 297 */ } else {
/* 298 */ convertedArray1[loopIndex1] = value12 instanceof UnsafeRow? value12.copy() : value12;
/* 299 */ }
...
```
## How was this patch tested?
Add a new test case which would fail without this patch.
Author: Liwei Lin <lwlin7@gmail.com>
Closes#14698 from lw-lin/mapobjects-copy.
## What changes were proposed in this pull request?
Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them.
## How was this patch tested?
Added a new test in `ConstraintPropagationSuite`.
Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
Closes#14795 from sameeragarwal/deterministic-constraints.
## What changes were proposed in this pull request?
This is a sub-task of [SPARK-16283](https://issues.apache.org/jira/browse/SPARK-16283) (Implement percentile_approx SQL function), which moves class QuantileSummaries to project catalyst so that it can be reused when implementing aggregation function `percentile_approx`.
## How was this patch tested?
This PR only does class relocation, class implementation is not changed.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14754 from clockfly/move_QuantileSummaries_to_catalyst.
## What changes were proposed in this pull request?
Modifies error message for numeric literals to
Numeric literal <literal> does not fit in range [min, max] for type <T>
## How was this patch tested?
Fixed up the error messages for literals.sql in SqlQueryTestSuite and re-ran via sbt. Also fixed up error messages in ExpressionParserSuite
Author: Srinath Shankar <srinath@databricks.com>
Closes#14721 from srinathshankar/sc4296.
## What changes were proposed in this pull request?
This patch changes predicate pushdown optimization rule (PushDownPredicate) from using a blacklist to a whitelist. That is to say, operators must be explicitly allowed. This approach is more future-proof: previously it was possible for us to introduce a new operator and then render the optimization rule incorrect.
This also fixes the bug that previously we allowed pushing filter beneath limit, which was incorrect. That is to say, before this patch, the optimizer would rewrite
```
select * from (select * from range(10) limit 5) where id > 3
to
select * from range(10) where id > 3 limit 5
```
## How was this patch tested?
- a unit test case in FilterPushdownSuite
- an end-to-end test in limit.sql
Author: Reynold Xin <rxin@databricks.com>
Closes#14713 from rxin/SPARK-16994.
## What changes were proposed in this pull request?
This patch improves inline table support with the following:
1. Support type coercion.
2. Support using foldable expressions. Previously only literals were supported.
3. Improve error message handling.
4. Improve test coverage.
## How was this patch tested?
Added a new unit test suite ResolveInlineTablesSuite and a new file-based end-to-end test inline-table.sql.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14676 from petermaxlee/SPARK-16947.
## What changes were proposed in this pull request?
This patch fixes the problem described in SPARK-17117, i.e. "SELECT 1 / NULL" throws an analysis exception:
```
org.apache.spark.sql.AnalysisException: cannot resolve '(1 / NULL)' due to data type mismatch: differing types in '(1 / NULL)' (int and null).
```
The problem is that division type coercion did not take null type into account.
## How was this patch tested?
A unit test for the type coercion, and a few end-to-end test cases using SQLQueryTestSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14695 from petermaxlee/SPARK-17117.
## What changes were proposed in this pull request?
This adds analyzer rules for resolving table-valued functions, and adds one builtin implementation for range(). The arguments for range() are the same as those of `spark.range()`.
## How was this patch tested?
Unit tests.
cc hvanhovell
Author: Eric Liang <ekl@databricks.com>
Closes#14656 from ericl/sc-4309.
## What changes were proposed in this pull request?
The `Optimizer` rules `PushThroughSetOperations` and `PushDownPredicate` have a redundant rule to push down `Filter` through `Union`. We should remove it.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#14687 from viirya/remove-extra-pushdown.
## What changes were proposed in this pull request?
I was looking at the code for UnresolvedOrdinal and made a few small changes to make it slightly more clear:
1. Rename the rule to SubstituteUnresolvedOrdinals which is more consistent with other rules that start with verbs. Note that this is still inconsistent with CTESubstitution and WindowsSubstitution.
2. Broke the test suite down from a single test case to three test cases.
## How was this patch tested?
This is a minor cleanup.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14672 from petermaxlee/SPARK-17034.
## What changes were proposed in this pull request?
Currently methods in `ParserUtils` are tested indirectly, we should add test cases in `ParserUtilsSuite` to verify their integrity directly.
## How was this patch tested?
New test cases in `ParserUtilsSuite`
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#14620 from jiangxb1987/parserUtils.
## What changes were proposed in this pull request?
This PR adds a field to subquery alias in order to make the usage of views in a resolved `LogicalPlan` more visible (and more understandable).
For example, the following view and query:
```sql
create view constants as select 1 as id union all select 1 union all select 42
select * from constants;
```
...now yields the following analyzed plan:
```
Project [id#39]
+- SubqueryAlias c, `default`.`constants`
+- Project [gen_attr_0#36 AS id#39]
+- SubqueryAlias gen_subquery_0
+- Union
:- Union
: :- Project [1 AS gen_attr_0#36]
: : +- OneRowRelation$
: +- Project [1 AS gen_attr_1#37]
: +- OneRowRelation$
+- Project [42 AS gen_attr_2#38]
+- OneRowRelation$
```
## How was this patch tested?
Added tests for the two code paths in `SessionCatalogSuite` (sql/core) and `HiveMetastoreCatalogSuite` (sql/hive)
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14657 from hvanhovell/SPARK-17068.
## What changes were proposed in this pull request?
This PR adds expression `UnresolvedOrdinal` to represent the ordinal in GROUP BY or ORDER BY, and fixes the rules when resolving ordinals.
Ordinals in GROUP BY or ORDER BY like `1` in `order by 1` or `group by 1` should be considered as unresolved before analysis. But in current code, it uses `Literal` expression to store the ordinal. This is inappropriate as `Literal` itself is a resolved expression, it gives the user a wrong message that the ordinals has already been resolved.
### Before this change
Ordinal is stored as `Literal` expression
```
scala> sc.setLogLevel("TRACE")
scala> sql("select a from t group by 1 order by 1")
...
'Sort [1 ASC], true
+- 'Aggregate [1], ['a]
+- 'UnresolvedRelation `t
```
For query:
```
scala> Seq(1).toDF("a").createOrReplaceTempView("t")
scala> sql("select count(a), a from t group by 2 having a > 0").show
```
During analysis, the intermediate plan before applying rule `ResolveAggregateFunctions` is:
```
'Filter ('a > 0)
+- Aggregate [2], [count(1) AS count(1)#83L, a#81]
+- LocalRelation [value#7 AS a#9]
```
Before this PR, rule `ResolveAggregateFunctions` believes all expressions of `Aggregate` have already been resolved, and tries to resolve the expressions in `Filter` directly. But this is wrong, as ordinal `2` in Aggregate is not really resolved!
### After this change
Ordinals are stored as `UnresolvedOrdinal`.
```
scala> sc.setLogLevel("TRACE")
scala> sql("select a from t group by 1 order by 1")
...
'Sort [unresolvedordinal(1) ASC], true
+- 'Aggregate [unresolvedordinal(1)], ['a]
+- 'UnresolvedRelation `t`
```
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14616 from clockfly/spark-16955.
## What changes were proposed in this pull request?
This PR changes the CTE resolving rule to use only **forward-declared** tables in order to prevent infinite loops. More specifically, new logic is like the following.
* Resolve CTEs in `WITH` clauses first before replacing the main SQL body.
* When resolving CTEs, only forward-declared CTEs or base tables are referenced.
- Self-referencing is not allowed any more.
- Cross-referencing is not allowed any more.
**Reported Error Scenarios**
```scala
scala> sql("WITH t AS (SELECT 1 FROM t) SELECT * FROM t")
java.lang.StackOverflowError
...
scala> sql("WITH t1 AS (SELECT * FROM t2), t2 AS (SELECT 2 FROM t1) SELECT * FROM t1, t2")
java.lang.StackOverflowError
...
```
Note that `t`, `t1`, and `t2` are not declared in database. Spark falls into infinite loops before resolving table names.
## How was this patch tested?
Pass the Jenkins tests with new two testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14397 from dongjoon-hyun/SPARK-16771-TREENODE.
#### What changes were proposed in this pull request?
So far, the test cases of `TableIdentifierParserSuite` do not cover the quoted cases. We should add one for avoiding regression.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14244 from gatorsmile/quotedIdentifiers.
## What changes were proposed in this pull request?
This PR adds `MINUS` set operator which is equivalent `EXCEPT DISTINCT`. This will slightly improve the compatibility with Oracle.
## How was this patch tested?
Pass the Jenkins with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14570 from dongjoon-hyun/SPARK-10601.
## What changes were proposed in this pull request?
Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.
## How was this patch tested?
Existing tests.
Author: Holden Karau <holden@us.ibm.com>
Closes#14407 from holdenk/SPARK-16779.
## What changes were proposed in this pull request?
This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase by returning an error message when the LIMIT is found in the path from the parent table to the correlated predicate in the subquery.
## How was this patch tested?
./dev/run-tests
a new unit test on the problematic pattern.
Author: Nattavut Sutyanyong <nsy.can@gmail.com>
Closes#14411 from nsyca/master.
## What changes were proposed in this pull request?
we have various logical plans for CREATE TABLE and CTAS: `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateHiveTableAsSelectLogicalPlan`. This PR unifies them to reduce the complexity and centralize the error handling.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14482 from cloud-fan/table.
## What changes were proposed in this pull request?
These 2 methods take `CatalogTable` as parameter, which already have the database information.
## How was this patch tested?
existing test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14476 from cloud-fan/minor5.
## What changes were proposed in this pull request?
Implements `eval()` method for expression `AssertNotNull` so that we can convert local projection on LocalRelation to another LocalRelation.
### Before change:
```
scala> import org.apache.spark.sql.catalyst.dsl.expressions._
scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
scala> import org.apache.spark.sql.Column
scala> case class A(a: Int)
scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain
java.lang.UnsupportedOperationException: Only code-generated evaluation is supported.
at org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
...
```
### After the change:
```
scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain(true)
== Parsed Logical Plan ==
'Project [assertnotnull('_1) AS assertnotnull(_1)#5]
+- LocalRelation [_1#2, _2#3]
== Analyzed Logical Plan ==
assertnotnull(_1): struct<a:int>
Project [assertnotnull(_1#2) AS assertnotnull(_1)#5]
+- LocalRelation [_1#2, _2#3]
== Optimized Logical Plan ==
LocalRelation [assertnotnull(_1)#5]
== Physical Plan ==
LocalTableScan [assertnotnull(_1)#5]
```
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14486 from clockfly/assertnotnull_eval.
## What changes were proposed in this pull request?
Here is a table about the behaviours of `array`/`map` and `greatest`/`least` in Hive, MySQL and Postgres:
| |Hive|MySQL|Postgres|
|---|---|---|---|---|
|`array`/`map`|can find a wider type with decimal type arguments, and will truncate the wider decimal type if necessary|can find a wider type with decimal type arguments, no truncation problem|can find a wider type with decimal type arguments, no truncation problem|
|`greatest`/`least`|can find a wider type with decimal type arguments, and truncate if necessary, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|
I think these behaviours makes sense and Spark SQL should follow them.
This PR fixes `array` and `map` by using `findWiderCommonType` to get the wider type.
This PR fixes `greatest` and `least` by add a `findWiderTypeWithoutStringPromotion`, which provides similar semantic of `findWiderCommonType`, but without string promotion.
## How was this patch tested?
new tests in `TypeCoersionSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#14439 from cloud-fan/bug.
## What changes were proposed in this pull request?
`Greatest` and `Least` are not conditional expressions, but arithmetic expressions.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14460 from cloud-fan/move.
## What changes were proposed in this pull request?
In Spark 1.6 (with Hive support) we could use `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions as literals (without adding braces), for example:
```SQL
select /* Spark 1.6: */ current_date, /* Spark 1.6 & Spark 2.0: */ current_date()
```
This was accidentally dropped in Spark 2.0. This PR reinstates this functionality.
## How was this patch tested?
Added a case to ExpressionParserSuite.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14442 from hvanhovell/SPARK-16836.
## What changes were proposed in this pull request?
Fix of incorrect arguments (dropping slideDuration and using windowDuration) in constructors for TimeWindow.
The JIRA this addresses is here: https://issues.apache.org/jira/browse/SPARK-16837
## How was this patch tested?
Added a test to TimeWindowSuite to check that the results of TimeWindow object apply and TimeWindow class constructor are equivalent.
Author: Tom Magrino <tmagrino@fb.com>
Closes#14441 from tmagrino/windowing-fix.
## What changes were proposed in this pull request?
Greatest/least function does not have the most friendly error message for data types. This patch improves the error message to not show the Seq type, and use more human readable data types.
Before:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST (ArrayBuffer(DecimalType(2,1), StringType)).; line 1 pos 7
```
After:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST(decimal(2,1), string).; line 1 pos 7
```
## How was this patch tested?
Manually verified the output and also added unit tests to ConditionalExpressionSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14453 from petermaxlee/SPARK-16850.
## What changes were proposed in this pull request?
These 2 expressions are not needed anymore after we have `Greatest` and `Least`. This PR removes them and related tests.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14434 from cloud-fan/minor1.
## What changes were proposed in this pull request?
Removes the deprecated timestamp constructor and incidentally fixes the use which was using system timezone rather than the one specified when working near DST.
This change also causes the roundtrip tests to fail since it now actually uses all the timezones near DST boundaries where it didn't before.
Note: this is only a partial the solution, longer term we should follow up with https://issues.apache.org/jira/browse/SPARK-16788 to avoid this problem & simplify our timezone handling code.
## How was this patch tested?
New tests for two timezones added so even if user timezone happens to coincided with one, the other tests should still fail. Important note: this (temporarily) disables the round trip tests until we can fix the issue more thoroughly.
Author: Holden Karau <holden@us.ibm.com>
Closes#14398 from holdenk/SPARK-16774-fix-use-of-deprecated-timestamp-constructor.
## What changes were proposed in this pull request?
a failing test case + fix to SPARK-16791 (https://issues.apache.org/jira/browse/SPARK-16791)
## How was this patch tested?
added a failing test case to CastSuit, then fixed the Cast code and rerun the entire CastSuit
Author: eyal farago <eyal farago>
Author: Eyal Farago <eyal.farago@actimize.com>
Closes#14400 from eyalfa/SPARK-16791_cast_struct_with_timestamp_field_fails.
## What changes were proposed in this pull request?
Currently, `UNION` queries on incompatible types show misleading error messages, i.e., `unresolved operator Union`. We had better show a more correct message. This will help users in the situation of [SPARK-16704](https://issues.apache.org/jira/browse/SPARK-16704).
**Before**
```scala
scala> sql("select 1,2,3 union (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: unresolved operator 'Intersect;
scala> sql("select 1,2,3 except (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: unresolved operator 'Except;
```
**After**
```scala
scala> sql("select 1,2,3 union (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the second column of the second table;
scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
org.apache.spark.sql.AnalysisException: Intersect can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the second column of the second table;
scala> sql("select 1,2,3 except (select array(1),array(2),3)")
org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the first column of the second table;
```
## How was this patch tested?
Pass the Jenkins test with a new test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14355 from dongjoon-hyun/SPARK-16726.
## What changes were proposed in this pull request?
`StructField` has very similar semantic with `CatalogColumn`, except that `CatalogColumn` use string to express data type. I think it's reasonable to use `StructType` as the `CatalogTable.schema` and remove `CatalogColumn`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14363 from cloud-fan/column.
## What changes were proposed in this pull request?
Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#14332 from srowen/SPARK-16694.
## What changes were proposed in this pull request?
Spark currently throws exceptions for invalid casts for all other data types except date type. Somehow date type returns null. It should be consistent and throws analysis exception as well.
## How was this patch tested?
Added a unit test case in CastSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14358 from petermaxlee/SPARK-16729.
## What changes were proposed in this pull request?
This PR is the first step for the following feature:
For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBasedKeyValueBatch`. We then automatically pick between the two implementations based on certain knobs.
In this first-step PR, implementations for `RowBasedKeyValueBatch` and `RowBasedHashMapGenerator` are added.
## How was this patch tested?
Unit tests: `RowBasedKeyValueBatchSuite`
Author: Qifan Pu <qifan.pu@gmail.com>
Closes#14349 from ooq/SPARK-16524.
## What changes were proposed in this pull request?
finish the TODO, create a new expression `ExternalMapToCatalyst` to iterate the map directly.
## How was this patch tested?
new test in `JavaDatasetSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14344 from cloud-fan/java-map.
## What changes were proposed in this pull request?
We push down `Project` through `Sample` in `Optimizer` by the rule `PushProjectThroughSample`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect.
Since the rule `ColumnPruning` already handles general `Project` pushdown. We don't need `PushProjectThroughSample` anymore. The rule `ColumnPruning` also avoids the described issue.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#14327 from viirya/fix-sample-pushdown.
## What changes were proposed in this pull request?
SubexpressionEliminationSuite."Semantic equals and hash" assumes the default AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when this test runs. It may happen to use "ExprId(1)".
This PR detects the conflict and makes sure we create a different ExprId when the conflict happens.
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#14350 from zsxwing/SPARK-16715.
## What changes were proposed in this pull request?
It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14331 from cloud-fan/check.
## What changes were proposed in this pull request?
we also store data source table options in this field, it's unreasonable to call it `serdeProperties`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14283 from cloud-fan/minor1.
## What changes were proposed in this pull request?
Currently we don't check the value returned by called method in `Invoke`. When the returned value is null and is assigned to a variable of primitive type, `NullPointerException` will be thrown.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#14259 from viirya/agg-empty-ds.
## What changes were proposed in this pull request?
This PR adds `str_to_map` SQL function in order to remove Hive fallback.
## How was this patch tested?
Pass the Jenkins tests with newly added.
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13990 from techaddict/SPARK-16287.
## What changes were proposed in this pull request?
aggregate expressions can only be executed inside `Aggregate`, if we propagate it up with constraints, the parent operator can not execute it and will fail at runtime.
## How was this patch tested?
new test in SQLQuerySuite
Author: Wenchen Fan <wenchen@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#14281 from cloud-fan/bug.