## What changes were proposed in this pull request?
`Nvl` function should support numeric-straing cases like Hive/Spark1.6. Currently, `Nvl` finds the tightest common types among numeric types. This PR extends that to consider `String` type, too.
```scala
- TypeCoercion.findTightestCommonTypeOfTwo(left.dataType, right.dataType).map { dtype =>
+ TypeCoercion.findTightestCommonTypeToString(left.dataType, right.dataType).map { dtype =>
```
**Before**
```scala
scala> sql("select nvl('0', 1)").collect()
org.apache.spark.sql.AnalysisException: cannot resolve `nvl("0", 1)` due to data type mismatch:
input to function coalesce should all be the same type, but it's [string, int]; line 1 pos 7
```
**After**
```scala
scala> sql("select nvl('0', 1)").collect()
res0: Array[org.apache.spark.sql.Row] = Array([0])
```
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14251 from dongjoon-hyun/SPARK-16602.
## What changes were proposed in this pull request?
This patch moves regexp related unit tests from StringExpressionsSuite to RegexpExpressionsSuite to match the file name for regexp expressions.
## How was this patch tested?
This is a test only change.
Author: Reynold Xin <rxin@databricks.com>
Closes#14230 from rxin/SPARK-16584.
## What changes were proposed in this pull request?
`Alias` with metadata is not a no-op and we should not strip it in `RemoveAliasOnlyProject` rule.
This PR also did some improvement for this rule:
1. extend the semantic of `alias-only`. Now we allow the project list to be partially aliased.
2. add unit test for this rule.
## How was this patch tested?
new `RemoveAliasOnlyProjectSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14106 from cloud-fan/bug.
## What changes were proposed in this pull request?
Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example:
```SELECT a FROM t WHERE rand() < 0.1 AND a = 1```
And
```SELECT a FROM t WHERE a = 1 AND rand() < 0.1```
may call rand() for different times and therefore the output rows differ.
This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates.
## How was this patch tested?
Expanded related testcases in FilterPushdownSuite.
Author: 蒋星博 <jiangxingbo@meituan.com>
Closes#14012 from jiangxb1987/ppd.
## What changes were proposed in this pull request?
RegexExtract and RegexReplace currently crash on non-nullable input due use of a hard-coded local variable name (e.g. compiles fail with `java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 26: Redefinition of local variable "m" `).
This changes those variables to use fresh names, and also in a few other places.
## How was this patch tested?
Unit tests. rxin
Author: Eric Liang <ekl@databricks.com>
Closes#14168 from ericl/sc-3906.
## What changes were proposed in this pull request?
This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969.
java_method is an alias for reflect, so this should also resolve SPARK-16277.
## How was this patch tested?
Added expression unit tests and an end-to-end test.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14138 from petermaxlee/reflect-static.
This option is used by Hive to directly delete the files instead of
moving them to the trash. This is needed in certain configurations
where moving the files does not work. For non-Hive tables and partitions,
Spark already behaves as if the PURGE option was set, so there's no
need to do anything.
Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for
partitions), so the code reflects that: trying to use the option with
older versions of Hive will cause an exception to be thrown.
The change is a little noisier than I would like, because of the code
to propagate the new flag through all the interfaces and implementations;
the main changes are in the parser and in HiveShim, aside from the tests
(DDLCommandSuite, VersionsSuite).
Tested by running sql and catalyst unit tests, plus VersionsSuite which
has been updated to test the version-specific behavior. I also ran an
internal test suite that uses PURGE and would not pass previously.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13831 from vanzin/SPARK-16119.
## What changes were proposed in this pull request?
In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r".
This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. This patch also fixes the bug in crc32 expression.
## How was this patch tested?
This is a test harness change, but I also created a new test suite for testing the test harness.
Author: Reynold Xin <rxin@databricks.com>
Closes#14146 from rxin/SPARK-16489.
## What changes were proposed in this pull request?
Temporary tables are used frequently, but `spark.catalog.listColumns` does not support those tables. This PR make `SessionCatalog` supports temporary table column listing.
**Before**
```scala
scala> spark.range(10).createOrReplaceTempView("t1")
scala> spark.catalog.listTables().collect()
res1: Array[org.apache.spark.sql.catalog.Table] = Array(Table[name=`t1`, tableType=`TEMPORARY`, isTemporary=`true`])
scala> spark.catalog.listColumns("t1").collect()
org.apache.spark.sql.AnalysisException: Table `t1` does not exist in database `default`.;
```
**After**
```
scala> spark.catalog.listColumns("t1").collect()
res2: Array[org.apache.spark.sql.catalog.Column] = Array(Column[name='id', description='id', dataType='bigint', nullable='false', isPartition='false', isBucket='false'])
```
## How was this patch tested?
Pass the Jenkins tests including a new testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14114 from dongjoon-hyun/SPARK-16458.
#### What changes were proposed in this pull request?
**Issue 1:** When a query containing LIMIT/TABLESAMPLE 0, the statistics could be zero. Results are correct but it could cause a huge performance regression. For example,
```Scala
Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v")
.createOrReplaceTempView("test")
val df1 = spark.table("test")
val df2 = spark.table("test").limit(0)
val df = df1.join(df2, Seq("k"), "left")
```
The statistics of both `df` and `df2` are zero. The statistics values should never be zero; otherwise `sizeInBytes` of `BinaryNode` will also be zero (product of children). This PR is to increase it to `1` when the num of rows is equal to 0.
**Issue 2:** When a query containing negative LIMIT/TABLESAMPLE, we should issue exceptions. Negative values could break the implementation assumption of multiple parts. For example, statistics calculation. Below is the example query.
```SQL
SELECT * FROM testData TABLESAMPLE (-1 rows)
SELECT * FROM testData LIMIT -1
```
This PR is to issue an appropriate exception in this case.
**Issue 3:** Spark SQL follows the restriction of LIMIT clause in Hive. The argument to the LIMIT clause must evaluate to a constant value. It can be a numeric literal, or another kind of numeric expression involving operators, casts, and function return values. You cannot refer to a column or use a subquery. Currently, we do not detect whether the expression in LIMIT clause is foldable or not. If non-foldable, we might issue a strange error message. For example,
```SQL
SELECT * FROM testData LIMIT rand() > 0.2
```
Then, a misleading error message is issued, like
```
assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
+- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
+- LocalLimit (_nondeterministic#202 > 0.2)
+- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
+- LogicalRDD [key#11, value#12]
java.lang.AssertionError: assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
+- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
+- LocalLimit (_nondeterministic#202 > 0.2)
+- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
+- LogicalRDD [key#11, value#12]
```
This PR detects it and then issues a meaningful error message.
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14034 from gatorsmile/limit.
## What changes were proposed in this pull request?
This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath.
## How was this patch tested?
Added unit tests and end-to-end tests.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13991 from petermaxlee/SPARK-16318.
## What changes were proposed in this pull request?
This PR adds parse_url SQL functions in order to remove Hive fallback.
A new implementation of #13999
## How was this patch tested?
Pass the exist tests including new testcases.
Author: wujian <jan.chou.wu@gmail.com>
Closes#14008 from janplus/SPARK-16281.
## What changes were proposed in this pull request?
This PR implements `sentences` SQL function.
## How was this patch tested?
Pass the Jenkins tests with a new testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14004 from dongjoon-hyun/SPARK_16285.
## What changes were proposed in this pull request?
This small patch modifies ExpressionEvalHelper. checkEvaluation to support comparing NaN values for floating point comparisons.
## How was this patch tested?
This is a test harness change.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14103 from petermaxlee/SPARK-16436.
## What changes were proposed in this pull request?
This PR improves `OptimizeIn` optimizer to remove the literal repetitions from SQL `IN` predicates. This optimizer prevents user mistakes and also can optimize some queries like [TPCDS-36](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19).
**Before**
```scala
scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
== Physical Plan ==
*Filter state#6 IN (TN,TN,TN,TN,TN,TN,TN)
+- Generate explode([CA,TN]), false, false, [state#6]
+- Scan OneRowRelation[]
```
**After**
```scala
scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
== Physical Plan ==
*Filter state#6 IN (TN)
+- Generate explode([CA,TN]), false, false, [state#6]
+- Scan OneRowRelation[]
```
## How was this patch tested?
Pass the Jenkins tests (including a new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13876 from dongjoon-hyun/SPARK-16174.
#### What changes were proposed in this pull request?
Different from the other leaf nodes, `MetastoreRelation` and `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change the qualifier of the node. However, based on the existing alias handling, alias should be put in `SubqueryAlias`.
This PR is to separate alias handling from `MetastoreRelation` and `SimpleCatalogRelation` to make it consistent with the other nodes. It simplifies the signature and conversion to a `BaseRelation`.
For example, below is an example query for `MetastoreRelation`, which is converted to a `LogicalRelation`:
```SQL
SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2
```
Before changes, the analyzed plan is
```
== Analyzed Logical Plan ==
(a + 1): int
Project [(a#951 + 1) AS (a + 1)#952]
+- Filter (a#951 > 2)
+- SubqueryAlias tmp
+- Relation[a#951] parquet
```
After changes, the analyzed plan becomes
```
== Analyzed Logical Plan ==
(a + 1): int
Project [(a#951 + 1) AS (a + 1)#952]
+- Filter (a#951 > 2)
+- SubqueryAlias tmp
+- SubqueryAlias test_parquet_ctas
+- Relation[a#951] parquet
```
**Note: the optimized plans are the same.**
For `SimpleCatalogRelation`, the existing code always generates two Subqueries. Thus, no change is needed.
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14053 from gatorsmile/removeAliasFromMetastoreRelation.
## What changes were proposed in this pull request?
This PR implements `stack` table generating function.
## How was this patch tested?
Pass the Jenkins tests including new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14033 from dongjoon-hyun/SPARK-16286.
## What changes were proposed in this pull request?
This PR implements `inline` table generating function.
## How was this patch tested?
Pass the Jenkins tests with new testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13976 from dongjoon-hyun/SPARK-16288.
## What changes were proposed in this pull request?
This PR adds `map_keys` and `map_values` SQL functions in order to remove Hive fallback.
## How was this patch tested?
Pass the Jenkins tests including new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13967 from dongjoon-hyun/SPARK-16278.
## What changes were proposed in this pull request?
This PR adds a new logical optimizer, `PropagateEmptyRelation`, to collapse a logical plans consisting of only empty LocalRelations.
**Optimizer Targets**
1. Binary(or Higher)-node Logical Plans
- Union with all empty children.
- Join with one or two empty children (including Intersect/Except).
2. Unary-node Logical Plans
- Project/Filter/Sample/Join/Limit/Repartition with all empty children.
- Aggregate with all empty children and without AggregateFunction expressions, COUNT.
- Generate with Explode because other UserDefinedGenerators like Hive UDTF returns results.
**Sample Query**
```sql
WITH t1 AS (SELECT a FROM VALUES 1 t(a)),
t2 AS (SELECT b FROM VALUES 1 t(b) WHERE 1=2)
SELECT a,b
FROM t1, t2
WHERE a=b
GROUP BY a,b
HAVING a>1
ORDER BY a,b
```
**Before**
```scala
scala> sql("with t1 as (select a from values 1 t(a)), t2 as (select b from values 1 t(b) where 1=2) select a,b from t1, t2 where a=b group by a,b having a>1 order by a,b").explain
== Physical Plan ==
*Sort [a#0 ASC, b#1 ASC], true, 0
+- Exchange rangepartitioning(a#0 ASC, b#1 ASC, 200)
+- *HashAggregate(keys=[a#0, b#1], functions=[])
+- Exchange hashpartitioning(a#0, b#1, 200)
+- *HashAggregate(keys=[a#0, b#1], functions=[])
+- *BroadcastHashJoin [a#0], [b#1], Inner, BuildRight
:- *Filter (isnotnull(a#0) && (a#0 > 1))
: +- LocalTableScan [a#0]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *Filter (isnotnull(b#1) && (b#1 > 1))
+- LocalTableScan <empty>, [b#1]
```
**After**
```scala
scala> sql("with t1 as (select a from values 1 t(a)), t2 as (select b from values 1 t(b) where 1=2) select a,b from t1, t2 where a=b group by a,b having a>1 order by a,b").explain
== Physical Plan ==
LocalTableScan <empty>, [a#0, b#1]
```
## How was this patch tested?
Pass the Jenkins tests (including a new testsuite).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13906 from dongjoon-hyun/SPARK-16208.
## What changes were proposed in this pull request?
This patch implements the elt function, as it is implemented in Hive.
## How was this patch tested?
Added expression unit test in StringExpressionsSuite and end-to-end test in StringFunctionsSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13966 from petermaxlee/SPARK-16276.
## What changes were proposed in this pull request?
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```
**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| 1|
| 1| b| 2|
+---+---+-----+
```
For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
| 0| 1|
| 1| 2|
| 2| 3|
+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13971 from dongjoon-hyun/SPARK-16289.
## What changes were proposed in this pull request?
This PR Checks the size limit when doubling the array size in BufferHolder to avoid integer overflow.
## How was this patch tested?
Manual test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13829 from clockfly/SPARK-16071_2.
## What changes were proposed in this pull request?
This patch implements xpath_boolean expression for Spark SQL, a xpath function that returns true or false. The implementation is modelled after Hive's xpath_boolean, except that how the expression handles null inputs. Hive throws a NullPointerException at runtime if either of the input is null. This implementation returns null if either of the input is null.
## How was this patch tested?
Created two new test suites. One for unit tests covering the expression, and the other for end-to-end test in SQL.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13964 from petermaxlee/SPARK-16274.
## What changes were proposed in this pull request?
This PR adds 3 optimizer rules for typed filter:
1. push typed filter down through `SerializeFromObject` and eliminate the deserialization in filter condition.
2. pull typed filter up through `SerializeFromObject` and eliminate the deserialization in filter condition.
3. combine adjacent typed filters and share the deserialized object among all the condition expressions.
This PR also adds `TypedFilter` logical plan, to separate it from normal filter, so that the concept is more clear and it's easier to write optimizer rules.
## How was this patch tested?
`TypedFilterOptimizationSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13846 from cloud-fan/filter.
## What changes were proposed in this pull request?
This extends SPARK-15860 to include metrics for the actual bytecode size of janino-generated methods. They can be accessed in the same way as any other codahale metric, e.g.
```
scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.getSnapshot().getValues()
res7: Array[Long] = Array(532, 532, 532, 542, 1479, 2670, 3585, 3585)
scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.getSnapshot().getValues()
res8: Array[Long] = Array(5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 38, 63, 79, 88, 94, 94, 94, 132, 132, 165, 165, 220, 220)
```
## How was this patch tested?
Small unit test, also verified manually that the performance impact is minimal (<10%). hvanhovell
Author: Eric Liang <ekl@databricks.com>
Closes#13934 from ericl/spark-16238.
## What changes were proposed in this pull request?
The analyzer rule for resolving using joins should respect the case sensitivity setting.
## How was this patch tested?
New tests in ResolveNaturalJoinSuite
Author: Yin Huai <yhuai@databricks.com>
Closes#13977 from yhuai/SPARK-16301.
#### What changes were proposed in this pull request?
Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`.
Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way:
```Scala
StructType(
StructField(
"cl1",
IntegerType,
nullable = false,
new MetadataBuilder().putString("comment", "test").build()) :: Nil)
```
This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it:
```Scala
val struct = (new StructType)
.add("a", "int", true, "test1")
val struct = (new StructType)
.add("c", StringType, true, "test3")
val struct = (new StructType)
.add(StructField("d", StringType).withComment("test4"))
```
#### How was this patch tested?
Added test cases:
- `DataTypeSuite` is for testing three types of API changes,
- `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog
- `OrcQuerySuite.scala` is for orc format using Hive-metastore
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13860 from gatorsmile/newMethodForComment.
## What changes were proposed in this pull request?
`MAX(COUNT(*))` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime.
The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT(*))` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute.
This PR fixes this issue by removing the aforementioned branch.
## How was this patch tested?
New test case added in `AnalysisErrorSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#13968 from liancheng/spark-16291-nested-agg-functions.
## What changes were proposed in this pull request?
This patch ports Hive's UDFXPathUtil over to Spark, which can be used to implement xpath functionality in Spark in the near future.
## How was this patch tested?
Added two new test suites UDFXPathUtilSuite and ReusableStringReaderSuite. They have been ported over from Hive (but rewritten in Scala in order to leverage ScalaTest).
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13961 from petermaxlee/xpath.
## What changes were proposed in this pull request?
Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#13952 from brkyvz/minor-doc-fix.
## What changes were proposed in this pull request?
Spark currently shows all functions when issue a `SHOW FUNCTIONS` command. This PR refines the `SHOW FUNCTIONS` command by allowing users to select all functions, user defined function or system functions. The following syntax can be used:
**ALL** (default)
```SHOW FUNCTIONS```
```SHOW ALL FUNCTIONS```
**SYSTEM**
```SHOW SYSTEM FUNCTIONS```
**USER**
```SHOW USER FUNCTIONS```
## How was this patch tested?
Updated tests and added tests to the DDLSuite
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13929 from hvanhovell/SPARK-16220.
## What changes were proposed in this pull request?
This pr is to remove `hashCode` and `equals` in `ArrayBasedMapData` because the type cannot be used as join keys, grouping keys, or in equality tests.
## How was this patch tested?
Add a new test suite `MapDataSuite` for comparison tests.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13847 from maropu/UnsafeMapTest.
## What changes were proposed in this pull request?
This PR changes `CombineFilters` to compose the final predicate condition by using (`child predicate` AND `parent predicate`) instead of (`parent predicate` AND `child predicate`). This is a best effort approach. Some other optimization rules may destroy this order by reorganizing conjunctive predicates.
**Reported Error Scenario**
Chris McCubbin reported a bug when he used StringIndexer in an ML pipeline with additional filters. It seems that during filter pushdown, we changed the ordering in the logical plan.
```scala
import org.apache.spark.ml.feature._
val df1 = (0 until 3).map(_.toString).toDF
val indexer = new StringIndexer()
.setInputCol("value")
.setOutputCol("idx")
.setHandleInvalid("skip")
.fit(df1)
val df2 = (0 until 5).map(_.toString).toDF
val predictions = indexer.transform(df2)
predictions.show() // this is okay
predictions.where('idx > 2).show() // this will throw an exception
```
Please see the notebook at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html for error messages.
## How was this patch tested?
Pass the Jenkins tests (including a new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13872 from dongjoon-hyun/SPARK-16164.
## What changes were proposed in this pull request?
Currently, we use local timezone to parse or format a timestamp (TimestampType), then use Long as the microseconds since epoch UTC.
In from_utc_timestamp() and to_utc_timestamp(), we did not consider the local timezone, they could return different results with different local timezone.
This PR will do the conversion based on human time (in local timezone), it should return same result in whatever timezone. But because the mapping from absolute timestamp to human time is not exactly one-to-one mapping, it will still return wrong result in some timezone (also in the begging or ending of DST).
This PR is kind of the best effort fix. In long term, we should make the TimestampType be timezone aware to fix this totally.
## How was this patch tested?
Tested these function in all timezone.
Author: Davies Liu <davies@databricks.com>
Closes#13784 from davies/convert_tz.
## What changes were proposed in this pull request?
Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).
This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13652 from davies/fix_timezone.
## What changes were proposed in this pull request?
This small patch renames a few optimizer rules to make the naming more consistent, e.g. class name start with a verb. The main important "fix" is probably SamplePushDown -> PushProjectThroughSample. SamplePushDown is actually the wrong name, since the rule is not about pushing Sample down.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#13732 from rxin/SPARK-16014.
#### What changes were proposed in this pull request?
`IF NOT EXISTS` in `INSERT OVERWRITE` should not support dynamic partitions. If we specify `IF NOT EXISTS`, the inserted statement is not shown in the table.
This PR is to issue an exception in this case, just like what Hive does. Also issue an exception if users specify `IF NOT EXISTS` if users do not specify any `PARTITION` specification.
#### How was this patch tested?
Added test cases into `PlanParserSuite` and `InsertIntoHiveTableSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13447 from gatorsmile/insertIfNotExist.
## What changes were proposed in this pull request?
This PR fixes the problem that Divide Expression inside Aggregation function is casted to wrong type, which cause `select 1/2` and `select sum(1/2)`returning different result.
**Before the change:**
```
scala> sql("select 1/2 as a").show()
+---+
| a|
+---+
|0.5|
+---+
scala> sql("select sum(1/2) as a").show()
+---+
| a|
+---+
|0 |
+---+
scala> sql("select sum(1 / 2) as a").schema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true))
```
**After the change:**
```
scala> sql("select 1/2 as a").show()
+---+
| a|
+---+
|0.5|
+---+
scala> sql("select sum(1/2) as a").show()
+---+
| a|
+---+
|0.5|
+---+
scala> sql("select sum(1/2) as a").schema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true))
```
## How was this patch tested?
Unit test.
This PR is based on https://github.com/apache/spark/pull/13524 by Sephiroth-Lin
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13651 from clockfly/SPARK-15776.
## What changes were proposed in this pull request?
Adds codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get.
To simplify, I added the metrics under a statically-initialized source that is always registered with SparkEnv.
## How was this patch tested?
Unit tests
Author: Eric Liang <ekl@databricks.com>
Closes#13586 from ericl/spark-15860.
## What changes were proposed in this pull request?
When the output mode is complete, then the output of a streaming aggregation essentially will contain the complete aggregates every time. So this is not different from a batch dataset within an incremental execution. Other non-streaming operations should be supported on this dataset. In this PR, I am just adding support for sorting, as it is a common useful functionality. Support for other operations will come later.
## How was this patch tested?
Additional unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13549 from tdas/SPARK-15812.
## What changes were proposed in this pull request?
The parser currently does not allow the use of some SQL keywords as table or field names. This PR adds supports for all keywords as identifier. The exception to this are table aliases, in this case most keywords are allowed except for join keywords (```anti, full, inner, left, semi, right, natural, on, join, cross```) and set-operator keywords (```union, intersect, except```).
## How was this patch tested?
I have added/move/renamed test in the catalyst `*ParserSuite`s.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13534 from hvanhovell/SPARK-15789.
## What changes were proposed in this pull request?
This PR makes sure the typed Filter doesn't change the Dataset schema.
**Before the change:**
```
scala> val df = spark.range(0,9)
scala> df.schema
res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
scala> val afterFilter = df.filter(_=>true)
scala> afterFilter.schema // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true.
res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true))
```
SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset.
**After the change:**
```
scala> afterFilter.schema // schema is NOT changed.
res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
```
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13529 from clockfly/spark-15632.
## What changes were proposed in this pull request?
This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column.
This PR also removes the support to use `Product` as a valid external type of struct type. This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working. However, we never officially support this feature and I think it's ok to ban it.
## How was this patch tested?
new tests in `RowEncoderSuite`.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13401 from cloud-fan/bug.
## What changes were proposed in this pull request?
In forType function of object RandomDataGenerator, the code following:
if (maybeSqlTypeGenerator.isDefined){
....
Some(generator)
} else{
None
}
will be changed. Instead, maybeSqlTypeGenerator.map will be used.
## How was this patch tested?
All of the current unit tests passed.
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#13448 from Sherry302/master.
## What changes were proposed in this pull request?
For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null.
This PR explicitly add this constraint and throw exception if users break it.
## How was this patch tested?
several new tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13469 from cloud-fan/null-object.
## What changes were proposed in this pull request?
There are 2 kinds of `GetStructField`:
1. resolved from `UnresolvedExtractValue`, and it will have a `name` property.
2. created when we build deserializer expression for nested tuple, no `name` property.
When we want to validate the ordinals of nested tuple, we should only catch `GetStructField` without the name property.
## How was this patch tested?
new test in `EncoderResolutionSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13474 from cloud-fan/ordinal-check.
## What changes were proposed in this pull request?
Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.
1. move validation logic to analyzer instead of encoder
2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)
## How was this patch tested?
existing test
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#13269 from cloud-fan/clean-encoder.
## What changes were proposed in this pull request?
This command didn't work for Hive tables. Now it does:
```
ALTER TABLE boxes PARTITION (width=3)
SET SERDE 'com.sparkbricks.serde.ColumnarSerDe'
WITH SERDEPROPERTIES ('compress'='true')
```
## How was this patch tested?
`HiveExternalCatalogSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13453 from andrewor14/alter-partition-storage.
## What changes were proposed in this pull request?
This patch fixes a number of `com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException` exceptions reported in [SPARK-15604], [SPARK-14752] etc. (while executing sparkSQL queries with the kryo serializer) by explicitly implementing `KryoSerialization` for `LazilyGenerateOrdering`.
## How was this patch tested?
1. Modified `OrderingSuite` so that all tests in the suite also test kryo serialization (for both interpreted and generated ordering).
2. Manually verified TPC-DS q1.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13466 from sameeragarwal/kryo.
## What changes were proposed in this pull request?
This PR add a rule at the end of analyzer to correct nullable fields of attributes in a logical plan by using nullable fields of the corresponding attributes in its children logical plans (these plans generate the input rows).
This is another approach for addressing SPARK-13484 (the first approach is https://github.com/apache/spark/pull/11371).
Close#113711
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#13290 from yhuai/SPARK-13484.
## What changes were proposed in this pull request?
This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them.
## How was this patch tested?
Updated tests to reflect the moves.
Author: Reynold Xin <rxin@databricks.com>
Closes#13429 from rxin/SPARK-15686.
## What changes were proposed in this pull request?
Currently structured streaming only supports append output mode. This PR adds the following.
- Added support for Complete output mode in the internal state store, analyzer and planner.
- Added public API in Scala and Python for users to specify output mode
- Added checks for unsupported combinations of output mode and DF operations
- Plans with no aggregation should support only Append mode
- Plans with aggregation should support only Update and Complete modes
- Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
- Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.
## How was this patch tested?
Unit tests in various test suites
- StreamingAggregationSuite: tests for complete mode
- MemorySinkSuite: tests for checking behavior in Append and Complete modes.
- UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
- DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
- Python doc test and existing unit tests modified to call write.outputMode.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13286 from tdas/complete-mode.
In this case, the result type of the expression becomes DECIMAL(38, 36) as we promote the individual string literals to DECIMAL(38, 18) when we handle string promotions for `BinaryArthmaticExpression`.
I think we need to cast the string literals to Double type instead. I looked at the history and found that this was changed to use decimal instead of double to avoid potential loss of precision when we cast decimal to double.
To double check i ran the query against hive, mysql. This query returns non NULL result for both the databases and both promote the expression to use double.
Here is the output.
- Hive
```SQL
hive> create table l2 as select (cast(99 as decimal(19,6)) + '2') from l1;
OK
hive> describe l2;
OK
_c0 double
```
- MySQL
```SQL
mysql> create table foo2 as select (cast(99 as decimal(19,6)) + '2') from test;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> describe foo2;
+-----------------------------------+--------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------------------------+--------+------+-----+---------+-------+
| (cast(99 as decimal(19,6)) + '2') | double | NO | | 0 | |
+-----------------------------------+--------+------+-----+---------+-------+
```
## How was this patch tested?
Added a new test in SQLQuerySuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#13368 from dilipbiswal/spark-15557.
## What changes were proposed in this pull request?
This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#13377 from srowen/BuildWarnings.
## What changes were proposed in this pull request?
This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.
## How was this patch tested?
Created a new SparkSqlParserSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#13292 from rxin/SPARK-15436.
## What changes were proposed in this pull request?
This PR fixes 3 slow tests:
1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13273 from cloud-fan/test.
## What changes were proposed in this pull request?
in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1, `locate("aa", "aaa", 1)` would yield 2 and `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.
## How was this patch tested?
tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#13186 from adrian-wang/locate.
## What changes were proposed in this pull request?
When invalid date string like "2015-02-29 00:00:00" are cast as date or timestamp using spark sql, it used to not return null but another valid date (2015-03-01 in this case).
In this pr, invalid date string like "2016-02-29" and "2016-04-31" are returned as null when cast as date or timestamp.
## How was this patch tested?
Unit tests are added.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: wangyang <wangyang@haizhi.com>
Closes#13169 from wangyang1992/invalid_date.
## What changes were proposed in this pull request?
Incrementalizing plans of with multiple streaming aggregation is tricky and we dont have the necessary support for "delta" to implement correctly. So disabling the support for multiple streaming aggregations.
## How was this patch tested?
Additional unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13210 from tdas/SPARK-15428.
## What changes were proposed in this pull request?
This patch simplifies the implementation of Range operator and make the explain string consistent between logical plan and physical plan. To do this, I changed RangeExec to embed a Range logical plan in it.
Before this patch (note that the logical Range and physical Range actually output different information):
```
== Optimized Logical Plan ==
Range 0, 100, 2, 2, [id#8L]
== Physical Plan ==
*Range 0, 2, 2, 50, [id#8L]
```
After this patch:
If step size is 1:
```
== Optimized Logical Plan ==
Range(0, 100, splits=2)
== Physical Plan ==
*Range(0, 100, splits=2)
```
If step size is not 1:
```
== Optimized Logical Plan ==
Range (0, 100, step=2, splits=2)
== Physical Plan ==
*Range (0, 100, step=2, splits=2)
```
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13239 from rxin/SPARK-15459.
#### What changes were proposed in this pull request?
When there are duplicate keys in the partition specs or table properties, we always use the last value and ignore all the previous values. This is caused by the function call `toMap`.
partition specs or table properties are widely used in multiple DDL statements.
This PR is to detect the duplicates and issue an exception if found.
#### How was this patch tested?
Added test cases in DDLSuite
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13095 from gatorsmile/detectDuplicate.
## What changes were proposed in this pull request?
In only `catalyst` module, there exists 8 evaluation test cases on unresolved expressions. But, in real-world situation, those cases doesn't happen since they occurs exceptions before evaluations.
```scala
scala> sql("select format_number(null, 3)")
res0: org.apache.spark.sql.DataFrame = [format_number(CAST(NULL AS DOUBLE), 3): string]
scala> sql("select format_number(cast(null as NULL), 3)")
org.apache.spark.sql.catalyst.parser.ParseException:
DataType null() is not supported.(line 1, pos 34)
```
This PR makes those testcases more realistic.
```scala
- checkEvaluation(FormatNumber(Literal.create(null, NullType), Literal(3)), null)
+ assert(FormatNumber(Literal.create(null, NullType), Literal(3)).resolved === false)
```
Also, this PR also removes redundant `resolved` checking in `FoldablePropagation` optimizer.
## How was this patch tested?
Pass the modified Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13241 from dongjoon-hyun/SPARK-15462.
## What changes were proposed in this pull request?
Right now inferring the schema for case classes happens before searching the SQLUserDefinedType annotation, so the SQLUserDefinedType annotation for case classes doesn't work.
This PR simply changes the inferring order to resolve it. I also reenabled the java.math.BigDecimal test and added two tests for `List`.
## How was this patch tested?
`encodeDecodeTest(UDTCaseClass(new java.net.URI("http://spark.apache.org/")), "udt with case class")`
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12965 from zsxwing/SPARK-15190.
## What changes were proposed in this pull request?
This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer.
Generated code to be compiled doesn't include actual comments but includes place holder instead.
Place holders in generated code will be replaced with actual comments only at the time of logging.
Also, this PR can resolve SPARK-15205.
## How was this patch tested?
Existing tests.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#12979 from sarutak/SPARK-15205.
## What changes were proposed in this pull request?
`CreateNamedStruct` and `CreateNamedStructUnsafe` should preserve metadata of value expressions if it is `NamedExpression` like `CreateStruct` or `CreateStructUnsafe` are doing.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13193 from ueshin/issues/SPARK-15400.
## What changes were proposed in this pull request?
The following code:
```
val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
```
throws an Exception:
```
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
...
Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
...
```
This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`.
The analyzed and optimized plans of the above example are as follows:
```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
+- Filter <function1>.apply
+- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
== Optimized Logical Plan ==
!Project [_1#420]
+- Filter <function1>.apply
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```
This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`.
The plans after this patch are as follows:
```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
+- Filter <function1>.apply
+- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
== Optimized Logical Plan ==
Project [_1#416]
+- Filter <function1>.apply
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```
## How was this patch tested?
Existing tests and I added a test to check if `filter and then select` works.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13096 from ueshin/issues/SPARK-15313.
## What changes were proposed in this pull request?
This patch fixes a bug in TypeUtils.checkForSameTypeInputExpr. Previously the code was testing on strict equality, which does not taking nullability into account.
This is based on https://github.com/apache/spark/pull/12768. This patch fixed a bug there (with empty expression) and added a test case.
## How was this patch tested?
Added a new test suite and test case.
Closes#12768.
Author: Reynold Xin <rxin@databricks.com>
Author: Oleg Danilov <oleg.danilov@wandisco.com>
Closes#13208 from rxin/SPARK-14990.
Hello : Can you help check this PR? I am adding support for the java.math.BigInteger for java bean code path. I saw internally spark is converting the BigInteger to BigDecimal in ColumnType.scala and CatalystRowConverter.scala. I use the similar way and convert the BigInteger to the BigDecimal. .
Author: Kevin Yu <qyu@us.ibm.com>
Closes#10125 from kevinyu98/working_on_spark-11827.
## What changes were proposed in this pull request?
Fix `MapObjects.itemAccessorMethod` to handle `TimestampType`. Without this fix, `Array[Timestamp]` cannot be properly encoded or decoded. To reproduce this, in `ExpressionEncoderSuite`, if you add the following test case:
`encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of timestamp")
`
... you will see that (without this fix) it fails with the following output:
```
- encode/decode for array of timestamp: [Ljava.sql.Timestamp;fd9ebde *** FAILED ***
Exception thrown while decoding
Converted: [0,1000000010,800000001,52a7ccdc36800]
Schema: value#61615
root
-- value: array (nullable = true)
|-- element: timestamp (containsNull = true)
Encoder:
class[value[0]: array<timestamp>] (ExpressionEncoderSuite.scala:312)
```
## How was this patch tested?
Existing tests
Author: Sumedh Mungee <smungee@gmail.com>
Closes#13108 from smungee/fix-itemAccessorMethod.
## What changes were proposed in this pull request?
This PR aims to add new **FoldablePropagation** optimizer that propagates foldable expressions by replacing all attributes with the aliases of original foldable expression. Other optimizations will take advantage of the propagated foldable expressions: e.g. `EliminateSorts` optimizer now can handle the following Case 2 and 3. (Case 1 is the previous implementation.)
1. Literals and foldable expression, e.g. "ORDER BY 1.0, 'abc', Now()"
2. Foldable ordinals, e.g. "SELECT 1.0, 'abc', Now() ORDER BY 1, 2, 3"
3. Foldable aliases, e.g. "SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, z"
This PR has been generalized based on cloud-fan 's key ideas many times; he should be credited for the work he did.
**Before**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
: +- Sort [1.0#5 ASC,x#0 ASC], true, 0
: +- INPUT
+- Exchange rangepartitioning(1.0#5 ASC, x#0 ASC, 200), None
+- WholeStageCodegen
: +- Project [1.0 AS 1.0#5,1461873043577000 AS x#0]
: +- INPUT
+- Scan OneRowRelation[]
```
**After**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
: +- Project [1.0 AS 1.0#5,1461873079484000 AS x#0]
: +- INPUT
+- Scan OneRowRelation[]
```
## How was this patch tested?
Pass the Jenkins tests including a new test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12719 from dongjoon-hyun/SPARK-14939.
## What changes were proposed in this pull request?
toCommentSafeString method replaces "\u" with "\\\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\\\u", in the string literal in the query, codegen can break.
Following code causes compilation error.
```
val df = Seq(...).toDF
df.select("'\\\\\\\\u002A/'").show
```
The reason of the compilation error is because "\\\\\\\\\\\\\\\\u002A/" is translated into "*/" (the end of comment).
Due to this unsafety, arbitrary code can be injected like as follows.
```
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'\\\\\\\\u002A/{System.exit(1);}/*'").show
```
## How was this patch tested?
Added new test cases.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: sarutak <sarutak@oss.nttdata.co.jp>
Closes#12939 from sarutak/SPARK-15165.
## What changes were proposed in this pull request?
This PR improves `RowEncoder` and `MapObjects`, to support array as the external type for `ArrayType`. The idea is straightforward, we use `Object` as the external input type for `ArrayType`, and determine its type at runtime in `MapObjects`.
## How was this patch tested?
new test in `RowEncoderSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13138 from cloud-fan/map-object.
## What changes were proposed in this pull request?
We originally designed the type coercion rules to match Hive, but over time we have diverged. It does not make sense to call it HiveTypeCoercion anymore. This patch renames it TypeCoercion.
## How was this patch tested?
Updated unit tests to reflect the rename.
Author: Reynold Xin <rxin@databricks.com>
Closes#13091 from rxin/SPARK-15310.
## What changes were proposed in this pull request?
This patch moves all the object related expressions into expressions.objects package, for better code organization.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13085 from rxin/SPARK-15306.
#### What changes were proposed in this pull request?
~~Currently, multiple partitions are allowed to drop by using a single DDL command: Alter Table Drop Partition. However, the internal implementation could break atomicity. That means, we could just drop a subset of qualified partitions, if hitting an exception when dropping one of qualified partitions~~
~~This PR contains the following behavior changes:~~
~~- disallow dropping multiple partitions by a single command ~~
~~- allow users to input predicates in partition specification and issue a nicer error message if the predicate's comparison operator is not `=`.~~
~~- verify the partition spec in SessionCatalog. This can ensure each partition spec in `Drop Partition` does not correspond to multiple partitions.~~
This PR has two major parts:
- Verify the partition spec in SessionCatalog for fixing the following issue:
```scala
sql(s"ALTER TABLE $externalTab DROP PARTITION (ds='2008-04-09', unknownCol='12')")
```
Above example uses an invalid partition spec. Without this PR, we will drop all the partitions. The reason is Hive megastores getPartitions API returns all the partitions if we provide an invalid spec.
- Re-implemented the `dropPartitions` in `HiveClientImpl`. Now, we always check if all the user-specified partition specs exist before attempting to drop the partitions. Previously, we start drop the partition before completing checking the existence of all the partition specs. If any failure happened after we start to drop the partitions, we will log an error message to indicate which partitions have been dropped and which partitions have not been dropped.
#### How was this patch tested?
Modified the existing test cases and added new test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12801 from gatorsmile/banDropMultiPart.
## What changes were proposed in this pull request?
Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#12945 from clockfly/spark-15171.
## What changes were proposed in this pull request?
SPARK-15241: We now support java decimal and catalyst decimal in external row, it makes sense to also support scala decimal.
SPARK-15242: This is a long-standing bug, and is exposed after https://github.com/apache/spark/pull/12364, which eliminate the `If` expression if the field is not nullable:
```
val fieldValue = serializerFor(
GetExternalRowField(inputObject, i, externalDataTypeForInput(f.dataType)),
f.dataType)
if (f.nullable) {
If(
Invoke(inputObject, "isNullAt", BooleanType, Literal(i) :: Nil),
Literal.create(null, f.dataType),
fieldValue)
} else {
fieldValue
}
```
Previously, we always use `DecimalType.SYSTEM_DEFAULT` as the output type of converted decimal field, which is wrong as it doesn't match the real decimal type. However, it works well because we always put converted field into `If` expression to do the null check, and `If` use its `trueValue`'s data type as its output type.
Now if we have a not nullable decimal field, then the converted field's output type will be `DecimalType.SYSTEM_DEFAULT`, and we will write wrong data into unsafe row.
The fix is simple, just use the given decimal type as the output type of converted decimal field.
These 2 issues was found at https://github.com/apache/spark/pull/13008
## How was this patch tested?
new tests in RowEncoderSuite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13019 from cloud-fan/encoder-decimal.
Since we cannot really trust if the underlying external catalog can throw exceptions when there is an invalid metadata operation, let's do it in SessionCatalog.
- [X] The first step is to unify the error messages issued in Hive-specific Session Catalog and general Session Catalog.
- [X] The second step is to verify the inputs of metadata operations for partitioning-related operations. This is moved to a separate PR: https://github.com/apache/spark/pull/12801
- [X] The third step is to add database existence verification in `SessionCatalog`
- [X] The fourth step is to add table existence verification in `SessionCatalog`
- [X] The fifth step is to add function existence verification in `SessionCatalog`
Add test cases and verify the error messages we issued
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12385 from gatorsmile/verifySessionAPIs.
#### What changes were proposed in this pull request?
This PR is to address a few existing issues in `EXPLAIN`:
- The `EXPLAIN` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command.
- The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command.
- The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example:
```
== Parsed Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
== Analyzed Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
== Optimized Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
...
```
#### How was this patch tested?
Added and modified a few test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12991 from gatorsmile/explainCreateTable.
## What changes were proposed in this pull request?
following operations have file system operation now:
1. CREATE DATABASE: create a dir
2. DROP DATABASE: delete the dir
3. CREATE TABLE: create a dir
4. DROP TABLE: delete the dir
5. RENAME TABLE: rename the dir
6. CREATE PARTITIONS: create a dir
7. RENAME PARTITIONS: rename the dir
8. DROP PARTITIONS: drop the dir
## How was this patch tested?
new tests in `ExternalCatalogSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12871 from cloud-fan/catalog.
#### What changes were proposed in this pull request?
So far, in the implementation of InMemoryCatalog, we do not check if the new/destination table/function/partition exists or not. Thus, we just silently remove the existent table/function/partition.
This PR is to detect them and issue an appropriate exception.
#### How was this patch tested?
Added the related test cases. They also verify if HiveExternalCatalog also detects these errors.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12960 from gatorsmile/renameInMemoryCatalog.
#### What changes were proposed in this pull request?
When Describe a UDTF, the command returns a wrong result. The command is unable to find the function, which has been created and cataloged in the catalog but not in the functionRegistry.
This PR is to correct it. If the function is not in the functionRegistry, we will check the catalog for collecting the information of the UDTF function.
#### How was this patch tested?
Added test cases to verify the results
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12885 from gatorsmile/showFunction.
## What changes were proposed in this pull request?
The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info.
TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR.
## How was this patch tested?
new tests in `RowEncoderSuite`
Note that, This PR takes over https://github.com/apache/spark/pull/11980, with a little simplification, so all credits should go to koertkuipers
Author: Wenchen Fan <wenchen@databricks.com>
Author: Koert Kuipers <koert@tresata.com>
Closes#12364 from cloud-fan/nullable.
## What changes were proposed in this pull request?
This PR improve the error message for `Generate` in 3 cases:
1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl`
2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl`
3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)`
## How was this patch tested?
new tests in `AnalysisErrorSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12810 from cloud-fan/bug.
## What changes were proposed in this pull request?
Just a bunch of small tweaks on DDL exception messages.
## How was this patch tested?
`DDLCommandSuite` et al.
Author: Andrew Or <andrew@databricks.com>
Closes#12853 from andrewor14/make-exceptions-consistent.
#### What changes were proposed in this pull request?
Compared with the current Spark parser, there are two extra syntax are supported in Hive for sampling
- In `On` clauses, `rand()` is used for indicating sampling on the entire row instead of an individual column. For example,
```SQL
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
```
- Users can specify the total length to be read. For example,
```SQL
SELECT * FROM source TABLESAMPLE(100M) s;
```
Below is the link for references:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling
This PR is to parse and capture these two extra syntax, and issue a better error message.
#### How was this patch tested?
Added test cases to verify the thrown exceptions
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12838 from gatorsmile/bucketOnRand.
## What changes were proposed in this pull request?
Make serializer correctly inferred if the input type is `List[_]`, since `List[_]` is type of `Seq[_]`, before it was matched to different case (`case t if definedByConstructorParams(t)`).
## How was this patch tested?
New test case was added.
Author: bomeng <bmeng@us.ibm.com>
Closes#12849 from bomeng/SPARK-15062.
## What changes were proposed in this pull request?
This PR addresses a few minor issues in SQL parser:
- Removes some unused rules and keywords in the grammar.
- Removes code path for fallback SQL parsing (was needed for Hive native parsing).
- Use `UnresolvedGenerator` instead of hard-coding `Explode` & `JsonTuple`.
- Adds a more generic way of creating error messages for unsupported Hive features.
- Use `visitFunctionName` as much as possible.
- Interpret a `CatalogColumn`'s `DataType` directly instead of parsing it again.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12826 from hvanhovell/SPARK-15047.
## What changes were proposed in this pull request?
In this PR we add support for correlated scalar subqueries. An example of such a query is:
```SQL
select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key)
```
The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans.
I could not find a well defined semantics for the use of scalar subqueries in an `Aggregate`. The current implementation currently evaluates the scalar subquery *before* aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this.
The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an `AggregateExpression`.
## How was this patch tested?
Added tests to `SubquerySuite`.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12822 from hvanhovell/SPARK-14785.
## What changes were proposed in this pull request?
In order to support nested predicate subquery, this PR introduce an internal join type ExistenceJoin, which will emit all the rows from left, plus an additional column, which presents there are any rows matched from right or not (it's not null-aware right now). This additional column could be used to replace the subquery in Filter.
In theory, all the predicate subquery could use this join type, but it's slower than LeftSemi and LeftAnti, so it's only used for nested subquery (subquery inside OR).
For example, the following SQL:
```sql
SELECT a FROM t WHERE EXISTS (select 0) OR EXISTS (select 1)
```
This PR also fix a bug in predicate subquery push down through join (they should not).
Nested null-aware subquery is still not supported. For example, `a > 3 OR b NOT IN (select bb from t)`
After this, we could run TPCDS query Q10, Q35, Q45
## How was this patch tested?
Added unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#12820 from davies/or_exists.
## What changes were proposed in this pull request?
This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT.
## How was this patch tested?
existing tests and new test suite `UnsafeArraySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12640 from cloud-fan/ml.
## What changes were proposed in this pull request?
CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser.
## How was this patch tested?
Existing tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12796 from yhuai/removeDataTypeParser.