## What changes were proposed in this pull request?
Add comments for specifying the position of batch "Check Cartesian Products", as rxin suggested in https://github.com/apache/spark/pull/19362 .
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19379 from gengliangwang/SPARK-22141-followup.
## What changes were proposed in this pull request?
Spark's RangePartitioner hard codes the number of sampling points per partition to be 20. This is sometimes too low. This ticket makes it configurable, via spark.sql.execution.rangeExchange.sampleSizePerPartition, and raises the default in Spark SQL to be 100.
## How was this patch tested?
Added a pretty sophisticated test based on chi square test ...
Author: Reynold Xin <rxin@databricks.com>
Closes#19387 from rxin/SPARK-22160.
## What changes were proposed in this pull request?
spark.sql.execution.arrow.enable and spark.sql.codegen.aggregate.map.twolevel.enable -> enabled
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#19384 from rxin/SPARK-22159.
## What changes were proposed in this pull request?
When inferring constraints from children, Join's condition can be simplified as None.
For example,
```
val testRelation = LocalRelation('a.int)
val x = testRelation.as("x")
val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y")
x.join.where($"x.a" === $"y.a")
```
The plan will become
```
Join Inner
:- LocalRelation <empty>, [a#23]
+- LocalRelation <empty>, [a#224]
```
And the Cartesian products check will throw exception for above plan.
Propagate empty relation before checking Cartesian products, and the issue is resolved.
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19362 from gengliangwang/MoveCheckCartesianProducts.
## What changes were proposed in this pull request?
Address PR comments that appeared post-merge, to rename `addExtraCode` to `addInnerClass`,
and not count the size of the inner class to the size of the outer class.
## How was this patch tested?
YOLO.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#19353 from juliuszsompolski/SPARK-22103followup.
## What changes were proposed in this pull request?
HashAggregateExec codegen uses two paths for fast hash table and a generic one.
It generates code paths for iterating over both, and both code paths generate the consume code of the parent operator, resulting in that code being expanded twice.
This leads to a long generated function that might be an issue for the compiler (see e.g. SPARK-21603).
I propose to remove the double expansion by generating the consume code in a helper function that can just be called from both iterating loops.
An issue with separating the `consume` code to a helper function was that a number of places relied and assumed on being in the scope of an outside `produce` loop and e.g. use `continue` to jump out.
I replaced such code flows with nested scopes. It is code that should be handled the same by compiler, while getting rid of depending on assumptions that are outside of the `consume`'s own scope.
## How was this patch tested?
Existing test coverage.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#19324 from juliuszsompolski/aggrconsumecodegen.
## What changes were proposed in this pull request?
The `percentile_approx` function previously accepted numeric type input and output double type results.
But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.
After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
This change is also required when we generate equi-height histograms for these types.
## How was this patch tested?
Added a new test and modified some existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19321 from wzhfy/approx_percentile_support_types.
## What changes were proposed in this pull request?
Enable Scala 2.12 REPL. Fix most remaining issues with 2.12 compilation and warnings, including:
- Selecting Kafka 0.10.1+ for Scala 2.12 and patching over a minor API difference
- Fixing lots of "eta expansion of zero arg method deprecated" warnings
- Resolving the SparkContext.sequenceFile implicits compile problem
- Fixing an odd but valid jetty-server missing dependency in hive-thriftserver
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#19307 from srowen/Scala212.
## What changes were proposed in this pull request?
This PR proposes to enhance the documentation for `trim` functions in the function description session.
- Add more `usage`, `arguments` and `examples` for the trim function
- Adjust space in the `usage` session
After the changes, the trim function documentation will look like this:
- `trim`
```trim(str) - Removes the leading and trailing space characters from str.
trim(BOTH trimStr FROM str) - Remove the leading and trailing trimStr characters from str
trim(LEADING trimStr FROM str) - Remove the leading trimStr characters from str
trim(TRAILING trimStr FROM str) - Remove the trailing trimStr characters from str
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string
Examples:
> SELECT trim(' SparkSQL ');
SparkSQL
> SELECT trim('SL', 'SSparkSQLS');
parkSQ
> SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
parkSQLS
> SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
SSparkSQ
```
- `ltrim`
```ltrim
ltrim(str) - Removes the leading space characters from str.
ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT ltrim(' SparkSQL ');
SparkSQL
> SELECT ltrim('Sp', 'SSparkSQLS');
arkSQLS
```
- `rtrim`
```rtrim
rtrim(str) - Removes the trailing space characters from str.
rtrim(trimStr, str) - Removes the trailing string which contains the characters from the trim string from the str
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT rtrim(' SparkSQL ');
SparkSQL
> SELECT rtrim('LQSa', 'SSparkSQLS');
SSpark
```
This is the trim characters function jira: [trim function](https://issues.apache.org/jira/browse/SPARK-14878)
## How was this patch tested?
Manually tested
```
spark-sql> describe function extended trim;
17/09/22 17:03:04 INFO CodeGenerator: Code generated in 153.026533 ms
Function: trim
Class: org.apache.spark.sql.catalyst.expressions.StringTrim
Usage:
trim(str) - Removes the leading and trailing space characters from `str`.
trim(BOTH trimStr FROM str) - Remove the leading and trailing `trimStr` characters from `str`
trim(LEADING trimStr FROM str) - Remove the leading `trimStr` characters from `str`
trim(TRAILING trimStr FROM str) - Remove the trailing `trimStr` characters from `str`
Extended Usage:
Arguments:
* str - a string expression
* trimStr - the trim string characters to trim, the default value is a single space
* BOTH, FROM - these are keywords to specify trimming string characters from both ends of
the string
* LEADING, FROM - these are keywords to specify trimming string characters from the left
end of the string
* TRAILING, FROM - these are keywords to specify trimming string characters from the right
end of the string
Examples:
> SELECT trim(' SparkSQL ');
SparkSQL
> SELECT trim('SL', 'SSparkSQLS');
parkSQ
> SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
parkSQLS
> SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
SSparkSQ
```
```
spark-sql> describe function extended ltrim;
Function: ltrim
Class: org.apache.spark.sql.catalyst.expressions.StringTrimLeft
Usage:
ltrim(str) - Removes the leading space characters from `str`.
ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string
Extended Usage:
Arguments:
* str - a string expression
* trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT ltrim(' SparkSQL ');
SparkSQL
> SELECT ltrim('Sp', 'SSparkSQLS');
arkSQLS
```
```
spark-sql> describe function extended rtrim;
Function: rtrim
Class: org.apache.spark.sql.catalyst.expressions.StringTrimRight
Usage:
rtrim(str) - Removes the trailing space characters from `str`.
rtrim(trimStr, str) - Removes the trailing string which contains the characters from the trim string from the `str`
Extended Usage:
Arguments:
* str - a string expression
* trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT rtrim(' SparkSQL ');
SparkSQL
> SELECT rtrim('LQSa', 'SSparkSQLS');
SSpark
```
Author: Kevin Yu <qyu@us.ibm.com>
Closes#19329 from kevinyu98/spark-14878-5.
## What changes were proposed in this pull request?
Try to avoid allocating an array bigger than Integer.MAX_VALUE - 8, which is the actual max size on some JVMs, in several places
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#19266 from srowen/SPARK-22033.
## What changes were proposed in this pull request?
Right now the calculation of SortMergeJoinExec's outputOrdering relies on the fact that its children have already been sorted on the join keys, while this is often not true until EnsureRequirements has been applied. So we ended up not getting the correct outputOrdering during physical planning stage before Sort nodes are added to the children.
For example, J = {A join B on key1 = key2}
1. if A is NOT ordered on key1 ASC, J's outputOrdering should include "key1 ASC"
2. if A is ordered on key1 ASC, J's outputOrdering should include "key1 ASC"
3. if A is ordered on key1 ASC, with sameOrderExp=c1, J's outputOrdering should include "key1 ASC, sameOrderExp=c1"
So to fix this I changed the behavior of <code>getKeyOrdering(keys, childOutputOrdering)</code> to:
1. If the childOutputOrdering satisfies (is a superset of) the required child ordering => childOutputOrdering
2. Otherwise => required child ordering
In addition, I organized the logic for deciding the relationship between two orderings into SparkPlan, so that it can be reused by EnsureRequirements and SortMergeJoinExec, and potentially other classes.
## How was this patch tested?
Added new test cases.
Passed all integration tests.
Author: maryannxue <maryann.xue@gmail.com>
Closes#19281 from maryannxue/spark-21998.
## What changes were proposed in this pull request?
#### Architecture
This PR implements stream-stream inner join using a two-way symmetric hash join. At a high level, we want to do the following.
1. For each stream, we maintain the past rows as state in State Store.
- For each joining key, there can be multiple rows that have been received.
- So, we have to effectively maintain a key-to-list-of-values multimap as state for each stream.
2. In each batch, for each input row in each stream
- Look up the other streams state to see if there are matching rows, and output them if they satisfy the joining condition
- Add the input row to corresponding stream’s state.
- If the data has a timestamp/window column with watermark, then we will use that to calculate the threshold for keys that are required to buffered for future matches and drop the rest from the state.
Cleaning up old unnecessary state rows depends completely on whether watermark has been defined and what are join conditions. We definitely want to support state clean up two types of queries that are likely to be common.
- Queries to time range conditions - E.g. `SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND leftTime > rightTime - INTERVAL 8 MINUTES AND leftTime < rightTime + INTERVAL 1 HOUR`
- Queries with windows as the matching key - E.g. `SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND window(leftTime, "1 hour") = window(rightTime, "1 hour")` (pseudo-SQL)
#### Implementation
The stream-stream join is primarily implemented in three classes
- `StreamingSymmetricHashJoinExec` implements the above symmetric join algorithm.
- `SymmetricsHashJoinStateManagers` manages the streaming state for the join. This essentially is a fault-tolerant key-to-list-of-values multimap built on the StateStore APIs. `StreamingSymmetricHashJoinExec` instantiates two such managers, one for each join side.
- `StreamingSymmetricHashJoinExecHelper` is a helper class to extract threshold for the state based on the join conditions and the event watermark.
Refer to the scaladocs class for more implementation details.
Besides the implementation of stream-stream inner join SparkPlan. Some additional changes are
- Allowed inner join in append mode in UnsupportedOperationChecker
- Prevented stream-stream join on an empty batch dataframe to be collapsed by the optimizer
## How was this patch tested?
- New tests in StreamingJoinSuite
- Updated tests UnsupportedOperationSuite
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#19271 from tdas/SPARK-22053.
## What changes were proposed in this pull request?
There is an incorrect `scalastyle:on` comment in `stringExpressions.scala` and causes the line size limit check ineffective in the file. There are many lines of code and comment which are more than 100 chars.
## How was this patch tested?
Code style change only.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19305 from viirya/fix-wrong-style.
## What changes were proposed in this pull request?
In SQL conditional expressions, only CASE WHEN lacks for expression description. This patch fills the gap.
## How was this patch tested?
Only documentation change.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19304 from viirya/casewhen-doc.
## What changes were proposed in this pull request?
This work is a part of [SPARK-17074](https://issues.apache.org/jira/browse/SPARK-17074) to compute equi-height histograms. Equi-height histogram is an array of bins. A bin consists of two endpoints which form an interval of values and the ndv in that interval.
This PR creates a new aggregate function, given an array of endpoints, counting distinct values (ndv) in intervals among those endpoints.
This PR also refactors `HyperLogLogPlusPlus` by extracting a helper class `HyperLogLogPlusPlusHelper`, where the underlying HLLPP algorithm locates.
## How was this patch tested?
Add new test cases.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#15544 from wzhfy/countIntervals.
## What changes were proposed in this pull request?
This a follow-up of https://github.com/apache/spark/pull/19289 , we missed another place: `rollup`. `Seq.init.toSeq` also returns a `Stream`, we should fix it too.
## How was this patch tested?
manually
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19298 from cloud-fan/bug.
## What changes were proposed in this pull request?
Spark with Scala 2.10 fails with a group by cube:
```
spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug")
spark.sql("select 1 from rollup_bug group by rollup ()").show
```
It can be traced back to https://github.com/apache/spark/pull/15484 , which made `Expand.projections` a lazy `Stream` for group by cube.
In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts.
This change is also good for master branch, to reduce the serialized size of `Expand.projections`.
## How was this patch tested?
manually verified with Spark with Scala 2.10.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19289 from cloud-fan/bug.
## What changes were proposed in this pull request?
Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example
## How was this patch tested?
Doc only change / existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#19276 from srowen/SPARK-22049.
## What changes were proposed in this pull request?
Tables in the catalog cache are not invalidated once their statistics are updated. As a consequence, existing sessions will use the cached information even though it is not valid anymore. Consider and an example below.
```
// step 1
spark.range(100).write.saveAsTable("tab1")
// step 2
spark.sql("analyze table tab1 compute statistics")
// step 3
spark.sql("explain cost select distinct * from tab1").show(false)
// step 4
spark.range(100).write.mode("append").saveAsTable("tab1")
// step 5
spark.sql("explain cost select distinct * from tab1").show(false)
```
After step 3, the table will be present in the catalog relation cache. Step 4 will correctly update the metadata inside the catalog but will NOT invalidate the cache.
By the way, ``spark.sql("analyze table tab1 compute statistics")`` between step 3 and step 4 would also solve the problem.
## How was this patch tested?
Current and additional unit tests.
Author: aokolnychyi <anton.okolnychyi@sap.com>
Closes#19252 from aokolnychyi/spark-21969.
## What changes were proposed in this pull request?
* Removed the method `org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter#alignToWords`.
It became unused as a result of 85b0a15754
(SPARK-15962) introducing word alignment for unsafe arrays.
* Cleaned up duplicate code in memory management and unsafe sorters
* The change extracting the exception paths is more than just cosmetics since it def. reduces the size the affected methods compile to
## How was this patch tested?
* Build still passes after removing the method, grepping the codebase for `alignToWords` shows no reference to it anywhere either.
* Dried up code is covered by existing tests.
Author: Armin <me@obrown.io>
Closes#19254 from original-brownbear/cleanup-mem-consumer.
#### What changes were proposed in this pull request?
This PR enhances the TRIM function support in Spark SQL by allowing the specification
of trim characters set. Below is the SQL syntax :
``` SQL
<trim function> ::= TRIM <left paren> <trim operands> <right paren>
<trim operands> ::= [ [ <trim specification> ] [ <trim character set> ] FROM ] <trim source>
<trim source> ::= <character value expression>
<trim specification> ::=
LEADING
| TRAILING
| BOTH
<trim character set> ::= <characters value expression>
```
or
``` SQL
LTRIM (source-exp [, trim-exp])
RTRIM (source-exp [, trim-exp])
```
Here are the documentation link of support of this feature by other mainstream databases.
- **Oracle:** [TRIM function](http://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2126.htm#OLADM704)
- **DB2:** [TRIM scalar function](https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/ak05270_.htm)
- **MySQL:** [Trim function](http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_trim)
- **Oracle:** [ltrim](https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2018.htm#OLADM594)
- **DB2:** [ltrim](https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/sqlref/src/tpc/db2z_bif_ltrim.html)
This PR is to implement the above enhancement. In the implementation, the design principle is to keep the changes to the minimum. Also, the exiting trim functions (which handles a special case, i.e., trimming space characters) are kept unchanged for performane reasons.
#### How was this patch tested?
The unit test cases are added in the following files:
- UTF8StringSuite.java
- StringExpressionsSuite.scala
- sql/SQLQuerySuite.scala
- StringFunctionsSuite.scala
Author: Kevin Yu <qyu@us.ibm.com>
Closes#12646 from kevinyu98/spark-14878.
## What changes were proposed in this pull request?
If there are two projects like as follows.
```
Project [a_with_metadata#27 AS b#26]
+- Project [a#0 AS a_with_metadata#27]
+- LocalRelation <empty>, [a#0, b#1]
```
Child Project has an output column with a metadata in it, and the parent Project has an alias that implicitly forwards the metadata. So this metadata is visible for higher operators. Upon applying CollapseProject optimizer rule, the metadata is not preserved.
```
Project [a#0 AS b#26]
+- LocalRelation <empty>, [a#0, b#1]
```
This is incorrect, as downstream operators that expect certain metadata (e.g. watermark in structured streaming) to identify certain fields will fail to do so. This PR fixes it by preserving the metadata of top-level aliases.
## How was this patch tested?
New unit test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#19240 from tdas/SPARK-22018.
## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.
cc viirya HyukjinKwon
Author: goldmedal <liugs963@gmail.com>
Closes#19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-21980
This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations.
The problem can be reproduced by:
`val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
df.cube("a").agg(grouping("A")).show()`
## How was this patch tested?
unit tests
Author: donnyzone <wellfengzhu@gmail.com>
Closes#19202 from DonnyZone/ResolveGroupingAnalytics.
# What changes were proposed in this pull request?
UDF to_json only supports converting `StructType` or `ArrayType` of `StructType`s to a json output string now.
According to the discussion of JIRA SPARK-21513, I allow to `to_json` support converting `MapType` and `ArrayType` of `MapType`s to a json output string.
This PR is for SQL and Scala API only.
# How was this patch tested?
Adding unit test case.
cc viirya HyukjinKwon
Author: goldmedal <liugs963@gmail.com>
Author: Jia-Xuan Liu <liugs963@gmail.com>
Closes#18875 from goldmedal/SPARK-21513.
## What changes were proposed in this pull request?
Improve QueryPlanConstraints framework, make it robust and simple.
In https://github.com/apache/spark/pull/15319, constraints for expressions like `a = f(b, c)` is resolved.
However, for expressions like
```scala
a = f(b, c) && c = g(a, b)
```
The current QueryPlanConstraints framework will produce non-converging constraints.
Essentially, the problem is caused by having both the name and child of aliases in the same constraint set. We infer constraints, and push down constraints as predicates in filters, later on these predicates are propagated as constraints, etc..
Simply using the alias names only can resolve these problems. The size of constraints is reduced without losing any information. We can always get these inferred constraints on child of aliases when pushing down filters.
Also, the EqualNullSafe between name and child in propagating alias is meaningless
```scala
allConstraints += EqualNullSafe(e, a.toAttribute)
```
It just produces redundant constraints.
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19201 from gengliangwang/QueryPlanConstraints.
## What changes were proposed in this pull request?
Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
Support DESC EXTENDED | FORMATTED TABLE COLUMN command to show column-level statistics.
Do NOT support describe nested columns.
## How was this patch tested?
Added test cases.
Author: Zhenhua Wang <wzh_zju@163.com>
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#16422 from wzhfy/descColumn.
## What changes were proposed in this pull request?
This PR implements the sql feature:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
## How was this patch tested?
Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory.
Author: Jane Wang <janewang@fb.com>
Closes#18975 from janewangfb/port_local_directory.
## What changes were proposed in this pull request?
`JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys.
Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19167 from viirya/test-jacksonutils.
## What changes were proposed in this pull request?
The condition in `Optimizer.isPlanIntegral` is wrong. We should always return `true` if not in test mode.
## How was this patch tested?
Manually test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19161 from viirya/SPARK-21726-followup.
## What changes were proposed in this pull request?
We have many optimization rules now in `Optimzer`. Right now we don't have any checks in the optimizer to check for the structural integrity of the plan (e.g. resolved). When debugging, it is difficult to identify which rules return invalid plans.
It would be great if in test mode, we can check whether a plan is still resolved after the execution of each rule, so we can catch rules that return invalid plans.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18956 from viirya/SPARK-21726.
## What changes were proposed in this pull request?
Since [SPARK-15639](https://github.com/apache/spark/pull/13701), `spark.sql.parquet.cacheMetadata` and `PARQUET_CACHE_METADATA` is not used. This PR removes from SQLConf and docs.
## How was this patch tested?
Pass the existing Jenkins.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#19129 from dongjoon-hyun/SPARK-13656.
## What changes were proposed in this pull request?
This is a follow-up of #19050 to deal with `ExistenceJoin` case.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19151 from viirya/SPARK-21835-followup.
## What changes were proposed in this pull request?
Add an assert in logical plan optimization that the isStreaming bit stays the same, and fix empty relation rules where that wasn't happening.
## How was this patch tested?
new and existing unit tests
Author: Jose Torres <joseph.torres@databricks.com>
Author: Jose Torres <joseph-torres@databricks.com>
Closes#19056 from joseph-torres/SPARK-21765-followup.
## What changes were proposed in this pull request?
Correlated predicate subqueries are rewritten into `Join` by the rule `RewritePredicateSubquery` during optimization.
It is possibly that the two sides of the `Join` have conflicting attributes. The query plans produced by `RewritePredicateSubquery` become unresolved and break structural integrity.
We should check if there are conflicting attributes in the `Join` and de-duplicate them by adding a `Project`.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19050 from viirya/SPARK-21835.
## What changes were proposed in this pull request?
For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration:
```
Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1")
Seq(1, 2).toDF("col").createOrReplaceTempView("t2")
sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col")
```
We can fix this by adjusting the indent of the optimize rules.
## How was this patch tested?
Add test case that would have failed in `SQLQuerySuite`.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#19099 from jiangxb1987/unconverge-optimization.
## What changes were proposed in this pull request?
We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
## How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19119 from gatorsmile/fallbackCodegen.
## What changes were proposed in this pull request?
SQL predicates don't have complete expression description. This patch goes to complement the description by adding arguments, examples.
This change also adds related test cases for the SQL predicate expressions.
## How was this patch tested?
Existing tests. And added predicate test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18869 from viirya/SPARK-21654.
## What changes were proposed in this pull request?
Add `TBLPROPERTIES` to the DDL statement `CREATE TABLE USING`.
After this change, the DDL becomes
```
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
USING table_provider
[OPTIONS table_property_list]
[PARTITIONED BY (col_name, col_name, ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS
]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (property_name=property_value, ...)]
[[AS] select_statement];
```
## How was this patch tested?
Add a few tests
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19100 from gatorsmile/addTablePropsToCreateTableUsing.
…build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
## What changes were proposed in this pull request?
This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
- Scalatest 2.x -> 3.0.3
- Chill 0.8.0 -> 0.8.4
- Clapper 1.0.x -> 1.1.2
- json4s 3.2.x -> 3.4.2
- Jackson 2.6.x -> 2.7.9 (required by json4s)
This change does _not_ fully enable a Scala 2.12 build:
- It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
- It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
## How was this patch tested?
Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
Author: Sean Owen <sowen@cloudera.com>
Closes#18645 from srowen/SPARK-14280.
## What changes were proposed in this pull request?
Allows `BinaryComparison` operators to work on any data type that actually supports ordering as verified by `TypeUtils.checkForOrderingExpr` instead of relying on the incomplete list `TypeCollection.Ordered` (which is removed by this PR).
## How was this patch tested?
Updated unit tests to cover structs and arrays.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#18818 from aray/SPARK-21110.
## What changes were proposed in this pull request?
Also remove useless function `partitionByDeterministic` after the changes of https://github.com/apache/spark/pull/14687
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19097 from gatorsmile/followupSPARK-17107.
## What changes were proposed in this pull request?
This PR aims to support `spark.sql.orc.compression.codec` like Parquet's `spark.sql.parquet.compression.codec`. Users can use SQLConf to control ORC compression, too.
## How was this patch tested?
Pass the Jenkins with new and updated test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#19055 from dongjoon-hyun/SPARK-21839.
## What changes were proposed in this pull request?
We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
## How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19062 from gatorsmile/fallbackCodegen.
## What changes were proposed in this pull request?
This is a follow-up for https://github.com/apache/spark/pull/18488, to simplify the code.
The major change is, we should map java enum to string type, instead of a struct type with a single string field.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19066 from cloud-fan/fix.
## What changes were proposed in this pull request?
Add trait UserDefinedExpression to identify user-defined functions.
UDF can be expensive. In optimizer we may need to avoid executing UDF multiple times.
E.g.
```scala
table.select(UDF as 'a).select('a, ('a + 1) as 'b)
```
If UDF is expensive in this case, optimizer should not collapse the project to
```scala
table.select(UDF as 'a, (UDF+1) as 'b)
```
Currently UDF classes like PythonUDF, HiveGenericUDF are not defined in catalyst.
This PR is to add a new trait to make it easier to identify user-defined functions.
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19064 from gengliangwang/UDFType.
## What changes were proposed in this pull request?
This patch adds allowUnquotedControlChars option in JSON data source to allow JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters)
## How was this patch tested?
Add new test cases
Author: vinodkc <vinod.kc.in@gmail.com>
Closes#19008 from vinodkc/br_fix_SPARK-21756.
## What changes were proposed in this pull request?
Fixed NPE when creating encoder for enum.
When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference following code
```
def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
val beanInfo = Introspector.getBeanInfo(beanClass)
beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
.filter(_.getReadMethod != null)
}
```
filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495.
I added property name "declaringClass" to filtering to resolve this.
## How was this patch tested?
Unit test in JavaDatasetSuite which creates an encoder for enum
Author: mike <mike0sv@gmail.com>
Author: Mikhail Sveshnikov <mike0sv@gmail.com>
Closes#18488 from mike0sv/enum-support.
## What changes were proposed in this pull request?
This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.
The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
```sql
SELECT *
FROM RANGE(1000)
WHERE
TRUE
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
```
This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#19042 from hvanhovell/SPARK-21830.
## What changes were proposed in this pull request?
With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule `PullupCorrelatedPredicates` can produce unresolved plans.
For a correlated IN query looks like:
SELECT t1.a FROM t1
WHERE
t1.a IN (SELECT t2.c
FROM t2
WHERE t1.b < t2.d);
The query plan might look like:
Project [a#0]
+- Filter a#0 IN (list#4 [b#1])
: +- Project [c#2]
: +- Filter (outer(b#1) < d#3)
: +- LocalRelation <empty>, [c#2, d#3]
+- LocalRelation <empty>, [a#0, b#1]
After `PullupCorrelatedPredicates`, it produces query plan like:
'Project [a#0]
+- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
: +- Project [c#2, d#3]
: +- LocalRelation <empty>, [c#2, d#3]
+- LocalRelation <empty>, [a#0, b#1]
Because the correlated predicate involves another attribute `d#3` in subquery, it has been pulled out and added into the `Project` on the top of the subquery.
When `list` in `In` contains just one `ListQuery`, `In.checkInputDataTypes` checks if the size of `value` expressions matches the output size of subquery. In the above example, there is only `value` expression and the subquery output has two attributes `c#2, d#3`, so it fails the check and `In.resolved` returns `false`.
We should not let `In.checkInputDataTypes` wrongly report unresolved plans to fail the structural integrity check.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18968 from viirya/SPARK-21759.
## What changes were proposed in this pull request?
This is a refactoring of `ColumnVector` hierarchy and related classes.
1. make `ColumnVector` read-only
2. introduce `WritableColumnVector` with write interface
3. remove `ReadOnlyColumnVector`
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#18958 from ueshin/issues/SPARK-21745.
## What changes were proposed in this pull request?
When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below:
``` scala
scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show()
+---+---+----+
| c0| c1| c2|
+---+---+----+
| 1| 2|null|
+---+---+----+
```
I think this should be consistent with Hive's implementation:
```
hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
...
1 1
```
In this PR, we located all the matched indices in `fieldNames` instead of returning the first matched index, i.e., indexOf.
## How was this patch tested?
Added test in JsonExpressionsSuite.
Author: Jen-Ming Chung <jenmingisme@gmail.com>
Closes#19017 from jmchung/SPARK-21804.
## What changes were proposed in this pull request?
The getAliasedConstraints fuction in LogicalPlan.scala will clone the expression set when an element added,
and it will take a long time. This PR add a function to add multiple elements at once to reduce the clone time.
Before modified, the cost of getAliasedConstraints is:
100 expressions: 41 seconds
150 expressions: 466 seconds
After modified, the cost of getAliasedConstraints is:
100 expressions: 1.8 seconds
150 expressions: 6.5 seconds
The test is like this:
test("getAliasedConstraints") {
val expressionNum = 150
val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), s"cnt$i")())
val aggPlan = Aggregate(Nil, aggExpression, LocalRelation())
val beginTime = System.currentTimeMillis()
val expressions = aggPlan.validConstraints
println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms")
// The size of Aliased expression is n * (n - 1) / 2 + n
assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + expressionNum)
}
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Run new added test.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: 10129659 <chen.yanshan@zte.com.cn>
Closes#19022 from eatoncys/getAliasedConstraints.
## What changes were proposed in this pull request?
This pr changed the default value of `maxLinesPerFunction` into `4000`. In #18810, we had this new option to disable code generation for too long functions and I found this option only affected `Q17` and `Q66` in TPC-DS. But, `Q66` had some performance regression:
```
Q17 w/o #18810, 3224ms --> q17 w/#18810, 2627ms (improvement)
Q66 w/o #18810, 1712ms --> q66 w/#18810, 3032ms (regression)
```
To keep the previous performance in TPC-DS, we better set higher value at `maxLinesPerFunction` by default.
## How was this patch tested?
Existing tests.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#19021 from maropu/SPARK-21603-FOLLOWUP-1.
## What changes were proposed in this pull request?
All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from.
## How was this patch tested?
Existing unit tests - no functional change is intended in this PR.
Author: Jose Torres <joseph-torres@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#18973 from joseph-torres/SPARK-21765.
## What changes were proposed in this pull request?
This PR is to enable users to create persistent Scala UDAF (that extends UserDefinedAggregateFunction).
```SQL
CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg'
```
Before this PR, Spark UDAF only can be registered through the API `spark.udf.register(...)`
## How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18700 from gatorsmile/javaUDFinScala.
## What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix a bug that we break whole stage codegen for `Limit`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18993 from cloud-fan/bug.
## What changes were proposed in this pull request?
Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows.
When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified.
For example, table t has 4 partitions with the following specs:
* Partition1: (ds='2008-04-08', hr=11)
* Partition2: (ds='2008-04-08', hr=12)
* Partition3: (ds='2008-04-09', hr=11)
* Partition4: (ds='2008-04-09', hr=12)
'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3.
'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4.
'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions.
When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes.
The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command.
## How was this patch tested?
Added tests.
Author: Masha Basmanova <mbasmanova@fb.com>
Closes#18421 from mbasmanova/mbasmanova-analyze-partition.
## What changes were proposed in this pull request?
``` scala
scala> Seq(("""{"Hyukjin": 224, "John": 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
...
java.lang.NullPointerException
at ...
```
Currently the `null` field name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name `__NullFieldName` in evaluation of the field names.
## How was this patch tested?
Added unit test.
Author: Jen-Ming Chung <jenmingisme@gmail.com>
Closes#18930 from jmchung/SPARK-21677.
## What changes were proposed in this pull request?
This pr sorted output attributes on their name and exprId in `AttributeSet.toSeq` to make the order consistent. If the order is different, spark possibly generates different code and then misses cache in `CodeGenerator`, e.g., `GenerateColumnAccessor` generates code depending on an input attribute order.
## How was this patch tested?
Added tests in `AttributeSetSuite` and manually checked if the cache worked well in the given query of the JIRA.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18959 from maropu/SPARK-18394.
## What changes were proposed in this pull request?
Close the whole stage codegen when the function lines is longer than the maxlines which will be setted by
spark.sql.codegen.MaxFunctionLength parameter, because when the function is too long , it will not get the JIT optimizing.
A benchmark test result is 10x slower when the generated function is too long :
ignore("max function length of wholestagecodegen") {
val N = 20 << 15
val benchmark = new Benchmark("max function length of wholestagecodegen", N)
def f(): Unit = sparkSession.range(N)
.selectExpr(
"id",
"(id & 1023) as k1",
"cast(id & 1023 as double) as k2",
"cast(id & 1023 as int) as k3",
"case when id > 100 and id <= 200 then 1 else 0 end as v1",
"case when id > 200 and id <= 300 then 1 else 0 end as v2",
"case when id > 300 and id <= 400 then 1 else 0 end as v3",
"case when id > 400 and id <= 500 then 1 else 0 end as v4",
"case when id > 500 and id <= 600 then 1 else 0 end as v5",
"case when id > 600 and id <= 700 then 1 else 0 end as v6",
"case when id > 700 and id <= 800 then 1 else 0 end as v7",
"case when id > 800 and id <= 900 then 1 else 0 end as v8",
"case when id > 900 and id <= 1000 then 1 else 0 end as v9",
"case when id > 1000 and id <= 1100 then 1 else 0 end as v10",
"case when id > 1100 and id <= 1200 then 1 else 0 end as v11",
"case when id > 1200 and id <= 1300 then 1 else 0 end as v12",
"case when id > 1300 and id <= 1400 then 1 else 0 end as v13",
"case when id > 1400 and id <= 1500 then 1 else 0 end as v14",
"case when id > 1500 and id <= 1600 then 1 else 0 end as v15",
"case when id > 1600 and id <= 1700 then 1 else 0 end as v16",
"case when id > 1700 and id <= 1800 then 1 else 0 end as v17",
"case when id > 1800 and id <= 1900 then 1 else 0 end as v18")
.groupBy("k1", "k2", "k3")
.sum()
.collect()
benchmark.addCase(s"codegen = F") { iter =>
sparkSession.conf.set("spark.sql.codegen.wholeStage", "false")
f()
}
benchmark.addCase(s"codegen = T") { iter =>
sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "10000")
f()
}
benchmark.run()
/*
Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1
Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
max function length of wholestagecodegen: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
codegen = F 443 / 507 1.5 676.0 1.0X
codegen = T 3279 / 3283 0.2 5002.6 0.1X
*/
}
## How was this patch tested?
Run the unit test
Author: 10129659 <chen.yanshan@zte.com.cn>
Closes#18810 from eatoncys/codegen.
## What changes were proposed in this pull request?
This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics.
## How was this patch tested?
Testcases added.
## Performance
Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan
### Performance data
(test on my laptop, use 2 partitions. tries out = 20, warm up = 10)
The unit of test results is records/milliseconds (higher is better)
Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000
----|------|----|---|----|----
Dataframe | 15149 | 7441 | 2118 | 224 | 21
RDD from Dataframe | 4992 | 4440 | 2328 | 320 | 33
raw RDD | 53931 | 20683 | 3966 | 528 | 53
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.
This version fixes a few issues in the import order checker; it provides
better error messages, and detects more improper ordering (thus the need
to change a lot of files in this patch). The main fix is that it correctly
complains about the order of packages vs. classes.
As part of the above, I moved some "SparkSession" import in ML examples
inside the "$example on$" blocks; that didn't seem consistent across
different source files to start with, and avoids having to add more on/off blocks
around specific imports.
The new scalastyle also seems to have a better header detector, so a few
license headers had to be updated to match the expected indentation.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#18943 from vanzin/SPARK-21731.
## What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/15900 , to fix one more bug:
When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *`
The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred.
## How was this patch tested?
regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18907 from cloud-fan/bug.
## What changes were proposed in this pull request?
This PR adds `since` annotation in documentation so that this can be rendered as below:
<img width="290" alt="2017-08-14 6 54 26" src="https://user-images.githubusercontent.com/6477701/29267050-034c1f64-8122-11e7-862b-7dfc38e292bf.png">
## How was this patch tested?
Manually checked the documentation by `cd sql && ./create-docs.sh`.
Also, Jenkins tests are required.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18939 from HyukjinKwon/add-sinces-date-functions.
## What changes were proposed in this pull request?
While discovering optimization rules and their test coverage, I did not find any tests for `CheckCartesianProducts` in the Catalyst folder. So, I decided to create a new test suite. Once I finished, I found a test in `JoinSuite` for this functionality so feel free to discard this change if it does not make much sense. The proposed test suite covers a few additional use cases.
Author: aokolnychyi <anton.okolnychyi@sap.com>
Closes#18909 from aokolnychyi/check-cartesian-join-tests.
## What changes were proposed in this pull request?
[SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers).
Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control.
## How was this patch tested?
Added unit tests
Author: Tejas Patil <tejasp@fb.com>
Closes#18843 from tejasapatil/SPARK-21595.
## What changes were proposed in this pull request?
This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption.
## How was this patch tested?
Removed the test case.
Author: Reynold Xin <rxin@databricks.com>
Closes#18912 from rxin/remove-getTableOption.
## What changes were proposed in this pull request?
Push filter predicates through EventTimeWatermark if they're deterministic and do not reference the watermarked attribute. (This is similar but not identical to the logic for pushing through UnaryNode.)
## How was this patch tested?
unit tests
Author: Jose Torres <joseph-torres@databricks.com>
Closes#18790 from joseph-torres/SPARK-21587.
## What changes were proposed in this pull request?
This PR is to add the spark version info in the table metadata. When creating the table, this value is assigned. It can help users find which version of Spark was used to create the table.
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18709 from gatorsmile/addVersion.
## What changes were proposed in this pull request?
Window rangeBetween() API should allow literal boundary, that means, the window range frame can calculate frame of double/date/timestamp.
Example of the use case can be:
```
SELECT
val_timestamp,
cate,
avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING)
FROM testData
```
This PR refactors the Window `rangeBetween` and `rowsBetween` API, while the legacy user code should still be valid.
## How was this patch tested?
Add new test cases both in `DataFrameWindowFunctionsSuite` and in `window.sql`.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#18814 from jiangxb1987/literal-boundary.
## What changes were proposed in this pull request?
If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset.
A reproducible case looks like:
object C {
type TwoInt = (Int, Int)
def tupleTypeAlias: TwoInt = (1, 1)
}
Seq(1).toDS().map(_ => ("", C.tupleTypeAlias))
It throws an exception like:
type T1 is not a class
scala.ScalaReflectionException: type T1 is not a class
at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
...
This patch accesses the dealias of type in many places in `ScalaReflection` to fix it.
## How was this patch tested?
Added test case.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18813 from viirya/SPARK-21567.
I have discovered that "full_outer" name option is working in Spark 2.0, but it is not printed in exception. Please verify.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: BartekH <bartekhamielec@gmail.com>
Closes#17985 from BartekH/patch-1.
## What changes were proposed in this pull request?
This pr (follow-up of #18772) used `UnresolvedSubqueryColumnAliases` for `visitTableName` in `AstBuilder`, which is a new unresolved `LogicalPlan` implemented in #18185.
## How was this patch tested?
Existing tests
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18857 from maropu/SPARK-20963-FOLLOWUP.
## What changes were proposed in this pull request?
In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter
Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue)
## How was this patch tested?
Added unit test
Author: vinodkc <vinod.kc.in@gmail.com>
Closes#18852 from vinodkc/br_Fix_SPARK-21588.
## What changes were proposed in this pull request?
This pr added parsing rules to support column aliases for join relations in FROM clause.
This pr is a sub-task of #18079.
## How was this patch tested?
Added tests in `AnalysisSuite`, `PlanParserSuite,` and `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18772 from maropu/SPARK-20963-2.
## What changes were proposed in this pull request?
This PR proposes to separate `extended` into `examples` and `arguments` internally so that both can be separately documented and add `since` and `note` for additional information.
For `since`, it looks users sometimes get confused by, up to my knowledge, missing version information. For example, see https://www.mail-archive.com/userspark.apache.org/msg64798.html
For few good examples to check the built documentation, please see both:
`from_json` - https://spark-test.github.io/sparksqldoc/#from_json
`like` - https://spark-test.github.io/sparksqldoc/#like
For `DESCRIBE FUNCTION`, `note` and `since` are added as below:
```
> DESCRIBE FUNCTION EXTENDED rlike;
...
Extended Usage:
Arguments:
...
Examples:
...
Note:
Use LIKE to match with simple string pattern
```
```
> DESCRIBE FUNCTION EXTENDED to_json;
...
Examples:
...
Since: 2.2.0
```
For the complete documentation, see https://spark-test.github.io/sparksqldoc/
## How was this patch tested?
Manual tests and existing tests. Please see https://spark-test.github.io/sparksqldoc
Jenkins tests are needed to double check
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18749 from HyukjinKwon/followup-sql-doc-gen.
## What changes were proposed in this pull request?
create temporary view data as select * from values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2)
as data(a, b);
`select 3, 4, sum(b) from data group by 1, 2;`
`select 3 as c, 4 as d, sum(b) from data group by c, d;`
When running these two cases, the following exception occurred:
`Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10`
The cause of this failure:
If an aggregateExpression is integer, after replaced with this aggregateExpression, the
groupExpression still considered as an ordinal.
The solution:
This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`.
## How was this patch tested?
Added unit test case
Author: liuxian <liu.xian3@zte.com.cn>
Closes#18779 from 10110346/groupby.
## What changes were proposed in this pull request?
OneRowRelation is the only plan that is a case object, which causes some issues with makeCopy using a 0-arg constructor. This patch changes it from a case object to a case class.
This blocks SPARK-21619.
## How was this patch tested?
Should be covered by existing test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#18839 from rxin/SPARK-21634.
## What changes were proposed in this pull request?
Hive `pmod(3.13, 0)`:
```:sql
hive> select pmod(3.13, 0);
OK
NULL
Time taken: 2.514 seconds, Fetched: 1 row(s)
hive>
```
Spark `mod(3.13, 0)`:
```:sql
spark-sql> select mod(3.13, 0);
NULL
spark-sql>
```
But the Spark `pmod(3.13, 0)`:
```:sql
spark-sql> select pmod(3.13, 0);
17/06/25 09:35:58 ERROR SparkSQLDriver: Failed in [select pmod(3.13, 0)]
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.Pmod.pmod(arithmetic.scala:504)
at org.apache.spark.sql.catalyst.expressions.Pmod.nullSafeEval(arithmetic.scala:432)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:323)
...
```
This PR make `pmod(number, 0)` to null.
## How was this patch tested?
unit tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18413 from wangyum/SPARK-21205.
## What changes were proposed in this pull request?
Currently, StructType.merge() only reports data types of conflicting fields when merging two incompatible schemas. It would be nice to also report the field names for easier debugging.
## How was this patch tested?
Unit test in DataTypeSuite.
Print exception message when conflict is triggered.
Author: bravo-zhang <mzhang1230@gmail.com>
Closes#16365 from bravo-zhang/spark-18950.
## What changes were proposed in this pull request?
This pr added parsing rules to support subquery column aliases in FROM clause.
This pr is a sub-task of #18079.
## How was this patch tested?
Added tests in `PlanParserSuite` and `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18185 from maropu/SPARK-20962.
## What changes were proposed in this pull request?
Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.
Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.
This PR is mostly based on Herman's previous amazing work: 596f53c339
After this been merged, we can close#16818 .
## How was this patch tested?
Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#18540 from jiangxb1987/rangeFrame.
## What changes were proposed in this pull request?
When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.
An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.
Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.
If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.
Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.
One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18761 from viirya/SPARK-21555.
## What changes were proposed in this pull request?
`UnsafeExternalSorter.recordComparator` can be either `KVComparator` or `RowComparator`, and both of them will keep the reference to the input rows they compared last time.
After sorting, we return the sorted iterator to upstream operators. However, the upstream operators may take a while to consume up the sorted iterator, and `UnsafeExternalSorter` is registered to `TaskContext` at [here](https://github.com/apache/spark/blob/v2.2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L159-L161), which means we will keep the `UnsafeExternalSorter` instance and keep the last compared input rows in memory until the sorted iterator is consumed up.
Things get worse if we sort within partitions of a dataset and coalesce all partitions into one, as we will keep a lot of input rows in memory and the time to consume up all the sorted iterators is long.
This PR takes over https://github.com/apache/spark/pull/18543 , the idea is that, we do not keep the record comparator instance in `UnsafeExternalSorter`, but a generator of record comparator.
close#18543
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18679 from cloud-fan/memory-leak.
## What changes were proposed in this pull request?
This PR ensures that `Unsafe.sizeInBytes` must be a multiple of 8. It it is not satisfied. `Unsafe.hashCode` causes the assertion violation.
## How was this patch tested?
Will add test cases
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#18503 from kiszk/SPARK-21271.
### What changes were proposed in this pull request?
Like [Hive UDFType](https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html), we should allow users to add the extra flags for ScalaUDF and JavaUDF too. _stateful_/_impliesOrder_ are not applicable to our Scala UDF. Thus, we only add the following two flags.
- deterministic: Certain optimizations should not be applied if UDF is not deterministic. Deterministic UDF returns same result each time it is invoked with a particular input. This determinism just needs to hold within the context of a query.
When the deterministic flag is not correctly set, the results could be wrong.
For ScalaUDF in Dataset APIs, users can call the following extra APIs for `UserDefinedFunction` to make the corresponding changes.
- `nonDeterministic`: Updates UserDefinedFunction to non-deterministic.
Also fixed the Java UDF name loss issue.
Will submit a separate PR for `distinctLike` for UDAF
### How was this patch tested?
Added test cases for both ScalaUDF
Author: gatorsmile <gatorsmile@gmail.com>
Author: Wenchen Fan <cloud0fan@gmail.com>
Closes#17848 from gatorsmile/udfRegister.
## What changes were proposed in this pull request?
When the code that is generated is greater than 64k, then Janino compile will fail and CodeGenerator.scala will log the entire code at Error level.
SPARK-20871 suggests only logging the code at Debug level.
Since, the code is already logged at debug level, this Pull Request proposes not including the formatted code in the Error logging and exception message at all.
When an exception occurs, the code will be logged at Info level but truncated if it is more than 1000 lines long.
## How was this patch tested?
Existing tests were run.
An extra test test case was added to CodeFormatterSuite to test the new maxLines parameter,
Author: pj.fanning <pj.fanning@workday.com>
Closes#18658 from pjfanning/SPARK-20871.
## What changes were proposed in this pull request?
DirectParquetOutputCommitter was removed from Spark as it was deemed unsafe to use. We however still have some code to generate warning. This patch removes those code as well.
This is kind of a follow-up of https://github.com/apache/spark/pull/16796
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18689 from cloud-fan/minor.
## What changes were proposed in this pull request?
When we list partitions from hive metastore with a partial partition spec, we are expecting exact matching according to the partition values. However, hive treats dot specially and match any single character for dot. We should do an extra filter to drop unexpected partitions.
## How was this patch tested?
new regression test.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18671 from cloud-fan/hive.
## What changes were proposed in this pull request?
Address scapegoat warnings for:
- BigDecimal double constructor
- Catching NPE
- Finalizer without super
- List.size is O(n)
- Prefer Seq.empty
- Prefer Set.empty
- reverse.map instead of reverseMap
- Type shadowing
- Unnecessary if condition.
- Use .log1p
- Var could be val
In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#18635 from srowen/Scapegoat1.
## What changes were proposed in this pull request?
This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below:
```
val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
val sc = spark.sparkContext
val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
val df = spark.createDataFrame(rdd, inputSchema)
// Works correctly since no nested decimal expression is involved
// Expected result type: (26, 6) * (26, 6) = (38, 12)
df.select($"col" * $"col").explain(true)
df.select($"col" * $"col").printSchema()
// Gives a wrong result since there is a nested decimal expression that should be visited first
// Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18)
df.select($"col" * $"col" * $"col").explain(true)
df.select($"col" * $"col" * $"col").printSchema()
```
The example above gives the following output:
```
// Correct result without sub-expressions
== Parsed Logical Plan ==
'Project [('col * 'col) AS (col * col)#4]
+- LogicalRDD [col#1]
== Analyzed Logical Plan ==
(col * col): decimal(38,12)
Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4]
+- LogicalRDD [col#1]
== Optimized Logical Plan ==
Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
+- LogicalRDD [col#1]
== Physical Plan ==
*Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
+- Scan ExistingRDD[col#1]
// Schema
root
|-- (col * col): decimal(38,12) (nullable = true)
// Incorrect result with sub-expressions
== Parsed Logical Plan ==
'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]
== Analyzed Logical Plan ==
((col * col) * col): decimal(38,12)
Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]
== Optimized Logical Plan ==
Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]
== Physical Plan ==
*Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
+- Scan ExistingRDD[col#1]
// Schema
root
|-- ((col * col) * col): decimal(38,12) (nullable = true)
```
## How was this patch tested?
This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios.
Author: aokolnychyi <anton.okolnychyi@sap.com>
Closes#18583 from aokolnychyi/spark-21332.
## What changes were proposed in this pull request?
Follow up to a few comments on https://github.com/apache/spark/pull/17150#issuecomment-315020196 that couldn't be addressed before it was merged.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#18646 from srowen/SPARK-19810.2.
## What changes were proposed in this pull request?
This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.
## How was this patch tested?
Added a test suite in `OrderingSuite`.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#18571 from kiszk/SPARK-21344.
## What changes were proposed in this pull request?
- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17150 from srowen/SPARK-19810.
## What changes were proposed in this pull request?
Hive interprets regular expression, e.g., `(a)?+.+` in query specification. This PR enables spark to support this feature when hive.support.quoted.identifiers is set to true.
## How was this patch tested?
- Add unittests in SQLQuerySuite.scala
- Run spark-shell tested the original failed query:
scala> hc.sql("SELECT `(a|b)?+.+` from test1").collect.foreach(println)
Author: Jane Wang <janewang@fb.com>
Closes#18023 from janewangfb/support_select_regex.
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. Data types except complex, date, timestamp, and decimal are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).
## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.
Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
## What changes were proposed in this pull request?
This pr made it more consistent to handle column name duplication. In the current master, error handling is different when hitting column name duplication:
```
// json
scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("json").schema(schema).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.;
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
scala> spark.read.format("json").load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format;
at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
// csv
scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#41, a#42.;
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
// If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896)
scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
+---+---+
| a0| a1|
+---+---+
| 1| 1|
+---+---+
// parquet
scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#110, a#111.;
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
```
When this patch applied, the results change to;
```
// json
scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("json").schema(schema).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
scala> spark.read.format("json").load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
// csv
scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
+---+---+
| a0| a1|
+---+---+
| 1| 1|
+---+---+
// parquet
scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
```
## How was this patch tested?
Added tests in `DataFrameReaderWriterSuite` and `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17758 from maropu/SPARK-20460.
## What changes were proposed in this pull request?
These 3 methods have to be used together, so it makes more sense to merge them into one method and then the caller side only need to call one method.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18579 from cloud-fan/minor.
## What changes were proposed in this pull request?
Since we do not set active sessions when parsing the plan, we are unable to correctly use SQLConf.get to find the correct active session. Since https://github.com/apache/spark/pull/18531 breaks the build, I plan to revert it at first.
## How was this patch tested?
The existing test cases
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18568 from gatorsmile/revert18531.
## What changes were proposed in this pull request?
This pr modified code to use string types by default if `array` and `map` in functions have no argument. This behaviour is the same with Hive one;
```
hive> CREATE TEMPORARY TABLE t1 AS SELECT map();
hive> DESCRIBE t1;
_c0 map<string,string>
hive> CREATE TEMPORARY TABLE t2 AS SELECT array();
hive> DESCRIBE t2;
_c0 array<string>
```
## How was this patch tested?
Added tests in `DataFrameFunctionsSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18516 from maropu/SPARK-21281.
## What changes were proposed in this pull request?
un-aliased subquery is supported by Spark SQL for a long time. Its semantic was not well defined and had confusing behaviors, and it's not a standard SQL syntax, so we disallowed it in https://issues.apache.org/jira/browse/SPARK-20690 .
However, this is a breaking change, and we do have existing queries using un-aliased subquery. We should add the support back and fix its semantic.
This PR fixes the un-aliased subquery by assigning a default alias name.
After this PR, there is no syntax change from branch 2.2 to master, but we invalid a weird use case:
`SELECT v.i from (SELECT i FROM v)`. Now this query will throw analysis exception because users should not be able to use the qualifier inside a subquery.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18559 from cloud-fan/sub-query.
## What changes were proposed in this pull request?
Rename org.apache.spark.sql.catalyst.plans.logical.statsEstimation.Range to ValueInterval.
The current naming is identical to logical operator "range".
Refactoring it to ValueInterval is more accurate.
## How was this patch tested?
unit test
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#18549 from gengliangwang/ValueInterval.
## What changes were proposed in this pull request?
Currently we can't produce a `Dataset` containing `Set` in SparkSQL. This PR tries to support serialization/deserialization of `Set`.
Because there's no corresponding internal data type in SparkSQL for a `Set`, the most proper choice for serializing a set should be an array.
## How was this patch tested?
Added unit tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18416 from viirya/SPARK-21204.
## What changes were proposed in this pull request?
When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals.
## How was this patch tested?
New test in SQLQuerySuite.
Author: Bogdan Raducanu <bogdan@databricks.com>
Closes#18455 from bogdanrdc/SPARK-21228.
## What changes were proposed in this pull request?
Add missing test cases back and revise code style
Follow up the previous PR: https://github.com/apache/spark/pull/18479
## How was this patch tested?
Unit test
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#18548 from gengliangwang/stat_propagation_revise.
## What changes were proposed in this pull request?
Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.
## How was this patch tested?
Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.
Author: Sumedh Wale <swale@snappydata.io>
Closes#18535 from sumwale/SPARK-21312.
### What changes were proposed in this pull request?
This PR removes SQLConf parameters from the optimizer rules
### How was this patch tested?
The existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18533 from gatorsmile/rmSQLConfOptimizer.
### What changes were proposed in this pull request?
This PR is to remove SQLConf parameters from the parser-related classes.
### How was this patch tested?
The existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18531 from gatorsmile/rmSQLConfParser.
## What changes were proposed in this pull request?
support to create [temporary] function with the keyword 'OR REPLACE' and 'IF NOT EXISTS'
## How was this patch tested?
manual test and added test cases
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>
Closes#17681 from ouyangxiaochen/spark-419.
## What changes were proposed in this pull request?
Currently `RowEncoder` doesn't preserve nullability of `ArrayType` or `MapType`.
It returns always `containsNull = true` for `ArrayType`, `valueContainsNull = true` for `MapType` and also the nullability of itself is always `true`.
This pr fixes the nullability of them.
## How was this patch tested?
Add tests to check if `RowEncoder` preserves array/map nullability.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#13873 from ueshin/issues/SPARK-16167.
## What changes were proposed in this pull request?
Add `returnNullable` to `StaticInvoke` the same as #15780 is trying to add to `Invoke` and modify to handle properly.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#16056 from ueshin/issues/SPARK-18623.
## What changes were proposed in this pull request?
For these collection-related encoder expressions, we don't need to create `isNull` variable if the loop element is not nullable.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18529 from cloud-fan/minor.
## What changes were proposed in this pull request?
`ExternalMapToCatalyst` should null-check map key prior to converting to internal value to throw an appropriate Exception instead of something like NPE.
## How was this patch tested?
Added a test and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#18524 from ueshin/issues/SPARK-21300.
### What changes were proposed in this pull request?
It is strange to see the following error message. Actually, the column is from another table.
```
cannot resolve '`right.a`' given input columns: [a, c, d];
```
After the PR, the error message looks like
```
cannot resolve '`right.a`' given input columns: [left.a, right.c, right.d];
```
### How was this patch tested?
Added a test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18520 from gatorsmile/removeSQLConf.
### What changes were proposed in this pull request?
SQLConf is moved to Catalyst. We are adding more and more test cases for verifying the conf-specific behaviors. It is nice to add a helper function to simplify the test cases.
### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18469 from gatorsmile/withSQLConf.
## What changes were proposed in this pull request?
Looking at the code in `SessionCatalog.registerFunction`, the parameter `ignoreIfExists` is a wrong name. When `ignoreIfExists` is true, we will override the function if it already exists. So `overrideIfExists` should be the corrected name.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18510 from cloud-fan/minor.
### Idea
This PR adds validation to REFRESH sql statements. Currently, users can specify whatever they want as resource path. For example, spark.sql("REFRESH ! $ !") will be executed without any exceptions.
### Implementation
I am not sure that my current implementation is the most optimal, so any feedback is appreciated. My first idea was to make the grammar as strict as possible. Unfortunately, there were some problems. I tried the approach below:
SqlBase.g4
```
...
| REFRESH TABLE tableIdentifier #refreshTable
| REFRESH resourcePath #refreshResource
...
resourcePath
: STRING
| (IDENTIFIER | number | nonReserved | '/' | '-')+ // other symbols can be added if needed
;
```
It is not flexible enough and requires to explicitly mention all possible symbols. Therefore, I came up with the current approach that is implemented in the code.
Let me know your opinion on which one is better.
Author: aokolnychyi <anton.okolnychyi@sap.com>
Closes#18368 from aokolnychyi/spark-21102.
## What changes were proposed in this pull request?
We currently implement statistics propagation directly in logical plan. Given we already have two different implementations, it'd make sense to actually decouple the two and add stats propagation using mixin. This would reduce the coupling between logical plan and statistics handling.
This can also be a powerful pattern in the future to add additional properties (e.g. constraints).
## How was this patch tested?
Should be covered by existing test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#18479 from rxin/stats-trait.
## What changes were proposed in this pull request?
Update stats after the following data changing commands:
- InsertIntoHadoopFsRelationCommand
- InsertIntoHiveTable
- LoadDataCommand
- TruncateTableCommand
- AlterTableSetLocationCommand
- AlterTableDropPartitionCommand
## How was this patch tested?
Added new test cases.
Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#18334 from wzhfy/changeStatsForOperation.
## What changes were proposed in this pull request?
For performance reasons, `UnsafeRow.getString`, `getStruct`, etc. return a "pointer" that points to a memory region of this unsafe row. This makes the unsafe projection a little dangerous, because all of its output rows share one instance.
When we implement SQL operators, we should be careful to not cache the input rows because they may be produced by unsafe projection from child operator and thus its content may change overtime.
However, when we updating values of InternalRow(e.g. in mutable projection and safe projection), we only copy UTF8String, we should also copy InternalRow, ArrayData and MapData. This PR fixes this, and also fixes the copy of vairous InternalRow, ArrayData and MapData implementations.
## How was this patch tested?
new regression tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18483 from cloud-fan/fix-copy.
### What changes were proposed in this pull request?
Function argument should not be named expressions. It could cause two issues:
- Misleading error message
- Unexpected query results when the column name is `distinct`, which is not a reserved word in our parser.
```
spark-sql> select count(distinct c1, distinct c2) from t1;
Error in query: cannot resolve '`distinct`' given input columns: [c1, c2]; line 1 pos 26;
'Project [unresolvedalias('count(c1#30, 'distinct), None)]
+- SubqueryAlias t1
+- CatalogRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31]
```
After the fix, the error message becomes
```
spark-sql> select count(distinct c1, distinct c2) from t1;
Error in query:
extraneous input 'c2' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 35)
== SQL ==
select count(distinct c1, distinct c2) from t1
-----------------------------------^^^
```
### How was this patch tested?
Added a test case to parser suite.
Author: Xiao Li <gatorsmile@gmail.com>
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18338 from gatorsmile/parserDistinctAggFunc.
## What changes were proposed in this pull request?
Invalidate spark's stats after data changing commands:
- InsertIntoHadoopFsRelationCommand
- InsertIntoHiveTable
- LoadDataCommand
- TruncateTableCommand
- AlterTableSetLocationCommand
- AlterTableDropPartitionCommand
## How was this patch tested?
Added test cases.
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#18449 from wzhfy/removeStats.
## What changes were proposed in this pull request?
`QueryPlan.preCanonicalized` is only overridden in a few places, and it does introduce an extra concept to `QueryPlan` which may confuse people.
This PR removes it and override `canonicalized` in these places
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18440 from cloud-fan/minor.
## What changes were proposed in this pull request?
Move elimination of Distinct clause from analyzer to optimizer
Distinct clause is useless after MAX/MIN clause. For example,
"Select MAX(distinct a) FROM src from"
is equivalent of
"Select MAX(a) FROM src from"
However, this optimization is implemented in analyzer. It should be in optimizer.
## How was this patch tested?
Unit test
gatorsmile cloud-fan
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#18429 from gengliangwang/distinct_opt.
## What changes were proposed in this pull request?
The issue happens in `ExternalMapToCatalyst`. For example, the following codes create `ExternalMapToCatalyst` to convert Scala Map to catalyst map format.
val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
val ds = spark.createDataset(data)
The `valueConverter` in `ExternalMapToCatalyst` looks like:
if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)
There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.
Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18418 from viirya/SPARK-19104.
## What changes were proposed in this pull request?
Time windowing in Spark currently performs an Expand + Filter, because there is no way to guarantee the amount of windows a timestamp will fall in, in the general case. However, for tumbling windows, a record is guaranteed to fall into a single bucket. In this case, doubling the number of records with Expand is wasteful, and can be improved by using a simple Projection instead.
Benchmarks show that we get an order of magnitude performance improvement after this patch.
## How was this patch tested?
Existing unit tests. Benchmarked using the following code:
```scala
import org.apache.spark.sql.functions._
spark.time {
spark.range(numRecords)
.select(from_unixtime((current_timestamp().cast("long") * 1000 + 'id / 1000) / 1000) as 'time)
.select(window('time, "10 seconds"))
.count()
}
```
Setup:
- 1 c3.2xlarge worker (8 cores)
![image](https://user-images.githubusercontent.com/5243515/27348748-ed991b84-55a9-11e7-8f8b-6e7abc524417.png)
1 B rows ran in 287 seconds after this optimization. I didn't wait for it to finish without the optimization. Shows about 5x improvement for large number of records.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#18364 from brkyvz/opt-tumble.
### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet
INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
SELECT custom_fields.id, custom_fields.value FROM tab1
```
The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
### How was this patch tested?
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18412 from gatorsmile/castStruct.
## What changes were proposed in this pull request?
`isTableSample` and `isGenerated ` were introduced for SQL Generation respectively by https://github.com/apache/spark/pull/11148 and https://github.com/apache/spark/pull/11050
Since SQL Generation is removed, we do not need to keep `isTableSample`.
## How was this patch tested?
The existing test cases
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18379 from gatorsmile/CleanSample.
## What changes were proposed in this pull request?
Currently we do a lot of validations for subquery in the Analyzer. We should move them to CheckAnalysis which is the framework to catch and report Analysis errors. This was mentioned as a review comment in SPARK-18874.
## How was this patch tested?
Exists tests + A few tests added to SQLQueryTestSuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#17713 from dilipbiswal/subquery_checkanalysis.
## What changes were proposed in this pull request?
If the SQL conf for StateStore provider class is changed between restarts (i.e. query started with providerClass1 and attempted to restart using providerClass2), then the query will fail in a unpredictable way as files saved by one provider class cannot be used by the newer one.
Ideally, the provider class used to start the query should be used to restart the query, and the configuration in the session where it is being restarted should be ignored.
This PR saves the provider class config to OffsetSeqLog, in the same way # shuffle partitions is saved and recovered.
## How was this patch tested?
new unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#18402 from tdas/SPARK-21192.
## What changes were proposed in this pull request?
After wiring `SQLConf` in logical plan ([PR 18299](https://github.com/apache/spark/pull/18299)), we can remove the need of passing `conf` into `def stats` and `def computeStats`.
## How was this patch tested?
Covered by existing tests, plus some modified existing tests.
Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#18391 from wzhfy/removeConf.
## What changes were proposed in this pull request?
The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:
```
withTempPath { dir =>
val basePath = dir.getCanonicalPath
spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
spark.read.parquet(basePath).show()
}
+---+
|foo|
+---+
| 1|
| 1|
| a|
| a|
| 1|
| a|
+---+
```
This patch added code to print a warning when the duplication found.
## How was this patch tested?
Manually checked.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18375 from maropu/SPARK-21144-3.
## What changes were proposed in this pull request?
Currently the validation of sampling fraction in dataset is incomplete.
As an improvement, validate sampling fraction in logical operator level:
1) if with replacement: fraction should be nonnegative
2) else: fraction should be on interval [0, 1]
Also add test cases for the validation.
## How was this patch tested?
integration tests
gatorsmile cloud-fan
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#18387 from gengliangwang/sample_ratio_validate.
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).
## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.
Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
## What changes were proposed in this pull request?
Currently we have several tens of test sqls in catalyst will fail at `SimpleAnalyzer.checkAnalysis`, we should make sure they are valid.
This PR makes the following changes:
1. Apply `checkAnalysis` on plans that tests `Optimizer` rules, but don't require the testcases for `Parser`/`Analyzer` pass `checkAnalysis`;
2. Fix testcases for `Optimizer` that would have fall.
## How was this patch tested?
Apply `SimpleAnalyzer.checkAnalysis` on plans in `PlanTest.comparePlans`, update invalid test cases.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#15417 from jiangxb1987/cptest.
## What changes were proposed in this pull request?
QueryPlanConstraints should be part of LogicalPlan, rather than QueryPlan, since the constraint framework is only used for query plan rewriting and not for physical planning.
## How was this patch tested?
Should be covered by existing tests, since it is a simple refactoring.
Author: Reynold Xin <rxin@databricks.com>
Closes#18310 from rxin/SPARK-21103.
## What changes were proposed in this pull request?
Fix some typo of the document.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Xianyang Liu <xianyang.liu@intel.com>
Closes#18350 from ConeyLiu/fixtypo.
## What changes were proposed in this pull request?
This PR cleans up a few Java linter errors for Apache Spark 2.2 release.
## How was this patch tested?
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
We can check the result at Travis CI, [here](https://travis-ci.org/dongjoon-hyun/spark/builds/244297894).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#18345 from dongjoon-hyun/fix_lint_java_2.
### What changes were proposed in this pull request?
We should not silently ignore `DISTINCT` when they are not supported in the function arguments. This PR is to block these cases and issue the error messages.
### How was this patch tested?
Added test cases for both regular functions and window functions
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18340 from gatorsmile/firstCount.
## What changes were proposed in this pull request?
Built-in SQL Function UnaryMinus/UnaryPositive support string type, if it's string type, convert it to double type, after this PR:
```sql
spark-sql> select positive('-1.11'), negative('-1.11');
-1.11 1.11
spark-sql>
```
## How was this patch tested?
unit tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18173 from wangyum/SPARK-20948.
## What changes were proposed in this pull request?
This PR adds built-in SQL function `BIT_LENGTH()`, `CHAR_LENGTH()`, and `OCTET_LENGTH()` functions.
`BIT_LENGTH()` returns the bit length of the given string or binary expression.
`CHAR_LENGTH()` returns the length of the given string or binary expression. (i.e. equal to `LENGTH()`)
`OCTET_LENGTH()` returns the byte length of the given string or binary expression.
## How was this patch tested?
Added new test suites for these three functions
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#18046 from kiszk/SPARK-20749.
## What changes were proposed in this pull request?
This pull-request exclusively includes the class splitting feature described in #16648. When code for a given class would grow beyond 1600k bytes, a private, nested sub-class is generated into which subsequent functions are inlined. Additional sub-classes are generated as the code threshold is met subsequent times. This code includes 3 changes:
1. Includes helper maps, lists, and functions for keeping track of sub-classes during code generation (included in the `CodeGenerator` class). These helper functions allow nested classes and split functions to be initialized/declared/inlined to the appropriate locations in the various projection classes.
2. Changes `addNewFunction` to return a string to support instances where a split function is inlined to a nested class and not the outer class (and so must be invoked using the class-qualified name). Uses of `addNewFunction` throughout the codebase are modified so that the returned name is properly used.
3. Removes instances of the `this` keyword when used on data inside generated classes. All state declared in the outer class is by default global and accessible to the nested classes. However, if a reference to global state in a nested class is prepended with the `this` keyword, it would attempt to reference state belonging to the nested class (which would not exist), rather than the correct variable belonging to the outer class.
## How was this patch tested?
Added a test case to the `GeneratedProjectionSuite` that increases the number of columns tested in various projections to a threshold that would previously have triggered a `JaninoRuntimeException` for the Constant Pool.
Note: This PR does not address the second Constant Pool issue with code generation (also mentioned in #16648): excess global mutable state. A second PR may be opened to resolve that issue.
Author: ALeksander Eskilson <alek.eskilson@cerner.com>
Closes#18075 from bdrillard/class_splitting_only.
### What changes were proposed in this pull request?
The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18202 from gatorsmile/renameCVSOption.
## What changes were proposed in this pull request?
It is really painful to not have configs in logical plan and expressions. We had to add all sorts of hacks (e.g. pass SQLConf explicitly in functions). This patch exposes SQLConf in logical plan, using a thread local variable and a getter closure that's set once there is an active SparkSession.
The implementation is a bit of a hack, since we didn't anticipate this need in the beginning (config was only exposed in physical plan). The implementation is described in `SQLConf.get`.
In terms of future work, we should follow up to clean up CBO (remove the need for passing in config).
## How was this patch tested?
Updated relevant tests for constraint propagation.
Author: Reynold Xin <rxin@databricks.com>
Closes#18299 from rxin/SPARK-21092.
## What changes were proposed in this pull request?
This patch moves constraint related code into a separate trait QueryPlanConstraints, so we don't litter QueryPlan with a lot of constraint private functions.
## How was this patch tested?
This is a simple move refactoring and should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#18298 from rxin/SPARK-21091.
### What changes were proposed in this pull request?
Since both table properties and storage properties share the same key values, table properties are not shown in the output of DESC EXTENDED/FORMATTED when the storage properties are not empty.
This PR is to fix the above issue by renaming them to different keys.
### How was this patch tested?
Added test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18294 from gatorsmile/tableProperties.
## What changes were proposed in this pull request?
Since `stack` function generates a table with nullable columns, it should allow mixed null values.
```scala
scala> sql("select stack(3, 1, 2, 3)").printSchema
root
|-- col0: integer (nullable = true)
scala> sql("select stack(3, 1, 2, null)").printSchema
org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); line 1 pos 7;
```
## How was this patch tested?
Pass the Jenkins with a new test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#17251 from dongjoon-hyun/SPARK-19910.
## What changes were proposed in this pull request?
This patch fixes a bug that can cause NullPointerException in LikeSimplification, when the pattern for like is null.
## How was this patch tested?
Added a new unit test case in LikeSimplificationSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#18273 from rxin/SPARK-21059.
The PR contains a tiny change to fix the way Spark parses string literals into timestamps. Currently, some timestamps that contain nanoseconds are corrupted during the conversion from internal UTF8Strings into the internal representation of timestamps.
Consider the following example:
```
spark.sql("SELECT cast('2015-01-02 00:00:00.000000001' as TIMESTAMP)").show(false)
+------------------------------------------------+
|CAST(2015-01-02 00:00:00.000000001 AS TIMESTAMP)|
+------------------------------------------------+
|2015-01-02 00:00:00.000001 |
+------------------------------------------------+
```
The fix was tested with existing tests. Also, there is a new test to cover cases that did not work previously.
Author: aokolnychyi <anton.okolnychyi@sap.com>
Closes#18252 from aokolnychyi/spark-17914.
## What changes were proposed in this pull request?
add test case to MathExpressionsSuite as #17906
## How was this patch tested?
unit test cases
Author: liuxian <liu.xian3@zte.com.cn>
Closes#18082 from 10110346/wip-lx-0524.
## What changes were proposed in this pull request?
Add support for specific Java `List` subtypes in deserialization as well as a generic implicit encoder.
All `List` subtypes are supported by using either the size-specifying constructor (one `int` parameter) or the default constructor.
Interfaces/abstract classes use the following implementations:
* `java.util.List`, `java.util.AbstractList` or `java.util.AbstractSequentialList` => `java.util.ArrayList`
## How was this patch tested?
```bash
build/mvn -DskipTests clean package && dev/run-tests
```
Additionally in Spark shell:
```
scala> val jlist = new java.util.LinkedList[Int]; jlist.add(1)
jlist: java.util.LinkedList[Int] = [1]
res0: Boolean = true
scala> Seq(jlist).toDS().map(_.element()).collect()
res1: Array[Int] = Array(1)
```
Author: Michal Senkyr <mike.senkyr@gmail.com>
Closes#18009 from michalsenkyr/dataset-java-lists.
## What changes were proposed in this pull request?
Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats.
For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.
Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats.
A running example is shown in [JIRA](https://issues.apache.org/jira/browse/SPARK-21031).
To fix this, we add a new method `alterTableStats` to store spark's stats, and let `alterTable` keep existing stats.
## How was this patch tested?
Added new tests.
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#18248 from wzhfy/separateHiveStats.
## What changes were proposed in this pull request?
add more datatype for some unit tests
## How was this patch tested?
unit tests
Author: liuxian <liu.xian3@zte.com.cn>
Closes#17880 from 10110346/wip_lx_0506.
### What changes were proposed in this pull request?
The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.
The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.
Before this PR, the following queries failed:
```SQL
select 1 > 0.0001
select floor(0.0001)
select ceil(0.0001)
```
### How was this patch tested?
Added test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18244 from gatorsmile/bigdecimal.
### What changes were proposed in this pull request?
Currently, the unquoted string of a function identifier is being used as the function identifier in the function registry. This could cause the incorrect the behavior when users use `.` in the function names. This PR is to take the `FunctionIdentifier` as the identifier in the function registry.
- Add one new function `createOrReplaceTempFunction` to `FunctionRegistry`
```Scala
final def createOrReplaceTempFunction(name: String, builder: FunctionBuilder): Unit
```
### How was this patch tested?
Add extra test cases to verify the inclusive bug fixes.
Author: Xiao Li <gatorsmile@gmail.com>
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18142 from gatorsmile/fuctionRegistry.
### What changes were proposed in this pull request?
Before 2.2, we indicate the job was terminated because of `FAILFAST` mode.
```
Malformed line in FAILFAST mode: {"a":{, b:3}
```
If possible, we should keep it. This PR is to unify the error messages.
### How was this patch tested?
Modified the existing messages.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18196 from gatorsmile/messFailFast.
## What changes were proposed in this pull request?
Removed a duplicate case in "SPARK-20854: select hint syntax with expressions"
## How was this patch tested?
Existing tests.
Author: Bogdan Raducanu <bogdan@databricks.com>
Closes#18217 from bogdanrdc/SPARK-20854-2.
## What changes were proposed in this pull request?
`HintInfo.isBroadcastable` is actually not an accurate name, it's used to force the planner to broadcast a plan no matter what the data size is, via the hint mechanism. I think `forceBroadcast` is a better name.
And `isBroadcastable` only have 2 possible values: `Some(true)` and `None`, so we can just use boolean type for it.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18189 from cloud-fan/stats.
There could be test failures because DataStorageStrategy, HiveMetastoreCatalog and also HiveSchemaInferenceSuite were exposed to guava library by directly accessing SessionCatalog's tableRelationCacheg. These failures occur when guava shading is in place.
## What changes were proposed in this pull request?
This change removes those guava exposures by introducing new methods in SessionCatalog and also changing DataStorageStrategy, HiveMetastoreCatalog and HiveSchemaInferenceSuite so that they use those proxy methods.
## How was this patch tested?
Unit tests passed after applying these changes.
Author: Reza Safi <rezasafi@cloudera.com>
Closes#18148 from rezasafi/branch-2.2.
(cherry picked from commit 1388fdd707)
## What changes were proposed in this pull request?
The construction of BROADCAST_TIMEOUT conf should take the TimeUnit argument as a TimeoutConf.
Author: Feng Liu <fengliu@databricks.com>
Closes#18208 from liufengdb/fix_timeout.
## What changes were proposed in this pull request?
Fixes a typo: `and` -> `an`
## How was this patch tested?
Not at all.
Author: Wieland Hoffmann <mineo@users.noreply.github.com>
Closes#17759 from mineo/patch-1.
### What changes were proposed in this pull request?
1. The description of `spark.sql.files.ignoreCorruptFiles` is not accurate. When the file does not exist, we will issue the error message.
```
org.apache.spark.sql.AnalysisException: Path does not exist: file:/nonexist/path;
```
2. `spark.sql.columnNameOfCorruptRecord` also affects the CSV format. The current description only mentions JSON format.
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18184 from gatorsmile/updateMessage.
## What changes were proposed in this pull request?
SQL hint syntax:
* support expressions such as strings, numbers, etc. instead of only identifiers as it is currently.
* support multiple hints, which was missing compared to the DataFrame syntax.
DataFrame API:
* support any parameters in DataFrame.hint instead of just strings
## How was this patch tested?
Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite.
Author: Bogdan Raducanu <bogdan@databricks.com>
Closes#18086 from bogdanrdc/SPARK-20854.
### What changes were proposed in this pull request?
Before this PR, Subquery reuse does not work. Below are three issues:
- Subquery reuse does not work.
- It is sharing the same `SQLConf` (`spark.sql.exchange.reuse`) with the one for Exchange Reuse.
- No test case covers the rule Subquery reuse.
This PR is to fix the above three issues.
- Ignored the physical operator `SubqueryExec` when comparing two plans.
- Added a dedicated conf `spark.sql.subqueries.reuse` for controlling Subquery Reuse
- Added a test case for verifying the behavior
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18169 from gatorsmile/subqueryReuse.
## What changes were proposed in this pull request?
Add build-int SQL function - UUID.
## How was this patch tested?
unit tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18136 from wangyum/SPARK-20910.
## What changes were proposed in this pull request?
Minor changes to scaladoc
## How was this patch tested?
Local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#18074 from jaceklaskowski/scaladoc-fixes.
## What changes were proposed in this pull request?
Currently the `DataFrameWriter` operations have several problems:
1. non-file-format data source writing action doesn't show up in the SQL tab in Spark UI
2. file-format data source writing action shows a scan node in the SQL tab, without saying anything about writing. (streaming also have this issue, but not fixed in this PR)
3. Spark SQL CLI actions don't show up in the SQL tab.
This PR fixes all of them, by refactoring the `ExecuteCommandExec` to make it have children.
close https://github.com/apache/spark/pull/17540
## How was this patch tested?
existing tests.
Also test the UI manually. For a simple command: `Seq(1 -> "a").toDF("i", "j").write.parquet("/tmp/qwe")`
before this PR:
<img width="266" alt="qq20170523-035840 2x" src="https://cloud.githubusercontent.com/assets/3182036/26326050/24e18ba2-3f6c-11e7-8817-6dd275bf6ac5.png">
after this PR:
<img width="287" alt="qq20170523-035708 2x" src="https://cloud.githubusercontent.com/assets/3182036/26326054/2ad7f460-3f6c-11e7-8053-d68325beb28f.png">
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18064 from cloud-fan/execution.
## What changes were proposed in this pull request?
A bunch of changes to the StateStore APIs and implementation.
Current state store API has a bunch of problems that causes too many transient objects causing memory pressure.
- `StateStore.get(): Option` forces creation of Some/None objects for every get. Changed this to return the row or null.
- `StateStore.iterator(): (UnsafeRow, UnsafeRow)` forces creation of new tuple for each record returned. Changed this to return a UnsafeRowTuple which can be reused across records.
- `StateStore.updates()` requires the implementation to keep track of updates, while this is used minimally (only by Append mode in streaming aggregations). Removed updates() and updated StateStoreSaveExec accordingly.
- `StateStore.filter(condition)` and `StateStore.remove(condition)` has been merge into a single API `getRange(start, end)` which allows a state store to do optimized range queries (i.e. avoid full scans). Stateful operators have been updated accordingly.
- Removed a lot of unnecessary row copies Each operator copied rows before calling StateStore.put() even if the implementation does not require it to be copied. It is left up to the implementation on whether to copy the row or not.
Additionally,
- Added a name to the StateStoreId so that each operator+partition can use multiple state stores (different names)
- Added a configuration that allows the user to specify which implementation to use.
- Added new metrics to understand the time taken to update keys, remove keys and commit all changes to the state store. These metrics will be visible on the plan diagram in the SQL tab of the UI.
- Refactored unit tests such that they can be reused to test any implementation of StateStore.
## How was this patch tested?
Old and new unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#18107 from tdas/SPARK-20376.
### What changes were proposed in this pull request?
We are unable to call the function registered in the not-current database.
```Scala
sql("CREATE DATABASE dAtABaSe1")
sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS '${classOf[GenericUDAFAverage].getName}'")
sql("SELECT dAtABaSe1.test_avg(1)")
```
The above code returns an error:
```
Undefined function: 'dAtABaSe1.test_avg'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
```
This PR is to fix the above issue.
### How was this patch tested?
Added test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18146 from gatorsmile/qualifiedFunction.
## What changes were proposed in this pull request?
We changed the parser to reject unaliased subqueries in the FROM clause in SPARK-20690. However, the error message that we now give isn't very helpful:
scala> sql("""SELECT x FROM (SELECT 1 AS x)""")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'FROM' expecting {<EOF>, 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
We should modify the parser to throw a more clear error for such queries:
scala> sql("""SELECT x FROM (SELECT 1 AS x)""")
org.apache.spark.sql.catalyst.parser.ParseException:
The unaliased subqueries in the FROM clause are not supported.(line 1, pos 14)
## How was this patch tested?
Modified existing tests to reflect this change.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18141 from viirya/SPARK-20916.
## What changes were proposed in this pull request?
Fix some indent issues.
## How was this patch tested?
existing tests.
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18133 from wangyum/IndentIssues.
## What changes were proposed in this pull request?
Add build-int SQL function - DAYOFWEEK
## How was this patch tested?
unit tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18134 from wangyum/SPARK-20909.
## What changes were proposed in this pull request?
This PR adds built-in SQL function `(REPLACE(<string_expression>, <search_string> [, <replacement_string>])`
`REPLACE()` return that string that is replaced all occurrences with given string.
## How was this patch tested?
added new test suites
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#18047 from kiszk/SPARK-20750.
## What changes were proposed in this pull request?
See class doc of `ConstantPropagation` for the approach used.
## How was this patch tested?
- Added unit tests
Author: Tejas Patil <tejasp@fb.com>
Closes#17993 from tejasapatil/SPARK-20758_const_propagation.
## What changes were proposed in this pull request?
This pr added parsing rules to support table column aliases in FROM clause.
## How was this patch tested?
Added tests in `PlanParserSuite`, `SQLQueryTestSuite`, and `PlanParserSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18079 from maropu/SPARK-20841.
### What changes were proposed in this pull request?
In Cache manager, the plan matching should ignore Hint.
```Scala
val df1 = spark.range(10).join(broadcast(spark.range(10)))
df1.cache()
spark.range(10).join(spark.range(10)).explain()
```
The output plan of the above query shows that the second query is not using the cached data of the first query.
```
BroadcastNestedLoopJoin BuildRight, Inner
:- *Range (0, 10, step=1, splits=2)
+- BroadcastExchange IdentityBroadcastMode
+- *Range (0, 10, step=1, splits=2)
```
After the fix, the plan becomes
```
InMemoryTableScan [id#20L, id#23L]
+- InMemoryRelation [id#20L, id#23L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- BroadcastNestedLoopJoin BuildRight, Inner
:- *Range (0, 10, step=1, splits=2)
+- BroadcastExchange IdentityBroadcastMode
+- *Range (0, 10, step=1, splits=2)
```
### How was this patch tested?
Added a test.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18131 from gatorsmile/HintCache.
## What changes were proposed in this pull request?
spark-sql>SELECT ceil(cast(12345.1233 as float));
spark-sql>12345
For this case, the result we expected is `12346`
spark-sql>SELECT floor(cast(-12345.1233 as float));
spark-sql>-12345
For this case, the result we expected is `-12346`
Because in `Ceil` or `Floor`, `inputTypes` has no FloatType, so it is converted to LongType.
## How was this patch tested?
After the modification:
spark-sql>SELECT ceil(cast(12345.1233 as float));
spark-sql>12346
spark-sql>SELECT floor(cast(-12345.1233 as float));
spark-sql>-12346
Author: liuxian <liu.xian3@zte.com.cn>
Closes#18103 from 10110346/wip-lx-0525-1.
## What changes were proposed in this pull request?
Add built-in SQL function `CH[A]R`:
For `CHR(bigint|double n)`, returns the ASCII character having the binary equivalent to `n`. If n is larger than 256 the result is equivalent to CHR(n % 256)
## How was this patch tested?
unit tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18019 from wangyum/SPARK-20748.
Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate. I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3.
Author: Michael Armbrust <michael@databricks.com>
Closes#18065 from marmbrus/streamingGA.
## What changes were proposed in this pull request?
It is reported that there is performance downgrade when applying ML pipeline for dataset with many columns but few rows.
A big part of the performance downgrade comes from some operations (e.g., `select`) on DataFrame/Dataset which re-create new DataFrame/Dataset with a new `LogicalPlan`. The cost can be ignored in the usage of SQL, normally.
However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the total cost spent on re-creation of DataFrame grows too. In particular, the `Analyzer` will go through the big query plan even most part of it is analyzed.
By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs.
In particular, the time applying the pipeline locally is mostly spent on calling transform of the 137 `Bucketizer`s. Before the change, each call of `Bucketizer`'s transform can cost about 0.4 sec. So the total time spent on all `Bucketizer`s' transform is about 50 secs. After the change, each call only costs about 0.1 sec.
<del>We also make `boundEnc` as lazy variable to reduce unnecessary running time.</del>
### Performance improvement
The codes and datasets provided by Barry Becker to re-produce this issue and benchmark can be found on the JIRA.
Before this patch: about 1 min
After this patch: about 20 secs
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#17770 from viirya/SPARK-20392.
## What changes were proposed in this pull request?
1. add instructions of 'cast' function When using 'show functions' and 'desc function cast'
command in spark-sql
2. Modify the instructions of functions,such as
boolean,tinyint,smallint,int,bigint,float,double,decimal,date,timestamp,binary,string
## How was this patch tested?
Before modification:
spark-sql>desc function boolean;
Function: boolean
Class: org.apache.spark.sql.catalyst.expressions.Cast
Usage: boolean(expr AS type) - Casts the value `expr` to the target data type `type`.
After modification:
spark-sql> desc function boolean;
Function: boolean
Class: org.apache.spark.sql.catalyst.expressions.Cast
Usage: boolean(expr) - Casts the value `expr` to the target data type `boolean`.
spark-sql> desc function cast
Function: cast
Class: org.apache.spark.sql.catalyst.expressions.Cast
Usage: cast(expr AS type) - Casts the value `expr` to the target data type `type`.
Author: liuxian <liu.xian3@zte.com.cn>
Closes#17698 from 10110346/wip_lx_0418.
## What changes were proposed in this pull request?
This is a follow-up to SPARK-20857 to move the broadcast hint from Statistics into a new HintInfo class, so we can be more flexible in adding new hints in the future.
## How was this patch tested?
Updated test cases to reflect the change.
Author: Reynold Xin <rxin@databricks.com>
Closes#18087 from rxin/SPARK-20867.
## What changes were proposed in this pull request?
This patch renames BroadcastHint to ResolvedHint (and Hint to UnresolvedHint) so the hint framework is more generic and would allow us to introduce other hint types in the future without introducing new hint nodes.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#18072 from rxin/SPARK-20857.
## What changes were proposed in this pull request?
As srowen pointed in 609ba5f2b9 (commitcomment-22221259), the previous tests are not proper.
This follow-up is going to fix the tests.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18048 from viirya/SPARK-20399-follow-up.
### What changes were proposed in this pull request?
After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792
Also fix the issue to fill Hive-generated RowCounts to our stats.
This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`.
### How was this patch tested?
Added a few test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#14971 from gatorsmile/showCreateTableNew.
### What changes were proposed in this pull request?
Currently, we have a bug when we specify `IF NOT EXISTS` in `INSERT OVERWRITE` data source tables. For example, given a query:
```SQL
INSERT OVERWRITE TABLE $tableName partition (b=2, c=3) IF NOT EXISTS SELECT 9, 10
```
we will get the following error:
```
unresolved operator 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), c -> Some(3)), true, true;;
'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), c -> Some(3)), true, true
+- Project [cast(9#423 as int) AS a#429, cast(10#424 as int) AS d#430]
+- Project [9 AS 9#423, 10 AS 10#424]
+- OneRowRelation$
```
This PR is to fix the issue to follow the behavior of Hive serde tables
> INSERT OVERWRITE will overwrite any existing data in the table or partition unless IF NOT EXISTS is provided for a partition
### How was this patch tested?
Modified an existing test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18050 from gatorsmile/insertPartitionIfNotExists.
## What changes were proposed in this pull request?
spark-sql>SELECT ceil(1234567890123456);
1234567890123456
spark-sql>SELECT ceil(12345678901234567);
12345678901234568
spark-sql>SELECT ceil(123456789012345678);
123456789012345680
when the length of the getText is greater than 16. long to double will be precision loss.
but mysql handle the value is ok.
mysql> SELECT ceil(1234567890123456);
+------------------------+
| ceil(1234567890123456) |
+------------------------+
| 1234567890123456 |
+------------------------+
1 row in set (0.00 sec)
mysql> SELECT ceil(12345678901234567);
+-------------------------+
| ceil(12345678901234567) |
+-------------------------+
| 12345678901234567 |
+-------------------------+
1 row in set (0.00 sec)
mysql> SELECT ceil(123456789012345678);
+--------------------------+
| ceil(123456789012345678) |
+--------------------------+
| 123456789012345678 |
+--------------------------+
1 row in set (0.00 sec)
## How was this patch tested?
Supplement the unit test.
Author: caoxuewen <cao.xuewen@zte.com.cn>
Closes#18016 from heary-cao/ceil_long.
## What changes were proposed in this pull request?
spark-sql>select month("1582-09-28");
spark-sql>10
For this case, the expected result is 9, but it is 10.
spark-sql>select day("1582-04-18");
spark-sql>28
For this case, the expected result is 18, but it is 28.
when the date before "1582-10-04", the function of `month` and `day` return the value which is not we expected.
## How was this patch tested?
unit tests
Author: liuxian <liu.xian3@zte.com.cn>
Closes#17997 from 10110346/wip_lx_0516.
## What changes were proposed in this pull request?
Add built-in SQL Function - COT.
## How was this patch tested?
unit tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#17999 from wangyum/SPARK-20751.
## What changes were proposed in this pull request?
GenerateUnsafeProjection.writeStructToBuffer() did not honor the assumption that the caller must make sure that a value is not null before using the getter. This could lead to various errors. This change fixes that behavior.
Example of code generated before:
```scala
/* 059 */ final UTF8String fieldName = value.getUTF8String(0);
/* 060 */ if (value.isNullAt(0)) {
/* 061 */ rowWriter1.setNullAt(0);
/* 062 */ } else {
/* 063 */ rowWriter1.write(0, fieldName);
/* 064 */ }
```
Example of code generated now:
```scala
/* 060 */ boolean isNull1 = value.isNullAt(0);
/* 061 */ UTF8String value1 = isNull1 ? null : value.getUTF8String(0);
/* 062 */ if (isNull1) {
/* 063 */ rowWriter1.setNullAt(0);
/* 064 */ } else {
/* 065 */ rowWriter1.write(0, value1);
/* 066 */ }
```
## How was this patch tested?
Adds GenerateUnsafeProjectionSuite.
Author: Ala Luszczak <ala@databricks.com>
Closes#18030 from ala/fix-generate-unsafe-projection.
## What changes were proposed in this pull request?
In the previous approach we used `aliasMap` to link an `Attribute` to the expression with potentially the form `f(a, b)`, but we only searched the `expressions` and `children.expressions` for this, which is not enough when an `Alias` may lies deep in the logical plan. In that case, we can't generate the valid equivalent constraint classes and thus we fail at preventing the recursive deductions.
We fix this problem by collecting all `Alias`s from the logical plan.
## How was this patch tested?
No additional test case is added, but do modified one test case to cover this situation.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#18020 from jiangxb1987/inferConstrants.
## What changes were proposed in this pull request?
We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this:
select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1
This query works in current codebase. However, the outside where clause shouldn't be able to refer `t1.c1` attribute.
The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#17935 from viirya/SPARK-20690.
## What changes were proposed in this pull request?
Currently the parser logs the query it is parsing at `info` level. This is too high, this PR lowers the log level to `debug`.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#18006 from hvanhovell/lower_parser_log_level.
## What changes were proposed in this pull request?
When an expression for `df.filter()` has many nodes (e.g. 400), the size of Java bytecode for the generated Java code is more than 64KB. It produces an Java exception. As a result, the execution fails.
This PR continues to execute by calling `Expression.eval()` disabling code generation if an exception has been caught.
## How was this patch tested?
Add a test suite into `DataFrameSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#17087 from kiszk/SPARK-19372.
## What changes were proposed in this pull request?
Because the method `TimeZone.getTimeZone(String ID)` is synchronized on the TimeZone class, concurrent call of this method will become a bottleneck.
This especially happens when casting from string value containing timezone info to timestamp value, which uses `DateTimeUtils.stringToTimestamp()` and gets TimeZone instance on the site.
This pr makes a cache of the generated TimeZone instances to avoid the synchronization.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#17933 from ueshin/issues/SPARK-20588.
## What changes were proposed in this pull request?
Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs : `hive.enforce.bucketing` and `hive.enforce.sorting`.
What does this PR achieve ?
- Spark will disallow users from writing outputs to hive bucketed tables by default (given that output won't adhere with Hive's semantics).
- IF user still wants to write to hive bucketed table, the only resort is to use `hive.enforce.bucketing=false` and `hive.enforce.sorting=false` which means user does NOT care about bucketing guarantees.
Changes done in this PR:
- Extract table's bucketing information in `HiveClientImpl`
- While writing table info to metastore, `HiveClientImpl` now populates the bucketing information in the hive `Table` object
- `InsertIntoHiveTable` allows inserts to bucketed table only if both `hive.enforce.bucketing` and `hive.enforce.sorting` are `false`
Ability to create bucketed tables will enable adding test cases to Spark while I add more changes related to hive bucketing support. Design doc for hive hive bucketing support : https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit#
## How was this patch tested?
- Added test for creating bucketed and sorted table.
- Added test to ensure that INSERTs fail if strict bucket / sort is enforced
- Added test to ensure that INSERTs can go through if strict bucket / sort is NOT enforced
- Added test to validate that bucketing information shows up in output of DESC FORMATTED
- Added test to ensure that `SHOW CREATE TABLE` works for hive bucketed tables
Author: Tejas Patil <tejasp@fb.com>
Closes#17644 from tejasapatil/SPARK-17729_create_bucketed_table.
## What changes were proposed in this pull request?
This pr added a new Optimizer rule to combine nested Concat. The master supports a pipeline operator '||' to concatenate strings in #17711 (This pr is follow-up). Since the parser currently generates nested Concat expressions, the optimizer needs to combine the nested expressions.
## How was this patch tested?
Added tests in `CombineConcatSuite` and `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17970 from maropu/SPARK-20730.
## What changes were proposed in this pull request?
For aggregate function with `PartialMerge` or `Final` mode, the input is aggregate buffers instead of the actual children expressions. So the actual children expressions won't affect the result, we should normalize the expr id for them.
## How was this patch tested?
a new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17964 from cloud-fan/tmp.
## What changes were proposed in this pull request?
This PR is based on https://github.com/apache/spark/pull/16199 and extracts the valid change from https://github.com/apache/spark/pull/9759 to resolve SPARK-18772
This avoids additional conversion try with `toFloat` and `toDouble`.
For avoiding additional conversions, please refer the codes below:
**Before**
```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```
**After**
```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```
## How was this patch tested?
Unit tests added in `JsonSuite`.
Closes#16199
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Nathan Howell <nhowell@godaddy.com>
Closes#17956 from HyukjinKwon/SPARK-18772.
### What changes were proposed in this pull request?
`LIMIT ALL` is the same as omitting the `LIMIT` clause. It is supported by both PrestgreSQL and Presto. This PR is to support it by adding it in the parser.
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17960 from gatorsmile/LimitAll.
## What changes were proposed in this pull request?
This pr added code to support `||` for string concatenation. This string operation is supported in PostgreSQL and MySQL.
## How was this patch tested?
Added tests in `SparkSqlParserSuite`
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17711 from maropu/SPARK-19951.
## What changes were proposed in this pull request?
This pr added `Analyzer` code for supporting aliases in CUBE/ROLLUP/GROUPING SETS (This is follow-up of #17191).
## How was this patch tested?
Added tests in `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17948 from maropu/SPARK-20710.
## What changes were proposed in this pull request?
Fix canonicalization for different filter orders in `HiveTableScanExec`.
## How was this patch tested?
Added a new test case.
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#17962 from wzhfy/canonicalizeHiveTableScanExec.
## What changes were proposed in this pull request?
This method gets a type's primary constructor and fills in type parameters with concrete types. For example, `MapPartitions[T, U] -> MapPartitions[Int, String]`. This Substitution fails when the actual type args are empty because they are still unknown. Instead, when there are no resolved types to subsitute, this returns the original args with unresolved type parameters.
## How was this patch tested?
This doesn't affect substitutions where the type args are determined. This fixes our case where the actual type args are empty and our job runs successfully.
Author: Ryan Blue <blue@apache.org>
Closes#15062 from rdblue/SPARK-17424-fix-unsound-reflect-substitution.
## What changes were proposed in this pull request?
This PR proposes three things as below:
- Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`).
- Support single argument for `to_timestamp` similarly with APIs in other languages.
For example, the one below works
```
import org.apache.spark.sql.functions._
Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
```
prints
```
+----------------------------------------+
|to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
+----------------------------------------+
| 2016-12-31 00:12:00|
+----------------------------------------+
```
whereas this does not work in SQL.
**Before**
```
spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
```
**After**
```
spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00
```
- Related document improvement for SQL function descriptions and other API descriptions accordingly.
**Before**
```
spark-sql> DESCRIBE FUNCTION extended to_date;
...
Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
Extended Usage:
Examples:
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31
```
```
spark-sql> DESCRIBE FUNCTION extended to_timestamp;
...
Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
Extended Usage:
Examples:
> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00.0
```
**After**
```
spark-sql> DESCRIBE FUNCTION extended to_date;
...
Usage:
to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
a date. Returns null with invalid input. By default, it follows casting rules to a date if
the `fmt` is omitted.
Extended Usage:
Examples:
> SELECT to_date('2009-07-30 04:17:52');
2009-07-30
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31
```
```
spark-sql> DESCRIBE FUNCTION extended to_timestamp;
...
Usage:
to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
a timestamp. Returns null with invalid input. By default, it follows casting rules to
a timestamp if the `fmt` is omitted.
Extended Usage:
Examples:
> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00
> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00
```
## How was this patch tested?
Added tests in `datetime.sql`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17901 from HyukjinKwon/to_timestamp_arg.
## What changes were proposed in this pull request?
spark-sql>select bround(12.3, 2);
spark-sql>NULL
For this case, the expected result is 12.3, but it is null.
So ,when the second parameter is bigger than "decimal.scala", the result is not we expected.
"round" function has the same problem. This PR can solve the problem for both of them.
## How was this patch tested?
unit test cases in MathExpressionsSuite and MathFunctionsSuite
Author: liuxian <liu.xian3@zte.com.cn>
Closes#17906 from 10110346/wip_lx_0509.
## What changes were proposed in this pull request?
The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.
The following codes can reproduce it:
val data = Seq("\u0020\u0021\u0023", "abc")
val df = data.toDF()
// 1st usage: works in 1.6
// Let parser parse pattern string
val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
// 2nd usage: works in 1.6, 2.x
// Call Column.rlike so the pattern string is a literal which doesn't go through parser
val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
// In 2.x, we need add backslashes to make regex pattern parsed correctly
val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")
Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#17887 from viirya/add-config-fallback-string-parsing.
## What changes were proposed in this pull request?
This pr added parsing rules to support aliases in table value functions.
The previous pr (#17666) has been reverted because of the regression. This new pr fixed the regression and add tests in `SQLQueryTestSuite`.
## How was this patch tested?
Added tests in `PlanParserSuite` and `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17928 from maropu/SPARK-20311-3.
## What changes were proposed in this pull request?
`RuntimeReplaceable` always has a constructor with the expression to replace with, and this constructor should not be the function builder.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17876 from cloud-fan/minor.
## What changes were proposed in this pull request?
In filter estimation, we update column stats for those columns in filter condition. However, if the number of rows decreases after the filter (i.e. the overall selectivity is less than 1), we need to update (scale down) the number of distinct values (NDV) for all columns, no matter they are in filter conditions or not.
This pr also fixes the inconsistency of rounding mode for ndv and rowCount.
## How was this patch tested?
Added new tests.
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#17918 from wzhfy/scaleDownNdvAfterFilter.
## What changes were proposed in this pull request?
In `CheckAnalysis`, we should call `checkAnalysis` for `ScalarSubquery` at the beginning, as later we will call `plan.output` which is invalid if `plan` is not resolved.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17930 from cloud-fan/tmp.
## What changes were proposed in this pull request?
When registering Scala UDF, we can know if the udf will return nullable value or not. `ScalaUDF` and related classes should handle the nullability.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#17911 from ueshin/issues/SPARK-20668.
## What changes were proposed in this pull request?
The query
```
SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
```
should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows.
This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead:
An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows.
If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation.
The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be).
This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation.
## How was this patch tested?
- Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
- Updated unit tests in `PropagateEmptyRelationSuite`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#17929 from JoshRosen/fix-PropagateEmptyRelation.
## What changes were proposed in this pull request?
Any Dataset/DataFrame batch query with the operation `withWatermark` does not execute because the batch planner does not have any rule to explicitly handle the EventTimeWatermark logical plan.
The right solution is to simply remove the plan node, as the watermark should not affect any batch query in any way.
Changes:
- In this PR, we add a new rule `EliminateEventTimeWatermark` to check if we need to ignore the event time watermark. We will ignore watermark in any batch query.
Depends upon:
- [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672). We can not add this rule into analyzer directly, because streaming query will be copied to `triggerLogicalPlan ` in every trigger, and the rule will be applied to `triggerLogicalPlan` mistakenly.
Others:
- A typo fix in example.
## How was this patch tested?
add new unit test.
Author: uncleGen <hustyugm@gmail.com>
Closes#17896 from uncleGen/SPARK-20373.
## What changes were proposed in this pull request?
This pr added parsing rules to support aliases in table value functions.
## How was this patch tested?
Added tests in `PlanParserSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17666 from maropu/SPARK-20311.
## What changes were proposed in this pull request?
So far, we do not drop all the cataloged objects after each package. Sometimes, we might hit strange test case errors because the previous test suite did not drop the cataloged/temporary objects (tables/functions/database). At least, we can first clean up the environment when completing the package of `sql/core` and `sql/hive`.
## How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17908 from gatorsmile/reset.
### What changes were proposed in this pull request?
Table comment was not getting set/unset using **ALTER TABLE SET/UNSET TBLPROPERTIES** query
eg: ALTER TABLE table_with_comment SET TBLPROPERTIES("comment"= "modified comment)
when user alter the table properties and adds/updates table comment,table comment which is a field of **CatalogTable** instance is not getting updated and old table comment if exists was shown to user, inorder to handle this issue, update the comment field value in **CatalogTable** with the newly added/modified comment along with other table level properties when user executes **ALTER TABLE SET TBLPROPERTIES** query.
This pr has also taken care of unsetting the table comment when user executes query **ALTER TABLE UNSET TBLPROPERTIES** inorder to unset or remove table comment.
eg: ALTER TABLE table_comment UNSET TBLPROPERTIES IF EXISTS ('comment')
### How was this patch tested?
Added test cases as part of **SQLQueryTestSuite** for verifying table comment using desc formatted table query after adding/modifying table comment as part of **AlterTableSetPropertiesCommand** and unsetting the table comment using **AlterTableUnsetPropertiesCommand**.
Author: sujith71955 <sujithchacko.2010@gmail.com>
Closes#17649 from sujith71955/alter_table_comment.
## What changes were proposed in this pull request?
This change allows timestamps in parquet-based hive table to behave as a "floating time", without a timezone, as timestamps are for other file formats. If the storage timezone is the same as the session timezone, this conversion is a no-op. When data is read from a hive table, the table property is *always* respected. This allows spark to not change behavior when reading old data, but read newly written data correctly (whatever the source of the data is).
Spark inherited the original behavior from Hive, but Hive is also updating behavior to use the same scheme in HIVE-12767 / HIVE-16231.
The default for Spark remains unchanged; created tables do not include the new table property.
This will only apply to hive tables; nothing is added to parquet metadata to indicate the timezone, so data that is read or written directly from parquet files will never have any conversions applied.
## How was this patch tested?
Added a unit test which creates tables, reads and writes data, under a variety of permutations (different storage timezones, different session timezones, vectorized reading on and off).
Author: Imran Rashid <irashid@cloudera.com>
Closes#16781 from squito/SPARK-12297.
## What changes were proposed in this pull request?
* Docs are consistent (across different `unix_timestamp` variants and their internal expressions)
* typo hunting
## How was this patch tested?
local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#17801 from jaceklaskowski/unix_timestamp.
## What changes were proposed in this pull request?
Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch.
## How was this patch tested?
Now the debug message prints the diff between start and end of batch.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#17875 from juliuszsompolski/SPARK-20616.
## What changes were proposed in this pull request?
We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL.
As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function:
```
df1.join(df2.hint("broadcast"))
```
## How was this patch tested?
Added a test case in DataFrameJoinSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#17839 from rxin/SPARK-20576.
## What changes were proposed in this pull request?
Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17803 from srowen/SPARK-20523.
## What changes were proposed in this pull request?
A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`.
## How was this patch tested?
Regression test
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#17826 from brkyvz/SPARK-20549.
## What changes were proposed in this pull request?
Add support for the SQL standard distinct predicate to SPARK SQL.
```
<expression> IS [NOT] DISTINCT FROM <expression>
```
## How was this patch tested?
Tested using unit tests, integration tests, manual tests.
Author: ptkool <michael.styles@shopify.com>
Closes#17764 from ptkool/is_not_distinct_from.
## What changes were proposed in this pull request?
Generate exec does not produce `null` values if the generator for the input row is empty and the generate operates in outer mode without join. This is caused by the fact that the `join=false` code path is different from the `join=true` code path, and that the `join=false` code path did deal with outer properly. This PR addresses this issue.
## How was this patch tested?
Updated `outer*` tests in `GeneratorFunctionSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17810 from hvanhovell/SPARK-20534.
## What changes were proposed in this pull request?
Currently, when the type string is invalid, it looks printing empty parentheses. This PR proposes a small improvement in an error message by removing it in the parse as below:
```scala
spark.range(1).select($"col".cast("aa"))
```
**Before**
```
org.apache.spark.sql.catalyst.parser.ParseException:
DataType aa() is not supported.(line 1, pos 0)
== SQL ==
aa
^^^
```
**After**
```
org.apache.spark.sql.catalyst.parser.ParseException:
DataType aa is not supported.(line 1, pos 0)
== SQL ==
aa
^^^
```
## How was this patch tested?
Unit tests in `DataTypeParserSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17784 from HyukjinKwon/SPARK-20492.
## What changes were proposed in this pull request?
This PR proposes to fill up the documentation with examples for `bitwiseOR`, `bitwiseAND`, `bitwiseXOR`. `contains`, `asc` and `desc` in `Column` API.
Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.
Lastly, this PR suggests to use `spark` rather than `sc` in doc tests in `Column` for Python documentation.
## How was this patch tested?
Doc tests were added and manually tested with the commands below:
`./python/run-tests.py --module pyspark-sql`
`./python/run-tests.py --module pyspark-sql --python-executable python3`
`./dev/lint-python`
Output was checked via `make html` under `./python/docs`. The snapshots will be left on the codes with comments.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17737 from HyukjinKwon/SPARK-20442.
## What changes were proposed in this pull request?
This pr added a new rule in `Analyzer` to resolve aliases in `GROUP BY`.
The current master throws an exception if `GROUP BY` clauses have aliases in `SELECT`;
```
scala> spark.sql("select a a1, a1 + 1 as b, count(1) from t group by a1")
org.apache.spark.sql.AnalysisException: cannot resolve '`a1`' given input columns: [a]; line 1 pos 51;
'Aggregate ['a1], [a#83L AS a1#87L, ('a1 + 1) AS b#88, count(1) AS count(1)#90L]
+- SubqueryAlias t
+- Project [id#80L AS a#83L]
+- Range (0, 10, step=1, splits=Some(8))
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
```
## How was this patch tested?
Added tests in `SQLQuerySuite` and `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17191 from maropu/SPARK-14471.
## What changes were proposed in this pull request?
Relax the requirement that a `TimeZoneAwareExpression` has to have its `timeZoneId` set to be considered resolved.
With this change, a `Cast` (which is a `TimeZoneAwareExpression`) can be considered resolved if the `(fromType, toType)` combination doesn't require time zone information.
Also de-relaxed test cases in `CastSuite` so Casts in that test suite don't get a default`timeZoneId = Option("GMT")`.
## How was this patch tested?
Ran the de-relaxed`CastSuite` and it's passing. Also ran the SQL unit tests and they're passing too.
Author: Kris Mok <kris.mok@databricks.com>
Closes#17777 from rednaxelafx/fix-catalyst-cast-timezone.
## What changes were proposed in this pull request?
change to using Jackson's `com.fasterxml.jackson.core.JsonFactory`
public JsonParser createParser(String content)
## How was this patch tested?
existing unit tests
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Eric Wasserman <ericw@sgn.com>
Closes#17693 from ewasserman/SPARK-20314.
## What changes were proposed in this pull request?
This patch adds support for customizing the spark session by injecting user-defined custom extensions. This allows a user to add custom analyzer rules/checks, optimizer rules, planning strategies or even a customized parser.
## How was this patch tested?
Unit Tests in SparkSessionExtensionSuite
Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
Closes#17724 from sameeragarwal/session-extensions.
## What changes were proposed in this pull request?
This PR avoids an exception in the case where `scala.math.BigInt` has a value that does not fit into long value range (e.g. `Long.MAX_VALUE+1`). When we run the following code by using the current Spark, the following exception is thrown.
This PR keeps the value using `BigDecimal` if we detect such an overflow case by catching `ArithmeticException`.
Sample program:
```
case class BigIntWrapper(value:scala.math.BigInt)```
spark.createDataset(BigIntWrapper(scala.math.BigInt("10000000000000000002"))::Nil).show
```
Exception:
```
Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
at org.apache.spark.sql.Agg$$anonfun$18.apply$mcV$sp(MySuite.scala:192)
at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
...
Caused by: java.lang.ArithmeticException: BigInteger out of long range
at java.math.BigInteger.longValueExact(BigInteger.java:4531)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:140)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:434)
at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 59 more
```
## How was this patch tested?
Add new test suite into `DecimalSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#17684 from kiszk/SPARK-20341.
## What changes were proposed in this pull request?
If a partitionSpec is supposed to not contain optional values, a ParseException should be thrown, and not nulls returned.
The nulls can later cause NullPointerExceptions in places not expecting them.
## How was this patch tested?
A query like "SHOW PARTITIONS tbl PARTITION(col1='val1', col2)" used to throw a NullPointerException.
Now it throws a ParseException.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#17707 from juliuszsompolski/SPARK-20412.
## What changes were proposed in this pull request?
It is often useful to be able to track changes to the `ExternalCatalog`. This PR makes the `ExternalCatalog` emit events when a catalog object is changed. Events are fired before and after the change.
The following events are fired per object:
- Database
- CreateDatabasePreEvent: event fired before the database is created.
- CreateDatabaseEvent: event fired after the database has been created.
- DropDatabasePreEvent: event fired before the database is dropped.
- DropDatabaseEvent: event fired after the database has been dropped.
- Table
- CreateTablePreEvent: event fired before the table is created.
- CreateTableEvent: event fired after the table has been created.
- RenameTablePreEvent: event fired before the table is renamed.
- RenameTableEvent: event fired after the table has been renamed.
- DropTablePreEvent: event fired before the table is dropped.
- DropTableEvent: event fired after the table has been dropped.
- Function
- CreateFunctionPreEvent: event fired before the function is created.
- CreateFunctionEvent: event fired after the function has been created.
- RenameFunctionPreEvent: event fired before the function is renamed.
- RenameFunctionEvent: event fired after the function has been renamed.
- DropFunctionPreEvent: event fired before the function is dropped.
- DropFunctionPreEvent: event fired after the function has been dropped.
The current events currently only contain the names of the object modified. We add more events, and more details at a later point.
A user can monitor changes to the external catalog by adding a listener to the Spark listener bus checking for `ExternalCatalogEvent`s using the `SparkListener.onOtherEvent` hook. A more direct approach is add listener directly to the `ExternalCatalog`.
## How was this patch tested?
Added the `ExternalCatalogEventSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17710 from hvanhovell/SPARK-20420.
## What changes were proposed in this pull request?
A cast expression with a resolved time zone is not equal to a cast expression without a resolved time zone. The `ResolveAggregateFunction` assumed that these expression were the same, and would fail to resolve `HAVING` clauses which contain a `Cast` expression.
This is in essence caused by the fact that a `TimeZoneAwareExpression` can be resolved without a set time zone. This PR fixes this, and makes a `TimeZoneAwareExpression` unresolved as long as it has no TimeZone set.
## How was this patch tested?
Added a regression test to the `SQLQueryTestSuite.having` file.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17641 from hvanhovell/SPARK-20329.
## What changes were proposed in this pull request?
Address a follow up in [comment](https://github.com/apache/spark/pull/16954#discussion_r105718880)
Currently subqueries with correlated predicates containing aggregate expression having mixture of outer references and local references generate a codegen error like following :
```SQL
SELECT t1a
FROM t1
GROUP BY 1
HAVING EXISTS (SELECT 1
FROM t2
WHERE t2a < min(t1a + t2a));
```
Exception snippet.
```
Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
```
After this PR, a better error message is issued.
```
org.apache.spark.sql.AnalysisException
Error in query: Found an aggregate expression in a correlated
predicate that has both outer and local references, which is not supported yet.
Aggregate expression: min((t1.`t1a` + t2.`t2a`)),
Outer references: t1.`t1a`,
Local references: t2.`t2a`.;
```
## How was this patch tested?
Added tests in SQLQueryTestSuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#17636 from dilipbiswal/subquery_followup1.
## What changes were proposed in this pull request?
It's illegal to have aggregate function in GROUP BY, and we should fail at analysis phase, if this happens.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17704 from cloud-fan/minor.
### What changes were proposed in this pull request?
Database and Table names conform the Hive standard ("[a-zA-z_0-9]+"), i.e. if this name only contains characters, numbers, and _.
When calling `toLowerCase` on the names, we should add `Locale.ROOT` to the `toLowerCase`for avoiding inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17655 from gatorsmile/locale.
## What changes were proposed in this pull request?
Also went through the same file to ensure other string concatenation are correct.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17691 from zsxwing/fix-error-message.
## What changes were proposed in this pull request?
Apply Complementation Laws during boolean expression simplification.
## How was this patch tested?
Tested using unit tests, integration tests, and manual tests.
Author: ptkool <michael.styles@shopify.com>
Author: Michael Styles <michael.styles@shopify.com>
Closes#17650 from ptkool/apply_complementation_laws.
Avoid necessary execution that can lead to NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown
## What changes were proposed in this pull request?
Change leftHasNonNullPredicate and rightHasNonNullPredicate to lazy so they are only executed when needed.
## How was this patch tested?
Added test in DataFrameSuite that failed before this fix and now succeeds. Note that a test in catalyst project would be better but i am unsure how to do this.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Koert Kuipers <koert@tresata.com>
Closes#17660 from koertkuipers/feat-catch-npe-in-eliminate-outer-join.
## What changes were proposed in this pull request?
If a plan has multi-level successive joins, e.g.:
```
Join
/ \
Union t5
/ \
Join t4
/ \
Join t3
/ \
t1 t2
```
Currently we fail to reorder the inside joins, i.e. t1, t2, t3.
In join reorder, we use `OrderedJoin` to indicate a join has been ordered, such that when transforming down the plan, these joins don't need to be rerodered again.
But there's a problem in the definition of `OrderedJoin`:
The real join node is a parameter, but not a child. This breaks the transform procedure because `mapChildren` applies transform function on parameters which should be children.
In this patch, we change `OrderedJoin` to a class having the same structure as a join node.
## How was this patch tested?
Add a corresponding test case.
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#17668 from wzhfy/recursiveReorder.
## What changes were proposed in this pull request?
fix typo
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17663 from felixcheung/likedoctypo.
## What changes were proposed in this pull request?
Replace non-existent `repartitionBy` with `distribute` in `CollapseRepartitionSuite`.
## How was this patch tested?
local build and `catalyst/testOnly *CollapseRepartitionSuite`
Author: Jacek Laskowski <jacek@japila.pl>
Closes#17657 from jaceklaskowski/CollapseRepartitionSuite.
## What changes were proposed in this pull request?
This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping.
A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead.
---
Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features.
| RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity |
| --- | --- | --- | --- |
| [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, %, [], [^] | none | no |
| [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) | _, % | none | yes |
| [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) | _, % | none | yes |
| [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) | _, % | none | no |
| [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) | _, % | \ | yes |
| [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | _, % | none | yes |
| Current Spark | _, % | \ | yes |
[1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket.
The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed.
1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`.
PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern.
According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid.
_Proposed new behaviour in Spark: throw AnalysisException_
2. [x] Empty input, e.g. `'' like ''`
Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule.
3. [x] Escape before a non-special character, e.g. `'a' like '\a'`.
Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same.
According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign.
_Proposed new behaviour in Spark: throw AnalysisException_
The current specification is also described in the operator's source code in this patch.
## How was this patch tested?
Extra case in regex unit tests.
Author: Jakob Odersky <jakob@odersky.com>
This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>
Closes#15398 from jodersky/SPARK-17647.
### What changes were proposed in this pull request?
The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
### How was this patch tested?
Added test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17646 from gatorsmile/showFunctions.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/17398 we introduced `UnresolvedMapObjects` as a placeholder of `MapObjects`. Unfortunately `UnresolvedMapObjects` is not serializable as its `function` may reference Scala `Type` which is not serializable.
Ideally this is fine, as we will never serialize and send unresolved expressions to executors. However users may accidentally do this, e.g. mistakenly reference an encoder instance when implementing `Aggregator`, we should fix it so that it's just a performance issue(more network traffic) and should not fail the query.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17639 from cloud-fan/minor.
## What changes were proposed in this pull request?
Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation.
To solve this, we use Catalyst type for min/max in `ColumnStat`. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore.
This pr also fixes a bug for boolean type in `IN` condition.
## How was this patch tested?
The changes for ColumnStat are covered by existing tests.
For bug fix, a new test for boolean type in IN condition is added
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#17630 from wzhfy/refactorColumnStat.
## What changes were proposed in this pull request?
AssertNotNull's toString/simpleString dumps the entire walkedTypePath. walkedTypePath is used for error message reporting and shouldn't be part of the output.
## How was this patch tested?
Manually tested.
Author: Reynold Xin <rxin@databricks.com>
Closes#17616 from rxin/SPARK-20304.
### What changes were proposed in this pull request?
Session catalog API `createTempFunction` is being used by Hive build-in functions, persistent functions, and temporary functions. Thus, the name is confusing. This PR is to rename it by `registerFunction`. Also we can move construction of `FunctionBuilder` and `ExpressionInfo` into the new `registerFunction`, instead of duplicating the logics everywhere.
In the next PRs, the remaining Function-related APIs also need cleanups.
### How was this patch tested?
Existing test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17615 from gatorsmile/cleanupCreateTempFunction.
## What changes were proposed in this pull request?
This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
There are several problems with it:
- It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
- > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
(see joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
## How was this patch tested?
Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
This was tested via manually adding `time.time()` as below:
```diff
profiles_and_goals = build_profiles + sbt_goals
print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
" ".join(profiles_and_goals))
+ import time
+ st = time.time()
exec_sbt(profiles_and_goals)
+ print("Elapsed :[%s]" % str(time.time() - st))
```
produces
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
...
[info] Main Java API documentation successful.
...
Elapsed :[94.8746569157]
...
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17477 from HyukjinKwon/SPARK-18692.
## What changes were proposed in this pull request?
Update count distinct error message for streaming datasets/dataframes to match current behavior. These aggregations are not yet supported, regardless of whether the dataset/dataframe is aggregated.
Author: jtoka <jason.tokayer@gmail.com>
Closes#17609 from jtoka/master.
## What changes were proposed in this pull request?
When we perform a cast expression and the from and to types are structurally the same (having the same structure but different field names), we should be able to skip the actual cast.
## How was this patch tested?
Added unit tests for the newly introduced functions.
Author: Reynold Xin <rxin@databricks.com>
Closes#17614 from rxin/SPARK-20302.
## What changes were proposed in this pull request?
`NaNvl(float value, null)` will be converted into `NaNvl(float value, Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), Cast(null, DoubleType))`.
This will cause mismatching in the output type when the input type is float.
By adding extra rule in TypeCoercion can resolve this issue.
## How was this patch tested?
unite tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: DB Tsai <dbt@netflix.com>
Closes#17606 from dbtsai/fixNaNvl.
## What changes were proposed in this pull request?
Dataset typed API currently uses NewInstance to box primitive types (i.e. calling the constructor). Instead, it'd be slightly more idiomatic in Java to use PrimitiveType.valueOf, which can be invoked using StaticInvoke expression.
## How was this patch tested?
The change should be covered by existing tests for Dataset encoders.
Author: Reynold Xin <rxin@databricks.com>
Closes#17604 from rxin/SPARK-20289.
## What changes were proposed in this pull request?
Similar to `ListQuery`, `Exists` should not be evaluated in `Join` operator too.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#17491 from viirya/dont-push-exists-to-join.
## What changes were proposed in this pull request?
This is a regression caused by SPARK-19716.
Before SPARK-19716, we will cast an array field to the expected array type. However, after SPARK-19716, the cast is removed, but we forgot to push the cast to the element level.
## How was this patch tested?
new regression tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17587 from cloud-fan/array.
## What changes were proposed in this pull request?
Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#17527 from srowen/SPARK-20156.
## What changes were proposed in this pull request?
```
sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b having r > 0.5").show()
```
We will get the following error:
```
Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
```
Filters could be pushed down to the join conditions by the optimizer rule `PushPredicateThroughJoin`. However, Analyzer [blocks users to add non-deterministics conditions](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L386-L395) (For details, see the PR https://github.com/apache/spark/pull/7535).
We should not push down non-deterministic conditions; otherwise, we need to explicitly initialize the non-deterministic expressions. This PR is to simply block it.
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17585 from gatorsmile/joinRandCondition.
## What changes were proposed in this pull request?
This PR proposes to add `IGNORE NULLS` keyword in `first`/`last` in Spark's parser likewise http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions057.htm. This simply maps the keywords to existing `ignoreNullsExpr`.
**Before**
```scala
scala> sql("select first('a' IGNORE NULLS)").show()
```
```
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'NULLS' expecting {')', ','}(line 1, pos 24)
== SQL ==
select first('a' IGNORE NULLS)
------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:210)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:112)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:66)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:622)
... 48 elided
```
**After**
```scala
scala> sql("select first('a' IGNORE NULLS)").show()
```
```
+--------------+
|first(a, true)|
+--------------+
| a|
+--------------+
```
## How was this patch tested?
Unit tests in `ExpressionParserSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17566 from HyukjinKwon/SPARK-19518.
## What changes were proposed in this pull request?
Like `Expression`, `QueryPlan` should also have a `semanticHash` method, then we can put plans to a hash map and look it up fast. This PR refactors `QueryPlan` to follow `Expression` and put all the normalization logic in `QueryPlan.canonicalized`, so that it's very natural to implement `semanticHash`.
follow-up: improve `CacheManager` to leverage this `semanticHash` and speed up plan lookup, instead of iterating all cached plans.
## How was this patch tested?
existing tests. Note that we don't need to test the `semanticHash` method, once the existing tests prove `sameResult` is correct, we are good.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17541 from cloud-fan/plan-semantic.
## What changes were proposed in this pull request?
AssertNotNull currently throws RuntimeException. It should throw NullPointerException, which is more specific.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#17573 from rxin/SPARK-20262.
## What changes were proposed in this pull request?
Similar to `Project`, when `Aggregate` has non-deterministic expressions, we should not push predicate down through it, as it will change the number of input rows and thus change the evaluation result of non-deterministic expressions in `Aggregate`.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17562 from cloud-fan/filter.
## What changes were proposed in this pull request?
Currently `LogicalRelation` has a `expectedOutputAttributes` parameter, which makes it hard to reason about what the actual output is. Like other leaf nodes, `LogicalRelation` should also take `output` as a parameter, to simplify the logic
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17552 from cloud-fan/minor.
## What changes were proposed in this pull request?
This is a tiny addendum to SPARK-19495 to remove the private visibility for copy, which is the only package private method in the entire file.
## How was this patch tested?
N/A - no semantic change.
Author: Reynold Xin <rxin@databricks.com>
Closes#17555 from rxin/SPARK-19495-2.
## What changes were proposed in this pull request?
This commit moves star schema code from ```join.scala``` to ```StarSchemaDetection.scala```. It also applies some minor fixes in ```StarJoinReorderSuite.scala```.
## How was this patch tested?
Run existing ```StarJoinReorderSuite.scala```.
Author: Ioana Delaney <ioanamdelaney@gmail.com>
Closes#17544 from ioana-delaney/starSchemaCBOv2.
## What changes were proposed in this pull request?
Make sure SESSION_LOCAL_TIMEZONE reflects the change in JVM's default timezone setting. Currently several timezone related tests fail as the change to default timezone is not picked up by SQLConf.
## How was this patch tested?
Added an unit test in ConfigEntrySuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#17537 from dilipbiswal/timezone_debug.
## What changes were proposed in this pull request?
Previously when we construct deserializer expression for array type, we will first cast the corresponding field to expected array type and then apply `MapObjects`.
However, by doing that, we lose the opportunity to do by-name resolution for struct type inside array type. In this PR, I introduce a `UnresolvedMapObjects` to hold the lambda function and the input array expression. Then during analysis, after the input array expression is resolved, we get the actual array element type and apply by-name resolution. Then we don't need to add `Cast` for array type when constructing the deserializer expression, as the element type is determined later at analyzer.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17398 from cloud-fan/dataset.
## What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/17285 .
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17521 from cloud-fan/conf.
### What changes were proposed in this pull request?
Observed by felixcheung , in `SparkSession`.`Catalog` APIs, we have different conventions/rules for table/function identifiers/names. Most APIs accept the qualified name (i.e., `databaseName`.`tableName` or `databaseName`.`functionName`). However, the following five APIs do not accept it.
- def listColumns(tableName: String): Dataset[Column]
- def getTable(tableName: String): Table
- def getFunction(functionName: String): Function
- def tableExists(tableName: String): Boolean
- def functionExists(functionName: String): Boolean
To make them consistent with the other Catalog APIs, this PR does the changes, updates the function/API comments and adds the `params` to clarify the inputs we allow.
### How was this patch tested?
Added the test cases .
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17518 from gatorsmile/tableIdentifier.
### What changes were proposed in this pull request?
This PR is to unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The output formats are improved. We also add the missing attributes. It impacts the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC FORMATTED`.
In addition, by following what we did in Dataset API `printSchema`, we can use `treeString` to show the schema in the more readable way.
Below is the current way:
```
Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), `c`: STRING (nullable = true), `d`: STRING (nullable = true)>
```
After the change, it should look like
```
Schema: root
|-- a: string (nullable = true)
|-- b: integer (nullable = true)
|-- c: string (nullable = true)
|-- d: string (nullable = true)
```
### How was this patch tested?
`describe.sql` and `show-tables.sql`
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17394 from gatorsmile/descFollowUp.
## What changes were proposed in this pull request?
**Description** from JIRA
The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet.
For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values.
For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated.
## How was this patch tested?
Added new tests in ParquetQuerySuite and ParquetIOSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#15332 from dilipbiswal/parquet-time-millis.
## What changes were proposed in this pull request?
In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work.
This PR estimates filter selectivity on two columns of same table. For example, multiple tpc-h queries have this predicate "WHERE l_commitdate < l_receiptdate"
## How was this patch tested?
We added 6 new test cases to test various logical predicates involving two columns of same table.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Ron Hu <ron.hu@huawei.com>
Author: U-CHINA\r00754707 <r00754707@R00754707-SC04.china.huawei.com>
Closes#17415 from ron8hu/filterTwoColumns.
## What changes were proposed in this pull request?
Range in SQL should be case insensitive
## How was this patch tested?
unit test
Author: samelamin <hussam.elamin@gmail.com>
Author: samelamin <sam_elamin@discovery.com>
Closes#17487 from samelamin/SPARK-20145.
## What changes were proposed in this pull request?
This patch implements `listPartitionsByFilter()` for `InMemoryCatalog` and thus resolves an outstanding TODO causing the `PruneFileSourcePartitions` optimizer rule not to apply when "spark.sql.catalogImplementation" is set to "in-memory" (which is the default).
The change is straightforward: it extracts the code for further filtering of the list of partitions returned by the metastore's `getPartitionsByFilter()` out from `HiveExternalCatalog` into `ExternalCatalogUtils` and calls this new function from `InMemoryCatalog` on the whole list of partitions.
Now that this method is implemented we can always pass the `CatalogTable` to the `DataSource` in `FindDataSourceTable`, so that the latter is resolved to a relation with a `CatalogFileIndex`, which is what the `PruneFileSourcePartitions` rule matches for.
## How was this patch tested?
Ran existing tests and added new test for `listPartitionsByFilter` in `ExternalCatalogSuite`, which is subclassed by both `InMemoryCatalogSuite` and `HiveExternalCatalogSuite`.
Author: Adrian Ionescu <adrian@databricks.com>
Closes#17510 from adrian-ionescu/InMemoryCatalog.
## What changes were proposed in this pull request?
This PR proposes to use `XXX` format instead of `ZZ`. `ZZ` seems a `FastDateFormat` specific.
`ZZ` supports "ISO 8601 extended format time zones" but it seems `FastDateFormat` specific option.
I misunderstood this is compatible format with `SimpleDateFormat` when this change is introduced.
Please see [SimpleDateFormat documentation]( https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone) and [FastDateFormat documentation](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html).
It seems we better replace `ZZ` to `XXX` because they look using the same strategy - [FastDateParser.java#L930](8767cd4f1a/src/main/java/org/apache/commons/lang3/time/FastDateParser.java (L930)), [FastDateParser.java#L932-L951 ](8767cd4f1a/src/main/java/org/apache/commons/lang3/time/FastDateParser.java (L932-L951)) and [FastDateParser.java#L596-L601](8767cd4f1a/src/main/java/org/apache/commons/lang3/time/FastDateParser.java (L596-L601)).
I also checked the codes and manually debugged it for sure. It seems both cases use the same pattern `( Z|(?:[+-]\\d{2}(?::)\\d{2}))`.
_Note that this should be rather a fix about documentation and not the behaviour change because `ZZ` seems invalid date format in `SimpleDateFormat` as documented in `DataFrameReader` and etc, and both `ZZ` and `XXX` look identically working with `FastDateFormat`_
Current documentation is as below:
```
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
```
## How was this patch tested?
Existing tests should cover this. Also, manually tested as below (BTW, I don't think these are worth being added as tests within Spark):
**Parse**
```scala
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
at java.text.DateFormat.parse(DateFormat.java:366)
... 48 elided
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
at java.text.DateFormat.parse(DateFormat.java:366)
... 48 elided
```
```scala
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017
```
**Format**
```scala
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").format(new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00"))
res6: String = 2017-03-21T20:00:00.000+09:00
```
```scala
scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ")
fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSZZ,ko_KR,Asia/Seoul]
scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
res1: String = 2017-03-21T20:00:00.000+09:00
scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSXXX,ko_KR,Asia/Seoul]
scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
res2: String = 2017-03-21T20:00:00.000+09:00
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17489 from HyukjinKwon/SPARK-20166.
## What changes were proposed in this pull request?
Currently, `DataType.fromJson` throws `scala.MatchError` or `java.util.NoSuchElementException` in some cases when the JSON input is invalid as below:
```scala
DataType.fromJson(""""abcd"""")
```
```
java.util.NoSuchElementException: key not found: abcd
at ...
```
```scala
DataType.fromJson("""{"abcd":"a"}""")
```
```
scala.MatchError: JObject(List((abcd,JString(a)))) (of class org.json4s.JsonAST$JObject)
at ...
```
```scala
DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
```
```
scala.MatchError: JObject(List((a,JInt(123)))) (of class org.json4s.JsonAST$JObject)
at ...
```
After this PR,
```scala
DataType.fromJson(""""abcd"""")
```
```
java.lang.IllegalArgumentException: Failed to convert the JSON string 'abcd' to a data type.
at ...
```
```scala
DataType.fromJson("""{"abcd":"a"}""")
```
```
java.lang.IllegalArgumentException: Failed to convert the JSON string '{"abcd":"a"}' to a data type.
at ...
```
```scala
DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
at ...
```
```
java.lang.IllegalArgumentException: Failed to convert the JSON string '{"a":123}' to a field.
```
## How was this patch tested?
Unit test added in `DataTypeSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17468 from HyukjinKwon/fromjson_exception.