## What changes were proposed in this pull request?
Add a new function from_avro for parsing a binary column of avro format and converting it into its corresponding catalyst value.
Add a new function to_avro for converting a column into binary of avro format with the specified schema.
This PR is in progress. Will add test cases.
## How was this patch tested?
Author: Gengliang Wang <gengliang.wang@databricks.com>
Closes#21774 from gengliangwang/from_and_to_avro.
## What changes were proposed in this pull request?
Refactor `Concat` and `MapConcat` to:
- avoid creating concatenator object for each row.
- make `Concat` handle `containsNull` properly.
- make `Concat` shortcut if `null` child is found.
## How was this patch tested?
Added some tests and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#21824 from ueshin/issues/SPARK-24871/refactor_concat_mapconcat.
## What changes were proposed in this pull request?
Enhances the parser and analyzer to support ANSI compliant syntax for GROUPING SET. As part of this change we derive the grouping expressions from user supplied groupings in the grouping sets clause.
```SQL
SELECT c1, c2, max(c3)
FROM t1
GROUP BY GROUPING SETS ((c1), (c1, c2))
```
## How was this patch tested?
Added tests in SQLQueryTestSuite and ResolveGroupingAnalyticsSuite.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#21813 from dilipbiswal/spark-24424.
## What changes were proposed in this pull request?
As stated in https://github.com/apache/spark/pull/21321, in the error messages we should use `catalogString`. This is not the case, as SPARK-22893 used `simpleString` in order to have the same representation everywhere and it missed some places.
The PR unifies the messages using alway the `catalogString` representation of the dataTypes in the messages.
## How was this patch tested?
existing/modified UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21804 from mgaido91/SPARK-24268_catalog.
## What changes were proposed in this pull request?
Made ExprId hashCode independent of jvmId to make canonicalization independent of JVM, by overriding hashCode (and necessarily also equality) to depend on id only
## How was this patch tested?
Created a unit test ExprIdSuite
Ran all unit tests of sql/catalyst
Author: Ger van Rossum <gvr@users.noreply.github.com>
Closes#21806 from gvr/spark24846-canonicalization.
## What changes were proposed in this pull request?
Two new rules in the logical plan optimizers are added.
1. When there is only one element in the **`Collection`**, the
physical plan will be optimized to **`EqualTo`**, so predicate
pushdown can be used.
```scala
profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
"""
|== Physical Plan ==
|*(1) Project [profileID#0]
|+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))
| +- *(1) FileScan parquet [profileID#0] Batched: true, Format: Parquet,
| PartitionFilters: [],
| PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],
| ReadSchema: struct<profileID:int>
""".stripMargin
```
2. When the **`Collection`** is empty, and the input is nullable, the
logical plan will be simplified to
```scala
profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
"""
|== Optimized Logical Plan ==
|Filter if (isnull(profileID#0)) null else false
|+- Relation[profileID#0] parquet
""".stripMargin
```
TODO:
1. For multiple conditions with numbers less than certain thresholds,
we should still allow predicate pushdown.
2. Optimize the **`In`** using **`tableswitch`** or **`lookupswitch`**
when the numbers of the categories are low, and they are **`Int`**,
**`Long`**.
3. The default immutable hash trees set is slow for query, and we
should do benchmark for using different set implementation for faster
query.
4. **`filter(if (condition) null else false)`** can be optimized to false.
## How was this patch tested?
Couple new tests are added.
Author: DB Tsai <d_tsai@apple.com>
Closes#21797 from dbtsai/optimize-in.
## What changes were proposed in this pull request?
Remove the non-negative checks of window start time to make window support negative start time, and add a check to guarantee the absolute value of start time is less than slide duration.
## How was this patch tested?
New unit tests.
Author: HanShuliang <kevinzwx1992@gmail.com>
Closes#18903 from KevinZwx/dev.
## What changes were proposed in this pull request?
Two new rules in the logical plan optimizers are added.
1. When there is only one element in the **`Collection`**, the
physical plan will be optimized to **`EqualTo`**, so predicate
pushdown can be used.
```scala
profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
"""
|== Physical Plan ==
|*(1) Project [profileID#0]
|+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))
| +- *(1) FileScan parquet [profileID#0] Batched: true, Format: Parquet,
| PartitionFilters: [],
| PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],
| ReadSchema: struct<profileID:int>
""".stripMargin
```
2. When the **`Collection`** is empty, and the input is nullable, the
logical plan will be simplified to
```scala
profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
"""
|== Optimized Logical Plan ==
|Filter if (isnull(profileID#0)) null else false
|+- Relation[profileID#0] parquet
""".stripMargin
```
TODO:
1. For multiple conditions with numbers less than certain thresholds,
we should still allow predicate pushdown.
2. Optimize the **`In`** using **`tableswitch`** or **`lookupswitch`**
when the numbers of the categories are low, and they are **`Int`**,
**`Long`**.
3. The default immutable hash trees set is slow for query, and we
should do benchmark for using different set implementation for faster
query.
4. **`filter(if (condition) null else false)`** can be optimized to false.
## How was this patch tested?
Couple new tests are added.
Author: DB Tsai <d_tsai@apple.com>
Closes#21442 from dbtsai/optimize-in.
## What changes were proposed in this pull request?
We have some functions which need to aware the nullabilities of all children, such as `CreateArray`, `CreateMap`, `Concat`, and so on. Currently we add casts to fix the nullabilities, but the casts might be removed during the optimization phase.
After the discussion, we decided to not add extra casts for just fixing the nullabilities of the nested types, but handle them by functions themselves.
## How was this patch tested?
Modified and added some tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#21704 from ueshin/issues/SPARK-24734/concat_containsnull.
## What changes were proposed in this pull request?
This PR will cache the function name from external catalog, it is used by lookupFunctions in the analyzer, and it is cached for each query plan. The original problem is reported in the [ spark-19737](https://issues.apache.org/jira/browse/SPARK-19737)
## How was this patch tested?
create new test file LookupFunctionsSuite and add test case in SessionCatalogSuite
Author: Kevin Yu <qyu@us.ibm.com>
Closes#20795 from kevinyu98/spark-23486.
## What changes were proposed in this pull request?
The PR is a followup to move the test cases introduced by the original PR in their proper location.
## How was this patch tested?
moved UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21751 from mgaido91/SPARK-24208_followup.
## What changes were proposed in this pull request?
The PR adds the SQL function `array_union`. The behavior of the function is based on Presto's one.
This function returns returns an array of the elements in the union of array1 and array2.
Note: The order of elements in the result is not defined.
## How was this patch tested?
Added UTs
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#21061 from kiszk/SPARK-23914.
## What changes were proposed in this pull request?
The PR simplifies the retrieval of config in `size`, as we can access them from tasks too thanks to SPARK-24250.
## How was this patch tested?
existing UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21736 from mgaido91/SPARK-24605_followup.
## What changes were proposed in this pull request?
This PR is proposing a fix for the output data type of ```If``` and ```CaseWhen``` expression. Upon till now, the implementation of exprassions has ignored nullability of nested types from different execution branches and returned the type of the first branch.
This could lead to an unwanted ```NullPointerException``` from other expressions depending on a ```If```/```CaseWhen``` expression.
Example:
```
val rows = new util.ArrayList[Row]()
rows.add(Row(true, ("a", 1)))
rows.add(Row(false, (null, 2)))
val schema = StructType(Seq(
StructField("cond", BooleanType, false),
StructField("s", StructType(Seq(
StructField("val1", StringType, true),
StructField("val2", IntegerType, false)
)), false)
))
val df = spark.createDataFrame(rows, schema)
df
.select(when('cond, struct(lit("x").as("val1"), lit(10).as("val2"))).otherwise('s) as "res")
.select('res.getField("val1"))
.show()
```
Exception:
```
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
...
```
Output schema:
```
root
|-- res.val1: string (nullable = false)
```
## How was this patch tested?
New test cases added into
- DataFrameSuite.scala
- conditionalExpressions.scala
Author: Marek Novotny <mn.mikke@gmail.com>
Closes#21687 from mn-mikke/SPARK-24165.
## What changes were proposed in this pull request?
SPARK-22893 tried to unify error messages about dataTypes. Unfortunately, still many places were missing the `simpleString` method in other to have the same representation everywhere.
The PR unified the messages using alway the simpleString representation of the dataTypes in the messages.
## How was this patch tested?
existing/modified UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21321 from mgaido91/SPARK-24268.
## What changes were proposed in this pull request?
Implement map_concat high order function.
This implementation does not pick a winner when the specified maps have overlapping keys. Therefore, this implementation preserves existing duplicate keys in the maps and potentially introduces new duplicates (After discussion with ueshin, we settled on option 1 from [here](https://issues.apache.org/jira/browse/SPARK-23936?focusedCommentId=16464245&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16464245)).
## How was this patch tested?
New tests
Manual tests
Run all sbt SQL tests
Run all pyspark sql tests
Author: Bruce Robbins <bersprockets@gmail.com>
Closes#21073 from bersprockets/SPARK-23936.
## What changes were proposed in this pull request?
We should use `DataType.sameType` to compare element type in `ArrayContains`, otherwise nullability affects comparison result.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#21724 from viirya/SPARK-24749.
## What changes were proposed in this pull request?
We can support type coercion between `StructType`s where all the internal types are compatible.
## How was this patch tested?
Added tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#21713 from ueshin/issues/SPARK-24737/structtypecoercion.
## What changes were proposed in this pull request?
Current code block manipulation API is immature and hacky. We need a formal API to manipulate code blocks.
The basic idea is making `JavaCode` as `TreeNode`. So we can use familiar `transform` API to manipulate code blocks and expressions in code blocks.
For example, we can replace `SimpleExprValue` in a code block like this:
```scala
code.transformExprValues {
case SimpleExprValue("1 + 1", _) => aliasedParam
}
```
The example use case is splitting code to methods.
For example, we have an `ExprCode` containing generated code. But it is too long and we need to split it as method. Because statement-based expressions can't be directly passed into. We need to transform them as variables first:
```scala
def getExprValues(block: Block): Set[ExprValue] = block match {
case c: CodeBlock =>
c.blockInputs.collect {
case e: ExprValue => e
}.toSet
case _ => Set.empty
}
def currentCodegenInputs(ctx: CodegenContext): Set[ExprValue] = {
// Collects current variables in ctx.currentVars and ctx.INPUT_ROW.
// It looks roughly like...
ctx.currentVars.flatMap { v =>
getExprValues(v.code) ++ Set(v.value, v.isNull)
}.toSet + ctx.INPUT_ROW
}
// A code block of an expression contains too long code, making it as method
if (eval.code.length > 1024) {
val setIsNull = if (!eval.isNull.isInstanceOf[LiteralValue]) {
...
} else {
""
}
// Pick up variables and statements necessary to pass in.
val currentVars = currentCodegenInputs(ctx)
val varsPassIn = getExprValues(eval.code).intersect(currentVars)
val aliasedExprs = HashMap.empty[SimpleExprValue, VariableValue]
// Replace statement-based expressions which can't be directly passed in the method.
val newCode = eval.code.transform {
case block =>
block.transformExprValues {
case s: SimpleExprValue(_, javaType) if varsPassIn.contains(s) =>
if (aliasedExprs.contains(s)) {
aliasedExprs(s)
} else {
val aliasedVariable = JavaCode.variable(ctx.freshName("aliasedVar"), javaType)
aliasedExprs += s -> aliasedVariable
varsPassIn += aliasedVariable
aliasedVariable
}
}
}
val params = varsPassIn.filter(!_.isInstanceOf[SimpleExprValue])).map { variable =>
s"${variable.javaType.getName} ${variable.variableName}"
}.mkString(", ")
val funcName = ctx.freshName("nodeName")
val javaType = CodeGenerator.javaType(dataType)
val newValue = JavaCode.variable(ctx.freshName("value"), dataType)
val funcFullName = ctx.addNewFunction(funcName,
s"""
|private $javaType $funcName($params) {
| $newCode
| $setIsNull
| return ${eval.value};
|}
""".stripMargin))
eval.value = newValue
val args = varsPassIn.filter(!_.isInstanceOf[SimpleExprValue])).map { variable =>
s"${variable.variableName}"
}
// Create a code block to assign statements to aliased variables.
val createVariables = aliasedExprs.foldLeft(EmptyBlock) { (block, (statement, variable)) =>
block + code"${statement.javaType.getName} $variable = $statement;"
}
eval.code = createVariables + code"$javaType $newValue = $funcFullName($args);"
}
```
## How was this patch tested?
Added unite tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#21405 from viirya/codeblock-api.
## What changes were proposed in this pull request?
Currently we don't allow type coercion between maps.
We can support type coercion between MapTypes where both the key types and the value types are compatible.
## How was this patch tested?
Added tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#21703 from ueshin/issues/SPARK-24732/maptypecoercion.
## What changes were proposed in this pull request?
In the PR, I propose to add new function - *schema_of_json()* which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format.
One of the use cases is using of *schema_of_json()* in the combination with *from_json()*. Currently, _from_json()_ requires a schema as a mandatory argument. The *schema_of_json()* function will allow to point out an JSON string as an example which has the same schema as the first argument of _from_json()_. For instance:
```sql
select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}'))
from json_table;
```
## How was this patch tested?
Added new test to `JsonFunctionsSuite`, `JsonExpressionsSuite` and SQL tests to `json-functions.sql`
Author: Maxim Gekk <maxim.gekk@databricks.com>
Closes#21686 from MaxGekk/infer_schema_json.
## What changes were proposed in this pull request?
The ColumnPruning rule tries adding an extra Project if an input node produces fields more than needed, but as a post-processing step, it needs to remove the lower Project in the form of "Project - Filter - Project" otherwise it would conflict with PushPredicatesThroughProject and would thus cause a infinite optimization loop. The current post-processing method is defined as:
```
private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = plan transform {
case p1 Project(_, f Filter(_, p2 Project(_, child)))
if p2.outputSet.subsetOf(child.outputSet) =>
p1.copy(child = f.copy(child = child))
}
```
This method works well when there is only one Filter but would not if there's two or more Filters. In this case, there is a deterministic filter and a non-deterministic filter so they stay as separate filter nodes and cannot be combined together.
An simplified illustration of the optimization process that forms the infinite loop is shown below (F1 stands for the 1st filter, F2 for the 2nd filter, P for project, S for scan of relation, PredicatePushDown as abbrev. of PushPredicatesThroughProject):
```
F1 - F2 - P - S
PredicatePushDown => F1 - P - F2 - S
ColumnPruning => F1 - P - F2 - P - S
=> F1 - P - F2 - S (Project removed)
PredicatePushDown => P - F1 - F2 - S
ColumnPruning => P - F1 - P - F2 - S
=> P - F1 - P - F2 - P - S
=> P - F1 - F2 - P - S (only one Project removed)
RemoveRedundantProject => F1 - F2 - P - S (goes back to the loop start)
```
So the problem is the ColumnPruning rule adds a Project under a Filter (and fails to remove it in the end), and that new Project triggers PushPredicateThroughProject. Once the filters have been push through the Project, a new Project will be added by the ColumnPruning rule and this goes on and on.
The fix should be when adding Projects, the rule applies top-down, but later when removing extra Projects, the process should go bottom-up to ensure all extra Projects can be matched.
## How was this patch tested?
Added a optimization rule test in ColumnPruningSuite; and a end-to-end test in SQLQuerySuite.
Author: maryannxue <maryannxue@apache.org>
Closes#21674 from maryannxue/spark-24696.
## What changes were proposed in this pull request?
Set createTime for every hive partition created in Spark SQL, which could be used to manage data lifecycle in Hive warehouse. We found that almost every partition modified by spark sql has not been set createTime.
```
mysql> select * from partitions where create_time=0 limit 1\G;
*************************** 1. row ***************************
PART_ID: 1028584
CREATE_TIME: 0
LAST_ACCESS_TIME: 1502203611
PART_NAME: date=20170130
SD_ID: 1543605
TBL_ID: 211605
LINK_TARGET_ID: NULL
1 row in set (0.27 sec)
```
## How was this patch tested?
N/A
Author: debugger87 <yangchaozhong.2009@gmail.com>
Author: Chaozhong Yang <yangchaozhong.2009@gmail.com>
Closes#18900 from debugger87/fix/set-create-time-for-hive-partition.
## What changes were proposed in this pull request?
The PR adds the SQL function ```sequence```.
https://issues.apache.org/jira/browse/SPARK-23927
The behavior of the function is based on Presto's one.
Ref: https://prestodb.io/docs/current/functions/array.html
- ```sequence(start, stop) → array<bigint>```
Generate a sequence of integers from ```start``` to ```stop```, incrementing by ```1``` if ```start``` is less than or equal to ```stop```, otherwise ```-1```.
- ```sequence(start, stop, step) → array<bigint>```
Generate a sequence of integers from ```start``` to ```stop```, incrementing by ```step```.
- ```sequence(start_date, stop_date) → array<date>```
Generate a sequence of dates from ```start_date``` to ```stop_date```, incrementing by ```interval 1 day``` if ```start_date``` is less than or equal to ```stop_date```, otherwise ```- interval 1 day```.
- ```sequence(start_date, stop_date, step_interval) → array<date>```
Generate a sequence of dates from ```start_date``` to ```stop_date```, incrementing by ```step_interval```. The type of ```step_interval``` is ```CalendarInterval```.
- ```sequence(start_timestemp, stop_timestemp) → array<timestamp>```
Generate a sequence of timestamps from ```start_timestamps``` to ```stop_timestamps```, incrementing by ```interval 1 day``` if ```start_date``` is less than or equal to ```stop_date```, otherwise ```- interval 1 day```.
- ```sequence(start_timestamp, stop_timestamp, step_interval) → array<timestamp>```
Generate a sequence of timestamps from ```start_timestamps``` to ```stop_timestamps```, incrementing by ```step_interval```. The type of ```step_interval``` is ```CalendarInterval```.
## How was this patch tested?
Added unit tests.
Author: Vayda, Oleksandr: IT (PRG) <Oleksandr.Vayda@barclayscapital.com>
Closes#21155 from wajda/feature/array-api-sequence.
## What changes were proposed in this pull request?
In PR, I propose new behavior of `size(null)` under the config flag `spark.sql.legacy.sizeOfNull`. If the former one is disabled, the `size()` function returns `null` for `null` input. By default the `spark.sql.legacy.sizeOfNull` is enabled to keep backward compatibility with previous versions. In that case, `size(null)` returns `-1`.
## How was this patch tested?
Modified existing tests for the `size()` function to check new behavior (`null`) and old one (`-1`).
Author: Maxim Gekk <maxim.gekk@databricks.com>
Closes#21598 from MaxGekk/legacy-size-of-null.
## What changes were proposed in this pull request?
Fix `GenericArrayData.equals`, so that it respects the actual types of the elements.
e.g. an instance that represents an `array<int>` and another instance that represents an `array<long>` should be considered incompatible, and thus should return false for `equals`.
`GenericArrayData` doesn't keep any schema information by itself, and rather relies on the Java objects referenced by its `array` field's elements to keep track of their own object types. So, the most straightforward way to respect their types is to call `equals` on the elements, instead of using Scala's `==` operator, which can have semantics that are not always desirable:
```
new java.lang.Integer(123) == new java.lang.Long(123L) // true in Scala
new java.lang.Integer(123).equals(new java.lang.Long(123L)) // false in Scala
```
## How was this patch tested?
Added unit test in `ComplexDataSuite`
Author: Kris Mok <kris.mok@databricks.com>
Closes#21643 from rednaxelafx/fix-genericarraydata-equals.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/19080 we simplified the distribution/partitioning framework, and make all the join-like operators require `HashClusteredDistribution` from children. Unfortunately streaming join operator was missed.
This can cause wrong result. Think about
```
val input1 = MemoryStream[Int]
val input2 = MemoryStream[Int]
val df1 = input1.toDF.select('value as 'a, 'value * 2 as 'b)
val df2 = input2.toDF.select('value as 'a, 'value * 2 as 'b).repartition('b)
val joined = df1.join(df2, Seq("a", "b")).select('a)
```
The physical plan is
```
*(3) Project [a#5]
+- StreamingSymmetricHashJoin [a#5, b#6], [a#10, b#11], Inner, condition = [ leftOnly = null, rightOnly = null, both = null, full = null ], state info [ checkpoint = <unknown>, runId = 54e31fce-f055-4686-b75d-fcd2b076f8d8, opId = 0, ver = 0, numPartitions = 5], 0, state cleanup [ left = null, right = null ]
:- Exchange hashpartitioning(a#5, b#6, 5)
: +- *(1) Project [value#1 AS a#5, (value#1 * 2) AS b#6]
: +- StreamingRelation MemoryStream[value#1], [value#1]
+- Exchange hashpartitioning(b#11, 5)
+- *(2) Project [value#3 AS a#10, (value#3 * 2) AS b#11]
+- StreamingRelation MemoryStream[value#3], [value#3]
```
The left table is hash partitioned by `a, b`, while the right table is hash partitioned by `b`. This means, we may have a matching record that is in different partitions, which should be in the output but not.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21587 from cloud-fan/join.
## What changes were proposed in this pull request?
In the PR, I propose to automatically convert a `Literal` with `Char` type to a `Literal` of `String` type. Currently, the following code:
```scala
val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
df.where($"city".contains('o')).show(false)
```
fails with the exception:
```
Unsupported literal type class java.lang.Character o
java.lang.RuntimeException: Unsupported literal type class java.lang.Character o
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
```
The PR fixes this issue by converting `char` to `string` of length `1`. I believe it makes sense to does not differentiate `char` and `string(1)` in _a unified, multi-language data platform_ like Spark which supports languages like Python/R.
Author: Maxim Gekk <maxim.gekk@databricks.com>
Author: Maxim Gekk <max.gekk@gmail.com>
Closes#21578 from MaxGekk/support-char-literals.
## What changes were proposed in this pull request?
Add array_distinct to remove duplicate value from the array.
## How was this patch tested?
Add unit tests
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21050 from huaxingao/spark-23912.
## What changes were proposed in this pull request?
UDF series UDFXPathXXXX allow users to pass carefully crafted XML to access arbitrary files. Spark does not have built-in access control. When users use the external access control library, users might bypass them and access the file contents.
This PR basically patches the Hive fix to Apache Spark. https://issues.apache.org/jira/browse/HIVE-18879
## How was this patch tested?
A unit test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#21549 from gatorsmile/xpathSecurity.
## What changes were proposed in this pull request?
If you construct catalyst trees using `scala.collection.immutable.Stream` you can run into situations where valid transformations do not seem to have any effect. There are two causes for this behavior:
- `Stream` is evaluated lazily. Note that default implementation will generally only evaluate a function for the first element (this makes testing a bit tricky).
- `TreeNode` and `QueryPlan` use side effects to detect if a tree has changed. Mapping over a stream is lazy and does not need to trigger this side effect. If this happens the node will invalidly assume that it did not change and return itself instead if the newly created node (this is for GC reasons).
This PR fixes this issue by forcing materialization on streams in `TreeNode` and `QueryPlan`.
## How was this patch tested?
Unit tests were added to `TreeNodeSuite` and `LogicalPlanSuite`. An integration test was added to the `PlannerSuite`
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#21539 from hvanhovell/SPARK-24500.
## What changes were proposed in this pull request?
The PR adds the SQL function `map_from_arrays`. The behavior of the function is based on Presto's `map`. Since SparkSQL already had a `map` function, we prepared the different name for this behavior.
This function returns returns a map from a pair of arrays for keys and values.
## How was this patch tested?
Added UTs
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#21258 from kiszk/SPARK-23933.
Signed-off-by: DylanGuedes <djmgguedesgmail.com>
## What changes were proposed in this pull request?
Addition of arrays_zip function to spark sql functions.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Unit tests that checks if the results are correct.
Author: DylanGuedes <djmgguedes@gmail.com>
Closes#21045 from DylanGuedes/SPARK-23931.
## What changes were proposed in this pull request?
In SPARK-22036 we introduced the possibility to allow precision loss in arithmetic operations (according to the SQL standard). The implementation was drawn from Hive's one, where Decimals with a negative scale are not allowed in the operations.
The PR handles the case when the scale is negative, removing the assertion that it is not.
## How was this patch tested?
added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21499 from mgaido91/SPARK-24468.
## What changes were proposed in this pull request?
Implemented eval in SortPrefix expression.
## How was this patch tested?
- ran existing sbt SQL tests
- added unit test
- ran existing Python SQL tests
- manual tests: disabling codegen -- patching code to disable beyond what spark.sql.codegen.wholeStage=false can do -- and running sbt SQL tests
Author: Bruce Robbins <bersprockets@gmail.com>
Closes#21231 from bersprockets/sortprefixeval.
## What changes were proposed in this pull request?
Add support for date `extract` function:
```sql
spark-sql> SELECT EXTRACT(YEAR FROM TIMESTAMP '2000-12-16 12:21:13');
2000
```
Supported field same as [Hive](https://github.com/apache/hive/blob/rel/release-2.3.3/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g#L308-L316): `YEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`, `DAYOFWEEK`, `HOUR`, `MINUTE`, `SECOND`.
## How was this patch tested?
unit tests
Author: Yuming Wang <yumwang@ebay.com>
Closes#21479 from wangyum/SPARK-23903.
## What changes were proposed in this pull request?
add array_remove to remove all elements that equal element from array
## How was this patch tested?
add unit tests
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21069 from huaxingao/spark-23920.
## What changes were proposed in this pull request?
`format_number` support user specifed format as argument. For example:
```sql
spark-sql> SELECT format_number(12332.123456, '##################.###');
12332.123
```
## How was this patch tested?
unit test
Author: Yuming Wang <yumwang@ebay.com>
Closes#21010 from wangyum/SPARK-23900.
## What changes were proposed in this pull request?
When two `In` operators are created with the same list of values, but different order, we are considering them as semantically different. This is wrong, since they have the same semantic meaning.
The PR adds a canonicalization rule which orders the literals in the `In` operator so the semantic equality works properly.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21331 from mgaido91/SPARK-24276.
## What changes were proposed in this pull request?
The PR adds the masking function as they are described in Hive's documentation: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions.
This means that only `string`s are accepted as parameter for the masking functions.
## How was this patch tested?
added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21246 from mgaido91/SPARK-23901.
## What changes were proposed in this pull request?
Currently, users are getting the following error messages on type conversions:
```
scala.MatchError: test (of class java.lang.String)
```
The message doesn't give any clues to the users where in the schema the error happened. In this PR, I would like to improve the error message like:
```
The value (test) of the type (java.lang.String) cannot be converted to struct<f1:int>
```
## How was this patch tested?
Added tests for converting of wrong values to `struct`, `map`, `array`, `string` and `decimal`.
Author: Maxim Gekk <maxim.gekk@databricks.com>
Closes#21410 from MaxGekk/type-conv-error.
## What changes were proposed in this pull request?
Add fallback logic for `UnsafeProjection`. In production we can try to create unsafe projection using codegen implementation. Once any compile error happens, it fallbacks to interpreted implementation.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#21106 from viirya/SPARK-23711.
## What changes were proposed in this pull request?
This patch tries to implement this [proposal](https://github.com/apache/spark/pull/19813#issuecomment-354045400) to add an API for handling expression code generation. It should allow us to manipulate how to generate codes for expressions.
In details, this adds an new abstraction `CodeBlock` to `JavaCode`. `CodeBlock` holds the code snippet and inputs for generating actual java code.
For example, in following java code:
```java
int ${variable} = 1;
boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)};
```
`variable`, `isNull` are two `VariableValue` and `CodeGenerator.defaultValue(BooleanType)` is a string. They are all inputs to this code block and held by `CodeBlock` representing this code.
For codegen, we provide a specified string interpolator `code`, so you can define a code like this:
```scala
val codeBlock =
code"""
|int ${variable} = 1;
|boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)};
""".stripMargin
// Generates actual java code.
codeBlock.toString
```
Because those inputs are held separately in `CodeBlock` before generating code, we can safely manipulate them, e.g., replacing statements to aliased variables, etc..
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#21193 from viirya/SPARK-24121.
## What changes were proposed in this pull request?
The interpreted evaluation of several collection operations works only for simple datatypes. For complex data types, for instance, `array_contains` it returns always `false`. The list of the affected functions is `array_contains`, `array_position`, `element_at` and `GetMapValue`.
The PR fixes the behavior for all the datatypes.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21361 from mgaido91/SPARK-24313.
## What changes were proposed in this pull request?
The PR adds the function `arrays_overlap`. This function returns `true` if the input arrays contain a non-null common element; if not, it returns `null` if any of the arrays contains a `null` element, `false` otherwise.
## How was this patch tested?
added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21028 from mgaido91/SPARK-23922.
## What changes were proposed in this pull request?
The PR adds a new collection function, array_repeat. As there already was a function repeat with the same signature, with the only difference being the expected return type (String instead of Array), the new function is called array_repeat to distinguish.
The behaviour of the function is based on Presto's one.
The function creates an array containing a given element repeated the requested number of times.
## How was this patch tested?
New unit tests added into:
- CollectionExpressionsSuite
- DataFrameFunctionsSuite
Author: Florent Pépin <florentpepin.92@gmail.com>
Author: Florent Pépin <florent.pepin14@imperial.ac.uk>
Closes#21208 from pepinoflo/SPARK-23925.
## What changes were proposed in this pull request?
1. Change antlr rule to fix the warning.
2. Add PIVOT/LATERAL check in AstBuilder with a more meaningful error message.
## How was this patch tested?
1. Add a counter case in `PlanParserSuite.test("lateral view")`
Author: maryannxue <maryann.xue@gmail.com>
Closes#21324 from maryannxue/spark-24035-fix.
## What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/20136 . #20136 didn't really work because in the test, we are using local backend, which shares the driver side `SparkEnv`, so `SparkEnv.get.executorId == SparkContext.DRIVER_IDENTIFIER` doesn't work.
This PR changes the check to `TaskContext.get != null`, and move the check to `SQLConf.get`, and fix all the places that violate this check:
* `InMemoryTableScanExec#createAndDecompressColumn` is executed inside `rdd.map`, we can't access `conf.offHeapColumnVectorEnabled` there. https://github.com/apache/spark/pull/21223 merged
* `DataType#sameType` may be executed in executor side, for things like json schema inference, so we can't call `conf.caseSensitiveAnalysis` there. This contributes to most of the code changes, as we need to add `caseSensitive` parameter to a lot of methods.
* `ParquetFilters` is used in the file scan function, which is executed in executor side, so we can't can't call `conf.parquetFilterPushDownDate` there. https://github.com/apache/spark/pull/21224 merged
* `WindowExec#createBoundOrdering` is called on executor side, so we can't use `conf.sessionLocalTimezone` there. https://github.com/apache/spark/pull/21225 merged
* `JsonToStructs` can be serialized to executors and evaluate, we should not call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)` in the body. https://github.com/apache/spark/pull/21226 merged
## How was this patch tested?
existing test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21190 from cloud-fan/minor.
## What changes were proposed in this pull request?
Mention `spark.sql.crossJoin.enabled` in error message when an implicit `CROSS JOIN` is detected.
## How was this patch tested?
`CartesianProductSuite` and `JoinSuite`.
Author: Henry Robinson <henry@apache.org>
Closes#21201 from henryr/spark-24128.
## What changes were proposed in this pull request?
When creating an InterpretedPredicate instance, initialize any Nondeterministic expressions in the expression tree to avoid java.lang.IllegalArgumentException on later call to eval().
## How was this patch tested?
- sbt SQL tests
- python SQL tests
- new unit test
Author: Bruce Robbins <bersprockets@gmail.com>
Closes#21144 from bersprockets/interpretedpredicate.
## What changes were proposed in this pull request?
The PR add the `slice` function. The behavior of the function is based on Presto's one.
The function slices an array according to the requested start index and length.
## How was this patch tested?
added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21040 from mgaido91/SPARK-23930.
## What changes were proposed in this pull request?
The PR adds the SQL function `array_sort`. The behavior of the function is based on Presto's one.
The function sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.
## How was this patch tested?
Added UTs
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#21021 from kiszk/SPARK-23921.
## What changes were proposed in this pull request?
This refactors the external catalog to be an interface. It can be easier for the future work in the catalog federation. After the refactoring, `ExternalCatalog` is much cleaner without mixing the listener event generation logic.
## How was this patch tested?
The existing tests
Author: gatorsmile <gatorsmile@gmail.com>
Closes#21122 from gatorsmile/refactorExternalCatalog.
## What changes were proposed in this pull request?
Add SQL support for Pivot according to Pivot grammar defined by Oracle (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_clause.htm) with some simplifications, based on our existing functionality and limitations for Pivot at the backend:
1. For pivot_for_clause (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_for_clause.htm), the column list form is not supported, which means the pivot column can only be one single column.
2. For pivot_in_clause (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_in_clause.htm), the sub-query form and "ANY" is not supported (this is only supported by Oracle for XML anyway).
3. For pivot_in_clause, aliases for the constant values are not supported.
The code changes are:
1. Add parser support for Pivot. Note that according to https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#i2076542, Pivot cannot be used together with lateral views in the from clause. This restriction has been implemented in the Parser rule.
2. Infer group-by expressions: group-by expressions are not explicitly specified in SQL Pivot clause and need to be deduced based on this rule: https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#CHDFAFIE, so we have to post-fix it at query analysis stage.
3. Override Pivot.resolved as "false": for the reason mentioned in [2] and the fact that output attributes change after Pivot being replaced by Project or Aggregate, we avoid resolving parent references until after Pivot has been resolved and replaced.
4. Verify aggregate expressions: only aggregate expressions with or without aliases can appear in the first part of the Pivot clause, and this check is performed as analysis stage.
## How was this patch tested?
A new test suite PivotSuite is added.
Author: maryannxue <maryann.xue@gmail.com>
Closes#21187 from maryannxue/spark-24035.
## What changes were proposed in this pull request?
This PR is extracted from #21190 , to make it easier to backport.
`JsonToStructs` can be serialized to executors and evaluate, we should not call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)` in the body.
## How was this patch tested?
tested in #21190
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21226 from cloud-fan/minor4.
## What changes were proposed in this pull request?
`from_utc_timestamp` assumes its input is in UTC timezone and shifts it to the specified timezone. When the timestamp contains timezone(e.g. `2018-03-13T06:18:23+00:00`), Spark breaks the semantic and respect the timezone in the string. This is not what user expects and the result is different from Hive/Impala. `to_utc_timestamp` has the same problem.
More details please refer to the JIRA ticket.
This PR fixes this by returning null if the input timestamp contains timezone.
## How was this patch tested?
new tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21169 from cloud-fan/from_utc_timezone.
## What changes were proposed in this pull request?
Add TypedFilter support for continuous processing application.
## How was this patch tested?
unit tests
Author: wangyanlin01 <wangyanlin01@baidu.com>
Closes#21136 from yanlin-Lynn/SPARK-24061.
## What changes were proposed in this pull request?
`colStat.min` AND `colStat.max` are empty for string type. Thus, `evaluateInSet` should not return zero when either `colStat.min` or `colStat.max`.
## How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#21147 from gatorsmile/cached.
## What changes were proposed in this pull request?
The PR adds the SQL function `array_join`. The behavior of the function is based on Presto's one.
The function accepts an `array` of `string` which is to be joined, a `string` which is the delimiter to use between the items of the first argument and optionally a `string` which is used to replace `null` values.
## How was this patch tested?
added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21011 from mgaido91/SPARK-23916.
## What changes were proposed in this pull request?
HIVE-15511 introduced the `roundOff` flag in order to disable the rounding to 8 digits which is performed in `months_between`. Since this can be a computational intensive operation, skipping it may improve performances when the rounding is not needed.
## How was this patch tested?
modified existing UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21008 from mgaido91/SPARK-23902.
## What changes were proposed in this pull request?
a followup of https://github.com/apache/spark/pull/21100
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21154 from cloud-fan/test.
## What changes were proposed in this pull request?
In SPARK-23375 we introduced the ability of removing `Sort` operation during query optimization if the data is already sorted. In this follow-up we remove also a `Sort` which is followed by another `Sort`: in this case the first sort is not needed and can be safely removed.
The PR starts from henryr's comment: https://github.com/apache/spark/pull/20560#discussion_r180601594. So credit should be given to him.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21072 from mgaido91/SPARK-23973.
## What changes were proposed in this pull request?
This pr supported interpreted mode for `ExternalMapToCatalyst`.
## How was this patch tested?
Added tests in `ObjectExpressionsSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#20980 from maropu/SPARK-23589.
## What changes were proposed in this pull request?
The existing query constraints framework has 2 steps:
1. propagate constraints bottom up.
2. use constraints to infer additional filters for better data pruning.
For step 2, it mostly helps with Join, because we can connect the constraints from children to the join condition and infer powerful filters to prune the data of the join sides. e.g., the left side has constraints `a = 1`, the join condition is `left.a = right.a`, then we can infer `right.a = 1` to the right side and prune the right side a lot.
However, the current logic of inferring filters from constraints for Join is pretty weak. It infers the filters from Join's constraints. Some joins like left semi/anti exclude output from right side and the right side constraints will be lost here.
This PR propose to check the left and right constraints individually, expand the constraints with join condition and add filters to children of join directly, instead of adding to the join condition.
This reverts https://github.com/apache/spark/pull/20670 , covers https://github.com/apache/spark/pull/20717 and https://github.com/apache/spark/pull/20816
This is inspired by the original PRs and the tests are all from these PRs. Thanks to the authors mgaido91 maryannxue KaiXinXiaoLei !
## How was this patch tested?
new tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21083 from cloud-fan/join.
>What changes were proposed in this pull request?
During evaluation of IN conditions, if the source data frame, is represented by a plan, that uses hive table with columns, which were previously analysed, and the plan has conditions for these fields, that cannot be satisfied (which leads us to an empty data frame), FilterEstimation.evaluateInSet method produces NumberFormatException and ClassCastException.
In order to fix this bug, method FilterEstimation.evaluateInSet at first checks, if distinct count is not zero, and also checks if colStat.min and colStat.max are defined, and only in this case proceeds with the calculation. If at least one of the conditions is not satisfied, zero is returned.
>How was this patch tested?
In order to test the PR two tests were implemented: one in FilterEstimationSuite, that tests the plan with the statistics that violates the conditions mentioned above, and another one in StatisticsCollectionSuite, that test the whole process of analysis/optimisation of the query, that leads to the problems, mentioned in the first section.
Author: Mykhailo Shtelma <mykhailo.shtelma@bearingpoint.com>
Author: smikesh <mshtelma@gmail.com>
Closes#21052 from mshtelma/filter_estimation_evaluateInSet_Bugs.
## What changes were proposed in this pull request?
This pr supported interpreted mode for `ValidateExternalType`.
## How was this patch tested?
Added tests in `ObjectExpressionsSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#20757 from maropu/SPARK-23595.
## What changes were proposed in this pull request?
Currently we find the wider common type by comparing the two types from left to right, this can be a problem when you have two data types which don't have a common type but each can be promoted to StringType.
For instance, if you have a table with the schema:
[c1: date, c2: string, c3: int]
The following succeeds:
SELECT coalesce(c1, c2, c3) FROM table
While the following produces an exception:
SELECT coalesce(c1, c3, c2) FROM table
This is only a issue when the seq of dataTypes contains `StringType` and all the types can do string promotion.
close#19033
## How was this patch tested?
Add test in `TypeCoercionSuite`
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#21074 from jiangxb1987/typeCoercion.
## What changes were proposed in this pull request?
This pr supported interpreted mode for `CatalystToExternalMap`.
## How was this patch tested?
Added tests in `ObjectExpressionsSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#20979 from maropu/SPARK-23588.
## What changes were proposed in this pull request?
This pr supported interpreted mode for `NewInstance`.
## How was this patch tested?
Added tests in `ObjectExpressionsSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#20778 from maropu/SPARK-23584.
## What changes were proposed in this pull request?
The PR adds the SQL function `element_at`. The behavior of the function is based on Presto's one.
This function returns element of array at given index in value if column is array, or returns value for the given key in value if column is map.
## How was this patch tested?
Added UTs
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#21053 from kiszk/SPARK-23924.
## What changes were proposed in this pull request?
The PR adds the SQL function `array_position`. The behavior of the function is based on Presto's one.
The function returns the position of the first occurrence of the element in array x (or 0 if not found) using 1-based index as BigInt.
## How was this patch tested?
Added UTs
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#21037 from kiszk/SPARK-23919.
## What changes were proposed in this pull request?
`EqualNullSafe` for `FloatType` and `DoubleType` might generate a wrong result by codegen.
```scala
scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF()
df: org.apache.spark.sql.DataFrame = [_1: double, _2: double]
scala> df.show()
+----+----+
| _1| _2|
+----+----+
|-1.0|null|
|null|-1.0|
+----+----+
scala> df.filter("_1 <=> _2").show()
+----+----+
| _1| _2|
+----+----+
|-1.0|null|
|null|-1.0|
+----+----+
```
The result should be empty but the result remains two rows.
## How was this patch tested?
Added a test.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#21094 from ueshin/issues/SPARK-24007/equalnullsafe.
## What changes were proposed in this pull request?
Added `TransitPredicateInOuterJoin` optimization rule that transits constraints from the preserved side of an outer join to the null-supplying side. The constraints of the join operator will remain unchanged.
## How was this patch tested?
Added 3 tests in `InferFiltersFromConstraintsSuite`.
Author: maryannxue <maryann.xue@gmail.com>
Closes#20816 from maryannxue/spark-21479.
## What changes were proposed in this pull request?
We are using `CodegenContext.freshName` to get a unique name for any new variable we are adding. Unfortunately, this method currently fails to create a unique name when we request more than one instance of variables with starting name `name1` and an instance with starting name `name11`.
The PR changes the way a new name is generated by `CodegenContext.freshName` so that we generate unique names in this scenario too.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21080 from mgaido91/SPARK-23986.
## What changes were proposed in this pull request?
There was no check on nullability for arguments of `Tuple`s. This could lead to have weird behavior when a null value had to be deserialized into a non-nullable Scala object: in those cases, the `null` got silently transformed in a valid value (like `-1` for `Int`), corresponding to the default value we are using in the SQL codebase. This situation was very likely to happen when deserializing to a Tuple of primitive Scala types (like Double, Int, ...).
The PR adds the `AssertNotNull` to arguments of tuples which have been asked to be converted to non-nullable types.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#20976 from mgaido91/SPARK-23835.
## What changes were proposed in this pull request?
We don't have a good way to sequentially access `UnsafeArrayData` with a common interface such as `Seq`. An example is `MapObject` where we need to access several sequence collection types together. But `UnsafeArrayData` doesn't implement `ArrayData.array`. Calling `toArray` will copy the entire array. We can provide an `IndexedSeq` wrapper for `ArrayData`, so we can avoid copying the entire array.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#20984 from viirya/SPARK-23875.
## What changes were proposed in this pull request?
The PR adds the SQL function `array_min`. It takes an array as argument and returns the minimum value in it.
## How was this patch tested?
added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21025 from mgaido91/SPARK-23918.
## What changes were proposed in this pull request?
Currently, interpreted execution of `LambdaVariable` just uses `InternalRow.get` to access element. We should use specified accessors if possible.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#20981 from viirya/SPARK-23873.
## What changes were proposed in this pull request?
The PR adds the SQL function `array_max`. It takes an array as argument and returns the maximum value in it.
## How was this patch tested?
added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21024 from mgaido91/SPARK-23917.
## What changes were proposed in this pull request?
Add UDF weekday
## How was this patch tested?
A new test
Author: yucai <yyu1@ebay.com>
Closes#21009 from yucai/SPARK-23905.
## What changes were proposed in this pull request?
This PR tries to use `MemoryBlock` in `UTF8StringBuffer`. In general, there are two advantages to use `MemoryBlock`.
1. Has clean API calls rather than using a Java array or `PlatformMemory`
2. Improve runtime performance of memory access instead of using `Object`.
## How was this patch tested?
Added `UTF8StringBufferSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#20871 from kiszk/SPARK-23762.
## What changes were proposed in this pull request?
This PR slightly refactors the newly added `ExprValue` API by quite a bit. The following changes are introduced:
1. `ExprValue` now uses the actual class instead of the class name as its type. This should give some more flexibility with generating code in the future.
2. Renamed `StatementValue` to `SimpleExprValue`. The statement concept is broader then an expression (untyped and it cannot be on the right hand side of an assignment), and this was not really what we were using it for. I have added a top level `JavaCode` trait that can be used in the future to reinstate (no pun intended) a statement a-like code fragment.
3. Added factory methods to the `JavaCode` companion object to make it slightly less verbose to create `JavaCode`/`ExprValue` objects. This is also what makes the diff quite large.
4. Added one more factory method to `ExprCode` to make it easier to create code-less expressions.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#21026 from hvanhovell/SPARK-23951.
## What changes were proposed in this pull request?
The codegen output of `Expression`, aka `ExprCode`, now encapsulates only strings of output value (`value`) and nullability (`isNull`). It makes difficulty for us to know what the output really is. I think it is better if we can add wrappers for the value and nullability that let us to easily know that.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#20043 from viirya/SPARK-22856.
## What changes were proposed in this pull request?
This PR avoids possible overflow at an operation `long = (long)(int * int)`. The multiplication of large positive integer values may set one to MSB. This leads to a negative value in long while we expected a positive value (e.g. `0111_0000_0000_0000 * 0000_0000_0000_0010`).
This PR performs long cast before the multiplication to avoid this situation.
## How was this patch tested?
Existing UTs
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#21002 from kiszk/SPARK-23893.
Fixes https://issues.apache.org/jira/browse/SPARK-23823
Keep origin for all the methods using transformExpression
## What changes were proposed in this pull request?
Keep origin in transformExpression
## How was this patch tested?
Manually tested that this fixes https://issues.apache.org/jira/browse/SPARK-23823 and columns have correct origins after Analyzer.analyze
Author: JiahuiJiang <jjiang@palantir.com>
Author: Jiahui Jiang <jjiang@palantir.com>
Closes#20961 from JiahuiJiang/jj/keep-origin.
## What changes were proposed in this pull request?
This PR allows us to use one of several types of `MemoryBlock`, such as byte array, int array, long array, or `java.nio.DirectByteBuffer`. To use `java.nio.DirectByteBuffer` allows to have off heap memory which is automatically deallocated by JVM. `MemoryBlock` class has primitive accessors like `Platform.getInt()`, `Platform.putint()`, or `Platform.copyMemory()`.
This PR uses `MemoryBlock` for `OffHeapColumnVector`, `UTF8String`, and other places. This PR can improve performance of operations involving memory accesses (e.g. `UTF8String.trim`) by 1.8x.
For now, this PR does not use `MemoryBlock` for `BufferHolder` based on cloud-fan's [suggestion](https://github.com/apache/spark/pull/11494#issuecomment-309694290).
Since this PR is a successor of #11494, close#11494. Many codes were ported from #11494. Many efforts were put here. **I think this PR should credit to yzotov.**
This PR can achieve **1.1-1.4x performance improvements** for operations in `UTF8String` or `Murmur3_x86_32`. Other operations are almost comparable performances.
Without this PR
```
OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic
Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz
OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic
Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz
Hash byte arrays with length 268435487: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Murmur3_x86_32 526 / 536 0.0 131399881.5 1.0X
UTF8String benchmark: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
hashCode 525 / 552 1022.6 1.0 1.0X
substring 414 / 423 1298.0 0.8 1.3X
```
With this PR
```
OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic
Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz
Hash byte arrays with length 268435487: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Murmur3_x86_32 474 / 488 0.0 118552232.0 1.0X
UTF8String benchmark: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
hashCode 476 / 480 1127.3 0.9 1.0X
substring 287 / 291 1869.9 0.5 1.7X
```
Benchmark program
```
test("benchmark Murmur3_x86_32") {
val length = 8192 * 32768 + 31
val seed = 42L
val iters = 1 << 2
val random = new Random(seed)
val arrays = Array.fill[MemoryBlock](numArrays) {
val bytes = new Array[Byte](length)
random.nextBytes(bytes)
new ByteArrayMemoryBlock(bytes, Platform.BYTE_ARRAY_OFFSET, length)
}
val benchmark = new Benchmark("Hash byte arrays with length " + length,
iters * numArrays, minNumIters = 20)
benchmark.addCase("HiveHasher") { _: Int =>
var sum = 0L
for (_ <- 0L until iters) {
sum += HiveHasher.hashUnsafeBytesBlock(
arrays(i), Platform.BYTE_ARRAY_OFFSET, length)
}
}
benchmark.run()
}
test("benchmark UTF8String") {
val N = 512 * 1024 * 1024
val iters = 2
val benchmark = new Benchmark("UTF8String benchmark", N, minNumIters = 20)
val str0 = new java.io.StringWriter() { { for (i <- 0 until N) { write(" ") } } }.toString
val s0 = UTF8String.fromString(str0)
benchmark.addCase("hashCode") { _: Int =>
var h: Int = 0
for (_ <- 0L until iters) { h += s0.hashCode }
}
benchmark.addCase("substring") { _: Int =>
var s: UTF8String = null
for (_ <- 0L until iters) { s = s0.substring(N / 2 - 5, N / 2 + 5) }
}
benchmark.run()
}
```
I run [this benchmark program](https://gist.github.com/kiszk/94f75b506c93a663bbbc372ffe8f05de) using [the commit](ee5a79861c). I got the following results:
```
OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz
Memory access benchmarks: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ByteArrayMemoryBlock get/putInt() 220 / 221 609.3 1.6 1.0X
Platform get/putInt(byte[]) 220 / 236 610.9 1.6 1.0X
Platform get/putInt(Object) 492 / 494 272.8 3.7 0.4X
OnHeapMemoryBlock get/putLong() 322 / 323 416.5 2.4 0.7X
long[] 221 / 221 608.0 1.6 1.0X
Platform get/putLong(long[]) 321 / 321 418.7 2.4 0.7X
Platform get/putLong(Object) 561 / 563 239.2 4.2 0.4X
```
I also run [this benchmark program](https://gist.github.com/kiszk/5fdb4e03733a5d110421177e289d1fb5) for comparing performance of `Platform.copyMemory()`.
```
OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz
Platform copyMemory: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Object to Object 1961 / 1967 8.6 116.9 1.0X
System.arraycopy Object to Object 1917 / 1921 8.8 114.3 1.0X
byte array to byte array 1961 / 1968 8.6 116.9 1.0X
System.arraycopy byte array to byte array 1909 / 1937 8.8 113.8 1.0X
int array to int array 1921 / 1990 8.7 114.5 1.0X
double array to double array 1918 / 1923 8.7 114.3 1.0X
Object to byte array 1961 / 1967 8.6 116.9 1.0X
Object to short array 1965 / 1972 8.5 117.1 1.0X
Object to int array 1910 / 1915 8.8 113.9 1.0X
Object to float array 1971 / 1978 8.5 117.5 1.0X
Object to double array 1919 / 1944 8.7 114.4 1.0X
byte array to Object 1959 / 1967 8.6 116.8 1.0X
int array to Object 1961 / 1970 8.6 116.9 1.0X
double array to Object 1917 / 1924 8.8 114.3 1.0X
```
These results show three facts:
1. According to the second/third or sixth/seventh results in the first experiment, if we use `Platform.get/putInt(Object)`, we achieve more than 2x worse performance than `Platform.get/putInt(byte[])` with concrete type (i.e. `byte[]`).
2. According to the second/third or fourth/fifth/sixth results in the first experiment, the fastest way to access an array element on Java heap is `array[]`. **Cons of `array[]` is that it is not possible to support unaligned-8byte access.**
3. According to the first/second/third or fourth/sixth/seventh results in the first experiment, `getInt()/putInt() or getLong()/putLong()` in subclasses of `MemoryBlock` can achieve comparable performance to `Platform.get/putInt()` or `Platform.get/putLong()` with concrete type (second or sixth result). There is no overhead regarding virtual call.
4. According to results in the second experiment, for `Platform.copy()`, to pass `Object` can achieve the same performance as to pass any type of primitive array as source or destination.
5. According to second/fourth results in the second experiment, `Platform.copy()` can achieve the same performance as `System.arrayCopy`. **It would be good to use `Platform.copy()` since `Platform.copy()` can take any types for src and dst.**
We are incrementally replace `Platform.get/putXXX` with `MemoryBlock.get/putXXX`. This is because we have two advantages.
1) Achieve better performance due to having a concrete type for an array.
2) Use simple OO design instead of passing `Object`
It is easy to use `MemoryBlock` in `InternalRow`, `BufferHolder`, `TaskMemoryManager`, and others that are already abstracted. It is not easy to use `MemoryBlock` in utility classes related to hashing or others.
Other candidates are
- UnsafeRow, UnsafeArrayData, UnsafeMapData, SpecificUnsafeRowJoiner
- UTF8StringBuffer
- BufferHolder
- TaskMemoryManager
- OnHeapColumnVector
- BytesToBytesMap
- CachedBatch
- classes for hash
- others.
## How was this patch tested?
Added `UnsafeMemoryAllocator`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#19222 from kiszk/SPARK-10399.
## What changes were proposed in this pull request?
Add interpreted execution for `InitializeJavaBean` expression.
## How was this patch tested?
Added unit test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#20985 from viirya/SPARK-23593-2.
## What changes were proposed in this pull request?
This pr added interpreted execution for `StaticInvoke`.
## How was this patch tested?
Added tests in `ObjectExpressionsSuite`.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#20753 from kiszk/SPARK-23582.