## What changes were proposed in this pull request?
Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions.
This PR splits their expressions in order to avoid the issue.
## How was this patch tested?
Added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>
Closes#19720 from mgaido91/SPARK-22494.
## What changes were proposed in this pull request?
This PR changes `least` and `greatest` code generation to place generated code for expression for arguments into separated methods if these size could be large.
This PR resolved two cases:
* `least` with a lot of argument
* `greatest` with a lot of argument
## How was this patch tested?
Added a new test case into `ArithmeticExpressionsSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#19729 from kiszk/SPARK-22499.
## What changes were proposed in this pull request?
This fixes a problem caused by #15880
`select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
`
When compare string and numeric, cast them as double like Hive.
Author: liutang123 <liutang123@yeah.net>
Closes#19692 from liutang123/SPARK-22469.
## What changes were proposed in this pull request?
Equi-height histogram is effective in cardinality estimation, and more accurate than basic column stats (min, max, ndv, etc) especially in skew distribution. So we need to support it.
For equi-height histogram, all buckets (intervals) have the same height (frequency).
In this PR, we use a two-step method to generate an equi-height histogram:
1. use `ApproximatePercentile` to get percentiles `p(0), p(1/n), p(2/n) ... p((n-1)/n), p(1)`;
2. construct range values of buckets, e.g. `[p(0), p(1/n)], [p(1/n), p(2/n)] ... [p((n-1)/n), p(1)]`, and use `ApproxCountDistinctForIntervals` to count ndv in each bucket. Each bucket is of the form: `(lowerBound, higherBound, ndv)`.
## How was this patch tested?
Added new test cases and modified some existing test cases.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#19479 from wzhfy/generate_histogram.
## What changes were proposed in this pull request?
There is a concern that Spark-side codegen row-by-row filtering might be faster than Parquet's one in general due to type-boxing and additional fuction calls which Spark's one tries to avoid.
So, this PR adds an option to disable/enable record-by-record filtering in Parquet side.
It sets the default to `false` to take the advantage of the improvement.
This was also discussed in https://github.com/apache/spark/pull/14671.
## How was this patch tested?
Manually benchmarks were performed. I generated a billion (1,000,000,000) records and tested equality comparison concatenated with `OR`. This filter combinations were made from 5 to 30.
It seem indeed Spark-filtering is faster in the test case and the gap increased as the filter tree becomes larger.
The details are as below:
**Code**
``` scala
test("Parquet-side filter vs Spark-side filter - record by record") {
withTempPath { path =>
val N = 1000 * 1000 * 1000
val df = spark.range(N).toDF("a")
df.write.parquet(path.getAbsolutePath)
val benchmark = new Benchmark("Parquet-side vs Spark-side", N)
Seq(5, 10, 20, 30).foreach { num =>
val filterExpr = (0 to num).map(i => s"a = $i").mkString(" OR ")
benchmark.addCase(s"Parquet-side filter - number of filters [$num]", 3) { _ =>
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> false.toString,
SQLConf.PARQUET_RECORD_FILTER_ENABLED.key -> true.toString) {
// We should strip Spark-side filter to compare correctly.
stripSparkFilter(
spark.read.parquet(path.getAbsolutePath).filter(filterExpr)).count()
}
}
benchmark.addCase(s"Spark-side filter - number of filters [$num]", 3) { _ =>
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> false.toString,
SQLConf.PARQUET_RECORD_FILTER_ENABLED.key -> false.toString) {
spark.read.parquet(path.getAbsolutePath).filter(filterExpr).count()
}
}
}
benchmark.run()
}
}
```
**Result**
```
Parquet-side vs Spark-side: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet-side filter - number of filters [5] 4268 / 4367 234.3 4.3 0.8X
Spark-side filter - number of filters [5] 3709 / 3741 269.6 3.7 0.9X
Parquet-side filter - number of filters [10] 5673 / 5727 176.3 5.7 0.6X
Spark-side filter - number of filters [10] 3588 / 3632 278.7 3.6 0.9X
Parquet-side filter - number of filters [20] 8024 / 8440 124.6 8.0 0.4X
Spark-side filter - number of filters [20] 3912 / 3946 255.6 3.9 0.8X
Parquet-side filter - number of filters [30] 11936 / 12041 83.8 11.9 0.3X
Spark-side filter - number of filters [30] 3929 / 3978 254.5 3.9 0.8X
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15049 from HyukjinKwon/SPARK-17310.
## What changes were proposed in this pull request?
This PR changes `AND` or `OR` code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for `isNull` and `value` are declared as an instance variable to pass these values (e.g. `isNull1409` and `value1409`) to the callers of the generated method.
This PR resolved two cases:
* large code size of left expression
* large code size of right expression
## How was this patch tested?
Added a new test case into `CodeGenerationSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#18972 from kiszk/SPARK-21720.
## What changes were proposed in this pull request?
This PR makes Spark to be able to read Parquet TIMESTAMP_MICROS values, and add a new config to allow Spark to write timestamp values to parquet as TIMESTAMP_MICROS type.
## How was this patch tested?
new test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19702 from cloud-fan/parquet.
## What changes were proposed in this pull request?
Because of the memory leak issue in `scala.reflect.api.Types.TypeApi.<:<` (https://github.com/scala/bug/issues/8302), creating an encoder may leak memory.
This PR adds `cleanUpReflectionObjects` to clean up these leaking objects for methods calling `scala.reflect.api.Types.TypeApi.<:<`.
## How was this patch tested?
The updated unit tests.
Author: Shixiong Zhu <zsxwing@gmail.com>
Closes#19687 from zsxwing/SPARK-19644.
## What changes were proposed in this pull request?
One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically.
For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values.
However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results.
```
scala> val ds = spark.read.parquet(...).as[Int]
scala> ds.show()
+----+
|v |
+----+
|null|
|1 |
+----+
scala> ds.collect
res0: Array[Long] = Array(0, 1)
scala> ds.map(_ * 2).show
+-----+
|value|
+-----+
|-2 |
|2 |
+-----+
```
This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly.
This PR adds null check for top-level primitive values
## How was this patch tested?
new test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19707 from cloud-fan/bug.
## What changes were proposed in this pull request?
We're building a data lineage tool in which we need to monitor the metadata changes in ExternalCatalog, current ExternalCatalog already provides several useful events like "CreateDatabaseEvent" for custom SparkListener to use. But still there's some event missing, like alter database event and alter table event. So here propose to and new ExternalCatalogEvent.
## How was this patch tested?
Enrich the current UT and tested on local cluster.
CC hvanhovell please let me know your comments about current proposal, thanks.
Author: jerryshao <sshao@hortonworks.com>
Closes#19649 from jerryshao/SPARK-22405.
## What changes were proposed in this pull request?
For a class with field name of special characters, e.g.:
```scala
case class MyType(`field.1`: String, `field 2`: String)
```
Although we can manipulate DataFrame/Dataset, the field names are encoded:
```scala
scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF
df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string]
scala> df.as[MyType].collect
res7: Array[MyType] = Array(MyType(a,b), MyType(c,d))
```
It causes resolving problem when we try to convert the data with non-encoded field names:
```scala
spark.read.json(path).as[MyType]
...
[info] org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie
ld.1];
[info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...
```
We should use decoded field name in Dataset schema.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19664 from viirya/SPARK-22442.
## What changes were proposed in this pull request?
UDFs that can cause runtime exception on invalid data are not safe to pushdown, because its behavior depends on its position in the query plan. Pushdown of it will risk to change its original behavior.
The example reported in the JIRA and taken as test case shows this issue. We should declare UDFs that can cause runtime exception on invalid data as non-determinstic.
This updates the document of `deterministic` property in `Expression` and states clearly an UDF that can cause runtime exception on some specific input, should be declared as non-determinstic.
## How was this patch tested?
Added test. Manually test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19662 from viirya/SPARK-22446.
## What changes were proposed in this pull request?
`spark.sql.statistics.autoUpdate.size` should be `spark.sql.statistics.size.autoUpdate.enabled`. The previous name is confusing as users may treat it as a size config.
This config is in master branch only, no backward compatibility issue.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19667 from cloud-fan/minor.
## What changes were proposed in this pull request?
`CodegenContext.copyResult` is kind of a global status for whole stage codegen. But the tricky part is, it is only used to transfer an information from child to parent when calling the `consume` chain. We have to be super careful in `produce`/`consume`, to set it to true when producing multiple result rows, and set it to false in operators that start new pipeline(like sort).
This PR moves the `copyResult` to `CodegenSupport`, and call it at `WholeStageCodegenExec`. This is much easier to reason about.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19656 from cloud-fan/whole-sage.
## What changes were proposed in this pull request?
It's not safe in all cases to push down a LIMIT below a FULL OUTER
JOIN. If the limit is pushed to one side of the FOJ, the physical
join operator can not tell if a row in the non-limited side would have a
match in the other side.
*If* the join operator guarantees that unmatched tuples from the limited
side are emitted before any unmatched tuples from the other side,
pushing down the limit is safe. But this is impractical for some join
implementations, e.g. SortMergeJoin.
For now, disable limit pushdown through a FULL OUTER JOIN, and we can
evaluate whether a more complicated solution is necessary in the future.
## How was this patch tested?
Ran org.apache.spark.sql.* tests. Altered full outer join tests in
LimitPushdownSuite.
Author: Henry Robinson <henry@cloudera.com>
Closes#19647 from henryr/spark-22211.
forward-port https://github.com/apache/spark/pull/19622 to master branch.
This bug doesn't exist in master because we've added hive bucketing support and the hive bucketing metadata can be recognized by Spark, but we should still port it to master: 1) there may be other unsupported hive metadata removed by Spark. 2) reduce code difference between master and 2.2 to ease the backport in the feature.
***
When we alter table schema, we set the new schema to spark `CatalogTable`, convert it to hive table, and finally call `hive.alterTable`. This causes a problem in Spark 2.2, because hive bucketing metedata is not recognized by Spark, which means a Spark `CatalogTable` representing a hive table is always non-bucketed, and when we convert it to hive table and call `hive.alterTable`, the original hive bucketing metadata will be removed.
To fix this bug, we should read out the raw hive table metadata, update its schema, and call `hive.alterTable`. By doing this we can guarantee only the schema is changed, and nothing else.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19644 from cloud-fan/infer.
## What changes were proposed in this pull request?
The current join estimation logic is only based on basic column statistics (such as ndv, etc). If we want to add estimation for other kinds of statistics (such as histograms), it's not easy to incorporate into the current algorithm:
1. When we have multiple pairs of join keys, the current algorithm computes cardinality in a single formula. But if different join keys have different kinds of stats, the computation logic for each pair of join keys become different, so the previous formula does not apply.
2. Currently it computes cardinality and updates join keys' column stats separately. It's better to do these two steps together, since both computation and update logic are different for different kinds of stats.
## How was this patch tested?
Only refactor, covered by existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19531 from wzhfy/join_est_refactor.
## What changes were proposed in this pull request?
In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for pointer, 1 `long` for key-prefix, and another 2 `long`s as the temporary buffer for radix sort.
In `UnsafeExternalSorter`, we set the `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, and hoping the max size of point array to be 8 GB. However this is wrong, `1024 * 1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point array before reach this limitation, we may hit the max-page-size error.
Users may see exception like this on large dataset:
```
Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more than 17179869176 bytes
at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
...
```
Setting `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to a smaller number is not enough, users can still set the config to a big number and trigger the too large page size issue. This PR fixes it by explicitly handling the too large page size exception in the sorter and spill.
This PR also change the type of `spark.shuffle.spill.numElementsForceSpillThreshold` to int, because it's only compared with `numRecords`, which is an int. This is an internal conf so we don't have a serious compatibility issue.
## How was this patch tested?
TODO
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18251 from cloud-fan/sort.
## What changes were proposed in this pull request?
This issue was discovered and investigated by Ohad Raviv and Sean Owen in https://issues.apache.org/jira/browse/SPARK-21657. The input data of `MapObjects` may be a `List` which has O(n) complexity for accessing by index. When converting input data to catalyst array, `MapObjects` gets element by index in each loop, and results to bad performance.
This PR fixes this issue by accessing elements via Iterator.
## How was this patch tested?
using the test script in https://issues.apache.org/jira/browse/SPARK-21657
```
val BASE = 100000000
val N = 100000
val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => (x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList ))).toDF("c1", "c_arr")
spark.time(df.queryExecution.toRdd.foreach(_ => ()))
```
We can see 50x speed up.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19603 from cloud-fan/map-objects.
## What changes were proposed in this pull request?
Fix three deprecation warnings introduced by move to ANTLR 4.7:
* Use ParserRuleContext.addChild(TerminalNode) in preference to
deprecated ParserRuleContext.addChild(Token) interface.
* TokenStream.reset() is deprecated in favour of seek(0)
* Replace use of deprecated ANTLRInputStream with stream returned by
CharStreams.fromString()
The last item changed the way we construct ANTLR's input stream (from
direct instantiation to factory construction), so necessitated a change
to how we override the LA() method to always return an upper-case
char. The ANTLR object is now wrapped, rather than inherited-from.
* Also fix incorrect usage of CharStream.getText() which expects the rhs
of the supplied interval to be the last char to be returned, i.e. the
interval is inclusive, and work around bug in ANTLR 4.7 where empty
streams or intervals may cause getText() to throw an error.
## How was this patch tested?
Ran all the sql tests. Confirmed that LA() override has coverage by
breaking it, and noting that tests failed.
Author: Henry Robinson <henry@apache.org>
Closes#19578 from henryr/spark-21983.
## What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/17075 , to fix the bug in codegen path.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19576 from cloud-fan/bug.
## What changes were proposed in this pull request?
`ArrowEvalPythonExec` and `FlatMapGroupsInPandasExec` are refering config values of `SQLConf` in function for `mapPartitions`/`mapPartitionsInternal`, but we should capture them in Driver.
## How was this patch tested?
Added a test and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#19587 from ueshin/issues/SPARK-22370.
## What changes were proposed in this pull request?
Canonicalized plans are not supposed to be executed. I ran into a case in which there's some code that accidentally calls execute on a canonicalized plan. This patch throws a more explicit exception when that happens.
## How was this patch tested?
Added a test case in SparkPlanSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#18828 from rxin/SPARK-21619.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-22333
In current version, users can use CURRENT_DATE() and CURRENT_TIMESTAMP() without specifying braces.
However, when a table has columns named as "current_date" or "current_timestamp", it will still be parsed as function call.
There are many such cases in our production cluster. We get the wrong answer due to this inappropriate behevior. In general, ColumnReference should get higher priority than timeFunctionCall.
## How was this patch tested?
unit test
manul test
Author: donnyzone <wellfengzhu@gmail.com>
Closes#19559 from DonnyZone/master.
## What changes were proposed in this pull request?
Adds a new optimisation rule 'ReplaceExceptWithNotFilter' that replaces Except logical with Filter operator and schedule it before applying 'ReplaceExceptWithAntiJoin' rule. This way we can avoid expensive join operation if one or both of the datasets of the Except operation are fully derived out of Filters from a same parent.
## How was this patch tested?
The patch is tested locally using spark-shell + unit test.
Author: Sathiya <sathiya.kumar@polytechnique.edu>
Closes#19451 from sathiyapk/SPARK-22181-optimize-exceptWithFilter.
## What changes were proposed in this pull request?
SPARK-18016 introduced `NestedClass` to avoid that the many methods generated by `splitExpressions` contribute to the outer class' constant pool, making it growing too much. Unfortunately, despite their definition is stored in the `NestedClass`, they all are invoked in the outer class and for each method invocation, there are two entries added to the constant pool: a `Methodref` and a `Utf8` entry (you can easily check this compiling a simple sample class with `janinoc` and looking at its Constant Pool). This limits the scalability of the solution with very large methods which are split in a lot of small ones. This means that currently we are generating classes like this one:
```
class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
public UnsafeRow apply(InternalRow i) {
rowWriter.zeroOutNullBytes();
apply_0(i);
apply_1(i);
...
nestedClassInstance.apply_862(i);
nestedClassInstance.apply_863(i);
...
nestedClassInstance1.apply_1612(i);
nestedClassInstance1.apply_1613(i);
...
}
...
private class NestedClass {
private void apply_862(InternalRow i) { ... }
private void apply_863(InternalRow i) { ... }
...
}
private class NestedClass1 {
private void apply_1612(InternalRow i) { ... }
private void apply_1613(InternalRow i) { ... }
...
}
}
```
This PR reduce the Constant Pool size of the outer class by adding a new method to each nested class: in this method we invoke all the small methods generated by `splitExpression` in that nested class. In this way, in the outer class there is only one method invocation per nested class, reducing by orders of magnitude the entries in its constant pool because of method invocations. This means that after the patch the generated code becomes:
```
class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
public UnsafeRow apply(InternalRow i) {
rowWriter.zeroOutNullBytes();
apply_0(i);
apply_1(i);
...
nestedClassInstance.apply(i);
nestedClassInstance1.apply(i);
...
}
...
private class NestedClass {
private void apply_862(InternalRow i) { ... }
private void apply_863(InternalRow i) { ... }
...
private void apply(InternalRow i) {
apply_862(i);
apply_863(i);
...
}
}
private class NestedClass1 {
private void apply_1612(InternalRow i) { ... }
private void apply_1613(InternalRow i) { ... }
...
private void apply(InternalRow i) {
apply_1612(i);
apply_1613(i);
...
}
}
}
```
## How was this patch tested?
Added UT and existing UTs
Author: Marco Gaido <mgaido@hortonworks.com>
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#19480 from mgaido91/SPARK-22226.
## What changes were proposed in this pull request?
This PR is to clean the related codes majorly based on the today's code review on https://github.com/apache/spark/pull/19559
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19585 from gatorsmile/trivialFixes.
## What changes were proposed in this pull request?
Add a flag "spark.sql.files.ignoreMissingFiles" to parallel the existing flag "spark.sql.files.ignoreCorruptFiles".
## How was this patch tested?
new unit test
Author: Jose Torres <jose@databricks.com>
Closes#19581 from joseph-torres/SPARK-22366.
## What changes were proposed in this pull request?
Rewritten error message for clarity. Added extra information in case of attribute name collision, hinting the user to double-check referencing two different tables
## How was this patch tested?
No functional changes, only final message has changed. It has been tested manually against the situation proposed in the JIRA ticket. Automated tests in repository pass.
This PR is original work from me and I license this work to the Spark project
Author: Ruben Berenguel Montoro <ruben@mostlymaths.net>
Author: Ruben Berenguel Montoro <ruben@dreamattic.com>
Author: Ruben Berenguel <ruben@mostlymaths.net>
Closes#17100 from rberenguel/SPARK-13947-error-message.
## What changes were proposed in this pull request?
For performance reason, we should resolve in operation on an empty list as false in the optimizations phase, ad discussed in #19522.
## How was this patch tested?
Added UT
cc gatorsmile
Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>
Closes#19523 from mgaido91/SPARK-22301.
## What changes were proposed in this pull request?
The current implementation of `ApproxCountDistinctForIntervals` is `ImperativeAggregate`. The number of `aggBufferAttributes` is the number of total words in the hllppHelper array. Each hllppHelper has 52 words by default relativeSD.
Since this aggregate function is used in equi-height histogram generation, and the number of buckets in histogram is usually hundreds, the number of `aggBufferAttributes` can easily reach tens of thousands or even more.
This leads to a huge method in codegen and causes error:
```
org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB.
```
Besides, huge generated methods also result in performance regression.
In this PR, we change its implementation to `TypedImperativeAggregate`. After the fix, `ApproxCountDistinctForIntervals` can deal with more than thousands endpoints without throwing codegen error, and improve performance from `20 sec` to `2 sec` in a test case of 500 endpoints.
## How was this patch tested?
Test by an added test case and existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19506 from wzhfy/change_forIntervals_typedAgg.
## What changes were proposed in this pull request?
This is a follow-up PR of https://github.com/apache/spark/pull/17633.
This PR is to add a conf `spark.sql.hive.advancedPartitionPredicatePushdown.enabled`, which can be used to turn the enhancement off.
## How was this patch tested?
Add a test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19547 from gatorsmile/Spark20331FollowUp.
## What changes were proposed in this pull request?
Plan equality should be computed by `canonicalized`, so we can remove unnecessary `hashCode` and `equals` methods.
## How was this patch tested?
Existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19539 from wzhfy/remove_equals.
## What changes were proposed in this pull request?
This is a follow-up of #18732.
This pr modifies `GroupedData.apply()` method to convert pandas udf to grouped udf implicitly.
## How was this patch tested?
Exisiting tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#19517 from ueshin/issues/SPARK-20396/fup2.
## What changes were proposed in this pull request?
spark does not support grouping__id, it has grouping_id() instead.
But it is not convenient for hive user to change to spark-sql
so this pr is to replace grouping__id with grouping_id()
hive user need not to alter their scripts
## How was this patch tested?
test with SQLQuerySuite.scala
Author: CenYuhai <yuhai.cen@ele.me>
Closes#18270 from cenyuhai/SPARK-21055.
## What changes were proposed in this pull request?
To let the same aggregate function that appear multiple times in an Aggregate be evaluated only once, we need to deduplicate the aggregate expressions. The original code was trying to use a "distinct" call to get a set of aggregate expressions, but did not work, since the "distinct" did not compare semantic equality. And even if it did, further work should be done in result expression rewriting.
In this PR, I changed the "set" to a map mapping the semantic identity of a aggregate expression to itself. Thus, later on, when rewriting result expressions (i.e., output expressions), the aggregate expression reference can be fixed.
## How was this patch tested?
Added a new test in SQLQuerySuite
Author: maryannxue <maryann.xue@gmail.com>
Closes#19488 from maryannxue/spark-22266.
## What changes were proposed in this pull request?
In Average.scala, it has
```
override lazy val evaluateExpression = child.dataType match {
case DecimalType.Fixed(p, s) =>
// increase the precision and scale to prevent precision loss
val dt = DecimalType.bounded(p + 14, s + 4)
Cast(Cast(sum, dt) / Cast(count, dt), resultType)
case _ =>
Cast(sum, resultType) / Cast(count, resultType)
}
def setChild (newchild: Expression) = {
child = newchild
}
```
It is possible that Cast(count, dt), resultType) will make the precision of the decimal number bigger than 38, and this causes over flow. Since count is an integer and doesn't need a scale, I will cast it using DecimalType.bounded(38,0)
## How was this patch tested?
In DataFrameSuite, I will add a test case.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#19496 from huaxingao/spark-22271.
## What changes were proposed in this pull request?
In EnsureStatefulOpPartitioning, we check that the inputRDD to a SparkPlan has the expected partitioning for Streaming Stateful Operators. The problem is that we are not allowed to access this information during planning.
The reason we added that check was because CoalesceExec could actually create RDDs with 0 partitions. We should fix it such that when CoalesceExec says that there is a SinglePartition, there is in fact an inputRDD of 1 partition instead of 0 partitions.
## How was this patch tested?
Regression test in StreamingQuerySuite
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#19467 from brkyvz/stateful-op.
## What changes were proposed in this pull request?
This is a minor folllowup of #19474 .
#19474 partially reverted #18064 but accidentally introduced a behavior change. `Command` extended `LogicalPlan` before #18064 , but #19474 made it extend `LeafNode`. This is an internal behavior change as now all `Command` subclasses can't define children, and they have to implement `computeStatistic` method.
This PR fixes this by making `Command` extend `LogicalPlan`
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19493 from cloud-fan/minor.
## What changes were proposed in this pull request?
This is an effort to reduce the difference between Hive and Spark. Spark supports case-sensitivity in columns. Especially, for Struct types, with `spark.sql.caseSensitive=true`, the following is supported.
```scala
scala> sql("select named_struct('a', 1, 'A', 2).a").show
+--------------------------+
|named_struct(a, 1, A, 2).a|
+--------------------------+
| 1|
+--------------------------+
scala> sql("select named_struct('a', 1, 'A', 2).A").show
+--------------------------+
|named_struct(a, 1, A, 2).A|
+--------------------------+
| 2|
+--------------------------+
```
And vice versa, with `spark.sql.caseSensitive=false`, the following is supported.
```scala
scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show
+--------------------+--------------------+
|named_struct(a, 1).A|named_struct(A, 1).a|
+--------------------+--------------------+
| 1| 1|
+--------------------+--------------------+
```
However, types are considered different. For example, SET operations fail.
```scala
scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<A:int> <> struct<a:int> at the first column of the second table;;
'Union
:- Project [named_struct(a, 1) AS named_struct(a, 1)#57]
: +- OneRowRelation$
+- Project [named_struct(A, 2) AS named_struct(A, 2)#58]
+- OneRowRelation$
```
This PR aims to support case-insensitive type equality. For example, in Set operation, the above operation succeed when `spark.sql.caseSensitive=false`.
```scala
scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show
+------------------+
|named_struct(a, 1)|
+------------------+
| [1]|
| [2]|
+------------------+
```
## How was this patch tested?
Pass the Jenkins with a newly add test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#18460 from dongjoon-hyun/SPARK-21247.
## What changes were proposed in this pull request?
For non-deterministic expressions, they should be considered as not contained in the [[ExpressionSet]].
This is consistent with how we define `semanticEquals` between two expressions.
Otherwise, combining expressions will remove non-deterministic expressions which should be reserved.
E.g.
Combine filters of
```scala
testRelation.where(Rand(0) > 0.1).where(Rand(0) > 0.1)
```
should result in
```scala
testRelation.where(Rand(0) > 0.1 && Rand(0) > 0.1)
```
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19475 from gengliangwang/non-deterministic-expressionSet.
## What changes were proposed in this pull request?
The method `deterministic` is frequently called in optimizer.
Refactor `deterministic` as lazy value, in order to avoid redundant computations.
## How was this patch tested?
Simple benchmark test over TPC-DS queries, run time from query string to optimized plan(continuous 20 runs, and get the average of last 5 results):
Before changes: 12601 ms
After changes: 11993ms
This is 4.8% performance improvement.
Also run test with Unit test.
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19478 from gengliangwang/deterministicAsLazyVal.
## What changes were proposed in this pull request?
`ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter`
This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this.
Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message.
(It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.)
Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify,
## How was this patch tested?
The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary.
| committer | summary | outcome |
|-----------|---------|---------|
| parquet | true | success |
| parquet | false | success |
| marking | false | success with marker |
| marking | true | exception |
All tests are happy.
Author: Steve Loughran <stevel@hortonworks.com>
Closes#19448 from steveloughran/cloud/SPARK-22217-committer.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/18064, we allowed `RunnableCommand` to have children in order to fix some UI issues. Then we made `InsertIntoXXX` commands take the input `query` as a child, when we do the actual writing, we just pass the physical plan to the writer(`FileFormatWriter.write`).
However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the output schema from "<A: int>" to `<a: int>`. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path.
Fortunately, in https://github.com/apache/spark/pull/18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in #18604 , and now we don't have to take the input `query` as child of `InsertIntoXXX` commands.
So the fix is simple, this PR partially revert #18064 and make `InsertIntoXXX` commands leaf nodes again.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19474 from cloud-fan/bug.
## What changes were proposed in this pull request?
Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer.
For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2.
Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above.
## How was this patch tested?
Added a new test case and fix existing test cases.
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#19438 from wzhfy/improve_percentile_approx.
## What changes were proposed in this pull request?
Current `CodeGeneraor.splitExpressions` splits statements into methods if the total length of statements is more than 1024 characters. The length may include comments or empty line.
This PR excludes comment or empty line from the length to reduce the number of generated methods in a class, by using `CodeFormatter.stripExtraNewLinesAndComments()` method.
## How was this patch tested?
Existing tests
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#18966 from kiszk/SPARK-21751.
This change adds a new SQL config key that is equivalent to SparkContext's
"spark.extraListeners", allowing users to register QueryExecutionListener
instances through the Spark configuration system instead of having to
explicitly do it in code.
The code used by SparkContext to implement the feature was refactored into
a helper method in the Utils class, and SQL's ExecutionListenerManager was
modified to use it to initialize listener declared in the configuration.
Unit tests were added to verify all the new functionality.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#19309 from vanzin/SPARK-19558.
## What changes were proposed in this pull request?
This PR adds an apply() function on df.groupby(). apply() takes a pandas udf that is a transformation on `pandas.DataFrame` -> `pandas.DataFrame`.
Static schema
-------------------
```
schema = df.schema
pandas_udf(schema)
def normalize(df):
df = df.assign(v1 = (df.v1 - df.v1.mean()) / df.v1.std()
return df
df.groupBy('id').apply(normalize)
```
Dynamic schema
-----------------------
**This use case is removed from the PR and we will discuss this as a follow up. See discussion https://github.com/apache/spark/pull/18732#pullrequestreview-66583248**
Another example to use pd.DataFrame dtypes as output schema of the udf:
```
sample_df = df.filter(df.id == 1).toPandas()
def foo(df):
ret = # Some transformation on the input pd.DataFrame
return ret
foo_udf = pandas_udf(foo, foo(sample_df).dtypes)
df.groupBy('id').apply(foo_udf)
```
In interactive use case, user usually have a sample pd.DataFrame to test function `foo` in their notebook. Having been able to use `foo(sample_df).dtypes` frees user from specifying the output schema of `foo`.
Design doc: https://github.com/icexelloss/spark/blob/pandas-udf-doc/docs/pyspark-pandas-udf.md
## How was this patch tested?
* Added GroupbyApplyTest
Author: Li Jin <ice.xelloss@gmail.com>
Author: Takuya UESHIN <ueshin@databricks.com>
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#18732 from icexelloss/groupby-apply-SPARK-20396.
## What changes were proposed in this pull request?
We should not break the assumption that the length of the allocated byte array is word rounded:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170
So we want to use `Integer.MAX_VALUE - 15` instead of `Integer.MAX_VALUE - 8` as the upper bound of an allocated byte array.
cc: srowen gatorsmile
## How was this patch tested?
Since the Spark unit test JVM has less than 1GB heap, here we run the test code as a submit job, so it can run on a JVM has 4GB memory.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Feng Liu <fengliu@databricks.com>
Closes#19460 from liufengdb/fix_array_max.
## What changes were proposed in this pull request?
This updates the broadcast join code path to lazily decompress pages and
iterate through UnsafeRows to prevent all rows from being held in memory
while the broadcast table is being built.
## How was this patch tested?
Existing tests.
Author: Ryan Blue <blue@apache.org>
Closes#19394 from rdblue/broadcast-driver-memory.
## What changes were proposed in this pull request?
`monotonically_increasing_id` doesn't work in Structured Streaming. We should throw an exception if a streaming query uses it.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19336 from viirya/SPARK-21947.
## What changes were proposed in this pull request?
In this PR we make a few changes to the list hive partitions code, to make the code more extensible.
The following changes are made:
1. In `HiveClientImpl.getPartitions()`, call `client.getPartitions` instead of `shim.getAllPartitions` when `spec` is empty;
2. In `HiveTableScanExec`, previously we always call `listPartitionsByFilter` if the config `metastorePartitionPruning` is enabled, but actually, we'd better call `listPartitions` if `partitionPruningPred` is empty;
3. We should use sessionCatalog instead of SharedState.externalCatalog in `HiveTableScanExec`.
## How was this patch tested?
Tested by existing test cases since this is code refactor, no regression or behavior change is expected.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#19444 from jiangxb1987/hivePartitions.
## What changes were proposed in this pull request?
By definition the table name in Spark can be something like `123x`, `25a`, etc., with exceptions for literals like `12L`, `23BD`, etc. However, Spark SQL has a special byte length literal, which stops users to use digits followed by `b`, `k`, `m`, `g` as identifiers.
byte length literal is not a standard sql literal and is only used in the `tableSample` parser rule. This PR move the parsing of byte length literal from lexer to parser, so that users can use it as identifiers.
## How was this patch tested?
regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19392 from cloud-fan/parser-bug.
## What changes were proposed in this pull request?
This pr added code to check actual bytecode size when compiling generated code. In #18810, we added code to give up code compilation and use interpreter execution in `SparkPlan` if the line number of generated functions goes over `maxLinesPerFunction`. But, we already have code to collect metrics for compiled bytecode size in `CodeGenerator` object. So,we could easily reuse the code for this purpose.
## How was this patch tested?
Added tests in `WholeStageCodegenSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#19083 from maropu/SPARK-21871.
## What changes were proposed in this pull request?
Allow one-sided outer joins between two streams when a watermark is defined.
## How was this patch tested?
new unit tests
Author: Jose Torres <jose@databricks.com>
Closes#19327 from joseph-torres/outerjoin.
## What changes were proposed in this pull request?
Users could hit `java.lang.NullPointerException` when the tables were created by Hive and the table's owner is `null` that are got from Hive metastore. `DESC EXTENDED` failed with the error:
> SQLExecutionException: java.lang.NullPointerException at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47) at scala.collection.immutable.StringOps.length(StringOps.scala:47) at scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27) at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111) at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at org.apache.spark.sql.catalyst.catalog.CatalogTable.toLinkedHashMap(interface.scala:300) at org.apache.spark.sql.execution.command.DescribeTableCommand.describeFormattedTableInfo(tables.scala:565) at org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:543) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:66) at
## How was this patch tested?
Added a unit test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19395 from gatorsmile/desc.
## What changes were proposed in this pull request?
The definition of `maxRows` in `LocalLimit` operator was simply wrong. This patch introduces a new `maxRowsPerPartition` method and uses that in pruning. The patch also adds more documentation on why we need local limit vs global limit.
Note that this previously has never been a bug because the way the code is structured, but future use of the maxRows could lead to bugs.
## How was this patch tested?
Should be covered by existing test cases.
Closes#18851
Author: gatorsmile <gatorsmile@gmail.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#19393 from gatorsmile/pr-18851.
### What changes were proposed in this pull request?
`tempTables` is not right. To be consistent, we need to rename the internal variable names/comments to tempViews in SessionCatalog too.
### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19117 from gatorsmile/renameTempTablesToTempViews.
## What changes were proposed in this pull request?
Add comments for specifying the position of batch "Check Cartesian Products", as rxin suggested in https://github.com/apache/spark/pull/19362 .
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19379 from gengliangwang/SPARK-22141-followup.
## What changes were proposed in this pull request?
Spark's RangePartitioner hard codes the number of sampling points per partition to be 20. This is sometimes too low. This ticket makes it configurable, via spark.sql.execution.rangeExchange.sampleSizePerPartition, and raises the default in Spark SQL to be 100.
## How was this patch tested?
Added a pretty sophisticated test based on chi square test ...
Author: Reynold Xin <rxin@databricks.com>
Closes#19387 from rxin/SPARK-22160.
## What changes were proposed in this pull request?
spark.sql.execution.arrow.enable and spark.sql.codegen.aggregate.map.twolevel.enable -> enabled
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#19384 from rxin/SPARK-22159.
## What changes were proposed in this pull request?
When inferring constraints from children, Join's condition can be simplified as None.
For example,
```
val testRelation = LocalRelation('a.int)
val x = testRelation.as("x")
val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y")
x.join.where($"x.a" === $"y.a")
```
The plan will become
```
Join Inner
:- LocalRelation <empty>, [a#23]
+- LocalRelation <empty>, [a#224]
```
And the Cartesian products check will throw exception for above plan.
Propagate empty relation before checking Cartesian products, and the issue is resolved.
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19362 from gengliangwang/MoveCheckCartesianProducts.
## What changes were proposed in this pull request?
Address PR comments that appeared post-merge, to rename `addExtraCode` to `addInnerClass`,
and not count the size of the inner class to the size of the outer class.
## How was this patch tested?
YOLO.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#19353 from juliuszsompolski/SPARK-22103followup.
## What changes were proposed in this pull request?
HashAggregateExec codegen uses two paths for fast hash table and a generic one.
It generates code paths for iterating over both, and both code paths generate the consume code of the parent operator, resulting in that code being expanded twice.
This leads to a long generated function that might be an issue for the compiler (see e.g. SPARK-21603).
I propose to remove the double expansion by generating the consume code in a helper function that can just be called from both iterating loops.
An issue with separating the `consume` code to a helper function was that a number of places relied and assumed on being in the scope of an outside `produce` loop and e.g. use `continue` to jump out.
I replaced such code flows with nested scopes. It is code that should be handled the same by compiler, while getting rid of depending on assumptions that are outside of the `consume`'s own scope.
## How was this patch tested?
Existing test coverage.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#19324 from juliuszsompolski/aggrconsumecodegen.
## What changes were proposed in this pull request?
The `percentile_approx` function previously accepted numeric type input and output double type results.
But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.
After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
This change is also required when we generate equi-height histograms for these types.
## How was this patch tested?
Added a new test and modified some existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19321 from wzhfy/approx_percentile_support_types.
## What changes were proposed in this pull request?
Enable Scala 2.12 REPL. Fix most remaining issues with 2.12 compilation and warnings, including:
- Selecting Kafka 0.10.1+ for Scala 2.12 and patching over a minor API difference
- Fixing lots of "eta expansion of zero arg method deprecated" warnings
- Resolving the SparkContext.sequenceFile implicits compile problem
- Fixing an odd but valid jetty-server missing dependency in hive-thriftserver
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#19307 from srowen/Scala212.
## What changes were proposed in this pull request?
This PR proposes to enhance the documentation for `trim` functions in the function description session.
- Add more `usage`, `arguments` and `examples` for the trim function
- Adjust space in the `usage` session
After the changes, the trim function documentation will look like this:
- `trim`
```trim(str) - Removes the leading and trailing space characters from str.
trim(BOTH trimStr FROM str) - Remove the leading and trailing trimStr characters from str
trim(LEADING trimStr FROM str) - Remove the leading trimStr characters from str
trim(TRAILING trimStr FROM str) - Remove the trailing trimStr characters from str
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string
Examples:
> SELECT trim(' SparkSQL ');
SparkSQL
> SELECT trim('SL', 'SSparkSQLS');
parkSQ
> SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
parkSQLS
> SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
SSparkSQ
```
- `ltrim`
```ltrim
ltrim(str) - Removes the leading space characters from str.
ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT ltrim(' SparkSQL ');
SparkSQL
> SELECT ltrim('Sp', 'SSparkSQLS');
arkSQLS
```
- `rtrim`
```rtrim
rtrim(str) - Removes the trailing space characters from str.
rtrim(trimStr, str) - Removes the trailing string which contains the characters from the trim string from the str
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT rtrim(' SparkSQL ');
SparkSQL
> SELECT rtrim('LQSa', 'SSparkSQLS');
SSpark
```
This is the trim characters function jira: [trim function](https://issues.apache.org/jira/browse/SPARK-14878)
## How was this patch tested?
Manually tested
```
spark-sql> describe function extended trim;
17/09/22 17:03:04 INFO CodeGenerator: Code generated in 153.026533 ms
Function: trim
Class: org.apache.spark.sql.catalyst.expressions.StringTrim
Usage:
trim(str) - Removes the leading and trailing space characters from `str`.
trim(BOTH trimStr FROM str) - Remove the leading and trailing `trimStr` characters from `str`
trim(LEADING trimStr FROM str) - Remove the leading `trimStr` characters from `str`
trim(TRAILING trimStr FROM str) - Remove the trailing `trimStr` characters from `str`
Extended Usage:
Arguments:
* str - a string expression
* trimStr - the trim string characters to trim, the default value is a single space
* BOTH, FROM - these are keywords to specify trimming string characters from both ends of
the string
* LEADING, FROM - these are keywords to specify trimming string characters from the left
end of the string
* TRAILING, FROM - these are keywords to specify trimming string characters from the right
end of the string
Examples:
> SELECT trim(' SparkSQL ');
SparkSQL
> SELECT trim('SL', 'SSparkSQLS');
parkSQ
> SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
parkSQLS
> SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
SSparkSQ
```
```
spark-sql> describe function extended ltrim;
Function: ltrim
Class: org.apache.spark.sql.catalyst.expressions.StringTrimLeft
Usage:
ltrim(str) - Removes the leading space characters from `str`.
ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string
Extended Usage:
Arguments:
* str - a string expression
* trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT ltrim(' SparkSQL ');
SparkSQL
> SELECT ltrim('Sp', 'SSparkSQLS');
arkSQLS
```
```
spark-sql> describe function extended rtrim;
Function: rtrim
Class: org.apache.spark.sql.catalyst.expressions.StringTrimRight
Usage:
rtrim(str) - Removes the trailing space characters from `str`.
rtrim(trimStr, str) - Removes the trailing string which contains the characters from the trim string from the `str`
Extended Usage:
Arguments:
* str - a string expression
* trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT rtrim(' SparkSQL ');
SparkSQL
> SELECT rtrim('LQSa', 'SSparkSQLS');
SSpark
```
Author: Kevin Yu <qyu@us.ibm.com>
Closes#19329 from kevinyu98/spark-14878-5.
## What changes were proposed in this pull request?
Try to avoid allocating an array bigger than Integer.MAX_VALUE - 8, which is the actual max size on some JVMs, in several places
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#19266 from srowen/SPARK-22033.
## What changes were proposed in this pull request?
Right now the calculation of SortMergeJoinExec's outputOrdering relies on the fact that its children have already been sorted on the join keys, while this is often not true until EnsureRequirements has been applied. So we ended up not getting the correct outputOrdering during physical planning stage before Sort nodes are added to the children.
For example, J = {A join B on key1 = key2}
1. if A is NOT ordered on key1 ASC, J's outputOrdering should include "key1 ASC"
2. if A is ordered on key1 ASC, J's outputOrdering should include "key1 ASC"
3. if A is ordered on key1 ASC, with sameOrderExp=c1, J's outputOrdering should include "key1 ASC, sameOrderExp=c1"
So to fix this I changed the behavior of <code>getKeyOrdering(keys, childOutputOrdering)</code> to:
1. If the childOutputOrdering satisfies (is a superset of) the required child ordering => childOutputOrdering
2. Otherwise => required child ordering
In addition, I organized the logic for deciding the relationship between two orderings into SparkPlan, so that it can be reused by EnsureRequirements and SortMergeJoinExec, and potentially other classes.
## How was this patch tested?
Added new test cases.
Passed all integration tests.
Author: maryannxue <maryann.xue@gmail.com>
Closes#19281 from maryannxue/spark-21998.
## What changes were proposed in this pull request?
#### Architecture
This PR implements stream-stream inner join using a two-way symmetric hash join. At a high level, we want to do the following.
1. For each stream, we maintain the past rows as state in State Store.
- For each joining key, there can be multiple rows that have been received.
- So, we have to effectively maintain a key-to-list-of-values multimap as state for each stream.
2. In each batch, for each input row in each stream
- Look up the other streams state to see if there are matching rows, and output them if they satisfy the joining condition
- Add the input row to corresponding stream’s state.
- If the data has a timestamp/window column with watermark, then we will use that to calculate the threshold for keys that are required to buffered for future matches and drop the rest from the state.
Cleaning up old unnecessary state rows depends completely on whether watermark has been defined and what are join conditions. We definitely want to support state clean up two types of queries that are likely to be common.
- Queries to time range conditions - E.g. `SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND leftTime > rightTime - INTERVAL 8 MINUTES AND leftTime < rightTime + INTERVAL 1 HOUR`
- Queries with windows as the matching key - E.g. `SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND window(leftTime, "1 hour") = window(rightTime, "1 hour")` (pseudo-SQL)
#### Implementation
The stream-stream join is primarily implemented in three classes
- `StreamingSymmetricHashJoinExec` implements the above symmetric join algorithm.
- `SymmetricsHashJoinStateManagers` manages the streaming state for the join. This essentially is a fault-tolerant key-to-list-of-values multimap built on the StateStore APIs. `StreamingSymmetricHashJoinExec` instantiates two such managers, one for each join side.
- `StreamingSymmetricHashJoinExecHelper` is a helper class to extract threshold for the state based on the join conditions and the event watermark.
Refer to the scaladocs class for more implementation details.
Besides the implementation of stream-stream inner join SparkPlan. Some additional changes are
- Allowed inner join in append mode in UnsupportedOperationChecker
- Prevented stream-stream join on an empty batch dataframe to be collapsed by the optimizer
## How was this patch tested?
- New tests in StreamingJoinSuite
- Updated tests UnsupportedOperationSuite
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#19271 from tdas/SPARK-22053.
## What changes were proposed in this pull request?
There is an incorrect `scalastyle:on` comment in `stringExpressions.scala` and causes the line size limit check ineffective in the file. There are many lines of code and comment which are more than 100 chars.
## How was this patch tested?
Code style change only.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19305 from viirya/fix-wrong-style.
## What changes were proposed in this pull request?
In SQL conditional expressions, only CASE WHEN lacks for expression description. This patch fills the gap.
## How was this patch tested?
Only documentation change.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19304 from viirya/casewhen-doc.
## What changes were proposed in this pull request?
This work is a part of [SPARK-17074](https://issues.apache.org/jira/browse/SPARK-17074) to compute equi-height histograms. Equi-height histogram is an array of bins. A bin consists of two endpoints which form an interval of values and the ndv in that interval.
This PR creates a new aggregate function, given an array of endpoints, counting distinct values (ndv) in intervals among those endpoints.
This PR also refactors `HyperLogLogPlusPlus` by extracting a helper class `HyperLogLogPlusPlusHelper`, where the underlying HLLPP algorithm locates.
## How was this patch tested?
Add new test cases.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#15544 from wzhfy/countIntervals.
## What changes were proposed in this pull request?
This a follow-up of https://github.com/apache/spark/pull/19289 , we missed another place: `rollup`. `Seq.init.toSeq` also returns a `Stream`, we should fix it too.
## How was this patch tested?
manually
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19298 from cloud-fan/bug.
## What changes were proposed in this pull request?
Spark with Scala 2.10 fails with a group by cube:
```
spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug")
spark.sql("select 1 from rollup_bug group by rollup ()").show
```
It can be traced back to https://github.com/apache/spark/pull/15484 , which made `Expand.projections` a lazy `Stream` for group by cube.
In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts.
This change is also good for master branch, to reduce the serialized size of `Expand.projections`.
## How was this patch tested?
manually verified with Spark with Scala 2.10.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19289 from cloud-fan/bug.
## What changes were proposed in this pull request?
Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example
## How was this patch tested?
Doc only change / existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#19276 from srowen/SPARK-22049.
## What changes were proposed in this pull request?
Tables in the catalog cache are not invalidated once their statistics are updated. As a consequence, existing sessions will use the cached information even though it is not valid anymore. Consider and an example below.
```
// step 1
spark.range(100).write.saveAsTable("tab1")
// step 2
spark.sql("analyze table tab1 compute statistics")
// step 3
spark.sql("explain cost select distinct * from tab1").show(false)
// step 4
spark.range(100).write.mode("append").saveAsTable("tab1")
// step 5
spark.sql("explain cost select distinct * from tab1").show(false)
```
After step 3, the table will be present in the catalog relation cache. Step 4 will correctly update the metadata inside the catalog but will NOT invalidate the cache.
By the way, ``spark.sql("analyze table tab1 compute statistics")`` between step 3 and step 4 would also solve the problem.
## How was this patch tested?
Current and additional unit tests.
Author: aokolnychyi <anton.okolnychyi@sap.com>
Closes#19252 from aokolnychyi/spark-21969.
## What changes were proposed in this pull request?
* Removed the method `org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter#alignToWords`.
It became unused as a result of 85b0a15754
(SPARK-15962) introducing word alignment for unsafe arrays.
* Cleaned up duplicate code in memory management and unsafe sorters
* The change extracting the exception paths is more than just cosmetics since it def. reduces the size the affected methods compile to
## How was this patch tested?
* Build still passes after removing the method, grepping the codebase for `alignToWords` shows no reference to it anywhere either.
* Dried up code is covered by existing tests.
Author: Armin <me@obrown.io>
Closes#19254 from original-brownbear/cleanup-mem-consumer.
#### What changes were proposed in this pull request?
This PR enhances the TRIM function support in Spark SQL by allowing the specification
of trim characters set. Below is the SQL syntax :
``` SQL
<trim function> ::= TRIM <left paren> <trim operands> <right paren>
<trim operands> ::= [ [ <trim specification> ] [ <trim character set> ] FROM ] <trim source>
<trim source> ::= <character value expression>
<trim specification> ::=
LEADING
| TRAILING
| BOTH
<trim character set> ::= <characters value expression>
```
or
``` SQL
LTRIM (source-exp [, trim-exp])
RTRIM (source-exp [, trim-exp])
```
Here are the documentation link of support of this feature by other mainstream databases.
- **Oracle:** [TRIM function](http://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2126.htm#OLADM704)
- **DB2:** [TRIM scalar function](https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/ak05270_.htm)
- **MySQL:** [Trim function](http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_trim)
- **Oracle:** [ltrim](https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2018.htm#OLADM594)
- **DB2:** [ltrim](https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/sqlref/src/tpc/db2z_bif_ltrim.html)
This PR is to implement the above enhancement. In the implementation, the design principle is to keep the changes to the minimum. Also, the exiting trim functions (which handles a special case, i.e., trimming space characters) are kept unchanged for performane reasons.
#### How was this patch tested?
The unit test cases are added in the following files:
- UTF8StringSuite.java
- StringExpressionsSuite.scala
- sql/SQLQuerySuite.scala
- StringFunctionsSuite.scala
Author: Kevin Yu <qyu@us.ibm.com>
Closes#12646 from kevinyu98/spark-14878.
## What changes were proposed in this pull request?
If there are two projects like as follows.
```
Project [a_with_metadata#27 AS b#26]
+- Project [a#0 AS a_with_metadata#27]
+- LocalRelation <empty>, [a#0, b#1]
```
Child Project has an output column with a metadata in it, and the parent Project has an alias that implicitly forwards the metadata. So this metadata is visible for higher operators. Upon applying CollapseProject optimizer rule, the metadata is not preserved.
```
Project [a#0 AS b#26]
+- LocalRelation <empty>, [a#0, b#1]
```
This is incorrect, as downstream operators that expect certain metadata (e.g. watermark in structured streaming) to identify certain fields will fail to do so. This PR fixes it by preserving the metadata of top-level aliases.
## How was this patch tested?
New unit test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#19240 from tdas/SPARK-22018.
## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.
cc viirya HyukjinKwon
Author: goldmedal <liugs963@gmail.com>
Closes#19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-21980
This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations.
The problem can be reproduced by:
`val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
df.cube("a").agg(grouping("A")).show()`
## How was this patch tested?
unit tests
Author: donnyzone <wellfengzhu@gmail.com>
Closes#19202 from DonnyZone/ResolveGroupingAnalytics.
# What changes were proposed in this pull request?
UDF to_json only supports converting `StructType` or `ArrayType` of `StructType`s to a json output string now.
According to the discussion of JIRA SPARK-21513, I allow to `to_json` support converting `MapType` and `ArrayType` of `MapType`s to a json output string.
This PR is for SQL and Scala API only.
# How was this patch tested?
Adding unit test case.
cc viirya HyukjinKwon
Author: goldmedal <liugs963@gmail.com>
Author: Jia-Xuan Liu <liugs963@gmail.com>
Closes#18875 from goldmedal/SPARK-21513.
## What changes were proposed in this pull request?
Improve QueryPlanConstraints framework, make it robust and simple.
In https://github.com/apache/spark/pull/15319, constraints for expressions like `a = f(b, c)` is resolved.
However, for expressions like
```scala
a = f(b, c) && c = g(a, b)
```
The current QueryPlanConstraints framework will produce non-converging constraints.
Essentially, the problem is caused by having both the name and child of aliases in the same constraint set. We infer constraints, and push down constraints as predicates in filters, later on these predicates are propagated as constraints, etc..
Simply using the alias names only can resolve these problems. The size of constraints is reduced without losing any information. We can always get these inferred constraints on child of aliases when pushing down filters.
Also, the EqualNullSafe between name and child in propagating alias is meaningless
```scala
allConstraints += EqualNullSafe(e, a.toAttribute)
```
It just produces redundant constraints.
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19201 from gengliangwang/QueryPlanConstraints.
## What changes were proposed in this pull request?
Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
Support DESC EXTENDED | FORMATTED TABLE COLUMN command to show column-level statistics.
Do NOT support describe nested columns.
## How was this patch tested?
Added test cases.
Author: Zhenhua Wang <wzh_zju@163.com>
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#16422 from wzhfy/descColumn.
## What changes were proposed in this pull request?
This PR implements the sql feature:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
## How was this patch tested?
Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory.
Author: Jane Wang <janewang@fb.com>
Closes#18975 from janewangfb/port_local_directory.
## What changes were proposed in this pull request?
`JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys.
Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19167 from viirya/test-jacksonutils.
## What changes were proposed in this pull request?
The condition in `Optimizer.isPlanIntegral` is wrong. We should always return `true` if not in test mode.
## How was this patch tested?
Manually test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19161 from viirya/SPARK-21726-followup.
## What changes were proposed in this pull request?
We have many optimization rules now in `Optimzer`. Right now we don't have any checks in the optimizer to check for the structural integrity of the plan (e.g. resolved). When debugging, it is difficult to identify which rules return invalid plans.
It would be great if in test mode, we can check whether a plan is still resolved after the execution of each rule, so we can catch rules that return invalid plans.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18956 from viirya/SPARK-21726.
## What changes were proposed in this pull request?
Since [SPARK-15639](https://github.com/apache/spark/pull/13701), `spark.sql.parquet.cacheMetadata` and `PARQUET_CACHE_METADATA` is not used. This PR removes from SQLConf and docs.
## How was this patch tested?
Pass the existing Jenkins.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#19129 from dongjoon-hyun/SPARK-13656.
## What changes were proposed in this pull request?
This is a follow-up of #19050 to deal with `ExistenceJoin` case.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19151 from viirya/SPARK-21835-followup.
## What changes were proposed in this pull request?
Add an assert in logical plan optimization that the isStreaming bit stays the same, and fix empty relation rules where that wasn't happening.
## How was this patch tested?
new and existing unit tests
Author: Jose Torres <joseph.torres@databricks.com>
Author: Jose Torres <joseph-torres@databricks.com>
Closes#19056 from joseph-torres/SPARK-21765-followup.
## What changes were proposed in this pull request?
Correlated predicate subqueries are rewritten into `Join` by the rule `RewritePredicateSubquery` during optimization.
It is possibly that the two sides of the `Join` have conflicting attributes. The query plans produced by `RewritePredicateSubquery` become unresolved and break structural integrity.
We should check if there are conflicting attributes in the `Join` and de-duplicate them by adding a `Project`.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19050 from viirya/SPARK-21835.
## What changes were proposed in this pull request?
For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration:
```
Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1")
Seq(1, 2).toDF("col").createOrReplaceTempView("t2")
sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col")
```
We can fix this by adjusting the indent of the optimize rules.
## How was this patch tested?
Add test case that would have failed in `SQLQuerySuite`.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#19099 from jiangxb1987/unconverge-optimization.
## What changes were proposed in this pull request?
We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
## How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19119 from gatorsmile/fallbackCodegen.
## What changes were proposed in this pull request?
SQL predicates don't have complete expression description. This patch goes to complement the description by adding arguments, examples.
This change also adds related test cases for the SQL predicate expressions.
## How was this patch tested?
Existing tests. And added predicate test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18869 from viirya/SPARK-21654.
## What changes were proposed in this pull request?
Add `TBLPROPERTIES` to the DDL statement `CREATE TABLE USING`.
After this change, the DDL becomes
```
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
USING table_provider
[OPTIONS table_property_list]
[PARTITIONED BY (col_name, col_name, ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS
]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (property_name=property_value, ...)]
[[AS] select_statement];
```
## How was this patch tested?
Add a few tests
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19100 from gatorsmile/addTablePropsToCreateTableUsing.
## What changes were proposed in this pull request?
Allows `BinaryComparison` operators to work on any data type that actually supports ordering as verified by `TypeUtils.checkForOrderingExpr` instead of relying on the incomplete list `TypeCollection.Ordered` (which is removed by this PR).
## How was this patch tested?
Updated unit tests to cover structs and arrays.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#18818 from aray/SPARK-21110.