Commit graph

11319 commits

Author SHA1 Message Date
Kent Yao 51815430b2 [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
### What changes were proposed in this pull request?

In https://github.com/yaooqinn/itachi/issues/8, we had a discussion about the current extension injection for the spark session.  We've agreed that the current way is not that convenient for both third-party developers and end-users.

It's much simple if third-party developers can provide a resource file that contains default extensions for Spark to  load ahead

### Why are the changes needed?

better use experience

### Does this PR introduce _any_ user-facing change?

no, dev change

### How was this patch tested?

new tests

Closes #32515 from yaooqinn/SPARK-35380.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
2021-05-13 16:34:13 +08:00
Chao Sun 0ab9bd79b3 [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
### What changes were proposed in this pull request?

Change `map` in `InvokeLike.invoke` to a while loop to improve performance, following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex).

### Why are the changes needed?

`InvokeLike.invoke`, which is used in non-codegen path for `Invoke` and `StaticInvoke`, currently uses `map` to evaluate arguments:
```scala
val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
if (needNullCheck && args.exists(_ == null)) {
  // return null if one of arguments is null
  null
} else {
  ...
```
which is pretty expensive if the method itself is trivial. We can change it to a plain while loop.

<img width="871" alt="Screen Shot 2021-05-12 at 12 19 59 AM" src="https://user-images.githubusercontent.com/506679/118055719-7f985a00-b33d-11eb-943b-cf85eab35f44.png">

Benchmark results show this can improve as much as 3x from `V2FunctionBenchmark`:

Before
```
 OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
 Intel(R) Xeon(R) CPU E5-2673 v3  2.40GHz
 scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
 native_long_add                                                                         36506          36656         251         13.7          73.0       1.0X
 java_long_add_default                                                                   47151          47540         370         10.6          94.3       0.8X
 java_long_add_magic                                                                    178691         182457        1327          2.8         357.4       0.2X
 java_long_add_static_magic                                                             177151         178258        1151          2.8         354.3       0.2X
```

After
```
 OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
 Intel(R) Xeon(R) CPU E5-2673 v3  2.40GHz
 scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
 native_long_add                                                                         29897          30342         568         16.7          59.8       1.0X
 java_long_add_default                                                                   40628          41075         664         12.3          81.3       0.7X
 java_long_add_magic                                                                     54553          54755         182          9.2         109.1       0.5X
 java_long_add_static_magic                                                              55410          55532         127          9.0         110.8       0.5X
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #32527 from sunchao/SPARK-35384.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-12 20:57:21 -07:00
Takeshi Yamamuro 3241aeb7f4 [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests
### What changes were proposed in this pull request?

This PR proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost the same ones; the only differences in these queries are ORDER BY columns.

### Why are the changes needed?

To improve test performance.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

Existing tests.

Closes #32520 from maropu/SkipDupQueries.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-13 09:46:25 +09:00
Chao Sun bc95c3a69b [SPARK-35361][SQL][FOLLOWUP] Switch to use while loop
### What changes were proposed in this pull request?

Switch to plain `while` loop following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex).

### Why are the changes needed?

`while` loop may yield better performance comparing to `foreach`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #32522 from sunchao/SPARK-35361-follow-up.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-05-12 12:41:12 -07:00
Liang-Chi Hsieh f156a95641 [SPARK-35347][SQL][FOLLOWUP] Throw exception with an explicit exception type when cannot find the method instead of sys.error
### What changes were proposed in this pull request?

A simple follow-up of #32474 to throw exception instead of sys.error.

### Why are the changes needed?

An exception only fails the query, instead of sys.error.

### Does this PR introduce _any_ user-facing change?

Yes, if `Invoke` or `StaticInvoke` cannot find the method, instead of original `sys.error` now we only throw an exception.

### How was this patch tested?

Existing tests.

Closes #32519 from viirya/SPARK-35347-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-05-12 09:56:08 -07:00
Cheng Su 7bcadedbd2 [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join
### What changes were proposed in this pull request?

This PR is to add code-gen support for LEFT OUTER / RIGHT OUTER sort merge join. Currently sort merge join only supports inner join type (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L374 ). There's no fundamental reason why we cannot support code-gen for other join types. Here we add code-gen for LEFT OUTER / RIGHT OUTER join. Will submit followup PRs to add LEFT SEMI, LEFT ANTI and FULL OUTER code-gen separately.

The change is to extend current sort merge join logic to work with LEFT OUTER and RIGHT OUTER (should work with LEFT SEMI/ANTI as well, but FULL OUTER join needs some other more code change). Replace left/right with streamed/buffered to make code extendable to other join types besides inner join.

Example query:

```
val df1 = spark.range(10).select($"id".as("k1"), $"id".as("k3"))
val df2 = spark.range(4).select($"id".as("k2"), $"id".as("k4"))
df1.join(df2.hint("SHUFFLE_MERGE"), $"k1" === $"k2" && $"k3" + 1 < $"k4", "left_outer").explain("codegen")
```

Example generated code:

```
== Subtree 5 / 5 (maxMethodCodeSize:396; maxConstantPoolSize:159(0.24% used); numInnerClasses:0) ==
*(5) SortMergeJoin [k1#2L], [k2#8L], LeftOuter, ((k3#3L + 1) < k4#9L)
:- *(2) Sort [k1#2L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(k1#2L, 5), ENSURE_REQUIREMENTS, [id=#26]
:     +- *(1) Project [id#0L AS k1#2L, id#0L AS k3#3L]
:        +- *(1) Range (0, 10, step=1, splits=2)
+- *(4) Sort [k2#8L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k2#8L, 5), ENSURE_REQUIREMENTS, [id=#32]
      +- *(3) Project [id#6L AS k2#8L, id#6L AS k4#9L]
         +- *(3) Range (0, 4, step=1, splits=2)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage5(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=5
/* 006 */ final class GeneratedIteratorForCodegenStage5 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private scala.collection.Iterator smj_streamedInput_0;
/* 010 */   private scala.collection.Iterator smj_bufferedInput_0;
/* 011 */   private InternalRow smj_streamedRow_0;
/* 012 */   private InternalRow smj_bufferedRow_0;
/* 013 */   private long smj_value_2;
/* 014 */   private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0;
/* 015 */   private long smj_value_3;
/* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
/* 017 */
/* 018 */   public GeneratedIteratorForCodegenStage5(Object[] references) {
/* 019 */     this.references = references;
/* 020 */   }
/* 021 */
/* 022 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 023 */     partitionIndex = index;
/* 024 */     this.inputs = inputs;
/* 025 */     smj_streamedInput_0 = inputs[0];
/* 026 */     smj_bufferedInput_0 = inputs[1];
/* 027 */
/* 028 */     smj_matches_0 = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483632, 2147483647);
/* 029 */     smj_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(4, 0);
/* 030 */
/* 031 */   }
/* 032 */
/* 033 */   private boolean findNextJoinRows(
/* 034 */     scala.collection.Iterator streamedIter,
/* 035 */     scala.collection.Iterator bufferedIter) {
/* 036 */     smj_streamedRow_0 = null;
/* 037 */     int comp = 0;
/* 038 */     while (smj_streamedRow_0 == null) {
/* 039 */       if (!streamedIter.hasNext()) return false;
/* 040 */       smj_streamedRow_0 = (InternalRow) streamedIter.next();
/* 041 */       long smj_value_0 = smj_streamedRow_0.getLong(0);
/* 042 */       if (false) {
/* 043 */         if (!smj_matches_0.isEmpty()) {
/* 044 */           smj_matches_0.clear();
/* 045 */         }
/* 046 */         return false;
/* 047 */
/* 048 */       }
/* 049 */       if (!smj_matches_0.isEmpty()) {
/* 050 */         comp = 0;
/* 051 */         if (comp == 0) {
/* 052 */           comp = (smj_value_0 > smj_value_3 ? 1 : smj_value_0 < smj_value_3 ? -1 : 0);
/* 053 */         }
/* 054 */
/* 055 */         if (comp == 0) {
/* 056 */           return true;
/* 057 */         }
/* 058 */         smj_matches_0.clear();
/* 059 */       }
/* 060 */
/* 061 */       do {
/* 062 */         if (smj_bufferedRow_0 == null) {
/* 063 */           if (!bufferedIter.hasNext()) {
/* 064 */             smj_value_3 = smj_value_0;
/* 065 */             return !smj_matches_0.isEmpty();
/* 066 */           }
/* 067 */           smj_bufferedRow_0 = (InternalRow) bufferedIter.next();
/* 068 */           long smj_value_1 = smj_bufferedRow_0.getLong(0);
/* 069 */           if (false) {
/* 070 */             smj_bufferedRow_0 = null;
/* 071 */             continue;
/* 072 */           }
/* 073 */           smj_value_2 = smj_value_1;
/* 074 */         }
/* 075 */
/* 076 */         comp = 0;
/* 077 */         if (comp == 0) {
/* 078 */           comp = (smj_value_0 > smj_value_2 ? 1 : smj_value_0 < smj_value_2 ? -1 : 0);
/* 079 */         }
/* 080 */
/* 081 */         if (comp > 0) {
/* 082 */           smj_bufferedRow_0 = null;
/* 083 */         } else if (comp < 0) {
/* 084 */           if (!smj_matches_0.isEmpty()) {
/* 085 */             smj_value_3 = smj_value_0;
/* 086 */             return true;
/* 087 */           } else {
/* 088 */             return false;
/* 089 */           }
/* 090 */         } else {
/* 091 */           smj_matches_0.add((UnsafeRow) smj_bufferedRow_0);
/* 092 */           smj_bufferedRow_0 = null;
/* 093 */         }
/* 094 */       } while (smj_streamedRow_0 != null);
/* 095 */     }
/* 096 */     return false; // unreachable
/* 097 */   }
/* 098 */
/* 099 */   protected void processNext() throws java.io.IOException {
/* 100 */     while (smj_streamedInput_0.hasNext()) {
/* 101 */       findNextJoinRows(smj_streamedInput_0, smj_bufferedInput_0);
/* 102 */       long smj_value_4 = -1L;
/* 103 */       long smj_value_5 = -1L;
/* 104 */       boolean smj_loaded_0 = false;
/* 105 */       smj_value_5 = smj_streamedRow_0.getLong(1);
/* 106 */       scala.collection.Iterator<UnsafeRow> smj_iterator_0 = smj_matches_0.generateIterator();
/* 107 */       boolean smj_foundMatch_0 = false;
/* 108 */
/* 109 */       // the last iteration of this loop is to emit an empty row if there is no matched rows.
/* 110 */       while (smj_iterator_0.hasNext() || !smj_foundMatch_0) {
/* 111 */         InternalRow smj_bufferedRow_1 = smj_iterator_0.hasNext() ?
/* 112 */         (InternalRow) smj_iterator_0.next() : null;
/* 113 */         boolean smj_isNull_5 = true;
/* 114 */         long smj_value_9 = -1L;
/* 115 */         if (smj_bufferedRow_1 != null) {
/* 116 */           long smj_value_8 = smj_bufferedRow_1.getLong(1);
/* 117 */           smj_isNull_5 = false;
/* 118 */           smj_value_9 = smj_value_8;
/* 119 */         }
/* 120 */         if (smj_bufferedRow_1 != null) {
/* 121 */           boolean smj_isNull_6 = true;
/* 122 */           boolean smj_value_10 = false;
/* 123 */           long smj_value_11 = -1L;
/* 124 */
/* 125 */           smj_value_11 = smj_value_5 + 1L;
/* 126 */
/* 127 */           if (!smj_isNull_5) {
/* 128 */             smj_isNull_6 = false; // resultCode could change nullability.
/* 129 */             smj_value_10 = smj_value_11 < smj_value_9;
/* 130 */
/* 131 */           }
/* 132 */           if (smj_isNull_6 || !smj_value_10) {
/* 133 */             continue;
/* 134 */           }
/* 135 */         }
/* 136 */         if (!smj_loaded_0) {
/* 137 */           smj_loaded_0 = true;
/* 138 */           smj_value_4 = smj_streamedRow_0.getLong(0);
/* 139 */         }
/* 140 */         boolean smj_isNull_3 = true;
/* 141 */         long smj_value_7 = -1L;
/* 142 */         if (smj_bufferedRow_1 != null) {
/* 143 */           long smj_value_6 = smj_bufferedRow_1.getLong(0);
/* 144 */           smj_isNull_3 = false;
/* 145 */           smj_value_7 = smj_value_6;
/* 146 */         }
/* 147 */         smj_foundMatch_0 = true;
/* 148 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
/* 149 */
/* 150 */         smj_mutableStateArray_0[0].reset();
/* 151 */
/* 152 */         smj_mutableStateArray_0[0].zeroOutNullBytes();
/* 153 */
/* 154 */         smj_mutableStateArray_0[0].write(0, smj_value_4);
/* 155 */
/* 156 */         smj_mutableStateArray_0[0].write(1, smj_value_5);
/* 157 */
/* 158 */         if (smj_isNull_3) {
/* 159 */           smj_mutableStateArray_0[0].setNullAt(2);
/* 160 */         } else {
/* 161 */           smj_mutableStateArray_0[0].write(2, smj_value_7);
/* 162 */         }
/* 163 */
/* 164 */         if (smj_isNull_5) {
/* 165 */           smj_mutableStateArray_0[0].setNullAt(3);
/* 166 */         } else {
/* 167 */           smj_mutableStateArray_0[0].write(3, smj_value_9);
/* 168 */         }
/* 169 */         append((smj_mutableStateArray_0[0].getRow()).copy());
/* 170 */
/* 171 */       }
/* 172 */       if (shouldStop()) return;
/* 173 */     }
/* 174 */     ((org.apache.spark.sql.execution.joins.SortMergeJoinExec) references[1] /* plan */).cleanupResources();
/* 175 */   }
/* 176 */
/* 177 */ }
```

### Why are the changes needed?

Improve query CPU performance. Example micro benchmark below showed 10% run-time improvement.

```
def sortMergeJoinWithDuplicates(): Unit = {
    val N = 2 << 20
    codegenBenchmark("sort merge join with duplicates", N) {
      val df1 = spark.range(N)
        .selectExpr(s"(id * 15485863) % ${N*10} as k1", "id as k3")
      val df2 = spark.range(N)
        .selectExpr(s"(id * 15485867) % ${N*10} as k2", "id as k4")
      val df = df1.join(df2, col("k1") === col("k2") && col("k3") * 3 < col("k4"), "left_outer")
      assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
      df.noop()
    }
 }
```

```
Running benchmark: sort merge join with duplicates
  Running case: sort merge join with duplicates outer-smj-codegen off
  Stopped after 2 iterations, 2696 ms
  Running case: sort merge join with duplicates outer-smj-codegen on
  Stopped after 5 iterations, 6058 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
sort merge join with duplicates:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------------
sort merge join with duplicates outer-smj-codegen off           1333           1348          21          1.6         635.7       1.0X
sort merge join with duplicates outer-smj-codegen on            1169           1212          47          1.8         557.4       1.1X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `WholeStageCodegenSuite.scala` and `WholeStageCodegenSuite.scala`.

Closes #32476 from c21/smj-outer-codegen.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-12 14:10:15 +00:00
Takeshi Yamamuro 101b0cc313 [SPARK-35253][SQL][BUILD] Bump up the janino version to v3.1.4
### What changes were proposed in this pull request?

This PR proposes to bump up the janino version from 3.0.16 to v3.1.4.
The major changes of this upgrade are as follows:
 - Fixed issue #131: Janino 3.1.2 is 10x slower than 3.0.11: The Compiler's IClassLoader was initialized way too eagerly, thus lots of classes were loaded from the class path, which is very slow.
 - Improved the encoding of stack map frames according to JVMS11 4.7.4: Previously, only "full_frame"s were generated.
 - Fixed issue #107: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package
 - Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions).

For all the changes, please see the change log: http://janino-compiler.github.io/janino/changelog.html

NOTE1: I've checked that there is no obvious performance regression. For all the data, see a link: https://docs.google.com/spreadsheets/d/1srxT9CioGQg1fLKM3Uo8z1sTzgCsMj4pg6JzpdcG6VU/edit?usp=sharing

NOTE2: We upgraded janino to 3.1.2 (#27860) once before, but the commit had been reverted in #29495 because of the correctness issue. Recently, #32374 had checked if Spark could land on v3.1.3 or not, but a new bug was found there. These known issues has been fixed in v3.1.4 by following PRs:
 - janino-compiler/janino#145
 - janino-compiler/janino#146

### Why are the changes needed?

janino v3.0.X  is no longer maintained.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed.

Closes #32455 from maropu/janino_v3.1.4.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-05-12 08:57:57 -05:00
Angerszhuuuu ed059541eb [SPARK-29145][SQL][FOLLOWUP] Clean up code about support sub-queries in join conditions
### What changes were proposed in this pull request?
According to discuss https://github.com/apache/spark/pull/25854#discussion_r629451135

### Why are the changes needed?
Clean code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #32499 from AngersZhuuuu/SPARK-29145-fix.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-12 13:45:53 +00:00
Yingyi Bu d92018ee35 [SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- ALIAS
- AND_OR
- AVERAGE
- GENERATE
- INTERSECT
- SORT
- SUM
- DISTINCT_LIKE
- PROJECT
- REPARTITION_OPERATION
- UNION

Added tree traversal pruning to the following rules in Optimizer.scala:
- EliminateAggregateFilter
- RemoveRedundantAggregates
- RemoveNoopOperators
- RemoveNoopUnion
- LimitPushDown
- ColumnPruning
- CollapseRepartition
- OptimizeRepartition
- OptimizeWindowFunctions
- CollapseWindow
- TransposeWindow
- InferFiltersFromGenerate
- InferFiltersFromConstraints
- CombineUnions
- CombineFilters
- EliminateSorts
- PruneFilters
- EliminateLimits
- DecimalAggregates
- ConvertToLocalRelation
- ReplaceDistinctWithAggregate
- ReplaceIntersectWithSemiJoin
- ReplaceExceptWithAntiJoin
- RewriteExceptAll
- RewriteIntersectAll
- RemoveLiteralFromGroupExpressions
- RemoveRepetitionFromGroupExpressions
- OptimizeLimitZero

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

perf diff:
Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
RemoveRedundantAggregates | 51290766 | 67070477 | 1.31
RemoveNoopOperators | 192371141 | 196631275 | 1.02
RemoveNoopUnion | 49222561 | 43266681 | 0.88
LimitPushDown | 40885185 | 21672646 | 0.53
ColumnPruning | 2003406120 | 1285562149 | 0.64
CollapseRepartition | 40648048 | 72646515 | 1.79
OptimizeRepartition | 37813850 | 20600803 | 0.54
OptimizeWindowFunctions | 174426904 | 46741409 | 0.27
CollapseWindow | 38959957 | 24542426 | 0.63
TransposeWindow | 33533191 | 20414930 | 0.61
InferFiltersFromGenerate | 21758688 | 15597344 | 0.72
InferFiltersFromConstraints | 518009794 | 493282321 | 0.95
CombineUnions | 67694022 | 70550382 | 1.04
CombineFilters | 35265060 | 29005424 | 0.82
EliminateSorts | 57025509 | 19795776 | 0.35
PruneFilters | 433964815 | 465579200 | 1.07
EliminateLimits | 44275393 | 24476859 | 0.55
DecimalAggregates | 83143172 | 28816090 | 0.35
ReplaceDistinctWithAggregate | 21783760 | 18287489 | 0.84
ReplaceIntersectWithSemiJoin | 22311271 | 16566393 | 0.74
ReplaceExceptWithAntiJoin | 23838520 | 16588808 | 0.70
RewriteExceptAll | 32750296 | 29421957 | 0.90
RewriteIntersectAll | 29760454 | 21243599 | 0.71
RemoveLiteralFromGroupExpressions | 28151861 | 25270947 | 0.90
RemoveRepetitionFromGroupExpressions | 29587030 | 23447041 | 0.79
OptimizeLimitZero | 18081943 | 15597344 | 0.86
**Accumulated | 4129959311 | 3112676285 | 0.75**

### How was this patch tested?

Existing tests.

Closes #32439 from sigmod/optimizer.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-05-12 20:42:47 +08:00
PengLei 82c520a3e2 [SPARK-35243][SQL] Support columnar execution on ANSI interval types
### What changes were proposed in this pull request?
Columnar execution support for ANSI interval types include YearMonthIntervalType and DayTimeIntervalType

### Why are the changes needed?
support cache tables with ANSI interval types.

### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
run ./dev/lint-java
run ./dev/scalastyle
run test: CachedTableSuite
run test: ColumnTypeSuite

Closes #32452 from Peng-Lei/SPARK-35243.

Lead-authored-by: PengLei <18066542445@189.cn>
Co-authored-by: Lei Peng <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-12 20:11:34 +09:00
Chao Sun 78221bda95 [SPARK-35361][SQL] Improve performance for ApplyFunctionExpression
### What changes were proposed in this pull request?

In `ApplyFunctionExpression`, move `zipWithIndex` out of the loop for each input row.

### Why are the changes needed?

When the `ScalarFunction` is trivial, `zipWithIndex` could incur significant costs, as shown below:

<img width="899" alt="Screen Shot 2021-05-11 at 10 03 42 AM" src="https://user-images.githubusercontent.com/506679/117866421-fb19de80-b24b-11eb-8c94-d5e8c8b1eda9.png">

By removing it out of the loop, I'm seeing sometimes 2x speedup from `V2FunctionBenchmark`. For instance:

Before:
```
scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
native_long_add                                                                         32437          32896         434         15.4          64.9       1.0X
java_long_add_default                                                                   85675          97045         NaN          5.8         171.3       0.4X
```

After:
```
scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
native_long_add                                                                         30182          30387         279         16.6          60.4       1.0X
java_long_add_default                                                                   42862          43009         209         11.7          85.7       0.7X
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #32507 from sunchao/SPARK-35361.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-12 10:16:35 +09:00
Yingyi Bu 7c9a9ec04f [SPARK-35146][SQL] Migrate to transformWithPruning or resolveWithPruning for rules in finishAnalysis.scala
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- BOOL_AGG
- COUNT_IF
- CURRENT_LIKE
- RUNTIME_REPLACEABLE

Added tree traversal pruning to the following rules:
- ReplaceExpressions
- RewriteNonCorrelatedExists
- ComputeCurrentTime
- GetCurrentDatabaseAndCatalog

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

Performance improvement (org.apache.spark.sql.TPCDSQuerySuite):
Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
ReplaceExpressions | 27546369 | 19753804 | 0.72
RewriteNonCorrelatedExists | 17304883 | 2086194 | 0.12
ComputeCurrentTime | 35751301 | 19984477 | 0.56
GetCurrentDatabaseAndCatalog | 37230787 | 18874013 | 0.51

### How was this patch tested?

Existing tests.

Closes #32461 from sigmod/finish_analysis.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-05-11 17:11:38 +08:00
Cheng Su c4ca23207b [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type
### What changes were proposed in this pull request?

This is a pre-requisite of https://github.com/apache/spark/pull/32476, in discussion of https://github.com/apache/spark/pull/32476#issuecomment-836469779 . This is to refactor sort merge join code-gen to depend on streamed/buffered terminology, which makes the code-gen agnostic to different join types and can be extended to support other join types than inner join.

### Why are the changes needed?

Pre-requisite of https://github.com/apache/spark/pull/32476.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test in `InnerJoinSuite.scala` for inner join code-gen.

Closes #32495 from c21/smj-refactor.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-11 11:21:59 +09:00
gengjiaan 44bd0a8bd3 [SPARK-35088][SQL][FOLLOWUP] Improve the error message for Sequence expression
### What changes were proposed in this pull request?
Sequence expression output a message looks confused.
This PR will fix the issue.

### Why are the changes needed?
Improve the error message for Sequence expression

### Does this PR introduce _any_ user-facing change?
Yes. this PR updates the error message of Sequence expression.

### How was this patch tested?
Tests updated.

Closes #32492 from beliefer/SPARK-35088-followup.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-11 09:45:09 +09:00
Gengliang Wang d2a535f85b [SPARK-34246][FOLLOWUP] Change the definition of findTightestCommonType for backward compatibility
### What changes were proposed in this pull request?

Change the definition of `findTightestCommonType` from
```
def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType]
```
to
```
val findTightestCommonType: (DataType, DataType) => Option[DataType]
```

### Why are the changes needed?

For backward compatibility.
When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error
```
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2
```
from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150

In the previous release, the function was
```
static public  scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType ()
```
After https://github.com/apache/spark/pull/31349, the function becomes:
```
static public  scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2)
```

This PR is to reduce the unnecessary API change.
### Does this PR introduce _any_ user-facing change?

Yes, the definition of `TypeCoercion.findTightestCommonType`  is consistent with previous release again.

### How was this patch tested?

Existing unit tests

Closes #32493 from gengliangwang/typecoercion.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-05-10 23:26:39 +08:00
Angerszhuuuu 7182f8cece [SPARK-35360][SQL] RepairTableCommand respects spark.sql.addPartitionInBatch.size too
### What changes were proposed in this pull request?
RepairTableCommand respects `spark.sql.addPartitionInBatch.size` too

### Why are the changes needed?
Make RepairTableCommand add partition batch size configurable.

### Does this PR introduce _any_ user-facing change?
User can use `spark.sql.addPartitionInBatch.size` to change batch size when repair table.

### How was this patch tested?
Not need

Closes #32489 from AngersZhuuuu/SPARK-35360.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-05-10 14:53:31 +05:00
Chao Sun 245dce1ea1 [SPARK-35261][SQL][TESTS][FOLLOW-UP] Change failOnError to false for NativeAdd in V2FunctionBenchmark
### What changes were proposed in this pull request?

Change `failOnError` to false for `NativeAdd` in `V2FunctionBenchmark`.

### Why are the changes needed?

Since `NativeAdd` is simply doing addition on long it's better to set `failOnError` to false so it will use native long addition instead of `Math.addExact`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #32481 from sunchao/SPARK-35261-follow-up.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-10 07:20:05 +00:00
Angerszhuuuu 2c8ced9590 [SPARK-35111][SPARK-35112][SQL][FOLLOWUP] Rename ANSI interval patterns and regexps
### What changes were proposed in this pull request?
Rename pattern strings and regexps of year-month and day-time intervals.

### Why are the changes needed?
To improve code maintainability.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes #32444 from AngersZhuuuu/SPARK-35111-followup.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-05-10 11:33:27 +05:00
Cheng Su 38eb5a6936 [SPARK-35354][SQL] Replace BaseJoinExec with ShuffledJoin in CoalesceBucketsInJoin
### What changes were proposed in this pull request?

As title. We should use a more restrictive interface `ShuffledJoin` other than `BaseJoinExec` in `CoalesceBucketsInJoin`, as the rule only applies to sort merge join and shuffled hash join (i.e. `ShuffledJoin`).

### Why are the changes needed?

Code cleanup.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test in `CoalesceBucketsInJoinSuite`.

Closes #32480 from c21/minor-cleanup.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-10 10:04:49 +09:00
Ruifeng Zheng 620f0727e3 [SPARK-35231][SQL] logical.Range override maxRowsPerPartition
### What changes were proposed in this pull request?
when `numSlices` is avaiable, `logical.Range` should compute a exact `maxRowsPerPartition`

### Why are the changes needed?
`maxRowsPerPartition` is used in optimizer, we should provide an exact value if possible

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #32350 from zhengruifeng/range_maxRowsPerPartition.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-09 21:44:49 +09:00
Liang-Chi Hsieh 5b65d8a129 [SPARK-35347][SQL] Use MethodUtils for looking up methods in Invoke and StaticInvoke
### What changes were proposed in this pull request?

This patch proposes to use `MethodUtils` for looking up methods `Invoke` and `StaticInvoke` expressions.

### Why are the changes needed?

Currently we wrote our logic in `Invoke` and `StaticInvoke` expressions for looking up methods. It is tricky to consider all the cases and there is already existing utility package for this purpose. We should reuse the utility package.

### Does this PR introduce _any_ user-facing change?

No, internal change only.

### How was this patch tested?

Existing tests.

Closes #32474 from viirya/invoke-util.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-05-08 15:17:30 -07:00
Dongjoon Hyun e31bef1ed4 Revert "[SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client"
This reverts commit b4ec9e2304.
2021-05-08 13:01:17 -07:00
Takeshi Yamamuro 06c40091a6 [SPARK-35327][SQL][TESTS] Filters out the TPC-DS queries that can cause flaky test results
### What changes were proposed in this pull request?

This PR proposes to filter out TPCDS v1.4 q6 and q75 in `TPCDSQueryTestSuite`.

I saw`TPCDSQueryTestSuite` failed nondeterministically because output row orders were different with those in the golden files. For example, the failure in the GA job, https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true, happened because the `tpcds/q6.sql` query output rows were only sorted by `cnt`:

a0c76a8755/sql/core/src/test/resources/tpcds/q6.sql (L20)
Actually, `tpcds/q6.sql`  and `tpcds-v2.7.0/q6.sql` are almost the same and the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and `a.ca_state`:
a0c76a8755/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql (L22)
So, I think it's okay just to test `tpcds-v2.7.0/q6.sql` in this case (q75 has the same issue).

### Why are the changes needed?

For stable testing.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GA passed.

Closes #32454 from maropu/CleanUpTpcdsQueries.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-08 21:43:39 +09:00
Kent Yao b0257801d5 [SPARK-35331][SQL] Support resolving missing attrs for distribute/cluster by/repartition hint
### What changes were proposed in this pull request?

This PR makes the below case work well.

```sql
select a b from values(1) t(a) distribute by a;
```

```logtalk
== Parsed Logical Plan ==
'RepartitionByExpression ['a]
+- 'Project ['a AS b#42]
   +- 'SubqueryAlias t
      +- 'UnresolvedInlineTable [a], [List(1)]

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: [b]; line 1 pos 62;
'RepartitionByExpression ['a]
+- Project [a#48 AS b#42]
   +- SubqueryAlias t
      +- LocalRelation [a#48]
```
### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

yes, the original attributes can be used in `distribute by` / `cluster by` and hints like `/*+ REPARTITION(3, c) */`

### How was this patch tested?

new tests

Closes #32465 from yaooqinn/SPARK-35331.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-05-08 05:00:51 -07:00
Chao Sun 323a6e848e [SPARK-35232][SQL] Nested column pruning should retain column metadata
### What changes were proposed in this pull request?

Retain column metadata during the process of nested column pruning, when constructing `StructField`.

To test the above change, this also added the logic of column projection in `InMemoryTable`. Without the fix `DSV2CharVarcharDDLTestSuite` will fail.

### Why are the changes needed?

The column metadata is used in a few places such as re-constructing CHAR/VARCHAR information such as in [SPARK-33901](https://issues.apache.org/jira/browse/SPARK-33901). Therefore, we should retain the info during nested column pruning.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #32354 from sunchao/SPARK-35232.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-05-07 22:37:54 -07:00
Chao Sun f47e0f8379 [SPARK-35261][SQL] Support static magic method for stateless Java ScalarFunction
### What changes were proposed in this pull request?

This allows `ScalarFunction` implemented in Java to optionally specify the magic method `invoke` to be static, which can be used if the UDF is stateless. Comparing to the non-static method, it can potentially give better performance due to elimination of dynamic dispatch, etc.

Also added a benchmark to measure performance of: the default `produceResult`, non-static magic method and static magic method.

### Why are the changes needed?

For UDFs that are stateless (e.g., no need to maintain intermediate state between each function call), it's better to allow users to implement the UDF function as static method which could potentially give better performance.

### Does this PR introduce _any_ user-facing change?

Yes. Spark users can now have the choice to define static magic method for `ScalarFunction` when it is written in Java and when the UDF is stateless.

### How was this patch tested?

Added new UT.

Closes #32407 from sunchao/SPARK-35261.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-07 20:34:51 -07:00
Chao Sun b4ec9e2304 [SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client
### What changes were proposed in this pull request?

Instantiate a new Hive client through `Hive.getWithFastCheck(conf, false)` instead of `Hive.get(conf)`.

### Why are the changes needed?

[HIVE-10319](https://issues.apache.org/jira/browse/HIVE-10319) introduced a new API `get_all_functions` which is only supported in Hive 1.3.0/2.0.0 and up. As result, when Spark 3.x talks to a HMS service of version 1.2 or lower, the following error will occur:
```
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions'
        at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
        at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
        at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
        ... 96 more
Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions'
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
```

The `get_all_functions` is called only when `doRegisterAllFns` is set to true:
```java
  private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
    conf = c;
    if (doRegisterAllFns) {
      registerAllFunctionsOnce();
    }
  }
```

what this does is to register all Hive permanent functions defined in HMS in Hive's `FunctionRegistry` class, via iterating through results from `get_all_functions`. To Spark, this seems unnecessary as it loads Hive permanent (not built-in) UDF via directly calling the HMS API, i.e., `get_function`. The `FunctionRegistry` is only used in loading Hive's built-in function that is not supported by Spark. At this time, it only applies to `histogram_numeric`.

### Does this PR introduce _any_ user-facing change?

Yes with this fix Spark now should be able to talk to HMS server with Hive 1.2.x and lower (with HIVE-24608 too)

### How was this patch tested?

Manually started a HMS server of Hive version 1.2.2, with patched Hive 2.3.8 using HIVE-24608. Without the PR it failed with the above exception. With the PR the error disappeared and I can successfully perform common operations such as create table, create database, list tables, etc.

Closes #32446 from sunchao/SPARK-35321.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-07 15:06:04 -07:00
Liang-Chi Hsieh 33fbf5647b [SPARK-35288][SQL] StaticInvoke should find the method without exact argument classes match
### What changes were proposed in this pull request?

This patch proposes to make StaticInvoke able to find method with given method name even the parameter types do not exactly match to argument classes.

### Why are the changes needed?

Unlike `Invoke`, `StaticInvoke` only tries to get the method with exact argument classes. If the calling method's parameter types are not exactly matched with the argument classes, `StaticInvoke` cannot find the method.

`StaticInvoke` should be able to find the method under the cases too.

### Does this PR introduce _any_ user-facing change?

Yes. `StaticInvoke` can find a method even the argument classes are not exactly matched.

### How was this patch tested?

Unit test.

Closes #32413 from viirya/static-invoke.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-05-07 09:07:57 -07:00
beliefer d3b92eec45 [SPARK-35021][SQL] Group exception messages in connector/catalog
### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32377 from beliefer/SPARK-35021.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-07 10:54:43 +00:00
Yingyi Bu 72d32662d4 [SPARK-35144][SQL] Migrate to transformWithPruning for object rules
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- APPEND_COLUMNS
- DESERIALIZE_TO_OBJECT
- LAMBDA_VARIABLE
- MAP_OBJECTS
- SERIALIZE_FROM_OBJECT
- PROJECT
- TYPED_FILTER

Added tree traversal pruning to the following rules dealing with objects:
- EliminateSerialization
- CombineTypedFilters
- EliminateMapObjects
- ObjectSerializerPruning

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32451 from sigmod/object.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-05-07 18:36:28 +08:00
Wenchen Fan 9aa18dfe19 [SPARK-35333][SQL] Skip object null check in Invoke if possible
### What changes were proposed in this pull request?

If `targetObject` is not nullable, we don't need the object null check in `Invoke`.

### Why are the changes needed?

small perf improvement

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #32466 from cloud-fan/invoke.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-07 10:27:28 +00:00
gengjiaan cf2c4ba584 [SPARK-35020][SQL] Group exception messages in catalyst/util
### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32367 from beliefer/SPARK-35020.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-07 08:30:30 +00:00
Wenchen Fan e83910f1f8 [SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/32198

Before https://github.com/apache/spark/pull/32198, in `WriteTaskStatsTracker.newRow`, we know that the row is written to the current file. After https://github.com/apache/spark/pull/32198 , we no longer know this connection.

This PR adds the file path parameter in `WriteTaskStatsTracker.newRow` to bring back the connection.

### Why are the changes needed?

To not break some custom `WriteTaskStatsTracker` implementations.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #32459 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-07 08:28:42 +00:00
Terry Kim 33c1034315 [SPARK-34701][SQL][FOLLOW-UP] Children/innerChildren should be mutually exclusive for AnalysisOnlyCommand
### What changes were proposed in this pull request?

This is a follow up to https://github.com/apache/spark/pull/32032#discussion_r620928086. Basically, `children`/`innerChildren` should be mutually exclusive for `AlterViewAsCommand` and `CreateViewCommand`, which extend `AnalysisOnlyCommand`. Otherwise, there could be an issue in the `EXPLAIN` command. Currently, this is not an issue, because these commands will be analyzed (children will always be empty) when the `EXPLAIN` command is run.

### Why are the changes needed?

To be future-proof where these commands are directly used.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new tsts

Closes #32447 from imback82/SPARK-34701-followup.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-07 06:07:53 +00:00
Cheng Su 42f59caf73 [SPARK-35133][SQL] Explain codegen works with AQE
### What changes were proposed in this pull request?

`EXPLAIN CODEGEN <query>` (and Dataset.explain("codegen")) prints out the generated code for each stage of plan. The current implementation is to match `WholeStageCodegenExec` operator in query plan and prints out generated code (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala#L111-L118 ). This does not work with AQE as we wrap the whole query plan inside `AdaptiveSparkPlanExec` and do not run whole stage code-gen physical plan rule eagerly (`CollapseCodegenStages`). This introduces unexpected behavior change for EXPLAIN query (and Dataset.explain), as we enable AQE by default now.

The change is to explain code-gen for the current executed plan of AQE.

### Why are the changes needed?

Make `EXPLAIN CODEGEN` work same as before.

### Does this PR introduce _any_ user-facing change?

No (when comparing with latest Spark release 3.1.1).

### How was this patch tested?

Added unit test in `ExplainSuite.scala`.

Closes #32430 from c21/explain-aqe.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-06 20:44:31 -07:00
Yuanjian Li dfb3343423 [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata
### What changes were proposed in this pull request?
When checking the path in `FileStreamSink.hasMetadata`, we should ignore the error and assume the user wants to read a batch output.

### Why are the changes needed?
Keep the original behavior of ignoring the error.

### Does this PR introduce _any_ user-facing change?
Yes.
The path checking will not throw an exception when checking file sink format

### How was this patch tested?
New UT added.

Closes #31638 from xuanyuanking/SPARK-34526.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-05-06 22:48:53 +09:00
Liang-Chi Hsieh 6cd5cf5722 [SPARK-35215][SQL] Update custom metric per certain rows and at the end of the task
### What changes were proposed in this pull request?

This patch changes custom metric updating to update per certain rows (currently 100), instead of per row.

### Why are the changes needed?

Based on previous discussion https://github.com/apache/spark/pull/31451#discussion_r605413557, we should only update custom metrics per certain (e.g. 100) rows and also at the end of the task. Updating per row doesn't make too much benefit.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit test.

Closes #32330 from viirya/metric-update.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-06 13:21:08 +00:00
Liang-Chi Hsieh c6d3f3778f [SPARK-35240][SS] Use CheckpointFileManager for checkpoint file manipulation
### What changes were proposed in this pull request?

This patch changes a few places using `FileSystem` API to manipulate checkpoint file to `CheckpointFileManager`.

### Why are the changes needed?

`CheckpointFileManager` is designed to handle checkpoint file manipulation. However, there are a few places exposing `FileSystem` from checkpoint files/paths. We should use `CheckpointFileManager` to manipulate checkpoint files. For example, we may want to have one storage system for checkpoint file. If all checkpoint file manipulation is performed through `CheckpointFileManager`, we can only implement `CheckpointFileManager` for the storage system, and don't need to implement `FileSystem` API for it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests.

Closes #32361 from viirya/checkpoint-manager.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-05-06 00:49:37 -07:00
Linhong Liu 3f5a20919c [SPARK-35318][SQL] Hide internal view properties for describe table cmd
### What changes were proposed in this pull request?
Hide internal view properties for describe table command, because those
properties are generated by spark and should be transparent to the end-user.

### Why are the changes needed?
Avoid internal properties confusing the users.

### Does this PR introduce _any_ user-facing change?
Yes
Before this change, the user will see below output for `describe formatted test_view`
```
....
Table Properties       [view.catalogAndNamespace.numParts=2, view.catalogAndNamespace.part.0=spark_catalog, view.catalogAndNamespace.part.1=default, view.query.out.col.0=c, view.query.out.col.1=v, view.query.out.numCols=2, view.referredTempFunctionsNames=[], view.referredTempViewNames=[]]
...
```
After this change, the internal properties will be hidden for `describe formatted test_view`
```
...
Table Properties        []
...
```

### How was this patch tested?
existing UT

Closes #32441 from linhongliu-db/hide-properties.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-06 07:31:34 +00:00
Takeshi Yamamuro 5c67d0c8f7 [SPARK-35293][SQL][TESTS] Use the newer dsdgen for TPCDSQueryTestSuite
### What changes were proposed in this pull request?

This PR intends to replace `maropu/spark-tpcds-datagen` with `databricks/tpcds-kit` for using a newer dsdgen and update the golden files in `tpcds-query-results`.

### Why are the changes needed?

For better testing.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed.

Closes #32420 from maropu/UseTpcdsKit.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-06 15:25:46 +09:00
Dongjoon Hyun 19661f6ae2 [SPARK-35325][SQL][TESTS] Add nested column ORC encryption test case
### What changes were proposed in this pull request?

This PR aims to enrich ORC encryption test coverage for nested columns.

### Why are the changes needed?

This will provide a test coverage for this feature.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #32449 from dongjoon-hyun/SPARK-35325.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-05 22:29:54 -07:00
Yingyi Bu 7970318296 [SPARK-35155][SQL] Add rule id pruning to Analyzer rules
### What changes were proposed in this pull request?

Added rule id based pruning to Analyzer rules in fixed point batches:

- org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns
- org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator
- org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions
- org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggAliasInGroupBy
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveBinaryArithmetic
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveEncodersInUDF
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveInsertInto
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNewInstance
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRandomSeed
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubqueryColumnAliases
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUserSpecifiedColumns
- org.apache.spark.sql.catalyst.analysis.Analyzer$WindowsSubstitution
- org.apache.spark.sql.catalyst.analysis.DeduplicateRelations
- org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
- org.apache.spark.sql.catalyst.analysis.EliminateUnions
- org.apache.spark.sql.catalyst.analysis.ResolveCreateNamedStruct
- org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveCoalesceHints
- org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveJoinStrategyHints
- org.apache.spark.sql.catalyst.analysis.ResolveInlineTables
- org.apache.spark.sql.catalyst.analysis.ResolveLambdaVariables
- org.apache.spark.sql.catalyst.analysis.ResolveTimeZone
- org.apache.spark.sql.catalyst.analysis.ResolveUnion
- org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals
- org.apache.spark.sql.catalyst.analysis.TimeWindowing

Subsequent PRs will add tree bits based pruning to those rules. Split a big PR to reduce review load.

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32425 from sigmod/analyzer.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-05-06 08:55:29 +08:00
Yijia Cui bbdbe0f734 [SPARK-34854][SQL][SS] Expose source metrics via progress report and add Kafka use-case to report delay
### What changes were proposed in this pull request?
This pull request proposes a new API for streaming sources to signal that they can report metrics, and adds a use case to support Kafka micro batch stream to report the stats of # of offsets for the current offset falling behind the latest.

A public interface is added.

`metrics`: returns the metrics reported by the streaming source with given offset.

### Why are the changes needed?
The new API can expose any custom metrics for the "current" offset for streaming sources. Different from #31398, this PR makes metrics available to user through progress report, not through spark UI. A use case is that people want to know how the current offset falls behind the latest offset.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit test for Kafka micro batch source v2 are added to test the Kafka use case.

Closes #31944 from yijiacui-db/SPARK-34297.

Authored-by: Yijia Cui <yijia.cui@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-05-05 17:26:07 +09:00
dsolow f550e03b96 [SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions
### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of #31887. Closes #31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes #32424 from maropu/pr31887.

Lead-authored-by: dsolow <dsolow@sayari.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: dmsolow <dsolow@sayarianalytics.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-05 12:46:13 +09:00
Yingyi Bu 7fd3f8f9ec [SPARK-35294][SQL] Add tree traversal pruning in rules with dedicated files under optimizer
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- CREATE_NAMED_STRUCT
- EXTRACT_VALUE
- JSON_TO_STRUCT
- OUTER_REFERENCE
- AGGREGATE
- LOCAL_RELATION
- EXCEPT
- LIMIT
- WINDOW

Used them in the following rules:
- DecorrelateInnerQuery
- LimitPushDownThroughWindow
- OptimizeCsvJsonExprs
- PropagateEmptyRelation
- PullOutGroupingExpressions
- PushLeftSemiLeftAntiThroughJoin
- ReplaceExceptWithFilter
- RewriteDistinctAggregates
- SimplifyConditionalsInPredicate
- UnwrapCastInBinaryComparison

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32421 from sigmod/opt.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-05-04 19:17:22 +08:00
HyukjinKwon 8aaa9e890a [SPARK-35250][SQL][DOCS] Fix duplicated STOP_AT_DELIMITER to SKIP_VALUE at CSV's unescapedQuoteHandling option documentation
### What changes were proposed in this pull request?

This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too.
`STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.

### Why are the changes needed?

To correctly document.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the user-facing documentation.

### How was this patch tested?

I checked them via running linters.

Closes #32423 from HyukjinKwon/SPARK-35250.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 08:44:18 +09:00
Tobias Hermann 54e0aa10c8 [MINOR][SS][DOCS] Fix a typo in the documentation of GroupState
### What changes were proposed in this pull request?

Fixing some typos in the documenting comments.

### Why are the changes needed?

To make reading the docs more pleasant.

### Does this PR introduce _any_ user-facing change?

Yes, since the user sees the docs.

### How was this patch tested?

It was not tested, because no code was changed.

Closes #32400 from Dobiasd/patch-1.

Authored-by: Tobias Hermann <editgym@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 19:35:38 +09:00
Chao Sun 2a8d7ed4bf [SPARK-35281][SQL] StaticInvoke should not apply boxing if return type is primitive
### What changes were proposed in this pull request?

In `StaticInvoke`, when result is nullable, don't box the return value if its type is primitive.

### Why are the changes needed?

It is unnecessary to apply boxing when the method return value is of primitive type, and it would hurt performance a lot if the method is simple. The check is done in `Invoke` but not in `StaticInvoke`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a UT.

Closes #32416 from sunchao/SPARK-35281.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 14:55:35 +09:00
Max Gekk 335f00b19b [SPARK-35285][SQL] Parse ANSI interval types in SQL schema
### What changes were proposed in this pull request?
1. Extend Spark SQL parser to support parsing of:
    - `INTERVAL YEAR TO MONTH` to `YearMonthIntervalType`
    - `INTERVAL DAY TO SECOND` to `DayTimeIntervalType`
2. Assign new names to the ANSI interval types according to the SQL standard to be able to parse the names back by Spark SQL parser. Override the `typeName()` name of `YearMonthIntervalType`/`DayTimeIntervalType`.

### Why are the changes needed?
To be able to use new ANSI interval types in SQL. The SQL standard requires the types to be defined according to the rules:
```
<interval type> ::= INTERVAL <interval qualifier>
<interval qualifier> ::= <start field> TO <end field> | <single datetime field>
<start field> ::= <non-second primary datetime field> [ <left paren> <interval leading field precision> <right paren> ]
<end field> ::= <non-second primary datetime field> | SECOND [ <left paren> <interval fractional seconds precision> <right paren> ]
<primary datetime field> ::= <non-second primary datetime field | SECOND
<non-second primary datetime field> ::= YEAR | MONTH | DAY | HOUR | MINUTE
<interval fractional seconds precision> ::= <unsigned integer>
<interval leading field precision> ::= <unsigned integer>
```
Currently, Spark SQL supports only `YEAR TO MONTH` and `DAY TO SECOND` as `<interval qualifier>`.

### Does this PR introduce _any_ user-facing change?
Should not since the types has not been released yet.

### How was this patch tested?
By running the affected tests such as:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
$ build/sbt "test:testOnly *ExpressionTypeCheckingSuite"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z windowFrameCoercion.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z literals.sql"
```

Closes #32409 from MaxGekk/parse-ansi-interval-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 13:50:35 +09:00
Takeshi Yamamuro cd689c942c [SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf
### What changes were proposed in this pull request?

This PR proposes to port minimal code to generate TPC-DS data from [databricks/spark-sql-perf](https://github.com/databricks/spark-sql-perf). The classes in a new class file `tpcdsDatagen.scala` are basically copied from the `databricks/spark-sql-perf` codebase.
Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them.

The code authors of these classes are:
juliuszsompolski
npoggi
wangyum

### Why are the changes needed?

We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g.,
 - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
 - https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala

 I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition.

### Does this PR introduce _any_ user-facing change?

dev only.

### How was this patch tested?

Manually checked and GA passed.

Closes #32243 from maropu/tpcdsDatagen.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-03 12:04:42 +09:00
Angerszhuuuu caa46ce0b6 [SPARK-35112][SQL] Support Cast string to day-second interval
### What changes were proposed in this pull request?
Support Cast string to day-seconds interval

### Why are the changes needed?
Users can cast day-second interval string to DayTimeIntervalType.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32271 from AngersZhuuuu/SPARK-35112.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-05-02 09:28:51 +03:00
Peter Toth cfc0495f9c [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function
### What changes were proposed in this pull request?
This PR adds a new rule `PullOutGroupingExpressions` to pull out complex grouping expressions to a `Project` node under an `Aggregate`. These expressions are then referenced in both grouping expressions and aggregate expressions without aggregate functions to ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions.

### Why are the changes needed?
If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid.

Here is a simple example:
```
SELECT not(t.id IS NULL) , count(*)
FROM t
GROUP BY t.id IS NULL
```
In this case the `BooleanSimplification` rule does this:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
!Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]
 +- Project [value#219 AS id#222]                                                                 +- Project [value#219 AS id#222]
    +- LocalRelation [value#219]                                                                     +- LocalRelation [value#219]
```
where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression.

Before this PR:
```
== Optimized Logical Plan ==
Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
+- Project [value#219 AS id#222]
   +- LocalRelation [value#219]
```
and running the query throws an error:
```
Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
```

After this PR:
```
== Optimized Logical Plan ==
Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (id IS NULL))#230, count(1) AS c#228L]
+- Project [isnull(value#219) AS _groupingexpression#233]
   +- LocalRelation [value#219]
```
and the query works.

### Does this PR introduce _any_ user-facing change?
Yes, the query works.

### How was this patch tested?
Added new UT.

Closes #32396 from peter-toth/SPARK-34581-keep-grouping-expressions-2.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-02 05:53:09 +00:00
Liang-Chi Hsieh 6ce1b161e9 [SPARK-35278][SQL] Invoke should find the method with correct number of parameters
### What changes were proposed in this pull request?

This patch fixes `Invoke` expression when the target object has more than one method with the given method name.

### Why are the changes needed?

`Invoke` will find out the method on the target object with given method name. If there are more than one method with the name, currently it is undeterministic which method will be used. We should add the condition of parameter number when finding the method.

### Does this PR introduce _any_ user-facing change?

Yes, fixed a bug when using `Invoke` on a object where more than one method with the given method name.

### How was this patch tested?

Unit test.

Closes #32404 from viirya/verify-invoke-param-len.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-05-01 10:20:46 -07:00
Yuming Wang 72e238a790 [SPARK-35273][SQL] CombineFilters support non-deterministic expressions
### What changes were proposed in this pull request?

This pr makes `CombineFilters` support non-deterministic expressions. For example:
```sql
spark.sql("CREATE TABLE t1(id INT, dt STRING) using parquet PARTITIONED BY (dt)")
spark.sql("CREATE VIEW v1 AS SELECT * FROM t1 WHERE dt NOT IN ('2020-01-01', '2021-01-01')")
spark.sql("SELECT * FROM v1 WHERE dt = '2021-05-01' AND rand() <= 0.01").explain()
```

Before this pr:
```
== Physical Plan ==
*(1) Filter (isnotnull(dt#1) AND ((dt#1 = 2021-05-01) AND (rand(-6723800298719475098) <= 0.01)))
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [NOT dt#1 IN (2020-01-01,2021-01-01)], PushedFilters: [], ReadSchema: struct<id:int>
```

After this pr:
```
== Physical Plan ==
*(1) Filter (rand(-2400509328955813273) <= 0.01)
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#1), NOT dt#1 IN (2020-01-01,2021-01-01), (dt#1 = 2021-05-01)], PushedFilters: [], ReadSchema: struct<id:int>
```

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #32405 from wangyum/SPARK-35273.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-01 06:02:11 +00:00
ulysses-you 39889df32a [SPARK-35264][SQL] Support AQE side broadcastJoin threshold
### What changes were proposed in this pull request?

~~This PR aims to add a new AQE optimizer rule `DynamicJoinSelection`. Like other AQE partition number configs, this rule add a new broadcast threshold config `spark.sql.adaptive.autoBroadcastJoinThreshold`.~~
This PR amis to add a flag in `Statistics` to distinguish AQE stats or normal stats, so that we can make some sql configs isolation between AQE and normal.

### Why are the changes needed?

The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path.

Actually we do not very trust using the static stats to consider if it can build broadcast hash join. In our experience it's very common that Spark throw broadcast timeout or driver side OOM exception when execute a bit large plan. And due to braodcast join is not reversed which means if we covert join to braodcast hash join at first time, we(AQE) can not optimize it again, so it should make sense to decide if we can do broadcast at aqe side using different sql config.

### Does this PR introduce _any_ user-facing change?

Yes, a new config `spark.sql.adaptive.autoBroadcastJoinThreshold` added.

### How was this patch tested?

Add new test.

Closes #32391 from ulysses-you/SPARK-35264.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-30 09:16:21 +00:00
Angerszhuuuu 11ea255283 [SPARK-35111][SQL] Support Cast string to year-month interval
### What changes were proposed in this pull request?
Support Cast string to year-month interval
Supported format as below
```
ANSI_STYLE, like
INTERVAL -'-10-1' YEAR TO MONTH
HIVE_STYLE like
10-1 or -10-1

Rules from the SQL standard about ANSI_STYLE:

<interval literal> ::=
  INTERVAL [ <sign> ] <interval string> <interval qualifier>
<interval string> ::=
  <quote> <unquoted interval string> <quote>
<unquoted interval string> ::=
  [ <sign> ] { <year-month literal> | <day-time literal> }
<year-month literal> ::=
  <years value> [ <minus sign> <months value> ]
  | <months value>
<years value> ::=
  <datetime value>
<months value> ::=
  <datetime value>
<datetime value> ::=
  <unsigned integer>
<unsigned integer> ::= <digit>...
```
### Why are the changes needed?
Support Cast string to year-month interval

### Does this PR introduce _any_ user-facing change?
User can cast year month interval string to YearMonthIntervalType

### How was this patch tested?
Added UT

Closes #32266 from AngersZhuuuu/SPARK-SPARK-35111.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-30 08:03:07 +03:00
Kousuke Saruta e8bf8fe213 [SPARK-35047][SQL] Allow Json datasources to write non-ascii characters as codepoints
### What changes were proposed in this pull request?

This PR proposes to enable the JSON datasources to write non-ascii characters as codepoints.
To enable/disable this feature, I introduce a new option `writeNonAsciiCharacterAsCodePoint` for JSON datasources.

### Why are the changes needed?

JSON specification allows codepoints as literal but Spark SQL's JSON datasources don't support the way to do it.
It's great if we can write non-ascii characters as codepoints, which is a platform neutral representation.

### Does this PR introduce _any_ user-facing change?

Yes. Users can write non-ascii characters as codepoints with JSON datasources.

### How was this patch tested?

New test.

Closes #32147 from sarutak/json-unicode-write.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-29 09:50:15 -07:00
Kousuke Saruta 132cbf0c8c [SPARK-35105][SQL] Support multiple paths for ADD FILE/JAR/ARCHIVE commands
### What changes were proposed in this pull request?

This PR extends `ADD FILE/JAR/ARCHIVE` commands to be able to take multiple path arguments like Hive.

### Why are the changes needed?

To make those commands more useful.

### Does this PR introduce _any_ user-facing change?

Yes. In the current implementation, those commands can take a path which contains whitespaces without enclose it by neither `'` nor `"` but after this change, users need to enclose such paths.
I've note this incompatibility in the migration guide.

### How was this patch tested?

New tests.

Closes #32205 from sarutak/add-multiple-files.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-04-29 13:58:51 +09:00
Kousuke Saruta 529b875901 [SPARK-35226][SQL] Support refreshKrb5Config option in JDBC datasources
### What changes were proposed in this pull request?

This PR proposes to introduce a new JDBC option `refreshKrb5Config` which allows to reflect the change of `krb5.conf`.

### Why are the changes needed?

In the current master, JDBC datasources can't accept `refreshKrb5Config` which is defined in `Krb5LoginModule`.
So even if we change the `krb5.conf` after establishing a connection, the change will not be reflected.

The similar issue happens when we run multiple `*KrbIntegrationSuites` at the same time.
`MiniKDC` starts and stops every KerberosIntegrationSuite and different port number is recorded to `krb5.conf`.
Due to `SecureConnectionProvider.JDBCConfiguration` doesn't take `refreshKrb5Config`, KerberosIntegrationSuites except the first running one see the wrong port so those suites fail.
You can easily confirm with the following command.
```
build/sbt -Phive Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.*KrbIntegrationSuite"
```
### Does this PR introduce _any_ user-facing change?

Yes. Users can set `refreshKrb5Config` to refresh krb5 relevant configuration.

### How was this patch tested?

New test.

Closes #32344 from sarutak/kerberos-refresh-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-04-29 13:55:53 +09:00
Kent Yao 771356555c [SPARK-34786][SQL][FOLLOWUP] Explicitly declare DecimalType(20, 0) for Parquet UINT_64
### What changes were proposed in this pull request?

Explicitly declare DecimalType(20, 0) for Parquet UINT_64, avoid use DecimalType.LongDecimal which only happens to have 20 as precision.

https://github.com/apache/spark/pull/31960#discussion_r622691560

### Why are the changes needed?

fix ambiguity

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

not needed, just current CI pass

Closes #32390 from yaooqinn/SPARK-34786-F.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-29 04:51:27 +00:00
Wenchen Fan 403e4795e9 [SPARK-35244][SQL][FOLLOWUP] Add null check for the exception cause
### What changes were proposed in this pull request?

Make sure we re-throw an exception that is not null.

### Why are the changes needed?

to be super safe

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #32387 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-29 09:21:32 +09:00
Chao Sun 86d3bb5f7d [SPARK-34981][SQL] Implement V2 function resolution and evaluation
Co-Authored-By: Chao Sun <sunchaoapple.com>
Co-Authored-By: Ryan Blue <rbluenetflix.com>

### What changes were proposed in this pull request?

This implements function resolution and evaluation for functions registered through V2 FunctionCatalog [SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658). In particular:
- Added documentation for how to define the "magic method" in `ScalarFunction`.
- Added a new expression `ApplyFunctionExpression` which evaluates input by delegating to `ScalarFunction.produceResult` method.
- added a new expression `V2Aggregator` which is a type of `TypedImperativeAggregate`. It's a wrapper of V2 `AggregateFunction` and mostly delegate methods to the implementation of the latter. It also uses plain Java serde for intermediate state.
- Added function resolution logic for `ScalarFunction` and `AggregateFunction` in `Analyzer`.
  + For `ScalarFunction` this checks if the magic method is implemented through Java reflection, and create a `Invoke` expression if so. Otherwise, it checks if the default `produceResult` is overridden. If so, it creates a `ApplyFunctionExpression` which evaluates through `InternalRow`. Otherwise an analysis exception is thrown.
 + For `AggregateFunction`, this checks if the `update` method is overridden. If so, it converts it to `V2Aggregator`. Otherwise an analysis exception is thrown similar to the case of `ScalarFunction`.
- Extended existing `InMemoryTableCatalog` to add the function catalog capability. Also renamed it to `InMemoryCatalog` since it no longer only covers tables.

**Note**: this currently can successfully detect whether a subclass overrides the default `produceResult` or `update` method from the parent interface **only for Java implementations**. It seems in Scala it's hard to differentiate whether a subclass overrides a default method from its parent interface. In this case, it will be a runtime error instead of analysis error.

A few TODOs:
- Extend `V2SessionCatalog` with function catalog. This seems a little tricky since API such V2 `FunctionCatalog`'s `loadFunction` is different from V1 `SessionCatalog`'s `lookupFunction`.
- Add magic method for `AggregateFunction`.
- Type coercion when looking up functions

### Why are the changes needed?

As V2 FunctionCatalog APIs are finalized, we should integrate it with function resolution and evaluation process so that they are actually useful.

### Does this PR introduce _any_ user-facing change?

Yes, now a function exposed through V2 FunctionCatalog can be analyzed and evaluated.

### How was this patch tested?

Added new unit tests.

Closes #32082 from sunchao/resolve-func-v2.

Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Co-authored-by: Chao Sun <sunchao@uber.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-28 17:21:49 +00:00
ulysses-you 0bcf348438 [SPARK-34781][SQL][FOLLOWUP] Adjust the order of AQE optimizer rules
### What changes were proposed in this pull request?

Reorder  `DemoteBroadcastHashJoin` and `EliminateUnnecessaryJoin`.

### Why are the changes needed?

Skip unnecessary check in `DemoteBroadcastHashJoin` if `EliminateUnnecessaryJoin` affects.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No result affect.

Closes #32380 from ulysses-you/SPARK-34781-FOLLOWUP.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-28 13:59:24 +00:00
ulysses-you 8b62c2964d [SPARK-35214][SQL] OptimizeSkewedJoin support ShuffledHashJoinExec
### What changes were proposed in this pull request?

Add `ShuffledHashJoin` pattern check in `OptimizeSkewedJoin` so that we can optimize it.

### Why are the changes needed?

Currently, we have already supported all type of join through hint that make it easy to choose the join implementation.

We would choose `ShuffledHashJoin` if one table is not big but over the broadcast threshold. It's better that we can support optimize it in `OptimizeSkewedJoin`.

### Does this PR introduce _any_ user-facing change?

Probably yes, the execute plan in AQE mode may be changed.

### How was this patch tested?

Improve exists test in `AdaptiveQueryExecSuite`

Closes #32328 from ulysses-you/SPARK-35214.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-28 16:57:57 +09:00
gengjiaan 56bb8155c5 [SPARK-35085][SQL] Get columns operation should handle ANSI interval column properly
### What changes were proposed in this pull request?
This PR let JDBC clients identify ANSI interval columns properly.

### Why are the changes needed?
This PR is similar to https://github.com/apache/spark/pull/29539.
JDBC users can query interval values through thrift server, create views with ansi interval columns, e.g.
`CREATE global temp view view1 as select interval '1-1' year to month as I;`
but when they want to get the details of the columns of view1, the will fail with `Unrecognized type name: YEAR-MONTH INTERVAL`
```
Caused by: java.lang.IllegalArgumentException: Unrecognized type name: YEAR-MONTH INTERVAL
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.toJavaSQLType(SparkGetColumnsOperation.scala:190)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:206)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:198)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$7(SparkGetColumnsOperation.scala:109)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$7$adapted(SparkGetColumnsOperation.scala:109)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5(SparkGetColumnsOperation.scala:109)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5$adapted(SparkGetColumnsOperation.scala:107)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.runInternal(SparkGetColumnsOperation.scala:107)
	... 34 more
```

### Does this PR introduce _any_ user-facing change?
Yes. Let hive JDBC recognize ANSI interval.

### How was this patch tested?
Jenkins test.

Closes #32345 from beliefer/SPARK-35085.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-28 08:58:43 +03:00
PengLei 046c8c3dd6 [SPARK-34878][SQL][TESTS] Check actual sizes of year-month and day-time intervals
### What changes were proposed in this pull request?
As we have suport the year-month and day-time intervals.  Add the test actual size of year-month and day-time intervals type

### Why are the changes needed?
Just add test

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
./dev/scalastyle
run test for "ColumnTypeSuite"

Closes #32366 from Peng-Lei/SPARK-34878.

Authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-28 07:48:49 +03:00
Jose Torres 253a1aee46 [SPARK-35246][SS] Don't allow streaming-batch intersects
### What changes were proposed in this pull request?
The UnsupportedOperationChecker shouldn't allow streaming-batch intersects. As described in the ticket, they can't actually be planned correctly, and even simple cases like the below will fail:

```
  test("intersect") {
    val input = MemoryStream[Long]
    val df = input.toDS().intersect(spark.range(10).as[Long])
    testStream(df) (
      AddData(input, 1L),
      CheckAnswer(1)
    )
  }
```

### Why are the changes needed?
Users will be confused by the cryptic errors produced from trying to run an invalid query plan.

### Does this PR introduce _any_ user-facing change?
Some queries which previously failed with a poor error will now fail with a better one.

### How was this patch tested?
modified unit test

Closes #32371 from jose-torres/ossthing.

Authored-by: Jose Torres <joseph.torres@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-28 10:47:11 +09:00
Wenchen Fan 10c2b68d24 [SPARK-35244][SQL] Invoke should throw the original exception
### What changes were proposed in this pull request?

This PR updates the interpreted code path of invoke expressions, to unwrap the `InvocationTargetException`

### Why are the changes needed?

Make interpreted and codegen path consistent for invoke expressions.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new UT

Closes #32370 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-28 10:45:04 +09:00
Kousuke Saruta abb1f0c5d7 [SPARK-35236][SQL] Support archive files as resources for CREATE FUNCTION USING syntax
### What changes were proposed in this pull request?

This PR proposes to make `CREATE FUNCTION USING` syntax can take archives as resources.

### Why are the changes needed?

It would be useful.
`CREATE FUNCTION USING` syntax doesn't support archives as resources because archives were not supported in Spark SQL.
Now Spark SQL supports archives so I think we can support them for the syntax.

### Does this PR introduce _any_ user-facing change?

Yes. Users can specify archives for `CREATE FUNCTION USING` syntax.

### How was this patch tested?

New test.

Closes #32359 from sarutak/load-function-using-archive.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-28 10:15:21 +09:00
Kent Yao 16d223efee [SPARK-35091][SPARK-35090][SQL] Support extract from ANSI Intervals
### What changes were proposed in this pull request?

In this PR, we add extract/date_part support for ANSI Intervals

The `extract` is an ANSI expression and `date_part` is NON-ANSI but exists as an equivalence for `extract`

#### expression

```
<extract expression> ::=
  EXTRACT <left paren> <extract field> FROM <extract source> <right paren>
```

#### <extract field> for interval source

```

<primary datetime field> ::=
    <non-second primary datetime field>
| SECOND
<non-second primary datetime field> ::=
    YEAR
  | MONTH
  | DAY
  | HOUR
  | MINUTE
```

#### dataType

```
If <extract field> is a <primary datetime field> that does not specify SECOND or <extract field> is not a <primary datetime field>, then the declared type of the result is an implementation-defined exact numeric type with scale 0 (zero)

Otherwise, the declared type of the result is an implementation-defined exact numeric type with scale not less than the specified or implied <time fractional seconds precision> or <interval fractional seconds precision>, as appropriate, of the SECOND <primary datetime field> of the <extract source>.
```

### Why are the changes needed?

Subtask of ANSI Intervals Support

### Does this PR introduce _any_ user-facing change?

Yes
1. extract/date_part support ANSI intervals
2. for non-ansi intervals, the return type is changed from long to byte when extracting hours

### How was this patch tested?

new added tests

Closes #32351 from yaooqinn/SPARK-35091.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-27 13:06:54 +00:00
ulysses-you 4ff9f1fe3b [SPARK-35239][SQL] Coalesce shuffle partition should handle empty input RDD
### What changes were proposed in this pull request?

Create empty partition for custom shuffle reader if input RDD is empty.

### Why are the changes needed?

If input RDD partition is empty then the map output statistics will be null. And if all shuffle stage's input RDD partition is empty, we will skip it and lose the chance to coalesce partition.

We can simply create a empty partition for these custom shuffle reader to reduce the partition number.

### Does this PR introduce _any_ user-facing change?

Yes, the shuffle partition might be changed in AQE.

### How was this patch tested?

add new test.

Closes #32362 from ulysses-you/SPARK-35239.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-27 13:05:57 +00:00
gengjiaan 55dea2d937 [SPARK-34837][SQL][FOLLOWUP] Fix division by zero in the avg function over ANSI intervals
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/32229 support ANSI SQL intervals by the aggregate function `avg`.
But have not treat that the input zero rows. so this will lead to:
```
Caused by: java.lang.ArithmeticException: / by zero
	at com.google.common.math.LongMath.divide(LongMath.java:367)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1864)
	at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253)
	at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2248)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

### Why are the changes needed?
Fix a bug.

### Does this PR introduce _any_ user-facing change?
No. Just new feature.

### How was this patch tested?
new tests.

Closes #32358 from beliefer/SPARK-34837-followup.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-27 10:52:12 +03:00
Angerszhuuuu 2d2f467831 [SPARK-35169][SQL] Fix wrong result of min ANSI interval division by -1
### What changes were proposed in this pull request?
Before this patch
```
scala> Seq(java.time.Period.ofMonths(Int.MinValue)).toDF("i").select($"i" / -1).show(false)
+-------------------------------------+
|(i / -1)                             |
+-------------------------------------+
|INTERVAL '-178956970-8' YEAR TO MONTH|
+-------------------------------------+
scala> Seq(java.time.Duration.of(Long.MinValue, java.time.temporal.ChronoUnit.MICROS)).toDF("i").select($"i" / -1).show(false)
+---------------------------------------------------+
|(i / -1)                                           |
+---------------------------------------------------+
|INTERVAL '-106751991 04:00:54.775808' DAY TO SECOND|
+---------------------------------------------------+
```

Wrong result of min ANSI interval division by -1, this pr fix this

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32314 from AngersZhuuuu/SPARK-35169.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-27 07:05:50 +00:00
Cheng Su c4ad86f311 [SPARK-35235][SQL][TEST] Add row-based hash map into aggregate benchmark
### What changes were proposed in this pull request?

`AggregateBenchmark` is only testing the performance for vectorized fast hash map, but not row-based hash map (which is used by default). We should add the row-based hash map into the benchmark.

java 8 benchmark run - https://github.com/c21/spark/actions/runs/787731549
java 11 benchmark run - https://github.com/c21/spark/actions/runs/787742858

### Why are the changes needed?

To have and track a basic sense of benchmarking different fast hash map used in hash aggregate.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test, as this only touches benchmark code.

Closes #32357 from c21/agg-benchmark.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-27 06:53:42 +00:00
PengLei eb08b9010a [SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors
### What changes were proposed in this pull request?
 Support YearMonthIntervalType and DayTimeIntervalType to extend ArrowColumnVector

### Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-35139

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. By checking coding style via:
    $ ./dev/scalastyle
    $ ./dev/lint-java
2. Run the test "ArrowWriterSuite"

Closes #32340 from Peng-Lei/SPARK-35139.

Authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-27 06:08:17 +00:00
Cheng Su 7f51106c0d [SPARK-26164][SQL] Allow concurrent writers for writing dynamic partitions and bucket table
### What changes were proposed in this pull request?

This is a re-proposal of https://github.com/apache/spark/pull/23163. Currently spark always requires a [local sort](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L188) before writing to output table with dynamic partition/bucket columns. The sort can be unnecessary if cardinality of partition/bucket values is small, and can be avoided by keeping multiple output writers concurrently.

This PR introduces a config `spark.sql.maxConcurrentOutputFileWriters` (which disables this feature by default), where user can tune the maximal number of concurrent writers. The config is needed here as we cannot keep arbitrary number of writers in task memory which can cause OOM (especially for Parquet/ORC vectorization writer).

The feature is to first use concurrent writers to write rows. If the number of writers exceeds the above config specified limit. Sort rest of rows and write rows one by one (See `DynamicPartitionDataConcurrentWriter.writeWithIterator()`).

In addition, interface `WriteTaskStatsTracker` and its implementation `BasicWriteTaskStatsTracker` are also changed because previously they are relying on the assumption that only one writer is active for writing dynamic partitions and bucketed table.

### Why are the changes needed?

Avoid the sort before writing output for dynamic partitioned query and bucketed table.
Help improve CPU and IO performance for these queries.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `DataFrameReaderWriterSuite.scala`.

Closes #32198 from c21/writer.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-27 05:37:08 +00:00
Terry Kim 7779fce79a [SPARK-35225][SQL] EXPLAIN command should handle empty output of analyzed plan
### What changes were proposed in this pull request?

EXPLAIN command puts an empty line if there is no output for an analyzed plan. For example,

`sql("CREATE VIEW test AS SELECT 1").explain(true)` produces:
```
== Parsed Logical Plan ==
'CreateViewStatement [test], SELECT 1, false, false, PersistedView
+- 'Project [unresolvedalias(1, None)]
   +- OneRowRelation

== Analyzed Logical Plan ==

CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
   +- Project [1 AS 1#7]
      +- OneRowRelation

== Optimized Logical Plan ==
CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
   +- Project [1 AS 1#7]
      +- OneRowRelation

== Physical Plan ==
Execute CreateViewCommand
   +- CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
         +- Project [1 AS 1#7]
            +- OneRowRelation
```

### Why are the changes needed?

To handle empty output of analyzed plan and remove the unneeded empty line.

### Does this PR introduce _any_ user-facing change?

Yes, now the EXPLAIN command for the analyzed plan produces the following without the empty line:
```
== Analyzed Logical Plan ==
CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
   +- Project [1 AS 1#7]
      +- OneRowRelation
```

### How was this patch tested?

Added a test.

Closes #32342 from imback82/analyzed_plan_blank_line.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-27 11:10:16 +09:00
Shixiong Zhu 0df3b501ae [SPARK-28247][SS][TEST] Fix flaky test "query without test harness" on ContinuousSuite
### What changes were proposed in this pull request?

This is another attempt to fix the flaky test "query without test harness" on ContinuousSuite.

`query without test harness` is flaky because it starts a continuous query with two partitions but assumes they will run at the same speed.

In this test, 0 and 2 will be written to partition 0, 1 and 3 will be written to partition 1. It assumes when we see 3, 2 should be written to the memory sink. But this is not guaranteed. We can add `if (currentValue == 2) Thread.sleep(5000)` at this line b2a2b5d820/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala (L135) to reproduce the failure: `Result set Set([0], [1], [3]) are not a superset of Set(0, 1, 2, 3)!`

The fix is changing `waitForRateSourceCommittedValue` to wait until all partitions reach the desired values before stopping the query.

### Why are the changes needed?

Fix a flaky test.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests. Manually verify the reproduction I mentioned above doesn't fail after this change.

Closes #32316 from zsxwing/SPARK-28247-fix.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-04-27 08:07:09 +09:00
beliefer 1b609c7dcf [SPARK-35060][SQL] Group exception messages in sql/types
### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/types`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32244 from beliefer/SPARK-35060.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-26 16:44:51 +00:00
Angerszhuuuu f0090463a8 [SPARK-33985][SQL][TESTS] Add query test of combine usage of TRANSFORM and CLUSTER BY/ORDER BY
### What changes were proposed in this pull request?
Under hive's document  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform there are many usage about  TRANSFORM and CLUSTER BY/ORDER BY, in this pr add some test about this cases.

### Why are the changes needed?
Add UT

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32333 from AngersZhuuuu/SPARK-33985.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-26 16:42:07 +00:00
Liang-Chi Hsieh c59988aa79 [SPARK-34638][SQL] Single field nested column prune on generator output
### What changes were proposed in this pull request?

This patch proposes an improvement on nested column pruning if the pruning target is generator's output. Previously we disallow such case. This patch allows to prune on it if there is only one single nested column is accessed after `Generate`.

E.g., `df.select(explode($"items").as('item)).select($"item.itemId")`. As we only need `itemId` from `item`, we can prune other fields out and only keep `itemId`.

In this patch, we only address explode-like generators. We will address other generators in followups.

### Why are the changes needed?

This helps to extend the availability of nested column pruning.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #31966 from viirya/SPARK-34638.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-26 09:32:22 -07:00
Liang-Chi Hsieh bdac19184a [SPARK-35230][SQL] Move custom metric classes to proper package
### What changes were proposed in this pull request?

This patch moves DS v2 custom metric classes to `org.apache.spark.sql.connector.metric` package. Moving `CustomAvgMetric` and `CustomSumMetric` to above package and make them as public java abstract class too.

### Why are the changes needed?

`CustomAvgMetric` and `CustomSumMetric`  should be public APIs for developers to extend. As there are a few metric classes, we should put them together in one package.

### Does this PR introduce _any_ user-facing change?

No, dev only and they are not released yet.

### How was this patch tested?

Unit tests.

Closes #32348 from viirya/move-custom-metric-classes.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-26 07:19:36 -07:00
beliefer c0a3c0cbbe [SPARK-35088][SQL] Accept ANSI intervals by the Sequence expression
### What changes were proposed in this pull request?
This PR makes `Sequence` expression supports ANSI intervals as step expression.
If the start and stop expression is `TimestampType,` then the step expression could select year-month or day-time interval.
If the start and stop expression is `DateType,` then the step expression must be year-month.

### Why are the changes needed?
Extends the function of `Sequence` expression.

### Does this PR introduce _any_ user-facing change?
'Yes'. Users could use ANSI intervals as step expression for `Sequence` expression.

### How was this patch tested?
New tests.

Closes #32311 from beliefer/SPARK-35088.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-26 10:05:57 +03:00
Adam Binford 74afc68e21 [SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations
### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in https://github.com/apache/spark/pull/29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes #32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-25 23:39:56 -07:00
Max Gekk d572a85989 [SPARK-35224][SQL][TESTS] Fix buffer overflow in MutableProjectionSuite
### What changes were proposed in this pull request?
In the test `"unsafe buffer with NO_CODEGEN"` of `MutableProjectionSuite`, fix unsafe buffer size calculation to be able to place all input fields without buffer overflow + meta-data.

### Why are the changes needed?
To make the test suite `MutableProjectionSuite` more stable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suite:
```
$ build/sbt "test:testOnly *MutableProjectionSuite"
```

Closes #32339 from MaxGekk/fix-buffer-overflow-MutableProjectionSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-26 09:35:24 +03:00
Angerszhuuuu 6f782efb04 [SPARK-35220][SQL] DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited
### What changes were proposed in this pull request?
DayTimeIntervalType/YearMonthIntervalString show different between Hive SerDe and row format delimited.
Create this pr to add a test and  have disscuss.

For this problem I think we have two direction:

1. leave it as current and add a item t explain this  in migration guide docs.
2. Since we should not change hive serde's behavior, so we can cast spark row format delimited's behavior to use cast  DayTimeIntervalType/YearMonthIntervalType as HIVE_STYLE

### Why are the changes needed?
Add UT

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added ut

Closes #32335 from AngersZhuuuu/SPARK-35220.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-26 11:26:32 +09:00
Kent Yao 5b1353f690 [SPARK-35168][SQL] mapred.reduce.tasks should be shuffle.partitions not adaptive.coalescePartitions.initialPartitionNum
### What changes were proposed in this pull request?

```sql
spark-sql> set spark.sql.adaptive.coalescePartitions.initialPartitionNum=1;
spark.sql.adaptive.coalescePartitions.initialPartitionNum	1
Time taken: 2.18 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks;
21/04/21 14:27:11 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions	1
Time taken: 0.03 seconds, Fetched 1 row(s)
spark-sql> set spark.sql.shuffle.partitions;
spark.sql.shuffle.partitions	200
Time taken: 0.024 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks=2;
21/04/21 14:31:52 WARN SetCommand: Property mapred.reduce.tasks is deprecated, automatically converted to spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions	2
Time taken: 0.017 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks;
21/04/21 14:31:55 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions	1
Time taken: 0.017 seconds, Fetched 1 row(s)
spark-sql>
```

`mapred.reduce.tasks` is mapping to `spark.sql.shuffle.partitions` at write-side, but `spark.sql.adaptive.coalescePartitions.initialPartitionNum` might take precede of `spark.sql.shuffle.partitions`

### Why are the changes needed?

roundtrip for `mapred.reduce.tasks`

### Does this PR introduce _any_ user-facing change?

yes, `mapred.reduce.tasks` will always report `spark.sql.shuffle.partitions` whether `spark.sql.adaptive.coalescePartitions.initialPartitionNum` exists or not.

### How was this patch tested?

a new test

Closes #32265 from yaooqinn/SPARK-35168.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
2021-04-25 20:27:12 +08:00
Maya Anderson 166cc6204c [SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite
### What changes were proposed in this pull request?

A simple test that writes and reads an encrypted parquet and verifies that it's encrypted by checking its magic string (in encrypted footer mode).

### Why are the changes needed?

To provide a test coverage for Parquet encryption.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] [SBT / Hadoop 3.2 / Java8 (the default)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137785/testReport)
- [ ] ~SBT / Hadoop 3.2 / Java11 by adding [test-java11] to the PR title.~ (Jenkins Java11 build is broken due to missing JDK11 installation)
- [x] [SBT / Hadoop 2.7 / Java8 by adding [test-hadoop2.7] to the PR title.](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137836/testReport)
- [x] Maven / Hadoop 3.2 / Java8 by adding [test-maven] to the PR title.
- [x] Maven / Hadoop 2.7 / Java8 by adding [test-maven][test-hadoop2.7] to the PR title.

Closes #32146 from andersonm-ibm/pme_testing.

Authored-by: Maya Anderson <mayaa@il.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-24 14:28:00 -07:00
Liang-Chi Hsieh b2a2b5d820 [SPARK-34297][SQL][SS] Add metrics for data loss and offset out range for KafkaMicroBatchStream
### What changes were proposed in this pull request?

This patch proposes to add a couple of metrics in scan node for Kafka batch streaming query.

### Why are the changes needed?

When testing SS, I found it is hard to track data loss of SS reading from Kafka. The micro batch scan node has only one metric, number of output rows. Users have no idea how many offsets to fetch are out of Kafka, how many times data loss happens. These metrics are important for users to know the quality of SS query running.

### Does this PR introduce _any_ user-facing change?

Yes, adding two metrics to micro batch scan node for Kafka batch streaming.

### How was this patch tested?

Currently I tested on internal cluster with Kafka:

<img width="1193" alt="Screen Shot 2021-04-22 at 7 16 29 PM" src="https://user-images.githubusercontent.com/68855/115808460-61bf8100-a39f-11eb-99a9-65d22c3f5fb0.png">

I was trying to add unit test. But as our batch streaming query disallows to specify ending offsets. If I only specify an out-of-range starting offset, when we get offset range in `getRanges`,  any negative size range will be filtered out. So it cannot actually test the case of fetched non-existing offset.

Closes #31398 from viirya/micro-batch-metrics.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-23 13:56:53 -07:00
Wenchen Fan a9345a0591 [SPARK-35204][SQL] CatalystTypeConverters of date/timestamp should accept both the old and new Java time classes
### What changes were proposed in this pull request?

`CatalystTypeConverters` is useful when the type of the input data classes are not known statically (otherwise we can use `ExpressionEncoder`). However, the current `CatalystTypeConverters` requires you to know the datetime data class statically, which makes it hard to use.

This PR improves the `CatalystTypeConverters` for date/timestamp, to support the old and new Java time classes at the same time.

### Why are the changes needed?

Make `CatalystTypeConverters` easier to use.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

new test

Closes #32312 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-23 19:21:43 +03:00
Angerszhuuuu e503b9c462 [SPARK-35201][SQL] Format empty grouping set exception in CUBE/ROLLUP
### What changes were proposed in this pull request?
Format empty grouping set exception in CUBE/ROLLUP

### Why are the changes needed?
Format empty grouping set exception in CUBE/ROLLUP

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #32307 from AngersZhuuuu/SPARK-35201.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-23 21:23:28 +09:00
Yingyi Bu 9af338cd68 [SPARK-35078][SQL] Add tree traversal pruning in expression rules
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- AND_OR
- BINARY_ARITHMETIC
- BINARY_COMPARISON
- CASE_WHEN
- CAST
- CONCAT
- COUNT
- IF
- LIKE_FAMLIY
- NOT
- NULL_CHECK
- UNARY_POSITIVE
- UPPER_OR_LOWER

Used them in the following rules:
- ConstantPropagation
- ReorderAssociativeOperator
- BooleanSimplification
- SimplifyBinaryComparison
- SimplifyCaseConversionExpressions
- SimplifyConditionals
- PushFoldableIntoBranches
- LikeSimplification
- NullPropagation
- SimplifyCasts
- RemoveDispensableExpressions
- CombineConcats

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32280 from sigmod/expression.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-23 16:33:58 +08:00
Wenchen Fan fdccd88c2a Revert "[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function"
This reverts commit c8d78a70b4.
2021-04-23 15:55:30 +08:00
Cheng Su cab205e9e4 [SPARK-35141][SQL] Support two level of hash maps for final hash aggregation
### What changes were proposed in this pull request?

For partial hash aggregation (code-gen path), we have two level of hash map for aggregation. First level is from `RowBasedHashMapGenerator`, which is computation faster compared to the second level from `UnsafeFixedWidthAggregationMap`. The introducing of two level hash map can help improve CPU performance of query as the first level hash map normally fits in hardware cache and has cheaper hash function for key lookup.

For final hash aggregation, we can also support two level of hash map, to improve query performance further.
The original two level of hash map code works for final aggregation mostly out of box. The major change here is to support testing fall back of final aggregation (see change related to `bitMaxCapacity` and `checkFallbackForGeneratedHashMap`).

Example:

An aggregation query:

```
spark.sql(
  """
    |SELECT key, avg(value)
    |FROM agg1
    |GROUP BY key
  """.stripMargin)
```

The generated code for final aggregation is [here](https://gist.github.com/c21/20c10cc8e2c7e561aafbe9b8da055242).

An aggregation query with testing fallback:
```
withSQLConf("spark.sql.TungstenAggregate.testFallbackStartsAt" -> "2, 3") {
  spark.sql(
    """
      |SELECT key, avg(value)
      |FROM agg1
      |GROUP BY key
    """.stripMargin)
}
```
The generated code for final aggregation is [here](https://gist.github.com/c21/dabf176cbc18a5e2138bc0a29e81c878). Note the no more counter condition for first level fast map.

### Why are the changes needed?

Improve the CPU performance of hash aggregation query in general.

For `AggregateBenchmark."Aggregate w multiple keys"`, seeing query performance improved by 10%.
`codegen = T` means whole stage code-gen is enabled.
`hashmap = T` means two level maps is enabled for partial aggregation.
`finalhashmap = T` means two level maps is enabled for final aggregation.

```
Running benchmark: Aggregate w multiple keys
  Running case: codegen = F
  Stopped after 2 iterations, 8284 ms
  Running case: codegen = T hashmap = F
  Stopped after 2 iterations, 5424 ms
  Running case: codegen = T hashmap = T finalhashmap = F
  Stopped after 2 iterations, 4753 ms
  Running case: codegen = T hashmap = T finalhashmap = T
  Stopped after 2 iterations, 4508 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Aggregate w multiple keys:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
codegen = F                                        3881           4142         370          5.4         185.1       1.0X
codegen = T hashmap = F                            2701           2712          16          7.8         128.8       1.4X
codegen = T hashmap = T finalhashmap = F           2363           2377          19          8.9         112.7       1.6X
codegen = T hashmap = T finalhashmap = T           2252           2254           3          9.3         107.4       1.7X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test in `HashAggregationQuerySuite` and `HashAggregationQueryWithControlledFallbackSuite` already cover the test.

Closes #32242 from c21/agg.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-23 05:24:19 +00:00
Chao Sun 86238d0e88 [SPARK-35195][SQL][TEST] Move InMemoryTable etc to org.apache.spark.sql.connector.catalog
### What changes were proposed in this pull request?

Move the following classes:
- `InMemoryAtomicPartitionTable`
- `InMemoryPartitionTable`
- `InMemoryPartitionTableCatalog`
- `InMemoryTable`
- `InMemoryTableCatalog`
- `StagingInMemoryTableCatalog`

from `org.apache.spark.sql.connector` to `org.apache.spark.sql.connector.catalog`.

### Why are the changes needed?

These classes implement catalog related interfaces but reside in `org.apache.spark.sql.connector`. A more suitable place should be `org.apache.spark.sql.connector.catalog`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #32302 from sunchao/SPARK-35195.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-22 21:56:45 -07:00
Yingyi Bu 47f86875f7 [SPARK-35075][SQL] Add traversal pruning for subquery related rules
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- DYNAMIC_PRUNING_SUBQUERY
- EXISTS_SUBQUERY
- IN_SUBQUERY
- LIST_SUBQUERY
- PLAN_EXPRESSION
- SCALAR_SUBQUERY
- FILTER

Used them in the following rules:
- ResolveSubquery
- UpdateOuterReferences
- OptimizeSubqueries
- RewritePredicateSubquery
- PullupCorrelatedPredicates
- RewriteCorrelatedScalarSubquery (not the rule itself but an internal transform call, the full support is in SPARK-35148)
- InsertAdaptiveSparkPlan
- PlanAdaptiveSubqueries

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32247 from sigmod/subquery.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-23 12:42:55 +08:00
Angerszhuuuu 04e2305a9b [SPARK-35187][SQL] Fix failure on the minimal interval literal
### What changes were proposed in this pull request?
If the sign '-' inside of interval string, everything is fine after bb5459fb26:
```
spark-sql> SELECT INTERVAL '-178956970-8' YEAR TO MONTH;
-178956970-8
```
but the sign outside of interval string is not handled properly:
```
spark-sql> SELECT INTERVAL -'178956970-8' YEAR TO MONTH;
Error in query:
Error parsing interval year-month string: integer overflow(line 1, pos 16)

== SQL ==
SELECT INTERVAL -'178956970-8' YEAR TO MONTH
----------------^^^
```
This pr fix this issue

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32296 from AngersZhuuuu/SPARK-35187.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-22 21:58:57 +03:00
beliefer 6c587d2627 [SPARK-35110][SQL] Handle ANSI intervals in WindowExecBase
### What changes were proposed in this pull request?
This PR makes window frame could support `YearMonthIntervalType` and `DayTimeIntervalType`.

### Why are the changes needed?
Extend the function of window frame

### Does this PR introduce _any_ user-facing change?
Yes. Users could use `YearMonthIntervalType` or `DayTimeIntervalType` as the sort expression for window frame.

### How was this patch tested?
New tests

Closes #32294 from beliefer/SPARK-35110.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-22 17:14:11 +03:00
Yingyi Bu 7f7a3d80a6 [SPARK-35183][SQL] Use transformAllExpressions in CombineConcats
### What changes were proposed in this pull request?

Use transformAllExpressions instead of transformExpressionsDown in CombineConcats. The latter only transforms the root plan node.

### Why are the changes needed?

It allows CombineConcats to cover more cases where `concat` are not in the root plan node.

### How was this patch tested?

Unit test. The updated tests would fail without the code change.

Closes #32290 from sigmod/concat.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-22 13:10:39 +00:00
Angerszhuuuu b22d54a58a [SPARK-35026][SQL] Support nested CUBE/ROLLUP/GROUPING SETS in GROUPING SETS
### What changes were proposed in this pull request?
PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax.
![image](https://user-images.githubusercontent.com/46485123/114975588-139a1180-9eb7-11eb-8f53-498c1db934e0.png)

In this PR, we support it in Spark SQL too

### Why are the changes needed?
Keep consistent with PG and oracle

### Does this PR introduce _any_ user-facing change?
User can write grouping analytics like
```
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()));
```

### How was this patch tested?
Added Test

Closes #32201 from AngersZhuuuu/SPARK-35026.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-22 13:08:22 +00:00
Angerszhuuuu bb5459fb26 [SPARK-35177][SQL] Fix arithmetic overflow in parsing the minimal interval by IntervalUtils.fromYearMonthString
### What changes were proposed in this pull request?
IntervalUtils.fromYearMonthString should handle Int.MinValue months correctly.
In current logic, just use `Math.addExact(Math.multiplyExact(years, 12), months)` to calculate  negative total months will overflow when actual total months is Int.MinValue, this pr fixes this bug.

### Why are the changes needed?
IntervalUtils.fromYearMonthString should handle Int.MinValue months correctly

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32281 from AngersZhuuuu/SPARK-35177.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-22 11:59:38 +03:00
Liang-Chi Hsieh 548e66c98a [SPARK-34692][SQL][FOLLOWUP] Add INSET to ReplaceNullWithFalseInPredicate's pattern
### What changes were proposed in this pull request?

The test added by #31797 has the [failure](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137741/testReport/org.apache.spark.sql.catalyst.optimizer/ReplaceNullWithFalseInPredicateSuite/SPARK_34692__Support_Not_Int__and_Not_InSet__propagate_null/). This is a followup to fix it.

### Why are the changes needed?

Due to #32157, the rule `ReplaceNullWithFalseInPredicate` will check tree pattern before actually doing transformation. As `null` in `INSET` is not `NULL_LITERAL` pattern, we miss it and fail the newly added `not inset ...` check in `replaceNullWithFalse`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests.

Closes #32278 from viirya/SPARK-34692-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-21 19:06:16 -07:00
Yuming Wang e609395913 [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
### What changes were proposed in this pull request?

It will remove `StructField` when [pruning nested columns](0f2c0b53e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala (L28-L42)). For example:
```scala
spark.sql(
  """
    |CREATE TABLE t1 (
    |  _col0 INT,
    |  _col1 STRING,
    |  _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>)
    |USING ORC
    |""".stripMargin)

spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))")

spark.sql("SELECT _col0, _col2.c1 FROM t1").show
```

Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception:
```
java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read.
	at scala.Predef$.assert(Predef.scala:223)
	at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160)
```

After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```.

The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning:
7a5647a93a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala (L208-L213)

e64eb75aed/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala (L96-L97)

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31993 from wangyum/SPARK-34897.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-21 10:23:38 -07:00
ulysses-you 81dbaedede [SPARK-34692][SQL] Support Not(Int) and Not(InSet) propagate null in predicate
### What changes were proposed in this pull request?

* Add `Not(In)` and `Not(InSet)` check in `NullPropagation` rule.
* Add more test for `In` and `Not(In)` in `Project` level.

### Why are the changes needed?

The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that match the `NullIntolerant`.

As we already simplify the `NullIntolerant` expression to null if it's children have null. E.g. `a != null` => `null`. It's safe to do this with `Not(In)`/`Not(InSet)`.

Note that, we can only do the simplify in predicate which `ReplaceNullWithFalseInPredicate`  rule do.

Let's say we have two sqls:
```
select 1 not in (2, null);
select 1 where 1 not in (2, null);
```
The first sql we cannot optimize since it would return `NULL` instead of `false`. The second is postive.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #31797 from ulysses-you/SPARK-34692.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-21 15:37:02 +00:00
Gengliang Wang 43ad939a7e [SPARK-35152][SQL] ANSI mode: IntegralDivide throws exception on overflow
### What changes were proposed in this pull request?

IntegralDivide should throw an exception on overflow in ANSI mode.
There is only one case that can cause that:
```
Long.MinValue div -1
```

### Why are the changes needed?

ANSI compliance

### Does this PR introduce _any_ user-facing change?

Yes, IntegralDivide throws an exception on overflow in ANSI mode

### How was this patch tested?

Unit test

Closes #32260 from gengliangwang/integralDiv.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-21 16:08:20 +08:00
sandeep.katta 4f309cec07 [SPARK-35096][SQL] SchemaPruning should adhere spark.sql.caseSensitive config
### What changes were proposed in this pull request?

As a part of the SPARK-26837 pruning of nested fields from object serializers are supported. But it is missed to handle case insensitivity nature of spark

In this PR I have resolved the column names to be pruned based on `spark.sql.caseSensitive ` config
**Exception Before Fix**

```
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
  at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
  at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
  at scala.collection.immutable.List.map(List.scala:298)
  at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
  at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
  at
```

### Why are the changes needed?
After Upgrade to Spark 3 `foreachBatch` API throws` java.lang.ArrayIndexOutOfBoundsException`. This issue will be fixed using this PR

### Does this PR introduce _any_ user-facing change?
No, Infact fixes the regression

### How was this patch tested?
Added tests and also tested verified manually

Closes #32194 from sandeep-katta/SPARK-35096.

Authored-by: sandeep.katta <sandeep.katta2007@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-21 15:16:17 +08:00
Angerszhuuuu d259f9301a [SPARK-35113][SQL] Support ANSI intervals in the Hash expression
### What changes were proposed in this pull request?
Support ANSI interval in HashExpression and add UT

### Why are the changes needed?
Support ANSI interval in HashExpression

### Does this PR introduce _any_ user-facing change?
User can pass ANSI interval in HashExpression function

### How was this patch tested?
Added UT

Closes #32259 from AngersZhuuuu/SPARK-35113.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-21 07:44:44 +03:00
Kent Yao 81c3cc2312 [SPARK-35044][SQL][FOLLOWUP][TEST-HADOOP2.7] Fix hadoop 2.7 test due to diff between hadoop 2.7 and hadoop 3
### What changes were proposed in this pull request?

dfs.replication is inconsistent from hadoop 2.x to 3.x, so in this PR we use `dfs.hosts` to verify per https://github.com/apache/spark/pull/32144#discussion_r616833099

```
== Results ==
!== Correct Answer - 1 ==        == Spark Answer - 1 ==
!struct<>                        struct<key:string,value:string>
![dfs.replication,<undefined>]   [dfs.replication,3]
```

### Why are the changes needed?

fix Jenkins job with Hadoop 2.7

### Does this PR introduce _any_ user-facing change?

test only change
### How was this patch tested?

test only change

Closes #32263 from yaooqinn/SPARK-35044-F.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-21 09:10:05 +09:00
Max Gekk e8d6992b66 [SPARK-35153][SQL] Make textual representation of ANSI interval operators more readable
### What changes were proposed in this pull request?
In the PR, I propose to override the `sql` and `toString` methods of the expressions that implement operators over ANSI intervals (`YearMonthIntervalType`/`DayTimeIntervalType`), and replace internal expression class names by operators like `*`, `/` and `-`.

### Why are the changes needed?
Proposed methods should make the textual representation of such operators more readable, and potentially parsable by Spark SQL parser.

### Does this PR introduce _any_ user-facing change?
Yes. This can influence on column names.

### How was this patch tested?
By running existing test suites for interval and datetime expressions, and re-generating the `*.sql` tests:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
```

Closes #32262 from MaxGekk/interval-operator-sql.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-20 23:13:40 +03:00
Wenchen Fan e08c40fa3f [SPARK-35145][SQL] CurrentOrigin should support nested invoking
### What changes were proposed in this pull request?

`CurrentOrigin` is a thread-local variable to track the original SQL line position in plan/expression. Usually, we set `CurrentOrigin`, create `TreeNode` instances, and reset `CurrentOrigin`.

This PR updates the last step to set `CurrentOrigin` to its previous value, instead of resetting it. This is necessary when we invoke `CurrentOrigin` in a nested way, like with subqueries.

### Why are the changes needed?

To keep the original SQL line position in the error message in more cases.

### Does this PR introduce _any_ user-facing change?

No, only minor error message changes.

### How was this patch tested?

existing tests

Closes #32249 from cloud-fan/origin.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-20 15:23:41 +00:00
Liang-Chi Hsieh eb9a4390da [SPARK-34338][SQL] Report metrics from Datasource v2 scan
### What changes were proposed in this pull request?

This patch proposes to leverage `CustomMetric`, `CustomTaskMetric` API to report custom metrics from DS v2 scan to Spark.

### Why are the changes needed?

This is related to #31398. In SPARK-34297, we want to add a couple of metrics when reading from Kafka in SS. We need some public API change in DS v2 to make it possible. This extracts only DS v2 change and make it general for DS v2 instead of micro-batch DS v2 API.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Implement a simple test DS v2 class locally and run it:

```scala
scala> import org.apache.spark.sql.execution.datasources.v2._
import org.apache.spark.sql.execution.datasources.v2._

scala> classOf[CustomMetricDataSourceV2].getName
res0: String = org.apache.spark.sql.execution.datasources.v2.CustomMetricDataSourceV2

scala> val df = spark.read.format(res0).load()
df: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> df.collect
```

<img width="703" alt="Screen Shot 2021-03-30 at 11 07 13 PM" src="https://user-images.githubusercontent.com/68855/113098080-d8a49800-91ac-11eb-8681-be408a0f2e69.png">

Closes #31451 from viirya/dsv2-metrics.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-20 15:01:44 +00:00
Angerszhuuuu 361444890e [SPARK-34035][SQL] Refactor ScriptTransformation to remove input parameter and replace it by child.output
### What changes were proposed in this pull request?
Refactor ScriptTransformation to remove input parameter and replace it by child.output

### Why are the changes needed?
refactor code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #32228 from AngersZhuuuu/SPARK-34035.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-20 14:52:21 +00:00
Angerszhuuuu b219e37af3 [SPARK-35068][SQL] Add tests for ANSI intervals to HiveThriftBinaryServerSuite
### What changes were proposed in this pull request?
After the PR https://github.com/apache/spark/pull/32209, this should be possible now.
We can add test case for ANSI intervals to HiveThriftBinaryServerSuite

### Why are the changes needed?
Add more test case

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32250 from AngersZhuuuu/SPARK-35068.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-20 13:17:59 +03:00
allisonwang-db b6bb24ca1b [SPARK-34974][SQL] Improve subquery decorrelation framework
### What changes were proposed in this pull request?
This PR implements the decorrelation technique in the paper "Unnesting Arbitrary Queries" by T. Neumann; A. Kemper
(http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf). It currently supports Filter, Project, Aggregate, Join, and UnaryNode that passes CheckAnalysis.

This feature can be controlled by the config `spark.sql.optimizer.decorrelateInnerQuery.enabled` (default: true).

A few notes:
1. This PR does not relax any constraints in CheckAnalysis for correlated subqueries, even though some cases can be supported by this new framework, such as aggregate with correlated non-equality predicates. This PR focuses on adding the new framework and making sure all existing cases can be supported. Constraints can be relaxed gradually in the future via separate PRs.
2. The new framework is only enabled for correlated scalar subqueries, as the first step. EXISTS/IN subqueries can be supported in the future.

### Why are the changes needed?
Currently, Spark has limited support for correlated subqueries. It only allows `Filter` to reference outer query columns and does not support non-equality predicates when the subquery is aggregated. This new framework will allow more operators to host outer column references and support correlated non-equality predicates and more types of operators in correlated subqueries.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit and SQL query tests and new optimizer plan tests.

Closes #32072 from allisonwang-db/spark-34974-decorrelation.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-20 09:22:22 +00:00
Max Gekk aa0d00de5e [SPARK-35018][SQL][TESTS] Check transferring of year-month intervals via Hive Thrift server
### What changes were proposed in this pull request?
1. Add a test to check that Thrift server is able to collect year-month intervals and transfer them via thrift protocol.
2. Improve similar test for day-time intervals. After the changes, the test doesn't depend on the result of date subtractions. In the future, the type of date subtract can be changed. So, current PR should make the test tolerant to the changes.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ ./build/sbt -Phive -Phive-thriftserver "test:testOnly *SparkThriftServerProtocolVersionsSuite"
```

Closes #32240 from MaxGekk/year-month-interval-thrift-protocol.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-20 08:52:37 +03:00
Yingyi Bu f4926d1c8b [SPARK-35052][SQL] Use static bits for AttributeReference and Literal
### What changes were proposed in this pull request?

- Share a static ImmutableBitSet for `treePatternBits` in all object instances of AttributeReference.
- Share three static ImmutableBitSets for  `treePatternBits` in three kinds of Literals.
- Add an ImmutableBitSet as a subclass of BitSet.

### Why are the changes needed?

Reduce the additional memory usage caused by `treePatternBits`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32157 from sigmod/leaf.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-20 13:13:16 +08:00
allisonwang-db bad4b6f025 [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when a subquery is aggregated
### What changes were proposed in this pull request?
This PR updated the `foundNonEqualCorrelatedPred` logic for correlated subqueries in `CheckAnalysis` to only allow correlated equality predicates that guarantee one-to-one mapping between inner and outer attributes, instead of all equality predicates.

### Why are the changes needed?
To fix correctness bugs. Before this fix Spark can give wrong results for certain correlated subqueries that pass CheckAnalysis:
Example 1:
```sql
create or replace view t1(c) as values ('a'), ('b')
create or replace view t2(c) as values ('ab'), ('abc'), ('bc')

select c, (select count(*) from t2 where t1.c = substring(t2.c, 1, 1)) from t1
```
Correct results: [(a, 2), (b, 1)]
Spark results:
```
+---+-----------------+
|c  |scalarsubquery(c)|
+---+-----------------+
|a  |1                |
|a  |1                |
|b  |1                |
+---+-----------------+
```
Example 2:
```sql
create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3);
create or replace view t2(c) as values (6);

select c, (select count(*) from t1 where a + b = c) from t2;
```
Correct results: [(6, 4)]
Spark results:
```
+---+-----------------+
|c  |scalarsubquery(c)|
+---+-----------------+
|6  |1                |
|6  |1                |
|6  |1                |
|6  |1                |
+---+-----------------+
```
### Does this PR introduce _any_ user-facing change?
Yes. Users will not be able to run queries that contain unsupported correlated equality predicates.

### How was this patch tested?
Added unit tests.

Closes #32179 from allisonwang-db/spark-35080-subquery-bug.

Lead-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-20 11:11:40 +08:00
Yingyi Bu 9a6d7730f5 [SPARK-35103][SQL] Make TypeCoercion rules more efficient
## What changes were proposed in this pull request?
This PR fixes a couple of things in TypeCoercion rules:
- Only run the propagate types step if the children of a node have output attributes with changed dataTypes and/or nullability. This is implemented as custom tree transformation. The TypeCoercion rules now only implement a partial function.
- Combine multiple type coercion rules into a single rule. Multiple rules are applied in single tree traversal.
- Reduce calls to conf.get in DecimalPrecision. This now happens once per tree traversal, instead of once per matched expression.
- Reduce the use of withNewChildren.

This brings down the number of CPU cycles spend in analysis by ~28% (benchmark: 10 iterations of all TPC-DS queries on SF10).

## How was this patch tested?
Existing tests.

Closes #32208 from sigmod/coercion.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2021-04-19 21:25:58 +02:00
Cheng Pan 0c2e9b99aa [SPARK-35138][SQL] Remove Antlr4 workaround
### What changes were proposed in this pull request?

Remove Antlr 4.7 workaround.

### Why are the changes needed?

The https://github.com/antlr/antlr4/commit/ac9f7530 has been fixed in upstream, so remove the workaround to simplify code.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existed UTs.

Closes #32238 from pan3793/antlr-minor.

Authored-by: Cheng Pan <379377944@qq.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-19 10:38:53 -07:00
Kent Yao 2d161cb3a1 [SPARK-35102][SQL] Make spark.sql.hive.version read-only, not deprecated and meaningful
### What changes were proposed in this pull request?

Firstly let's take a look at the definition and comment.

```
// A fake config which is only here for backward compatibility reasons. This config has no effect
// to Spark, just for reporting the builtin Hive version of Spark to existing applications that
// already rely on this config.
val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version")
  .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive version in Spark.")
  .version("1.1.1")
  .fallbackConf(HIVE_METASTORE_VERSION)
```
It is used for reporting the built-in Hive version but the current status is unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax.

It is marked as deprecated but kept a long way until now. I guess it is hard for us to remove it and not even necessary.

On second thought, it's actually good for us to keep it to work with the `spark.sql.hive.metastore.version`. As when `spark.sql.hive.metastore.version` is changed, it could be used to report the compiled hive version statically, it's useful when an error occurs in this case. So this parameter should be fixed to compiled hive version.

### Why are the changes needed?

`spark.sql.hive.version` is useful in certain cases and should be read-only

### Does this PR introduce _any_ user-facing change?

`spark.sql.hive.version` now is read-only

### How was this patch tested?

new test cases

Closes #32200 from yaooqinn/SPARK-35102.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-19 14:40:21 +00:00
gengjiaan 7f3403583f [SPARK-34715][SQL][TESTS] Add round trip tests for period <-> month and duration <-> micros
### What changes were proposed in this pull request?
Similarly to the test from the PR https://github.com/apache/spark/pull/31799, add tests:
1. Months -> Period -> Months
2. Period -> Months -> Period
3. Duration -> micros -> Duration

### Why are the changes needed?
Add round trip tests for period <-> month and duration <-> micros

### Does this PR introduce _any_ user-facing change?
'No'. Just test cases.

### How was this patch tested?
Jenkins test

Closes #32234 from beliefer/SPARK-34715.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-19 16:13:13 +03:00
Max Gekk 1d1ed3eb25 [SPARK-35107][SQL] Parse unit-to-unit interval literals to ANSI intervals
### What changes were proposed in this pull request?
Parse the year-month interval literals like `INTERVAL '1-1' YEAR TO MONTH` to values of `YearMonthIntervalType`, and day-time interval literals to `DayTimeIntervalType` values. Currently, Spark SQL supports:
- DAY TO HOUR
- DAY TO MINUTE
- DAY TO SECOND
- HOUR TO MINUTE
- HOUR TO SECOND
- MINUTE TO SECOND

All such interval literals are converted to `DayTimeIntervalType`, and `YEAR TO MONTH` to `YearMonthIntervalType` while loosing info about `from` and `to` units.

**Note**: new behavior is under the SQL config `spark.sql.legacy.interval.enabled` which is `false` by default. When the config is set to `true`, the interval literals are parsed to `CaledarIntervalType` values.

Closes #32176

### Why are the changes needed?
To conform the ANSI SQL standard which assumes conversions of interval literals to year-month or day-time interval but not to mixed interval type like Catalyst's `CalendarIntervalType`.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```sql
spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND;
1 days 1 hours 2 minutes 3.123 seconds
spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND);
interval
```

After:
```sql
spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND;
1 01:02:03.123000000
spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND);
day-time interval
```

### How was this patch tested?
1. By running the affected test suites:
```
$ ./build/sbt "test:testOnly *.ExpressionParserSuite"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z create_view.sql"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z date.sql"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z timestamp.sql"
```
2. PostgresSQL tests are executed with `spark.sql.legacy.interval.enabled` is set to `true` to keep compatibility with PostgreSQL output:
```sql
> SELECT interval '999' second;
0 years 0 mons 0 days 0 hours 16 mins 39.00 secs
```

Closes #32209 from MaxGekk/parse-ansi-interval-literals.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-19 16:00:59 +03:00
gengjiaan 8dc455bba8 [SPARK-34837][SQL] Support ANSI SQL intervals by the aggregate function avg
### What changes were proposed in this pull request?
Extend the `Average` expression to support `DayTimeIntervalType` and `YearMonthIntervalType` added by #31614.

Note: the expressions can throw the overflow exception independently from the SQL config `spark.sql.ansi.enabled`. In this way, the modified expressions always behave in the ANSI mode for the intervals.

### Why are the changes needed?
Extend `org.apache.spark.sql.catalyst.expressions.aggregate.Average` to support `DayTimeIntervalType` and `YearMonthIntervalType`.

### Does this PR introduce _any_ user-facing change?
'No'.
Should not since new types have not been released yet.

### How was this patch tested?
Jenkins test

Closes #32229 from beliefer/SPARK-34837.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-19 15:56:56 +03:00
HyukjinKwon 70b606ffdd [SPARK-35045][SQL][FOLLOW-UP] Add a configuration for CSV input buffer size
### What changes were proposed in this pull request?

This PR makes the input buffer configurable (as an internal configuration). This is mainly to work around the regression in uniVocity/univocity-parsers#449.

This is particularly useful for SQL workloads that requires to rewrite the `CREATE TABLE` with options.

### Why are the changes needed?

To work around uniVocity/univocity-parsers#449.

### Does this PR introduce _any_ user-facing change?

No, it's only internal option.

### How was this patch tested?

Manually tested by modifying the unittest added in https://github.com/apache/spark/pull/31858 as below:

```diff
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index fd25a79619d..705f38dbfbd 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 -2456,6 +2456,7  abstract class CSVSuite
   test("SPARK-34768: counting a long record with ignoreTrailingWhiteSpace set to true") {
     val bufSize = 128
     val line = "X" * (bufSize - 1) + "| |"
+    spark.conf.set("spark.sql.csv.parser.inputBufferSize", 128)
     withTempPath { path =>
       Seq(line).toDF.write.text(path.getAbsolutePath)
       assert(spark.read.format("csv")
```

Closes #32231 from HyukjinKwon/SPARK-35045-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-19 19:52:06 +09:00
Angerszhuuuu a74f601040 [SPARK-31937][SQL] Support processing ArrayType/MapType/StructType data using no-serde mode script transform
### What changes were proposed in this pull request?
Support no-serde mode script transform use ArrayType/MapType/StructStpe data.

### Why are the changes needed?
Make user can process array/map/struct data

### Does this PR introduce _any_ user-facing change?
Yes, user can process array/map/struct data in script transform `no-serde` mode

### How was this patch tested?
Added UT

Closes #30957 from AngersZhuuuu/SPARK-31937.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-19 17:02:32 +09:00
Terry Kim 7a06cdd53b [SPARK-35122][SQL] Migrate CACHE/UNCACHE TABLE to use AnalysisOnlyCommand
### What changes were proposed in this pull request?

Now that `AnalysisOnlyCommand` in introduced in #32032, `CacheTable` and `UncacheTable` can extend `AnalysisOnlyCommand` to simplify the code base. For example, the logic to handle these commands such that the tables are only analyzed is scattered across different places.

### Why are the changes needed?

To simplify the code base to handle these two commands.

### Does this PR introduce _any_ user-facing change?

No, just internal refactoring.

### How was this patch tested?

The existing tests (e.g., `CachedTableSuite`) cover the changes in this PR. For example, if I make `CacheTable`/`UncacheTable` extend `LeafCommand`, there are few failures in `CachedTableSuite`.

Closes #32220 from imback82/cache_cmd_analysis_only.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-19 06:00:23 +00:00
Peter Toth c8d78a70b4 [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function
### What changes were proposed in this pull request?
This PR:
- Adds a new expression `GroupingExprRef` that can be used in aggregate expressions of `Aggregate` nodes to refer grouping expressions by index. These expressions capture the data type and nullability of the referred grouping expression.
- Adds a new rule `EnforceGroupingReferencesInAggregates` that inserts the references in the beginning of the optimization phase.
- Adds a new rule `UpdateGroupingExprRefNullability` to update nullability of `GroupingExprRef` expressions as nullability of referred grouping expression can change during optimization.

### Why are the changes needed?
If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid.

Here is a simple example:
```
SELECT not(t.id IS NULL) , count(*)
FROM t
GROUP BY t.id IS NULL
```
In this case the `BooleanSimplification` rule does this:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
!Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]
 +- Project [value#219 AS id#222]                                                                 +- Project [value#219 AS id#222]
    +- LocalRelation [value#219]                                                                     +- LocalRelation [value#219]
```
where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression.

Before this PR:
```
== Optimized Logical Plan ==
Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
+- Project [value#219 AS id#222]
   +- LocalRelation [value#219]
```
and running the query throws an error:
```
Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
```

After this PR:
```
== Optimized Logical Plan ==
Aggregate [isnull(id#222)], [NOT groupingexprref(0) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
+- Project [value#219 AS id#222]
   +- LocalRelation [value#219]
```
and the query works.

### Does this PR introduce _any_ user-facing change?
Yes, the query works.

### How was this patch tested?
Added new UT.

Closes #31913 from peter-toth/SPARK-34581-keep-grouping-expressions.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-19 04:58:41 +00:00
Cheng Su fd08c93151 [SPARK-35109][SQL] Fix minor exception messages of HashedRelation and HashJoin
### What changes were proposed in this pull request?

It seems that we miss classifying one `SparkOutOfMemoryError` in `HashedRelation`. Add the error classification for it. In addition, clean up two errors definition of `HashJoin` as they are not used.

### Why are the changes needed?

Better error classification.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32211 from c21/error-message.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-19 12:43:43 +09:00
Max Gekk 074f770137 [SPARK-35115][SQL][TESTS] Check ANSI intervals in MutableProjectionSuite
### What changes were proposed in this pull request?
Add checks for `YearMonthIntervalType` and `DayTimeIntervalType` to `MutableProjectionSuite`.

### Why are the changes needed?
To improve test coverage, and the same checks as for `CalendarIntervalType`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *MutableProjectionSuite"
```

Closes #32225 from MaxGekk/test-ansi-intervals-in-MutableProjectionSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-19 08:50:19 +09:00
gengjiaan 12abfe7917 [SPARK-34716][SQL] Support ANSI SQL intervals by the aggregate function sum
### What changes were proposed in this pull request?
Extend the `Sum` expression to  to support `DayTimeIntervalType` and `YearMonthIntervalType` added by #31614.

Note: the expressions can throw the overflow exception independently from the SQL config `spark.sql.ansi.enabled`. In this way, the modified expressions always behave in the ANSI mode for the intervals.

### Why are the changes needed?
Extend `org.apache.spark.sql.catalyst.expressions.aggregate.Sum` to support `DayTimeIntervalType` and `YearMonthIntervalType`.

### Does this PR introduce _any_ user-facing change?
'No'.
Should not since new types have not been released yet.

### How was this patch tested?
Jenkins test

Closes #32107 from beliefer/SPARK-34716.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-18 18:03:50 +03:00
Max Gekk d04b467690 [SPARK-35114][SQL][TESTS] Add checks for ANSI intervals to LiteralExpressionSuite
### What changes were proposed in this pull request?
In the PR, I propose to add additional checks for ANSI interval types `YearMonthIntervalType` and `DayTimeIntervalType` to `LiteralExpressionSuite`.

Also, I replaced some long literal values by `CalendarInterval` to check `CalendarIntervalType` that the tests were supposed to check.

### Why are the changes needed?
To improve test coverage and have the same checks for ANSI types as for `CalendarIntervalType`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *LiteralExpressionSuite"
```

Closes #32213 from MaxGekk/interval-literal-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-18 11:35:00 +03:00
beliefer 03191e8d8f [SPARK-35116][SQL][TESTS] The generated data fits the precision of DayTimeIntervalType in spark
### What changes were proposed in this pull request?
The precision of `java.time.Duration` is nanosecond, but when it is used as `DayTimeIntervalType` in Spark, it is microsecond.
At present, the `DayTimeIntervalType` data generated in the implementation of `RandomDataGenerator` is accurate to nanosecond, which will cause the `DayTimeIntervalType` to be converted to long, and then back to `DayTimeIntervalType` to lose the accuracy, which will cause the test to fail. For example: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137390/testReport/org.apache.spark.sql.hive.execution/HashAggregationQueryWithControlledFallbackSuite/udaf_with_all_data_types/

### Why are the changes needed?
Improve `RandomDataGenerator` so that the generated data fits the precision of DayTimeIntervalType in spark.

### Does this PR introduce _any_ user-facing change?
'No'. Just change the test class.

### How was this patch tested?
Jenkins test.

Closes #32212 from beliefer/SPARK-35116.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-18 09:04:11 +03:00
Kousuke Saruta 95db7e6459 [SPARK-35104][SQL] Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true
### What changes were proposed in this pull request?

This issue fixes an issue that indentation of multiple output JSON records in a single split file are broken except for the first record in the split when `pretty` option is `true`.
```
// Run in the Spark Shell.
// Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
// Or set spark.default.parallelism for the previous releases.
spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
val df = Seq("a", "b", "c").toDF
df.write.option("pretty", "true").json("/path/to/output")

# Run in a Shell
$ cat /path/to/output/*.json
{
  "value" : "a"
}
 {
  "value" : "b"
}
 {
  "value" : "c"
}
```

### Why are the changes needed?

It's not pretty even though `pretty` option is true.

### Does this PR introduce _any_ user-facing change?

I think "No". Indentation style is changed but JSON format is not changed.

### How was this patch tested?

New test.

Closes #32203 from sarutak/fix-ugly-indentation.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-16 11:00:52 +03:00
Max Gekk 3f4c32b3ca [SPARK-35099][SQL] Convert ANSI interval literals to SQL string in ANSI style
### What changes were proposed in this pull request?
Handle `YearMonthIntervalType` and `DayTimeIntervalType` in the `sql()` and `toString()` method of `Literal`, and format the ANSI interval in the ANSI style.

### Why are the changes needed?
To improve readability and UX with Spark SQL. For example, a test output before the changes:
```
-- !query
select timestamp'2011-11-11 11:11:11' - interval '2' day
-- !query schema
struct<TIMESTAMP '2011-11-11 11:11:11' - 172800000000:timestamp>
-- !query output
2011-11-09 11:11:11
```

### Does this PR introduce _any_ user-facing change?
Should not since the new intervals haven't been released yet.

### How was this patch tested?
By running new tests:
```
$ ./build/sbt "test:testOnly *LiteralExpressionSuite"
```

Closes #32196 from MaxGekk/literal-ansi-interval-sql.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-16 08:22:25 +03:00
Angerszhuuuu 71133e1c2a [SPARK-35070][SQL] TRANSFORM not support alias in inputs
### What changes were proposed in this pull request?
Normal function parameters should not support alias, hive not support too
![image](https://user-images.githubusercontent.com/46485123/114645556-4a7ff400-9d0c-11eb-91eb-bc679ea0039a.png)
In this pr we forbid use alias in `TRANSFORM`'s inputs

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32165 from AngersZhuuuu/SPARK-35070.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-15 06:51:07 +00:00
Angerszhuuuu 9b2e0d6191 [SPARK-35086][SQL][CORE] --verbose should be passed to Spark SQL CLI too
### What changes were proposed in this pull request?
In current code, if we run spark sql with
```
./bin/spark-sql --verbose
```
It won't be passed to end SparkSQLCliDriver, then the SessionState won't call `setIsVerbose`

In the CLI option, it shows
```
CLI options:
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)
```

It's not consistent. This pr fix this issue
### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
when user call `-v` when run spark sql, sql will be echoed to console.

### How was this patch tested?
Added UT

Closes #32163 from AngersZhuuuu/SPARK-35086.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-04-15 12:59:20 +08:00
Kousuke Saruta 271aa331b3 [MINOR][SQL] Refactor the comments in HiveClientImpl.withHiveState
### What changes were proposed in this pull request?

This PR refactors three parts of the comments in `HiveClientImpl.withHiveState`

One is about the following comment.
```
// The classloader in clientLoader could be changed after addJar, always use the latest
// classloader.
```
The comment was added in SPARK-10810 (#8909) because `IsolatedClientLoader.classLoader` was declared as `var`.
But the field is now `val` and cannot be changed after instanciation.
So, the comment can confuse developers.

One is about the following code and comment.
```
// classloader. We explicitly set the context class loader since "conf.setClassLoader" does
// not do that, and the Hive client libraries may need to load classes defined by the client's
// class loader.
Thread.currentThread().setContextClassLoader(clientLoader.classLoader)
```
It's not trivial why this part is necessary and it's difficult when we can remove this code in the future.
So, I revised the comment by adding the reference of the related JIRA.

And the last one is about the following code and comment.
```
// Replace conf in the thread local Hive with current conf
Hive.get(conf)
```
It's also not trivial why this part is necessary.
I revised the comment by adding the reference of the related discussion.

### Why are the changes needed?

To make code more readable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It's just a comment refactoring so I add no new test.

Closes #32162 from sarutak/refactor-HiveClientImpl.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-14 21:42:35 -07:00
Kent Yao f32114d17e [SPARK-35044][SQL] SET propertyKey shall also lookup sparkSession.sharedState.hadoopConf to display the effective default hive/hadoop configs
### What changes were proposed in this pull request?

Currently, pure SQL users are short of ways to see the Hadoop configurations which may affect their jobs a lot, they are only able to get the Hadoop configs that exist in `SQLConf` while other defaults in `SharedState.hadoopConf` display wrongly and confusingly with `<undefined>`.

The pre-loaded ones from `core-site.xml, hive-site.xml` etc., will only stay in `sparkSession.sharedState.hadoopConf` or `sc._hadoopConfiguation` not `SQLConf`. Some of them that related the Hive Metastore connection(never change it spark runtime), e.g. `hive.metastore.uris`, are clearly global static and unchangeable but displayable I guess. Some of the ones that might be related to, for example, the output codec/compression, preset in Hadoop/hive config files like core-site.xml shall be still changeable from case to case, table to table, file to file, etc. It' meaningfully to show the defaults for users to change based on that.

In this PR, I propose to support get a Hadoop configuration by SET syntax, for example
```
SET mapreduce.map.output.compress.codec;
```

### Why are the changes needed?

better user experience for pure SQL users

### Does this PR introduce _any_ user-facing change?

yes, where retrieving a conf only existing in sessionState.hadoopConf, before is `undefined` and now you see it

### How was this patch tested?

new test

Closes #32144 from yaooqinn/SPARK-35044.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
2021-04-15 01:44:10 +08:00
Max Gekk de9e8b6c94 [SPARK-35051][SQL] Support add/subtract of a day-time interval to/from a date
### What changes were proposed in this pull request?
Support `date +/- day-time interval`. In the PR, I propose to update the binary arithmetic rules, and cast an input date to a timestamp at the session time zone, and then add a day-time interval to it.

### Why are the changes needed?
1. To conform the ANSI SQL standard which requires to support such operation over dates and intervals:
<img width="811" alt="Screenshot 2021-03-12 at 11 36 14" src="https://user-images.githubusercontent.com/1580697/111081674-865d4900-8515-11eb-86c8-3538ecaf4804.png">
2. To fix the regression comparing to the recent Spark release 3.1 with default settings.

Before the changes:
```sql
spark-sql> select date'now' + (timestamp'now' - timestamp'yesterday');
Error in query: cannot resolve 'DATE '2021-04-14' + subtracttimestamps(TIMESTAMP '2021-04-14 18:14:56.497', TIMESTAMP '2021-04-13 00:00:00')' due to data type mismatch: argument 1 requires timestamp type, however, 'DATE '2021-04-14'' is of date type.; line 1 pos 7;
'Project [unresolvedalias(cast(2021-04-14 + subtracttimestamps(2021-04-14 18:14:56.497, 2021-04-13 00:00:00, false, Some(Europe/Moscow)) as date), None)]
+- OneRowRelation
```

Spark 3.1:
```sql
spark-sql> select date'now' + (timestamp'now' - timestamp'yesterday');
2021-04-15
```

Hive:
```sql
0: jdbc:hive2://localhost:10000/default> select date'2021-04-14' + (timestamp'2020-04-14 18:15:30' - timestamp'2020-04-13 00:00:00');
+------------------------+
|          _c0           |
+------------------------+
| 2021-04-15 18:15:30.0  |
+------------------------+
```

### Does this PR introduce _any_ user-facing change?
Should not since new intervals have not been released yet.

After the changes:
```sql
spark-sql> select date'now' + (timestamp'now' - timestamp'yesterday');
2021-04-15 18:13:16.555
```

### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *ColumnExpressionSuite"
```

Closes #32170 from MaxGekk/date-add-day-time-interval.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-14 19:28:26 +03:00
Angerszhuuuu 4ca9958270 [SPARK-35069][SQL] TRANSFORM forbidden DISTICNT and ALL, also make the error clear
### What changes were proposed in this pull request?
According to https://github.com/apache/spark/pull/29087#discussion_r612267050,  add UT in `transform.sql`

It seems that distinct is not recognized as a reserved word here

```
-- !query
explain extended SELECT TRANSFORM(distinct b, a, c)
                   USING 'cat' AS (a, b, c)
                 FROM script_trans
                 WHERE a <= 4
-- !query schema
struct<plan:string>
-- !query output
== Parsed Logical Plan ==
'ScriptTransformation [*], cat, [a#x, b#x, c#x], ScriptInputOutputSchema(List(),List(),None,None,List(),List(),None,None,false)
+- 'Project ['distinct AS b#x, 'a, 'c]
   +- 'Filter ('a <= 4)
      +- 'UnresolvedRelation [script_trans], [], false

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: cannot resolve 'distinct' given input columns: [script_trans.a, script_trans.b, script_trans.c]; line 1 pos 34;
'ScriptTransformation [*], cat, [a#x, b#x, c#x], ScriptInputOutputSchema(List(),List(),None,None,List(),List(),None,None,false)
+- 'Project ['distinct AS b#x, a#x, c#x]
   +- Filter (a#x <= 4)
      +- SubqueryAlias script_trans
         +- View (`script_trans`, [a#x,b#x,c#x])
            +- Project [cast(a#x as int) AS a#x, cast(b#x as int) AS b#x, cast(c#x as int) AS c#x]
               +- Project [a#x, b#x, c#x]
                  +- SubqueryAlias script_trans
                     +- LocalRelation [a#x, b#x, c#x]
```

Hive's error
![image](https://user-images.githubusercontent.com/46485123/114533170-355d8380-9c80-11eb-992f-982f0b296759.png)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added Ut

Closes #32149 from AngersZhuuuu/SPARK-28227-new-followup.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-14 15:03:29 +00:00
Terry Kim b5241c97b1 [SPARK-34701][SQL] Introduce AnalysisOnlyCommand that allows its children to be removed once the command is marked as analyzed
### What changes were proposed in this pull request?

This PR proposes to introduce the `AnalysisOnlyCommand` trait such that a command that extends this trait can have its children only analyzed, but not optimized. There is a corresponding analysis rule `HandleAnalysisOnlyCommand` that marks the command as analyzed after all other analysis rules are run.

This can be useful if a logical plan has children where they need to be only analyzed, but not optimized - e.g., `CREATE VIEW` or `CACHE TABLE AS`. This also addresses the issue found in #31933.

This PR also updates `CreateViewCommand`, `CacheTableAsSelect`, and `AlterViewAsCommand` to use the new trait / rule such that their children are only analyzed.

### Why are the changes needed?

To address the issue where the plan is unnecessarily re-analyzed in `CreateViewCommand`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests should cover the changes.

Closes #32032 from imback82/skip_transform.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-14 08:24:25 +00:00
Karen Feng 816f6dd13e [SPARK-34527][SQL] Resolve duplicated common columns from USING/NATURAL JOIN
### What changes were proposed in this pull request?

Adds the duplicated common columns as hidden columns to the Projection used to rewrite NATURAL/USING JOINs.

### Why are the changes needed?

Allows users to resolve either side of the NATURAL/USING JOIN's common keys.
Previously, the user could only resolve the following columns:

| Join type | Left key columns | Right key columns |
| --- | --- | --- |
| Inner | Yes | No |
| Left | Yes | No |
| Right | No | Yes |
| Outer | No | No |

### Does this PR introduce _any_ user-facing change?

Yes. The user can now symmetrically resolve the common columns from a NATURAL/USING JOIN.

### How was this patch tested?

SQL-side tests. The behavior matches PostgreSQL and MySQL.

Closes #31666 from karenfeng/spark-34527.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-14 07:01:40 +00:00
Kousuke Saruta ef05e89ee5 [SPARK-34977][SQL] LIST FILES/JARS/ARCHIVES cannot handle multiple arguments properly when at least one path is quoted
### What changes were proposed in this pull request?

This PR fixes an issue that `LIST FILES/JARS/ARCHIVES path1 path2 ...` cannot list all paths if at least one path is quoted.
An example here.
```
ADD FILE /tmp/test1;
ADD FILE /tmp/test2;

LIST FILES /tmp/test1 /tmp/test2;
file:/tmp/test1
file:/tmp/test2

LIST FILES /tmp/test1 "/tmp/test2";
file:/tmp/test2
```

In this example, the second `LIST FILES` doesn't show `file:/tmp/test1`.

To resolve this issue, I modified the syntax rule to be able to handle this case.
I also changed `SparkSQLParser` to be able to handle paths which contains white spaces.

### Why are the changes needed?

This is a bug.
I also have a plan which extends `ADD FILE/JAR/ARCHIVE` to take multiple paths like Hive and the syntax rule change is necessary for that.

### Does this PR introduce _any_ user-facing change?

Yes. Users can pass quoted paths when using `ADD FILE/JAR/ARCHIVE`.

### How was this patch tested?

New test.

Closes #32074 from sarutak/fix-list-files-bug.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-04-14 10:33:45 +09:00
gengjiaan 27bec91bc9 [SPARK-33604][SQL] Group exception messages in sql/execution
### What changes were proposed in this pull request?
This PR group exception messages in `/core/src/main/scala/org/apache/spark/sql/execution`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #31920 from beliefer/SPARK-33604.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-13 14:03:36 +00:00
Chao Sun 1a6708918b [SPARK-34947][SQL] Streaming write to a V2 table should invalidate its associated cache
### What changes were proposed in this pull request?

Populate table catalog and identifier from `DataStreamWriter` to `WriteToMicroBatchDataSource` so that we can invalidate cache for tables that are updated by a streaming write.

This is somewhat related [SPARK-27484](https://issues.apache.org/jira/browse/SPARK-27484) and [SPARK-34183](https://issues.apache.org/jira/browse/SPARK-34183) (#31700), as ideally we may want to replace `WriteToMicroBatchDataSource` and `WriteToDataSourceV2` with logical write nodes and feed them to analyzer. That will potentially change the code path involved in this PR.

### Why are the changes needed?

Currently `WriteToDataSourceV2` doesn't have cache invalidation logic, and therefore, when the target table for a micro batch streaming job is cached, the cache entry won't be removed when the table is updated.

### Does this PR introduce _any_ user-facing change?

Yes now when a DSv2 table which supports streaming write is updated by a streaming job, its cache will also be invalidated.

### How was this patch tested?

Added a new UT.

Closes #32039 from sunchao/streaming-cache.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-13 13:31:09 +00:00
Gengliang Wang ade3a1df82 [SPARK-34916][SQL][FOLLOWUP] Remove duplicate code in TreeNode.treePatternBits
### What changes were proposed in this pull request?

Remove duplicate code in `TreeNode.treePatternBits`

### Why are the changes needed?

Code clean up. Make it easier for maintainence.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #32143 from gengliangwang/getBits.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-13 20:25:35 +08:00
Hyukjin Kwon 1f562159bf [SPARK-35045][SQL] Add an internal option to control input buffer in univocity
### What changes were proposed in this pull request?

This PR makes the input buffer configurable (as an internal option). This is mainly to work around uniVocity/univocity-parsers#449.

### Why are the changes needed?

To work around uniVocity/univocity-parsers#449.

### Does this PR introduce _any_ user-facing change?

No, it's only internal option.

### How was this patch tested?

Manually tested by modifying the unittest added in https://github.com/apache/spark/pull/31858 as below:

```diff
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index fd25a79619d..b58f0bd3661 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 -2460,6 +2460,7  abstract class CSVSuite
       Seq(line).toDF.write.text(path.getAbsolutePath)
       assert(spark.read.format("csv")
         .option("delimiter", "|")
+        .option("inputBufferSize", "128")
         .option("ignoreTrailingWhiteSpace", "true").load(path.getAbsolutePath).count() == 1)
     }
   }
```

Closes #32145 from HyukjinKwon/SPARK-35045.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-13 15:08:01 +03:00
Yingyi Bu 9cd25b46b9 [SPARK-35014] Fix the PhysicalAggregation pattern to not rewrite foldable expressions
### What changes were proposed in this pull request?

Fix PhysicalAggregation to not transform a foldable expression.

### Why are the changes needed?

It can potentially break certain queries like the added unit test shows.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes undesirable errors caused by a returned TypeCheckFailure from places like RegExpReplace.checkInputDataTypes.

Closes #32113 from sigmod/foldable.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-13 19:57:13 +08:00
Yingyi Bu 49618c9543 [SPARK-35043][SQL] Add condition lambda and rule id to the resolve function family
### What changes were proposed in this pull request?

This PR contains:
- AnalysisHelper changes to allow the resolve function family to stop earlier without traversing the entire tree;
- Example changes in a few rules to support such pruning, e.g., ResolveRandomSeed, ResolveWindowFrame, ResolveWindowOrder, and ResolveNaturalAndUsingJoin.

### Why are the changes needed?

It's a framework-level change for reducing the query compilation time.
In particular, if we update existing analysis rules' call sites as per the examples in this PR, the analysis time can be reduced as described in the [doc](https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk).

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

It is tested by existing tests.

Closes #32135 from sigmod/resolver.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-13 19:39:11 +08:00
Yuming Wang b34a84e21e [SPARK-34212][SQL][FOLLOWUP] Move the added test to ParquetQuerySuite
### What changes were proposed in this pull request?

This pr moves the added test from `SQLQuerySuite` to `ParquetQuerySuite`.

### Why are the changes needed?
1. It can be tested by `ParquetV1QuerySuite` and `ParquetV2QuerySuite`.
2. Reduce the testing time of `SQLQuerySuite`(SQLQuerySuite ~ 3 min 17 sec, ParquetV1QuerySuite ~ 27 sec).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #32090 from wangyum/SPARK-34212.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-13 09:04:47 +00:00
Gengliang Wang 5d126537d3 [MINOR][TESTS] Enhance the test instruction of ThriftServerQueryTestSuite
### What changes were proposed in this pull request?

Enhance the test instruction of ThriftServerQueryTestSuite:
1. how to run a single test case
2. how to regenerate golden file for a single test

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No, just enhance the comments.

Closes #32141 from gengliangwang/updateComment.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-13 16:49:20 +08:00
allisonwang-db 6b8405b574 [SPARK-28379][SQL] Allow non-aggregated single row correlated scalar subquery
### What changes were proposed in this pull request?
This PR allows non-aggregated correlated scalar subquery if the max output row is less than 2. Correlated scalar subqueries need to be aggregated because they are going to be decorrelated and rewritten as LEFT OUTER joins. If the correlated scalar subquery produces more than one output row, the rewrite will yield wrong results.

But this constraint can be relaxed when the subquery plan's the max number of output rows is less than or equal to 1.

### Why are the changes needed?
To relax a constraint in CheckAnalysis for the correlated scalar subquery.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Unit tests

Closes #32111 from allisonwang-db/spark-28379-aggregated.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-13 07:27:17 +00:00
ayushi agarwal caf33be274 [SPARK-33411][SQL] Cardinality estimation of union, sort and range operator
### What changes were proposed in this pull request?
Supports cardinality estimation of union, sort and range operator.

1. **Union**: number of rows in output will be the sum of number of rows in the output for each child of union, min and max for each column in the output will be the min and max of that particular column coming from its children.
Example:
Table 1
a   b
1   6
2   3
Table 2
a   b
1   3
 4   1
stats for table1 union table2 would be number of rows = 4, columnStats = (a: {min: 1, max: 4}, b: {min: 1, max: 6})

2. **Sort**: row and columns stats would be same as its children.

3. **Range**: number of output rows and distinct count will be equal to number of elements, min and max is calculated from start, end and step param.

### Why are the changes needed?
The change will enhance the feature https://issues.apache.org/jira/browse/SPARK-16026 and will help in other stats based optimizations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New unit tests added.

Closes #30334 from ayushi-agarwal/SPARK-33411.

Lead-authored-by: ayushi agarwal <ayaga@microsoft.com>
Co-authored-by: ayushi-agarwal <36420535+ayushi-agarwal@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-13 15:17:17 +09:00
Max Gekk 26f312e95f [SPARK-35037][SQL] Recognize sign before the interval string in literals
### What changes were proposed in this pull request?
1. Extend SQL syntax rules to support a sign before the interval strings of ANSI year-month and day-time intervals.
2. Recognize `-` in `AstBuilder` and negate parsed intervals.

### Why are the changes needed?
To conform to the SQL standard which allows a sign before the string interval, see `"5.3 <literal>"`:
```
<interval literal> ::=
  INTERVAL [ <sign> ] <interval string> <interval qualifier>
<interval string> ::=
  <quote> <unquoted interval string> <quote>
<unquoted interval string> ::=
  [ <sign> ] { <year-month literal> | <day-time literal> }
<sign> ::=
    <plus sign>
  | <minus sign>
```

### Does this PR introduce _any_ user-facing change?
Should not because it just extends supported intervals syntax.

### How was this patch tested?
By running new tests in `interval.sql`:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
```

Closes #32134 from MaxGekk/negative-parsed-intervals.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-13 08:55:00 +03:00
Kent Yao 16e2faadac [SPARK-34944][SQL][TESTS] Replace bigint with int for web_returns and store_returns in TPCDS tests to employ correct data type
### What changes were proposed in this pull request?

According to  http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf

```
 2.2.2 Datatype
2.2.2.1 Each column employs one of the following datatypes:
a) Identifier means that the column shall be able to hold any key value generated for that column.
b) Integer means that the column shall be able to exactly represent integer values (i.e., values in increments of
1) in the range of at least ( − 2n − 1) to (2n − 1 − 1), where n is 64.
c) Decimal(d, f) means that the column shall be able to represent decimal values up to and including d digits,
of which f shall occur to the right of the decimal place; the values can be either represented exactly or
interpreted to be in this range.
d) Char(N) means that the column shall be able to hold any string of characters of a fixed length of N.
Comment: If the string that a column of datatype char(N) holds is shorter than N characters, then trailing
spaces shall be stored in the database or the database shall automatically pad with spaces upon retrieval such
that a CHAR_LENGTH() function will return N.
e) Varchar(N) means that the column shall be able to hold any string of characters of a variable length with a
maximum length of N. Columns defined as "varchar(N)" may optionally be implemented as "char(N)".
f) Date means that the column shall be able to express any calendar day between January 1, 1900 and
December 31, 2199.
2.2.2.2 The datatypes do not correspond to any specific SQL-standard datatype. The definitions are provided to
highlight the properties that are required for a particular column. The benchmark implementer may employ any internal representation or SQL datatype that meets those requirements.
```

This PR proposes that we use int for identifiers instead of bigint to reach a compromise with TPC-DS Standard Specification.

After this PR, the field schemas are now consistent with those DDLs in the `tpcds.sql` from tpc-ds tool kit, see https://gist.github.com/yaooqinn/b9978a77bbf4f871a95d6a9103019907

### Why are the changes needed?

reach a compromise with TPC-DS Standard Specification

### Does this PR introduce _any_ user-facing change?

no test only

### How was this patch tested?

test only

Closes #32037 from yaooqinn/SPARK-34944.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
2021-04-13 11:28:35 +08:00
Gengliang Wang 79e55b44f7 [SPARK-35028][SQL] ANSI mode: disallow group by aliases
### What changes were proposed in this pull request?

Disallow group by aliases under ANSI mode.

### Why are the changes needed?

As per the ANSI SQL standard secion 7.12 <group by clause>:

>Each `grouping column reference` shall unambiguously reference a column of the table resulting from the `from clause`. A column referenced in a `group by clause` is a grouping column.

By forbidding it, we can avoid ambiguous SQL queries like:
```
SELECT col + 1 as col FROM t GROUP BY col
```

### Does this PR introduce _any_ user-facing change?

Yes, group by aliases is not allowed under ANSI mode.

### How was this patch tested?

Unit tests

Closes #32129 from gengliangwang/disallowGroupByAlias.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-13 10:42:57 +08:00
angerszhu 278203d969 [SPARK-28227][SQL] Support projection, aggregate/window functions, and lateral view in the TRANSFORM clause
### What changes were proposed in this pull request?
For Spark SQL, it can't support script transform SQL with aggregationClause/windowClause/LateralView.
This case we can't directly migration Hive SQL to Spark SQL.

In this PR, we treat all script transform statement's query part (exclude transform about part)  as a  separate query block and solve it as ScriptTransformation's child and pass a UnresolvedStart as ScriptTransform's input. Then in analyzer level, we pass child's output as ScriptTransform's input. Then we can support all kind of normal SELECT query combine with script transformation.

Such as transform with aggregation:
```
SELECT TRANSFORM ( d2, max(d1) as max_d1, sum(d3))
USING 'cat' AS (a,b,c)
FROM script_trans
WHERE d1 <= 100
GROUP BY d2
 HAVING max_d1 > 0
```
When we build AST, we treat it as
```
SELECT TRANSFORM (*)
USING 'cat' AS (a,b,c)
FROM (
     SELECT  d2, max(d1) as max_d1, sum(d3)
     FROM script_trans
    WHERE d1 <= 100
    GROUP BY d2
    HAVING max_d1 > 0
) tmp
```
then in Analyzer's `ResolveReferences`, resolve `* (UnresolvedStar)`, then sql behavior like
```
SELECT TRANSFORM ( d2, max(d1) as max_d1, sum(d3))
USING 'cat' AS (a,b,c)
FROM script_trans
WHERE d1 <= 100
GROUP BY d2
HAVING max_d1 > 0
```

About UT, in this pr we add a lot of different SQL to check we can support all kind of such SQL and  each kind of expressions can work well, such as alias, case when, binary compute etc...

### Why are the changes needed?
Support transform with aggregateClause/windowClause/LateralView etc , make sql migration more smoothly

### Does this PR introduce _any_ user-facing change?
User can write transform with  aggregateClause/windowClause/LateralView.

### How was this patch tested?
Added UT

Closes #29087 from AngersZhuuuu/SPARK-28227-NEW.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-13 11:34:45 +09:00
Wenchen Fan 8627cab39d [SPARK-34593][SQL][FOLLOWUP] Fix BroadcastNestedLoopJoinExec.outputPartition with full outer join
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/31708 . For full outer join, the final result RDD is created from
```
sparkContext.union(
  matchedStreamRows,
  sparkContext.makeRDD(notMatchedBroadcastRows)
)
```

It's incorrect to say that the final output partitioning is `UnknownPartitioning(left.outputPartitioning.numPartitions)`

### Why are the changes needed?

Fix a correctness bug

### Does this PR introduce _any_ user-facing change?

Yes, see the added test. Fortunately, this bug is not released yet.

### How was this patch tested?

new test

Closes #32132 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-12 17:29:21 -07:00
Yuming Wang e40fce919a [SPARK-34562][SQL] Add test and doc for Parquet Bloom filter push down
### What changes were proposed in this pull request?

This pr add test and document for Parquet Bloom filter push down.

### Why are the changes needed?

Improve document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Generating docs:
![image](https://user-images.githubusercontent.com/5399861/114327472-c131bb80-9b6b-11eb-87a0-6f9a74eb1097.png)

Closes #32123 from wangyum/SPARK-34562.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-12 17:07:35 +03:00
Max Gekk 8f8bac6435 [SPARK-34905][SQL][TESTS] Enable ANSI intervals in SQLQueryTestSuite/ThriftServerQueryTestSuite
### What changes were proposed in this pull request?
Remove `spark.sql.legacy.interval.enabled` settings from `SQLQueryTestSuite`/`ThriftServerQueryTestSuite` that enables new ANSI intervals by default.

### Why are the changes needed?
To use default settings for intervals, and test new ANSI intervals - year-month and day-time interval introduced by SPARK-27793.

### Does this PR introduce _any_ user-facing change?
Should not because this affects tests only.

### How was this patch tested?
By running the affected tests, for instance:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z date.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z timestamp.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
```

Closes #32099 from MaxGekk/enable-ansi-intervals-sql-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-12 09:25:51 +00:00
Angerszhuuuu 21232377ba [SPARK-33229][SQL] Support partial grouping analytics and concatenated grouping analytics
### What changes were proposed in this pull request?
Support GROUP BY use Separate columns and CUBE/ROLLUP

In postgres sql, it support
```
select a, b, c, count(1) from t group by a, b, cube (a, b, c);
select a, b, c, count(1) from t group by a, b, rollup(a, b, c);
select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c);
select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ());
```
In this pr, we have done two things as below:

1. Support partial grouping analytics such as `group by a, cube(a, b)`
2. Support mixed grouping analytics such as `group by cube(a, b), rollup(b,c)`

*Partial Groupings*

    Partial Groupings means there are both `group_expression` and `CUBE|ROLLUP|GROUPING SETS`
    in GROUP BY clause. For example:
    `GROUP BY warehouse, CUBE(product, location)` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`.
    `GROUP BY warehouse, ROLLUP(product, location)` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`.
    `GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`.

*Concatenated Groupings*

    Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified
    with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product
    operation enables even a small number of concatenated groupings to generate a large number of final groups.
    The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`,
    and separating them with commas. For example:
    `GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`.
    `GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
    `GROUP BY GROUPING SETS(
        (warehouse, product, location, size), (warehouse, product, location), (warehouse, product),
        (warehouse, location, size), (warehouse, location), (warehouse),
        (product, location, size), (product, location), (product),
        (location, size), (location), ())`.
    `GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to
    `GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
    `GROUP BY GROUPING SETS(
        (order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product),
        (order, warehouse, location, size), (order, warehouse, location), (order, warehouse),
        (order, product, location, size), (order, product, location), (order, product),
        (order, location, size), (order, location), (order))`.

### Why are the changes needed?
Support more flexible grouping analytics

### Does this PR introduce _any_ user-facing change?
User can use sql like
```
select a, b, c, agg_expr() from table group by a, cube(b, c)
```

### How was this patch tested?
Added UT

Closes #30144 from AngersZhuuuu/SPARK-33229.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-12 08:23:52 +00:00
Yingyi Bu 3db8ec258c [SPARK-34916][SQL] Add condition lambda and rule id to the transform family for early stopping
### What changes were proposed in this pull request?

This PR contains:
- TreeNode, QueryPlan, AnalysisHelper changes to allow the transform function family to stop earlier without traversing the entire tree;
- Example changes in a few rules to support such pruning, e.g., ReorderJoin and OptimizeIn.

Here is a [design doc](https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk) that elaborates the ideas and benchmark numbers.

### Why are the changes needed?

It's a framework-level change for reducing the query compilation time.
In particular, if we update existing rules and transform call sites as per the examples in this PR, the analysis time and query optimization time can be reduced as described in this [doc](https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk) .

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It is tested by existing tests.

Closes #32060 from sigmod/bits.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-12 11:21:16 +08:00
Max Gekk 90820b3ec3 [SPARK-35017][SQL] Transfer ANSI intervals via Hive Thrift server
### What changes were proposed in this pull request?
1. Map Catalyst's interval types to Hive's types:
    - YearMonthIntervalType -> `interval_year_month`
    - DayTimeIntervalType -> `interval_day_time`
2. Invoke `HiveResult.toHiveString()` to convert external intervals types ` java.time.Period`/`java.time.Duration` to strings.

### Why are the changes needed?
1. To be able to retrieve ANSI intervals via Hive Thrift server.
2. This fixes the issue:
```sql
 $ ./sbin/start-thriftserver.sh
 $ ./bin/beeline
Beeline version 2.3.8 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000/default "" "" ""
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Spark SQL (version 3.2.0-SNAPSHOT)
0: jdbc:hive2://localhost:10000/default> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
Error: java.lang.IllegalArgumentException: Unrecognized type name: day-time interval (state=,code=0)
```
3. It should unblock https://github.com/apache/spark/pull/32099 which enables `*.sql` tests in `ThriftServerQueryTestSuite`.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```sql
0: jdbc:hive2://localhost:10000/default> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
+----------------------------------------------------+
| subtracttimestamps(TIMESTAMP '2021-01-01 01:02:03.000001', DATE '2020-12-31') |
+----------------------------------------------------+
| 1 01:02:03.000001000                               |
+----------------------------------------------------+
1 row selected (1.637 seconds)
```

### How was this patch tested?
By running new test:
```
$ ./build/sbt -Phive -Phive-thriftserver "test:testOnly *SparkThriftServerProtocolVersionsSuite"
$ ./build/sbt -Phive -Phive-thriftserver "test:testOnly *SparkMetadataOperationSuite"
```
Also checked an array of an interval:
```sql
0: jdbc:hive2://localhost:10000/default> select array(timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31');
+----------------------------------------------------+
| array(subtracttimestamps(TIMESTAMP '2021-01-01 01:02:03.000001', DATE '2020-12-31')) |
+----------------------------------------------------+
| [1 01:02:03.000001000]                             |
+----------------------------------------------------+
```

Closes #32121 from MaxGekk/ansi-intervals-thrift-protocol.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-12 11:56:10 +09:00
Angerszhuuuu 03431d40eb [SPARK-34986][SQL] Make an error msg clearer when ordinal numbers in group-by refer to agg funcs
### What changes were proposed in this pull request?
before when we use aggregate ordinal in group by expression and index position is a aggregate function, it will show error as
```
– !query
select a, b, sum(b) from data group by 3
– !query schema
struct<>
– !query output
org.apache.spark.sql.AnalysisException
aggregate functions are not allowed in GROUP BY, but found sum(data.b)
```

It't not clear enough refactor this error message in this pr

### Why are the changes needed?
refactor  error message

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #32089 from AngersZhuuuu/SPARK-34986.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-12 11:45:08 +09:00
Max Gekk 0e761c7307 [SPARK-35016][SQL] Format ANSI intervals in Hive style
### What changes were proposed in this pull request?
1. Extend `IntervalUtils` methods: `toYearMonthIntervalString` and `toDayTimeIntervalString` to support formatting of year-month/day-time intervals in Hive style. The methods get new parameter style which can have to values; `HIVE_STYLE` and `ANSI_STYLE`.
2. Invoke `toYearMonthIntervalString` and `toDayTimeIntervalString` from the `Cast` expression with the `style` parameter is set to `ANSI_STYLE`.
3. Invoke `toYearMonthIntervalString` and `toDayTimeIntervalString` from `HiveResult` with `style` is set to `HIVE_STYLE`.

### Why are the changes needed?
The `spark-sql` shell formats its output in Hive style by using `HiveResult.hiveResultString()`. The changes are needed to match Hive behavior. For instance,

Hive:
```sql
0: jdbc:hive2://localhost:10000/default> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
+-----------------------+
|          _c0          |
+-----------------------+
| 1 01:02:03.000001000  |
+-----------------------+
```

Spark before the changes:
```sql
spark-sql> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
INTERVAL '1 01:02:03.000001' DAY TO SECOND
```

Also this should unblock #32099 which enables *.sql tests in `SQLQueryTestSuite`.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```sql
spark-sql> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
1 01:02:03.000001000
```

### How was this patch tested?
1. Added new tests to `IntervalUtilsSuite`:
```
$  build/sbt "test:testOnly *IntervalUtilsSuite"
```
2. Modified existing tests in `HiveResultSuite`:
```
$  build/sbt -Phive-2.3 -Phive-thriftserver "testOnly *HiveResultSuite"
```
3. By running cast tests:
```
$ build/sbt "testOnly *CastSuite*"
```

Closes #32120 from MaxGekk/ansi-intervals-hive-thrift-server.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-11 10:13:19 +03:00
Liang-Chi Hsieh 364d1eaf10 [SPARK-34963][SQL] Fix nested column pruning for extracting case-insensitive struct field from array of struct
### What changes were proposed in this pull request?

This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.

### Why are the changes needed?

Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,

```scala
val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
```

Error stack:
```
[info]   java.lang.IllegalArgumentException: Field "First" does not exist.
[info] Available fields:
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
[info]   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
[info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #32059 from viirya/fix-array-nested-pruning.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-09 11:52:55 -07:00
Chao Sun ee7bf7d962 [SPARK-35003][SQL] Improve performance for reading smallint in vectorized Parquet reader
### What changes were proposed in this pull request?

Implements `readShorts` in `VectorizedPlainValuesReader`, which decodes `total` shorts in the input buffer at one time, similar to other types.

### Why are the changes needed?

Currently `VectorizedRleValuesReader` reads short integer in the following way:

```java
for (int i = 0; i < n; i++) {
  c.putShort(rowId + i, (short)data.readInteger());
}
```
For PLAIN encoding `data.readInteger` is done via:

```java
public final int readInteger() {
  return getBuffer(4).getInt();
}
```
which means it needs to repeatedly call `slice` buffer for the batch size number of times. This is more expensive than calling it once in a big chunk and then reading the ints out.

Micro benchmark via `DataSourceReadBenchmark` showed ~35% perf improvement.

Before:
```
[info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.16
[info] Intel(R) Core(TM) i9-9880H CPU  2.30GHz
[info] SQL Single SMALLINT Column Scan:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] SQL CSV                                           10249          10271          32          1.5         651.6       1.0X
[info] SQL Json                                           5963           5982          28          2.6         379.1       1.7X
[info] SQL Parquet Vectorized                              141            151          15        111.9           8.9      72.9X
[info] SQL Parquet MR                                     1454           1491          52         10.8          92.4       7.0X
[info] SQL ORC Vectorized                                  160            164           3         98.3          10.2      64.1X
[info] SQL ORC MR                                         1133           1164          44         13.9          72.0       9.0X
```

After:
```
[info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.16
[info] Intel(R) Core(TM) i9-9880H CPU  2.30GHz
[info] SQL Single SMALLINT Column Scan:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] SQL CSV                                           10489          10535          65          1.5         666.8       1.0X
[info] SQL Json                                           5864           5888          34          2.7         372.8       1.8X
[info] SQL Parquet Vectorized                              104            111           8        151.0           6.6     100.7X
[info] SQL Parquet MR                                     1458           1472          20         10.8          92.7       7.2X
[info] SQL ORC Vectorized                                  157            166           7        100.0          10.0      66.7X
[info] SQL ORC MR                                         1121           1147          37         14.0          71.2       9.4X
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #32104 from sunchao/smallint.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-09 08:12:47 -07:00
Ali Afroozeh 0945baf906 [SPARK-34989] Improve the performance of mapChildren and withNewChildren methods
### What changes were proposed in this pull request?
One of the main performance bottlenecks in query compilation is overly-generic tree transformation methods, namely `mapChildren` and `withNewChildren` (defined in `TreeNode`). These methods have an overly-generic implementation to iterate over the children and rely on reflection to create new instances. We have observed that, especially for queries with large query plans, a significant amount of CPU cycles are wasted in these methods. In this PR we make these methods more efficient, by delegating the iteration and instantiation to concrete node types. The benchmarks show that we can expect significant performance improvement in total query compilation time in queries with large query plans (from 30-80%) and about 20% on average.

#### Problem detail
The `mapChildren` method in `TreeNode` is overly generic and costly. To be more specific, this method:
- iterates over all the fields of a node using Scala’s product iterator. While the iteration is not reflection-based, thanks to the Scala compiler generating code for `Product`, we create many anonymous functions and visit many nested structures (recursive calls).
The anonymous functions (presumably compiled to Java anonymous inner classes) also show up quite high on the list in the object allocation profiles, so we are putting unnecessary pressure on GC here.
- does a lot of comparisons. Basically for each element returned from the product iterator, we check if it is a child (contained in the list of children) and then transform it. We can avoid that by just iterating over children, but in the current implementation, we need to gather all the fields (only transform the children) so that we can instantiate the object using the reflection.
- creates objects using reflection, by delegating to the `makeCopy` method, which is several orders of magnitude slower than using the constructor.

#### Solution
The proposed solution in this PR is rather straightforward: we rewrite the `mapChildren` method using the `children` and `withNewChildren` methods. The default `withNewChildren` method suffers from the same problems as `mapChildren` and we need to make it more efficient by specializing it in concrete classes.  Similar to how each concrete query plan node already defines its children, it should also define how they can be constructed given a new list of children. Actually, the implementation is quite simple in most cases and is a one-liner thanks to the copy method present in Scala case classes. Note that we cannot abstract over the copy method, it’s generated by the compiler for case classes if no other type higher in the hierarchy defines it. For most concrete nodes, the implementation of `withNewChildren` looks like this:
```
override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = copy(children = newChildren)
```
The current `withNewChildren` method has two properties that we should preserve:

- It returns the same instance if the provided children are the same as its children, i.e., it preserves referential equality.
- It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. Therefore, we propose a template method `withNewChildrenInternal` that should be rewritten by the concrete classes and let the `withNewChildren` method take care of referential equality and copying:
```
override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {
 if (childrenFastEquals(children, newChildren)) {
   this
 } else {
   CurrentOrigin.withOrigin(origin) {
     val res = withNewChildrenInternal(newChildren)
     res.copyTagsFrom(this)
     res
   }
 }
}
```

With the refactoring done in a previous PR (https://github.com/apache/spark/pull/31932) most tree node types fall in one of the categories of `Leaf`, `Unary`, `Binary` or `Ternary`. These traits have a more efficient implementation for `mapChildren` and define a more specialized version of `withNewChildrenInternal` that avoids creating unnecessary lists. For example, the `mapChildren` method in `UnaryLike` is defined as follows:
```
  override final def mapChildren(f: T => T): T = {
    val newChild = f(child)
    if (newChild fastEquals child) {
      this.asInstanceOf[T]
    } else {
      CurrentOrigin.withOrigin(origin) {
        val res = withNewChildInternal(newChild)
        res.copyTagsFrom(this.asInstanceOf[T])
        res
      }
    }
  }
```

#### Results
With this PR, we have observed significant performance improvements in query compilation time, more specifically in the analysis and optimization phases. The table below shows the TPC-DS queries that had more than 25% speedup in compilation times. Biggest speedups are observed in queries with large query plans.
| Query  | Speedup |
| ------------- | ------------- |
|q4    |29%|
|q9    |81%|
|q14a  |31%|
|q14b  |28%|
|q22   |33%|
|q33   |29%|
|q34   |25%|
|q39   |27%|
|q41   |27%|
|q44   |26%|
|q47   |28%|
|q48   |76%|
|q49   |46%|
|q56   |26%|
|q58   |43%|
|q59   |46%|
|q60   |50%|
|q65   |59%|
|q66   |46%|
|q67   |52%|
|q69   |31%|
|q70   |30%|
|q96   |26%|
|q98   |32%|

#### Binary incompatibility
Changing the `withNewChildren` in `TreeNode` breaks the binary compatibility of the code compiled against older versions of Spark because now it is expected that concrete `TreeNode` subclasses all implement the `withNewChildrenInternal` method. This is a problem, for example, when users write custom expressions. This change is the right choice, since it forces all newly added expressions to Catalyst implement it in an efficient manner and will prevent future regressions.
Please note that we have not completely removed the old implementation and renamed it to `legacyWithNewChildren`. This method will be removed in the future and for now helps the transition. There are expressions such as `UpdateFields` that have a complex way of defining children. Writing `withNewChildren` for them requires refactoring the expression. For now, these expressions use the old, slow method. In a future PR we address these expressions.

### Does this PR introduce _any_ user-facing change?

This PR does not introduce user facing changes but my break binary compatibility of the code compiled against older versions. See the binary compatibility section.

### How was this patch tested?

This PR is mainly a refactoring and passes existing tests.

Closes #32030 from dbaliafroozeh/ImprovedMapChildren.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2021-04-09 15:06:26 +02:00
Gengliang Wang bfba7fadd2 [SPARK-34881][SQL][FOLLOWUP] Implement toString() and sql() methods for TRY_CAST
### What changes were proposed in this pull request?

Implement toString() and sql() methods for TRY_CAST

### Why are the changes needed?

The new expression should have a different name from `CAST` in SQL/String representation.

### Does this PR introduce _any_ user-facing change?

Yes, in the result of `explain()`, users can see try_cast if the new expression is used.

### How was this patch tested?

Unit tests.

Closes #32098 from gengliangwang/tryCastString.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-09 15:39:25 +08:00
Chao Sun 5013171fd3 [SPARK-34973][SQL] Cleanup unused fields and methods in vectorized Parquet reader
### What changes were proposed in this pull request?

Remove some unused fields and methods in `SpecificParquetRecordReaderBase` and `VectorizedColumnReader`.

### Why are the changes needed?

Some fields and methods in these classes are no longer used since years ago. It's better to clean them up to make the code easier to maintain and read.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #32071 from sunchao/cleanup-parquet.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-08 11:21:07 -07:00
Max Gekk 96a3533de8 [SPARK-34984][SQL] ANSI intervals formatting in hive results
### What changes were proposed in this pull request?
Extend `HiveResult.toHiveString()` to support new interval types `YearMonthIntervalType` and `DayTimeIntervalType`.

### Why are the changes needed?
To fix failures while formatting ANSI intervals as Hive strings. For example:
```sql
spark-sql> select timestamp'now' - date'2021-01-01';
21/04/08 09:42:49 ERROR SparkSQLDriver: Failed in [select timestamp'now' - date'2021-01-01']
scala.MatchError: (PT2337H42M46.649S,DayTimeIntervalType) (of class scala.Tuple2)
	at org.apache.spark.sql.execution.HiveResult$.toHiveString(HiveResult.scala:97)
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```sql
spark-sql> select timestamp'now' - date'2021-01-01';
INTERVAL '97 09:37:52.171' DAY TO SECOND
```

### How was this patch tested?
By running new tests:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "testOnly *HiveResultSuite"
```

Closes #32087 from MaxGekk/ansi-interval-hiveResultString.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-08 16:18:15 +00:00
Tathagata Das c1c9a318c2 [SPARK-34962][SQL] Explicit representation of * in UpdateAction and InsertAction in MergeIntoTable
### What changes were proposed in this pull request?
Change UpdateAction and InsertAction of MergeIntoTable to explicitly represent star,

### Why are the changes needed?
Currently, UpdateAction and InsertAction in the MergeIntoTable implicitly represent `update set *` and `insert *` with empty assignments. That means there is no way to differentiate between the representations of "update all columns" and "update no columns". For SQL MERGE queries, this inability does not matter because the SQL MERGE grammar that generated the MergeIntoTable plan does not allow "update no columns". However, other ways of generating the MergeIntoTable plan may not have that limitation, and may want to allow specifying "update no columns". For example, in the Delta Lake project we provide a type-safe Scala API for Merge, where it is perfectly valid to produce a Merge query with an update clause but no update assignments. Currently, we cannot use MergeIntoTable to represent this plan, thus complicating the generation, and resolution of merge query from scala API.

Side note: fixed another bug where a merge plan with star and no other expressions with unresolved attributes (e.g. all non-optional predicates are `literal(true)`), then resolution will be skipped and star wont expanded. added test for that.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests

Closes #32067 from tdas/SPARK-34962-2.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-08 14:21:31 +00:00
Angerszhuuuu 90613df652 [SPARK-33233][SQL] CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal
### What changes were proposed in this pull request?
Currently, we can't support use ordinal in CUBE/ROLLUP/GROUPING SETS,
this pr make CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal

### Why are the changes needed?
Make CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal.
Postgres SQL and TeraData support this use case.

### Does this PR introduce _any_ user-facing change?
User can use ordinal in CUBE/ROLLUP/GROUPING SETS, such as
```
-- can use ordinal in CUBE
select a, b, count(1) from data group by cube(1, 2);

-- mixed cases: can use ordinal in CUBE
select a, b, count(1) from data group by cube(1, b);

-- can use ordinal with cube
select a, b, count(1) from data group by 1, 2 with cube;

-- can use ordinal in ROLLUP
select a, b, count(1) from data group by rollup(1, 2);

-- mixed cases: can use ordinal in ROLLUP
select a, b, count(1) from data group by rollup(1, b);

-- can use ordinal with rollup
select a, b, count(1) from data group by 1, 2 with rollup;

-- can use ordinal in GROUPING SETS
select a, b, count(1) from data group by grouping sets((1), (2), (1, 2));

-- mixed cases: can use ordinal in GROUPING SETS
select a, b, count(1) from data group by grouping sets((1), (b), (a, 2));

select a, b, count(1) from data group by a, 2 grouping sets((1), (b), (a, 2));

```

### How was this patch tested?
Added UT

Closes #30145 from AngersZhuuuu/SPARK-33233.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-08 14:19:41 +00:00
allisonwang-db ac01070a77 [SPARK-34946][SQL] Block unsupported correlated scalar subquery in Aggregate
### What changes were proposed in this pull request?
This PR adds two additional checks in `CheckAnalysis` for correlated scalar subquery in Aggregate. It blocks the cases that Spark do not currently support based on the rewrite logic in `RewriteCorrelatedScalarSubquery`:
aff6c0febb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (L618-L624)

### Why are the changes needed?
It can be confusing to users when their queries pass the check analysis but cannot be executed. Also, the error messages are confusing:

#### Case 1: correlated scalar subquery in the grouping expressions but not in aggregate expressions

```sql
SELECT SUM(c2) FROM t t1 GROUP BY (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1)
```
We get this error:
```
java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
```
because the correlated scalar subquery is not rewritten properly:
```scala
== Optimized Logical Plan ==
Aggregate [scalar-subquery#5 [(c1#6 = c1#6#93)]], [sum(c2#7) AS sum(c2)#11L]
:  +- Aggregate [c1#6], [sum(c2#7) AS sum(c2)#15L, c1#6 AS c1#6#93]
:     +- LocalRelation [c1#6, c2#7]
+- LocalRelation [c1#6, c2#7]
```

#### Case 2: correlated scalar subquery in the aggregate expressions but not in the grouping expressions

```sql
SELECT (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1), SUM(c2) FROM t t1 GROUP BY c1
```
We get this error:
```
java.lang.IllegalStateException: Couldn't find sum(c2)#69L in [c1#60,sum(c2#61)#64L]
```
because the transformed correlated scalar subquery output is not present in the grouping expression of the Aggregate:
```scala
== Optimized Logical Plan ==
Aggregate [c1#60], [sum(c2)#69L AS scalarsubquery(c1)#70L, sum(c2#61) AS sum(c2)#65L]
+- Project [c1#60, c2#61, sum(c2)#69L]
   +- Join LeftOuter, (c1#60 = c1#60#95)
      :- LocalRelation [c1#60, c2#61]
      +- Aggregate [c1#60], [sum(c2#61) AS sum(c2)#69L, c1#60 AS c1#60#95]
         +- LocalRelation [c1#60, c2#61]
```

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
New unit tests

Closes #32054 from allisonwang-db/spark-34946-scalar-subquery-agg.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-08 13:03:08 +00:00
Kousuke Saruta e5d972e84e [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path
### What changes were proposed in this pull request?

This PR fixes an issue that `ADD JAR` command can't add jar files which contain whitespaces in the path though `ADD FILE` and `ADD ARCHIVE` work with such files.

If we have `/some/path/test file.jar` and execute the following command:

```
ADD JAR "/some/path/test file.jar";
```

The following exception is thrown.

```
21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test file.jar"]
java.lang.IllegalArgumentException: Illegal character in path at index 9: /some/path/test file.jar
	at java.net.URI.create(URI.java:852)
	at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129)
	at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
```

This is because `HiveSessionStateBuilder` and `SessionStateBuilder` don't check whether the form of the path is URI or plain path and it always regards the path as URI form.
Whitespces should be encoded to `%20` so `/some/path/test file.jar` is rejected.
We can resolve this part by checking whether the given path is URI form or not.

Unfortunatelly, if we fix this part, another problem occurs.
When we execute `ADD JAR` command, Hive's `ADD JAR` command is executed in `HiveClientImpl.addJar` and `AddResourceProcessor.run` is transitively invoked.
In `AddResourceProcessor.run`, the command line is just split by `
s+` and the path is also split into `/some/path/test` and `file.jar` and passed to `ss.add_resources`.
f1e8713703/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java (L56-L75)
So, the command still fails.

Even if we convert the form of the path to URI like `file:/some/path/test%20file.jar` and execute the following command:

```
ADD JAR "file:/some/path/test%20file";
```

The following exception is thrown.

```
21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does not exist
java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does not exist
	at org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168)
	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289)
	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278)
	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378)
	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336)
	at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74)
```

The reason is `Utilities.realFile` invoked in `SessionState.validateFiles` returns `null` as the result of `fs.exists(path)` is `false`.
f1e8713703/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java (L1052-L1064)

`fs.exists` checks the existence of the given path by comparing the string representation of Hadoop's `Path`.
The string representation of `Path` is similar to URI but it's actually different.
`Path` doesn't encode the given path.
For example, the URI form of `/some/path/jar file.jar` is `file:/some/path/jar%20file.jar` but the `Path` form of it is `file:/some/path/jar file.jar`. So `fs.exists` returns false.

So the solution I come up with is removing Hive's `ADD JAR` from `HiveClientimpl.addJar`.
I think Hive's `ADD JAR` was used to add jar files to the class loader for metadata and isolate the class loader from the one for execution.
https://github.com/apache/spark/pull/6758/files#diff-cdb07de713c84779a5308f65be47964af865e15f00eb9897ccf8a74908d581bbR94-R103

But, as of SPARK-10810 and SPARK-10902 (#8909) are resolved, the class loaders for metadata and execution seem to be isolated with different way.
https://github.com/apache/spark/pull/8909/files#diff-8ef7cabf145d3fe7081da799fa415189d9708892ed76d4d13dd20fa27021d149R635-R641

In the current implementation, such class loaders seem to be isolated by `SharedState.jarClassLoader` and `IsolatedClientLoader.classLoader`.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L173-L188
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L956-L967

So I wonder we can remove Hive's `ADD JAR` from `HiveClientImpl.addJar`.
### Why are the changes needed?

This is a bug.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #32052 from sarutak/add-jar-whitespace.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-07 11:43:03 -07:00
Max Gekk 3dfd456b2c [SPARK-34668][SQL] Support casting of day-time intervals to strings
### What changes were proposed in this pull request?
1. Added new method `toDayTimeIntervalString()` to `IntervalUtils` which converts a day-time interval as a number of microseconds to a string in the form **"INTERVAL '[sign]days hours:minutes:secondsWithFraction' DAY TO SECOND"**.
2. Extended the `Cast` expression to support casting of `DayTimeIntervalType` to `StringType`.

### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.

### Does this PR introduce _any_ user-facing change?
Should not because new day-time interval has not been released yet.

### How was this patch tested?
Added new tests for casting:
```
$ build/sbt "testOnly *CastSuite*"
```

Closes #32070 from MaxGekk/cast-dt-interval-to-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-07 13:28:55 +00:00
Angerszhuuuu 5a3f41a017 [SPARK-34976][SQL] Rename GroupingSet to BaseGroupingSets
### What changes were proposed in this pull request?
Current trait `GroupingSet` is ambiguous, since `grouping set` in parser level means one set of a group.
Rename this to `BaseGroupingSets` since cube/rollup is syntax sugar for grouping sets.`

### Why are the changes needed?
Refactor class name

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #32073 from AngersZhuuuu/SPARK-34976.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-07 13:27:21 +00:00
Gengliang Wang f208d80881 [SPARK-34970][SQL][SERCURITY] Redact map-type options in the output of explain()
### What changes were proposed in this pull request?

The `explain()` method prints the arguments of tree nodes in logical/physical plans. The arguments could contain a map-type option that contains sensitive data.
We should map-type options in the output of `explain()`. Otherwise, we will see sensitive data in explain output or Spark UI.
![image](https://user-images.githubusercontent.com/1097932/113719178-326ffb00-96a2-11eb-8a2c-28fca3e72941.png)

### Why are the changes needed?

Data security.

### Does this PR introduce _any_ user-facing change?

Yes, redact the map-type options in the output of `explain()`

### How was this patch tested?

Unit tests

Closes #32066 from gengliangwang/redactOptions.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-07 18:19:01 +08:00
Ryan Blue 3c7d6c38e8 [SPARK-27658][SQL] Add FunctionCatalog API
## What changes were proposed in this pull request?

This adds a new API for catalog plugins that exposes functions to Spark. The API can list and load functions. This does not include create, delete, or alter operations.
- [Design Document](https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit?usp=sharing)

There are 3 types of functions defined:
* A `ScalarFunction` that produces a value for every call
* An `AggregateFunction` that produces a value after updates for a group of rows

Functions loaded from the catalog by name as `UnboundFunction`. Once input arguments are determined `bind` is called on the unbound function to get a `BoundFunction` implementation that is one of the 3 types above. Binding can fail if the function doesn't support the input type. `BoundFunction` returns the result type produced by the function.

## How was this patch tested?

This includes a test that demonstrates the new API.

Closes #24559 from rdblue/SPARK-27658-add-function-catalog-api.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-07 09:19:20 +00:00
Ali Afroozeh 06c09a79b3 [SPARK-34969][SPARK-34906][SQL] Followup for Refactor TreeNode's children handling methods into specialized traits
### What changes were proposed in this pull request?

This is a followup for https://github.com/apache/spark/pull/31932.
In this PR we:
- Introduce the `QuaternaryLike` trait for node types with 4 children.
- Specialize more node types
- Fix a number of style errors that were introduced in the original PR.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

This is a refactoring, passes existing tests.

Closes #32065 from dbaliafroozeh/FollowupSPARK-34906.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2021-04-07 09:50:30 +02:00
allisonwang-db 0aa2c284e4 [SPARK-34678][SQL] Add table function registry
### What changes were proposed in this pull request?
This PR extends the current function registry and catalog to support table-valued functions by adding a table function registry. It also refactors `range` to be a built-in function in the table function registry.

### Why are the changes needed?
Currently, Spark resolves table-valued functions very differently from the other functions. This change is to make the behavior for table and non-table functions consistent. It also allows Spark to display information about built-in table-valued functions:
Before:
```scala
scala> sql("describe function range").show(false)
+--------------------------+
|function_desc             |
+--------------------------+
|Function: range not found.|
+--------------------------+
```
After:
```scala
Function: range
Class: org.apache.spark.sql.catalyst.plans.logical.Range
Usage:
  range(start: Long, end: Long, step: Long, numPartitions: Int)
  range(start: Long, end: Long, step: Long)
  range(start: Long, end: Long)
  range(end: Long)

// Extended
Function: range
Class: org.apache.spark.sql.catalyst.plans.logical.Range
Usage:
  range(start: Long, end: Long, step: Long, numPartitions: Int)
  range(start: Long, end: Long, step: Long)
  range(start: Long, end: Long)
  range(end: Long)

Extended Usage:
  Examples:
    > SELECT * FROM range(1);
      +---+
      | id|
      +---+
      |  0|
      +---+
    > SELECT * FROM range(0, 2);
      +---+
      |id |
      +---+
      |0  |
      |1  |
      +---+
    > SELECT range(0, 4, 2);
      +---+
      |id |
      +---+
      |0  |
      |2  |
      +---+

    Since: 2.0.0
```

### Does this PR introduce _any_ user-facing change?
Yes. User will not be able to create a function with name `range` in the default database:
Before:
```scala
scala> sql("create function range as 'range'")
res3: org.apache.spark.sql.DataFrame = []
```
After:
```
scala> sql("create function range as 'range'")
org.apache.spark.sql.catalyst.analysis.FunctionAlreadyExistsException: Function 'default.range' already exists in database 'default'
```

### How was this patch tested?
Unit test

Closes #31791 from allisonwang-db/spark-34678-table-func-registry.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-07 05:49:36 +00:00
Tanel Kiis 7c8dc5e0b5 [SPARK-34922][SQL] Use a relative cost comparison function in the CBO
### What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes.

### Why are the changes needed?

In #30965 we changed to CBO cost comparison function so it would be "symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`.
With that we caused a performance regressions in some queries - TPCDS q19 for example.

The original cost comparison function used the ratios `relativeRows = A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` and `costB = w*B.rowCount + (1-w)*B.size`.

Given the input from wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios.

Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was used. Besides being "non-symteric", this also can exhibit one overwhelming other.
For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no matter how many times more rows does the `B` plan have, `B` will allways be considered to be better - `0.5*2 + 0.5*0.00000000000001 > 1`.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: `A.betterThan(B) => relativeRows^w  * relativeSize^(1-w) < 1`.

### Does this PR introduce _any_ user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

  | absolute | multiplicative |   | additive |  
-- | -- | -- | -- | -- | --
q12 | 145 | 137 | -5.52% | 141 | -2.76%
q13 | 264 | 271 | 2.65% | 271 | 2.65%
q17 | 4521 | 4243 | -6.15% | 4348 | -3.83%
q18 | 758 | 466 | -38.52% | 480 | -36.68%
q19 | 38503 | 2167 | -94.37% | 2176 | -94.35%
q20 | 119 | 120 | 0.84% | 126 | 5.88%
q24a | 16429 | 16838 | 2.49% | 17103 | 4.10%
q24b | 16592 | 16999 | 2.45% | 17268 | 4.07%
q25 | 3558 | 3556 | -0.06% | 3675 | 3.29%
q33 | 362 | 361 | -0.28% | 380 | 4.97%
q52 | 1020 | 1032 | 1.18% | 1052 | 3.14%
q55 | 927 | 938 | 1.19% | 961 | 3.67%
q72 | 24169 | 13377 | -44.65% | 24306 | 0.57%
q81 | 1285 | 1185 | -7.78% | 1168 | -9.11%
q91 | 324 | 336 | 3.70% | 337 | 4.01%
q98 | 126 | 129 | 2.38% | 131 | 3.97%

All times are in ms, the change is compared to the situation in the master branch (absolute).
The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved.

### How was this patch tested?

PlanStabilitySuite

Closes #32014 from tanelk/SPARK-34922_cbo_better_cost_function.

Lead-authored-by: Tanel Kiis <tanel.kiis@gmail.com>
Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-07 11:31:10 +09:00
Max Gekk 4b5fc1da75 [SPARK-34667][SQL] Support casting of year-month intervals to strings
### What changes were proposed in this pull request?
1. Added new method `toYearMonthIntervalString()` to `IntervalUtils` which converts an year-month interval as a number of month to a string in the form **"INTERVAL '[sign]yearField-monthField' YEAR TO MONTH"**.
2. Extended the `Cast` expression to support casting of `YearMonthIntervalType` to `StringType`.

### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.

### Does this PR introduce _any_ user-facing change?
Should not because new year-month interval has not been released yet.

### How was this patch tested?
Added new tests for casting:
```
$ build/sbt "testOnly *CastSuite*"
```

Closes #32056 from MaxGekk/cast-ym-interval-to-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-06 17:59:50 +03:00
Wenchen Fan 19c7d2f3d8 Revert "[SPARK-34884][SQL] Improve DPP evaluation to make filtering side must can broadcast by size or broadcast by hint"
This reverts commit de66fa63f9.
2021-04-06 22:58:41 +08:00
Karen Feng 3b634f66c3 [SPARK-34923][SQL] Metadata output should be empty for more plans
### What changes were proposed in this pull request?

Changes the metadata propagation framework.

Previously, most `LogicalPlan`'s propagated their `children`'s `metadataOutput`. This did not make sense in cases where the `LogicalPlan` did not even propagate their `children`'s `output`.

I set the metadata output for plans that do not propagate their `children`'s `output` to be `Nil`. Notably, `Project` and `View` no longer have metadata output.

### Why are the changes needed?

Previously, `SELECT m from (SELECT a from tb)` would output `m` if it were metadata. This did not make sense.

### Does this PR introduce _any_ user-facing change?

Yes. Now, `SELECT m from (SELECT a from tb)` will encounter an `AnalysisException`.

### How was this patch tested?

Added unit tests. I did not cover all cases, as they are fairly extensive. However, the new tests cover major cases (and an existing test already covers Join).

Closes #32017 from karenfeng/spark-34923.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-06 16:04:30 +08:00
Kent Yao 7cffacef18 [SPARK-34935][SQL] CREATE TABLE LIKE should respect the reserved table properties
### What changes were proposed in this pull request?

CREATE TABLE LIKE should respect the reserved properties of tables and fail if specified, using `spark.sql.legacy.notReserveProperties` to restore.

### Why are the changes needed?

Make DDLs consistently treat reserved properties

### Does this PR introduce _any_ user-facing change?

YES, this is a breaking change as using `create table like` w/ reserved properties will fail.

### How was this patch tested?

new test

Closes #32025 from yaooqinn/SPARK-34935.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-06 08:52:48 +09:00
Wenchen Fan 39d5677ee3 [SPARK-34932][SQL] deprecate GROUP BY ... GROUPING SETS (...) and promote GROUP BY GROUPING SETS (...)
### What changes were proposed in this pull request?

GROUP BY ... GROUPING SETS (...) is a weird SQL syntax we copied from Hive. It's not in the SQL standard or any other mainstream databases. This syntax requires users to repeat the expressions inside `GROUPING SETS (...)` after `GROUP BY`, and has a weird null semantic if `GROUP BY` contains extra expressions than `GROUPING SETS (...)`.

This PR deprecates this syntax:
1. Do not promote it in the document and only mention it as a Hive compatible sytax.
2. Simplify the code to only keep it for Hive compatibility.

### Why are the changes needed?

Deprecate a weird grammar.

### Does this PR introduce _any_ user-facing change?

No breaking change, but it removes a check to simplify the code: `GROUP BY a GROUPING SETS(a, b)` fails before and forces users to also put `b` after `GROUP BY`. Now this works just as `GROUP BY GROUPING SETS(a, b)`.

### How was this patch tested?

existing tests

Closes #32022 from cloud-fan/followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-06 08:49:08 +09:00
Dongjoon Hyun 748f05fca9 [SPARK-34954][SQL] Use zstd codec name in ORC file names
### What changes were proposed in this pull request?

This PR aims to add `zstd` codec names in the Spark generated ORC file names for consistency.

### Why are the changes needed?

Like the other ORC supported codecs, we had better have `zstd` in the Spark generated ORC file names. Please note that there is no problem at reading/writing ORC zstd files currently. This PR only aims to revise the file name format for consistency.

**SNAPPY**
```
scala> spark.range(10).repartition(1).write.option("compression", "snappy").orc("/tmp/snappy")

$ ls -al /tmp/snappy
total 24
drwxr-xr-x   6 dongjoon  wheel  192 Apr  4 12:17 .
drwxrwxrwt  14 root      wheel  448 Apr  4 12:17 ..
-rw-r--r--   1 dongjoon  wheel    8 Apr  4 12:17 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Apr  4 12:17 .part-00000-833bb7ad-d1e1-48cc-9719-07b2d594aa4c-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Apr  4 12:17 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  231 Apr  4 12:17 part-00000-833bb7ad-d1e1-48cc-9719-07b2d594aa4c-c000.snappy.orc
```

**ZSTD (AS-IS)**
```
scala> spark.range(10).repartition(1).write.option("compression", "zstd").orc("/tmp/zstd")

$ ls -al /tmp/zstd
total 24
drwxr-xr-x   6 dongjoon  wheel  192 Apr  4 12:17 .
drwxrwxrwt  14 root      wheel  448 Apr  4 12:17 ..
-rw-r--r--   1 dongjoon  wheel    8 Apr  4 12:17 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Apr  4 12:17 .part-00000-2f403ce9-7314-4db5-bca3-b1c1dd83335f-c000.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Apr  4 12:17 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  231 Apr  4 12:17 part-00000-2f403ce9-7314-4db5-bca3-b1c1dd83335f-c000.orc
```

**ZSTD (After this PR)**
```
scala> spark.range(10).repartition(1).write.option("compression", "zstd").orc("/tmp/zstd_new")

$ ls -al /tmp/zstd_new
total 24
drwxr-xr-x   6 dongjoon  wheel  192 Apr  4 12:28 .
drwxrwxrwt  15 root      wheel  480 Apr  4 12:28 ..
-rw-r--r--   1 dongjoon  wheel    8 Apr  4 12:28 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Apr  4 12:28 .part-00000-49d57329-7196-4caf-839c-4251c876e26b-c000.zstd.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Apr  4 12:28 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  231 Apr  4 12:28 part-00000-49d57329-7196-4caf-839c-4251c876e26b-c000.zstd.orc
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the updated UT.

Closes #32051 from dongjoon-hyun/SPARK-34954.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-04 17:11:56 -07:00
HyukjinKwon ebf01ec3c1 [SPARK-34950][TESTS] Update benchmark results to the ones created by GitHub Actions machines
### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/32015 added a way to run benchmarks much more easily in the same GitHub Actions build. This PR updates the benchmark results by using the way.

**NOTE** that looks like GitHub Actions use four types of CPU given my observations:

- Intel(R) Xeon(R) Platinum 8171M CPU  2.60GHz
- Intel(R) Xeon(R) CPU E5-2673 v4  2.30GHz
- Intel(R) Xeon(R) CPU E5-2673 v3  2.40GHz
- Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz

Given my quick research, seems like they perform roughly similarly:

![Screen Shot 2021-04-03 at 9 31 23 PM](https://user-images.githubusercontent.com/6477701/113478478-f4b57b80-94c3-11eb-9047-f81ca8c59672.png)

I couldn't find enough information about Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz but the performance seems roughly similar given the numbers.

So shouldn't be a big deal especially given that this way is much easier, encourages contributors to run more and guarantee the same number of cores and same memory with the same softwares.

### Why are the changes needed?

To have a base line of the benchmarks accordingly.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It was generated from:

- [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465)
- [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337)

Closes #32044 from HyukjinKwon/SPARK-34950.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-03 23:02:56 +03:00
HyukjinKwon 71effba5f2 [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
### What changes were proposed in this pull request?

This PR proposes to add a workflow that allows developers to run benchmarks and download the results files.  After this PR, developers can run benchmarks in GitHub Actions in their fork.

### Why are the changes needed?

1. Very easy to use.
2. We can use the (almost) same environment to run the benchmarks. Given my few experiments and observation, the CPU, cores, and memory are same.
3. Does not burden ASF's resource at GitHub Actions.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in https://github.com/HyukjinKwon/spark/pull/31.

Entire benchmarks are being run as below:
- [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465)
- [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337)

### How do developers use it in their fork?

1. **Go to Actions in your fork, and click "Run benchmarks"**

    ![Screen Shot 2021-03-31 at 10 15 13 PM](https://user-images.githubusercontent.com/6477701/113150018-99d71680-926e-11eb-8647-4ecf062c55f2.png)

2. **Run the benchmarks with JDK 8 or 11 with benchmark classes to run. Glob pattern is supported just like `testOnly` in SBT**

    ![Screen Shot 2021-04-02 at 8 35 02 PM](https://user-images.githubusercontent.com/6477701/113412599-ab95f680-93f3-11eb-9a15-c6ed54587b9d.png)

3. **After finishing the jobs, the benchmark results are available on the top in the underlying workflow:**

    ![Screen Shot 2021-03-31 at 10 17 21 PM](https://user-images.githubusercontent.com/6477701/113150332-ede1fb00-926e-11eb-9c0e-97d195070508.png)

4. **After downloading it, unzip and untar at Spark git root directory:**

    ```bash
    cd .../spark
    mv ~/Downloads/benchmark-results-8.zip .
    unzip benchmark-results-8.zip
    tar -xvf benchmark-results-8.tar
    ```

5. **Check the results:**

    ```bash
    git status
    ```

    ```
    ...
        modified:   core/benchmarks/MapStatusesSerDeserBenchmark-results.txt
    ```

Closes #32015 from HyukjinKwon/SPARK-34821-pr.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-03 20:55:54 +09:00
Chao Sun f1d42bb68d [SPARK-34945][DOC] Fix Javadoc for classes in catalyst module
### What changes were proposed in this pull request?

Use proper Java doc format for Java classes within `catalyst` module

### Why are the changes needed?

Many Java classes in `catalyst`, especially those for DataSource V2, do not have proper Java doc format. By fixing the format it helps to improve the doc's readability.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #32038 from sunchao/javadoc.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-02 23:00:19 -07:00
Angerszhuuuu 65da9287bc [SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null
### What changes were proposed in this pull request?

When we insert data into a partition table partition with empty DataFrame. We will call `PartitioningUtils.getPathFragment()`
then to update this partition's metadata too.
When we insert to a partition when partition value is `null`, it will throw exception like
```
[info]   java.lang.NullPointerException:
[info]   at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51)
[info]   at scala.collection.immutable.StringOps.length(StringOps.scala:51)
[info]   at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:35)
[info]   at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[info]   at scala.collection.immutable.StringOps.foreach(StringOps.scala:33)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.escapePathName(ExternalCatalogUtils.scala:69)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.getPartitionValueString(ExternalCatalogUtils.scala:126)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.$anonfun$getPathFragment$1(PartitioningUtils.scala:354)
[info]   at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
[info]   at scala.collection.Iterator.foreach(Iterator.scala:941)
[info]   at scala.collection.Iterator.foreach$(Iterator.scala:941)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
[info]   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
[info]   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
```
`PartitioningUtils.getPathFragment()`  should support `null` value too

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32018 from AngersZhuuuu/SPARK-34926.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-02 10:26:14 +03:00
Cheng Su 280a2f359c [SPARK-34940][SQL][TEST] Fix test of BasicWriteTaskStatsTrackerSuite
### What changes were proposed in this pull request?

This is to fix the minor typo in unit test of BasicWriteTaskStatsTrackerSuite (https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala#L152 ), where it should be a new file name, e.g. `f-3-3`, because the unit test expects 3 files in statistics (https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala#L160 ).

### Why are the changes needed?

Fix minor bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Changed unit test `"Three files, last one empty"` itself.

Closes #32034 from c21/tracker-fix.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-02 15:51:22 +09:00
Max Gekk 1d084513b9 [SPARK-34938][SQL][TESTS] Benchmark only legacy interval in ExtractBenchmark
### What changes were proposed in this pull request?
In the PR, I propose to disable ANSI intervals as the result of dates/timestamp subtraction in `ExtractBenchmark` and benchmark only legacy intervals because `EXTRACT( .. FROM ..)` doesn't support ANSI intervals so far.

### Why are the changes needed?
This fixes the benchmark failure:
```
[info]   Running case: YEAR of interval
[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'year((subtractdates(CAST(timestamp_seconds(id) AS DATE), DATE '0001-01-01') + subtracttimestamps(timestamp_seconds(id), TIMESTAMP '1000-01-01 01:02:03.123456')))' due to data type mismatch: argument 1 requires date type, however, '(subtractdates(CAST(timestamp_seconds(id) AS DATE), DATE '0001-01-01') + subtracttimestamps(timestamp_seconds(id), TIMESTAMP '1000-01-01 01:02:03.123456'))' is of day-time interval type.; line 1 pos 0;
[error] 'Project [extract(YEAR, (subtractdates(cast(timestamp_seconds(id#1456L) as date), 0001-01-01, false) + subtracttimestamps(timestamp_seconds(id#1456L), 1000-01-01 01:02:03.123456, false, Some(Europe/Moscow)))) AS YEAR#1458]
[error] +- Range (1262304000, 1272304000, step=1, splits=Some(1))
[error] 	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[error] 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:194)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the `ExtractBenchmark` benchmark via:
```
$ build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark"
```

Closes #32035 from MaxGekk/fix-ExtractBenchmark.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-02 15:45:32 +09:00
yi.wu f897cc2374 [SPARK-34354][SQL] Fix failure when apply CostBasedJoinReorder on self-join
### What changes were proposed in this pull request?

This PR introduces a new analysis rule `DeduplicateRelations`, which deduplicates any duplicate relations in a plan first and then deduplicates conflicting attributes(which resued the `dedupRight` of `ResolveReferences`).

### Why are the changes needed?

`CostBasedJoinReorder` could fail when applying on self-join, e.g.,

```scala
// test in JoinReorderSuite
test("join reorder with self-join") {
  val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === nameToAttr("t2.k-1-5")))
      .select(nameToAttr("t1.v-1-10"))
      .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))

    // this can fail
    Optimize.execute(plan.analyze)
}
```
Besides, with the new rule `DeduplicateRelations`, we'd be able to enable some optimizations, e.g., LeftSemiAnti pushdown, redundant project removal, as reflects in updated unit tests.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Added and updated unit tests.

Closes #32027 from Ngone51/join-reorder-3.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-02 06:22:57 +00:00
Cheng Su 1fc66f6870 [SPARK-34862][SQL] Support nested column in ORC vectorized reader
### What changes were proposed in this pull request?

This PR is to support nested column type in Spark ORC vectorized reader. Currently ORC vectorized reader [does not support nested column type (struct, array and map)](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138). We implemented nested column vectorized reader for FB-ORC in our internal fork of Spark. We are seeing performance improvement compared to non-vectorized reader when reading nested columns. In addition, this can also help improve the non-nested column performance when reading non-nested and nested columns together in one query.

Before this PR:

* `OrcColumnVector` is the implementation class for Spark's `ColumnVector` to wrap Hive's/ORC's `ColumnVector` to read `AtomicType` data.

After this PR:

* `OrcColumnVector` is an abstract class to keep interface being shared between multiple implementation class of orc column vectors, namely `OrcAtomicColumnVector` (for `AtomicType`), `OrcArrayColumnVector` (for `ArrayType`), `OrcMapColumnVector` (for `MapType`), `OrcStructColumnVector` (for `StructType`). So the original logic to read `AtomicType` data is moved from `OrcColumnVector` to `OrcAtomicColumnVector`. The abstract class of `OrcColumnVector` is needed here because of supporting nested column (i.e. nested column vectors).
* A utility method `OrcColumnVectorUtils.toOrcColumnVector` is added to create Spark's `OrcColumnVector` from Hive's/ORC's `ColumnVector`.
* A new user-facing config `spark.sql.orc.enableNestedColumnVectorizedReader` is added to control enabling/disabling vectorized reader for nested columns. The default value is false (i.e. disabling by default). For certain tables having deep nested columns, vectorized reader might take too much memory for each sub-column vectors, compared to non-vectorized reader. So providing a config here to work around OOM for query reading wide and deep nested columns if any. We plan to enable it by default on 3.3. Leave it disable in 3.2 in case for any unknown bugs.

### Why are the changes needed?

Improve query performance when reading nested columns from ORC file format.
Tested with locally adding a small benchmark in `OrcReadBenchmark.scala`. Seeing more than 1x run time improvement.

```
Running benchmark: SQL Nested Column Scan
  Running case: Native ORC MR
  Stopped after 2 iterations, 37850 ms
  Running case: Native ORC Vectorized (Enabled Nested Column)
  Stopped after 2 iterations, 15892 ms
  Running case: Native ORC Vectorized (Disabled Nested Column)
  Stopped after 2 iterations, 37954 ms
  Running case: Hive built-in ORC
  Stopped after 2 iterations, 35118 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
SQL Nested Column Scan:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------------
Native ORC MR                                           18706          18925         310          0.1       17839.6       1.0X
Native ORC Vectorized (Enabled Nested Column)            7625           7946         455          0.1        7271.6       2.5X
Native ORC Vectorized (Disabled Nested Column)          18415          18977         796          0.1       17561.5       1.0X
Hive built-in ORC                                       17469          17559         127          0.1       16660.1       1.1X
```

Benchmark:

```
nestedColumnScanBenchmark(1024 * 1024)
def nestedColumnScanBenchmark(values: Int): Unit = {
    val benchmark = new Benchmark(s"SQL Nested Column Scan", values, output = output)

    withTempPath { dir =>
      withTempTable("t1", "nativeOrcTable", "hiveOrcTable") {
        import spark.implicits._
        spark.range(values).map(_ => Random.nextLong).map { x =>
          val arrayOfStructColumn = (0 until 5).map(i => (x + i, s"$x" * 5))
          val mapOfStructColumn = Map(
            s"$x" -> (x * 0.1, (x, s"$x" * 100)),
            (s"$x" * 2) -> (x * 0.2, (x, s"$x" * 200)),
            (s"$x" * 3) -> (x * 0.3, (x, s"$x" * 300)))
          (arrayOfStructColumn, mapOfStructColumn)
        }.toDF("col1", "col2")
          .createOrReplaceTempView("t1")

        prepareTable(dir, spark.sql(s"SELECT * FROM t1"))

        benchmark.addCase("Native ORC MR") { _ =>
          withSQLConf(SQLConf.ORC_VECTORIZED_READER_ENABLED.key -> "false") {
            spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM nativeOrcTable").noop()
          }
        }

        benchmark.addCase("Native ORC Vectorized (Enabled Nested Column)") { _ =>
          spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM nativeOrcTable").noop()
        }

        benchmark.addCase("Native ORC Vectorized (Disabled Nested Column)") { _ =>
          withSQLConf(SQLConf.ORC_VECTORIZED_READER_NESTED_COLUMN_ENABLED.key -> "false") {
            spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM nativeOrcTable").noop()
          }
        }

        benchmark.addCase("Hive built-in ORC") { _ =>
          spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM hiveOrcTable").noop()
        }

        benchmark.run()
      }
    }
  }
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added one simple test in `OrcSourceSuite.scala` to verify correctness.
Definitely need more unit tests and add benchmark here, but I want to first collect feedback before crafting more tests.

Closes #31958 from c21/orc-vector.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-01 23:10:34 -07:00
Kent Yao 1b553da2a1 [SPARK-34908][SQL][TESTS] Add test cases for char and varchar with functions
### What changes were proposed in this pull request?

Using char and varchar with the string functions and some other expressions might be confusing and ambiguous. In this PR we add test cases for char and varchar with these operations to reveal these behavior and see if we can come up with a general pattern for them.

### Why are the changes needed?

test coverage

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #32010 from yaooqinn/SPARK-34908.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-01 16:33:30 +09:00
Max Gekk 5911faa0d4 [SPARK-34903][SQL] Return day-time interval from timestamps subtraction
### What changes were proposed in this pull request?
Modify the `SubtractTimestamps` expression to return values of `DayTimeIntervalType` when `spark.sql.legacy.interval.enabled` is set to `false` (which is the default).

### Why are the changes needed?
To conform to the ANSI SQL standard which requires ANSI intervals as the result of timestamps subtraction, see
<img width="656" alt="Screenshot 2021-03-29 at 19 09 34" src="https://user-images.githubusercontent.com/1580697/112866455-7e2f0d00-90c2-11eb-96e6-3feb7eea7e09.png">

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
$ build/sbt "test:testOnly *DateExpressionsSuite"
$ build/sbt "test:testOnly *ColumnExpressionSuite"
```
and some tests from `SQLQueryTestSuite`:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z timestamp.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
```

Closes #32016 from MaxGekk/subtract-timestamps-to-intervals.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-01 10:27:58 +03:00
ulysses-you 89ae83d19b [SPARK-34919][SQL] Change partitioning to SinglePartition if partition number is 1
### What changes were proposed in this pull request?

Change partitioning to `SinglePartition`.

### Why are the changes needed?

For node `Repartition` and `RepartitionByExpression`, if partition number is 1 we can use `SinglePartition` instead of other `Partitioning`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add test

Closes #32012 from ulysses-you/SPARK-34919.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-01 06:59:31 +00:00
Hyukjin Kwon 8a2138d09f [SPARK-34881][SQL][FOLLOW-UP] Use multiline string for TryCast' expression description
### What changes were proposed in this pull request?

This PR fixes JDK 11 compilation failed:

```
/home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TryCast.scala:35: error: annotation argument needs to be a constant; found: "_FUNC_(expr AS type) - Casts the value `expr` to the target data type `type`. ".+("This expression is identical to CAST with configuration `spark.sql.ansi.enabled` as ").+("true, except it returns NULL instead of raising an error. Note that the behavior of this ").+("expression doesn\'t depend on configuration `spark.sql.ansi.enabled`.")
    "true, except it returns NULL instead of raising an error. Note that the behavior of this " +
```

For whatever reason, it doesn't know that the string is actually a constant. This PR simply switches it to multi-line style (which is actually more correct).

Reference:

bd0990e3e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala (L53-L57)

### Why are the changes needed?

To recover the build.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

 CI in this PR

Closes #32019 from HyukjinKwon/SPARK-34881.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-01 14:50:05 +08:00