ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	abfd9b23cd	[SPARK-34769][SQL] AnsiTypeCoercion: return closest convertible type among TypeCollection ### What changes were proposed in this pull request? Currently, when implicit casting a data type to a `TypeCollection`, Spark returns the first convertible data type among `TypeCollection`. In ANSI mode, we can make the behavior more reasonable by returning the closet convertible data type in `TypeCollection`. In details, we first try to find the all the expected types we can implicitly cast: 1. if there is no convertible data types, return None; 2. if there is only one convertible data type, cast input as it; 3. otherwise if there are multiple convertible data types, find the closet data type among them. If there is no such closet data type, return None. Note that if the closet type is Float type and the convertible types contains Double type, simply return Double type as the closet type to avoid potential precision loss on converting the Integral type as Float type. ### Why are the changes needed? Make the type coercion rule for TypeCollection more reasonable and ANSI compatible. E.g. returning Long instead of Double for`implicast(int, TypeCollect(Double, Long))`. From ANSI SQL Spec section 4.33 "SQL-invoked routines" ![Screen Shot 2021-03-17 at 4 05 06 PM](https://user-images.githubusercontent.com/1097932/111434916-5e104e80-86bd-11eb-8b3b-33090a68067d.png) Section 9.6 "Subject routine determination" ![Screen Shot 2021-03-17 at 1 36 55 PM](https://user-images.githubusercontent.com/1097932/111420336-48445e80-86a8-11eb-9d50-34b325043bdb.png) Section 10.4 "routine invocation" ![Screen Shot 2021-03-17 at 4 08 41 PM](https://user-images.githubusercontent.com/1097932/111434926-610b3f00-86bd-11eb-8c32-8c7935e055da.png) ### Does this PR introduce _any_ user-facing change? Yes, in ANSI mode, implicit casting to a `TypeCollection` returns the narrowest convertible data type instead of the first convertible one. ### How was this patch tested? Unit tests. Closes #31859 from gengliangwang/implicitCastTypeCollection. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-24 15:04:03 +00:00
Tanel Kiis	84df54b495	[SPARK-34822][SQL] Update the plan stability golden files even if only the explain.txt changes ### What changes were proposed in this pull request? Update the plan stability golden files even if only the `explain.txt` changes. ### Why are the changes needed? Currently only `simplified.txt` change is checked. There are some PRs, that update the `explain.txt`, that do not change the `simplified.txt`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The updated golden files. Closes #31927 from tanelk/SPARK-34822_update_plan_stability. Lead-authored-by: Tanel Kiis <tanel.kiis@gmail.com> Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-24 14:36:51 +00:00
Cheng Su	35c70e417d	[SPARK-34853][SQL] Remove duplicated definition of output partitioning/ordering for limit operator ### What changes were proposed in this pull request? Both local limit and global limit define the output partitioning and output ordering in the same way and this is duplicated (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L159-L175 ). We can move the output partitioning and ordering into their parent trait - `BaseLimitExec`. This is doable as `BaseLimitExec` has no more other child class. This is a minor code refactoring. ### Why are the changes needed? Clean up the code a little bit. Better readability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pure refactoring. Rely on existing unit tests. Closes #31950 from c21/limit-cleanup. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-24 23:06:35 +09:00
yangjie01	712a62ca82	[SPARK-34832][SQL][TEST] Set EXECUTOR_ALLOW_SPARK_CONTEXT to true to ensure ExternalAppendOnlyUnsafeRowArrayBenchmark run successfully ### What changes were proposed in this pull request? SPARK-32160 add a config(`EXECUTOR_ALLOW_SPARK_CONTEXT`) to switch allow/disallow to create `SparkContext` in executors and the default value of the config is `false` `ExternalAppendOnlyUnsafeRowArrayBenchmark` will run fail when `EXECUTOR_ALLOW_SPARK_CONTEXT` use the default value because the `ExternalAppendOnlyUnsafeRowArrayBenchmark#withFakeTaskContext` method try to create a `SparkContext` manually in Executor Side. So the main change of this pr is set `EXECUTOR_ALLOW_SPARK_CONTEXT` to `true` to ensure `ExternalAppendOnlyUnsafeRowArrayBenchmark` run successfully. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test： ``` bin/spark-submit --class org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark --jars spark-core_2.12-3.2.0-SNAPSHOT-tests.jar spark-sql_2.12-3.2.0-SNAPSHOT-tests.jar ``` Before ``` Exception in thread "main" java.lang.IllegalStateException: SparkContext should only be created and accessed on the driver. at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$assertOnDriver(SparkContext.scala:2679) at org.apache.spark.SparkContext.<init>(SparkContext.scala:89) at org.apache.spark.SparkContext.<init>(SparkContext.scala:137) at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark$.withFakeTaskContext(ExternalAppendOnlyUnsafeRowArrayBenchmark.scala:52) at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark$.testAgainstRawArrayBuffer(ExternalAppendOnlyUnsafeRowArrayBenchmark.scala:119) at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark$.$anonfun$runBenchmarkSuite$1(ExternalAppendOnlyUnsafeRowArrayBenchmark.scala:189) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.benchmark.BenchmarkBase.runBenchmark(BenchmarkBase.scala:40) at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark$.runBenchmarkSuite(ExternalAppendOnlyUnsafeRowArrayBenchmark.scala:186) at org.apache.spark.benchmark.BenchmarkBase.main(BenchmarkBase.scala:58) at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark.main(ExternalAppendOnlyUnsafeRowArrayBenchmark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` After `ExternalAppendOnlyUnsafeRowArrayBenchmark` run successfully. Closes #31939 from LuciferYang/SPARK-34832. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-24 14:59:31 +09:00
Kousuke Saruta	f7e9b6efc7	[SPARK-34763][SQL] col(), $"<name>" and df("name") should handle quoted column names properly ### What changes were proposed in this pull request? This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names like ``` `a``b.c` ```properly. For example, if we have a following DataFrame. ``` val df1 = spark.sql("SELECT 'col1' AS `a``b.c`") ``` For the DataFrame, this query is successfully executed. ``` scala> df1.selectExpr("`a``b.c`").show +-----+ \|a`b.c\| +-----+ \| col1\| +-----+ ``` But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception. ``` scala> df1.select(df1("`a``b.c`")).show org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`; at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1274) at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241) ... 49 elided ``` ### Why are the changes needed? It's a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #31854 from sarutak/fix-parseAttributeName. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-24 13:34:10 +08:00
Takeshi Yamamuro	0494dc90af	[SPARK-34842][SQL][TESTS] Corrects the type of `date_dim.d_quarter_name` in the TPCDS schema ### What changes were proposed in this pull request? SPARK-34842 (#31012) has a typo in the type of `date_dim.d_quarter_name` in the TPCDS schema (`TPCDSBase`). This PR replace `CHAR(1)` with `CHAR(6)`. This fix comes from p28 in [the TPCDS official doc](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf). ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #31943 from maropu/SPARK-34083-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-23 10:22:13 -07:00
Max Gekk	760556a42f	[SPARK-34824][SQL] Support multiply an year-month interval by a numeric ### What changes were proposed in this pull request? 1. Add new expression `MultiplyYMInterval` which multiplies a `YearMonthIntervalType` expression by a `NumericType` expression including ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DecimalType. 2. Extend binary arithmetic rules to support `numeric * year-month interval` and `year-month interval * numeric`. ### Why are the changes needed? To conform the ANSI SQL standard which requires such operation over year-month intervals: <img width="667" alt="Screenshot 2021-03-22 at 16 33 16" src="https://user-images.githubusercontent.com/1580697/111997810-77d1eb80-8b2c-11eb-951d-e43911d9c5db.png"> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly IntervalExpressionsSuite" $ build/sbt "test:testOnly ColumnExpressionSuite" ``` Closes #31929 from MaxGekk/interval-mul-div. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-03-23 19:40:15 +03:00
Wenchen Fan	3b70829b5b	[SPARK-34719][SQL] Correctly resolve the view query with duplicated column names forward-port https://github.com/apache/spark/pull/31811 to master ### What changes were proposed in this pull request? For permanent views (and the new SQL temp view in Spark 3.1), we store the view SQL text and re-parse/analyze the view SQL text when reading the view. In the case of `SELECT * FROM ...`, we want to avoid view schema change (e.g. the referenced table changes its schema) and will record the view query output column names when creating the view, so that when reading the view we can add a `SELECT recorded_column_names FROM ...` to retain the original view query schema. In Spark 3.1 and before, the final SELECT is added after the analysis phase: https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala#L67 If the view query has duplicated output column names, we always pick the first column when reading a view. A simple repro: ``` scala> sql("create view c(x, y) as select 1 a, 2 a") res0: org.apache.spark.sql.DataFrame = [] scala> sql("select * from c").show +---+---+ \| x\| y\| +---+---+ \| 1\| 1\| +---+---+ ``` In the master branch, we will fail at the view reading time due to `b891862fb6` , which adds the final SELECT during analysis, so that the query fails with `Reference 'a' is ambiguous` This PR proposes to resolve the view query output column names from the matching attributes by ordinal. For example, `create view c(x, y) as select 1 a, 2 a`, the view query output column names are `[a, a]`. When we reading the view, there are 2 matching attributes (e.g.`[a#1, a#2]`) and we can simply match them by ordinal. A negative example is ``` create table t(a int) create view v as select *, 1 as col from t replace table t(a int, col int) ``` When reading the view, the view query output column names are `[a, col]`, and there are two matching attributes of `col`, and we should fail the query. See the tests for details. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? new test Closes #31930 from cloud-fan/view2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-23 14:34:51 +00:00
Liang-Chi Hsieh	115ed89a3c	[SPARK-34366][SQL] Add interface for DS v2 metrics ### What changes were proposed in this pull request? This patch proposes to add a few public API change to DS v2, to make DS v2 scan can report metrics to Spark. Two public interfaces are added. * `CustomMetric`: metric interface at the driver side. It basically defines how Spark aggregates task metrics with the same metric name. * `CustomTaskMetric`: task metric reported at executors. It includes a name and long value. Spark will collect these metric values and update internal metrics. There are two public methods added to existing public interfaces. They are optional to DS v2 implementations. * `PartitionReader.currentMetricsValues()`: returns an array of CustomTaskMetric. Here is where the actual metrics values are collected. Empty array by default. * `Scan.supportedCustomMetrics()`: returns an array of supported custom metrics `CustomMetric`. Empty array by default. ### Why are the changes needed? In order to report custom metrics, we need some public API change in DS v2 to make it possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This only adds interfaces. In follow-up PRs where adding implementation there will be tests added. See #31451 and #31398 for some details and manual test there. Closes #31476 from viirya/SPARK-34366. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-23 13:22:37 +00:00
Peter Toth	93a5d34f84	[SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check ### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-23 17:01:16 +08:00
yi.wu	e00afd31a7	[SPARK-34087][FOLLOW-UP][SQL] Manage ExecutionListenerBus register inside itself ### What changes were proposed in this pull request? Move `ExecutionListenerBus` register (both `ListenerBus` and `ContextCleaner` register) into itself. Also with a minor change that put `registerSparkListenerForCleanup` to a better place. ### Why are the changes needed? improve code ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #31919 from Ngone51/SPARK-34087-followup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-23 07:38:43 +00:00
linzebing	e768eaa908	[SPARK-34707][SQL] Code-gen broadcast nested loop join (left outer/right outer) ### What changes were proposed in this pull request? This PR is to add code-gen support for left outer (build right) and right outer (build left). Reference: `BroadcastNestedLoopJoinExec.codegenInner()` and `BroadcastNestedLoopJoinExec.outerJoin()` ### Why are the changes needed? Improve query CPU performance. Tested with a simple query: ```scala val N = 20 << 20 val M = 1 << 4 val dim = broadcast(spark.range(M).selectExpr("id as k2")) codegenBenchmark("left outer broadcast nested loop join", N) { val df = spark.range(N).selectExpr(s"id as k1").join( dim, col("k1") + 1 <= col("k2"), "left_outer") assert(df.queryExecution.sparkPlan.find( _.isInstanceOf[BroadcastNestedLoopJoinExec]).isDefined) df.noop() } ``` Seeing 2x run time improvement: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz left outer broadcast nested loop join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------------ left outer broadcast nested loop join wholestage off 3024 3698 953 6.9 144.2 1.0X left outer broadcast nested loop join wholestage on 1512 1659 172 13.9 72.1 2.0X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Changed existing unit tests in `OuterJoinSuite` to cover codegen use cases. Added unit test in WholeStageCodegenSuite.scala to make sure code-gen for broadcast nested loop join is taking effect, and test for multiple join case as well. Example query: ```scala val df1 = spark.range(4).select($"id".as("k1")) val df2 = spark.range(3).select($"id".as("k2")) df1.join(df2, $"k1" + 1 <= $"k2", "left_outer").explain("codegen") ``` Example generated code (`bnlj_doConsume_0` method): ```java == Subtree 2 / 2 (maxMethodCodeSize:282; maxConstantPoolSize:210(0.32% used); numInnerClasses:0) == (2) BroadcastNestedLoopJoin BuildRight, LeftOuter, ((k1#2L + 1) <= k2#6L) :- (2) Project [id#0L AS k1#2L] : +- (2) Range (0, 4, step=1, splits=16) +- BroadcastExchange IdentityBroadcastMode, [id=#22] +- (1) Project [id#4L AS k2#6L] +- (1) Range (0, 3, step=1, splits=16) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage2(references); / 003 / } / 004 / / 005 / // codegenStageId=2 / 006 / final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean range_initRange_0; / 010 / private long range_nextIndex_0; / 011 / private TaskContext range_taskContext_0; / 012 / private InputMetrics range_inputMetrics_0; / 013 / private long range_batchEnd_0; / 014 / private long range_numElementsTodo_0; / 015 / private InternalRow[] bnlj_buildRowArray_0; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[4]; / 017 / / 018 / public GeneratedIteratorForCodegenStage2(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / / 026 / range_taskContext_0 = TaskContext.get(); / 027 / range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); / 028 / range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 029 / range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 030 / range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 031 / bnlj_buildRowArray_0 = (InternalRow[]) ((org.apache.spark.broadcast.TorrentBroadcast) references[1] / broadcastTerm /).value(); / 032 / range_mutableStateArray_0[3] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0); / 033 / / 034 / } / 035 / / 036 / private void bnlj_doConsume_0(long bnlj_expr_0_0) throws java.io.IOException { / 037 / boolean bnlj_foundMatch_0 = false; / 038 / for (int bnlj_arrayIndex_0 = 0; bnlj_arrayIndex_0 < bnlj_buildRowArray_0.length; bnlj_arrayIndex_0++) { / 039 / UnsafeRow bnlj_buildRow_0 = (UnsafeRow) bnlj_buildRowArray_0[bnlj_arrayIndex_0]; / 040 / boolean bnlj_shouldOutputRow_0 = false; / 041 / / 042 / boolean bnlj_isNull_2 = true; / 043 / long bnlj_value_2 = -1L; / 044 / if (bnlj_buildRow_0 != null) { / 045 / long bnlj_value_1 = bnlj_buildRow_0.getLong(0); / 046 / bnlj_isNull_2 = false; / 047 / bnlj_value_2 = bnlj_value_1; / 048 / } / 049 / / 050 / long bnlj_value_4 = -1L; / 051 / / 052 / bnlj_value_4 = bnlj_expr_0_0 + 1L; / 053 / / 054 / boolean bnlj_value_3 = false; / 055 / bnlj_value_3 = bnlj_value_4 <= bnlj_value_2; / 056 / if (!(false \|\| !bnlj_value_3)) / 057 / { / 058 / bnlj_shouldOutputRow_0 = true; / 059 / bnlj_foundMatch_0 = true; / 060 / } / 061 / if (bnlj_arrayIndex_0 == bnlj_buildRowArray_0.length - 1 && !bnlj_foundMatch_0) { / 062 / bnlj_buildRow_0 = null; / 063 / bnlj_shouldOutputRow_0 = true; / 064 / } / 065 / if (bnlj_shouldOutputRow_0) { / 066 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] / numOutputRows /).add(1); / 067 / / 068 / boolean bnlj_isNull_9 = true; / 069 / long bnlj_value_9 = -1L; / 070 / if (bnlj_buildRow_0 != null) { / 071 / long bnlj_value_8 = bnlj_buildRow_0.getLong(0); / 072 / bnlj_isNull_9 = false; / 073 / bnlj_value_9 = bnlj_value_8; / 074 / } / 075 / range_mutableStateArray_0[3].reset(); / 076 / / 077 / range_mutableStateArray_0[3].zeroOutNullBytes(); / 078 / / 079 / range_mutableStateArray_0[3].write(0, bnlj_expr_0_0); / 080 / / 081 / if (bnlj_isNull_9) { / 082 / range_mutableStateArray_0[3].setNullAt(1); / 083 / } else { / 084 / range_mutableStateArray_0[3].write(1, bnlj_value_9); / 085 / } / 086 / append((range_mutableStateArray_0[3].getRow()).copy()); / 087 / / 088 / } / 089 / } / 090 / / 091 / } / 092 / / 093 / private void initRange(int idx) { / 094 / java.math.BigInteger index = java.math.BigInteger.valueOf(idx); / 095 / java.math.BigInteger numSlice = java.math.BigInteger.valueOf(16L); / 096 / java.math.BigInteger numElement = java.math.BigInteger.valueOf(4L); / 097 / java.math.BigInteger step = java.math.BigInteger.valueOf(1L); / 098 / java.math.BigInteger start = java.math.BigInteger.valueOf(0L); / 099 / long partitionEnd; / 100 / / 101 / java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); / 102 / if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 103 / range_nextIndex_0 = Long.MAX_VALUE; / 104 / } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 105 / range_nextIndex_0 = Long.MIN_VALUE; / 106 / } else { / 107 / range_nextIndex_0 = st.longValue(); / 108 / } / 109 / range_batchEnd_0 = range_nextIndex_0; / 110 / / 111 / java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) / 112 / .multiply(step).add(start); / 113 / if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 114 / partitionEnd = Long.MAX_VALUE; / 115 / } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 116 / partitionEnd = Long.MIN_VALUE; / 117 / } else { / 118 / partitionEnd = end.longValue(); / 119 / } / 120 / / 121 / java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( / 122 / java.math.BigInteger.valueOf(range_nextIndex_0)); / 123 / range_numElementsTodo_0 = startToEnd.divide(step).longValue(); / 124 / if (range_numElementsTodo_0 < 0) { / 125 / range_numElementsTodo_0 = 0; / 126 / } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { / 127 / range_numElementsTodo_0++; / 128 / } / 129 / } / 130 / / 131 / protected void processNext() throws java.io.IOException { / 132 / // initialize Range / 133 / if (!range_initRange_0) { / 134 / range_initRange_0 = true; / 135 / initRange(partitionIndex); / 136 / } / 137 / / 138 / while (true) { / 139 / if (range_nextIndex_0 == range_batchEnd_0) { / 140 / long range_nextBatchTodo_0; / 141 / if (range_numElementsTodo_0 > 1000L) { / 142 / range_nextBatchTodo_0 = 1000L; / 143 / range_numElementsTodo_0 -= 1000L; / 144 / } else { / 145 / range_nextBatchTodo_0 = range_numElementsTodo_0; / 146 / range_numElementsTodo_0 = 0; / 147 / if (range_nextBatchTodo_0 == 0) break; / 148 / } / 149 / range_batchEnd_0 += range_nextBatchTodo_0 1L; /* 150 / } / 151 / / 152 / int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); / 153 / for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { / 154 / long range_value_0 = ((long)range_localIdx_0 1L) + range_nextIndex_0; /* 155 / / 156 / // common sub-expressions / 157 / / 158 / bnlj_doConsume_0(range_value_0); / 159 / / 160 / if (shouldStop()) { / 161 / range_nextIndex_0 = range_value_0 + 1L; / 162 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localIdx_0 + 1); / 163 / range_inputMetrics_0.incRecordsRead(range_localIdx_0 + 1); / 164 / return; / 165 / } / 166 / / 167 / } / 168 / range_nextIndex_0 = range_batchEnd_0; / 169 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localEnd_0); / 170 / range_inputMetrics_0.incRecordsRead(range_localEnd_0); / 171 / range_taskContext_0.killTaskIfInterrupted(); / 172 / } / 173 / } / 174 / / 175 */ } ``` Closes #31931 from linzebing/code-left-right-outer. Authored-by: linzebing <linzebing1995@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-23 07:11:57 +00:00
hezuojiao	39542bb81f	[SPARK-34790][CORE] Disable fetching shuffle blocks in batch when io encryption is enabled ### What changes were proposed in this pull request? This patch proposes to disable fetching shuffle blocks in batch when io encryption is enabled. Adaptive Query Execution fetch contiguous shuffle blocks for the same map task in batch to reduce IO and improve performance. However, we found that batch fetching is incompatible with io encryption. ### Why are the changes needed? Before this patch, we set `spark.io.encryption.enabled` to true, then run some queries which coalesced partitions by AEQ, may got following error message: ```14:05:52.638 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 3) (11.240.37.88 executor driver): FetchFailed(BlockManagerId(driver, 11.240.37.88, 63574, None), shuffleId=0, mapIndex=0, mapId=0, reduceId=2, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110) at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:200) at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:226) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841) ... 25 more ) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? New tests. Closes #31898 from hezuojiao/fetch_shuffle_in_batch. Authored-by: hezuojiao <hezuojiao@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-22 13:06:12 -07:00
tanel.kiis@gmail.com	51cf0cadea	[SPARK-34812][SQL] RowNumberLike and RankLike should not be nullable ### What changes were proposed in this pull request? Marked `RowNumberLike` and `RankLike` as not-nullable. ### Why are the changes needed? `RowNumberLike` and `RankLike` SQL expressions never return null value. Marking them as non-nullable can have some performance benefits, because some optimizer rules apply only to non-nullable expressions ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Did not find any existing tests on the nullability of aggregate functions. Plan stability suite partially covers this. Closes #31924 from tanelk/SPARK-34812_nullability. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-22 14:55:43 +00:00
woyumen4597	f44608a8c0	[SPARK-34800][SQL] Use fine-grained lock in SessionCatalog.tableExists ### What changes were proposed in this pull request? Use fine-grained lock in SessionCatalog.tableExists, in order to lock currentDB variable rather than lock `tableExists` method which will block inner external catalog's behaviour. ### Why are the changes needed? We have modified the underlying hive meta store which a different hive database is placed in its own shard for performance. However, we found that the synchronized lock limits the concurrency. ### How was this patch tested? Existing tests. Closes #31891 from woyumen4597/SPARK-34800. Authored-by: woyumen4597 <woyumen4597@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-22 09:03:46 +00:00
Terry Kim	7953fcdb56	[SPARK-34700][SQL] SessionCatalog's temporary view related APIs should take/return more concrete types ### What changes were proposed in this pull request? Now that all the temporary views are wrapped with `TemporaryViewRelation`(#31273, #31652, and #31825), this PR proposes to update `SessionCatalog`'s APIs for temporary views to take or return more concrete types. APIs that will take `TemporaryViewRelation` instead of `LogicalPlan`: ``` createTempView, createGlobalTempView, alterTempViewDefinition ``` APIs that will return `TemporaryViewRelation` instead of `LogicalPlan`: ``` getRawTempView, getRawGlobalTempView ``` APIs that will return `View` instead of `LogicalPlan`: ``` getTempView, getGlobalTempView, lookupTempView ``` ### Why are the changes needed? Internal refactoring to work with more concrete types. ### Does this PR introduce _any_ user-facing change? No, this is internal refactoring. ### How was this patch tested? Updated existing tests affected by the refactoring. Closes #31906 from imback82/use_temporary_view_relation. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-22 08:17:54 +00:00
yi.wu	e4bb97526c	[SPARK-34089][CORE] HybridRowQueue should respect the configured memory mode ### What changes were proposed in this pull request? This PR fixes the `HybridRowQueue ` to respect the configured memory mode. Besides, this PR also refactored the constructor of `MemoryConsumer` to accept the memory mode explicitly. ### Why are the changes needed? `HybridRowQueue` supports both onHeap and offHeap manipulation. But it inherited the wrong `MemoryConsumer` constructor, which hard-coded the memory mode to `onHeap`. ### Does this PR introduce _any_ user-facing change? No. (Maybe yes in some cases where users can't complete the job before could complete successfully after the fix because of `HybridRowQueue` is able to spill under offHeap mode now. ) ### How was this patch tested? Updated the existing test to make it test both offHeap and onHeap modes. Closes #31152 from Ngone51/fix-MemoryConsumer-memorymode. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-22 08:12:08 +00:00
HyukjinKwon	ec70467d4d	[SPARK-34815][SQL] Update CSVBenchmark ### What changes were proposed in this pull request? This PR updates CSVBenchmark especially we have a fix like https://github.com/apache/spark/pull/31858 that could potentially improve the performance. ### Why are the changes needed? To have the updated benchmark results. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually ran the benchmark Closes #31917 from HyukjinKwon/SPARK-34815. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-03-22 10:49:53 +03:00
Jungtaek Lim (HeartSaVioR)	121883b1a5	[SPARK-34383][SS] Optimize WAL commit phase via reducing cost of filesystem operations ### What changes were proposed in this pull request? This PR proposes to optimize WAL commit phase via following changes: * cache offset log to avoid FS get operation per batch * just directly delete instead of employing FS list operation on purge ### Why are the changes needed? There're inefficiency on WAL commit phase which can be easily optimized via using a small driver memory. 1. To provide the offset metadata to source side (via `source.commit()`), we read offset metadata for previous batch from file system, which is probably written by this driver in previous batches. Caching it into driver memory would reduce the get operation. 2. Spark calls purge against offset log & commit log per batch, which calls list operation. If the previous batch succeeded to purge, the current batch just needs to check one batch which can be simply done via direct delete operation, instead of calling list operation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested with additional debug log. (Verified that cache is used, cache keeps the size as 2, only one delete call is used instead of list call) Did some experiment with simple rate to console query. (NOTE: wasn't done with master branch - tested against Spark 2.4.x, but WAL commit phase hasn't been changed AFAIK during these versions) AWS S3 + S3 guard: > before the patch <img width="1075" alt="aws-before" src="https://user-images.githubusercontent.com/1317309/107108721-6cc54380-687d-11eb-8f10-b906b9d58397.png"> > after the patch <img width="1071" alt="aws-after" src="https://user-images.githubusercontent.com/1317309/107108724-7189f780-687d-11eb-88da-26912ac15c85.png"> Azure: > before the patch <img width="1074" alt="azure-before" src="https://user-images.githubusercontent.com/1317309/107108726-75b61500-687d-11eb-8c06-9048fa10ff9a.png"> > after the patch <img width="1069" alt="azure-after" src="https://user-images.githubusercontent.com/1317309/107108729-79e23280-687d-11eb-8d97-e7f3aeec51be.png"> Closes #31495 from HeartSaVioR/SPARK-34383. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>	2021-03-22 08:47:07 +01:00
Cheng Su	f8838fe82b	[SPARK-34708][SQL] Code-gen for left semi/anti broadcast nested loop join (build right side) ### What changes were proposed in this pull request? This PR is to add code-gen support for left semi / left anti BroadcastNestedLoopJoin (build side is right side). The execution code path for build left side cannot fit into whole stage code-gen framework, so only add the code-gen for build right side here. Reference: the iterator (non-code-gen) code path is `BroadcastNestedLoopJoinExec.leftExistenceJoin()` with `BuildRight`. ### Why are the changes needed? Improve query CPU performance. Tested with a simple query: ``` val N = 20 << 20 val M = 1 << 4 val dim = broadcast(spark.range(M).selectExpr("id as k2")) codegenBenchmark("left semi broadcast nested loop join", N) { park.range(N).selectExpr(s"id as k1").join( dim, col("k1") + 1 <= col("k2"), "left_semi") } ``` Seeing 5x run time improvement: ``` Running benchmark: left semi broadcast nested loop join Running case: left semi broadcast nested loop join codegen off Stopped after 2 iterations, 6958 ms Running case: left semi broadcast nested loop join codegen on Stopped after 5 iterations, 3383 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz left semi broadcast nested loop join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative -------------------------------------------------------------------------------------------------------------------------------- left semi broadcast nested loop join codegen off 3434 3479 65 6.1 163.7 1.0X left semi broadcast nested loop join codegen on 672 677 5 31.2 32.1 5.1X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Changed existing unit test in `ExistenceJoinSuite.scala` to cover all code paths: * left semi/anti + empty right side + empty condition * left semi/anti + non-empty right side + empty condition * left semi/anti + right side + non-empty condition Added unit test in `WholeStageCodegenSuite.scala` to make sure code-gen for broadcast nested loop join is taking effect, and test for multiple join case as well. Example query: ``` val df1 = spark.range(4).select($"id".as("k1")) val df2 = spark.range(3).select($"id".as("k2")) df1.join(df2, $"k1" + 1 <= $"k2", "left_semi").explain("codegen") ``` Example generated code (`bnlj_doConsume_0` method): This is for left semi join. The generated code for left anti join is mostly to be same as here, except L55 to be `if (bnlj_findMatchedRow_0 == false) {`. ``` == Subtree 2 / 2 (maxMethodCodeSize:282; maxConstantPoolSize:203(0.31% used); numInnerClasses:0) == (2) Project [id#0L AS k1#2L] +- (2) BroadcastNestedLoopJoin BuildRight, LeftSemi, ((id#0L + 1) <= k2#6L) :- (2) Range (0, 4, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode, [id=#23] +- (1) Project [id#4L AS k2#6L] +- (1) Range (0, 3, step=1, splits=2) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage2(references); / 003 / } / 004 / / 005 / // codegenStageId=2 / 006 / final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean range_initRange_0; / 010 / private long range_nextIndex_0; / 011 / private TaskContext range_taskContext_0; / 012 / private InputMetrics range_inputMetrics_0; / 013 / private long range_batchEnd_0; / 014 / private long range_numElementsTodo_0; / 015 / private InternalRow[] bnlj_buildRowArray_0; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[4]; / 017 / / 018 / public GeneratedIteratorForCodegenStage2(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / / 026 / range_taskContext_0 = TaskContext.get(); / 027 / range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); / 028 / range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 029 / range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 030 / bnlj_buildRowArray_0 = (InternalRow[]) ((org.apache.spark.broadcast.TorrentBroadcast) references[1] / broadcastTerm /).value(); / 031 / range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 032 / range_mutableStateArray_0[3] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 033 / / 034 / } / 035 / / 036 / private void bnlj_doConsume_0(long bnlj_expr_0_0) throws java.io.IOException { / 037 / boolean bnlj_findMatchedRow_0 = false; / 038 / for (int bnlj_arrayIndex_0 = 0; bnlj_arrayIndex_0 < bnlj_buildRowArray_0.length; bnlj_arrayIndex_0++) { / 039 / UnsafeRow bnlj_buildRow_0 = (UnsafeRow) bnlj_buildRowArray_0[bnlj_arrayIndex_0]; / 040 / / 041 / long bnlj_value_1 = bnlj_buildRow_0.getLong(0); / 042 / / 043 / long bnlj_value_3 = -1L; / 044 / / 045 / bnlj_value_3 = bnlj_expr_0_0 + 1L; / 046 / / 047 / boolean bnlj_value_2 = false; / 048 / bnlj_value_2 = bnlj_value_3 <= bnlj_value_1; / 049 / if (!(false \|\| !bnlj_value_2)) / 050 / { / 051 / bnlj_findMatchedRow_0 = true; / 052 / break; / 053 / } / 054 / } / 055 / if (bnlj_findMatchedRow_0 == true) { / 056 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] / numOutputRows /).add(1); / 057 / / 058 / // common sub-expressions / 059 / / 060 / range_mutableStateArray_0[3].reset(); / 061 / / 062 / range_mutableStateArray_0[3].write(0, bnlj_expr_0_0); / 063 / append((range_mutableStateArray_0[3].getRow()).copy()); / 064 / / 065 / } / 066 / / 067 / } / 068 / / 069 / private void initRange(int idx) { / 070 / java.math.BigInteger index = java.math.BigInteger.valueOf(idx); / 071 / java.math.BigInteger numSlice = java.math.BigInteger.valueOf(2L); / 072 / java.math.BigInteger numElement = java.math.BigInteger.valueOf(4L); / 073 / java.math.BigInteger step = java.math.BigInteger.valueOf(1L); / 074 / java.math.BigInteger start = java.math.BigInteger.valueOf(0L); / 075 / long partitionEnd; / 076 / / 077 / java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); / 078 / if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 079 / range_nextIndex_0 = Long.MAX_VALUE; / 080 / } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 081 / range_nextIndex_0 = Long.MIN_VALUE; / 082 / } else { / 083 / range_nextIndex_0 = st.longValue(); / 084 / } / 085 / range_batchEnd_0 = range_nextIndex_0; / 086 / / 087 / java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) / 088 / .multiply(step).add(start); / 089 / if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 090 / partitionEnd = Long.MAX_VALUE; / 091 / } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 092 / partitionEnd = Long.MIN_VALUE; / 093 / } else { / 094 / partitionEnd = end.longValue(); / 095 / } / 096 / / 097 / java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( / 098 / java.math.BigInteger.valueOf(range_nextIndex_0)); / 099 / range_numElementsTodo_0 = startToEnd.divide(step).longValue(); / 100 / if (range_numElementsTodo_0 < 0) { / 101 / range_numElementsTodo_0 = 0; / 102 / } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { / 103 / range_numElementsTodo_0++; / 104 / } / 105 / } / 106 / / 107 / protected void processNext() throws java.io.IOException { / 108 / // initialize Range / 109 / if (!range_initRange_0) { / 110 / range_initRange_0 = true; / 111 / initRange(partitionIndex); / 112 / } / 113 / / 114 / while (true) { / 115 / if (range_nextIndex_0 == range_batchEnd_0) { / 116 / long range_nextBatchTodo_0; / 117 / if (range_numElementsTodo_0 > 1000L) { / 118 / range_nextBatchTodo_0 = 1000L; / 119 / range_numElementsTodo_0 -= 1000L; / 120 / } else { / 121 / range_nextBatchTodo_0 = range_numElementsTodo_0; / 122 / range_numElementsTodo_0 = 0; / 123 / if (range_nextBatchTodo_0 == 0) break; / 124 / } / 125 / range_batchEnd_0 += range_nextBatchTodo_0 1L; /* 126 / } / 127 / / 128 / int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); / 129 / for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { / 130 / long range_value_0 = ((long)range_localIdx_0 1L) + range_nextIndex_0; /* 131 / / 132 / bnlj_doConsume_0(range_value_0); / 133 / / 134 / if (shouldStop()) { / 135 / range_nextIndex_0 = range_value_0 + 1L; / 136 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localIdx_0 + 1); / 137 / range_inputMetrics_0.incRecordsRead(range_localIdx_0 + 1); / 138 / return; / 139 / } / 140 / / 141 / } / 142 / range_nextIndex_0 = range_batchEnd_0; / 143 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localEnd_0); / 144 / range_inputMetrics_0.incRecordsRead(range_localEnd_0); / 145 / range_taskContext_0.killTaskIfInterrupted(); / 146 / } / 147 / } / 148 / / 149 */ } ``` Closes #31874 from c21/code-semi-anti. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-22 07:31:16 +00:00
Yuanjian Li	45235ac4bc	[SPARK-34748][SS] Create a rule of the analysis logic for streaming write ### What changes were proposed in this pull request? - Create a new rule `ResolveStreamWrite` for all analysis logic for streaming write. - Add corresponding logical plans `WriteToStreamStatement` and `WriteToStream`. ### Why are the changes needed? Currently, the analysis logic for streaming write is mixed in StreamingQueryManager. If we create a specific analyzer rule and separated logical plans, it should be helpful for further extension. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31842 from xuanyuanking/SPARK-34748. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-22 06:39:39 +00:00
Josh Soref	f4de93efb0	[MINOR][SQL] Spelling: filters - PushedFilers ### What changes were proposed in this pull request? Consistently correct the spelling of `PushedFilters` ### Why are the changes needed? bersprockets noted that it's wrong ### Does this PR introduce _any_ user-facing change? Technically, I think it does. Practically, neither Google nor GitHub show anyone using `pushedFilers` outside of forks (or the discussion about fixing it started at https://github.com/apache/spark/pull/30323#issuecomment-725568719) ### How was this patch tested? None beyond CI in the previous PR Closes #30678 from jsoref/spelling-filters. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-03-22 08:00:12 +03:00
Dongjoon Hyun	c5fd94f119	[SPARK-34772][TESTS][FOLLOWUP] Disable a test case using Hive 1.2.1 in Java9+ environment ### What changes were proposed in this pull request? This PR aims to disable a new test case using Hive 1.2.1 from Java9+ test environment. ### Why are the changes needed? [HIVE-6113](https://issues.apache.org/jira/browse/HIVE-6113) upgraded Datanucleus to 4.x at Hive 2.0. Datanucleus 3.x doesn't support Java9+. Java 9+ Environment ``` $ build/sbt "hive/testOnly .HiveSparkSubmitSuite -- -z SPARK-34772" -Phive ... [info] 1 TEST FAILED * [error] Failed: Total 1, Failed 1, Errors 0, Passed 0 [error] Failed tests: [error] org.apache.spark.sql.hive.HiveSparkSubmitSuite [error] (hive / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 328 s (05:28), completed Mar 21, 2021, 5:32:39 PM ``` ### Does this PR introduce _any_ user-facing change? Fix the UT in Java9+ environment. ### How was this patch tested? Manually. ``` $ build/sbt "hive/testOnly *.HiveSparkSubmitSuite -- -z SPARK-34772" -Phive ... [info] HiveSparkSubmitSuite: [info] - SPARK-34772: RebaseDateTime loadRebaseRecords should use Spark classloader instead of context !!! CANCELED !!! (26 milliseconds) [info] org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(JAVA_9) was true (HiveSparkSubmitSuite.scala:344) ``` Closes #31916 from dongjoon-hyun/SPARK-HiveSparkSubmitSuite. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-21 17:59:55 -07:00
Kousuke Saruta	94fd6cb0ce	[SPARK-34636][FOLLOWUP][SQL] Fix an incompatible behavior of UnresolvedAttribute.sql ### What changes were proposed in this pull request? This PR fixes an incompatible behavior introduced by #31754. The problem is that quoted name parts represented as a string are given to the constructor of `UnresolvedAttribute` which takes single string parameter, `sql` method invocation against the `UnresolvedAttrribute` returns different result than before. One example is ``` UnresolvedAttribute("`a.b`").sql ```. This returned `a.b` before but it doesn't now. See [this duscussion](https://github.com/apache/spark/pull/31754/files#r597181927) for more details. ### Why are the changes needed? For compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New assertion. Closes #31885 from sarutak/followup-SPARK-34636. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-20 14:44:36 -07:00
Yuming Wang	908318f30d	[SPARK-28220][SQL] Improve PropagateEmptyRelation to support join with false condition ### What changes were proposed in this pull request? Improve `PropagateEmptyRelation` to support join with false condition. For example: ```sql SELECT * FROM t1 LEFT JOIN t2 ON false ``` Before this pr: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastNestedLoopJoin BuildRight, LeftOuter, false :- FileScan parquet default.t1[a#4L] +- BroadcastExchange IdentityBroadcastMode, [id=#40] +- FileScan parquet default.t2[b#5L] ``` After this pr: ``` == Physical Plan == (1) Project [a#4L, null AS b#5L] +- (1) ColumnarToRow +- FileScan parquet default.t1[a#4L] ``` ### Why are the changes needed? Avoid `BroadcastNestedLoopJoin` to improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31857 from wangyum/SPARK-28220. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-03-20 22:57:02 +08:00
Kent Yao	2cdedef2a0	[SPARK-34128][SQL] Suppress undesirable TTransportException warnings involved in THRIFT-4805 ### What changes were proposed in this pull request? Since Spark 3.0， the `libthrift` has been bumped up from 0.9.3 to 0.12.0. Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. For example, the current thrift server module test in Github action workflow outputs more than 200MB of data for this error only ```java org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433) at org.apache.thrift.transport.TSaslServerTransport.read(TSaslServerTransport.java:43) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` I checked the latest `hive-service-rpc` module in the maven center, https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2. It still uses the 0.9.3 version. Unfortunately, I tried the newly released `libthrift 0.14.1`(w/o shading it), it breaks the metastore client side. ```scala java.lang.NoSuchMethodError: org.apache.thrift.transport.TSocket.<init>(Ljava/lang/String;II)V ``` On the Thrift side, they just muted it see https://issues.apache.org/jira/browse/THRIFT-4805 So in this PR, I add a filter to suppress the warning ### Why are the changes needed? if the log is too large, the Github action might truncate it. We need to reduce useless output. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ```build/sbt "hive-thriftserver/testOnly *ThriftServerQueryTestSuite" -Phive-thriftserver``` locally #### before ```java [info] - count.sql (1 second, 537 milliseconds) [info] - decimalArithmeticOperations.sql !!! IGNORED !!! 14:09:53.233 ERROR org.apache.thrift.server.TThreadPoolServer: Thrift error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433) at org.apache.thrift.transport.TSaslServerTransport.read(TSaslServerTransport.java:43) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [info] - group-analytics.sql (4 seconds, 282 milliseconds) [info] - csv-functions.sql (400 milliseconds) 14:09:24.234 ERROR org.apache.thrift.server.TThreadPoolServer: Thrift error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433) at org.apache.thrift.transport.TSaslServerTransport.read(TSaslServerTransport.java:43) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [info] - datetime-formatting-invalid.sql (349 milliseconds) 14:09:26.544 ERROR org.apache.thrift.server.TThreadPoolServer: Thrift error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433) at org.apache.thrift.transport.TSaslServerTransport.read(TSaslServerTransport.java:43) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [info] - except.sql (2 seconds, 309 milliseconds) 14:09:27.782 ERROR org.apache.thrift.server.TThreadPoolServer: Thrift error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433) at org.apache.thrift.transport.TSaslServerTransport.read(TSaslServerTransport.java:43) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [info] - string-functions.sql (1 second, 237 milliseconds) 14:09:27.835 WARN org.apache.spark.sql.execution.datasources.DataSource: All paths were ignored: 14:09:29.266 ERROR org.apache.thrift.server.TThreadPoolServer: Thrift error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433) at org.apache.thrift.transport.TSaslServerTransport.read(TSaslServerTransport.java:43) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` #### after ```java [info] - null-propagation.sql (181 milliseconds) [info] - operators.sql (1 second, 772 milliseconds) [info] - change-column.sql (241 milliseconds) [info] - count.sql (1 second, 665 milliseconds) [info] - decimalArithmeticOperations.sql !!! IGNORED !!! [info] - group-analytics.sql (3 seconds, 926 milliseconds) [info] - inline-table.sql (247 milliseconds) [info] - comparator.sql (223 milliseconds) [info] - show-tblproperties.sql (148 milliseconds) [info] - timezone.sql (105 milliseconds) [info] - parse-schema-string.sql (193 milliseconds) ``` Closes #31895 from yaooqinn/SPARK-34128-2. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-19 21:15:28 -07:00
Cheng Su	2ff0032e01	[SPARK-34796][SQL] Initialize counter variable for LIMIT code-gen in doProduce() ### What changes were proposed in this pull request? This PR is to fix the LIMIT code-gen bug in https://issues.apache.org/jira/browse/SPARK-34796, where the counter variable from `BaseLimitExec` is not initialized but used in code-gen. This is because the limit counter variable will be used in upstream operators (LIMIT's child plan, e.g. `ColumnarToRowExec` operator for early termination), but in the same stage, there can be some operators doing the shortcut and not calling `BaseLimitExec`'s `doConsume()`, e.g. [HashJoin.codegenInner](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L402). So if we have query that `LocalLimit - BroadcastHashJoin - FileScan` in the same stage, the whole stage code-gen compilation will be failed. Here is an example: ``` test("failed limit query") { withTable("left_table", "empty_right_table", "output_table") { spark.range(5).toDF("k").write.saveAsTable("left_table") spark.range(0).toDF("k").write.saveAsTable("empty_right_table") withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { spark.sql("CREATE TABLE output_table (k INT) USING parquet") spark.sql( s""" \|INSERT INTO TABLE output_table \|SELECT t1.k FROM left_table t1 \|JOIN empty_right_table t2 \|ON t1.k = t2.k \|LIMIT 3 \|""".stripMargin) } } } ``` Query plan: ``` Execute InsertIntoHadoopFsRelationCommand file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/output_table, false, Parquet, Map(path -> file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/output_table), Append, CatalogTable( Database: default Table: output_table Created Time: Thu Mar 18 21:46:26 PDT 2021 Last Access: UNKNOWN Created By: Spark 3.2.0-SNAPSHOT Type: MANAGED Provider: parquet Location: file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/output_table Schema: root \|-- k: integer (nullable = true) ), org.apache.spark.sql.execution.datasources.InMemoryFileIndexb25d08b, [k] +- (3) Project [ansi_cast(k#228L as int) AS k#231] +- (3) GlobalLimit 3 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#179] +- (2) LocalLimit 3 +- (2) Project [k#228L] +- (2) BroadcastHashJoin [k#228L], [k#229L], Inner, BuildRight, false :- (2) Filter isnotnull(k#228L) : +- (2) ColumnarToRow : +- FileScan parquet default.left_table[k#228L] Batched: true, DataFilters: [isnotnull(k#228L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(k)], ReadSchema: struct<k:bigint> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#173] +- (1) Filter isnotnull(k#229L) +- *(1) ColumnarToRow +- FileScan parquet default.empty_right_table[k#229L] Batched: true, DataFilters: [isnotnull(k#229L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(k)], ReadSchema: struct<k:bigint> ``` Codegen failure - https://gist.github.com/c21/ea760c75b546d903247582be656d9d66 . The uninitialized variable `_limit_counter_1` from `LocalLimitExec` is referenced in `ColumnarToRowExec`, but `BroadcastHashJoinExec` does not call `LocalLimitExec.doConsume()` to initialize the counter variable. The fix is to move the counter variable initialization to `doProduce()`, as in whole stage code-gen framework, `doProduce()` will definitely be called if upstream operators `doProduce()`/`doConsume()` is called. Note: this only happens in AQE disabled case, because we have an AQE optimization rule [EliminateUnnecessaryJoin](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateUnnecessaryJoin.scala#L69) to change the whole query to an empty `LocalRelation` if inner join broadcast side is empty with AQE enabled. ### Why are the changes needed? Fix query failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `SQLQuerySuite.scala`. Closes #31892 from c21/limit-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-20 11:20:52 +09:00
tanel.kiis@gmail.com	620cae098c	[SPARK-33122][SQL] Remove redundant aggregates in the Optimzier ### What changes were proposed in this pull request? Added optimizer rule `RemoveRedundantAggregates`. It removes redundant aggregates from a query plan. A redundant aggregate is an aggregate whose only goal is to keep distinct values, while its parent aggregate would ignore duplicate values. The affected part of the query plan for TPCDS q87: Before: ``` == Physical Plan == (26) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#785] +- (25) HashAggregate(keys=[], functions=[partial_count(1)]) +- (25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- (25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- (25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- (25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- (25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#724] +- (24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti :- ... ``` After: ``` == Physical Plan == (26) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#751] +- (25) HashAggregate(keys=[], functions=[partial_count(1)]) +- (25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#694] +- (24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti :- ... ``` ### Why are the changes needed? Performance improvements - few TPCDS queries have these kinds of duplicate aggregates. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Benchmarks (sf=5): OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Linux 5.8.13-arch1-1 Intel(R) Core(TM) i5-6500 CPU 3.20GHz \| Query \| Before \| After \| Speedup \| \| ------\| ------- \| ------\| ------- \| \| q14a \| 44s \| 44s \| 1x \| \| q14b \| 41s \| 41s \| 1x \| \| q38 \| 6.5s \| 5.9s \| 1.1x \| \| q87 \| 7.2s \| 6.8s \| 1.1x \| \| q14a-v2.7 \| 55s \| 53s \| 1x \| Closes #30018 from tanelk/SPARK-33122. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-20 11:16:39 +09:00
Liang-Chi Hsieh	7a8a600995	[SPARK-34776][SQL] Nested column pruning should not prune Window produced attributes ### What changes were proposed in this pull request? This patch proposes to fix a bug related to `NestedColumnAliasing`. The root cause is `Window` doesn't override `producedAttributes` so `NestedColumnAliasing` rule wrongly prune attributes produced by `Window`. The master and branch-3.1 both have this issue. ### Why are the changes needed? It is needed to fix a bug of nested column pruning. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #31897 from viirya/SPARK-34776. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-19 11:44:02 -07:00
Max Gekk	089c3b77e1	[SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals ### What changes were proposed in this pull request? For all built-in datasources, prohibit saving of year-month and day-time intervals that were introduced by SPARK-27793. We plan to support saving of such types at the milestone 2, see SPARK-27790. ### Why are the changes needed? To improve user experience with Spark SQL, and print nicer error message. Current error message might confuse users: ``` scala> Seq(java.time.Period.ofMonths(1)).toDF.write.mode("overwrite").json("/Users/maximgekk/tmp/123") 21/03/18 22:44:35 ERROR FileFormatWriter: Aborting job 8de402d7-ab69-4dc0-aa8e-14ef06bd2d6b. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (192.168.1.66 executor driver): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:418) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:298) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:211) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: Failed to convert value 1 (class of class java.lang.Integer}) with the type of YearMonthIntervalType to JSON. at scala.sys.package$.error(package.scala:30) at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$23(JacksonGenerator.scala:179) at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$23$adapted(JacksonGenerator.scala:176) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above: ``` scala> Seq(java.time.Period.ofMonths(1)).toDF.write.mode("overwrite").json("/Users/maximgekk/tmp/123") org.apache.spark.sql.AnalysisException: Cannot save interval data type into external storage. ``` ### How was this patch tested? 1. Checked nested intervals: ``` scala> spark.range(1).selectExpr("""struct(timestamp'2021-01-02 00:01:02' - timestamp'2021-01-01 00:00:00')""").write.mode("overwrite").parquet("/Users/maximgekk/tmp/123") org.apache.spark.sql.AnalysisException: Cannot save interval data type into external storage. scala> Seq(Seq(java.time.Period.ofMonths(1))).toDF.write.mode("overwrite").json("/Users/maximgekk/tmp/123") org.apache.spark.sql.AnalysisException: Cannot save interval data type into external storage. ``` 2. By running existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DataSourceV2DataFrameSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DataSourceV2SQLSuite" ``` Closes #31884 from MaxGekk/ban-save-intervals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-03-19 18:47:53 +03:00
Hongyi Zhang	6f89cdfb0c	[SPARK-34798][SQL][TESTS] Fix incorrect join condition ### What changes were proposed in this pull request? join condition 'a.attr == 'c.attr check the reference of these 2 objects which will always returns false. we need to use === instead ### Why are the changes needed? Although this join condition always false doesn't break the test but it is not what we expected. We should fix it to avoid future confusing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #31890 from opensky142857/SPARK-34798. Authored-by: Hongyi Zhang <hongyzhang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-03-19 23:35:15 +08:00
Wenchen Fan	4b4f8e2a25	[SPARK-34558][SQL][FOLLOWUP] Use final Hadoop conf to instantiate FileSystem in SharedState ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/31671 https://github.com/apache/spark/pull/31671 has an unexpected behavior change that it uses a different Hadoop conf (`sparkContext.hadoopConfiguration`) to instantiate `FileSystem`, which is used to qualify the warehouse path. Before https://github.com/apache/spark/pull/31671 , the Hadoop conf to instantiate `FileSystem` is `session.sessionState.newHadoopConf()`. More specifically, `session.sessionState.newHadoopConf()` has more conf entries: 1. it includes configs from `SharedState.initialConfigs` 2. in includes configs from `sparkContext.conf` This PR updates `SharedState` to use the final Hadoop conf to instantiate `FileSystem`. ### Why are the changes needed? fix behavior change ### Does this PR introduce _any_ user-facing change? yes, the behavior will be the same before https://github.com/apache/spark/pull/31671 ### How was this patch tested? manually check the log of `FileSystem` and verify the passed in configs. Closes #31868 from cloud-fan/followup. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-19 22:02:15 +08:00
ulysses-you	58509565f8	[SPARK-34772][SQL] RebaseDateTime loadRebaseRecords should use Spark classloader instead of context ### What changes were proposed in this pull request? Change context classloader to Spark classloader at `RebaseDateTime.loadRebaseRecords` ### Why are the changes needed? With custom `spark.sql.hive.metastore.version` and `spark.sql.hive.metastore.jars`. Spark would use date formatter in `HiveShim` that convert `date` to `string`, if we set `spark.sql.legacy.timeParserPolicy=LEGACY` and the partition type is `date` the `RebaseDateTime` code will be invoked. At that moment, if `RebaseDateTime` is initialized the first time then context class loader is `IsolatedClientLoader`. Such error msg would throw: ``` java.lang.IllegalArgumentException: argument "src" is null at com.fasterxml.jackson.databind.ObjectMapper._assertNotNull(ObjectMapper.java:4413) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3157) at com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue(ScalaObjectMapper.scala:187) at com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue$(ScalaObjectMapper.scala:186) at org.apache.spark.sql.catalyst.util.RebaseDateTime$$anon$1.readValue(RebaseDateTime.scala:267) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.loadRebaseRecords(RebaseDateTime.scala:269) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.<init>(RebaseDateTime.scala:291) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.<clinit>(RebaseDateTime.scala) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:109) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format(DateFormatter.scala:95) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format$(DateFormatter.scala:94) at org.apache.spark.sql.catalyst.util.LegacySimpleDateFormatter.format(DateFormatter.scala:138) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$1$.unapply(HiveShim.scala:661) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:785) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) ``` ``` java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.catalyst.util.RebaseDateTime$ at org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:109) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format(DateFormatter.scala:95) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format$(DateFormatter.scala:94) at org.apache.spark.sql.catalyst.util.LegacySimpleDateFormatter.format(DateFormatter.scala:138) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$1$.unapply(HiveShim.scala:661) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:785) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) at scala.collection.immutable.Stream.flatMap(Stream.scala:493) at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:749) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:747) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitionsByFilter$1(HiveExternalCatalog.scala:1273) ``` The reproduce steps: 1. `spark.sql.hive.metastore.version` and `spark.sql.hive.metastore.jars`. 2. `CREATE TABLE t (c int) PARTITIONED BY (p date)` 3. `SET spark.sql.legacy.timeParserPolicy=LEGACY` 4. `SELECT * FROM t WHERE p='2021-01-01'` ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? pass `org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite` and add new unit test to `HiveSparkSubmitSuite.scala`. Closes #31864 from ulysses-you/SPARK-34772. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-03-19 12:51:43 +08:00
Max Gekk	a48b2086dd	[SPARK-34761][SQL] Support add/subtract of a day-time interval to/from a timestamp ### What changes were proposed in this pull request? Support `timestamp +/- day-time interval`. In the PR, I propose to extend the `TimeAdd` expression and support `DayTimeIntervalType` as the `interval` parameter. The expression invokes the new method `DateTimeUtils.timestampAddDayTime()` which splits the input day-time interval to `days` and `microsecond adjustment` of a day, and adds `days` (and the microseconds) to a local timestamp derived from the given timestamp at the given time zone. The resulted local timestamp is converted back to the offset in microseconds since the epoch. Also I updated the rules that handle `CalendarIntervalType` and produce `TimeAdd` to take into account new type `DateTimeIntervalType` for the `interval` parameter of `TimeAdd`. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such operation over timestamps and intervals: <img width="811" alt="Screenshot 2021-03-12 at 11 36 14" src="https://user-images.githubusercontent.com/1580697/111081674-865d4900-8515-11eb-86c8-3538ecaf4804.png"> ### Does this PR introduce _any_ user-facing change? Should not since new intervals have not been released yet. ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly DateTimeUtilsSuite" $ build/sbt "test:testOnly DateExpressionsSuite" $ build/sbt "test:testOnly *ColumnExpressionSuite" ``` Closes #31855 from MaxGekk/timestamp-add-day-time-interval. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-19 04:02:34 +00:00
Karuppayya Rajendran	0a58029d52	[SPARK-31897][SQL] Enable codegen for GenerateExec ### What changes were proposed in this pull request? Enabling codegen for GenerateExec ### Why are the changes needed? To leverage code generation for Generators ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - UT tests added ### Benchmark ``` case class Data(value1: Float, value2: Map[String, String], value3: String) val path = "<path>" val numRecords = Seq(10000000, 100000000) numRecords.map { recordCount => import java.util.concurrent.TimeUnit.NANOSECONDS val srcDF = spark.range(recordCount).map { x => Data(x.toFloat, Map(x.toString -> x.toString ), s"value3$x") }.select($"value1", explode($"value2"), $"value3") val start = System.nanoTime() srcDF .write .mode("overwrite") .parquet(s"$path/$recordCount") val end = System.nanoTime() val diff = end - start (recordCount, NANOSECONDS.toMillis(diff)) } ``` With codegen: ``` res0: Seq[(Int, Long)] = List((10000000,13989), (100000000,129625)) ``` Without codegen: ``` res0: Seq[(Int, Long)] = List((10000000,15736), (100000000,150399)) ``` Closes #28715 from karuppayya/SPARK-31897. Lead-authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com> Co-authored-by: Karuppayya Rajendran <karuppayya.rajendran@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-03-18 20:50:28 -07:00
Kousuke Saruta	07ee73234f	[SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document ### What changes were proposed in this pull request? This PR fix an issue that virtual operators (`\|\|`, `!=`, `<>`, `between` and `case`) are absent from the Spark SQL Built-in functions document. ### Why are the changes needed? The document should explain about all the supported built-in operators. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the document with `SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 bundler exec jekyll build` and then, confirmed the document. ![neq1](https://user-images.githubusercontent.com/4736016/111192859-e2e76380-85fc-11eb-89c9-75916a5e856a.png) ![neq2](https://user-images.githubusercontent.com/4736016/111192874-e7ac1780-85fc-11eb-9a9b-c504265b373f.png) ![between](https://user-images.githubusercontent.com/4736016/111192898-eda1f880-85fc-11eb-992d-cf80c544ec27.png) ![case](https://user-images.githubusercontent.com/4736016/111192918-f266ac80-85fc-11eb-9306-5dbc413a0cdb.png) ![double_pipe](https://user-images.githubusercontent.com/4736016/111192952-fb577e00-85fc-11eb-932e-385e5c2a5205.png) Closes #31841 from sarutak/builtin-op-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-19 10:19:26 +09:00
Cheng Su	8207e2f65c	[SPARK-34781][SQL] Eliminate LEFT SEMI/ANTI joins to its left child side in AQE ### What changes were proposed in this pull request? In `EliminateJoinToEmptyRelation.scala`, we can extend it to cover more cases for LEFT SEMI and LEFT ANI joins: * Join is left semi join, join right side is non-empty and condition is empty. Eliminate join to its left side. * Join is left anti join, join right side is empty. Eliminate join to its left side. Given we eliminate join to its left side here, renaming the current optimization rule to `EliminateUnnecessaryJoin` instead. In addition, also change to use `checkRowCount()` to check run time row count, instead of using `EmptyHashedRelation`. So this can cover `BroadcastNestedLoopJoin` as well. (`BroadcastNestedLoopJoin`'s broadcast side is `Array[InternalRow]`, not `HashedRelation`). ### Why are the changes needed? Cover more join cases, and improve query performance for affected queries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `AdaptiveQueryExecSuite.scala`. Closes #31873 from c21/aqe-join. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-19 09:41:52 +09:00
yi.wu	d99135b66a	[SPARK-34741][SQL] MergeIntoTable should avoid ambiguous reference in UpdateAction ### What changes were proposed in this pull request? This PR proposes to deduplicate the source table when there're conflicting attributes between the target table and the source table. ### Why are the changes needed? When resolving the `UpdateAction`, which could reference attributes from both target and source tables, Spark should know clearly where the attribute comes from when there're conflicting attributes instead of picking up a random one. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test and updated existing tests. Closes #31835 from Ngone51/dedup-MergeIntoTable. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-18 15:54:41 +08:00
Luan	25e7d1ceee	[SPARK-34728][SQL] Remove all SQLConf.get if extends from SQLConfHelper ### What changes were proposed in this pull request? Remove all SQLConf.get to conf if extends from SQLConfHelper ### Why are the changes needed? Clean up code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests. Closes #31822 from leoluan2009/SPARK-34728. Authored-by: Luan <luanxuedong2009@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-18 15:04:41 +09:00
yi.wu	4d90c5dc0e	[SPARK-34087][SQL] Fix memory leak of ExecutionListenerBus ### What changes were proposed in this pull request? This PR proposes an alternative way to fix the memory leak of `ExecutionListenerBus`, which would automatically clean them up. Basically, the idea is to add `registerSparkListenerForCleanup` to `ContextCleaner`, so we can remove the `ExecutionListenerBus` from `LiveListenerBus` when the `SparkSession` is GC'ed. On the other hand, to make the `SparkSession` GC-able, we need to get rid of the reference of `SparkSession` in `ExecutionListenerBus`. Therefore, we introduced the `sessionUUID`, which is a unique identifier for SparkSession, to replace the `SparkSession` object. Note that, the proposal wouldn't take effect when `spark.cleaner.referenceTracking=false` since it depends on `ContextCleaner`. ### Why are the changes needed? Fix the memory leak caused by `ExecutionListenerBus` mentioned in SPARK-34087. ### Does this PR introduce _any_ user-facing change? Yes, save memory for users. ### How was this patch tested? Added unit test. Closes #31839 from Ngone51/fix-mem-leak-of-ExecutionListenerBus. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-18 13:27:03 +09:00
Kousuke Saruta	c5cadfefdf	[SPARK-34762][BUILD] Fix the build failure with Scala 2.13 which is related to commons-cli ### What changes were proposed in this pull request? This PR fixes the build failure with Scala 2.13 which is related to `commons-cli`. The last few days, build with Scala 2.13 on GA continues to fail and the error message says like as follows. ``` [error] /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:26:1: error: package org.apache.commons.cli does not exist 1278[error] import org.apache.commons.cli.GnuParser; ``` The reason is that `mvn help` in `change-scala-version.sh` downloads the POM file of `commons-cli` but doesn't download the JAR file, leading the build failure. This PR also adds `commons-cli` to the dependencies explicitly because HiveThriftServer depends on it. ### Why are the changes needed? Expect to fix the build failure with Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that build successfully finishes with Scala 2.13 on my laptop. ``` find ~/.m2 -name commons-cli -exec rm -rf {} \; find ~/.ivy2 -name commons-cli -exec rm -rf {} \; find ~/.cache/ -name commons-cli -exec rm -rf {} \; // For Linux find ~/Library/Caches -name commons-cli -exec rm -rf {} \; // For macOS dev/change-scala-version 2.13 ./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 clean compile test:compile ``` Closes #31862 from sarutak/commons-cli. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-18 12:31:50 +09:00
gengjiaan	569fb133d0	[SPARK-33602][SQL] Group exception messages in execution/datasources ### What changes were proposed in this pull request? This PR group exception messages in `/core/src/main/scala/org/apache/spark/sql/execution/datasources`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #31757 from beliefer/SPARK-33602. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-17 14:04:02 +00:00
Wenchen Fan	9f7b0a035b	[SPARK-34758][SQL] Simplify Analyzer.resolveLiteralFunction ### What changes were proposed in this pull request? This PR simplifies `Analyzer.resolveLiteralFunction` to always create the `Alias`. The caller side will remove the `Alias` if it's not necessary. ### Why are the changes needed? code simplification. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31844 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-17 21:26:44 +09:00
Wenchen Fan	bf4570b43d	[SPARK-34749][SQL] Simplify ResolveCreateNamedStruct ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/31808 and simplifies its fix to one line (excluding comments). ### Why are the changes needed? code simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #31843 from cloud-fan/simplify. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-17 21:21:54 +09:00
ulysses-you	48637a9d43	[SPARK-34766][SQL] Do not capture maven config for views ### What changes were proposed in this pull request? Skip capture maven repo config for views. ### Why are the changes needed? Due to the bad network, we always use the thirdparty maven repo to run test. e.g., ``` build/sbt "test:testOnly SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxxxx ``` It's failed with such error msg ``` [info] - show-tblproperties.sql FAILED * (128 milliseconds) [info] show-tblproperties.sql [info] Expected "...rredTempViewNames [][]", but got "...rredTempViewNames [][ [info] view.sqlConfig.spark.sql.maven.additionalRemoteRepositories xxxxx]" Result did not match for query #6 [info] SHOW TBLPROPERTIES view (SQLQueryTestSuite.scala:464) ``` It's not necessary to capture the maven config to view since it's a session level config. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? manual test pass ``` build/sbt "test:testOnly *SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxx ``` Closes #31856 from ulysses-you/skip-maven-config. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>	2021-03-17 20:12:18 +08:00
HyukjinKwon	385f1e8f5d	[SPARK-34768][SQL] Respect the default input buffer size in Univocity ### What changes were proposed in this pull request? This PR proposes to follow Univocity's input buffer. ### Why are the changes needed? - Firstly, it's best to trust their judgement on the default values. Also 128 is too low. - Default values arguably have more test coverage in Univocity. - It will also fix https://github.com/uniVocity/univocity-parsers/issues/449 - ^ is a regression compared to Spark 2.4 ### Does this PR introduce _any_ user-facing change? No. In addition, It fixes a regression. ### How was this patch tested? Manually tested, and added a unit test. Closes #31858 from HyukjinKwon/SPARK-34768. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-17 19:55:49 +09:00
Wenchen Fan	1a4971d8a1	[SPARK-34770][SQL] InMemoryCatalog.tableExists should not fail if database doesn't exist ### What changes were proposed in this pull request? This PR updates `InMemoryCatalog.tableExists` to return false if database doesn't exist, instead of failing. The new behavior is consistent with `HiveExternalCatalog` which is used in production, so this bug mostly only affects tests. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new test Closes #31860 from cloud-fan/catalog. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-17 16:36:50 +08:00
Kent Yao	115f777cb0	[SPARK-21449][SQL][FOLLOWUP] Avoid log undesirable IllegalStateException when state close ### What changes were proposed in this pull request? `TmpOutputFile` and `TmpErrOutputFile` are registered in `o.a.h.u.ShutdownHookManager `during creatation. The `state.close()` will delete them if they are not null and try remove them from the `o.a.h.u.ShutdownHookManager` which causes IllegalStateException when we call it in our ShutdownHookManager too. In this PR, we delete them ahead with a high priority hook in Spark and set them to null to bypass the deletion and canceling in `state.close()` ### Why are the changes needed? W/ or w/o this PR, the deletion of these files is not affected, we just mute an undesirable error log here. ### Does this PR introduce _any_ user-facing change? no, this is a follow-up ### How was this patch tested? #### the undesirable gone ```scala spark-sql> 21/03/16 18:41:31 ERROR Utils: Uncaught exception in thread shutdown-hook-0 java.lang.IllegalStateException: Shutdown in progress, cannot cancel a deleteOnExit at org.apache.hive.common.util.ShutdownHookManager.cancelDeleteOnExit(ShutdownHookManager.java:106) at org.apache.hadoop.hive.common.FileUtils.deleteTmpFile(FileUtils.java:861) at org.apache.hadoop.hive.ql.session.SessionState.deleteTmpErrOutputFile(SessionState.java:325) at org.apache.hadoop.hive.ql.session.SessionState.dropSessionPaths(SessionState.java:829) at org.apache.hadoop.hive.ql.session.SessionState.close(SessionState.java:1585) at org.apache.hadoop.hive.cli.CliSessionState.close(CliSessionState.java:66) at org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:172) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) (python) ✘ kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316  cd .. (python) kentyaohulk  ~/Downloads/spark  tar zxf spark-3.2.0-SNAPSHOT-bin-20210316.tgz (python) kentyaohulk  ~/Downloads/spark  cd - ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316 (python) kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316  bin/spark-sql --conf spark.local.dir=./local --conf spark.hive.exec.local.scratchdir=./local 21/03/16 18:42:15 WARN Utils: Your hostname, hulk.local resolves to a loopback address: 127.0.0.1; using 10.242.189.214 instead (on interface en0) 21/03/16 18:42:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/03/16 18:42:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 21/03/16 18:42:16 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 21/03/16 18:42:18 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 21/03/16 18:42:18 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 21/03/16 18:42:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 21/03/16 18:42:19 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore kentyao127.0.0.1 Spark master: local[*], Application Id: local-1615891336877 spark-sql> % ``` #### and the deletion is still fine ```shell kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316  ls -al local total 0 drwxr-xr-x 7 kentyao staff 224 3 16 18:42 . drwxr-xr-x 19 kentyao staff 608 3 16 18:42 .. drwx------ 2 kentyao staff 64 3 16 18:42 16cc5238-e25e-4c0f-96ef-0c4bdecc7e51 -rw-r--r-- 1 kentyao staff 0 3 16 18:42 16cc5238-e25e-4c0f-96ef-0c4bdecc7e51219959790473242539.pipeout -rw-r--r-- 1 kentyao staff 0 3 16 18:42 16cc5238-e25e-4c0f-96ef-0c4bdecc7e518816377057377724129.pipeout drwxr-xr-x 2 kentyao staff 64 3 16 18:42 blockmgr-37a52ad2-eb56-43a5-8803-8f58d08fe9ad drwx------ 3 kentyao staff 96 3 16 18:42 spark-101971df-f754-47c2-8764-58c45586be7e kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316  ls -al local total 0 drwxr-xr-x 2 kentyao staff 64 3 16 19:22 . drwxr-xr-x 19 kentyao staff 608 3 16 18:42 .. kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316  ``` Closes #31850 from yaooqinn/followup. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-03-17 15:21:23 +08:00
Yuming Wang	c234c5b5f1	[SPARK-34575][SQL] Push down limit through window when partitionSpec is empty ### What changes were proposed in this pull request? Push down limit through `Window` when the partitionSpec of all window functions is empty and the same order is used. This is a real case from production: ![image](https://user-images.githubusercontent.com/5399861/109457143-3900c680-7a95-11eb-9078-806b041175c2.png) This pr support 2 cases: 1. All window functions have same orderSpec: ```sql SELECT , ROW_NUMBER() OVER(ORDER BY a) AS rn, RANK() OVER(ORDER BY a) AS rk FROM t1 LIMIT 5; == Optimized Logical Plan == Window [row_number() windowspecdefinition(a#9L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#4, rank(a#9L) windowspecdefinition(a#9L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#5], [a#9L ASC NULLS FIRST] +- GlobalLimit 5 +- LocalLimit 5 +- Sort [a#9L ASC NULLS FIRST], true +- Relation default.t1[A#9L,B#10L,C#11L] parquet ``` 2. There is a window function with a different orderSpec: ```sql SELECT a, ROW_NUMBER() OVER(ORDER BY a) AS rn, RANK() OVER(ORDER BY b DESC) AS rk FROM t1 LIMIT 5; == Optimized Logical Plan == Project [a#9L, rn#4, rk#5] +- Window [rank(b#10L) windowspecdefinition(b#10L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#5], [b#10L DESC NULLS LAST] +- GlobalLimit 5 +- LocalLimit 5 +- Sort [b#10L DESC NULLS LAST], true +- Window [row_number() windowspecdefinition(a#9L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#4], [a#9L ASC NULLS FIRST] +- Project [a#9L, b#10L] +- Relation default.t1[A#9L,B#10L,C#11L] parquet ``` ### Why are the changes needed? Improve query performance. ```scala spark.range(500000000L).selectExpr("id AS a", "id AS b").write.saveAsTable("t1") spark.sql("SELECT , ROW_NUMBER() OVER(ORDER BY a) AS rowId FROM t1 LIMIT 5").show ``` Before this pr \| After this pr -- \| -- ![image](https://user-images.githubusercontent.com/5399861/109456919-c68fe680-7a94-11eb-89ca-67ec03267158.png) \| ![image](https://user-images.githubusercontent.com/5399861/109456927-cd1e5e00-7a94-11eb-9866-d76b2665caea.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31691 from wangyum/SPARK-34575. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-17 07:16:10 +00:00
Gengliang Wang	143303147b	[SPARK-34742][SQL] ANSI mode: Abs throws exception if input is out of range ### What changes were proposed in this pull request? For the following cases, ABS should throw exceptions since the results are out of the range of the result data types in ANSI mode. ``` SELECT abs(${Int.MinValue}); SELECT abs(${Long.MinValue}); ``` ### Why are the changes needed? Better ANSI compliance ### Does this PR introduce _any_ user-facing change? Yes, Abs throws an exception if input is out of range in ANSI mode ### How was this patch tested? Unit test Closes #31836 from gengliangwang/ansiAbs. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-17 06:57:25 +00:00
Terry Kim	387d866244	[SPARK-34699][SQL] 'CREATE OR REPLACE TEMP VIEW USING' should uncache correctly ### What changes were proposed in this pull request? This PR proposes: 1. `CREATE OR REPLACE TEMP VIEW USING` should use `TemporaryViewRelation` to store temp views. 2. By doing #1, it fixes the issue where the temp view being replaced is not uncached. ### Why are the changes needed? This is a part of an ongoing work to wrap all the temporary views with `TemporaryViewRelation`: [SPARK-34698](https://issues.apache.org/jira/browse/SPARK-34698). This also fixes a bug where the temp view being replaced is not uncached. ### Does this PR introduce _any_ user-facing change? Yes, the temp view being replaced with `CREATE OR REPLACE TEMP VIEW USING` is correctly uncached if the temp view is cached. ### How was this patch tested? Added new tests. Closes #31825 from imback82/create_temp_view_using. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-17 06:04:07 +00:00
Wenchen Fan	af553735b1	[SPARK-34504][SQL] Avoid unnecessary resolving of SQL temp views for DDL commands ### What changes were proposed in this pull request? For DDL commands like DROP VIEW, they don't really need to resolve the view (parse and analyze the view SQL text), they just need to get the view metadata. This PR fixes the rule `ResolveTempViews` to only resolve the temp view for `UnresolvedRelation`. This also fixes a bug for DROP VIEW, as previously it tried to resolve the view and failed to drop invalid views. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #31853 from cloud-fan/view-resolve. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-17 11:16:51 +08:00
Wenchen Fan	cef6650048	Revert "[SPARK-33428][SQL] Conv UDF use BigInt to avoid Long value overflow" This reverts commit `5f9a7fea06`.	2021-03-16 13:56:50 +08:00
Cheng Su	bb05dc91f0	[SPARK-34729][SQL][FOLLOWUP] Broadcast nested loop join to use executeTake instead of execute ### What changes were proposed in this pull request? This is a followup minor change from https://github.com/apache/spark/pull/31821#discussion_r594110622 , where we change from using `execute()` to `executeTake()`. Performance-wise there's no difference. We are just using a different API to be aligned with code path of `Dataset`. ### Why are the changes needed? To align with other code paths in SQL/Dataset. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests same as https://github.com/apache/spark/pull/31821 . Closes #31845 from c21/join-followup. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-15 19:48:45 -07:00
Kent Yao	202529ef23	[SPARK-21449][SPARK-23745][SQL] add ShutdownHook to cloes HiveClient's SessionState to delete residual dirs ### What changes were proposed in this pull request? We initialized a Hive `SessionState` to interact with the external hive metastore server but left it behind after we finished. We should close the metastore client explicitly in case of connection leaks with HMS and we should trigger the `SessionState` to close itself to clean the residual dirs to fix issues reported by SPARK-21449 and SPARK-23745. `hive.downloaded.resources.dir` contains transient files, such as UDF jars, it will not be used anymore after spark applications exit. ### Why are the changes needed? 1. prevent potential metastore client leak 2. clean `hive.downloaded.resources.dir` ``` DOWNLOADED_RESOURCES_DIR("hive.downloaded.resources.dir", "${system:java.io.tmpdir}" + File.separator + "${hive.session.id}_resources", "Temporary local directory for added resources in the remote file system."), ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing jenkins and verify locally Closes #31833 from yaooqinn/SPARK-21449-2. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-03-16 10:37:40 +08:00
Dongjoon Hyun	0a70dff066	[MINOR][SQL] Remove unused variable in NewInstance.constructor ### What changes were proposed in this pull request? This PR removes one unused variable in `NewInstance.constructor`. ### Why are the changes needed? This looks like a variable for debugging at the initial commit of SPARK-23584 . - `1b08c4393c (diff-2a36e31684505fd22e2d12a864ce89fd350656d716a3f2d7789d2cdbe38e15fbR461)` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31838 from dongjoon-hyun/minor-object. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-15 18:49:54 -07:00
Max Gekk	9809a2f1c5	[SPARK-34739][SQL] Support add/subtract of a year-month interval to/from a timestamp ### What changes were proposed in this pull request? Support `timestamp +/- year-month interval`. In the PR, I propose to introduce new binary expression `TimestampAddYMInterval` similarly to `DateAddYMInterval`. It invokes new method `timestampAddMonths` from `DateTimeUtils` by passing a timestamp as an offset in microseconds since the epoch, amount of months from the giveb year-month interval, and the time zone ID in which the operation is performed. The `timestampAddMonths()` method converts the input microseconds to a local timestamp, adds months to it, and converts the results back to an instant in microseconds at the given time zone. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such operation over timestamps and intervals: <img width="811" alt="Screenshot 2021-03-12 at 11 36 14" src="https://user-images.githubusercontent.com/1580697/111081674-865d4900-8515-11eb-86c8-3538ecaf4804.png"> ### Does this PR introduce _any_ user-facing change? Should not since new intervals have not been released yet. ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly DateTimeUtilsSuite" $ build/sbt "test:testOnly DateExpressionsSuite" $ build/sbt "test:testOnly *ColumnExpressionSuite" ``` Closes #31832 from MaxGekk/timestamp-add-year-month-interval. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-03-15 14:36:12 +03:00
Dongjoon Hyun	363a7f0722	[SPARK-34743][SQL][TESTS] ExpressionEncoderSuite should use deepEquals when we expect `array of array` ### What changes were proposed in this pull request? This PR aims to make `ExpressionEncoderSuite` to use `deepEquals` instead of `equals` when `input` is `array of array`. This comparison code itself was added by SPARK-11727 at Apache Spark 1.6.0. ### Why are the changes needed? Currently, the interpreted mode fails for `array of array` because the following line is used. ``` Arrays.equals(b1.asInstanceOf[Array[AnyRef]], b2.asInstanceOf[Array[AnyRef]]) ``` ### Does this PR introduce _any_ user-facing change? No. This is a test-only PR. ### How was this patch tested? Pass the existing CIs. Closes #31837 from dongjoon-hyun/SPARK-34743. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-15 02:30:54 -07:00
Wenchen Fan	be888b27ed	[SPARK-34639][SQL] Always remove unnecessary Alias in Analyzer.resolveExpression ### What changes were proposed in this pull request? In `Analyzer.resolveExpression`, we have a parameter to decide if we should remove unnecessary `Alias` or not. This is over complicated and we can always remove unnecessary `Alias`. This PR simplifies this part and removes the parameter. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31758 from cloud-fan/resolve. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-15 09:22:36 +00:00
Cheng Su	a0f3b72e1c	[SPARK-34729][SQL] Faster execution for broadcast nested loop join (left semi/anti with no condition) ### What changes were proposed in this pull request? For `BroadcastNestedLoopJoinExec` left semi and left anti join without condition. If we broadcast left side. Currently we check whether every row from broadcast side has a match or not by [iterating broadcast side a lot of time](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L256-L275). This is unnecessary and very inefficient when there's no condition, as we only need to check whether stream side is empty or not. Create this PR to add the optimization. This can boost the affected query execution performance a lot. In addition, create a common method `getMatchedBroadcastRowsBitSet()` shared by several methods. Refactor `defaultJoin()` to move * left semi and left anti join related logic to `leftExistenceJoin` * existence join related logic to `existenceJoin`. After this, `defaultJoin()` holds logic only for outer join (left outer, right outer and full outer), which is much easier to read from my own opinion. ### Why are the changes needed? Improve the affected query performance a lot. Test with a simple query by modifying `JoinBenchmark.scala` locally: ``` val N = 20 << 20 val M = 1 << 4 val dim = broadcast(spark.range(M).selectExpr("id as k")) val df = dim.join(spark.range(N), Seq.empty, "left_semi") df.noop() ``` See >30x run time improvement. Note the stream side is only `spark.range(N)`. For complicated query with non-trivial stream side, the saving would be much more. ``` Running benchmark: broadcast nested loop left semi join Running case: broadcast nested loop left semi join optimization off Stopped after 2 iterations, 3163 ms Running case: broadcast nested loop left semi join optimization on Stopped after 5 iterations, 366 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz broadcast nested loop left semi join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------------- broadcast nested loop left semi join optimization off 1568 1582 19 13.4 74.8 1.0X broadcast nested loop left semi join optimization on 46 73 18 456.0 2.2 34.1X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `ExistenceJoinSuite.scala`. Closes #31821 from c21/nested-join. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-14 23:51:36 -07:00
yangjie01	e757091820	[SPARK-34722][CORE][SQL][TEST] Clean up deprecated API usage related to JUnit4 ### What changes were proposed in this pull request? The main change of this pr as follows: - Use `org.junit.Assert.assertThrows(String, Class, ThrowingRunnable)` method instead of `ExpectedException.none()` - Use `org.hamcrest.MatcherAssert.assertThat()` method instead of `org.junit.Assert.assertThat(T, org.hamcrest.Matcher<? super T>)` ### Why are the changes needed? Clean up deprecated API usage ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31815 from LuciferYang/SPARK-34722. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-14 23:33:03 -07:00
Max Gekk	7aaed76125	[SPARK-34737][SQL] Cast input float to double in `TIMESTAMP_SECONDS` ### What changes were proposed in this pull request? In the PR, I propose to cast the input float to double in the `SecondsToTimestamp` expression in the same way as in the `Cast` expression. ### Why are the changes needed? To have the same results from `CAST(<float> AS TIMESTAMP)` and from `TIMESTAMP_SECONDS`: ```sql spark-sql> SELECT CAST(16777215.0f AS TIMESTAMP); 1970-07-14 07:20:15 spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f); 1970-07-14 07:20:14.951424 ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes: ```sql spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f); 1970-07-14 07:20:15 ``` ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *DateExpressionsSuite" ``` Closes #31831 from MaxGekk/adjust-SecondsToTimestamp. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-15 10:05:59 +09:00
Max Gekk	e0a1399bd7	[SPARK-34727][SQL] Fix discrepancy in casting float to timestamp ### What changes were proposed in this pull request? In non-ANSI mode, casting float to timestamp has different implementation for codegen on and off. Codegen on: 1. Multiply float input by MICROS_PER_SECOND 2. Cast resulting float value to long Codegen off: 1. CAST float input to double input 2. Multiply double input by MICROS_PER_SECOND 3. Cast resulting double value to long In the PR, I propose to align to non-codegen code, and cast input float to double in codegen. ### Why are the changes needed? This fixes the issue which is demonstrated by the code: ```sql spark-sql> CREATE TEMP VIEW v1 AS SELECT 16777215.0f AS f; spark-sql> SELECT * FROM v1; 1.6777215E7 spark-sql> SELECT CAST(f AS TIMESTAMP) FROM v1; 1970-07-14 07:20:15 spark-sql> CACHE TABLE v1; spark-sql> SELECT * FROM v1; 1.6777215E7 spark-sql> SELECT CAST(f AS TIMESTAMP) FROM v1; 1970-07-14 07:20:14.951424 ``` The result from the cached view 1970-07-14 07:20:14.951424 is different from un-cached view 1970-07-14 07:20:15. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above outputs the same timestamp for the cached view: ```sql spark-sql> CACHE TABLE v1; spark-sql> SELECT * FROM v1; 1.6777215E7 spark-sql> SELECT CAST(f AS TIMESTAMP) FROM v1; 1970-07-14 07:20:15 ``` ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *CastSuite" ``` Closes #31819 from MaxGekk/fix-float-to-timestamp. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-14 11:29:54 +09:00
Liang-Chi Hsieh	86baa36eeb	[SPARK-34723][SQL] Correct parameter type for subexpression elimination under whole-stage ### What changes were proposed in this pull request? This patch proposes to fix incorrect parameter type for subexpression elimination under whole-stage. ### Why are the changes needed? If the parameter is a byte array, the subexpression elimination under wholestage codegen will use incorrect parameter type and cause compile error. Although Spark can automatically fallback to interpreted mode, we should fix it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test with customer application. Unit test. Closes #31814 from viirya/SPARK-34723. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-03-13 00:05:41 -08:00
Max Gekk	4f1e434ec5	[SPARK-34721][SQL] Support add/subtract of a year-month interval to/from a date ### What changes were proposed in this pull request? Support `date +/- year-month interval`. In the PR, I propose to re-use existing code from the `AddMonths` expression, and extract it to the common base class `AddMonthsBase`. That base class is used in new expression `DateAddYMInterval` and in the existing one `AddMonths` (the `add_months` function). ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such operation over dates and intervals: <img width="811" alt="Screenshot 2021-03-12 at 11 36 14" src="https://user-images.githubusercontent.com/1580697/110914390-5f412480-8327-11eb-9f8b-e92e73c0b9cd.png"> ### Does this PR introduce _any_ user-facing change? Should not since new intervals have not been released yet. ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly ColumnExpressionSuite" $ build/sbt "test:testOnly DateExpressionsSuite" ``` Closes #31812 from MaxGekk/date-add-year-month-interval. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-12 14:35:56 +00:00
Dongjoon Hyun	9a7977933f	[SPARK-34724][SQL] Fix Interpreted evaluation by using getMethod instead of getDeclaredMethod ### What changes were proposed in this pull request? This bug was introduced by SPARK-23583 at Apache Spark 2.4.0. This PR aims to use `getMethod` instead of `getDeclaredMethod`. ```scala - obj.getClass.getDeclaredMethod(functionName, argClasses: _) + obj.getClass.getMethod(functionName, argClasses: _) ``` ### Why are the changes needed? `getDeclaredMethod` does not search the super class's method. To invoke `GenericArrayData.toIntArray`, we need to use `getMethod` because it's declared at the super class `ArrayData`. ``` [info] - encode/decode for array of int: [I74655d03 (interpreted path) * FAILED * (14 milliseconds) [info] Exception thrown while decoding [info] Converted: [0,1000000020,3,0,ffffff850000001f,4] [info] Schema: value#680 [info] root [info] -- value: array (nullable = true) [info] \|-- element: integer (containsNull = false) [info] [info] [info] Encoder: [info] class[value[0]: array<int>] (ExpressionEncoderSuite.scala:578) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.fail(Assertions.scala:949) [info] at org.scalatest.Assertions.fail$(Assertions.scala:945) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.$anonfun$encodeDecodeTest$1(ExpressionEncoderSuite.scala:578) [info] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.verifyNotLeakingReflectionObjects(ExpressionEncoderSuite.scala:656) [info] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.$anonfun$testAndVerifyNotLeakingReflectionObjects$2(ExpressionEncoderSuite.scala:669) [info] at org.apache.spark.sql.catalyst.plans.CodegenInterpretedPlanTest.$anonfun$test$4(PlanTest.scala:50) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) [info] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.withSQLConf(ExpressionEncoderSuite.scala:118) [info] at org.apache.spark.sql.catalyst.plans.CodegenInterpretedPlanTest.$anonfun$test$3(PlanTest.scala:50) ... [info] Cause: java.lang.RuntimeException: Error while decoding: java.lang.NoSuchMethodException: org.apache.spark.sql.catalyst.util.GenericArrayData.toIntArray() [info] mapobjects(lambdavariable(MapObject, IntegerType, false, -1), assertnotnull(lambdavariable(MapObject, IntegerType, false, -1)), input[0, array<int>, true], None).toIntArray [info] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:186) [info] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.$anonfun$encodeDecodeTest$1(ExpressionEncoderSuite.scala:576) [info] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.verifyNotLeakingReflectionObjects(ExpressionEncoderSuite.scala:656) [info] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.$anonfun$testAndVerifyNotLeakingReflectionObjects$2(ExpressionEncoderSuite.scala:669) ``` ### Does this PR introduce _any_ user-facing change? This causes a runtime exception when we use the interpreted mode. ### How was this patch tested? Pass the modified unit test case. Closes #31816 from dongjoon-hyun/SPARK-34724. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-12 21:30:46 +09:00
Kousuke Saruta	03dd33cc98	[SPARK-25769][SPARK-34636][SPARK-34626][SQL] sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly ### What changes were proposed in this pull request? This PR fixes an issue that `sql` method in the following classes which take qualified names don't quote the qualified names properly. * UnresolvedAttribute * AttributeReference * Alias One instance caused by this issue is reported in SPARK-34626. ``` UnresolvedAttribute("a" :: "b" :: Nil).sql `a.b` // expected: `a`.`b` ``` And other instances are like as follows. ``` UnresolvedAttribute("a`b"::"c.d"::Nil).sql a`b.`c.d` // expected: `a``b`.`c.d` AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql c.d.`a.b` // expected: `c.d`.`a.b` Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c` ``` ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #31754 from sarutak/fix-qualified-names. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-12 02:58:46 +00:00
Max Gekk	cebe2be221	[SPARK-34718][SQL] Assign pretty names to YearMonthIntervalType and DayTimeIntervalType ### What changes were proposed in this pull request? In the PR, I propose to override the `typeName()` method in `YearMonthIntervalType` and `DayTimeIntervalType`, and assign them names according to the ANSI SQL standard: <img width="836" alt="Screenshot 2021-03-11 at 17 29 04" src="https://user-images.githubusercontent.com/1580697/110802854-a54aa980-828f-11eb-956d-dd4fbf14aa72.png"> but keep the type name as singular according existing naming convention for other types. ### Why are the changes needed? To improve Spark SQL user experience, and have readable types in error messages. ### Does this PR introduce _any_ user-facing change? Should not since the types has not been released yet. ### How was this patch tested? By running the modified tests: ``` $ build/sbt "test:testOnly ExpressionTypeCheckingSuite" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z windowFrameCoercion.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z literals.sql" ``` Closes #31810 from MaxGekk/interval-types-name. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-11 12:55:12 -08:00
Angerszhuuuu	badca975af	[SPARK-34712][SQL][TESTS] Refactor UT about hive build in version, avoid to change every time when upgrade hive version ### What changes were proposed in this pull request? Use HiveUtils.buildinHiveVersion to replace correspoding Ut about hive version ### Why are the changes needed? Refactor UT about hive build in version, avoid to change every time when upgrade hive version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31807 from AngersZhuuuu/SPARK-34712. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-11 12:52:29 -08:00
Wenchen Fan	6a42b633bf	[SPARK-34713][SQL] Fix group by CreateStruct with ExtractValue ### What changes were proposed in this pull request? This is a bug caused by https://issues.apache.org/jira/browse/SPARK-31670 . We remove the `Alias` when resolving column references in grouping expressions, which breaks `ResolveCreateNamedStruct` ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31808 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-11 09:21:58 -08:00
Max Gekk	d7bb327aee	[SPARK-34695][SQL] Fix long overflow in conversion of minimum duration to microseconds ### What changes were proposed in this pull request? In the PR, I propose to especially handle the amount of seconds `-9223372036855` in `IntervalUtils. durationToMicros()`. Starting from the amount (any durations with the second field < `-9223372036855`), input durations cannot fit to `Long` in the conversion to microseconds. For example, the amount of microseconds = `Long.MinValue = -9223372036854775808` can be represented in two forms: 1. seconds = -9223372036854, nanoAdjustment = -775808, or 2. seconds = -9223372036855, nanoAdjustment = +224192 And the method `Duration.ofSeconds()` produces the last form but such form causes overflow while converting `-9223372036855` seconds to microseconds. In the PR, I propose to convert the second form to the first one if the second field of input duration is equal to `-9223372036855`. ### Why are the changes needed? The changes fix the issue demonstrated by the code: ```scala scala> durationToMicros(microsToDuration(Long.MinValue)) java.lang.ArithmeticException: long overflow at java.lang.Math.multiplyExact(Math.java:892) at org.apache.spark.sql.catalyst.util.IntervalUtils$.durationToMicros(IntervalUtils.scala:782) ... 49 elided ``` The `durationToMicros()` method cannot handle valid output of `microsToDuration()`. ### Does this PR introduce _any_ user-facing change? Should not since new interval types has not been released yet. ### How was this patch tested? By running new UT from `IntervalUtilsSuite`. Closes #31799 from MaxGekk/fix-min-duration. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-11 15:21:15 +00:00
ulysses-you	744a73df9e	[SPARK-34538][SQL] Hive Metastore support filter by not-in ### What changes were proposed in this pull request? Add `Not(In)` and `Not(InSet)` pattern when convert filter to metastore. ### Why are the changes needed? `NOT IN` is a useful condition to prune partition, it would be better to support it. Technically, we can convert `c not in(x,y)` to `c != x and c != y`, then push it to metastore. Avoid metastore overflow and respect the config `spark.sql.hive.metastorePartitionPruningInSetThreshold`, `Not(InSet)` won't push to metastore if it's value exceeds the threshold. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31646 from ulysses-you/SPARK-34538. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-11 15:19:47 +00:00
Cheng Su	da086a8ea3	[SPARK-34702][SQL] Avoid unnecessary code generation in JoinCodegenSupport.genBuildSideVars ### What changes were proposed in this pull request? As a followup from code review in https://github.com/apache/spark/pull/31736#discussion_r588134104 , for `JoinCodegenSupport.genBuildSideVars`, we only need to generate build side variables with default values for LEFT OUTER and RIGHT OUTER join, but not for other join types (i.e. LEFT SEMI and LEFT ANTI). Create this PR to clean up the code. In addition, change `BroadcastNestedLoopJoinExec` unit test to cover both whole stage code-gen enabled and disabled. Harden the unit tests to exercise all code paths. ### Why are the changes needed? Avoid unnecessary code generation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. * BHJ and SHJ inner join is covered in `InnerJoinSuite.scala` * BHJ and SHJ left outer and right outer join are covered in `OuterJoinSuite.scala` * BHJ and SHJ left semi, left anti and existence join are covered in `ExistenceJoinSuite.scala` Closes #31802 from c21/join-codegen-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-11 13:29:45 +00:00
Kousuke Saruta	fa1cf5c207	[SPARK-34697][SQL] Allow DESCRIBE FUNCTION and SHOW FUNCTIONS explain about \|\| (string concatenation operator) ### What changes were proposed in this pull request? This PR fixes the behavior of `SHOW FUNCTIONS` and `DESCRIBE FUNCTION` for the `\|\|` operator. The result of `SHOW FUNCTIONS` doesn't contains `\|\|` and `DESCRIBE FUNCTION \|\|` says `Function: \|\| not found.` even though `\|\|` is supported. ### Why are the changes needed? It's a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed manually with the following commands. ``` spark-sql> DESCRIBE FUNCTION \|\|; Function: \|\| Usage: expr1 \|\| expr2 - Returns the concatenation of `expr1` and `expr2`. spark-sql> SHOW FUNCTIONS; ! != % ... \| \|\| ~ ``` Closes #31800 from sarutak/fix-describe-concat-pipe. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-11 22:11:26 +09:00
Max Gekk	9d3d25bca4	[SPARK-34677][SQL] Support the `+`/`-` operators over ANSI SQL intervals ### What changes were proposed in this pull request? Extend the `Add`, `Subtract` and `UnaryMinus` expression to support `DayTimeIntervalType` and `YearMonthIntervalType` added by #31614. Note: the expressions can throw the `overflow` exception independently from the SQL config `spark.sql.ansi.enabled`. In this way, the modified expressions always behave in the ANSI mode for the intervals. ### Why are the changes needed? To conform to the ANSI SQL standard which defines `-/+` over intervals: <img width="822" alt="Screenshot 2021-03-09 at 21 59 22" src="https://user-images.githubusercontent.com/1580697/110523128-bd50ea80-8122-11eb-9982-782da0088d27.png"> ### Does this PR introduce _any_ user-facing change? Should not since new types have not been released yet. ### How was this patch tested? By running new tests in the test suites: ``` $ build/sbt "test:testOnly ArithmeticExpressionSuite" $ build/sbt "test:testOnly ColumnExpressionSuite" ``` Closes #31789 from MaxGekk/add-subtruct-intervals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-11 10:08:43 +00:00
Dongjoon Hyun	5c4d8f9538	[SPARK-34696][SQL][TESTS] Fix CodegenInterpretedPlanTest to generate correct test cases ### What changes were proposed in this pull request? SPARK-23596 added `CodegenInterpretedPlanTest` at Apache Spark 2.4.0 in a wrong way because `withSQLConf` depends on the execution time `SQLConf.get` instead of `test` function declaration time. So, the following code executes the test twice without controlling the `CodegenObjectFactoryMode`. This PR aims to fix it correct and introduce a new function `testFallback`. ```scala trait CodegenInterpretedPlanTest extends PlanTest { override protected def test( testName: String, testTags: Tag)(testFun: => Any)(implicit pos: source.Position): Unit = { val codegenMode = CodegenObjectFactoryMode.CODEGEN_ONLY.toString val interpretedMode = CodegenObjectFactoryMode.NO_CODEGEN.toString withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key -> codegenMode) { super.test(testName + " (codegen path)", testTags: _)(testFun)(pos) } withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key -> interpretedMode) { super.test(testName + " (interpreted path)", testTags: _)(testFun)(pos) } } } ``` ### Why are the changes needed? 1. We need to use like the following. ```scala super.test(testName + " (codegen path)", testTags: _)( withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key -> codegenMode) { testFun })(pos) super.test(testName + " (interpreted path)", testTags: _*)( withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key -> interpretedMode) { testFun })(pos) ``` 2. After we fix this behavior with the above code, several test cases including SPARK-34596 and SPARK-34607 fail because they didn't work at both `CODEGEN` and `INTERPRETED` mode. Those test cases only work at `FALLBACK` mode. So, inevitably, we need to introduce `testFallback`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31766 from dongjoon-hyun/SPARK-34596-SPARK-34607. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-10 23:41:49 -08:00
Cheng Su	14ad7afa1a	[MINOR][SQL] Remove unnecessary extend from BroadcastHashJoinExec ### What changes were proposed in this pull request? This is just a minor fix. `HashJoin` already extends `JoinCodegenSupport`. So we don't need `CodegenSupport` here for `BroadcastHashJoinExec`. Submitted separately as a PR here per https://github.com/apache/spark/pull/31802#discussion_r592066686 . ### Why are the changes needed? Clean up code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #31805 from c21/bhj-minor. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-10 23:38:53 -08:00
Cheng Su	9aa8f06313	[SPARK-34711][SQL][TESTS] Exercise code-gen enable/disable code paths for SHJ in join test suites ### What changes were proposed in this pull request? Per comment in https://github.com/apache/spark/pull/31802#discussion_r592068440 , we would like to exercise whole stage code-gen enabled and disabled code paths in join unit test suites. This is for better test coverage of shuffled hash join. ### Why are the changes needed? Better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing and added unit tests here. Closes #31806 from c21/test-minor. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-10 23:34:09 -08:00
Anton Okolnychyi	72263797bc	[SPARK-34457][SQL] DataSource V2: Add default null ordering to SortDirection ### What changes were proposed in this pull request? This PR adds a default null ordering to public `SortDirection` to match the Catalyst behavior. ### Why are the changes needed? The SQL standard does not define the default null ordering for a sort direction. That's why it is up to a query engine to assign one. We need to standardize this in our public connector expressions to avoid ambiguity. That's why I propose to match the behavior in our Catalyst expressions. ### Does this PR introduce _any_ user-facing change? Yes, it affects unreleased connector expression API. ### How was this patch tested? Existing tests. Closes #31580 from aokolnychyi/spark-34457. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-11 05:47:31 +00:00
Terry Kim	2a6e68e1f7	[SPARK-34546][SQL] AlterViewAs.query should be analyzed during the analysis phase, and AlterViewAs should invalidate the cache ### What changes were proposed in this pull request? This PR proposes the following: * `AlterViewAs.query` is currently analyzed in the physical operator `AlterViewAsCommand`, but it should be analyzed during the analysis phase. * When `spark.sql.legacy.storeAnalyzedPlanForView` is set to true, store `TermporaryViewRelation` which wraps the analyzed plan, similar to #31273. * Try to uncache the view you are altering. ### Why are the changes needed? Analyzing a plan should be done in the analysis phase if possible. Not uncaching the view (existing behavior) seems like a bug since the cache may not be used again. ### Does this PR introduce _any_ user-facing change? Yes, now the view can be uncached if it's already cached. ### How was this patch tested? Added new tests around uncaching. The existing tests such as `SQLViewSuite` should cover the analysis changes. Closes #31652 from imback82/alter_view_child. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-11 05:31:40 +00:00
Andy Grove	fc182f7e7f	[SPARK-34682][SQL] Use PrivateMethodTester instead of reflection ### Why are the changes needed? SPARK-34682 was merged prematurely. This PR implements feedback from the review. I wasn't sure whether I should create a new JIRA or not. ### Does this PR introduce _any_ user-facing change? No. Just improves the test. ### How was this patch tested? Updated test. Closes #31798 from andygrove/SPARK-34682-follow-up. Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-10 12:08:31 -08:00
Andy Grove	fd4843803c	[SPARK-34682][SQL] Fix regression in canonicalization error check in CustomShuffleReaderExec ### What changes were proposed in this pull request? There is a regression in 3.1.1 compared to 3.0.2 when checking for a canonicalized plan when executing CustomShuffleReaderExec. The regression was caused by the call to `sendDriverMetrics` which happens before the check and will always fail if the plan is canonicalized. ### Why are the changes needed? This is a regression in a useful error check. ### Does this PR introduce _any_ user-facing change? No. This is not an error that a user would typically see, as far as I know. ### How was this patch tested? I tested this change locally by making a distribution from this PR branch. Before fixing the regression I saw: ``` java.util.NoSuchElementException: key not found: numPartitions ``` After fixing this regression I saw: ``` java.lang.IllegalStateException: operating on canonicalized plan ``` Closes #31793 from andygrove/SPARK-34682. Lead-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Andy Grove <andygrove@nvidia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-10 20:48:00 +09:00
Cheng Su	a916690dd9	[SPARK-34681][SQL] Fix bug for full outer shuffled hash join when building left side with non-equal condition ### What changes were proposed in this pull request? For full outer shuffled hash join with building hash map on left side, and having non-equal condition, the join can produce wrong result. The root cause is `boundCondition` in `HashJoin.scala` always assumes the left side row is `streamedPlan` and right side row is `buildPlan` ([streamedPlan.output ++ buildPlan.output](https://github.com/apache/spark/blob/branch-3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L141)). This is valid assumption, except for full outer + build left case. The fix is to correct `boundCondition` in `HashJoin.scala` to handle full outer + build left case properly. See reproduce in https://issues.apache.org/jira/browse/SPARK-32399?focusedCommentId=17298414&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17298414 . ### Why are the changes needed? Fix data correctness bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Changed the test in `OuterJoinSuite.scala` to cover full outer shuffled hash join. Before this change, the unit test `basic full outer join using ShuffledHashJoin` in `OuterJoinSuite.scala` is failed. Closes #31792 from c21/join-bugfix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-09 22:55:27 -08:00
Wenchen Fan	48377d5bd9	[SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite ### What changes were proposed in this pull request? Fixes a mistake in `TableCapabilityCheckSuite`, which runs some tests repeatedly. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #31788 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-09 09:02:31 -08:00
Kousuke Saruta	2fd85174e9	[SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES command ### What changes were proposed in this pull request? This PR adds `ADD ARCHIVE` and `LIST ARCHIVES` commands to SQL and updates relevant documents. SPARK-33530 added `addArchive` and `listArchives` to `SparkContext` but it's not supported yet to add/list archives with SQL. ### Why are the changes needed? To complement features. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test and confirmed the generated HTML from the updated documents. Closes #31721 from sarutak/sql-archive. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-09 21:28:35 +09:00
Terry Kim	cf8bbc184b	[SPARK-34152][SQL][FOLLOW-UP] Global temp view's identifier should be correctly stored ### What changes were proposed in this pull request? This PR proposed to fix a bug introduced in #31273 (https://github.com/apache/spark/pull/31273/files#r589494855). ### Why are the changes needed? This fixes a bug where global temp view's database name was not passed correctly. ### Does this PR introduce _any_ user-facing change? Yes, now the global temp view's database is correctly stored. ### How was this patch tested? Added a new test that catches the bug. Closes #31783 from imback82/SPARK-34152-bug-fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-09 12:08:08 +00:00
Amandeep Sharma	a9c11896a5	[SPARK-34649][SQL][DOCS] org.apache.spark.sql.DataFrameNaFunctions.replace() fails for column name having a dot ### What changes were proposed in this pull request? Use resolved attributes instead of data-frame fields for replacing values. ### Why are the changes needed? dataframe.na.replace() does not work for column having a dot in the name ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Added unit tests for the same Closes #31769 from amandeep-sharma/master. Authored-by: Amandeep Sharma <happyaman91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-09 11:47:01 +00:00
Cheng Su	b5b198516c	[SPARK-34620][SQL] Code-gen broadcast nested loop join (inner/cross) ### What changes were proposed in this pull request? `BroadcastNestedLoopJoinExec` does not have code-gen, and we can potentially boost the CPU performance for this operator if we add code-gen for it. https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html also showed the evidence in one fork. The codegen for `BroadcastNestedLoopJoinExec` shared some code with `HashJoin`, and the interface `JoinCodegenSupport` is created to hold those common logic. This PR is only supporting inner and cross join. Other join types will be added later in followup PRs. Example query and generated code: ``` val df1 = spark.range(4).select($"id".as("k1")) val df2 = spark.range(3).select($"id".as("k2")) df1.join(df2, $"k1" + 1 =!= $"k2").explain("codegen") ``` ``` == Subtree 2 / 2 (maxMethodCodeSize:282; maxConstantPoolSize:203(0.31% used); numInnerClasses:0) == (2) BroadcastNestedLoopJoin BuildRight, Inner, NOT ((k1#2L + 1) = k2#6L) :- (2) Project [id#0L AS k1#2L] : +- (2) Range (0, 4, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode, [id=#22] +- (1) Project [id#4L AS k2#6L] +- (1) Range (0, 3, step=1, splits=2) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage2(references); / 003 / } / 004 / / 005 / // codegenStageId=2 / 006 / final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean range_initRange_0; / 010 / private long range_nextIndex_0; / 011 / private TaskContext range_taskContext_0; / 012 / private InputMetrics range_inputMetrics_0; / 013 / private long range_batchEnd_0; / 014 / private long range_numElementsTodo_0; / 015 / private InternalRow[] bnlj_buildRowArray_0; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[4]; / 017 / / 018 / public GeneratedIteratorForCodegenStage2(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / / 026 / range_taskContext_0 = TaskContext.get(); / 027 / range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); / 028 / range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 029 / range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 030 / range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 031 / bnlj_buildRowArray_0 = (InternalRow[]) ((org.apache.spark.broadcast.TorrentBroadcast) references[1] / broadcastTerm /).value(); / 032 / range_mutableStateArray_0[3] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0); / 033 / / 034 / } / 035 / / 036 / private void bnlj_doConsume_0(long bnlj_expr_0_0) throws java.io.IOException { / 037 / for (int bnlj_arrayIndex_0 = 0; bnlj_arrayIndex_0 < bnlj_buildRowArray_0.length; bnlj_arrayIndex_0++) { / 038 / UnsafeRow bnlj_buildRow_0 = (UnsafeRow) bnlj_buildRowArray_0[bnlj_arrayIndex_0]; / 039 / / 040 / long bnlj_value_1 = bnlj_buildRow_0.getLong(0); / 041 / / 042 / long bnlj_value_4 = -1L; / 043 / / 044 / bnlj_value_4 = bnlj_expr_0_0 + 1L; / 045 / / 046 / boolean bnlj_value_3 = false; / 047 / bnlj_value_3 = bnlj_value_4 == bnlj_value_1; / 048 / boolean bnlj_value_2 = false; / 049 / bnlj_value_2 = !(bnlj_value_3); / 050 / if (!(false \|\| !bnlj_value_2)) / 051 / { / 052 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] / numOutputRows /).add(1); / 053 / / 054 / range_mutableStateArray_0[3].reset(); / 055 / / 056 / range_mutableStateArray_0[3].write(0, bnlj_expr_0_0); / 057 / / 058 / range_mutableStateArray_0[3].write(1, bnlj_value_1); / 059 / append((range_mutableStateArray_0[3].getRow()).copy()); / 060 / / 061 / } / 062 / } / 063 / / 064 / } / 065 / / 066 / private void initRange(int idx) { / 067 / java.math.BigInteger index = java.math.BigInteger.valueOf(idx); / 068 / java.math.BigInteger numSlice = java.math.BigInteger.valueOf(2L); / 069 / java.math.BigInteger numElement = java.math.BigInteger.valueOf(4L); / 070 / java.math.BigInteger step = java.math.BigInteger.valueOf(1L); / 071 / java.math.BigInteger start = java.math.BigInteger.valueOf(0L); / 072 / long partitionEnd; / 073 / / 074 / java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); / 075 / if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 076 / range_nextIndex_0 = Long.MAX_VALUE; / 077 / } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 078 / range_nextIndex_0 = Long.MIN_VALUE; / 079 / } else { / 080 / range_nextIndex_0 = st.longValue(); / 081 / } / 082 / range_batchEnd_0 = range_nextIndex_0; / 083 / / 084 / java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) / 085 / .multiply(step).add(start); / 086 / if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 087 / partitionEnd = Long.MAX_VALUE; / 088 / } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 089 / partitionEnd = Long.MIN_VALUE; / 090 / } else { / 091 / partitionEnd = end.longValue(); / 092 / } / 093 / / 094 / java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( / 095 / java.math.BigInteger.valueOf(range_nextIndex_0)); / 096 / range_numElementsTodo_0 = startToEnd.divide(step).longValue(); / 097 / if (range_numElementsTodo_0 < 0) { / 098 / range_numElementsTodo_0 = 0; / 099 / } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { / 100 / range_numElementsTodo_0++; / 101 / } / 102 / } / 103 / / 104 / protected void processNext() throws java.io.IOException { / 105 / // initialize Range / 106 / if (!range_initRange_0) { / 107 / range_initRange_0 = true; / 108 / initRange(partitionIndex); / 109 / } / 110 / / 111 / while (true) { / 112 / if (range_nextIndex_0 == range_batchEnd_0) { / 113 / long range_nextBatchTodo_0; / 114 / if (range_numElementsTodo_0 > 1000L) { / 115 / range_nextBatchTodo_0 = 1000L; / 116 / range_numElementsTodo_0 -= 1000L; / 117 / } else { / 118 / range_nextBatchTodo_0 = range_numElementsTodo_0; / 119 / range_numElementsTodo_0 = 0; / 120 / if (range_nextBatchTodo_0 == 0) break; / 121 / } / 122 / range_batchEnd_0 += range_nextBatchTodo_0 1L; /* 123 / } / 124 / / 125 / int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); / 126 / for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { / 127 / long range_value_0 = ((long)range_localIdx_0 1L) + range_nextIndex_0; /* 128 / / 129 / // common sub-expressions / 130 / / 131 / bnlj_doConsume_0(range_value_0); / 132 / / 133 / if (shouldStop()) { / 134 / range_nextIndex_0 = range_value_0 + 1L; / 135 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localIdx_0 + 1); / 136 / range_inputMetrics_0.incRecordsRead(range_localIdx_0 + 1); / 137 / return; / 138 / } / 139 / / 140 / } / 141 / range_nextIndex_0 = range_batchEnd_0; / 142 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localEnd_0); / 143 / range_inputMetrics_0.incRecordsRead(range_localEnd_0); / 144 / range_taskContext_0.killTaskIfInterrupted(); / 145 / } / 146 / } / 147 / / 148 / } ``` ### Why are the changes needed? Improve query CPU performance. Added a micro benchmark query in `JoinBenchmark.scala`. Saw 1x of run time improvement: ``` OpenJDK 64-Bit Server VM 11.0.9+11-LTS on Linux 4.14.219-161.340.amzn2.x86_64 Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz broadcast nested loop join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- broadcast nested loop join wholestage off 62922 63052 184 0.3 3000.3 1.0X broadcast nested loop join wholestage on 30946 30972 26 0.7 1475.6 2.0X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `WholeStageCodegenSuite.scala`, and existing unit tests for `BroadcastNestedLoopJoinExec`. * Updated golden files for several TCPDS query plans, as whole stage code-gen for `BroadcastNestedLoopJoinExec` is triggered. * Updated `JoinBenchmark-jdk11-results.txt ` and `JoinBenchmark-results.txt` with new benchmark result. Followed previous benchmark PRs - https://github.com/apache/spark/pull/27078 and https://github.com/apache/spark/pull/26003 to use same type of machine: ``` Amazon AWS EC2 type: r3.xlarge region: us-west-2 (Oregon) OS: Linux ``` Closes #31736 from c21/nested-join-exec. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-09 11:45:43 +00:00
Takeshi Yamamuro	43b23fd132	[SPARK-33498][SQL][TESTS][FOLLOWUP] Remove SQLConf.withExistingConf in CastSuite ### What changes were proposed in this pull request? This PR intends to remove unnecessary `SQLConf.withExistingConf` in `CastSuite`; since we've remove `ParVector ` in #31775, we no longer need to copy SQL configs into each thread env. ### Why are the changes needed? Clean up the code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run the existing tests. Closes #31785 from maropu/UpdateCastSuite. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-08 23:52:43 -08:00
Max Gekk	6a7701a5be	[SPARK-34663][SQL][TESTS] Test year-month and day-time intervals in UDF ### What changes were proposed in this pull request? Added new tests to `UDFSuite` to check `java.time.Period`/`java.time.Duration` in UDF as input parameters as well as UDF results. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly *UDFSuite" ``` Closes #31779 from MaxGekk/interval-udf. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-09 06:58:14 +00:00
Max Gekk	4ea27787bf	[SPARK-34666][SQL][TESTS] Test DayTimeIntervalType and YearMonthIntervalType as ordered and atomic types ### What changes were proposed in this pull request? Add `DayTimeIntervalType` and `YearMonthIntervalType` to `DataTypeTestUtils.ordered`/`atomicTypes`, and implement values generation of those types in `LiteralGenerator`/`RandomDataGenerator`. In this way, the types will be tested automatically in: 1. ArithmeticExpressionSuite: - "function least" - "function greatest" 2. PredicateSuite - "BinaryComparison consistency check" - "AND, OR, EqualTo, EqualNullSafe consistency check" 3. ConditionalExpressionSuite - "if" 4. RandomDataGeneratorSuite - "Basic types" 5. CastSuite - "null cast" - "up-cast" - "SPARK-27671: cast from nested null type in struct" 6. OrderingSuite - "GenerateOrdering with DayTimeIntervalType" - "GenerateOrdering with YearMonthIntervalType" 7. PredicateSuite - "IN with different types" 8. UnsafeRowSuite - "calling get(ordinal, datatype) on null columns" 9. SortSuite - "sorting on YearMonthIntervalType ..." - "sorting on DayTimeIntervalType ..." ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites. Closes #31782 from MaxGekk/test-interval-as-atomic. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-09 06:42:59 +00:00
allisonwang-db	c32cac4cd6	[SPARK-34627][SQL] Use FunctionIdentifier in UnresolvedTableValuedFunction ### What changes were proposed in this pull request? This PR updates UnresolvedTableValuedFunction's name to be a FunctionIdentifier instead of a string. ### Why are the changes needed? To make UnresolvedTableValuedFunction consistent with UnresolvedFunction that uses FunctionIdentifier as the function name. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #31749 from allisonwang-db/spark-34627. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-09 05:27:02 +00:00
Swinky	02e74b298a	[SPARK-34598][SQL] RewritePredicateSubquery Rule must not update Filters without subqueries ### What changes were proposed in this pull request? RewritePredicateSubquery Optimizer Rule must not update Filters without subqueries. Following is one such example. ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery === Project [a#0] Project [a#0] !+- Filter (((a#0 > 1) OR (b#1 > 2)) AND ((c#2 > 1) AND (d#3 > 2))) +- Filter ((((a#0 > 1) OR (b#1 > 2)) AND (c#2 > 1)) AND (d#3 > 2)) +- LocalRelation <empty>, [a#0, b#1, c#2, d#3] +- LocalRelation <empty>, [a#0, b#1, c#2, d#3] ``` ### Why are the changes needed? minor change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs pass. Closes #31712 from Swinky/rewritePredicateFix. Authored-by: Swinky <mannswinky@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-08 09:01:49 +00:00
Max Gekk	e10bf64769	[SPARK-34615][SQL] Support `java.time.Period` as an external type of the year-month interval type ### What changes were proposed in this pull request? In the PR, I propose to extend Spark SQL API to accept [`java.time.Period`](https://docs.oracle.com/javase/8/docs/api/java/time/Period.html) as an external type of recently added new Catalyst type - `YearMonthIntervalType` (see #31614). The Java class `java.time.Period` has similar semantic to ANSI SQL year-month interval type, and it is the most suitable to be an external type for `YearMonthIntervalType`. In more details: 1. Added `PeriodConverter` which converts `java.time.Period` instances to/from internal representation of the Catalyst type `YearMonthIntervalType` (to `Int` type). The `PeriodConverter` object uses new methods of `IntervalUtils`: - `periodToMonths()` converts the input period to the total length in months. If this period is too large to fit `Int`, the method throws the exception `ArithmeticException`. Note: _the input period has "days" precision, the method just ignores the days unit._ - `monthToPeriod()` obtains a `java.time.Period` representing a number of months. 2. Support new type `YearMonthIntervalType` in `RowEncoder` via the methods `createDeserializerForPeriod()` and `createSerializerForJavaPeriod()`. 3. Extended the Literal API to construct literals from `java.time.Period` instances. ### Why are the changes needed? 1. To allow users parallelization of `java.time.Period` collections, and construct year-month interval columns. Also to collect such columns back to the driver side. 2. This will allow to write tests in other sub-tasks of SPARK-27790. ### Does this PR introduce _any_ user-facing change? The PR extends existing functionality. So, users can parallelize instances of the `java.time.Duration` class and collect them back: ```scala scala> val ds = Seq(java.time.Period.ofYears(10).withMonths(2)).toDS ds: org.apache.spark.sql.Dataset[java.time.Period] = [value: yearmonthinterval] scala> ds.collect res0: Array[java.time.Period] = Array(P10Y2M) ``` ### How was this patch tested? - Added a few tests to `CatalystTypeConvertersSuite` to check conversion from/to `java.time.Period`. - Checking row encoding by new tests in `RowEncoderSuite`. - Making literals of `YearMonthIntervalType` are tested in `LiteralExpressionSuite`. - Check collecting by `DatasetSuite` and `JavaDatasetSuite`. - New tests in `IntervalUtilsSuites` to check conversions `java.time.Period` <-> months. Closes #31765 from MaxGekk/java-time-period. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-08 08:33:09 +00:00
yangjie01	43f355b5f2	[SPARK-34597][SQL] Replaces `ParquetFileReader.readFooter` with `ParquetFileReader.open and getFooter` ### What changes were proposed in this pull request? `ParquetFileReader.readFooter` related methods has been identified as `Deprecated` and `Apache Parquet` suggests replace it with the combination of `ParquetFileReader.open() and getFooter()` methods. This PR introduces the `ParquetFooterReader` utility class due to some repetitive code patterns when read parquet file footer. ### Why are the changes needed? Cleanup deprecated API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31711 from LuciferYang/parquet-read-footer. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-07 23:38:40 -08:00
Peter Toth	ab8a9a0ceb	[SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite ### What changes were proposed in this pull request? pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122 This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that ``` new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType)))) ``` and ``` new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType)))) ``` are currently equal and the second instance is replaced to the short code of the first one during serialization. ### Why are the changes needed? The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description: ``` >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import * >>> >>> def udf1(data_type): def u1(e): return e[0] return udf(u1, data_type) >>> >>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2']) >>> >>> df = df.withColumn("c3", udf1(DoubleType())("c1")) >>> df = df.withColumn("c4", udf1(IntegerType())("c2")) >>> >>> df.select("c3").show() +---+ \| c3\| +---+ \|1.0\| +---+ >>> df.select("c4").show() +---+ \| c4\| +---+ \| 1\| +---+ >>> df.select("c3", "c4").show() +---+----+ \| c3\| c4\| +---+----+ \|1.0\|null\| +---+----+ ``` This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113 After this PR: ``` >>> df.select("c3", "c4").show() +---+---+ \| c3\| c4\| +---+---+ \|1.0\| 1\| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UT + manual tests. Closes #31682 from peter-toth/SPARK-34545-fix-row-comparison. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-07 19:12:42 -06:00
wangguangxin.cn	9ec8696f11	[SPARK-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation ### What changes were proposed in this pull request? When we do self join with transform in a CTE, spark will throw AnalysisException. A simple way to reproduce is ``` create temporary view t as select * from values 0, 1, 2 as t(a); WITH temp AS ( SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t ) SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b ``` before this patch, it throws ``` org.apache.spark.sql.AnalysisException: cannot resolve '`t1.b`' given input columns: [t1.b]; line 6 pos 41; 'Project ['t1.b] +- 'Join Inner, ('t1.b = 't2.b) :- SubqueryAlias t1 : +- SubqueryAlias temp : +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false) : +- SubqueryAlias t : +- Project [a#1] : +- SubqueryAlias t : +- LocalRelation [a#1] +- SubqueryAlias t2 +- SubqueryAlias temp +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false) +- SubqueryAlias t +- Project [a#1] +- SubqueryAlias t +- LocalRelation [a#1] ``` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Add a UT Closes #31752 from WangGuangxin/selfjoin-with-transform. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-07 15:53:52 +09:00
Yuming Wang	616c818e7c	[SPARK-34628][SQL] Remove GlobalLimit operator if its child max rows not larger than limit number ### What changes were proposed in this pull request? This pr remove `GlobalLimit` operator if its child max rows not larger than limit number. For example: ``` val testRelation = LocalRelation.fromExternalRows(Seq("a".attr.int, "b".attr.int, "c".attr.int), 1.to(10).map(_ => Row(1, 2, 3)) ) val query = GlobalLimit(100, testRelation) ``` We can remove this `GlobalLimit`. ### Why are the changes needed? Further optimize the query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31750 from wangyum/SPARK-34628. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-06 09:54:15 -08:00
helloman	1a9722420e	[SPARK-34595][SQL] DPP support RLIKE ### What changes were proposed in this pull request? This pr make DPP support RLIKE expression: ```sql SELECT date_id, product_id FROM fact_sk f JOIN dim_store s ON f.store_id = s.store_id WHERE s.country RLIKE '[DE\|US]' ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31722 from chaojun-zhang/SPARK-34595. Authored-by: helloman <zcj23085@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-06 20:30:20 +09:00
Angerszhuuuu	654f19dfd7	[SPARK-34621][SQL] Unify output of ShowCreateTableAsSerdeCommand and ShowCreateTableCommand ### What changes were proposed in this pull request? Unify output of ShowCreateTableAsSerdeCommand and ShowCreateTableCommand ### Why are the changes needed? Unify output of ShowCreateTableAsSerdeCommand and ShowCreateTableCommand ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #31737 from AngersZhuuuu/SPARK-34621. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-05 07:44:07 +00:00
suqilong	ca326c4bb3	[SPARK-22748][SQL] Analyze __grouping__id as a literal function ### What changes were proposed in this pull request? This PR intends to refactor the logic to resolve `__grouping_id` in the `Analyzer`; it moves the logic from `ResolveFunctions` to `ResolveReferences` (`resolveLiteralFunction`). The original author of this PR is sqlwindspeaker (#30781). Closes #30781. ### Why are the changes needed? Code refactoring. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `AnalysisSuite`. Closes #31751 from maropu/SPARK-22748. Authored-by: suqilong <suqilong@qiyi.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-05 07:40:58 +00:00
Wenchen Fan	dc78f337cb	[SPARK-34609][SQL] Unify resolveExpressionBottomUp and resolveExpressionTopDown ### What changes were proposed in this pull request? It's a bit confusing to see `resolveExpressionBottomUp` and `resolveExpressionTopDown`, which provide similar functionalities but with different tree traverse order. It turns out that the real difference between these 2 methods is: which attributes should the columns be resolved to? `resolveExpressionTopDown` resolves columns using output attributes of the plan children, `resolveExpressionBottomUp` resolves columns using output attributes of the plan itself. This PR unifies `resolveExpressionBottomUp` and `resolveExpressionTopDown` and put the common logic in a new method, and let `resolveExpressionBottomUp` and `resolveExpressionTopDown` just call the new method. This PR also renames `resolveExpressionBottomUp` and `resolveExpressionTopDown` to make the difference clear. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31728 from cloud-fan/resolve. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-05 05:59:15 +00:00
Angerszhuuuu	979b9bcf5d	[SPARK-34608][SQL] Remove unused output of AddJarCommand ### What changes were proposed in this pull request? Remove unused output of AddJarCommand, keep consistence and clean ### Why are the changes needed? Keep consistence and clean ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31725 from AngersZhuuuu/SPARK-34608. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-04 21:19:07 -08:00
ulysses-you	43aacd5069	[SPARK-34613][SQL] Fix view does not capture disable hint config ### What changes were proposed in this pull request? Add allow list to capture sql config for view. ### Why are the changes needed? Spark use origin text sql to store view then capture and store sql config into view metadata. Capture config will skip some config with some prefix, e.g. `spark.sql.optimizer.` but unfortunately `spark.sql.optimizer.disableHints` is start with `spark.sql.optimizer.`. We need a allow list to help capture the config. ### Does this PR introduce _any_ user-facing change? Yes bug fix. ### How was this patch tested? Add test. Closes #31732 from ulysses-you/SPARK-34613. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-05 12:19:30 +08:00
Kent Yao	814d81c1e5	[SPARK-34376][SQL] Support regexp as a SQL function ### What changes were proposed in this pull request? We have equality in `SqlBase.g4` for `RLIKE: 'RLIKE' \| 'REGEXP';` We seemed to miss adding` REGEXP` as a SQL function just like` RLIKE` ### Why are the changes needed? symmetry and beauty This is also a builtin function in Hive, we can reduce the migration pain for those users ### Does this PR introduce _any_ user-facing change? yes new regexp function as an alias as rlike ### How was this patch tested? new tests Closes #31488 from yaooqinn/SPARK-34376. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-05 12:09:28 +09:00
Takeshi Yamamuro	dbce74d39d	[SPARK-34607][SQL] Add `Utils.isMemberClass` to fix a malformed class name error on jdk8u ### What changes were proposed in this pull request? This PR intends to fix a bug of `objects.NewInstance` if a user runs Spark on jdk8u and a given `cls` in `NewInstance` is a deeply-nested inner class, e.g.,. ``` object OuterLevelWithVeryVeryVeryLongClassName1 { object OuterLevelWithVeryVeryVeryLongClassName2 { object OuterLevelWithVeryVeryVeryLongClassName3 { object OuterLevelWithVeryVeryVeryLongClassName4 { object OuterLevelWithVeryVeryVeryLongClassName5 { object OuterLevelWithVeryVeryVeryLongClassName6 { object OuterLevelWithVeryVeryVeryLongClassName7 { object OuterLevelWithVeryVeryVeryLongClassName8 { object OuterLevelWithVeryVeryVeryLongClassName9 { object OuterLevelWithVeryVeryVeryLongClassName10 { object OuterLevelWithVeryVeryVeryLongClassName11 { object OuterLevelWithVeryVeryVeryLongClassName12 { object OuterLevelWithVeryVeryVeryLongClassName13 { object OuterLevelWithVeryVeryVeryLongClassName14 { object OuterLevelWithVeryVeryVeryLongClassName15 { object OuterLevelWithVeryVeryVeryLongClassName16 { object OuterLevelWithVeryVeryVeryLongClassName17 { object OuterLevelWithVeryVeryVeryLongClassName18 { object OuterLevelWithVeryVeryVeryLongClassName19 { object OuterLevelWithVeryVeryVeryLongClassName20 { case class MalformedNameExample2(x: Int) }}}}}}}}}}}}}}}}}}}} ``` The root cause that Kris (rednaxelafx) investigated is as follows (Kudos to Kris); The reason why the test case above is so convoluted is in the way Scala generates the class name for nested classes. In general, Scala generates a class name for a nested class by inserting the dollar-sign ( `$` ) in between each level of class nesting. The problem is that this format can concatenate into a very long string that goes beyond certain limits, so Scala will change the class name format beyond certain length threshold. For the example above, we can see that the first two levels of class nesting have class names that look like this: ``` org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$ ``` If we leave out the fact that Scala uses a dollar-sign ( `$` ) suffix for the class name of the companion object, `OuterLevelWithVeryVeryVeryLongClassName1`'s full name is a prefix (substring) of `OuterLevelWithVeryVeryVeryLongClassName2`. But if we keep going deeper into the levels of nesting, you'll find names that look like: ``` org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$2a1321b953c615695d7442b2adb1$$$$ryVeryLongClassName8$OuterLevelWithVeryVeryVeryLongClassName9$OuterLevelWithVeryVeryVeryLongClassName10$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$2a1321b953c615695d7442b2adb1$$$$ryVeryLongClassName8$OuterLevelWithVeryVeryVeryLongClassName9$OuterLevelWithVeryVeryVeryLongClassName10$OuterLevelWithVeryVeryVeryLongClassName11$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$85f068777e7ecf112afcbe997d461b$$$$VeryLongClassName11$OuterLevelWithVeryVeryVeryLongClassName12$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$85f068777e7ecf112afcbe997d461b$$$$VeryLongClassName11$OuterLevelWithVeryVeryVeryLongClassName12$OuterLevelWithVeryVeryVeryLongClassName13$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$85f068777e7ecf112afcbe997d461b$$$$VeryLongClassName11$OuterLevelWithVeryVeryVeryLongClassName12$OuterLevelWithVeryVeryVeryLongClassName13$OuterLevelWithVeryVeryVeryLongClassName14$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$5f7ad51804cb1be53938ea804699fa$$$$VeryLongClassName14$OuterLevelWithVeryVeryVeryLongClassName15$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$5f7ad51804cb1be53938ea804699fa$$$$VeryLongClassName14$OuterLevelWithVeryVeryVeryLongClassName15$OuterLevelWithVeryVeryVeryLongClassName16$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$5f7ad51804cb1be53938ea804699fa$$$$VeryLongClassName14$OuterLevelWithVeryVeryVeryLongClassName15$OuterLevelWithVeryVeryVeryLongClassName16$OuterLevelWithVeryVeryVeryLongClassName17$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$69b54f16b1965a31e88968df1a58d8$$$$VeryLongClassName17$OuterLevelWithVeryVeryVeryLongClassName18$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$69b54f16b1965a31e88968df1a58d8$$$$VeryLongClassName17$OuterLevelWithVeryVeryVeryLongClassName18$OuterLevelWithVeryVeryVeryLongClassName19$ org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$69b54f16b1965a31e88968df1a58d8$$$$VeryLongClassName17$OuterLevelWithVeryVeryVeryLongClassName18$OuterLevelWithVeryVeryVeryLongClassName19$OuterLevelWithVeryVeryVeryLongClassName20$ ``` with a hash code in the middle and various levels of nesting omitted. The `java.lang.Class.isMemberClass` method is implemented in JDK8u as: http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/tip/src/share/classes/java/lang/Class.java#l1425 ``` /** * Returns {code true} if and only if the underlying class * is a member class. * * return {code true} if and only if this class is a member class. * since 1.5 / public boolean isMemberClass() { return getSimpleBinaryName() != null && !isLocalOrAnonymousClass(); } /* * Returns the "simple binary name" of the underlying class, i.e., * the binary name without the leading enclosing class name. * Returns {code null} if the underlying class is a top level * class. / private String getSimpleBinaryName() { Class<?> enclosingClass = getEnclosingClass(); if (enclosingClass == null) // top level class return null; // Otherwise, strip the enclosing class' name try { return getName().substring(enclosingClass.getName().length()); } catch (IndexOutOfBoundsException ex) { throw new InternalError("Malformed class name", ex); } } ``` and the problematic code is `getName().substring(enclosingClass.getName().length())` -- if a class's enclosing class's full name is longer* than the nested class's full name, this logic would end up going out of bounds. The bug has been fixed in JDK9 by https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919 , but still exists in the latest JDK8u release. So from the Spark side we'd need to do something to avoid hitting this problem. ### Why are the changes needed? Bugfix on jdk8u. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #31733 from maropu/SPARK-34607. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-05 08:59:30 +09:00
Max Gekk	17601e014c	[SPARK-34605][SQL] Support `java.time.Duration` as an external type of the day-time interval type ### What changes were proposed in this pull request? In the PR, I propose to extend Spark SQL API to accept [`java.time.Duration`](https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html) as an external type of recently added new Catalyst type - `DayTimeIntervalType` (see #31614). The Java class `java.time.Duration` has similar semantic to ANSI SQL day-time interval type, and it is the most suitable to be an external type for `DayTimeIntervalType`. In more details: 1. Added `DurationConverter` which converts `java.time.Duration` instances to/from internal representation of the Catalyst type `DayTimeIntervalType` (to `Long` type). The `DurationConverter` object uses new methods of `IntervalUtils`: - `durationToMicros()` converts the input duration to the total length in microseconds. If this duration is too large to fit `Long`, the method throws the exception `ArithmeticException`. Note: _the input duration has nanosecond precision, the method casts the nanos part to microseconds by dividing by 1000._ - `microsToDuration()` obtains a `java.time.Duration` representing a number of microseconds. 2. Support new type `DayTimeIntervalType` in `RowEncoder` via the methods `createDeserializerForDuration()` and `createSerializerForJavaDuration()`. 3. Extended the Literal API to construct literals from `java.time.Duration` instances. ### Why are the changes needed? 1. To allow users parallelization of `java.time.Duration` collections, and construct day-time interval columns. Also to collect such columns back to the driver side. 2. This will allow to write tests in other sub-tasks of SPARK-27790. ### Does this PR introduce _any_ user-facing change? The PR extends existing functionality. So, users can parallelize instances of the `java.time.Duration` class and collect them back: ```Scala scala> val ds = Seq(java.time.Duration.ofDays(10)).toDS ds: org.apache.spark.sql.Dataset[java.time.Duration] = [value: daytimeinterval] scala> ds.collect res0: Array[java.time.Duration] = Array(PT240H) ``` ### How was this patch tested? - Added a few tests to `CatalystTypeConvertersSuite` to check conversion from/to `java.time.Duration`. - Checking row encoding by new tests in `RowEncoderSuite`. - Making literals of `DayTimeIntervalType` are tested in `LiteralExpressionSuite` - Check collecting by `DatasetSuite` and `JavaDatasetSuite`. Closes #31729 from MaxGekk/java-time-duration. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-04 16:58:33 +00:00
yi.wu	e7e016192f	[SPARK-34482][SS] Correct the active SparkSession for StreamExecution.logicalPlan ### What changes were proposed in this pull request? Set the active SparkSession to `sparkSessionForStream` and diable AQE & CBO before initializing the `StreamExecution.logicalPlan`. ### Why are the changes needed? The active session should be `sparkSessionForStream`. Otherwise, settings like `6b34745cb9/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala (L332-L335)` wouldn't take effect if callers access them from the active SQLConf, e.g., the rule of `InsertAdaptiveSparkPlan`. Besides, unlike `InsertAdaptiveSparkPlan` (which skips streaming plan), `CostBasedJoinReorder` seems to have the chance to take effect theoretically. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Before the fix, `InsertAdaptiveSparkPlan` would try to apply AQE on the plan(wouldn't take effect though). After this fix, the rule returns directly. Closes #31600 from Ngone51/active-session-for-stream. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-04 22:41:11 +08:00
Angerszhuuuu	401e270c17	[SPARK-34567][SQL] CreateTableAsSelect should update metrics too ### What changes were proposed in this pull request? For command `CreateTableAsSelect` we use `InsertIntoHiveTable`, `InsertIntoHadoopFsRelationCommand` to insert data. We will update metrics of `InsertIntoHiveTable`, `InsertIntoHadoopFsRelationCommand` in `FileFormatWriter.write()`, but we only show CreateTableAsSelectCommand in WebUI SQL Tab. We need to update `CreateTableAsSelectCommand`'s metrics too. Before this PR: ![image](https://user-images.githubusercontent.com/46485123/109411226-81f44480-79db-11eb-99cb-b9686b15bf61.png) After this PR: ![image](https://user-images.githubusercontent.com/46485123/109411232-8ae51600-79db-11eb-9111-3bea0bc2d475.png) ![image](https://user-images.githubusercontent.com/46485123/109905192-62aa2f80-7cd9-11eb-91f9-04b16c9238ae.png) ### Why are the changes needed? Complete SQL Metrics ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? <!-- MT Closes #31679 from AngersZhuuuu/SPARK-34567. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-04 20:42:47 +08:00
Gengliang Wang	2b1c170016	[SPARK-34614][SQL] ANSI mode: Casting String to Boolean should throw exception on parse error ### What changes were proposed in this pull request? In ANSI mode, casting String to Boolean should throw an exception on parse error, instead of returning null ### Why are the changes needed? For better ANSI compliance ### Does this PR introduce _any_ user-facing change? Yes, in ANSI mode there will be an exception on parse failure of casting String value to Boolean type. ### How was this patch tested? Unit tests. Closes #31734 from gengliangwang/ansiCastToBoolean. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2021-03-04 19:04:16 +08:00
Shixiong Zhu	53e4dba7c4	[SPARK-34599][SQL] Fix the issue that INSERT INTO OVERWRITE doesn't support partition columns containing dot for DSv2 ### What changes were proposed in this pull request? `ResolveInsertInto.staticDeleteExpression` should use `UnresolvedAttribute.quoted` to create the delete expression so that we will treat the entire `attr.name` as a column name. ### Why are the changes needed? When users use `dot` in a partition column name, queries like ```INSERT OVERWRITE $t1 PARTITION (`a.b` = 'a') (`c.d`) VALUES('b')``` is not working. ### Does this PR introduce _any_ user-facing change? Without this test, the above query will throw ``` [info] org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input columns: [a.b, c.d]; [info] 'OverwriteByExpression RelationV2[a.b#17, c.d#18] default.tbl, ('a.b <=> cast(a as string)), false [info] +- Project [a.b#19, ansi_cast(col1#16 as string) AS c.d#20] [info] +- Project [cast(a as string) AS a.b#19, col1#16] [info] +- LocalRelation [col1#16] ``` With the fix, the query will run correctly. ### How was this patch tested? The new added test. Closes #31713 from zsxwing/SPARK-34599. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-04 15:12:53 +08:00
Angerszhuuuu	db627107b7	[SPARK-34577][SQL] Fix drop/add columns to a dataset of `DESCRIBE NAMESPACE` ### What changes were proposed in this pull request? In the PR, I propose to generate "stable" output attributes per the logical node of the DESCRIBE NAMESPACE command. ### Why are the changes needed? This fixes the issue demonstrated by the example: ``` sql(s"CREATE NAMESPACE ns") val description = sql(s"DESCRIBE NAMESPACE ns") description.drop("name") ``` ``` [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#74 missing from name#25,value#26 in operator !Project [name#74]. Attribute(s) with the same name appear in the operation: name. Please check if the right attribute(s) are used.; [info] !Project [name#74] [info] +- LocalRelation [name#25, value#26] ``` ### Does this PR introduce _any_ user-facing change? After this change user `drop()/add()` works well. ### How was this patch tested? Added UT Closes #31705 from AngersZhuuuu/SPARK-34577. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-04 13:22:10 +08:00
Wenchen Fan	8f1eec4d13	[SPARK-34584][SQL] Static partition should also follow StoreAssignmentPolicy when insert into v2 tables ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/27597 and simply apply the fix in the v2 table insertion code path. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? yes, now v2 table insertion with static partitions also follow StoreAssignmentPolicy. ### How was this patch tested? moved the test from https://github.com/apache/spark/pull/27597 to the general test suite `SQLInsertTestSuite`, which covers DS v2, file source, and hive tables. Closes #31726 from cloud-fan/insert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-04 11:29:34 +09:00
Erik Krogen	9d2d620e98	[SPARK-33084][SQL][TEST][FOLLOWUP] Add ResetSystemProperties trait to SQLQuerySuite to avoid clearing ivy.home ### What changes were proposed in this pull request? Add the `ResetSystemProperties` trait to `SQLQuerySuite` so that system property changes made by any of the tests will not affect other suites/tests. Specifically, the system property changes made by `SPARK-33084: Add jar support Ivy URI in SQL -- jar contains udf class` are targeted here (which sets and then clears `ivy.home`). ### Why are the changes needed? PR #29966 added a new test case that adjusts the `ivy.home` system property to force Ivy to resolve an artifact from a custom location. At the end of the test, the value is cleared. Clearing the value meant that, if a custom value of `ivy.home` was configured externally, it would not apply for tests run after this test case. ### Does this PR introduce _any_ user-facing change? No, this is only in tests. ### How was this patch tested? Existing unit tests continue to pass, whether or not `spark.jars.ivySettings` is configured (which adjusts the behavior of Ivy w.r.t. handling of `ivy.home` and `ivy.default.ivy.user.dir` properties). Closes #31694 from xkrogen/xkrogen-SPARK-33084-ivyhome-sysprop-followon. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-04 08:52:53 +09:00
Gengliang Wang	5aaab19685	[SPARK-34222][SQL][FOLLOWUP] Non-recursive implementation of buildBalancedPredicate ### What changes were proposed in this pull request? Use a non-recursive implementation for the function buildBalancedPredicate ### Why are the changes needed? For better performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests. Also, a quick benchmark: ``` test("buildBalancedPredicate") { val expressions = (1 to 1000).map(_ => Literal(true)) val start = System.currentTimeMillis() buildBalancedPredicate(expressions, And) println(System.currentTimeMillis() - start) } ``` Before: 47ms After: 4ms Closes #31724 from gengliangwang/nonrecursive. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2021-03-04 01:01:28 +08:00
Karen Feng	b01dd12805	[SPARK-34555][SQL] Resolve metadata output from DataFrame ### What changes were proposed in this pull request? Add metadataOutput as a fallback to resolution. Builds off https://github.com/apache/spark/pull/31654. ### Why are the changes needed? The metadata columns could not be resolved via `df.col("metadataColName")` from the DataFrame API. ### Does this PR introduce _any_ user-facing change? Yes, the metadata columns can now be resolved as described above. ### How was this patch tested? Scala unit test. Closes #31668 from karenfeng/spark-34555. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-03 22:07:41 +08:00
angerszhu	56edb8156f	[SPARK-33474][SQL] Support TypeConstructed partition spec value ### What changes were proposed in this pull request? Hive support type constructed value as partition spec value, spark should support too. ### Why are the changes needed? Support TypeConstructed partition spec value keep same with hive ### Does this PR introduce _any_ user-facing change? Yes, user can use TypeConstruct value as partition spec value such as ``` CREATE TABLE t1(name STRING) PARTITIONED BY (part DATE) INSERT INTO t1 PARTITION(part = date'2019-01-02') VALUES('a') CREATE TABLE t2(name STRING) PARTITIONED BY (part TIMESTAMP) INSERT INTO t2 PARTITION(part = timestamp'2019-01-02 11:11:11') VALUES('a') CREATE TABLE t4(name STRING) PARTITIONED BY (part BINARY) INSERT INTO t4 PARTITION(part = X'537061726B2053514C') VALUES('a') ``` ### How was this patch tested? Added UT Closes #30421 from AngersZhuuuu/SPARK-33474. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-03 16:48:50 +09:00
Kent Yao	499f620037	[MINOR][SQL][DOCS] Fix some wrong default values in SQL tuning guide's AQE section ### What changes were proposed in this pull request? spark.sql.adaptive.coalescePartitions.initialPartitionNum 200 -> (none) spark.sql.adaptive.skewJoin.skewedPartitionFactor is 10 -> 5 ### Why are the changes needed? the wrong doc misguide people ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing doc Closes #31717 from yaooqinn/minordoc0. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-03 15:00:09 +09:00
Swinky	229d2e0554	[SPARK-34222][SQL] Enhance boolean simplification rule ### What changes were proposed in this pull request? Enhance boolean simplification rule by handling following scenarios: (((a && b) && a && (a && c))) => a && b && c) (((a \|\| b) \|\| a \|\| (a \|\| c))) => a \|\| b \|\| c ### Why are the changes needed? Minor improvement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UTs Closes #31318 from Swinky/booleansimplification. Authored-by: Swinky <mannswinky@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-03 05:25:28 +00:00
Angerszhuuuu	17f0e70fa0	[SPARK-34576][SQL] Fix drop/add columns to a dataset of `DESCRIBE COLUMN` ### What changes were proposed in this pull request? In the PR, I propose to generate "stable" output attributes per the logical node of the DESCRIBE COLUMN command. ### Why are the changes needed? This fixes the issue demonstrated by the example: ``` val tbl = "testcat.ns1.ns2.tbl" sql(s"CREATE TABLE $tbl (c0 INT) USING _") val description = sql(s"DESCRIBE TABLE $tbl c0") description.drop("info_name") ``` ``` [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) info_name#74 missing from info_name#25,info_value#26 in operator !Project [info_name#74]. Attribute(s) with the same name appear in the operation: info_name. Please check if the right attribute(s) are used.; [info] !Project [info_name#74] [info] +- LocalRelation [info_name#25, info_value#26] ``` ### Does this PR introduce _any_ user-facing change? After this change user `drop()/add()` works well. ### How was this patch tested? Added UT Closes #31696 from AngersZhuuuu/SPARK-34576. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-03 12:51:30 +08:00
Max Gekk	cd649e7aef	[SPARK-27793][SQL] Add ANSI SQL day-time and year-month interval types ### What changes were proposed in this pull request? In the PR, I propose to extend Catalyst's type system by two new types that conform to the SQL standard (see SQL:2016, section 4.6.3): - `DayTimeIntervalType` represents the day-time interval type, - `YearMonthIntervalType` for SQL year-month interval type. This PR only adds the two new DataType implementations, and there will be more PRs as sub-tasks of SPARK-27790 to completely support the new ANSI interval types. ### Why are the changes needed? Spark as it is today supports an INTERVAL datatype. However this type is of very limited use. Existing interval values cannot be compared with any other interval values, or persisted to storage. Spark users request to either implement new or expand existing built-in functions which produce some sort of measures for elapsed time, such as `DATEDIFF()`. Rather than work around the edges to fill the potholes of the existing INTERVAL data type, I would like to propose to deliver a proper ANSI compliant INTERVAL type that can be introduced with minimal incompatibility, is comparable and thus sortable, and can be persisted in tables. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. By checking coding style via: ``` $ ./dev/scalastyle $ ./dev/lint-java ``` 2. Run the test for the default sizes: ``` $ build/sbt "test:testOnly *DataTypeSuite" ``` Closes #31614 from MaxGekk/day-time-interval-type. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-03 04:44:23 +00:00
Cheng Su	5362f08125	[SPARK-34593][SQL] Preserve broadcast nested loop join partitioning and ordering ### What changes were proposed in this pull request? `BroadcastNestedLoopJoinExec` does not preserve `outputPartitioning` and `outputOrdering` right now. But it can preserve the streamed side partitioning and ordering when possible. This can help avoid shuffle and sort in later stage, if there's join and aggregation in the query. See example queries in added unit test in `JoinSuite.scala`. In addition, fix a bunch of minor places in `BroadcastNestedLoopJoinExec.scala` for better style and readability. ### Why are the changes needed? Avoid shuffle and sort for certain complicated query shape. Better query performance can be achieved. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite.scala`. Closes #31708 from c21/nested-join. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-03 04:32:28 +00:00
Kris Mok	ecf4811764	[SPARK-34596][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in NewInstance.doGenCode ### What changes were proposed in this pull request? Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `NewInstance.doGenCode`. ### Why are the changes needed? On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error. In this particular case, creating an `ExpressionEncoder` on such a nested Scala class would create a `NewInstance` expression under the hood, which will trigger the problem during codegen. Similar to https://github.com/apache/spark/pull/29050, we should use Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue. There are two other occurrences of `java.lang.Class.getSimpleName` in the same file, but they're safe because they're only guaranteed to be only used on Java classes, which don't have this problem, e.g.: ```scala // Make a copy of the data if it's unsafe-backed def makeCopyIfInstanceOf(clazz: Class[_ <: Any], value: String) = s"$value instanceof ${clazz.getSimpleName}? ${value}.copy() : $value" val genFunctionValue: String = lambdaFunction.dataType match { case StructType(_) => makeCopyIfInstanceOf(classOf[UnsafeRow], genFunction.value) case ArrayType(_, _) => makeCopyIfInstanceOf(classOf[UnsafeArrayData], genFunction.value) case MapType(_, _, _) => makeCopyIfInstanceOf(classOf[UnsafeMapData], genFunction.value) case _ => genFunction.value } ``` The Unsafe-* family of types are all Java types, so they're okay. ### Does this PR introduce _any_ user-facing change? Fixes a bug that throws an error when using `ExpressionEncoder` on some nested Scala types, otherwise no changes. ### How was this patch tested? Added a test case to `org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite`. It'll fail on JDK8u before the fix, and pass after the fix. Closes #31709 from rednaxelafx/spark-34596-master. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-03 12:22:51 +09:00
Liang-Chi Hsieh	107766661a	[SPARK-34548][SQL][FOLLOW-UP] Call toSeq to recover Scala 2.13 build in RemoveNoopUnion ### What changes were proposed in this pull request? Call `toSeq` to fix Scala 2.13 build error. ### Why are the changes needed? It is needed to fix 2.13 build error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31716 from viirya/SPARK-34548-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-03 11:12:58 +09:00
Liang-Chi Hsieh	bab9531134	[SPARK-34548][SQL] Remove unnecessary children from Union under Distince and Deduplicate ### What changes were proposed in this pull request? This patch proposes to remove unnecessary children from Union under Distince and Deduplicate ### Why are the changes needed? If there are any duplicate child of `Union` under `Distinct` and `Deduplicate`, it can be removed to simplify query plan. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #31656 from viirya/SPARK-34548. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-03-02 17:09:08 -08:00
Kent Yao	6093a78dbd	[SPARK-34558][SQL] warehouse path should be qualified ahead of populating and use ### What changes were proposed in this pull request? Currently, the warehouse path gets fully qualified in the caller side for creating a database, table, partition, etc. An unqualified path is populated into Spark and Hadoop confs, which leads to inconsistent API behaviors. We should make it qualified ahead. When the value is a relative path `spark.sql.warehouse.dir=lakehouse`, some behaviors become inconsistent, for example. If the default database is absent at runtime, the app fails with ```java Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./lakehouse at org.apache.hadoop.fs.Path.initialize(Path.java:263) at org.apache.hadoop.fs.Path.<init>(Path.java:254) at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:133) at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:137) at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:150) at org.apache.hadoop.hive.metastore.Warehouse.getDefaultDatabasePath(Warehouse.java:163) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:636) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:655) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:431) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:79) ... 73 more ``` If the default database is present at runtime, the app can work with it, and if we create a database, it gets fully qualified, for example ```sql spark-sql> create database test; Time taken: 0.052 seconds spark-sql> desc database test; Database Name test Comment Location file:/Users/kentyao/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210226/lakehouse/test.db Owner kentyao Time taken: 0.023 seconds, Fetched 4 row(s) ``` Another thing is that the log becomes nubilous, for example. ```logtalk 21/02/27 13:54:17 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('datalake'). 21/02/27 13:54:17 INFO SharedState: Warehouse path is 'lakehouse'. ``` ### Why are the changes needed? fix bug and ambiguity ### Does this PR introduce _any_ user-facing change? yes, the path now resolved with proper order - `warehouse->database->table->partition` ### How was this patch tested? w/ ut added Closes #31671 from yaooqinn/SPARK-34558. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-02 15:14:19 +00:00
kevincmchen	9e8547ca43	[SPARK-34498][SQL][TESTS] fix the remaining problems in #31560 ### What changes were proposed in this pull request? This is a followup of #31560, In #31560, we added `JavaSimpleWritableDataSource ` and left some little problems like unused interface `SessionConfigSupport` 、 inconsistent schema between `JavaSimpleWritableDataSource ` and `SimpleWritableDataSource`. This PR fixes the remaining problems in #31560. ### Why are the changes needed? 1. `SessionConfigSupport` in `JavaSimpleWritableDataSource ` and `SimpleWritableDataSource` is never used, so we don't need to implement it. 2. change the schema of `SimpleWritableDataSource`, to match `TestingV2Source` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing testsuites Closes #31621 from kevincmchen/SPARK-34498. Authored-by: kevincmchen <kevincmchen@tencent.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-02 15:08:23 +00:00
Karen Feng	2e54d68eb9	[SPARK-34547][SQL] Only use metadata columns for resolution as last resort ### What changes were proposed in this pull request? Today, child expressions may be resolved based on "real" or metadata output attributes. We should prefer the real attribute during resolution if one exists. ### Why are the changes needed? Today, attempting to resolve an expression when there is a "real" output attribute and a metadata attribute with the same name results in resolution failure. This is likely unexpected, as the user may not know about the metadata attribute. ### Does this PR introduce _any_ user-facing change? Yes. Previously, the user would see an error message when resolving a column with the same name as a "real" output attribute and a metadata attribute as below: ``` org.apache.spark.sql.AnalysisException: Reference 'index' is ambiguous, could be: testcat.ns1.ns2.tableTwo.index, testcat.ns1.ns2.tableOne.index.; line 1 pos 71 at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:107) ``` Now, resolution succeeds and provides the "real" output attribute. ### How was this patch tested? Added a unit test. Closes #31654 from karenfeng/fallback-resolve-metadata. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-02 17:27:13 +08:00
Amandeep Sharma	4bda3c0f02	[SPARK-34417][SQL] org.apache.spark.sql.DataFrameNaFunctions.fillMap fails for column name having a dot What changes were proposed in this pull request? This PR fixes dataframe.na.fillMap() for column having a dot in the name as mentioned in [SPARK-34417](https://issues.apache.org/jira/browse/SPARK-34417). Use resolved attributes of a column for replacing null values. Why are the changes needed? dataframe.na.fillMap() does not work for column having a dot in the name Does this PR introduce any user-facing change? None How was this patch tested? Added unit test for the same Closes #31545 from amandeep-sharma/master. Lead-authored-by: Amandeep Sharma <happyaman91@gmail.com> Co-authored-by: Amandeep Sharma <amandeep.sharma@oracle.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-02 17:14:15 +08:00
Anton Okolnychyi	08a125761d	[SPARK-34585][SQL] Remove no longer needed BatchWriteHelper ### What changes were proposed in this pull request? As a follow-up to SPARK-34456, this PR removes `BatchWriteHelper` completely. ### Why are the changes needed? These changes remove no longer used code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31699 from aokolnychyi/spark-34585. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-02 16:58:18 +09:00
Dongjoon Hyun	4818847e87	[SPARK-34578][SQL][TESTS][TEST-MAVEN] Refactor ORC encryption tests and ignore ORC shim loaded by old Hadoop library ### What changes were proposed in this pull request? 1. This PR aims to ignore ORC encryption tests when ORC shim is loaded by old Hadoop library by some other tests. The test coverage is preserved by Jenkins SBT runs and GitHub Action jobs. This PR only aims to recover Maven Jenkins jobs. 2. In addition, this PR simplifies SBT testing by refactor the test config to `SparkBuild.scala/pom.xml` and remove `DedicatedJVMTest`. This will remove one GitHub Action job which was recently added for `DedicatedJVMTest` tag. ### Why are the changes needed? Currently, Maven test fails when it runs in a batch mode because `HadoopShimsPre2_3$NullKeyProvider` is loaded. MVN COMMAND ``` $ mvn test -pl sql/core --am -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcV1QuerySuite,org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite ``` BEFORE ``` - Write and read an encrypted table * FAILED * ... Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): java.lang.IllegalArgumentException: Unknown key pii at org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider.getCurrentKeyVersion(HadoopShimsPre2_3.java:71) at org.apache.orc.impl.WriterImpl.getKey(WriterImpl.java:871) ``` AFTER ``` OrcV1QuerySuite ... OrcEncryptionSuite: - Write and read an encrypted file !!! CANCELED !!! [] was empty org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider1b705f65 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:39) - Write and read an encrypted table !!! CANCELED !!! [] was empty org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider22adeee1 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:67) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins Maven tests. For SBT command, - the test suite required a dedicated JVM (Before) - the test suite doesn't require a dedicated JVM (After) ``` $ build/sbt "sql/testOnly .OrcV1QuerySuite .OrcEncryptionSuite" ... [info] OrcV1QuerySuite ... [info] - SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core (26 milliseconds) [info] OrcEncryptionSuite: [info] - Write and read an encrypted file (431 milliseconds) [info] - Write and read an encrypted table (359 milliseconds) [info] All tests passed. [info] Passed: Total 35, Failed 0, Errors 0, Passed 35 ``` Closes #31697 from dongjoon-hyun/SPARK-34578-TEST. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-02 16:52:27 +09:00
Chao Sun	ce13dcc689	[SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase ### What changes were proposed in this pull request? Currently in `SpecificParquetRecordReaderBase` we use deprecated APIs in a few places from Parquet, such as `readFooter`, `ParquetInputSplit`, `new ParquetFileReader`, `filterRowGroups`, etc. This replaces these with the newer APIs. In specific this: - Replaces `ParquetInputSplit` with `FileSplit`. We never use specific things in the former such as `rowGroupOffsets` so the swap is pretty simple. - Removes `readFooter` calls by using `ParquetFileReader.open` - Replace deprecated `ParquetFileReader` ctor with the newer API which takes `ParquetReadOptions`. - Removes the unnecessary handling of case when `rowGroupOffsets` is not null. It seems this never happens. ### Why are the changes needed? The aforementioned APIs were deprecated and is going to be removed at some point in future. This is to ensure better supportability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a cleanup and relies on existing tests on the relevant code paths. Closes #31667 from sunchao/SPARK-32703. Lead-authored-by: Chao Sun <sunchao@apache.org> Co-authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-02 16:51:41 +09:00
Richard Penney	7d0743b493	[SPARK-33678][SQL] Product aggregation function ### Why is this change being proposed? This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group. This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark. This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly. ### Does this PR introduce _any_ user-facing change? No - only adds new function. ### How was this patch tested? Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself). An illustration of the new functionality, within PySpark is as follows: ``` import pyspark.sql.functions as pf, pyspark.sql.window as pw df = sqlContext.range(1, 17).toDF("x") win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x")) df.withColumn("factorial", pf.product("x").over(win)).show(20, False) +---+---------------+ \|x \|factorial \| +---+---------------+ \|1 \|1.0 \| \|2 \|2.0 \| \|3 \|6.0 \| \|4 \|24.0 \| \|5 \|120.0 \| \|6 \|720.0 \| \|7 \|5040.0 \| \|8 \|40320.0 \| \|9 \|362880.0 \| \|10 \|3628800.0 \| \|11 \|3.99168E7 \| \|12 \|4.790016E8 \| \|13 \|6.2270208E9 \| \|14 \|8.71782912E10 \| \|15 \|1.307674368E12 \| \|16 \|2.0922789888E13\| +---+---------------+ ``` Closes #30745 from rwpenney/feature/agg-product. Lead-authored-by: Richard Penney <rwp@rwpenney.uk> Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-02 16:51:07 +09:00
Gabriele Nizzoli	b13a4b85d4	[SPARK-34573][SQL] Avoid global locking in SQLConf object for sqlConfEntries map ### What changes were proposed in this pull request? In the `SQLConf` object, the `sqlConfEntries` map is globally synchronized (it is a Java `Collections.synchronizedMap`): any operation, including a get, will need to acquire the lock. An example of this is calling the `DatatType.sameType` method. This will trigger a check on `SQLConf.get.caseSensitiveAnalysis`. So every time we compare two datatypes with sameType, we hit a lock. To avoid having multiple tasks locking on this, a better approach would be to use a map that does not lock on read (like a `ConcurrentHashMap`). This map implementation does not lock on read, and on write it only locks the map partially. The only lock that happens is on write on the same map key. ### Why are the changes needed? Multiple tasks performing any operation that directly or indirectly trigger a query to the `SQLConf.sqlConfEntries` map, will require acquiring a global lock on that map. Something as easy as calling `DataType.sameType(...)` would be locking on the global `sqlConfEntries` lock of the `Collections.synchronizedMap`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No functionality change. Existing unit tests run normally. Closes #31689 from gabrielenizzoli/SPARK-34573. Authored-by: Gabriele Nizzoli <1545350+gabrielenizzoli@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-02 15:36:51 +09:00
Max Gekk	70f6267de6	[SPARK-34560][SQL] Generate unique output attributes in the `SHOW TABLES` logical node ### What changes were proposed in this pull request? In the PR, I propose to generate unique attributes in the logical nodes of the `SHOW TABLES` command. Also, this PR fixes similar issues in other logical nodes: - ShowTableExtended - ShowViews - ShowTableProperties - ShowFunctions - ShowColumns - ShowPartitions - ShowNamespaces ### Why are the changes needed? This fixes the issue which is demonstrated by the example below: ```scala scala> val show1 = sql("SHOW TABLES IN ns1") show1: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string ... 1 more field] scala> val show2 = sql("SHOW TABLES IN ns2") show2: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string ... 1 more field] scala> show1.show +---------+---------+-----------+ \|namespace\|tableName\|isTemporary\| +---------+---------+-----------+ \| ns1\| tbl1\| false\| +---------+---------+-----------+ scala> show2.show +---------+---------+-----------+ \|namespace\|tableName\|isTemporary\| +---------+---------+-----------+ \| ns2\| tbl2\| false\| +---------+---------+-----------+ scala> show1.join(show2).where(show1("tableName") =!= show2("tableName")).show org.apache.spark.sql.AnalysisException: Column tableName#17 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check. at org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works as expected: ```scala scala> show1.join(show2).where(show1("tableName") =!= show2("tableName")).show +---------+---------+-----------+---------+---------+-----------+ \|namespace\|tableName\|isTemporary\|namespace\|tableName\|isTemporary\| +---------+---------+-----------+---------+---------+-----------+ \| ns1\| tbl1\| false\| ns2\| tbl2\| false\| +---------+---------+-----------+---------+---------+-----------+ ``` ### How was this patch tested? By running the new test: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite" ``` Closes #31675 from MaxGekk/fix-output-attrs. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-01 18:32:32 +00:00
Max Gekk	984ff396a2	[SPARK-34561][SQL] Fix drop/add columns from/to a dataset of v2 `DESCRIBE TABLE` ### What changes were proposed in this pull request? In the PR, I propose to generate "stable" output attributes per the logical node of the `DESCRIBE TABLE` command. ### Why are the changes needed? This fixes the issue demonstrated by the example: ```scala val tbl = "testcat.ns1.ns2.tbl" sql(s"CREATE TABLE $tbl (c0 INT) USING _") val description = sql(s"DESCRIBE TABLE $tbl") description.drop("comment") ``` The `drop()` method fails with the error: ``` org.apache.spark.sql.AnalysisException: Resolved attribute(s) col_name#102,data_type#103 missing from col_name#29,data_type#30,comment#31 in operator !Project [col_name#102, data_type#103]. Attribute(s) with the same name appear in the operation: col_name,data_type. Please check if the right attribute(s) are used.; !Project [col_name#102, data_type#103] +- LocalRelation [col_name#29, data_type#30, comment#31] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, `drop()`/`add()` works as expected: ```scala description.drop("comment").show() +---------------+---------+ \| col_name\|data_type\| +---------------+---------+ \| c0\| int\| \| \| \| \| # Partitioning\| \| \|Not partitioned\| \| +---------------+---------+ ``` ### How was this patch tested? 1. Run new test: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DataSourceV2SQLSuite" ``` 2. Run existing test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly CatalogedDDLSuite" ``` Closes #31676 from MaxGekk/describe-table-drop-column. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-01 22:20:28 +08:00
Shixiong Zhu	62737e140c	[SPARK-34556][SQL] Checking duplicate static partition columns should respect case sensitive conf ### What changes were proposed in this pull request? This PR makes partition spec parsing respect case sensitive conf. ### Why are the changes needed? When parsing the partition spec, Spark will call `org.apache.spark.sql.catalyst.parser.ParserUtils.checkDuplicateKeys` to check if there are duplicate partition column names in the list. But this method is always case sensitive and doesn't detect duplicate partition column names when using different cases. ### Does this PR introduce _any_ user-facing change? Yep. This prevents users from writing incorrect queries such as `INSERT OVERWRITE t PARTITION (c='2', C='3') VALUES (1)` when they don't enable case sensitive conf. ### How was this patch tested? The new added test will fail without this change. Closes #31669 from zsxwing/SPARK-34556. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-01 13:55:35 +09:00
Kent Yao	1afe284ed8	[SPARK-34570][SQL] Remove dead code from constructors of [Hive]SessionStateBuilder ### What changes were proposed in this pull request? the parameter - `options` is never used. The changes here was part of https://github.com/apache/spark/pull/30642, It got reverted for easier backporting #30642 as a hotfix by `dad24543aa`, this PR brings it back to master. ### Why are the changes needed? remove unless dead code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Passing CI is enough. Closes #31683 from yaooqinn/SPARK-34570. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-01 09:30:18 +09:00
Angerszhuuuu	d574308864	[SPARK-34579][SQL][TEST] Fix wrong UT in SQLQuerySuite ### What changes were proposed in this pull request? Some UT in SQLQuerySuite is not incorrect, it have wrong table name in `withTable`, this pr to make it correct. ### Why are the changes needed? Fix UT ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #31681 from AngersZhuuuu/SPARK-34569. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-02-28 16:21:42 -08:00
Shardul Mahadik	0216051aca	[SPARK-34506][CORE] ADD JAR with ivy coordinates should be compatible with Hive transitive behavior ### What changes were proposed in this pull request? SPARK-33084 added the ability to use ivy coordinates with `SparkContext.addJar`. PR #29966 claims to mimic Hive behavior although I found a few cases where it doesn't 1) The default value of the transitive parameter is false, both in case of parameter not being specified in coordinate or parameter value being invalid. The Hive behavior is that transitive is [true if not specified](`cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L169)`) in the coordinate and [false for invalid values](`cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L124)`). Also, regardless of Hive, I think a default of true for the transitive parameter also matches [ivy's own defaults](https://ant.apache.org/ivy/history/2.5.0/ivyfile/dependency.html#_attributes). 2) The parameter value for transitive parameter is regarded as case-sensitive [based on the understanding](https://github.com/apache/spark/pull/29966#discussion_r547752259) that Hive behavior is case-sensitive. However, this is not correct, Hive [treats the parameter value case-insensitively](`cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L122)`). I propose that we be compatible with Hive for these behaviors ### Why are the changes needed? To make `ADD JAR` with ivy coordinates compatible with Hive's transitive behavior ### Does this PR introduce _any_ user-facing change? The user-facing changes here are within master as the feature introduced in SPARK-33084 has not been released yet 1. Previously an ivy coordinate without `transitive` parameter specified did not resolve transitive dependency, now it does. 2. Previously an `transitive` parameter value was treated case-sensitively. e.g. `transitive=TRUE` would be treated as false as it did not match exactly `true`. Now it will be treated case-insensitively. ### How was this patch tested? Modified existing unit tests to test new behavior Add new unit test to cover usage of `exclude` with unspecified `transitive` Closes #31623 from shardulm94/spark-34506. Authored-by: Shardul Mahadik <smahadik@linkedin.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-01 09:10:20 +09:00
Yuming Wang	d07fc3076b	[SPARK-33687][SQL] Support analyze all tables in a specific database ### What changes were proposed in this pull request? This pr add support analyze all tables in a specific database: ```g4 ANALYZE TABLES ((FROM \| IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)? ``` ### Why are the changes needed? 1. Make it easy to analyze all tables in a specific database. 2. PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The feature tested by unit test. The documentation tested by regenerating the documentation: menu-sql.yaml \| sql-ref-syntax-aux-analyze-tables.md -- \| -- ![image](https://user-images.githubusercontent.com/5399861/109098769-dc33a200-775c-11eb-86b1-55531e5425e0.png) \| ![image](https://user-images.githubusercontent.com/5399861/109098841-02594200-775d-11eb-8588-de8da97ec94a.png) Closes #30648 from wangyum/SPARK-33687. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-01 09:06:47 +09:00
Yuming Wang	54c053afb0	[SPARK-34479][SQL] Add zstandard codec to Avro compression codec list ### What changes were proposed in this pull request? Avro add zstandard codec since AVRO-2195. This pr add zstandard codec to Avro compression codec list. ### Why are the changes needed? To make Avro support zstandard codec. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31673 from wangyum/SPARK-34479. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-02-27 10:31:42 -08:00
Ruifeng Zheng	05069ff4ce	[SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition ### What changes were proposed in this pull request? if child rdd has only one partition or zero partition, skip the shuffle ### Why are the changes needed? skip shuffle if possible ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #31468 from zhengruifeng/collect_limit_single_partition. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-02-27 16:48:20 +09:00
ShiKai Wang	56e664c717	[SPARK-34392][SQL] Support ZoneOffset +h:mm in DateTimeUtils. getZoneId ### What changes were proposed in this pull request? To support +8:00 in Spark3 when execute sql `select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")` ### Why are the changes needed? +8:00 this format is supported in PostgreSQL,hive, presto, but not supported in Spark3 https://issues.apache.org/jira/browse/SPARK-34392 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? unit test Closes #31624 from Karl-WangSK/zone. Lead-authored-by: ShiKai Wang <wskqing@gmail.com> Co-authored-by: Karl-WangSK <shikai.wang@linkflowtech.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-26 11:03:20 -06:00
tanel.kiis@gmail.com	67ec4f7f67	[SPARK-33971][SQL] Eliminate distinct from more aggregates ### What changes were proposed in this pull request? Add more aggregate expressions to `EliminateDistinct` rule. ### Why are the changes needed? Distinct aggregation can add a significant overhead. It's better to remove distinct whenever possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #30999 from tanelk/SPARK-33971_eliminate_distinct. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-02-26 21:59:02 +09:00
Max Gekk	c1beb16cc8	[SPARK-34554][SQL] Implement the copy() method in ColumnarMap ### What changes were proposed in this pull request? Implement `ColumnarMap.copy()` by using the `copy()` method of `ColumnarArray`. ### Why are the changes needed? To eliminate `java.lang.UnsupportedOperationException` while using `ColumnarMap`. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running new tests in `ColumnarBatchSuite`. Closes #31663 from MaxGekk/columnar-map-copy. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-26 21:33:14 +09:00
ulysses-you	82267acfe8	[SPARK-34550][SQL] Skip InSet null value during push filter to Hive metastore ### What changes were proposed in this pull request? Skip `InSet` null value during push filter to Hive metastore. ### Why are the changes needed? If `InSet` contains a null value, we should skip it and push other values to metastore. To keep same behavior with `In`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31659 from ulysses-you/SPARK-34550. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-26 21:29:14 +09:00
Cheng Su	7d5021f5ee	[SPARK-34533][SQL] Eliminate LEFT ANTI join to empty relation in AQE ### What changes were proposed in this pull request? I discovered from review discussion - https://github.com/apache/spark/pull/31630#discussion_r581774000 , that we can eliminate LEFT ANTI join (with no join condition) to empty relation, if the right side is known to be non-empty. So with AQE, this is doable similar to https://github.com/apache/spark/pull/29484 . ### Why are the changes needed? This can help eliminate the join operator during logical plan optimization. Before this PR, [left side physical plan `execute()` will be called](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L192), so if left side is complicated (e.g. contain broadcast exchange operator), then some computation would happen. However after this PR, the join operator will be removed during logical plan, and nothing will be computed from left side. Potentially it can save resource for these kinds of query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests for positive and negative queries in `AdaptiveQueryExecSuite.scala`. Closes #31641 from c21/left-anti-aqe. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-26 11:46:27 +00:00
Wenchen Fan	73857cdd87	[SPARK-34524][SQL] Simplify v2 partition commands resolution ### What changes were proposed in this pull request? This PR simplifies the resolution of v2 partition commands: 1. Add a common trait for v2 partition commands, so that we don't need to match them one by one in the rules. 2. Make partition spec an expression, so that it's easier to resolve them via tree node transformation. 3. Add `TruncatePartition` so that `TruncateTable` doesn't need to be a v2 partition command. 4. Simplify `CheckAnalysis` to only check if the table is partitioned. For partitioned tables, partition spec is always resolved, so we don't need to check it. The `SupportsAtomicPartitionManagement` check is also done in the runtime. Since Spark eagerly executes commands, exception in runtime will also be thrown at analysis time. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31637 from cloud-fan/simplify. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-26 11:44:42 +00:00
Max Gekk	5c7d019b60	[SPARK-34543][SQL] Respect the `spark.sql.caseSensitive` config while resolving partition spec in v1 `SET LOCATION` ### What changes were proposed in this pull request? Preprocess the partition spec passed to the V1 `ALTER TABLE .. SET LOCATION` implementation `AlterTableSetLocationCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag spark.sql.caseSensitive. ### Why are the changes needed? V1 `ALTER TABLE .. SET LOCATION` is case sensitive in fact, and doesn't respect the SQL config spark.sql.caseSensitive which is false by default, for instance: ```sql spark-sql> CREATE TABLE tbl (id INT, part INT) PARTITIONED BY (part); spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0; spark-sql> SHOW TABLE EXTENDED LIKE 'tbl' PARTITION (part=0); Location: file:/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 spark-sql> ALTER TABLE tbl ADD PARTITION (part=1); spark-sql> SELECT * FROM tbl; 0 0 ``` Create new partition folder in the file system: ``` $ cp -r /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/aaa ``` Set new location for the partition part=1: ```sql spark-sql> ALTER TABLE tbl PARTITION (part=1) SET LOCATION '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/aaa'; spark-sql> SELECT * FROM tbl; 0 0 0 1 spark-sql> ALTER TABLE tbl ADD PARTITION (PART=2); spark-sql> SELECT * FROM tbl; 0 0 0 1 ``` Set location for a partition in the upper case: ``` $ cp -r /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb ``` ```sql spark-sql> ALTER TABLE tbl PARTITION (PART=2) SET LOCATION '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb'; Error in query: Partition spec is invalid. The spec (PART) must match the partition spec (part) defined in table '`default`.`tbl`' ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the command above works as expected: ```sql spark-sql> ALTER TABLE tbl PARTITION (PART=2) SET LOCATION '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb'; spark-sql> SELECT * FROM tbl; 0 0 0 1 0 2 ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes #31651 from MaxGekk/set-location-case-sense. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-26 15:28:57 +08:00
yangjie01	0d3a9cd3c9	[SPARK-34535][SQL] Cleanup unused symbol in Orc related code ### What changes were proposed in this pull request? Cleanup unused symbol in Orc related code as follows: - `OrcDeserializer` : parameter `dataSchema` in constructor - `OrcFilters` : parameter `schema ` in method `convertibleFilters`. - `OrcPartitionReaderFactory`: ignore return value of `OrcUtils.orcResultSchemaString` in method `buildReader(file: PartitionedFile)` ### Why are the changes needed? Cleanup code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31644 from LuciferYang/cleanup-orc-unused-symbol. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-26 09:20:40 +09:00
Wenchen Fan	dffb01f28a	[SPARK-34152][SQL][FOLLOWUP] Do not uncache the temp view if it doesn't exist ### What changes were proposed in this pull request? This PR fixes a mistake in https://github.com/apache/spark/pull/31273. When CREATE OR REPLACE a temp view, we need to uncache the to-be-replaced existing temp view. However, we shouldn't uncache if there is no existing temp view. This doesn't cause real issues because the uncache action is failure-safe. But it produces a lot of warning messages. ### Why are the changes needed? Avoid unnecessary warning logs. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually run tests and check the warning messages. Closes #31650 from cloud-fan/warnning. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-25 15:25:41 -08:00
Liang-Chi Hsieh	f7ac2d655c	[SPARK-34474][SQL] Remove unnecessary Union under Distinct/Deduplicate ### What changes were proposed in this pull request? This patch proposes to let optimizer to remove unnecessary `Union` under `Distinct`/`Deduplicate`. ### Why are the changes needed? For an `Union` under `Distinct`/`Deduplicate`, if its children are all the same, we can just keep one among them and remove the `Union`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #31595 from viirya/remove-union. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-25 12:41:07 -08:00
Yuming Wang	4a3200b08a	[SPARK-34436][SQL] DPP support LIKE ANY/ALL expression ### What changes were proposed in this pull request? This pr make DPP support LIKE ANY/ALL expression: ```sql SELECT date_id, product_id FROM fact_sk f JOIN dim_store s ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%') ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31563 from wangyum/SPARK-34436. Lead-authored-by: Yuming Wang <yumwang@apache.org> Co-authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-25 18:07:39 +08:00
Max Gekk	c56af69cdf	[SPARK-34518][SQL] Rename `AlterTableRecoverPartitionsCommand` to `RepairTableCommand` ### What changes were proposed in this pull request? Rename the execution node `AlterTableRecoverPartitionsCommand` for the commands: - `MSCK REPAIR TABLE table [{ADD\|DROP\|SYNC} PARTITIONS]` - `ALTER TABLE table RECOVER PARTITIONS` to `RepairTableCommand`. ### Why are the changes needed? 1. After the PR https://github.com/apache/spark/pull/31499, `ALTER TABLE table RECOVER PARTITIONS` is equal to `MSCK REPAIR TABLE table ADD PARTITIONS`. And mapping of the generic command `MSCK REPAIR TABLE` to the more specific execution node `AlterTableRecoverPartitionsCommand` can confuse devs in the future. 2. `ALTER TABLE table RECOVER PARTITIONS` does not support any options/extensions. So, additional parameters `enableAddPartitions` and `enableDropPartitions` in `AlterTableRecoverPartitionsCommand` confuse as well. ### Does this PR introduce _any_ user-facing change? No because this is internal API. ### How was this patch tested? By running the existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsSuite" $ build/sbt "test:testOnly AlterTableRecoverPartitionsParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly MsckRepairTableSuite" $ build/sbt "test:testOnly MsckRepairTableParserSuite" ``` Closes #31635 from MaxGekk/rename-recover-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-25 09:32:41 +00:00
Gabor Somogyi	44eadb943b	[SPARK-34497][SQL] Fix built-in JDBC connection providers to restore JVM security context changes ### What changes were proposed in this pull request? Some of the built-in JDBC connection providers are changing the JVM security context to do the authentication which is fine. The problematic part is that executors can be reused by another query. The following situation leads to incorrect behaviour: * Query1 opens JDBC connection and changes JVM security context in Executor1 * Query2 tries to open JDBC connection but it realizes there is already an entry for that DB type in Executor1 * Query2 is not changing JVM security context and uses Query1 keytab and principal * Query2 fails with authentication error In this PR I've changed to code such a way that JVM security context is changed all the time but only temporarily until the connection built-up and then rolled back. Since `getConnection` is synchronised with `SecurityConfigurationLock` it ends-up in correct behaviour without any race. ### Why are the changes needed? Incorrect JVM security context handling. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit + integration tests. Closes #31622 from gaborgsomogyi/SPARK-34497. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-25 09:25:17 +09:00
ulysses-you	999d3b89b6	[SPARK-34515][SQL] Fix NPE if InSet contains null value during getPartitionsByFilter ### What changes were proposed in this pull request? Skip null value during rewrite `InSet` to `>= and <=` at getPartitionsByFilter. ### Why are the changes needed? Spark will convert `InSet` to `>= and <=` if it's values size over `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning partition . At this case, if values contain a null, we will get such exception ``` java.lang.NullPointerException at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) at scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) at java.util.TimSort.sort(TimSort.java:220) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike.sorted(SeqLike.scala:659) at scala.collection.SeqLike.sorted$(SeqLike.scala:647) at scala.collection.AbstractSeq.sorted(Seq.scala:45) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) at scala.collection.immutable.Stream.flatMap(Stream.scala:489) at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) ``` ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? Add test. Closes #31632 from ulysses-you/SPARK-34515. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-24 21:32:19 +08:00
Cheng Su	6ef57d31cd	[SPARK-34514][SQL] Push down limit for LEFT SEMI and LEFT ANTI join ### What changes were proposed in this pull request? I found out during code review of https://github.com/apache/spark/pull/31567#discussion_r577379572, where we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if the join condition is empty. Why it's safe to push down limit: The semantics of LEFT SEMI join without condition: (1). if right side is non-empty, output all rows from left side. (2). if right side is empty, output nothing. The semantics of LEFT ANTI join without condition: (1). if right side is non-empty, output nothing. (2). if right side is empty, output all rows from left side. With the semantics of output all rows from left side or nothing (all or nothing), it's safe to push down limit to left side. NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for limit push down, because output can be a portion of left side rows. Reference: physical operator implementation for LEFT SEMI / LEFT ANTI join without condition - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204 . ### Why are the changes needed? Better performance. Save CPU and IO for these joins, as limit being pushed down before join. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `LimitPushdownSuite.scala` and `SQLQuerySuite.scala`. Closes #31630 from c21/limit-pushdown. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-24 10:23:01 +00:00
beliefer	14934f42d0	[SPARK-33599][SQL][FOLLOWUP] Group exception messages in catalyst/analysis ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/30717 Maybe some contributors don't know the job and added some exception by the old way. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #31316 from beliefer/SPARK-33599-followup. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-24 07:28:44 +00:00
Terry Kim	714ff73d4a	[SPARK-34152][SQL] Make CreateViewStatement.child to be LogicalPlan's children so that it's resolved in analyze phase ### What changes were proposed in this pull request? This PR proposes to make `CreateViewStatement.child` to be `LogicalPlan`'s `children` so that it's resolved in the analyze phase. ### Why are the changes needed? Currently, the `CreateViewStatement.child` is resolved when the create view command runs, which is inconsistent with other plan resolutions. For example, you may see the following in the physical plan: ``` == Physical Plan == Execute CreateViewCommand (1) +- CreateViewCommand (2) +- Project (4) +- UnresolvedRelation (3) ``` ### Does this PR introduce _any_ user-facing change? Yes. For the example, you will now see the resolved plan: ``` == Physical Plan == Execute CreateViewCommand (1) +- CreateViewCommand (2) +- Project (5) +- SubqueryAlias (4) +- LogicalRelation (3) ``` ### How was this patch tested? Updated existing tests. Closes #31273 from imback82/spark-34152. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-24 06:50:11 +00:00
Gengliang Wang	5d9cfd727c	[SPARK-34246][SQL] New type coercion syntax rules in ANSI mode ### What changes were proposed in this pull request? In Spark ANSI mode, the type coercion rules are based on the type precedence lists of the input data types. As per the section "Type precedence list determination" of "ISO/IEC 9075-2:2011 Information technology — Database languages - SQL — Part 2: Foundation (SQL/Foundation)", the type precedence lists of primitive data types are as following: - Byte: Byte, Short, Int, Long, Decimal, Float, Double - Short: Short, Int, Long, Decimal, Float, Double - Int: Int, Long, Decimal, Float, Double - Long: Long, Decimal, Float, Double - Decimal: Any wider Numeric type - Float: Float, Double - Double: Double - String: String - Date: Date, Timestamp - Timestamp: Timestamp - Binary: Binary - Boolean: Boolean - Interval: Interval As for complex data types, Spark will determine the precedent list recursively based on their sub-types. With the definition of type precedent list, the general type coercion rules are as following: - Data type S is allowed to be implicitly cast as type T iff T is in the precedence list of S - Comparison is allowed iff the data type precedence list of both sides has at least one common element. When evaluating the comparison, Spark casts both sides as the tightest common data type of their precedent lists. - There should be at least one common data type among all the children's precedence lists for the following operators. The data type of the operator is the tightest common precedent data type. ``` In, Except(odd), Intersect, Greatest, Least, Union, If, CaseWhen, CreateArray, Array Concat,Sequence, MapConcat, CreateMap ``` - For complex types (struct, array, map), Spark recursively looks into the element type and applies the rules above. If the element nullability is converted from true to false, add runtime null check to the elements. Note: this new type coercion system will allow implicit converting String type literals as other primitive types, in case of breaking too many existing Spark SQL queries. This is a special rule and it is not from the ANSI SQL standard. ### Why are the changes needed? The current type coercion rules are complex. Also, they are very hard to describe and understand. For details please refer the attached documentation "Default Type coercion rules of Spark" [Default Type coercion rules of Spark.pdf](https://github.com/apache/spark/files/5874362/Default.Type.coercion.rules.of.Spark.pdf) This PR is to create a new and strict type coercion system under ANSI mode. The rules are simple and clean, so that users can follow them easily ### Does this PR introduce _any_ user-facing change? Yes, new implicit cast syntax rules in ANSI mode. All the details are in the first section of this description. ### How was this patch tested? Unit tests Closes #31349 from gengliangwang/ansiImplicitConversion. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2021-02-24 13:40:58 +08:00
Max Gekk	f64fc22466	[SPARK-34290][SQL] Support v2 `TRUNCATE TABLE` ### What changes were proposed in this pull request? Implement the v2 execution node for the `TRUNCATE TABLE` command. ### Why are the changes needed? To have feature parity with DS v1, and support truncation of v2 tables. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running the unified tests for v1 and v2 tables: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *TruncateTableSuite" ``` Closes #31605 from MaxGekk/truncate-table-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-24 05:21:11 +00:00
HyukjinKwon	80bad086c8	Revert "[SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase" This reverts commit `27873280ff`.	2021-02-24 11:36:54 +09:00
Yuming Wang	b5afff59fa	[SPARK-26138][SQL] Pushdown limit through InnerLike when condition is empty ### What changes were proposed in this pull request? This pr pushdown limit through InnerLike when condition is empty(Origin pr: #23104). For example: ```sql CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM range(2); CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(2); SELECT * FROM t1 CROSS JOIN t2 LIMIT 10; ``` Before this pr: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- CollectLimit 10 +- BroadcastNestedLoopJoin BuildRight, Cross :- FileScan parquet default.t1[a#5L,b#6L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint> +- BroadcastExchange IdentityBroadcastMode, [id=#43] +- FileScan parquet default.t2[d#7L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<d:bigint> ``` After this pr: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- CollectLimit 10 +- BroadcastNestedLoopJoin BuildRight, Cross :- LocalLimit 10 : +- FileScan parquet default.t1[a#5L,b#6L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint> +- BroadcastExchange IdentityBroadcastMode, [id=#51] +- LocalLimit 10 +- FileScan parquet default.t2[d#7L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<d:bigint> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31567 from wangyum/SPARK-26138. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-02-24 09:50:13 +08:00
Max Gekk	7f27d33a3c	[SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD\|DROP\|SYNC} PARTITIONS]` ### What changes were proposed in this pull request? In the PR, I propose to extend the `MSCK REPAIR TABLE` command, and support new options `{ADD\|DROP\|SYNC} PARTITIONS`. In particular: 1. Extend the logical node `RepairTable`, and add two new flags `enableAddPartitions` and `enableDropPartitions`. 2. Add similar flags to the v1 execution node `AlterTableRecoverPartitionsCommand` 3. Add new method `dropPartitions()` to `AlterTableRecoverPartitionsCommand` which drops partitions from the catalog if their locations in the file system don't exist. 4. Updated public docs about the `MSCK REPAIR TABLE` command: <img width="1037" alt="Screenshot 2021-02-16 at 13 46 39" src="https://user-images.githubusercontent.com/1580697/108052607-7446d280-705d-11eb-8e25-7398254787a4.png"> Closes #31097 ### Why are the changes needed? - The changes allow to recover tables with removed partitions. The example below portraits the problem: ```sql spark-sql> create table tbl2 (col int, part int) partitioned by (part); spark-sql> insert into tbl2 partition (part=1) select 1; spark-sql> insert into tbl2 partition (part=0) select 0; spark-sql> show table extended like 'tbl2' partition (part = 0); default tbl2 false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0 ... ``` Remove the partition (part = 0) from the filesystem: ``` $ rm -rf /Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0 ``` Even after recovering, we cannot query the table: ```sql spark-sql> msck repair table tbl2; spark-sql> select * from tbl2; 21/01/08 22:49:13 ERROR SparkSQLDriver: Failed in [select * from tbl2] org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0 ``` - To have feature parity with Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE) ### Does this PR introduce _any_ user-facing change? Yes. After the changes, we can query recovered table: ```sql spark-sql> msck repair table tbl2 sync partitions; spark-sql> select * from tbl2; 1 1 spark-sql> show partitions tbl2; part=1 ``` ### How was this patch tested? - By running the modified test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly MsckRepairTableParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly PlanResolutionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsParallelSuite" ``` - Added unified v1 and v2 tests for `MSCK REPAIR TABLE`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *MsckRepairTableSuite" ``` Closes #31499 from MaxGekk/repair-table-drop-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-23 13:45:15 -08:00
Wenchen Fan	95e45c6257	[SPARK-34168][SQL][FOLLOWUP] Improve DynamicPartitionPruningSuiteBase ### What changes were proposed in this pull request? A few minor improvements for `DynamicPartitionPruningSuiteBase`. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31625 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-23 13:41:24 -08:00
Wenchen Fan	0d5d248bdc	[SPARK-34508][SQL][TEST] Skip HiveExternalCatalogVersionsSuite if network is down ### What changes were proposed in this pull request? It's possible that the network is down when running Spark tests, and it's annoying to see `HiveExternalCatalogVersionsSuite` keep failing. This PR proposes to skip this test suite if we can't get the latest Spark version from the Apache website. ### Why are the changes needed? Make the Spark tests more robust. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #31627 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-23 13:35:29 -08:00
Huaxin Gao	443139b601	[SPARK-34502][SQL] Remove unused parameters in join methods ### What changes were proposed in this pull request? Remove unused parameters in `CoalesceBucketsInJoin`, `UnsafeCartesianRDD` and `ShuffledHashJoinExec`. ### Why are the changes needed? Clean up ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #31617 from huaxingao/join-minor. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-23 12:18:43 -08:00
Wenchen Fan	429f8af9b6	Revert "[SPARK-34380][SQL] Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES for v2 command" This reverts commit `9a566f83a0`.	2021-02-24 02:38:22 +08:00
Max Gekk	8f994cbb4a	[SPARK-34475][SQL] Rename logical nodes of v2 `ALTER` commands ### What changes were proposed in this pull request? In the PR, I propose to rename logical nodes of v2 commands in the form: `<verb> + <object>` like: - AlterTableAddPartition -> AddPartition - AlterTableSetLocation -> SetTableLocation ### Why are the changes needed? 1. For simplicity and readability of logical plans 2. For consistency with other logical nodes. For example, the logical node `RenameTable` for `ALTER TABLE .. RENAME TO` was added before `AlterTableRenamePartition`. ### Does this PR introduce _any_ user-facing change? Should not since this is non-public APIs. ### How was this patch tested? 1. Check scala style: `./dev/scalastyle` 2. Affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31596 from MaxGekk/rename-alter-table-logic-nodes. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-23 12:04:31 +00:00
Linhong Liu	be675a052c	[SPARK-34490][SQL] Analysis should fail if the view refers a dropped table ### What changes were proposed in this pull request? When resolving a view, we use the captured view name in `AnalysisContext` to distinguish whether a relation name is a view or a table. But if the resolution failed, other rules (e.g. `ResolveTables`) will try to resolve the relation again but without `AnalysisContext`. So, in this case, the resolution may be incorrect. For example, if the view refers to a dropped table while a view with the same name exists, the dropped table will be resolved as a view rather than an unresolved exception. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? newly added test cases Closes #31606 from linhongliu-db/fix-temp-view-master. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-23 15:51:02 +08:00
Kousuke Saruta	612d52315b	[SPARK-34500][DOCS][EXAMPLES] Replace symbol literals with $"" in examples and documents ### What changes were proposed in this pull request? This PR replaces all the occurrences of symbol literals (`'name`) with string interpolation (`$"name"`) in examples and documents. ### Why are the changes needed? Symbol literals are used to represent columns in Spark SQL but the Scala community seems to remove `Symbol` completely. As we discussed in #31569, first we should replacing symbol literals with `$"name"` in user facing examples and documents. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Build docs. Closes #31615 from sarutak/replace-symbol-literals-in-doc-and-examples. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-23 11:22:02 +09:00
Max Gekk	7df4fed420	[MINOR][SQL] Fix the comment for CalendarIntervalType about comparability ### What changes were proposed in this pull request? In the PR, I propose to revert https://github.com/apache/spark/pull/26659 partially regarding to comparability of interval values. The comment became incorrect after https://github.com/apache/spark/pull/27262. ### Why are the changes needed? The comment is incorrect, and it might confuse Spark's devs/users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By checking scala coding style `./dev/scalastyle`. Closes #31610 from MaxGekk/doc-interval-not-comparable. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-22 21:29:14 +08:00
Wenchen Fan	02c784ca68	[SPARK-34473][SQL] Avoid NPE in DataFrameReader.schema(StructType) ### What changes were proposed in this pull request? This fixes a regression in `DataFrameReader.schema(StructType)`, to avoid NPE if the given `StructType` is null. Note that, passing null to Spark public APIs leads to undefined behavior. There is no document mentioning the null behavior, and it's just an accident that `DataFrameReader.schema(StructType)` worked before. So I think this is not a 3.1 blocker. ### Why are the changes needed? It fixes a 3.1 regression ### Does this PR introduce _any_ user-facing change? yea, now `df.read.schema(null: StructType)` is a noop as before, while in the current branch-3.1 it throws NPE. ### How was this patch tested? It's undefined behavior and is very obvious, so I didn't add a test. We should add tests when we clearly define and fix the null behavior for all public APIs. Closes #31593 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-22 21:11:21 +08:00
kevincmchen	9767041153	[SPARK-34432][SQL][TESTS] Add JavaSimpleWritableDataSource ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/19269 In #19269 , there is only a scala implementation of simple writable data source in `DataSourceV2Suite`. This PR adds a java implementation of it. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #31560 from kevincmchen/SPARK-34432. Lead-authored-by: kevincmchen <kevincmchen@tencent.com> Co-authored-by: Kevin Pis <68981916+kevincmchen@users.noreply.github.com> Co-authored-by: Kevin Pis <kc4163568@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-22 09:38:13 +00:00
Max Gekk	23a5996a46	[SPARK-34450][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RENAME tests ### What changes were proposed in this pull request? 1. Move parser tests from `DDLParserSuite` to `AlterTableRenameParserSuite`. 2. Port DS v1 tests from `DDLSuite` and other test suites to `v1.AlterTableRenameBase` and to `v1.AlterTableRenameSuite`. 3. Add a test for DSv2 `ALTER TABLE .. RENAME` to `v2.AlterTableRenameSuite`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenameSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenameParserSuite" ``` Closes #31575 from MaxGekk/unify-rename-table-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-22 08:36:16 +00:00
Dongjoon Hyun	2fb5f21b1e	[SPARK-34495][TESTS] Add `DedicatedJVMTest` test tag ### What changes were proposed in this pull request? This PR aims to add a test tag, `DedicatedJVMTest`, and replace `SecurityTest` with this. ### Why are the changes needed? To have a reusable general test tag. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31607 from dongjoon-hyun/SPARK-34495. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-22 16:00:48 +09:00
Raza Jafri	38fbe560fd	[SPARK-34167][SQL] Reading parquet with IntDecimal written as a LongDecimal blows up ### What changes were proposed in this pull request? If an IntDecimal type was written as a LongDecimal in a parquet file. Spark should read it as a long from `VectorizedValuesReader` but write it to the `WritableColumnVector` as an int by down-casting it and calling the appropriate method. `readLongs` has been modified to take in a boolean flag that tells it if the number would fit in a 32-bit Decimal and subsequently downsized. ### Why are the changes needed? If a Parquet file writes an IntDecimal as LongDecimal, which is supported by the parquet spec, Spark will not be able to read it and will throw an exception. The reason this happens is because method `readLong` tries to write the long to a `WritableColumnVector` which has been initialized to accept only Ints which leads to a `NullPointerException`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manually tested and added unit-test Closes #31284 from razajafri/decimal_fix. Authored-by: Raza Jafri <rjafri@nvidia.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-22 04:48:56 +00:00
Max Gekk	a22d20a6ca	[SPARK-34468][SQL] Rename v2 table in place if new name has single part ### What changes were proposed in this pull request? If new table name consists of single part (no namespaces), the v2 `ALTER TABLE .. RENAME TO` command renames the table while keeping it in the same namespace. For example: ```sql ALTER TABLE catalog_name.ns1.ns2.ns3.ns4.ns5.tbl RENAME TO new_table ``` the command should rename the source table to `catalog_name.ns1.ns2.ns3.ns4.ns5.new_table`. Before the changes, the command moves the table to the "root" name space i.e. `catalog_name.new_table`. ### Why are the changes needed? To have the same behavior as v1 implementation of `ALTER TABLE .. RENAME TO`, and other DBMSs. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running new test: ``` $ build/sbt "sql/test:testOnly *DataSourceV2SQLSuite" ``` Closes #31594 from MaxGekk/rename-table-single-part. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-22 04:43:19 +00:00
Max Gekk	6ea4b5fda7	[SPARK-34401][SQL][DOCS] Update docs about altering cached tables/views ### What changes were proposed in this pull request? Update public docs of SQL commands about altering cached tables/views. For instance: <img width="869" alt="Screenshot 2021-02-08 at 15 11 48" src="https://user-images.githubusercontent.com/1580697/107217940-fd3b8980-6a1f-11eb-98b9-9b2e3fe7f4ef.png"> ### Why are the changes needed? To inform users about commands behavior in altering cached tables or views. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the command below and manually checking the docs: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` Closes #31524 from MaxGekk/doc-cmd-caching. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-22 04:32:09 +00:00
Dongjoon Hyun	03f4cf5845	[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider ### What changes were proposed in this pull request? This is a retry of #31065 . Last time, the newly add test cases passed in Jenkins and individually, but it's reverted because they fail when `GitHub Action` runs with `SERIAL_SBT_TESTS=1`. In this PR, `SecurityTest` tag is used to isolate `KeyProvider`. This PR aims to add a basis for columnar encryption test framework by add `OrcEncryptionSuite` and `FakeKeyProvider`. Please note that we will improve more in both Apache Spark and Apache ORC in Apache Spark 3.2.0 timeframe. ### Why are the changes needed? Apache ORC 1.6 supports columnar encryption. ### Does this PR introduce _any_ user-facing change? No. This is for a test case. ### How was this patch tested? Pass the newly added test suite. Closes #31603 from dongjoon-hyun/SPARK-34486-RETRY. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-21 15:05:29 -08:00
Yuming Wang	94f9617cb4	[SPARK-34129][SQL] Add table name to LogicalRelation.simpleString ### What changes were proposed in this pull request? This pr add table name to `LogicalRelation.simpleString`. ### Why are the changes needed? Make optimized logical plan more readable. Before this pr: ``` == Optimized Logical Plan == Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B) +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = class_id#160)) AND (i_category_id#18 = category_id#161)), Statistics(sizeInBytes=2.42E+28 B) :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5) : +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5) : +- Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B) +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B) +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B) :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B) : :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS class_id#160, i_category_id#18 AS category_id#161], Statistics(sizeInBytes=2.73E+21 B) : : +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), Statistics(sizeInBytes=3.83E+21 B) : : :- Project [ss_sold_date_sk#51, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB) : : : +- Join Inner, (ss_item_sk#30 = i_item_sk#7), Statistics(sizeInBytes=516.5 PiB) : : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], Statistics(sizeInBytes=61.1 GiB) : : : : +- Filter ((isnotnull(ss_item_sk#30) AND isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), Statistics(sizeInBytes=580.6 GiB) : : : : : +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : : : : : +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : : : : : +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) : : : : +- Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51] parquet, Statistics(sizeInBytes=580.6 GiB) : : : +- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5) : : : +- Filter (((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5) : : : +- Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) : : +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : : +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : : +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) : +- Aggregate [i_brand_id#14, i_class_id#16, i_category_id#18], [i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=1414.2 EiB) : +- Project [i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=1414.2 EiB) : +- Join Inner, (cs_sold_date_sk#113 = d_date_sk#52), Statistics(sizeInBytes=1979.9 EiB) : :- Project [cs_sold_date_sk#113, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=231.1 PiB) : : +- Join Inner, (cs_item_sk#94 = i_item_sk#7), Statistics(sizeInBytes=308.2 PiB) : : :- Project [cs_item_sk#94, cs_sold_date_sk#113], Statistics(sizeInBytes=36.2 GiB) : : : +- Filter ((isnotnull(cs_item_sk#94) AND isnotnull(cs_sold_date_sk#113)) AND dynamicpruning#169 [cs_sold_date_sk#113]), Statistics(sizeInBytes=470.5 GiB) : : : : +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : : : : +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : : : : +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) : : : +- Relation[cs_sold_time_sk#80,cs_ship_date_sk#81,cs_bill_customer_sk#82,cs_bill_cdemo_sk#83,cs_bill_hdemo_sk#84,cs_bill_addr_sk#85,cs_ship_customer_sk#86,cs_ship_cdemo_sk#87,cs_ship_hdemo_sk#88,cs_ship_addr_sk#89,cs_call_center_sk#90,cs_catalog_page_sk#91,cs_ship_mode_sk#92,cs_warehouse_sk#93,cs_item_sk#94,cs_promo_sk#95,cs_order_number#96L,cs_quantity#97,cs_wholesale_cost#98,cs_list_price#99,cs_sales_price#100,cs_ext_discount_amt#101,cs_ext_sales_price#102,cs_ext_wholesale_cost#103,... 10 more fields] parquet, Statistics(sizeInBytes=470.5 GiB) : : +- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, rowCount=3.72E+5) : : +- Filter isnotnull(i_item_sk#7), Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) : : +- Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) : +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) +- Aggregate [i_brand_id#14, i_class_id#16, i_category_id#18], [i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=650.5 EiB) +- Project [i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=650.5 EiB) +- Join Inner, (ws_sold_date_sk#147 = d_date_sk#52), Statistics(sizeInBytes=910.6 EiB) :- Project [ws_sold_date_sk#147, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=106.3 PiB) : +- Join Inner, (ws_item_sk#116 = i_item_sk#7), Statistics(sizeInBytes=141.7 PiB) : :- Project [ws_item_sk#116, ws_sold_date_sk#147], Statistics(sizeInBytes=16.6 GiB) : : +- Filter ((isnotnull(ws_item_sk#116) AND isnotnull(ws_sold_date_sk#147)) AND dynamicpruning#170 [ws_sold_date_sk#147]), Statistics(sizeInBytes=216.4 GiB) : : : +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : : : +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : : : +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) : : +- Relation[ws_sold_time_sk#114,ws_ship_date_sk#115,ws_item_sk#116,ws_bill_customer_sk#117,ws_bill_cdemo_sk#118,ws_bill_hdemo_sk#119,ws_bill_addr_sk#120,ws_ship_customer_sk#121,ws_ship_cdemo_sk#122,ws_ship_hdemo_sk#123,ws_ship_addr_sk#124,ws_web_page_sk#125,ws_web_site_sk#126,ws_ship_mode_sk#127,ws_warehouse_sk#128,ws_promo_sk#129,ws_order_number#130L,ws_quantity#131,ws_wholesale_cost#132,ws_list_price#133,ws_sales_price#134,ws_ext_discount_amt#135,ws_ext_sales_price#136,ws_ext_wholesale_cost#137,... 10 more fields] parquet, Statistics(sizeInBytes=216.4 GiB) : +- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, rowCount=3.72E+5) : +- Filter isnotnull(i_item_sk#7), Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) : +- Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731) +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) ``` After this pr: ``` == Optimized Logical Plan == Project [i_item_sk#9 AS ss_item_sk#3], Statistics(sizeInBytes=8.07E+27 B) +- Join Inner, (((i_brand_id#16 = brand_id#0) AND (i_class_id#18 = class_id#1)) AND (i_category_id#20 = category_id#2)), Statistics(sizeInBytes=2.42E+28 B) :- Project [i_item_sk#9, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5) : +- Filter ((isnotnull(i_brand_id#16) AND isnotnull(i_class_id#18)) AND isnotnull(i_category_id#20)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5) : +- Relation tpcds5t.item[i_item_sk#9,i_item_id#10,i_rec_start_date#11,i_rec_end_date#12,i_item_desc#13,i_current_price#14,i_wholesale_cost#15,i_brand_id#16,i_brand#17,i_class_id#18,i_class#19,i_category_id#20,i_category#21,i_manufact_id#22,i_manufact#23,i_size#24,i_formulation#25,i_color#26,i_units#27,i_container#28,i_manager_id#29,i_product_name#30] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) +- Aggregate [brand_id#0, class_id#1, category_id#2], [brand_id#0, class_id#1, category_id#2], Statistics(sizeInBytes=2.73E+21 B) +- Aggregate [brand_id#0, class_id#1, category_id#2], [brand_id#0, class_id#1, category_id#2], Statistics(sizeInBytes=2.73E+21 B) +- Join LeftSemi, (((brand_id#0 <=> i_brand_id#16) AND (class_id#1 <=> i_class_id#18)) AND (category_id#2 <=> i_category_id#20)), Statistics(sizeInBytes=2.73E+21 B) :- Join LeftSemi, (((brand_id#0 <=> i_brand_id#16) AND (class_id#1 <=> i_class_id#18)) AND (category_id#2 <=> i_category_id#20)), Statistics(sizeInBytes=2.73E+21 B) : :- Project [i_brand_id#16 AS brand_id#0, i_class_id#18 AS class_id#1, i_category_id#20 AS category_id#2], Statistics(sizeInBytes=2.73E+21 B) : : +- Join Inner, (ss_sold_date_sk#53 = d_date_sk#54), Statistics(sizeInBytes=3.83E+21 B) : : :- Project [ss_sold_date_sk#53, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=387.3 PiB) : : : +- Join Inner, (ss_item_sk#32 = i_item_sk#9), Statistics(sizeInBytes=516.5 PiB) : : : :- Project [ss_item_sk#32, ss_sold_date_sk#53], Statistics(sizeInBytes=61.1 GiB) : : : : +- Filter ((isnotnull(ss_item_sk#32) AND isnotnull(ss_sold_date_sk#53)) AND dynamicpruning#150 [ss_sold_date_sk#53]), Statistics(sizeInBytes=580.6 GiB) : : : : : +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : : : : : +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : : : : : +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) : : : : +- Relation tpcds5t.store_sales[ss_sold_time_sk#31,ss_item_sk#32,ss_customer_sk#33,ss_cdemo_sk#34,ss_hdemo_sk#35,ss_addr_sk#36,ss_store_sk#37,ss_promo_sk#38,ss_ticket_number#39L,ss_quantity#40,ss_wholesale_cost#41,ss_list_price#42,ss_sales_price#43,ss_ext_discount_amt#44,ss_ext_sales_price#45,ss_ext_wholesale_cost#46,ss_ext_list_price#47,ss_ext_tax#48,ss_coupon_amt#49,ss_net_paid#50,ss_net_paid_inc_tax#51,ss_net_profit#52,ss_sold_date_sk#53] parquet, Statistics(sizeInBytes=580.6 GiB) : : : +- Project [i_item_sk#9, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5) : : : +- Filter (((isnotnull(i_brand_id#16) AND isnotnull(i_class_id#18)) AND isnotnull(i_category_id#20)) AND isnotnull(i_item_sk#9)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5) : : : +- Relation tpcds5t.item[i_item_sk#9,i_item_id#10,i_rec_start_date#11,i_rec_end_date#12,i_item_desc#13,i_current_price#14,i_wholesale_cost#15,i_brand_id#16,i_brand#17,i_class_id#18,i_class#19,i_category_id#20,i_category#21,i_manufact_id#22,i_manufact#23,i_size#24,i_formulation#25,i_color#26,i_units#27,i_container#28,i_manager_id#29,i_product_name#30] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) : : +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : : +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : : +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) : +- Aggregate [i_brand_id#16, i_class_id#18, i_category_id#20], [i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=1414.2 EiB) : +- Project [i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=1414.2 EiB) : +- Join Inner, (cs_sold_date_sk#115 = d_date_sk#54), Statistics(sizeInBytes=1979.9 EiB) : :- Project [cs_sold_date_sk#115, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=231.1 PiB) : : +- Join Inner, (cs_item_sk#96 = i_item_sk#9), Statistics(sizeInBytes=308.2 PiB) : : :- Project [cs_item_sk#96, cs_sold_date_sk#115], Statistics(sizeInBytes=36.2 GiB) : : : +- Filter ((isnotnull(cs_item_sk#96) AND isnotnull(cs_sold_date_sk#115)) AND dynamicpruning#151 [cs_sold_date_sk#115]), Statistics(sizeInBytes=470.5 GiB) : : : : +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : : : : +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : : : : +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) : : : +- Relation tpcds5t.catalog_sales[cs_sold_time_sk#82,cs_ship_date_sk#83,cs_bill_customer_sk#84,cs_bill_cdemo_sk#85,cs_bill_hdemo_sk#86,cs_bill_addr_sk#87,cs_ship_customer_sk#88,cs_ship_cdemo_sk#89,cs_ship_hdemo_sk#90,cs_ship_addr_sk#91,cs_call_center_sk#92,cs_catalog_page_sk#93,cs_ship_mode_sk#94,cs_warehouse_sk#95,cs_item_sk#96,cs_promo_sk#97,cs_order_number#98L,cs_quantity#99,cs_wholesale_cost#100,cs_list_price#101,cs_sales_price#102,cs_ext_discount_amt#103,cs_ext_sales_price#104,cs_ext_wholesale_cost#105,... 10 more fields] parquet, Statistics(sizeInBytes=470.5 GiB) : : +- Project [i_item_sk#9, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=8.5 MiB, rowCount=3.72E+5) : : +- Filter isnotnull(i_item_sk#9), Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) : : +- Relation tpcds5t.item[i_item_sk#9,i_item_id#10,i_rec_start_date#11,i_rec_end_date#12,i_item_desc#13,i_current_price#14,i_wholesale_cost#15,i_brand_id#16,i_brand#17,i_class_id#18,i_class#19,i_category_id#20,i_category#21,i_manufact_id#22,i_manufact#23,i_size#24,i_formulation#25,i_color#26,i_units#27,i_container#28,i_manager_id#29,i_product_name#30] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) : +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) +- Aggregate [i_brand_id#16, i_class_id#18, i_category_id#20], [i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=650.5 EiB) +- Project [i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=650.5 EiB) +- Join Inner, (ws_sold_date_sk#149 = d_date_sk#54), Statistics(sizeInBytes=910.6 EiB) :- Project [ws_sold_date_sk#149, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=106.3 PiB) : +- Join Inner, (ws_item_sk#118 = i_item_sk#9), Statistics(sizeInBytes=141.7 PiB) : :- Project [ws_item_sk#118, ws_sold_date_sk#149], Statistics(sizeInBytes=16.6 GiB) : : +- Filter ((isnotnull(ws_item_sk#118) AND isnotnull(ws_sold_date_sk#149)) AND dynamicpruning#152 [ws_sold_date_sk#149]), Statistics(sizeInBytes=216.4 GiB) : : : +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731) : : : +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) : : : +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) : : +- Relation tpcds5t.web_sales[ws_sold_time_sk#116,ws_ship_date_sk#117,ws_item_sk#118,ws_bill_customer_sk#119,ws_bill_cdemo_sk#120,ws_bill_hdemo_sk#121,ws_bill_addr_sk#122,ws_ship_customer_sk#123,ws_ship_cdemo_sk#124,ws_ship_hdemo_sk#125,ws_ship_addr_sk#126,ws_web_page_sk#127,ws_web_site_sk#128,ws_ship_mode_sk#129,ws_warehouse_sk#130,ws_promo_sk#131,ws_order_number#132L,ws_quantity#133,ws_wholesale_cost#134,ws_list_price#135,ws_sales_price#136,ws_ext_discount_amt#137,ws_ext_sales_price#138,ws_ext_wholesale_cost#139,... 10 more fields] parquet, Statistics(sizeInBytes=216.4 GiB) : +- Project [i_item_sk#9, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=8.5 MiB, rowCount=3.72E+5) : +- Filter isnotnull(i_item_sk#9), Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) : +- Relation tpcds5t.item[i_item_sk#9,i_item_id#10,i_rec_start_date#11,i_rec_end_date#12,i_item_desc#13,i_current_price#14,i_wholesale_cost#15,i_brand_id#16,i_brand#17,i_class_id#18,i_class#19,i_category_id#20,i_category#21,i_manufact_id#22,i_manufact#23,i_size#24,i_formulation#25,i_color#26,i_units#27,i_container#28,i_manager_id#29,i_product_name#30] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5) +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731) +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731) +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31196 from wangyum/SPARK-34129. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-21 12:04:49 -06:00
Max Gekk	04c3125dcf	[SPARK-34360][SQL] Support truncation of v2 tables ### What changes were proposed in this pull request? 1. Add new interface `TruncatableTable` which represents tables that allow atomic truncation. 2. Implement new method in `InMemoryTable` and in `InMemoryPartitionTable`. ### Why are the changes needed? To support `TRUNCATE TABLE` for v2 tables. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? Added new tests to `TableCatalogSuite` that check truncation of non-partitioned and partitioned tables: ``` $ build/sbt "test:testOnly *TableCatalogSuite" ``` Closes #31475 from MaxGekk/dsv2-truncate-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-21 17:50:38 +09:00
Kent Yao	1fac706db5	[SPARK-34373][SQL] HiveThriftServer2 startWithContext may hang with a race issue ### What changes were proposed in this pull request? fix a race issue by interrupting the thread ### Why are the changes needed? ``` 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error occurred during acceptance of message. org.apache.thrift.transport.TTransportException: No underlying server socket. at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) at org.apache.thrift.transport.TServerTransport.acceException in thread "Thread-15" java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) at java.io.BufferedInputStream.read(BufferedInputStream.java:336) at java.io.FilterInputStream.read(FilterInputStream.java:107) at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238) at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246) at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227) at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221) ``` when the TServer try to `serve` after `stop`, it hangs with the log above forever ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing ci Closes #31479 from yaooqinn/SPARK-34373. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-21 17:37:12 +09:00
Yuchen Huo	7de49a8fc0	[SPARK-34481][SQL] Refactor dataframe reader/writer optionsWithPath logic ### What changes were proposed in this pull request? Extract optionsWithPath logic into their own function. ### Why are the changes needed? Reduce the code duplication and improve modularity. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just some refactoring. Existing tests. Closes #31599 from yuchenhuo/SPARK-34481. Authored-by: Yuchen Huo <yuchen.huo@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-20 17:57:43 -08:00
Kousuke Saruta	82b33a3041	[SPARK-34379][SQL] Map JDBC RowID to StringType rather than LongType ### What changes were proposed in this pull request? This PR fix an issue that `java.sql.RowId` is mapped to `LongType` and prefer `StringType`. In the current implementation, JDBC RowID type is mapped to `LongType` except for `OracleDialect`, but there is no guarantee to be able to convert RowID to long. `java.sql.RowId` declares `toString` and the specification of `java.sql.RowId` says > _all methods on the RowId interface must be fully implemented if the JDBC driver supports the data type_ (https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html) So, we should prefer StringType to LongType. ### Why are the changes needed? This seems to be a potential bug. ### Does this PR introduce _any_ user-facing change? Yes. RowID is mapped to StringType rather than LongType. ### How was this patch tested? New test and the existing test case `SPARK-32992: map Oracle's ROWID type to StringType` in `OracleIntegrationSuite` passes. Closes #31491 from sarutak/rowid-type. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-02-20 23:45:56 +09:00
Sean Owen	f78466dca6	[SPARK-7768][CORE][SQL] Open UserDefinedType as a Developer API ### What changes were proposed in this pull request? UserDefinedType and UDTRegistration become public Developer APIs, not package-private to Spark. ### Why are the changes needed? This proposes to simply open up the UserDefinedType class as a developer API. It was public in 1.x, but closed in 2.x for some possible redesign that does not seem to have happened. Other libraries have managed to define UDTs anyway by inserting shims into the Spark namespace, and this evidently has worked OK. But package isolation in Java 9+ breaks this. The logic here is mostly: this is de facto a stable API, so can at least be open to developers with the usual caveats about developer APIs. Open questions: - Is there in fact some important redesign that's needed before opening it? The comment to this effect is from 2016 - Is this all that needs to be opened up? Like PythonUserDefinedType? - Should any of this be kept package-private? This was first proposed in https://github.com/apache/spark/pull/16478 though it was a larger change, but, the other API issues it was fixing seem to have been addressed already (e.g. no need to return internal Spark types). It was never really reviewed. My hunch is that there isn't much downside, and some upside, to just opening this as-is now. ### Does this PR introduce _any_ user-facing change? UserDefinedType becomes visible to developers to subclass. ### How was this patch tested? Existing tests; there is no change to the existing logic. Closes #31461 from srowen/SPARK-7768. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-20 07:32:06 -06:00
Zhichao Zhang	96bcb4bbe4	[SPARK-34283][SQL] Combines all adjacent 'Union' operators into a single 'Union' when using 'Dataset.union.distinct.union.distinct' ### What changes were proposed in this pull request? Handled 'Deduplicate(Keys, Union)' operation in rule 'CombineUnions' to combine adjacent 'Union' operators into a single 'Union' if necessary when using 'Dataset.union.distinct.union.distinct'. Currently only handle distinct-like 'Deduplicate', where the keys == output, for example: ``` val df1 = Seq((1, 2, 3)).toDF("a", "b", "c") val df2 = Seq((6, 2, 5)).toDF("a", "b", "c") val df3 = Seq((2, 4, 3)).toDF("c", "a", "b") val df4 = Seq((1, 4, 5)).toDF("b", "a", "c") val unionDF1 = df1.unionByName(df2).dropDuplicates(Seq("b", "a", "c")) .unionByName(df3).dropDuplicates().unionByName(df4) .dropDuplicates("a") ``` In this case, all Union operators will be combined. but, ``` val df1 = Seq((1, 2, 3)).toDF("a", "b", "c") val df2 = Seq((6, 2, 5)).toDF("a", "b", "c") val df3 = Seq((2, 4, 3)).toDF("c", "a", "b") val df4 = Seq((1, 4, 5)).toDF("b", "a", "c") val unionDF = df1.unionByName(df2).dropDuplicates(Seq("a")) .unionByName(df3).dropDuplicates("c").unionByName(df4) .dropDuplicates("b") ``` In this case, all unions will not be combined, because the Deduplicate.keys doesn't equal to Union.output. ### Why are the changes needed? When using 'Dataset.union.distinct.union.distinct', the operator is 'Deduplicate(Keys, Union)', but AstBuilder transform sql-style 'Union' to operator 'Distinct(Union)', the rule 'CombineUnions' in Optimizer only handle 'Distinct(Union)' operator but not Deduplicate(Keys, Union). Please see the detailed description in [SPARK-34283](https://issues.apache.org/jira/browse/SPARK-34283). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #31404 from zzcclp/SPARK-34283. Authored-by: Zhichao Zhang <441586683@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 15:19:13 +00:00
gengjiaan	06df1210d4	[SPARK-28123][SQL] String Functions: support btrim ### What changes were proposed in this pull request? Spark support `trim`/`ltrim`/`rtrim` now. The function `btrim` is an alternate form of `TRIM(BOTH <chars> FROM <expr>)`. `btrim` removes the longest string consisting only of specified characters from the start and end of a string. The mainstream database support this feature show below: Postgresql https://www.postgresql.org/docs/11/functions-binarystring.html Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/BTRIM.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CString%20Functions%7C_____5 Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_BTRIM.html Druid https://druid.apache.org/docs/latest/querying/sql.html#string-functions Greenplum http://docs.greenplum.org/6-8/ref_guide/function-summary.html ### Why are the changes needed? btrim is very useful. ### Does this PR introduce _any_ user-facing change? Yes. btrim is a new function ### How was this patch tested? Jenkins test. Closes #31390 from beliefer/SPARK-28123-support-btrim. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 13:28:49 +00:00
Peter Toth	27abb6ab56	[SPARK-34421][SQL] Resolve temporary functions and views in views with CTEs ### What changes were proposed in this pull request? This PR: - Fixes a bug that prevents analysis of: ``` CREATE TEMPORARY VIEW temp_view AS WITH cte AS (SELECT temp_func(0)) SELECT * FROM cte; SELECT * FROM temp_view ``` by throwing: ``` Undefined function: 'temp_func'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'. ``` - and doesn't report analysis error when it should: ``` CREATE TEMPORARY VIEW temp_view AS SELECT 0; CREATE VIEW view_on_temp_view AS WITH cte AS (SELECT * FROM temp_view) SELECT * FROM cte ``` by properly collecting temporary objects from VIEW definitions with CTEs. - Minor refactor to make the affected code more readable. ### Why are the changes needed? To fix a bug introduced with https://github.com/apache/spark/pull/30567 ### Does this PR introduce _any_ user-facing change? Yes, the query works again. ### How was this patch tested? Added new UT + existing ones. Closes #31550 from peter-toth/SPARK-34421-temp-functions-in-views-with-cte. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 18:14:49 +08:00
Max Gekk	b26e7b510b	[SPARK-34314][SQL] Fix partitions schema inference ### What changes were proposed in this pull request? Infer the partitions schema by: 1. interring the common type over all partition part values, and 2. casting those values to the common type Before the changes: 1. Spark creates a literal with most appropriate type for concrete partition value i.e. `part0=-0` -> `Literal(0, IntegerType)`, `part0=abc` -> `Literal(UTF8String.fromString("abc"), StringType)`. 2. Finds the common type for all literals of a partition column. For the example above, it is `StringType`. 3. Casts those literal to the desired type: - `Cast(Literal(0, IntegerType), StringType)` -> `UTF8String.fromString("0")` - `Cast(Literal(UTF8String.fromString("abc", StringType), StringType)` -> `UTF8String.fromString("abc")` In the example, we get a partition part value "0" which is different from the original one "-0". Spark shouldn't modify partition part values of the string type because it can influence on query results. Closes #31423 ### Why are the changes needed? The changes fix the bug demonstrated by the example: 1. There are partitioned parquet files (file format doesn't matter): ``` /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d ├── _SUCCESS ├── part=-0 │ └── part-00001-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet └── part=AA └── part-00000-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet ``` placed to two partitions "AA" and "-0". 2. When reading them w/o specified schema: ``` val df = spark.read.parquet(path) df.printSchema() root \|-- id: integer (nullable = true) \|-- part: string (nullable = true) ``` the inferred type of the partition column `part` is the string type. 3. The expected values in the column `part` are "AA" and "-0" but we get: ``` df.show(false) +---+----+ \|id \|part\| +---+----+ \|0 \|AA \| \|1 \|0 \| +---+----+ ``` So, Spark returns "0" instead of "-0". ### Does this PR introduce _any_ user-facing change? This PR can change query results. ### How was this patch tested? By running new test and existing test suites: ``` $ build/sbt "test:testOnly FileIndexSuite" $ build/sbt "test:testOnly ParquetV1PartitionDiscoverySuite" $ build/sbt "test:testOnly *ParquetV2PartitionDiscoverySuite" ``` Closes #31549 from MaxGekk/fix-partition-file-index-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 08:36:13 +00:00
yzjg	26548edfa2	[MINOR][SQL][DOCS] Fix the comments in the example at window function ### What changes were proposed in this pull request? `functions.scala` window function has an comment error in the field name. The column should be `time` per `timestamp:TimestampType`. ### Why are the changes needed? To deliver the correct documentation and examples. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing docs. ### How was this patch tested? CI builds in this PR should test the documentation build. Closes #31582 from yzjg/yzjg-patch-1. Authored-by: yzjg <785246661@qq.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-19 10:45:21 +09:00
Max Gekk	cad469d47a	[SPARK-34465][SQL] Rename v2 alter table exec nodes ### What changes were proposed in this pull request? Rename the following v2 exec nodes: - AlterTableAddPartitionExec -> AddPartitionExec - AlterTableRenamePartitionExec -> RenamePartitionExec - AlterTableDropPartitionExec -> DropPartitionExec ### Why are the changes needed? - To be consistent with v2 exec node added before: ALTER TABLE .. RENAME TO` -> RenameTableExec. - For simplicity and readability of the execution plans. ### Does this PR introduce _any_ user-facing change? Should not since this is internal API. ### How was this patch tested? By running the existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31584 from MaxGekk/rename-alter-table-exec-nodes. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-18 14:33:26 -08:00
Max Gekk	8f7ec4b28e	[SPARK-34454][SQL] Mark legacy SQL configs as internal ### What changes were proposed in this pull request? 1. Make the following SQL configs as internal: - spark.sql.legacy.allowHashOnMapType - spark.sql.legacy.sessionInitWithConfigDefaults 2. Add a test to check that all SQL configs from the `legacy` namespace are marked as internal configs. ### Why are the changes needed? Assuming that legacy SQL configs shouldn't be set by users in common cases. The purpose of such configs is to allow switching to old behavior in corner cases. So, the configs should be marked as internals. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *SQLConfSuite" ``` Closes #31577 from MaxGekk/mark-legacy-configs-as-internal. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-18 10:39:51 -08:00
Chao Sun	27873280ff	[SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase ### What changes were proposed in this pull request? Currently in `SpecificParquetRecordReaderBase` we use deprecated APIs in a few places from Parquet, such as `readFooter`, `ParquetInputSplit`, `new ParquetFileReader`, `filterRowGroups`, etc. This replaces these with the newer APIs. In specific this: - Replaces `ParquetInputSplit` with `FileSplit`. We never use specific things in the former such as `rowGroupOffsets` so the swap is pretty simple. - Removes `readFooter` calls by using `ParquetFileReader.open` - Replace deprecated `ParquetFileReader` ctor with the newer API which takes `ParquetReadOptions`. - Removes the unnecessary handling of case when `rowGroupOffsets` is not null. It seems this never happens. ### Why are the changes needed? The aforementioned APIs were deprecated and is going to be removed at some point in future. This is to ensure better supportability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a cleanup and relies on existing tests on the relevant code paths. Closes #29542 from sunchao/SPARK-32703. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-18 10:18:14 -06:00
Steve Loughran	ff5115c3ac	[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-02-18 08:43:18 -06:00
gengjiaan	edccf96cad	[SPARK-34394][SQL] Unify output of SHOW FUNCTIONS and pass output attributes properly ### What changes were proposed in this pull request? The current implement of some DDL not unify the output and not pass the output properly to physical command. Such as: The output attributes of `ShowFunctions` does't pass to `ShowFunctionsCommand` properly. As the query plan, this PR pass the output attributes from `ShowFunctions` to `ShowFunctionsCommand`. ### Why are the changes needed? This PR pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #31519 from beliefer/SPARK-34394. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-18 12:50:50 +00:00
gengjiaan	c925e4d0fd	[SPARK-34393][SQL] Unify output of SHOW VIEWS and pass output attributes properly ### What changes were proposed in this pull request? The current implement of some DDL not unify the output and not pass the output properly to physical command. Such as: The output attributes of `ShowViews` does't pass to `ShowViewsCommand` properly. As the query plan, this PR pass the output attributes from `ShowViews` to `ShowViewsCommand`. ### Why are the changes needed? This PR pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #31508 from beliefer/SPARK-34393. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-18 12:48:39 +00:00
Max Gekk	7b549c3e53	[SPARK-34455][SQL] Deprecate `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` ### What changes were proposed in this pull request? 1. Put the SQL config `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` to the list of deprecated configs `deprecatedSQLConfigs` 2. Update docs for the Avro datasource <img width="982" alt="Screenshot 2021-02-17 at 21 04 26" src="https://user-images.githubusercontent.com/1580697/108249890-abed7180-7166-11eb-8cb7-0c246d2a34fc.png"> ### Why are the changes needed? The config exists for enough time. We can deprecate it, and recommend users to use `.format("avro")` instead. ### Does this PR introduce _any_ user-facing change? Should not except of the warning with the recommendation to use the `avro` format. ### How was this patch tested? 1. By generating docs via: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` 2. Manually checking the warning: ``` scala> spark.conf.set("spark.sql.legacy.replaceDatabricksSparkAvro.enabled", false) 21/02/17 21:20:18 WARN SQLConf: The SQL config 'spark.sql.legacy.replaceDatabricksSparkAvro.enabled' has been deprecated in Spark v3.2 and may be removed in the future. Use `.format("avro")` in `DataFrameWriter` or `DataFrameReader` instead. ``` Closes #31578 from MaxGekk/deprecate-replaceDatabricksSparkAvro. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-17 21:54:20 -08:00
Anton Okolnychyi	1ad343238c	[SPARK-33736][SQL] Handle MERGE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR handles merge operations in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? These changes are needed to match what we already do for delete and update operations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends existing tests to cover merge operations. Closes #31579 from aokolnychyi/spark-33736. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-17 17:27:21 -08:00
Anton Okolnychyi	44a9aed0d7	[SPARK-34456][SQL] Remove unused write options from BatchWriteHelper ### What changes were proposed in this pull request? This PR removes dead code from `BatchWriteHelper` after SPARK-33808. ### Why are the changes needed? These changes simplify `BatchWriteHelper` by removing write options that are no longer needed as we build `Write` earlier. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31581 from aokolnychyi/simplify-batch-write-helper. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-17 17:25:26 -08:00
Max Gekk	5957bc18a1	[SPARK-34451][SQL] Add alternatives for datetime rebasing SQL configs and deprecate legacy configs ### What changes were proposed in this pull request? Move the datetime rebase SQL configs from the `legacy` namespace by: 1. Renaming of the existing rebase configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` -> `spark.sql.parquet.datetimeRebaseModeInRead`. 2. Add the legacy configs as alternatives 3. Deprecate the legacy rebase configs. ### Why are the changes needed? The rebasing SQL configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` can be used not only for migration from previous Spark versions but also to read/write datatime columns saved by other systems/frameworks/libs. So, the configs shouldn't be considered as legacy configs. ### Does this PR introduce _any_ user-facing change? Should not. Users will see a warning if they still use one of the legacy configs. ### How was this patch tested? 1. Manually checking new configs: ```scala scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res0: String = EXCEPTION scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") 21/02/17 14:57:10 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead. scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res2: String = LEGACY ``` 2. By running a datetime rebasing test suite: ``` $ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite" ``` Closes #31576 from MaxGekk/rebase-confs-alternatives. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-17 14:04:47 +00:00
Kousuke Saruta	dd6383f0a3	[SPARK-34333][SQL] Fix PostgresDialect to handle money types properly ### What changes were proposed in this pull request? This PR changes the type mapping for `money` and `money[]` types for PostgreSQL. Currently, those types are tried to convert to `DoubleType` and `ArrayType` of `double` respectively. But the JDBC driver seems not to be able to handle those types properly. https://github.com/pgjdbc/pgjdbc/issues/100 https://github.com/pgjdbc/pgjdbc/issues/1405 Due to these issue, we can get the error like as follows. money type. ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : 1,000.00 [info] at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104) [info] at org.postgresql.jdbc.PgResultSet.getDouble(PgResultSet.java:2432) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$5(JdbcUtils.scala:418) ``` money[] type. ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : $2,000.00 [info] at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104) [info] at org.postgresql.jdbc.ArrayDecoding$5.parseValue(ArrayDecoding.java:235) [info] at org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:122) [info] at org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:764) [info] at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:310) [info] at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:171) [info] at org.postgresql.jdbc.PgArray.getArray(PgArray.java:111) ``` For money type, a known workaround is to treat it as string so this PR do it. For money[], however, there is no reasonable workaround so this PR remove the support. ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? Yes. As of this PR merged, money type is mapped to `StringType` rather than `DoubleType` and the support for money[] is stopped. For money type, if the value is less than one thousand, `$100.00` for instance, it works without this change so I also updated the migration guide because it's a behavior change for such small values. On the other hand, money[] seems not to work with any value but mentioned in the migration guide just in case. ### How was this patch tested? New test. Closes #31442 from sarutak/fix-for-money-type. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-02-17 10:50:06 +09:00
Max Gekk	1a11fe5501	[SPARK-33210][SQL][DOCS][FOLLOWUP] Fix descriptions of the SQL configs for the parquet INT96 rebase modes ### What changes were proposed in this pull request? Fix descriptions of the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInWrite`, and mention `EXCEPTION` as the default value. ### Why are the changes needed? This fixes incorrect descriptions that can mislead users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #31557 from MaxGekk/int96-exception-by-default-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-16 11:55:53 +09:00
Max Gekk	03161055de	[SPARK-34424][SQL][TESTS] Fix failures of HiveOrcHadoopFsRelationSuite ### What changes were proposed in this pull request? Modify `RandomDataGenerator.forType()` to allow generation of dates/timestamps that are valid in both Julian and Proleptic Gregorian calendars. Currently, the function can produce a date (for example `1582-10-06`) which is valid in the Proleptic Gregorian calendar. Though it cannot be saved to ORC files AS IS since ORC format (ORC libs in fact) assumes Julian calendar. So, Spark shifts `1582-10-06` to the next valid date `1582-10-15` while saving it to ORC files. And as a consequence of that, the test fails because it compares original date `1582-10-06` and the date `1582-10-15` loaded back from the ORC files. In this PR, I propose to generate valid dates/timestamps in both calendars for ORC datasource till SPARK-34440 is resolved. ### Why are the changes needed? The changes fix failures of `HiveOrcHadoopFsRelationSuite`. For instance, the test "test all data types" fails with the seed 610710213676: ``` == Results == !== Correct Answer - 20 == == Spark Answer - 20 == struct<index:int,col:date> struct<index:int,col:date> ... ![9,1582-10-06] [9,1582-10-15] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveOrcHadoopFsRelationSuite" ``` Closes #31552 from MaxGekk/fix-HiveOrcHadoopFsRelationSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-16 11:53:26 +09:00
Max Gekk	aca6db1868	[SPARK-34434][SQL] Mention DS rebase options in `SparkUpgradeException` ### What changes were proposed in this pull request? Mention the DS options introduced by https://github.com/apache/spark/pull/31529 and by https://github.com/apache/spark/pull/31489 in `SparkUpgradeException`. ### Why are the changes needed? To improve user experience with Spark SQL. Before the changes, the error message recommends to set SQL configs but the configs cannot help in the some situations (see the PRs for more details). ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the error message is: _org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set the SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' or the datasource option 'datetimeRebaseMode' to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. To read the datetime values as it is, set the SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' or the datasource option 'datetimeRebaseMode' to 'CORRECTED'._ ### How was this patch tested? 1. By checking coding style: `./dev/scalastyle` 2. By running the related test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ParquetRebaseDatetimeV1Suite" ``` Closes #31562 from MaxGekk/rebase-upgrade-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-14 17:42:15 -08:00
Terry Kim	9a566f83a0	[SPARK-34380][SQL] Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES for v2 command ### What changes were proposed in this pull request? This PR proposes to support `ifExists` flag for v2 `ALTER TABLE ... UNSET TBLPROPERTIES` command. Currently, the flag is not respected and the command behaves as `ifExists = true` where the command always succeeds when the properties do not exist. ### Why are the changes needed? To support `ifExists` flag and align with v1 command behavior. ### Does this PR introduce _any_ user-facing change? Yes, now if the property does not exist and `IF EXISTS` is not specified, the command will fail: ``` ALTER TABLE t UNSET TBLPROPERTIES ('unknown') // Fails with "Attempted to unset non-existent property 'unknown'" ALTER TABLE t UNSET TBLPROPERTIES IF EXISTS ('unknown') // OK ``` ### How was this patch tested? Added new test Closes #31494 from imback82/AlterTableUnsetPropertiesIfExists. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-12 17:42:43 -08:00
Max Gekk	91be583fb8	[SPARK-34418][SQL][TESTS] Check partitions existence after v1 TRUNCATE TABLE ### What changes were proposed in this pull request? Add a test and modify an existing one to check that partitions still exist after v1 `TRUNCATE TABLE`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *TruncateTableSuite" ``` Closes #31544 from MaxGekk/test-truncate-partitioned-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-12 15:57:42 -08:00
Liang-Chi Hsieh	e0053853c9	[SPARK-34420][SQL] Throw exception if non-streaming Deduplicate is not replaced by aggregate ### What changes were proposed in this pull request? This patch proposes to throw exception if non-streaming `Deduplicate` is not replaced by aggregate in query planner. ### Why are the changes needed? We replace some operations in the query optimizer. For them we throw some exceptions accordingly in query planner if these logical nodes are not replaced. But `Deduplicate` is missing and it opens a possible hole. For code consistency and to prevent possible unexpected query planning error, we should add similar exception case to query planner. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #31547 from viirya/minor-deduplicate. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-10 22:40:51 -08:00
Chao Sun	cd38287ce2	[SPARK-34419][SQL] Move PartitionTransforms.scala to scala directory ### What changes were proposed in this pull request? Move `PartitionTransforms.scala` from `sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions` to `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions`. ### Why are the changes needed? We should put java/scala files to their corresponding directories. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #31546 from sunchao/SPARK-34419. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-10 17:08:50 -08:00
David Li	9b875ceada	[SPARK-32953][PYTHON][SQL] Add Arrow self_destruct support to toPandas ### What changes were proposed in this pull request? Creating a Pandas dataframe via Apache Arrow currently can use twice as much memory as the final result, because during the conversion, both Pandas and Arrow retain a copy of the data. Arrow has a "self-destruct" mode now (Arrow >= 0.16) to avoid this, by freeing each column after conversion. This PR integrates support for this in toPandas, handling a couple of edge cases: self_destruct has no effect unless the memory is allocated appropriately, which is handled in the Arrow serializer here. Essentially, the issue is that self_destruct frees memory column-wise, but Arrow record batches are oriented row-wise: ``` Record batch 0: allocation 0: column 0 chunk 0, column 1 chunk 0, ... Record batch 1: allocation 1: column 0 chunk 1, column 1 chunk 1, ... ``` In this scenario, Arrow will drop references to all of column 0's chunks, but no memory will actually be freed, as the chunks were just slices of an underlying allocation. The PR copies each column into its own allocation so that memory is instead arranged as so: ``` Record batch 0: allocation 0 column 0 chunk 0, allocation 1 column 1 chunk 0, ... Record batch 1: allocation 2 column 0 chunk 1, allocation 3 column 1 chunk 1, ... ``` The optimization is disabled by default, and can be enabled with the Spark SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled" set to "true". We can't always apply this optimization because it's more likely to generate a dataframe with immutable buffers, which Pandas doesn't always handle well, and because it is slower overall (since it only converts one column at a time instead of in parallel). ### Why are the changes needed? This lets us load larger datasets - in particular, with N bytes of memory, before we could never load a dataset bigger than N/2 bytes; now the overhead is more like N/1.25 or so. ### Does this PR introduce _any_ user-facing change? Yes - it adds a new SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled" ### How was this patch tested? See the [mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html) - it was tested with Python memory_profiler. Unit tests added to check memory within certain bounds and correctness with the option enabled. Closes #29818 from lidavidm/spark-32953. Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2021-02-10 09:58:46 -08:00
gengjiaan	32a523b56f	[SPARK-34234][SQL] Remove TreeNodeException that didn't work ### What changes were proposed in this pull request? `TreeNodeException` causes the error msg not clear and it didn't work well. Because the `TreeNodeException` looks redundancy, we could remove it. There are show a case: ``` val df = Seq(("1", 1), ("1", 2), ("2", 3), ("2", 4)).toDF("x", "y") val hashAggDF = df.groupBy("x").agg(c, sum("y")) ``` The above code will use `HashAggregateExec`. In order to ensure that an exception will be thrown when executing `HashAggregateExec`, I added `throw new RuntimeException("calculate error")` into `72b7f8abfb/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L85)` So, if the above code is executed, `RuntimeException("calculate error")` will be thrown. Before this PR, the error is: ``` execute, tree: ShuffleQueryStage 0 +- Exchange hashpartitioning(x#105, 5), ENSURE_REQUIREMENTS, [id=#168] +- HashAggregate(keys=[x#105], functions=[partial_sum(y#106)], output=[x#105, sum#118L]) +- Project [_1#100 AS x#105, _2#101 AS y#106] +- LocalTableScan [_1#100, _2#101] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: ShuffleQueryStage 0 +- Exchange hashpartitioning(x#105, 5), ENSURE_REQUIREMENTS, [id=#168] +- HashAggregate(keys=[x#105], functions=[partial_sum(y#106)], output=[x#105, sum#118L]) +- Project [_1#100 AS x#105, _2#101 AS y#106] +- LocalTableScan [_1#100, _2#101] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.doMaterialize(QueryStageExec.scala:163) at org.apache.spark.sql.execution.adaptive.QueryStageExec.$anonfun$materialize$1(QueryStageExec.scala:81) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.adaptive.QueryStageExec.materialize(QueryStageExec.scala:79) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5(AdaptiveSparkPlanExec.scala:207) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5$adapted(AdaptiveSparkPlanExec.scala:205) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:205) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:289) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3708) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2977) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3699) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3697) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2977) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$3(DataFrameAggregateSuite.scala:665) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) at org.apache.spark.sql.DataFrameAggregateSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244) at org.apache.spark.sql.DataFrameAggregateSuite.withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2(DataFrameAggregateSuite.scala:659) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2$adapted(DataFrameAggregateSuite.scala:655) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.sql.DataFrameAggregateSuite.assertNoExceptions(DataFrameAggregateSuite.scala:655) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126(DataFrameAggregateSuite.scala:695) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126$adapted(DataFrameAggregateSuite.scala:695) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$125(DataFrameAggregateSuite.scala:695) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) at org.scalatest.Suite.run(Suite.scala:1112) at org.scalatest.Suite.run$(Suite.scala:1094) at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) at org.scalatest.SuperEngine.runImpl(Engine.scala:535) at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1320) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1314) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1314) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1480) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971) at org.scalatest.tools.Runner$.run(Runner.scala:798) at org.scalatest.tools.Runner.run(Runner.scala) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: HashAggregate(keys=[x#105], functions=[partial_sum(y#106)], output=[x#105, sum#118L]) +- Project [_1#100 AS x#105, _2#101 AS y#106] +- LocalTableScan [_1#100, _2#101] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doExecute(HashAggregateExec.scala:84) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:118) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:118) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture$lzycompute(ShuffleExchangeExec.scala:122) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture(ShuffleExchangeExec.scala:121) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.$anonfun$doMaterialize$1(QueryStageExec.scala:163) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 91 more Caused by: java.lang.RuntimeException: calculate error at org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$1(HashAggregateExec.scala:85) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 103 more ``` After this PR, the error is: ``` calculate error java.lang.RuntimeException: calculate error at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doExecute(HashAggregateExec.scala:84) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:117) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:117) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture$lzycompute(ShuffleExchangeExec.scala:121) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture(ShuffleExchangeExec.scala:120) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.doMaterialize(QueryStageExec.scala:161) at org.apache.spark.sql.execution.adaptive.QueryStageExec.$anonfun$materialize$1(QueryStageExec.scala:80) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.adaptive.QueryStageExec.materialize(QueryStageExec.scala:78) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5(AdaptiveSparkPlanExec.scala:207) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5$adapted(AdaptiveSparkPlanExec.scala:205) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:205) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:289) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3708) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2977) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3699) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3697) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2977) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$3(DataFrameAggregateSuite.scala:665) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) at org.apache.spark.sql.DataFrameAggregateSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244) at org.apache.spark.sql.DataFrameAggregateSuite.withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2(DataFrameAggregateSuite.scala:659) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2$adapted(DataFrameAggregateSuite.scala:655) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.sql.DataFrameAggregateSuite.assertNoExceptions(DataFrameAggregateSuite.scala:655) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126(DataFrameAggregateSuite.scala:695) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126$adapted(DataFrameAggregateSuite.scala:695) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$125(DataFrameAggregateSuite.scala:695) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) at org.scalatest.Suite.run(Suite.scala:1112) at org.scalatest.Suite.run$(Suite.scala:1094) at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) at org.scalatest.SuperEngine.runImpl(Engine.scala:535) at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1320) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1314) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1314) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1480) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971) at org.scalatest.tools.Runner$.run(Runner.scala:798) at org.scalatest.tools.Runner.run(Runner.scala) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28) ``` ### Why are the changes needed? `TreeNodeException` didn't work well. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #31337 from beliefer/SPARK-34234. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 06:25:33 +00:00
Chao Sun	0986f16c8d	[SPARK-34347][SQL] CatalogImpl.uncacheTable should invalidate in cascade for temp views ### What changes were proposed in this pull request? This PR includes the following changes: 1. in `CatalogImpl.uncacheTable`, invalidate caches in cascade when the target table is a temp view, and `spark.sql.legacy.storeAnalyzedPlanForView` is false (default value). 2. make `SessionCatalog.lookupTempView` public and return processed temp view plan (i.e., with `View` op). ### Why are the changes needed? Following [SPARK-34052](https://issues.apache.org/jira/browse/SPARK-34052) (#31107), we should invalidate in cascade for `CatalogImpl.uncacheTable` when the table is a temp view, so that the behavior is consistent. ### Does this PR introduce _any_ user-facing change? Yes, now `SQLContext.uncacheTable` will drop temp view in cascade by default. ### How was this patch tested? Added a UT Closes #31462 from sunchao/SPARK-34347. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-09 20:48:58 -08:00
Gabor Somogyi	0a37a95224	[SPARK-31816][SQL][DOCS] Added high level description about JDBC connection providers for users/developers ### What changes were proposed in this pull request? JDBC connection provider API and embedded connection providers already added to the code but no in-depth description about the internals. In this PR I've added both user and developer documentation and additionally added an example custom JDBC connection provider. ### Why are the changes needed? No documentation and example custom JDBC provider. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` cd docs/ SKIP_API=1 jekyll build ``` <img width="793" alt="Screenshot 2021-02-02 at 16 35 43" src="https://user-images.githubusercontent.com/18561820/106623428-e48d2880-6574-11eb-8d14-e5c2aa7c37f1.png"> Closes #31384 from gaborgsomogyi/SPARK-31816. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-02-10 12:28:28 +09:00
MrPowers	e6753c9402	[SPARK-33995][SQL] Expose make_interval as a Scala function ### What changes were proposed in this pull request? This pull request exposes the `make_interval` function, [as suggested here](https://github.com/apache/spark/pull/31000#pullrequestreview-560812433), and as agreed to [here](https://github.com/apache/spark/pull/31000#issuecomment-754856820) and [here](https://github.com/apache/spark/pull/31000#issuecomment-755040234). This powerful little function allows for idiomatic datetime arithmetic via the Scala API: ```scala // add two hours df.withColumn("plus_2_hours", col("first_datetime") + make_interval(hours = lit(2))) // subtract one week and 30 seconds col("d") - make_interval(weeks = lit(1), secs = lit(30)) ``` The `make_interval` [SQL function](https://github.com/apache/spark/pull/26446) already exists. Here is [the JIRA ticket](https://issues.apache.org/jira/browse/SPARK-33995) for this PR. ### Why are the changes needed? The Spark API makes it easy to perform datetime addition / subtraction with months (`add_months`) and days (`date_add`). Users need to write code like this to perform datetime addition with years, weeks, hours, minutes, or seconds: ```scala df.withColumn("plus_2_hours", expr("first_datetime + INTERVAL 2 hours")) ``` We don't want to force users to manipulate SQL strings when they're using the Scala API. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds `make_interval` to the `org.apache.spark.sql.functions` API. This single function will benefit a lot of users. It's a small increase in the surface of the API for a big gain. ### How was this patch tested? This was tested via unit tests. cc: MaxGekk Closes #31073 from MrPowers/SPARK-33995. Authored-by: MrPowers <matthewkevinpowers@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 03:27:41 +00:00
Angerszhuuuu	2f387b41e8	[SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats ### What changes were proposed in this pull request? When explain SQL with cost, treeString about subquery won't show it's statistics: How to reproduce: ``` spark.sql("create table t1 using parquet as select id as a, id as b from range(1000)") spark.sql("create table t2 using parquet as select id as c, id as d from range(2000)") spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("set spark.sql.cbo.enabled=true") spark.sql( """ \|WITH max_store_sales AS \| (SELECT max(csales) tpcds_cmax \| FROM (SELECT \| sum(b) csales \| FROM t1 WHERE a < 100 ) x), \|best_ss_customer AS \| (SELECT \| c \| FROM t2 \| WHERE d > (SELECT * FROM max_store_sales)) \| \|SELECT c FROM best_ss_customer \|""".stripMargin).explain("cost") ``` Before this PR's output: ``` == Optimized Logical Plan == Project [c#4263L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3) +- Filter (isnotnull(d#4264L) AND (d#4264L > scalar-subquery#4262 [])), Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) : +- Aggregate [max(csales#4260L) AS tpcds_cmax#4261L] : +- Aggregate [sum(b#4266L) AS csales#4260L] : +- Project [b#4266L] : +- Filter ((a#4265L < 100) AND isnotnull(a#4265L)) : +- Relation default.t1[a#4265L,b#4266L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- Relation default.t2[c#4263L,d#4264L] parquet, Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) ``` After this pr: ``` == Optimized Logical Plan == Project [c#4481L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3) +- Filter (isnotnull(d#4482L) AND (d#4482L > scalar-subquery#4480 [])), Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) : +- Aggregate [max(csales#4478L) AS tpcds_cmax#4479L], Statistics(sizeInBytes=16.0 B, rowCount=1) : +- Aggregate [sum(b#4484L) AS csales#4478L], Statistics(sizeInBytes=16.0 B, rowCount=1) : +- Project [b#4484L], Statistics(sizeInBytes=1616.0 B, rowCount=101) : +- Filter (isnotnull(a#4483L) AND (a#4483L < 100)), Statistics(sizeInBytes=2.4 KiB, rowCount=101) : +- Relation[a#4483L,b#4484L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- Relation[c#4481L,d#4482L] parquet, Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) ``` ### Why are the changes needed? Complete explain treeString's statistics ### Does this PR introduce _any_ user-facing change? When user use explain with cost mode, user can see subquery's statistic too. ### How was this patch tested? Added UT Closes #31485 from AngersZhuuuu/SPARK-34137. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 03:21:45 +00:00
Angerszhuuuu	123365e05c	[SPARK-34240][SQL] Unify output of `SHOW TBLPROPERTIES` clause's output attribute's schema and ExprID ### What changes were proposed in this pull request? Passing around the output attributes should have more benefits like keeping the exprID unchanged to avoid bugs when we apply more operators above the command output DataFrame. This PR did 2 things ： 1. After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`.. 2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged. ### Why are the changes needed? 1. Keep `SHOW TBLPROPERTIES`'s output schema consistence 2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged. ### Does this PR introduce _any_ user-facing change? After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`. Before this PR: ``` sql > SHOW TBLPROPERTIES tabe_name('key') value value_of_key ``` After this PR ``` sql > SHOW TBLPROPERTIES tabe_name('key') key value key value_of_key ``` ### How was this patch tested? Added UT Closes #31378 from AngersZhuuuu/SPARK-34240. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 03:19:52 +00:00
Kousuke Saruta	f79305a402	[SPARK-34311][SQL] PostgresDialect can't treat arrays of some types ### What changes were proposed in this pull request? This PR fixes the issue that `PostgresDialect` can't treat arrays of some types. Though PostgreSQL supports wide range of types (https://www.postgresql.org/docs/13/datatype.html), the current `PostgresDialect` can't treat arrays of the following types. * xml * tsvector * tsquery * macaddr * macaddr8 * txid_snapshot * pg_snapshot * point * line * lseg * box * path * polygon * circle * pg_lsn * bit varying * interval NOTE: PostgreSQL doesn't implement arrays of serial types so this PR doesn't care about them. ### Why are the changes needed? To provide better support with PostgreSQL. ### Does this PR introduce _any_ user-facing change? Yes. PostgresDialect can handle arrays of types shown above. ### How was this patch tested? New test. Closes #31419 from sarutak/postgres-array-types. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-02-10 11:29:14 +09:00
Angerszhuuuu	3e12e9d2ee	[SPARK-34238][SQL][FOLLOW_UP] SHOW PARTITIONS Keep consistence with other `SHOW` command ### What changes were proposed in this pull request? Keep consistence with other `SHOW` command according to https://github.com/apache/spark/pull/31341#issuecomment-774613080 ### Why are the changes needed? Keep consistence ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31516 from AngersZhuuuu/SPARK-34238-follow-up. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 02:28:05 +00:00
Holden Karau	cf7a13c363	[SPARK-34209][SQL] Delegate table name validation to the session catalog ### What changes were proposed in this pull request? Delegate table name validation to the session catalog ### Why are the changes needed? Queerying of tables with nested namespaces. ### Does this PR introduce _any_ user-facing change? SQL queries of nested namespace queries ### How was this patch tested? Unit tests updated. Closes #31427 from holdenk/SPARK-34209-delegate-table-name-validation-to-the-catalog. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-09 10:15:16 -08:00
Angerszhuuuu	7ea3a336b9	[SPARK-34355][CORE][SQL][FOLLOWUP] Log commit time in all File Writer ### What changes were proposed in this pull request? When doing https://issues.apache.org/jira/browse/SPARK-34399 based on https://github.com/apache/spark/pull/31471 Found FileBatchWrite will use `FileFormatWrite.processStates()` too. We need log commit duration in other writer too. In this pr: 1. Extract a commit job method in SparkHadoopWriter 2. address other commit writer ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #31520 from AngersZhuuuu/SPARK-34355-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2021-02-09 16:05:39 +09:00
yikf	37fe8c6d3c	[SPARK-34395][SQL] Clean up unused code for code simplifications ### What changes were proposed in this pull request? Currently, we pass the default value `EmptyRow` to method `checkEvaluation` in the `StringExpressionsSuite`, but the default value of the 'checkEvaluation' method parameter is the `emptyRow`. We can clean the parameter for Code Simplifications. ### Why are the changes needed? for Code Simplifications before: ``` def testConcat(inputs: String): Unit = { val expected = if (inputs.contains(null)) null else inputs.mkString checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected, EmptyRow) } ``` after: ``` def testConcat(inputs: String): Unit = { val expected = if (inputs.contains(null)) null else inputs.mkString checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected) } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or Github action. Closes #31510 from yikf/master. Authored-by: yikf <13468507104@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-08 20:37:23 -06:00
gengjiaan	e65b28cf7d	[SPARK-34352][SQL] Improve SQLQueryTestSuite so as could run on windows system ### What changes were proposed in this pull request? The current implement of `SQLQueryTestSuite` cannot run on windows system. Becasue the code below will fail on windows system: `assume(TestUtils.testCommandAvailable("/bin/bash"))` For operation system that cannot support `/bin/bash`, we just skip some tests. ### Why are the changes needed? SQLQueryTestSuite has a bug on windows system. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #31466 from beliefer/SPARK-34352. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-09 10:58:58 +09:00
yangjie01	777d51e7e3	[SPARK-34374][SQL][DSTREAM] Use standard methods to extract keys or values from a Map ### What changes were proposed in this pull request? Use standard methods to extract `keys` or `values` from a `Map`, it's semantically consistent and use the `DefaultKeySet` and `DefaultValuesIterable` instead of a manual loop. Before ``` map.map(_._1) map.map(_._2) ``` After ``` map.keys map.values ``` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31484 from LuciferYang/keys-and-values. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-08 15:42:55 -06:00
jiake	3b26bc2536	[SPARK-34168][SQL] Support DPP in AQE when the join is Broadcast hash join at the beginning ### What changes were proposed in this pull request? This PR is to enable AQE and DPP when the join is broadcast hash join at the beginning, which can benefit the performance improvement from DPP and AQE at the same time. This PR will make use of the result of build side and then insert the DPP filter into the probe side. ### Why are the changes needed? Improve performance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? adding new ut Closes #31258 from JkSelf/supportDPP1. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 16:42:52 +00:00
Terry Kim	c92e408aa1	[SPARK-34388][SQL] Propagate the registered UDF name to ScalaUDF, ScalaUDAF and ScalaAggregator ### What changes were proposed in this pull request? This PR proposes to propagate the name used for registering UDFs to `ScalaUDF`, `ScalaUDAF` and `ScaalAggregator`. Note that `PythonUDF` gets the name correctly: `466c045bfa/python/pyspark/sql/udf.py (L358-L359)` , and same for Hive UDFs: `466c045bfa/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala (L67)` ### Why are the changes needed? This PR can help in the following scenarios: 1) Better EXPLAIN output 2) By adding `def name: String` to `UserDefinedExpression`, we can match an expression by `UserDefinedExpression` and look up the catalog, an use case needed for #31273. ### Does this PR introduce _any_ user-facing change? The EXPLAIN output involving udfs will be changed to use the name used for UDF registration. For example, for the following: ``` sql("CREATE TEMPORARY FUNCTION test_udf AS 'org.apache.spark.examples.sql.Spark33084'") sql("SELECT test_udf(col1) FROM VALUES (1), (2), (3)").explain(true) ``` The output of the optimized plan will change from: ``` Aggregate [spark33084(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330846906be0f, 1, 1) AS spark33084(col1)#237] +- LocalRelation [col1#223] ``` to ``` Aggregate [test_udf(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330847a62d697, 1, 1, Some(test_udf)) AS test_udf(col1)#237] +- LocalRelation [col1#223] ``` ### How was this patch tested? Added new tests. Closes #31500 from imback82/udaf_name. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 16:02:07 +00:00
yliou	d1131bc850	[MINOR][SQL][FOLLOW-UP] Add assertion to FixedLengthRowBasedKeyValueBatch ### What changes were proposed in this pull request? Adds an assert to `FixedLengthRowBasedKeyValueBatch#appendRow` method to check the incoming vlen and klen by comparing them with the lengths stored as member variables as followup to https://github.com/apache/spark/pull/30788 ### Why are the changes needed? Add assert statement to catch similar bugs in future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran some tests locally, though not easy to test. Closes #31447 from yliou/SPARK-33726-Assert. Authored-by: yliou <yliou@berkeley.edu> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-08 08:46:01 -06:00
Linhong Liu	037bfb2dbc	[SPARK-33438][SQL] Eagerly init objects with defined SQL Confs for command `set -v` ### What changes were proposed in this pull request? In Spark, `set -v` is defined as "Queries all properties that are defined in the SQLConf of the sparkSession". But there are other external modules that also define properties and register them to SQLConf. In this case, it can't be displayed by `set -v` until the conf object is initiated (i.e. calling the object at least once). In this PR, I propose to eagerly initiate all the objects registered to SQLConf, so that `set -v` will always output the completed properties. ### Why are the changes needed? Improve the `set -v` command to produces completed and deterministic results ### Does this PR introduce _any_ user-facing change? `set -v` command will dump more configs ### How was this patch tested? existing tests Closes #30363 from linhongliu-db/set-v. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 22:48:28 +09:00
Max Gekk	a85490659f	[SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read ### What changes were proposed in this pull request? In the PR, I propose new options for the Parquet datasource: 1. `datetimeRebaseMode` 2. `int96RebaseMode` Both options influence on loading ancient dates and timestamps column values from parquet files. The `datetimeRebaseMode` option impacts on loading values of the `DATE`, `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` types, `int96RebaseMode` impacts on loading of `INT96` timestamps. The options support the same values as the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInRead` namely; - `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar. - `"CORRECTED"`, dates/timestamps are read AS IS from parquet files. - `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. ### Why are the changes needed? 1. New options will allow to load parquet files from at least two sources in different rebasing modes in the same query. For instance: ```scala val df1 = spark.read.option("datetimeRebaseMode", "legacy").parquet(folder1) val df2 = spark.read.option("datetimeRebaseMode", "corrected").parquet(folder2) df1.join(df2, ...) ``` Before the changes, it is impossible because the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` influences on both reads. 2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps: ```scala spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "legacy") spark.read.parquet(folder).distinct.rdd.collect() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV1Suite" $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV2Suite" ``` Closes #31489 from MaxGekk/parquet-rebase-options. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 13:28:40 +00:00
HyukjinKwon	70ef196d59	[SPARK-34157][BUILD][FOLLOW-UP] Fix Scala 2.13 compilation error via using Array.deep ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/31245: ``` [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:112:53: value deep is not a member of Array[String] [error] assert(sql("show tables").schema.fieldNames.deep == [error] ^ [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:115:72: value deep is not a member of Array[String] [error] assert(sql("show table extended like 'tbl'").schema.fieldNames.deep == [error] ^ [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:121:55: value deep is not a member of Array[String] [error] assert(sql("show tables").schema.fieldNames.deep == [error] ^ [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:124:74: value deep is not a member of Array[String] [error] assert(sql("show table extended like 'tbl'").schema.fieldNames.deep == [error] ^ ``` It broke Scala 2.13 build. This PR works around by using ScalaTests' `===` that can compare `Array`s safely. ### Why are the changes needed? To fix the build. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? CI in this PR should test it out. Closes #31526 from HyukjinKwon/SPARK-34157. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 22:25:59 +09:00
Angerszhuuuu	70a79e920a	[SPARK-34239][SQL][FOLLOW_UP] SHOW COLUMNS Keep consistence with other `SHOW` command ### What changes were proposed in this pull request? Keep consistence with other `SHOW` command according to https://github.com/apache/spark/pull/31341#issuecomment-774613080 ### Why are the changes needed? Keep consistence ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31518 from AngersZhuuuu/SPARK-34239-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 11:39:59 +00:00
gengjiaan	2c243c93d9	[SPARK-34157][SQL] Unify output of SHOW TABLES and pass output attributes properly ### What changes were proposed in this pull request? The current implement of some DDL not unify the output and not pass the output properly to physical command. Such as: The `ShowTables` output attributes `namespace`, but `ShowTablesCommand` output attributes `database`. As the query plan, this PR pass the output attributes from `ShowTables` to `ShowTablesCommand`, `ShowTableExtended ` to `ShowTablesCommand`. Take `show tables` and `show table extended like 'tbl'` as example. The output before this PR: `show tables` \|database\|tableName\|isTemporary\| -- \| -- \| -- \| default\| tbl\| false\| If catalog is v2 session catalog, the output before this PR: \|namespace\|tableName\| -- \| -- \| default\| tbl `show table extended like 'tbl'` \|database\|tableName\|isTemporary\| information\| -- \| -- \| -- \| -- \| default\| tbl\| false\|Database: default...\| The output after this PR: `show tables` \|namespace\|tableName\|isTemporary\| -- \| -- \| -- \| default\| tbl\| false\| `show table extended like 'tbl'` \|namespace\|tableName\|isTemporary\| information\| -- \| -- \| -- \| -- \| default\| tbl\| false\|Database: default...\| ### Why are the changes needed? This PR have benefits as follows: First, Unify schema for the output of SHOW TABLES. Second, pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe. ### Does this PR introduce _any_ user-facing change? Yes. The output schema of `SHOW TABLES` replace `database` by `namespace`. ### How was this patch tested? Jenkins test. Closes #31245 from beliefer/SPARK-34157. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 08:39:58 +00:00
ulysses-you	9270238473	[SPARK-34355][SQL] Add log and time cost for commit job ### What changes were proposed in this pull request? Add some info log around commit log. ### Why are the changes needed? Th commit job is a heavy option and we have seen many times Spark block at this code place due to the slow rpc with namenode or other. It's better to record the time that commit job cost. ### Does this PR introduce _any_ user-facing change? Yes, more info log. ### How was this patch tested? Not need. Closes #31471 from ulysses-you/add-commit-log. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2021-02-08 16:44:59 +09:00
Yuming Wang	6e05e99143	[SPARK-34342][SQL] Format DateLiteral and TimestampLiteral toString ### What changes were proposed in this pull request? This pr format DateLiteral and TimestampLiteral toString. For example: ```sql SELECT * FROM date_dim WHERE d_date BETWEEN (cast('2000-03-11' AS DATE) - INTERVAL 30 days) AND (cast('2000-03-11' AS DATE) + INTERVAL 30 days) ``` Before this pr: ``` Condition : (((isnotnull(d_date#18) AND (d_date#18 >= 10997)) AND (d_date#18 <= 11057)) ``` After this pr: ``` Condition : (((isnotnull(d_date#14) AND (d_date#14 >= 2000-02-10)) AND (d_date#14 <= 2000-04-10)) ``` ### Why are the changes needed? Make the plan more readable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31455 from wangyum/SPARK-34342. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 19:49:38 -08:00
“attilapiros”	cc508d17c7	[SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url" ### What changes were proposed in this pull request? With https://github.com/apache/spark/pull/31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by `avro.schema.literal`. Here that functionality is extended to support schema evolution where the schema is defined via `avro.schema.url`. ### Why are the changes needed? Without this PR the problem described in https://github.com/apache/spark/pull/31133 can be reproduced by tables where `avro.schema.url` is used. As in this case always the property value given at partition level is used for the `avro.schema.url`. So for example when a new column (with a default value) is added to the table then one the following problem happens: - when the new field is added after the last one the cell values will be null values instead of the default value - when the schema is extended somewhere before the last field then values will be listed for the wrong column positions Similar error will happen when one of the field is removed from the schema. For details please check the attached unit tests where both cases are checked. ### Does this PR introduce _any_ user-facing change? Fixes the potential value error. ### How was this patch tested? The existing unit tests for schema evolution is generalized and reused. New tests: - `SPARK-34370: support Avro schema evolution (add column with avro.schema.url)` - `SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)` Closes #31501 from attilapiros/SPARK-34370. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 17:25:39 -08:00
Max Gekk	6845d26057	[SPARK-34385][SQL] Unwrap `SparkUpgradeException` in v2 Parquet datasource ### What changes were proposed in this pull request? Unwrap `SparkUpgradeException` from `ParquetDecodingException` in v2 `FilePartitionReader` in the same way as v1 implementation does: `3a299aa648/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala (L180-L183)` ### Why are the changes needed? 1. To be compatible with v1 implementation of the Parquet datasource. 2. To improve UX with Spark SQL by making `SparkUpgradeException` more visible. ### Does this PR introduce _any_ user-facing change? Yes, it can. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV1Suite" $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV2Suite" ``` Closes #31497 from MaxGekk/parquet-spark-upgrade-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 16:49:15 -08:00
tanel.kiis@gmail.com	c73f70bb0d	[SPARK-34141][SQL] Remove side effect from ExtractGenerator ### What changes were proposed in this pull request? Rewrote one `ExtractGenerator` case such that it would not rely on a side effect of the flatmap function. ### Why are the changes needed? With the dataframe api it is possible to have a lazy sequence as the `output` of a `LogicalPlan`. When exploding a column on this dataframe using the `withColumn("newName", explode(col("name")))` method, the `ExtractGenerator` does not extract the generator and `CheckAnalysis` would throw an exception. ### Does this PR introduce _any_ user-facing change? Bugfix Before this, the work around was to put `.select("*")` before the explode. ### How was this patch tested? UT Closes #31213 from tanelk/SPARK-34141_extract_generator. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-06 13:27:07 -06:00
“attilapiros”	e614f34c7a	[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal" ### What changes were proposed in this pull request? Before this PR for a partitioned Avro Hive table when the SerDe is configured to read the partition data the table level properties were overwritten by the partition level properties. This PR changes this ordering by giving table level properties higher precedence thus when a new evolved schema is set for the table this new schema will be used to read the partition data and not the original schema which was used for writing the data. This new behavior is consistent with Apache Hive. See the example used in the unit test `SPARK-26836: support Avro schema evolution`, in Hive this results in: ``` 0: jdbc:hive2://<IP>:10000> select * from t; INFO : Compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.col1, type:string, comment:null), FieldSchema(name:t.col2, type:string, comment:null), FieldSchema(name:t.ds, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.098 seconds INFO : Executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t INFO : Completed executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.013 seconds INFO : OK +---------------+-------------+-------------+ \| t.col1 \| t.col2 \| t.ds \| +---------------+-------------+-------------+ \| col1_default \| col2_value \| 1981-01-07 \| \| col1_value \| col2_value \| 1983-04-27 \| +---------------+-------------+-------------+ 2 rows selected (0.159 seconds) ``` ### Why are the changes needed? Without this change the old schema would be used. This can use a correctness issue when the new schema introduces a new field with a default value (following the rules of schema evolution) before an existing field. In this case the rows coming from the partition where the old schema was used will contain values in wrong column positions. For example check the attached unit test `SPARK-26836: support Avro schema evolution` Without this fix the result of the select on the table would be: ``` +----------+----------+----------+ \| col1\| col2\| ds\| +----------+----------+----------+ \|col2_value\| null\|1981-01-07\| \|col1_value\|col2_value\|1983-04-27\| +----------+----------+----------+ ``` With this fix: ``` +------------+----------+----------+ \| col1\| col2\| ds\| +------------+----------+----------+ \|col1_default\|col2_value\|1981-01-07\| \| col1_value\|col2_value\|1983-04-27\| +------------+----------+----------+ ``` ### Does this PR introduce _any_ user-facing change? Just fixes the value errors. When a new column is introduced even to the last position then instead of 'null' the given default will be used. ### How was this patch tested? This was tested with the unit tested included to the PR. And manually on Apache Spark / Hive. Closes #31133 from attilapiros/SPARK-26836. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-05 10:56:25 -08:00
Cheng Su	76baaf7465	[SPARK-32985][SQL] Decouple bucket scan and bucket filter pruning for data source v1 ### What changes were proposed in this pull request? As a followup from discussion in https://github.com/apache/spark/pull/29804#discussion_r493100510 . Currently in data source v1 file scan `FileSourceScanExec`, [bucket filter pruning will only take effect with bucket table scan](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L542 ). However this is unnecessary, as bucket filter pruning can also happen if we disable bucketed table scan. Read files with bucket hash partitioning, and bucket filter pruning are two orthogonal features, and do not need to couple together. ### Why are the changes needed? This help query leverage the benefit from bucket filter pruning to save CPU/IO to not read unnecessary bucket files, and do not bound by bucket table scan when the parallelism of tasks is a concern. In addition, this also resolves the issue to reduce number of tasks launched for simple query with bucket column filter - SPARK-33207, because with bucket scan, we launch # of tasks to equal to # of buckets, and this is unnecessary. ### Does this PR introduce _any_ user-facing change? Users will notice query to start pruning irrelevant files for reading bucketed table, when disabling bucketing. If the input data does not follow spark data source bucketing convention, by default exception will be thrown and query will be failed. The exception can be bypassed with setting config `spark.sql.files.ignoreCorruptFiles` to true. ### How was this patch tested? Added unit test in `BucketedReadSuite.scala` to make all existing unit tests for bucket filter work with this PR. Closes #31413 from c21/bucket-pruning. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-05 13:00:06 +00:00
Wenchen Fan	989eb6884d	[SPARK-34331][SQL] Speed up DS v2 metadata col resolution ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/28027 https://github.com/apache/spark/pull/28027 added a DS v2 API that allows data sources to produce metadata/hidden columns that can only be seen when it's explicitly selected. The way we integrate this API into Spark is: 1. The v2 relation gets normal output and metadata output from the data source, and the metadata output is excluded from the plan output by default. 2. column resolution can resolve `UnresolvedAttribute` with metadata columns, even if the child plan doesn't output metadata columns. 3. An analyzer rule searches the query plan, trying to find a node that has missing inputs. If such node is found, transform the sub-plan of this node, and update the v2 relation to include the metadata output. The analyzer rule in step 3 brings a perf regression, for queries that do not read v2 tables at all. This rule will calculate `QueryPlan.inputSet` (which builds an `AttributeSet` from outputs of all children) and `QueryPlan.missingInput` (which does a set exclusion and creates a new `AttributeSet`) for every plan node in the query plan. In our benchmark, the TPCDS query compilation time gets increased by more than 10% This PR proposes a simple way to improve it: we add a special metadata entry to the metadata attribute, which allows us to quickly check if a plan needs to add metadata columns: we just check all the references of this plan, and see if the attribute contains the special metadata entry, instead of calculating `QueryPlan.missingInput`. This PR also fixes one bug: we should not change the final output schema of the plan, if we only use metadata columns in operators like filter, sort, etc. ### Why are the changes needed? Fix perf regression in SQL query compilation, and fix a bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run `org.apache.spark.sql.TPCDSQuerySuite`, before this PR, `AddMetadataColumns` is the top 4 rule ranked by running time ``` === Metrics of Analyzer/Optimizer Rules === Total number of runs: 407641 Total time: 47.257239779 seconds Rule Effective Time / Total Time Effective Runs / Total Runs OptimizeSubqueries 4157690003 / 8485444626 49 / 2778 Analyzer$ResolveAggregateFunctions 1238968711 / 3369351761 49 / 2141 ColumnPruning 660038236 / 2924755292 338 / 6391 Analyzer$AddMetadataColumns 0 / 2918352992 0 / 2151 ``` after this PR: ``` Analyzer$AddMetadataColumns 0 / 122885629 0 / 2151 ``` This rule is 20 times faster and is negligible to the total compilation time. This PR also add new tests to verify the bug fix. Closes #31440 from cloud-fan/metadata-col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-05 16:37:29 +08:00
Max Gekk	ee11a8f407	[SPARK-34371][SQL][TESTS] Run the datetime rebasing tests for Parquet datasource v1 and v2 ### What changes were proposed in this pull request? Extract the date/timestamps rebasing tests from `ParquetIOSuite` to `ParquetRebaseDatetimeSuite` to run them for both DSv1 and DSv2 implementations of Parquet datasource. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV2Suite" $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV1Suite" $ build/sbt "sql/test:testOnly *ParquetIOSuite" ``` Closes #31478 from MaxGekk/rebase-tests-dsv1-and-dsv2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-05 07:10:37 +00:00
Wenchen Fan	361d702f8d	[SPARK-34359][SQL] Add a legacy config to restore the output schema of SHOW DATABASES ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26006 In #26006 , we merged the v1 and v2 SHOW DATABASES/NAMESPACES commands, but we missed a behavior change that the output schema of SHOW DATABASES becomes different. This PR adds a legacy config to restore the old schema, with a migration guide item to mention this behavior change. ### Why are the changes needed? Improve backward compatibility ### Does this PR introduce _any_ user-facing change? No (the legacy config is false by default) ### How was this patch tested? a new test Closes #31474 from cloud-fan/command-schema. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-05 04:57:51 +00:00
Kent Yao	961c85166a	[SPARK-34346][CORE][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31460 from yaooqinn/SPARK-34346. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-05 10:13:19 +09:00
Jungtaek Lim (HeartSaVioR)	fbe726f5b1	[SPARK-34339][CORE][SQL] Expose the number of total paths in Utils.buildLocationMetadata() ### What changes were proposed in this pull request? This PR proposes to expose the number of total paths in Utils.buildLocationMetadata(), with relaxing space usage a bit (around 10+ chars). Suppose the first 2 of 5 paths are only fit to the threshold, the outputs between the twos are below: * before the change: `[path1, path2]` * after the change: `(5 paths)[path1, path2, ...]` ### Why are the changes needed? SPARK-31793 silently truncates the paths hence end users can't indicate how many paths are truncated, and even more, whether paths are truncated or not. ### Does this PR introduce _any_ user-facing change? Yes, the location metadata will also show how many paths are truncated (not shown), instead of silently truncated. ### How was this patch tested? Modified UTs Closes #31464 from HeartSaVioR/SPARK-34339. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-05 09:37:38 +09:00
Hoa	7675582dab	[SPARK-34357][SQL] Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone ### What changes were proposed in this pull request? Due to user-experience (confusing to Spark users - java.sql.Time using milliseconds vs Spark using microseconds; and user losing useful functions like hour(), minute(), etc on the column), we have decided to revert back to use TimestampType but this time we will enforce the hour to be consistently across system timezone (via offset manipulation) and date part fixed to zero epoch. Full Discussion with Wenchen Fan Wenchen Fan regarding this ticket is here https://github.com/apache/spark/pull/30902#discussion_r569186823 ### Why are the changes needed? Revert and improvement to sql.Time handling ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests and integration tests Closes #31473 from saikocat/SPARK-34357. Authored-by: Hoa <hoameomu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-04 17:16:39 +00:00
Jungtaek Lim (HeartSaVioR)	44dcf0062c	[SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path ### What changes were proposed in this pull request? This PR proposes to fix the UTs being added in SPARK-31793, so that all things contributing the length limit are properly accounted. ### Why are the changes needed? The test `DataSourceScanExecRedactionSuite.SPARK-31793: FileSourceScanExec metadata should contain limited file paths` is failing conditionally, depending on the length of the temp directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified UTs explain the missing points, which also do the test. Closes #31449 from HeartSaVioR/SPARK-34326-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-02-04 08:46:11 +09:00
Terry Kim	3d7e1397d6	[SPARK-34317][SQL][FOLLOW-UP] Use relationTypeMismatchHint when UnresolvedTable is resolved to a temp view ### What changes were proposed in this pull request? This is a follow up to #31424, and proposes to use `UnresolvedTable.relationTypeMismatchHint` when `UnresolvedTable` is resolved to a temp view. ### Why are the changes needed? This change utilizes the type mismatch hint when a relation is resolved to a temp view when a table is expected. For example, `ALTER TABLE tmpView SET TBLPROPERTIES ('p' = 'an')` will now include `Please use ALTER VIEW instead.` in the exception message: `tmpView is a temp view. 'ALTER TABLE ... SET TBLPROPERTIES' expects a table. Please use ALTER VIEW instead.` ### Does this PR introduce _any_ user-facing change? Yes, adds the hint in the exception message. ### How was this patch tested? Update existing tests to include the hint. Closes #31452 from imback82/followup_SPARK-34317. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-03 16:12:27 +00:00
Max Gekk	7bfb4a4642	[SPARK-34304][SQL] Remove view checks in v1 alter table commands ### What changes were proposed in this pull request? Remove the check `verifyAlterTableType()` from the following v1 commands: - AlterTableAddPartitionCommand - AlterTableDropPartitionCommand - AlterTableRenamePartitionCommand - AlterTableRecoverPartitionsCommand - AlterTableSerDePropertiesCommand - AlterTableSetLocationCommand The check is not needed any more after migration on new resolution framework, see SPARK-29900. Also new tests were added to: - AlterTableAddPartitionSuiteBase - AlterTableDropPartitionSuiteBase - AlterTableRenamePartitionSuiteBase - v1/AlterTableRecoverPartitionsSuite and removed duplicate tests from `SQLViewSuite` and `HiveDDLSuite`. The tests for `AlterTableSerDePropertiesCommand`/`AlterTableSetLocationCommand` exist in SQLViewSuite` and `HiveDDLSuite`, and they can be ported to unified tests after SPARK-34305 and SPARK-34332. The `ALTER TABLE .. CHANGE COLUMN` command accepts only tables too, so, the check can be removed after migration on new resolution framework, SPARK-34302. ### Why are the changes needed? To improve code maintenance by removing dead code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. Added new tests to unified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenamePartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsSuite" ``` 2. Run the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly SQLViewSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly HiveDDLSuite" ``` Closes #31405 from MaxGekk/remove-view-check-in-alter-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-03 16:11:36 +00:00
allisonwang-db	76a7fca4e1	[SPARK-34335][SQL] Support referencing subquery with column aliases by table alias ### What changes were proposed in this pull request? This PR adds support for referencing subquery with column aliases by its table alias. Before ```sql -- AnalysisException: cannot resolve '`t.c1`' given input columns: [c1, c2]; SELECT t.c1, t.c2 FROM (SELECT 1 AS a, 1 AS b) t(c1, c2) ``` After: ```sql -- [(1, 1)] SELECT t.c1, t.c2 FROM (SELECT 1 AS a, 1 AS b) t(c1, c2) ``` ### Why are the changes needed? To allow users to reference subquery with column aliases by its table alias. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added parser tests and SQL query tests. Closes #31444 from allisonwang-db/spark-34335. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-03 08:51:28 +00:00
Terry Kim	a1d4bb3300	[SPARK-34313][SQL] Migrate ALTER TABLE SET/UNSET TBLPROPERTIES commands to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... SET/UNSET TBLPROPERTIES` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, `ALTER TABLE SET/UNSET TBLPROPERTIES` will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests / added new tests. Closes #31422 from imback82/v2_alter_table_set_unset_properties. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-03 05:44:58 +00:00
Ruifeng Zheng	fc80a5b877	[SPARK-34307][SQL] TakeOrderedAndProjectExec avoid shuffle if input rdd has single partition ### What changes were proposed in this pull request? when the child rdd has only one partition, skip the shuffle ### Why are the changes needed? avoid unnecessary shuffle ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #31409 from zhengruifeng/limit_with_single_partition. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-03 04:49:08 +00:00
HyukjinKwon	e927bf90e0	Revert "[SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path" This reverts commit `63866025d2`.	2021-02-03 12:32:39 +09:00
Kousuke Saruta	603a7fd7b6	[SPARK-34308][SQL] Escape meta-characters in printSchema ### What changes were proposed in this pull request? Similar to SPARK-33690, this PR improves the output layout of `printSchema` for the case column names contain meta characters. Here is an example. Before: ``` scala> val df1 = spark.sql("SELECT 'aaa\nbbb\tccc\rddd\feee\bfff\u000Bggg\u0007hhh'") scala> df1.printSchema root \|-- aaa ddd ccc eefff ggghhh: string (nullable = false) ``` After: ``` scala> df1.printSchema root \|-- aaa\nbbb\tccc\rddd\feee\bfff\vggg\ahhh: string (nullable = false) ``` ### Why are the changes needed? To avoid breaking the layout of `Dataset#printSchema` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #31412 from sarutak/escape-meta-printSchema. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-03 11:06:41 +09:00
Wenchen Fan	00120ea537	[SPARK-34212][SQL][FOLLOWUP] Parquet vectorized reader can read decimal fields with a larger precision ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/31357 #31357 added a very strong restriction to the vectorized parquet reader, that the spark data type must exactly match the physical parquet type, when reading decimal fields. This restriction is actually not necessary, as we can safely read parquet decimals with a larger precision. This PR releases this restriction a little bit. ### Why are the changes needed? To not fail queries unnecessarily. ### Does this PR introduce _any_ user-facing change? Yes, now users can read parquet decimals with mismatched `DecimalType` as long as the scale is the same and precision is larger. ### How was this patch tested? updated test. Closes #31443 from cloud-fan/improve. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-03 09:26:36 +09:00
Jungtaek Lim (HeartSaVioR)	63866025d2	[SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path ### What changes were proposed in this pull request? This PR proposes to fix the UTs being added in SPARK-31793, so that all things contributing the length limit are properly accounted. ### Why are the changes needed? The test `DataSourceScanExecRedactionSuite.SPARK-31793: FileSourceScanExec metadata should contain limited file paths` is failing conditionally, depending on the length of the temp directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified UTs explain the missing points, which also do the test. Closes #31435 from HeartSaVioR/SPARK-34326. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-02-03 07:35:22 +09:00
Liang-Chi Hsieh	cadca8d352	[SPARK-34324][SQL] FileTable should not list TRUNCATE in capabilities by default ### What changes were proposed in this pull request? This patch proposes to remove `TRUNCATE` from the default `capabilities` list from `FileTable`. ### Why are the changes needed? The abstract class `FileTable` now lists `TRUNCATE` in its `capabilities`, but `FileTable` does not know if an implementation really supports truncation or not. Specifically, we can check existing `FileTable` implementations including `AvroTable`, `CSVTable`, `JsonTable`, etc. No one implementation really implements `SupportsTruncate` in its writer builder. ### Does this PR introduce _any_ user-facing change? No, because seems to me `FileTable` is not of user-facing public API. ### How was this patch tested? Existing unit tests. Closes #31432 from viirya/SPARK-34324. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-02 11:41:05 -08:00
Kousuke Saruta	d308794adb	[SPARK-34263][SQL] Simplify the code for treating unicode/octal/escaped characters in string literals ### What changes were proposed in this pull request? In the current master, the code for treating unicode/octal/escaped characters in string literals is a little bit complex so let's simplify it. ### Why are the changes needed? To keep it easy to maintain. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `ParserUtilsSuite` passes. Closes #31362 from sarutak/refactor-unicode-escapes. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-02-03 01:07:12 +09:00
Max Gekk	79515b82f1	[SPARK-34282][SQL][TESTS] Unify v1 and v2 TRUNCATE TABLE tests ### What changes were proposed in this pull request? 1. Move parser tests from `DDLParserSuite` to `TruncateTableParserSuite`. 2. Port DS v1 tests from `DDLSuite` and other test suites to `v1.TruncateTableSuiteBase` and to `v1.TruncateTableSuite`. 3. Add a test for DSv2 `TRUNCATE TABLE` to `v2.TruncateTableSuite`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly TruncateTableSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly CatalogedDDLSuite" ``` Closes #31387 from MaxGekk/unify-truncate-table-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 14:32:35 +00:00
Gengliang Wang	ff1b6ecc37	[SPARK-33591][SQL][FOLLOW-UP] Revise the version and doc of `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` ### What changes were proposed in this pull request? Correct the version of SQL configuration `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` from 3.2.0 to 3.0.2. Also, revise the documentation and test case. ### Why are the changes needed? The release version in https://github.com/apache/spark/pull/31421 was wrong. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #31434 from gengliangwang/reviseVersion. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 13:51:20 +00:00
gengjiaan	5b2ad59f64	[SPARK-33599][SQL] Restore the assert-like in catalyst/analysis ### What changes were proposed in this pull request? There exists some `Exception` for assert in fact. Such as: `throw new IllegalStateException("[BUG] unexpected plan returned by `lookupV2Relation`: " + other)` This kind `Exception` seems should not put in single dedicated files. ### Why are the changes needed? Reduce the workload of auditing. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #31395 from beliefer/SPARK-33599-restore-assert. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 13:28:28 +00:00
Kousuke Saruta	66f3480f2b	[SPARK-34318][SQL] Dataset.colRegex should work with column names and qualifiers which contain newlines ### What changes were proposed in this pull request? This PR fixes an issue that `Dataset.colRegex` doesn't work with column names or qualifiers which contain newlines. In the current master, if column names or qualifiers passed to `colRegex` contain newlines, it throws exception. ``` val df = Seq(1, 2, 3).toDF("test\n_column").as("test\n_table") val col1 = df.colRegex("`tes.\n.mn`") org.apache.spark.sql.AnalysisException: Cannot resolve column name "`tes.* .mn`" among (test _column) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272) at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263) at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407) ... 47 elided val col2 = df.colRegex("test\n_table.`tes.\n.mn`") org.apache.spark.sql.AnalysisException: Cannot resolve column name "test _table.`tes. .*mn`" among (test _column) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272) at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263) at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407) ... 47 elided ``` ### Why are the changes needed? Column names and qualifiers can contain newlines but `colRegex` can't work with them, so it's a bug. ### Does this PR introduce _any_ user-facing change? Yes. users can pass column names and qualifiers even though they contain newlines. ### How was this patch tested? New test. Closes #31426 from sarutak/fix-colRegex. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-02-02 21:47:11 +09:00
Max Gekk	6d3674bb62	[SPARK-34312][SQL] Support partition(s) truncation by `Supports(Atomic)PartitionManagement` ### What changes were proposed in this pull request? 1. Add new method `truncatePartition()` to the `SupportsPartitionManagement` interface. 2. Add new method `truncatePartitions()` to the `SupportsAtomicPartitionManagement` interface. 3. Default implementation of new methods in `InMemoryPartitionTable`/`InMemoryAtomicPartitionTable`. ### Why are the changes needed? This is the first step in supporting of v2 `TRUNCATE TABLE .. PARTITION`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly SupportsPartitionManagementSuite" $ build/sbt "test:testOnly SupportsAtomicPartitionManagementSuite" ``` Closes #31420 from MaxGekk/dsv2-truncate-table-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 08:25:59 +00:00
Terry Kim	f024d3051c	[SPARK-34317][SQL] Introduce relationTypeMismatchHint to UnresolvedTable for a better error message ### What changes were proposed in this pull request? This PR proposes to add `relationTypeMismatchHint` to `UnresolvedTable` so that if a relation is resolved to a view when a table is expected, a hint message can be included as a part of the analysis exception message. Note that the same feature is already introduced to `UnresolvedView` in #30636. This mostly affects `ALTER TABLE` commands where the analysis exception message will now contain `Please use ALTER VIEW as instead`. ### Why are the changes needed? To give a better error message. (The hint used to exist but got removed for commands that migrated to the new resolution framework) ### Does this PR introduce _any_ user-facing change? Yes, now `ALTER TABLE` commands include a hint to use `ALTER VIEW` instead. ``` sql("ALTER TABLE v SET SERDE 'whatever'") ``` Before: ``` "v is a view. 'ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES]' expects a table. ``` After this PR: ``` "v is a view. 'ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES]' expects a table. Please use ALTER VIEW instead. ``` ### How was this patch tested? Updated existing test cases to include the hint. Closes #31424 from imback82/better_error. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 08:24:44 +00:00
Linhong Liu	bb9bf66bb6	[SPARK-34199][SQL] Block `table.` inside function to follow ANSI standard and other SQL engines ### What changes were proposed in this pull request? In spark, the `count(table.)` may cause very weird result, for example: ``` select count() from (select 1 as a, null as b) t; output: 1 select count(t.) from (select 1 as a, null as b) t; output: 0 ``` This is because spark expands `t.` while converts `` to count(1), this will confuse users. After checking the ANSI standard, `count()` should always be `count(1)` while `count(t.)` is not allowed. What's more, this is also not allowed by common databases, e.g. MySQL, Oracle. So, this PR proposes to block the ambiguous behavior and print a clear error message for users. ### Why are the changes needed? to avoid ambiguous behavior and follow ANSI standard and other SQL engines ### Does this PR introduce _any_ user-facing change? Yes, `count(table.*)` behavior will be blocked and output an error message. ### How was this patch tested? newly added and existing tests Closes #31286 from linhongliu-db/fix-table-star. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 07:49:50 +00:00
yi.wu	e9362c2571	[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas ### What changes were proposed in this pull request? Resolve duplicate attributes for `FlatMapCoGroupsInPandas`. ### Why are the changes needed? When performing self-join on top of `FlatMapCoGroupsInPandas`, analysis can fail because of conflicting attributes. For example, ```scala df = spark.createDataFrame([(1, 1)], ("column", "value")) row = df.groupby("ColUmn").cogroup( df.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, "column long, value long") row.join(row).show() ``` error: ```scala ... Conflicting attributes: column#163321L,value#163322L ;; ’Join Inner :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] : :- Project [ColUmn#163312L, column#163312L, value#163313L] : : +- LogicalRDD [column#163312L, value#163313L], false : +- Project [COLUMN#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] :- Project [ColUmn#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- Project [COLUMN#163312L, column#163312L, value#163313L] +- LogicalRDD [column#163312L, value#163313L], false ... ``` ### Does this PR introduce _any_ user-facing change? yes, the query like the above example won't fail. ### How was this patch tested? Adde unit tests. Closes #31429 from Ngone51/fix-conflcting-attrs-of-FlatMapCoGroupsInPandas. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 16:25:32 +09:00
Gengliang Wang	521397f2f9	[SPARK-33591][SQL][FOLLOWUP] Add legacy config for recognizing null partition spec values ### What changes were proposed in this pull request? This is a follow up for https://github.com/apache/spark/pull/30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes #31421 from gengliangwang/legacyNullStringConstant. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 16:13:40 +09:00
HyukjinKwon	30468a9015	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs ### What changes were proposed in this pull request? This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621. In more details, this PR: - Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases. - (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases. - Deprecates and renames: - `sumDistinct` -> `sum_distinct` - `bitwiseNOT` -> `bitwise_not` - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`) - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`) - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`) - (Scala-specific) `callUDF` -> `call_udf` ### Why are the changes needed? To keep the consistent naming in APIs. ### Does this PR introduce _any_ user-facing change? Yes, it deprecates some APIs and add new renamed APIs as described above. ### How was this patch tested? Unittests were added. Closes #31408 from HyukjinKwon/SPARK-34306. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 09:29:40 +09:00
yangjie01	9db566a882	[SPARK-34310][CORE][SQL] Replaces map and flatten with flatMap ### What changes were proposed in this pull request? Replaces `collection.map(f1).flatten(f2)` with `collection.flatMap` if possible. it's semantically consistent, but looks simpler. ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31416 from LuciferYang/SPARK-34310. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-01 08:21:35 -06:00
Angerszhuuuu	74116b6b25	[SPARK-34239][SQL] Unify output of SHOW COLUMNS pass output attributes properly ### What changes were proposed in this pull request? Passing around the output attributes should have more benefits like keeping the expr ID unchanged to avoid bugs when we apply more operators above the command output dataframe. This PR keep SHOW COLUMNS command's output attribute exprId unchanged. ### Why are the changes needed? Keep SHOW PARTITIONS command's output attribute exprid unchanged. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #31377 from AngersZhuuuu/SPARK-34239. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-01 14:16:03 +00:00
Max Gekk	2b76e6d15c	[SPARK-34301][SQL] Use logical plan of alter table in `CatalogImpl.recoverPartitions()` ### What changes were proposed in this pull request? Replace v1 exec node `AlterTableRecoverPartitionsCommand` by the logical node `AlterTableRecoverPartitions` in `CatalogImpl.recoverPartitions()`. ### Why are the changes needed? 1. Print user friendly error message for views: ``` my_temp_table is a temp view. 'recoverPartitions()' expects a table ``` Before the changes: ``` Table or view 'my_temp_table' not found in database 'default' ``` 2. To not bind to v1 `ALTER TABLE .. RECOVER PARTITIONS`, and to support v2 tables potentially as well. ### Does this PR introduce _any_ user-facing change? Yes, it can. ### How was this patch tested? By running new test in `CatalogSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.internal.CatalogSuite" ``` Closes #31403 from MaxGekk/catalogimpl-recoverPartitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-01 14:09:40 +00:00
Max Gekk	0837c1aa3d	[SPARK-34303][SQL] Migrate ALTER TABLE .. SET LOCATION to new resolution framework ### What changes were proposed in this pull request? 1. Remove old statement `AlterTableSetLocationStatement` 2. Introduce new command `AlterTableSetLocation` for `ALTER TABLE .. SET LOCATION`. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: SPARK-29900. ### Does this PR introduce _any_ user-facing change? It can change the error message for views. ### How was this patch tested? By running `ALTER TABLE .. SET LOCATION` tests: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly DataSourceV2SQLSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly CatalogedDDLSuite" ``` Closes #31414 from MaxGekk/migrate-set-location-resolv-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-01 13:41:15 +00:00
Max Gekk	95302756f1	[SPARK-34266][SQL][DOCS] Update comments for `SessionCatalog.refreshTable()` and `CatalogImpl.refreshTable()` ### What changes were proposed in this pull request? Describe `SessionCatalog.refreshTable()` and `CatalogImpl.refreshTable()`. what they do and when they are supposed to be used. ### Why are the changes needed? To improve code maintenance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle` Closes #31364 from MaxGekk/doc-refreshTable. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-01 13:07:05 +00:00
HyukjinKwon	4e7e7ee6e5	[SPARK-33245][SQL][FOLLOW-UP] Remove bitwiseGet in Scala functions API ### What changes were proposed in this pull request? This PR is a followup that removes `bitwiseGet` in functions API. This is mainly for SQL compliance, and arguably not very much commonly used. ### Why are the changes needed? See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59 ### Does this PR introduce _any_ user-facing change? No, it's a change in unreleased branches. ### How was this patch tested? Existing tests should cover. Closes #31410 from HyukjinKwon/SPARK-33245. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-01 18:21:27 +09:00
Terry Kim	a8eb443bf8	[SPARK-34299][SQL] Clean up ResolveSessionCatalog's isTempView and isTempFunction ### What changes were proposed in this pull request? `ResolveSessionCatalog`'s `isTempView` and `isTempFunction` are not being used anymore since the resolution of temp view/function has moved to `Analyzer`. This PR proposes to remove `isTempView` and `isTempFunction` from `ResolveSessionCatalog`. ### Why are the changes needed? To clean up unused variables. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests should cover as this PR just removes the unused variables. Closes #31400 from imback82/cleanup_resolve_session_catalog. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-31 13:03:30 +09:00
Terry Kim	bec88a66bd	[SPARK-34269][SQL][TESTS][FOLLOWUP] Test a subquery with view in aggregate's grouping expression ### What changes were proposed in this pull request? This PR is a follow-up to #31368 to add a test case that has a subquery with "view" in aggregate's grouping expression. The existing test tests a subquery with dataframe's temp view, so it doesn't contain a `View` node. ### Why are the changes needed? To increase the test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test. Closes #31352 from imback82/grouping_expr. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-30 17:07:40 -08:00
Chao Sun	463d4ec350	[SPARK-34269][SQL][TESTS][FOLLOWUP] Add test cases for cache lookup and project removal ### What changes were proposed in this pull request? This adds a few test cases for looking up cached temporary/permanent view created using clauses such as `ORDER BY` or `LIMIT`. ### Why are the changes needed? Due to `EliminateView` and how canonization is done for `View`, which inserts an extra project operator, cache lookup could fail in the following simple example: ```sql > CREATE TABLE t (key bigint, value string) USING parquet > CACHE TABLE v1 AS SELECT * FROM t ORDER BY key > SELECT * FROM t ORDER BY key ``` #31368 addresses this issue by removing the project operator if `canRemoveProject` check is successful. This PR adds a few tests. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? This PR just adds unit tests. Closes #31182 from sunchao/SPARK-34108. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-01-30 12:31:57 -08:00
Liang-Chi Hsieh	50d14c98c3	[SPARK-34270][SS] Combine StateStoreMetrics should not override StateStoreCustomMetric ### What changes were proposed in this pull request? This patch proposes to sum up custom metric values instead of taking arbitrary one when combining `StateStoreMetrics`. ### Why are the changes needed? For stateful join in structured streaming, we need to combine `StateStoreMetrics` from both left and right side. Currently we simply take arbitrary one from custom metrics with same name from left and right. By doing this we miss half of metric number. ### Does this PR introduce _any_ user-facing change? Yes, this corrects metrics collected for stateful join. ### How was this patch tested? Unit test. Closes #31369 from viirya/SPARK-34270. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-29 20:50:39 -08:00
Yuming Wang	f2b22d1487	[SPARK-34289][SQL] Parquet vectorized reader support column index ### What changes were proposed in this pull request? This pr make parquet vectorized reader support [column index](https://issues.apache.org/jira/browse/PARQUET-1201). ### Why are the changes needed? Improve filter performance. for example: `id = 1`, we only need to read `page-0` in `block 1`: ``` block 1: null count min max page-0 0 0 99 page-1 0 100 199 page-2 0 200 299 page-3 0 300 399 page-4 0 400 449 block 2: null count min max page-0 0 450 549 page-1 0 550 649 page-2 0 650 749 page-3 0 750 849 page-4 0 850 899 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark: https://github.com/apache/spark/pull/31393#issuecomment-769767724 Closes #31393 from wangyum/SPARK-34289. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-29 09:53:46 -08:00
Max Gekk	588ddcdf22	[SPARK-33163][SQL][TESTS][FOLLOWUP] Fix the test for the parquet metadata key 'org.apache.spark.legacyDateTime' ### What changes were proposed in this pull request? 1. Test both date and timestamp column types 2. Write the timestamp as the `TIMESTAMP_MICROS` logical type 3. Change the timestamp value to `'1000-01-01 01:02:03'` to check exception throwing. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite" ``` Closes #31396 from MaxGekk/parquet-test-metakey-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-29 22:25:01 +09:00
beliefer	0f7a4977c9	[SPARK-33601][SQL] Group exception messages in catalyst/parser ### What changes were proposed in this pull request? This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #31293 from beliefer/SPARK-33601. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-29 08:57:58 +00:00
Chircu	520e5d2ab8	[SPARK-34144][SQL] Exception thrown when trying to write LocalDate and Instant values to a JDBC relation ### What changes were proposed in this pull request? When writing rows to a table only the old date time API types are handled in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#makeSetter. If the new API is used (spark.sql.datetime.java8API.enabled=true) casting Instant and LocalDate to Timestamp and Date respectively fails. The proposed change is to handle Instant and LocalDate values and transform them to Timestamp and Date. ### Why are the changes needed? In the current state writing Instant or LocalDate values to a table fails with something like: Caused by: java.lang.ClassCastException: class java.time.LocalDate cannot be cast to class java.sql.Date (java.time.LocalDate is in module java.base of loader 'bootstrap'; java.sql.Date is in module java.sql of loader 'platform') at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$11(JdbcUtils.scala:573) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$11$adapted(JdbcUtils.scala:572) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:678) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:858) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:856) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:994) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:994) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) ... 3 more ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests Closes #31264 from cristichircu/SPARK-34144. Lead-authored-by: Chircu <chircu@arezzosky.com> Co-authored-by: Cristi Chircu <cristian.chircu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-29 17:48:13 +09:00
Bo Zhang	3f350dbd78	[SPARK-33212][FOLLOW-UP][BUILD] Fix test "built-in Hadoop version should support shaded client" for hadoop-2.7 ### What changes were proposed in this pull request? We added test "built-in Hadoop version should support shaded client" in https://github.com/apache/spark/pull/31203, but it fails when profile hadoop-2.7 is activated. This change fixes the test by skipping the assertion when Hadoop version is 2. ### Why are the changes needed? The test fails in master branch when profile hadoop-2.7 is activated. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran the test with hadoop-2.7 profile. Closes #31391 from bozhang2820/fix-hadoop-2-version-test. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-29 15:47:02 +09:00
Wenchen Fan	b891862fb6	[SPARK-34269][SQL] Simplify SQL view resolution ### What changes were proposed in this pull request? The currently SQL (temp or permanent) view resolution is done in 2 steps: 1. In `SessionCatalog`, we get the view metadata, parse the view SQL string, and wrap it with `View`. 2. At the beginning of the optimizer, we run `EliminateView`, which drops the wrapper `View`, and apply some special logic to match the view schema. Step 2 is tricky, as we need to retain the output attr expr id, while we need to add an extra `Project` to add cast and alias. This PR simplifies the view solution by building a completed plan (with cast and alias added) in `SessionCatalog`, so that we only have 1 step. ### Why are the changes needed? Code simplification. It also fixes issues like https://github.com/apache/spark/pull/31352 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #31368 from cloud-fan/try. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-29 06:46:01 +00:00
Cheng Su	d871b54a4e	[SPARK-34237][SQL] Add more metrics (fallback, spill) to object hash aggregate ### What changes were proposed in this pull request? This PR is to add two more metrics for `ObjectHashAggregateExec`, i.e. the spill size, and number of fallback to sort-based aggregation. ### Why are the changes needed? As object hash aggregate fallback mechanism is special - it will fallback to sort-based aggregation based on number of keys seen so far [0]. This fallback logic sometimes is sub-optimal and leads to unnecessary sort, and performance degradation in run-time. The first step to help user/developer debug is to add more related metrics on UI, e.g. spill size, and number of fallback to sort-based aggregation. (spill size metrics was already added for hash aggregate [1]) [0]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L161 [1]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L68 ### Does this PR introduce _any_ user-facing change? Added two more metrics on Spark UI for operator `ObjectHashAggregateExec`. Screenshot is attached below. ### How was this patch tested? * Added unit test in `SQLMetricsSuite.scala`. * Tested on spark shell locally and verified the metrics shown up on UI. <img width="399" alt="Screen Shot 2021-01-28 at 1 44 40 PM" src="https://user-images.githubusercontent.com/4629931/106204224-7a8a1300-6171-11eb-9814-c3432abadc29.png"> Closes #31340 from c21/object-hash. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-29 04:35:58 +00:00
ulysses-you	72b7f8abfb	[SPARK-34261][SQL] Avoid side effect if create exists temporary function ### What changes were proposed in this pull request? Add function exists check before load resource. ### Why are the changes needed? We should not add jar into classpath if the create temporary function is already exists. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31358 from ulysses-you/SPARK-34261. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-29 10:39:02 +09:00
Yuming Wang	a7683afdf4	[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 ### What changes were proposed in this pull request? This PR upgrade Parquet to 1.11.1. Parquet 1.11.1 new features: - [PARQUET-1201](https://issues.apache.org/jira/browse/PARQUET-1201) - Column indexes - [PARQUET-1253](https://issues.apache.org/jira/browse/PARQUET-1253) - Support for new logical type representation - [PARQUET-1388](https://issues.apache.org/jira/browse/PARQUET-1388) - Nanosecond precision time and timestamp - parquet-mr More details: https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md ### Why are the changes needed? Support column indexes to improve query performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing test. Closes #26804 from wangyum/SPARK-26346. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-01-29 08:07:49 +08:00
Cheng Su	3a361cd837	[SPARK-34253][SQL] Object hash aggregate should not fallback if no more input rows ### What changes were proposed in this pull request? Object hash aggregate will fallback to sort-based aggregation based on number of keys seen so far [0]. The default config threshold is 128 (spark.sql.objectHashAggregate.sortBased.fallbackThreshold in [1]). There's an edge case we can do better, where we do not fallback if there's no more input rows. Suppose the task only has 128 group-by keys in hash ma, we don't need to fallback in this case, and we can save the extra sort. This is an rare edge case in production, but it can happen. [0]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L161 [1]: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1615 ### Why are the changes needed? To avoid unnecessary sort in query. Save resource. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a unit test to verify task fallback or not is challenging. Given the change is pretty minor, besides relying on existing test in `ObjectHashAggregateSuite.scala`, I manually ran the followed query, and verified in debug mode that the code path for fallback was not executed. And verified the code path for fallback was executed without this change. ``` withSQLConf( SQLConf.USE_OBJECT_HASH_AGG.key -> "true", SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1") { Seq.fill(1)(Tuple1(Array.empty[Int])) .toDF("c0") .groupBy(lit(1)) .agg(typed_count($"c0"), max($"c0")).collect() } ``` Closes #31353 from c21/object-hash-fallback. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-28 15:18:54 +00:00
MrPowers	9ed0e3cebf	[SPARK-34165][SQL] Add count_distinct as an option to Dataset#summary ### What changes were proposed in this pull request? Add `count_distinct` as an option argument to Dataset#summary (the method already supports count, min, max, etc.) ### Why are the changes needed? The `summary()` method is used for lightweight exploratory data analysis. A distinct count of all the columns is one of the most common exploratory data analysis queries. Distinct counts can be expensive, so this shouldn't be enabled by default. The proposed implementation is completely backwards compatible. ### Does this PR introduce _any_ user-facing change? Yes, users can now call `df.summary("count_distinct")`, which wasn't an option before. Users can still call `df.summary()` without any arguments and the output is the same. `count_distinct` was not added as one of the `defaultStatistics`. ### How was this patch tested? Unit tests. ### Additional comments If this idea is accepted, we should add a PySpark implementation in this PR, as suggested by zero323. Closes #31254 from MrPowers/SPARK-34165. Authored-by: MrPowers <matthewkevinpowers@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-28 08:38:01 -06:00
yangjie01	15445a8d9e	[SPARK-34275][CORE][SQL][MLLIB] Replaces filter and size with count ### What changes were proposed in this pull request? Use `count` to simplify `find + size(or length)` operation, it's semantically consistent, but looks simpler. Before ``` seq.filter(p).size ``` After ``` seq.count(p) ``` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31374 from LuciferYang/SPARK-34275. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-28 15:27:07 +09:00
Max Gekk	d242166b8f	[SPARK-34262][SQL] Refresh cached data of v1 table in `ALTER TABLE .. SET LOCATION` ### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` in v1 implementation of the `ALTER TABLE .. SET LOCATION` command to refresh cached table data. ### Why are the changes needed? The example below portraits the issue: - Create a source table: ```sql spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0; spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0); default src_tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0 ... ``` - Set new location for the empty partition (part=0): ```sql spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> ALTER TABLE dst_tbl ADD PARTITION (part=0); spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1; spark-sql> CACHE TABLE dst_tbl; spark-sql> SELECT * FROM dst_tbl; 1 1 spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0'; spark-sql> SELECT * FROM dst_tbl; 1 1 ``` The last query does not return new loaded data. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works correctly: ```sql spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0'; spark-sql> SELECT * FROM dst_tbl; 0 0 1 1 ``` ### How was this patch tested? Added new test to `org.apache.spark.sql.hive.CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #31361 from MaxGekk/refresh-cache-set-location. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-28 15:05:22 +09:00
beliefer	b12e9a4ea6	[SPARK-33542][SQL][FOLLOWUP] Group exception messages in catalyst/catalog ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/30870. Maybe some contributors don't know the job and added some exception by the old way. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #31312 from beliefer/SPARK-33542-followup. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-28 05:15:57 +00:00
Angerszhuuuu	850990f40e	[SPARK-34238][SQL] Unify output of SHOW PARTITIONS and pass output attributes properly ### What changes were proposed in this pull request? Passing around the output attributes should have more benefits like keeping the expr ID unchanged to avoid bugs when we apply more operators above the command output dataframe. This PR keep SHOW PARTITIONS command's output attribute exprId unchanged. And benefit for https://issues.apache.org/jira/browse/SPARK-34238 ### Why are the changes needed? Keep SHOW PARTITIONS command's output attribute exprid unchanged. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #31341 from AngersZhuuuu/SPARK-34238. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-28 05:13:19 +00:00
Yuming Wang	01d11da84e	[SPARK-34268][SQL][DOCS] Correct the documentation of the concat_ws function ### What changes were proposed in this pull request? This pr correct the documentation of the `concat_ws` function. ### Why are the changes needed? `concat_ws` doesn't need any str or array(str) arguments: ``` scala> sql("""select concat_ws("s")""").show +------------+ \|concat_ws(s)\| +------------+ \| \| +------------+ ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` build/sbt "sql/testOnly *.ExpressionInfoSuite" ``` Closes #31370 from wangyum/SPARK-34268. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-28 14:06:36 +09:00
Chao Sun	6ec3cf6219	[SPARK-34271][SQL] Use majorMinorPatchVersion for Hive version parsing ### What changes were proposed in this pull request? Use `majorMinorPatchVersion` to check major & minor version in `IsolatedClientLoader.hiveVersion`. ### Why are the changes needed? Currently `IsolatedClientLoader.hiveVersion` needs to enumerate all Hive patch versions. Therefore, whenever we upgrade Hive version we'd need to remember to update the method as well. It would be better if we just check major & minor version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a refactoring and relies on existing tests. Closes #31371 from sunchao/replace-hive-version. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-28 14:00:10 +09:00
Linhong Liu	cf1400c8dd	[SPARK-34260][SQL] Fix UnresolvedException when creating temp view twice ### What changes were proposed in this pull request? In PR #30140, it will compare new and old plans when replacing view and uncache data if the view has changed. But the compared new plan is not analyzed which will cause `UnresolvedException` when calling `sameResult`. So in this PR, we use the analyzed plan to compare to fix this problem. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? newly added tests Closes #31360 from linhongliu-db/SPARK-34260. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-27 20:59:23 -08:00
Chircu	829f118f98	[SPARK-33867][SQL] Instant and LocalDate values aren't handled when generating SQL queries ### What changes were proposed in this pull request? When generating SQL queries only the old date time API types are handled for values in org.apache.spark.sql.jdbc.JdbcDialect#compileValue. If the new API is used (spark.sql.datetime.java8API.enabled=true) Instant and LocalDate values are not quoted and errors are thrown. The change proposed is to handle Instant and LocalDate values the same way that Timestamp and Date are. ### Why are the changes needed? In the current state if an Instant is used in a filter, an exception will be thrown. Ex (dataset was read from PostgreSQL): dataset.filter(current_timestamp().gt(col(VALID_FROM))) Stacktrace (the T11 is from an instant formatted like yyyy-MM-dd'T'HH:mm:ss.SSSSSS'Z'): Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11" Position: 285 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test added Closes #31148 from cristichircu/SPARK-33867. Lead-authored-by: Chircu <chircu@arezzosky.com> Co-authored-by: Cristi Chircu <chircu@arezzosky.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-28 11:58:20 +09:00
Max Gekk	1318be7ee9	[SPARK-34267][SQL] Remove `refreshTable()` from `SessionState` ### What changes were proposed in this pull request? Remove `SessionState.refreshTable()` and modify the tests where the method is used. ### Why are the changes needed? There are already 2 methods with the same name in: - `SessionCatalog` - `CatalogImpl` One more method in `SessionState` does not give any benefits. By removing it, we can improve code maintenance. ### Does this PR introduce _any_ user-facing change? Should not because `SessionState` is an internal class. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly MetastoreDataSourcesSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly HiveOrcQuerySuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveParquetMetastoreSuite" ``` Closes #31366 from MaxGekk/remove-refreshTable-from-SessionState. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-27 09:43:59 -08:00
Wenchen Fan	2dbb7d5af8	[SPARK-34212][SQL][FOLLOWUP] Refine the behavior of reading parquet non-decimal fields as decimal ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/31319 . When reading parquet int/long as decimal, the behavior should be the same as reading int/long and then cast to the decimal type. This PR changes to the expected behavior. When reading parquet binary as decimal, we don't really know how to interpret the binary (it may from a string), and should fail. This PR changes to the expected behavior. ### Why are the changes needed? To make the behavior more sane. ### Does this PR introduce _any_ user-facing change? Yes, but it's a followup. ### How was this patch tested? updated test Closes #31357 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-27 09:34:31 -08:00
Kent Yao	5718d64f31	[SPARK-34083][SQL] Using TPCDS original definitions for char/varchar columns ### What changes were proposed in this pull request? This PR changes the column types in the table definitions of `TPCDSBase` from string to char and varchar, with respect to the original definitions for char/varchar columns in the official doc - [TPC-DS_v2.9.0](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf). ### Why are the changes needed? Comply with both TPCDS standard and ANSI, and using string will get wrong results with those TPCDS queries ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? plan stability check Closes #31012 from yaooqinn/tpcds. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-27 17:51:49 +08:00
Max Gekk	0d08e22bc7	[SPARK-34251][SQL] Fix table stats calculation by `TRUNCATE TABLE` ### What changes were proposed in this pull request? 1. Take into account the SQL config `spark.sql.statistics.size.autoUpdate.enabled` in the `TRUNCATE TABLE` command as other commands do. 2. Re-calculate actual table size in fs. Before the changes, `TRUNCATE TABLE` always sets table size to 0 in stats. ### Why are the changes needed? This fixes the bug that is demonstrated by the example: 1. Create a partitioned table with 2 non-empty partitions: ```sql spark-sql> CREATE TABLE tbl (c0 int, part int) PARTITIONED BY (part); spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0; spark-sql> INSERT INTO tbl PARTITION (part=1) SELECT 1; spark-sql> ANALYZE TABLE tbl COMPUTE STATISTICS; spark-sql> DESCRIBE TABLE EXTENDED tbl; ... Statistics 4 bytes, 2 rows ... ``` 2. Truncate only one partition: ```sql spark-sql> TRUNCATE TABLE tbl PARTITION (part=1); spark-sql> SELECT * FROM tbl; 0 0 ``` 3. The table is still non-empty but `TRUNCATE TABLE` reseted stats: ``` spark-sql> DESCRIBE TABLE EXTENDED tbl; ... Statistics 0 bytes, 0 rows ... ``` ### Does this PR introduce _any_ user-facing change? It could impact on performance of following queries. ### How was this patch tested? Added new test to `StatisticsCollectionSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly StatisticsCollectionSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly StatisticsSuite" ``` Closes #31350 from MaxGekk/fix-stats-in-trunc-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-27 07:02:04 +00:00
Kent Yao	764582c07a	[SPARK-34233][SQL] FIX NPE for char padding in binary comparison ### What changes were proposed in this pull request? we need to check whether the `lit` is null before calling `numChars` ### Why are the changes needed? fix an obvious NPE bug ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31336 from yaooqinn/SPARK-34233. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-27 14:59:53 +08:00
Kent Yao	91ca21d700	[SPARK-34236][SQL] Fix v2 Overwrite w/ null static partition raise Cannot translate expression to source filter: null ### What changes were proposed in this pull request? For v2 static partitions overwriting, we use `EqualTo ` to generate the `deleteExpr` This is not right for null partition values, and cause the problem like below because `ConstantFolding` converts it to lit(null) ```scala SPARK-34223: static partition with null raise NPE * FAILED * (19 milliseconds) [info] org.apache.spark.sql.AnalysisException: Cannot translate expression to source filter: null [info] at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.$anonfun$applyOrElse$1(V2Writes.scala:50) [info] at scala.collection.immutable.List.flatMap(List.scala:366) [info] at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:47) [info] at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:39) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) ``` The right way is to use EqualNullSafe instead to delete the null partitions. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? an original test to new place Closes #31339 from yaooqinn/SPARK-34236. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-27 12:05:50 +08:00
Chao Sun	abf7e81712	[SPARK-33212][FOLLOW-UP][BUILD] Bring back duplicate dependency check and add more strict Hadoop version check ### What changes were proposed in this pull request? 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code ### Why are the changes needed? The Maven enforcer was removed as part of #30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](https://github.com/apache/spark/pull/30701#discussion_r558522227) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-26 15:34:55 -08:00
Dongjoon Hyun	dbf051c50a	[SPARK-34212][SQL] Fix incorrect decimal reading from Parquet files ### What changes were proposed in this pull request? This PR aims to the correctness issues during reading decimal values from Parquet files. - For MR code path, `ParquetRowConverter` can read Parquet's decimal values with the original precision and scale written in the corresponding footer. - For Vectorized code path, `VectorizedColumnReader` throws `SchemaColumnConvertNotSupportedException`. ### Why are the changes needed? Currently, Spark returns incorrect results when the Parquet file's decimal precision and scale are different from the Spark's schema. This happens when there is multiple files with different decimal schema or HiveMetastore has a new schema. BEFORE (Simplified example for correctness) ```scala scala> sql("SELECT 1.0 a").write.parquet("/tmp/decimal") scala> spark.read.schema("a DECIMAL(3,2)").parquet("/tmp/decimal").show +----+ \| a\| +----+ \|0.10\| +----+ ``` This works correctly in the other data sources, `ORC/JSON/CSV`, like the following. ```scala scala> sql("SELECT 1.0 a").write.orc("/tmp/decimal_orc") scala> spark.read.schema("a DECIMAL(3,2)").orc("/tmp/decimal_orc").show +----+ \| a\| +----+ \|1.00\| +----+ ``` AFTER 1. Vectorized path: Instead of incorrect result, we will raise an explicit exception. ```scala scala> spark.read.schema("a DECIMAL(3,2)").parquet("/tmp/decimal").show java.lang.UnsupportedOperationException: Schema evolution not supported. ``` 2. MR path (complex schema or explicit configuration): Spark returns correct results. ```scala scala> spark.read.schema("a DECIMAL(3,2), b DECIMAL(18, 3), c MAP<INT,INT>").parquet("/tmp/decimal").show +----+-------+--------+ \| a\| b\| c\| +----+-------+--------+ \|1.00\|100.000\|{1 -> 2}\| +----+-------+--------+ scala> spark.read.schema("a DECIMAL(3,2), b DECIMAL(18, 3), c MAP<INT,INT>").parquet("/tmp/decimal").printSchema root \|-- a: decimal(3,2) (nullable = true) \|-- b: decimal(18,3) (nullable = true) \|-- c: map (nullable = true) \| \|-- key: integer \| \|-- value: integer (valueContainsNull = true) ``` ### Does this PR introduce _any_ user-facing change? Yes. This fixes the correctness issue. ### How was this patch tested? Pass with the newly added test case. Closes #31319 from dongjoon-hyun/SPARK-34212. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-26 15:13:39 -08:00
beliefer	99b6af2dd2	[SPARK-34244][SQL] Remove the Scala function version of regexp_extract_all ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/27507 implements `regexp_extract_all` and added the scala function version of it. According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description. ### Why are the changes needed? `regexp_extract_all` is less common. ### Does this PR introduce _any_ user-facing change? 'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet. ### How was this patch tested? Jenkins test. Closes #31346 from beliefer/SPARK-24884-followup. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-26 13:52:51 -08:00
Max Gekk	ac8307d75c	[SPARK-34215][SQL] Keep tables cached after truncation ### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` instead of combination of `SessionCatalog.refreshTable()` + `uncacheQuery()`. This allows to clear cached table data while keeping the table cached. ### Why are the changes needed? 1. To improve user experience with Spark SQL 2. To be consistent to other commands, see https://github.com/apache/spark/pull/31206 ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala scala> sql("CREATE TABLE tbl (c0 int)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT INTO tbl SELECT 0") res2: org.apache.spark.sql.DataFrame = [] scala> sql("CACHE TABLE tbl") res3: org.apache.spark.sql.DataFrame = [] scala> sql("SELECT * FROM tbl").show(false) +---+ \|c0 \| +---+ \|0 \| +---+ scala> spark.catalog.isCached("tbl") res5: Boolean = true scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = false ``` After: ```scala scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = true ``` ### How was this patch tested? Added new test to `CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly CachedTableSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly CatalogedDDLSuite" ``` Closes #31308 from MaxGekk/truncate-table-cached. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 15:36:44 +00:00
Angerszhuuuu	dd88eff820	[SPARK-34241][SQL] For DDL command plan, we should define producedAttributes as it's outputSet ### What changes were proposed in this pull request? When write test about command, when `checkAnswer`, Always got error as below ``` [info] AttributeSet(partition#607) was not empty The analyzed logical plan has missing inputs: [info] ShowPartitionsCommand `ns`.`tbl`, [partition#607] (QueryTest.scala:224) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) ``` For Command DDL plan, we can define `producedAttributes` as it's `outputSet` and it's reasonable ### Why are the changes needed? Add default `producedAttributes` for Command LogicalPlan ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31342 from AngersZhuuuu/SPARK-34241. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 15:14:10 +00:00
Anton Okolnychyi	08679646fe	[SPARK-34026][SQL] Inject repartition and sort nodes to satisfy required distribution and ordering ### What changes were proposed in this pull request? This PR adds repartition and sort nodes to satisfy the required distribution and ordering introduced in SPARK-33779. Note: This PR contains the final part of changes discussed in PR #29066. ### Why are the changes needed? These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with a new test suite. Closes #31083 from aokolnychyi/spark-34026. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 15:09:30 +00:00
yangjie01	8999e8805d	[SPARK-34224][CORE][SQL][SS][DSTREAM][YARN][TEST][EXAMPLES] Ensure all resource opened by `Source.fromXXX` are closed ### What changes were proposed in this pull request? Using a function like `.mkString` or `.getLines` directly on a `scala.io.Source` opened by `fromFile`, `fromURL`, `fromURI ` will leak the underlying file handle, this pr use the `Utils.tryWithResource` method wrap the `BufferedSource` to ensure these `BufferedSource` closed. ### Why are the changes needed? Avoid file handle leak. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31323 from LuciferYang/source-not-closed. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 19:06:37 +09:00
Wenchen Fan	8dee8a9b7c	[SPARK-34227][SQL] WindowFunctionFrame should clear its states during preparation ### What changes were proposed in this pull request? This PR fixed all `OffsetWindowFunctionFrameBase#prepare` implementations to reset the states, and also add more comments in `WindowFunctionFrame` classdoc to explain why we need to reset states during preparation: `WindowFunctionFrame` instances are reused to process multiple partitions. ### Why are the changes needed? To fix a correctness bug caused by the new feature "window function with ignore nulls" in the master branch. ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? new test Closes #31325 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 08:50:14 +00:00
Yuanjian Li	0a1a029622	[SPARK-34235][SS] Make spark.sql.hive as a private package ### What changes were proposed in this pull request? Follow the comment https://github.com/apache/spark/pull/31271#discussion_r562598983: - Remove the API tag `Unstable` for `HiveSessionStateBuilder` - Add document for spark.sql.hive package to emphasize it's a private package ### Why are the changes needed? Follow the rule for a private package. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Doc change only. Closes #31321 from xuanyuanking/SPARK-34185-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 17:13:11 +09:00
Angerszhuuuu	7bd4165c11	[SPARK-32852][SQL][FOLLOW_UP] Add notice about keep hive version consistence when config hive jars location ### What changes were proposed in this pull request? Add notice about keep hive version consistence when config hive jars location With PR #29881, if we don't keep hive version consistence. we will got below error. ``` Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8. ``` ![image](https://user-images.githubusercontent.com/46485123/105795169-512d8380-5fc7-11eb-97c3-0259a0d2aa58.png) ### Why are the changes needed? Make config doc detail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31317 from AngersZhuuuu/SPARK-32852-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 13:40:20 +09:00
Kent Yao	b3915ddd91	[SPARK-34223][SQL] FIX NPE for static partition with null in InsertIntoHadoopFsRelationCommand ### What changes were proposed in this pull request? with a simple case, the null will be passed to InsertIntoHadoopFsRelationCommand blindly, we should avoid the npe ```scala test("NPE") { withTable("t") { sql(s"CREATE TABLE t(i STRING, c string) USING $format PARTITIONED BY (c)") sql("INSERT OVERWRITE t PARTITION (c=null) VALUES ('1')") checkAnswer(spark.table("t"), Row("1", null)) } } ``` ```logtalk java.lang.NullPointerException at scala.collection.immutable.StringOps$.length(StringOps.scala:51) at scala.collection.immutable.StringOps.length(StringOps.scala:51) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:35) at scala.collection.IndexedSeqOptimized.foreach at scala.collection.immutable.StringOps.foreach(StringOps.scala:33) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.escapePathName(ExternalCatalogUtils.scala:69) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.orig-s0.0000030000-r30676-expand-or-complete(InsertIntoHadoopFsRelationCommand.scala:231) ``` ### Why are the changes needed? a bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31320 from yaooqinn/SPARK-34223. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 12:05:58 +08:00
Kent Yao	d1177b5230	[SPARK-34192][SQL] Move char padding to write side and remove length check on read side too ### What changes were proposed in this pull request? On the read-side, the char length check and padding bring issues to CBO and predicate pushdown and other issues to the catalyst. This PR reverts `6da5cdf1db` that added read side length check) so that we only do length check for the write side, and data sources/vendors are responsible to enforce the char/varchar constraints for data import operations like ADD PARTITION. It doesn't make sense for Spark to report errors on the read-side if the data is already dirty. This PR also moves the char padding to the write-side, so that it 1) avoids read side issues like CBO and filter pushdown. 2) the data source can preserve char type semantic better even if it's read by systems other than Spark. ### Why are the changes needed? fix perf regression when tables have char/varchar type columns closes #31278 ### Does this PR introduce _any_ user-facing change? yes, spark will not raise error for oversized char/varchar values in read side ### How was this patch tested? modified ut the dropped read side benchmark ``` ================================================================================================ Char Varchar Read Side Perf w/o Tailing Spaces ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1564 1573 9 63.9 15.6 1.0X read char with length 20 1532 1551 18 65.3 15.3 1.0X read varchar with length 20 1520 1531 13 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1573 1613 41 63.6 15.7 1.0X read char with length 40 1575 1577 2 63.5 15.7 1.0X read varchar with length 40 1568 1576 11 63.8 15.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1526 1540 23 65.5 15.3 1.0X read char with length 60 1514 1539 23 66.0 15.1 1.0X read varchar with length 60 1486 1497 10 67.3 14.9 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1531 1542 19 65.3 15.3 1.0X read char with length 80 1514 1529 15 66.0 15.1 1.0X read varchar with length 80 1524 1565 42 65.6 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1597 1623 25 62.6 16.0 1.0X read char with length 100 1499 1512 16 66.7 15.0 1.1X read varchar with length 100 1517 1524 8 65.9 15.2 1.1X ================================================================================================ Char Varchar Read Side Perf w/ Tailing Spaces ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1524 1526 1 65.6 15.2 1.0X read char with length 20 1532 1537 9 65.3 15.3 1.0X read varchar with length 20 1520 1532 15 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1556 1580 32 64.3 15.6 1.0X read char with length 40 1600 1611 17 62.5 16.0 1.0X read varchar with length 40 1648 1716 88 60.7 16.5 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1504 1524 20 66.5 15.0 1.0X read char with length 60 1509 1512 3 66.2 15.1 1.0X read varchar with length 60 1519 1535 21 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1640 1652 17 61.0 16.4 1.0X read char with length 80 1625 1666 35 61.5 16.3 1.0X read varchar with length 80 1590 1605 13 62.9 15.9 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1622 1628 5 61.6 16.2 1.0X read char with length 100 1614 1646 30 62.0 16.1 1.0X read varchar with length 100 1594 1606 11 62.7 15.9 1.0X ``` Closes #31281 from yaooqinn/SPARK-34192. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 02:08:35 +08:00
Max Gekk	bfc0235013	[SPARK-34203][SQL] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog ### What changes were proposed in this pull request? In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203. ### Why are the changes needed? `InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below: ``` $ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory ``` ```scala scala> spark.conf.get("spark.sql.catalogImplementation") res0: String = in-memory scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default': Map(p1 -> null) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog: ```scala scala> spark.table("tbl").show(false) +----+----+ \|col1\|p1 \| +----+----+ \|0 \|null\| +----+----+ scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") res4: org.apache.spark.sql.DataFrame = [] scala> spark.table("tbl").show(false) +----+---+ \|col1\|p1 \| +----+---+ +----+---+ ``` ### How was this patch tested? Added new test to `AlterTableDropPartitionSuiteBase`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31322 from MaxGekk/insert-overwrite-null-part. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-25 15:27:20 +00:00
Dereck Li	096b15fa12	[SPARK-34607][SQL][FOLLOWUP] Change Option[LogicalRelation] to LogicalRelation ### What changes were proposed in this pull request? optimize: change Option[LogicalRelation] to LogicalRelation ### Why are the changes needed? simplify code ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Existed unit test. Closes #31315 from monkeyboy123/spark-34067-follow-up. Authored-by: Dereck Li <monkeyboy.ljh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-25 13:34:07 +00:00
Max Gekk	6fe5a8a2ae	[SPARK-34197][SQL] `SessionCatalog.refreshTable()` should not invalidate the relation cache for temporary views ### What changes were proposed in this pull request? Check the name passed to `SessionCatalog.refreshTable`, and if it belongs to a temporary view, do not invalidate the relation cache. ### Why are the changes needed? When `SessionCatalog.refreshTable` refreshes a temporary or global temporary view, it should not invalidate an entry in the relation cache associated to a table with the same name. ### Does this PR introduce _any_ user-facing change? Should not. The change might improve performance slightly. ### How was this patch tested? By running new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SessionCatalogSuite" ``` Closes #31265 from MaxGekk/fix-session-catalog-refresh-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-25 07:37:24 +00:00
yliou	512cacf7c6	[SPARK-33726][SQL] Fix for Duplicate field names during Aggregation ### What changes were proposed in this pull request? The `RowBasedKeyValueBatch` has two different implementations depending on whether the aggregation key and value uses only fixed length data types (`FixedLengthRowBasedKeyValueBatch`) or not (`VariableLengthRowBasedKeyValueBatch`). Before this PR the decision about the used implementation was based on by accessing the schema fields by their name. But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made. When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test: `with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x` where the 'x' column in the left side of the join is a Long but on the right side is a String. ### Why are the changes needed? Fixes the issue where duplicate field name aggregation has null values in the dataframe. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT, tested manually on spark shell. Closes #30788 from yliou/SPARK-33726. Authored-by: yliou <yliou@berkeley.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-25 06:53:26 +00:00
Peter Toth	98ec6c27e3	[SPARK-34147][SQL][TEST] Keep table partitioning in TPCDSQueryBenchmak when CBO is enabled ### What changes were proposed in this pull request? This PR keeps partitioning of input tables in TPCDSQueryBenchmark when `--cbo` option is enabled. https://github.com/apache/spark/pull/31011 introduced the `--cbo` option but unfortunately in that mode the table partitioning of the input data is lost. This means that the results of CBO mode is very different to non CBO mode, one example is that Dynamic Partition Pruning doesn't kick in in CBO mode. ### Why are the changes needed? To monitor performance changed with CBO enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes #31218 from peter-toth/SPARK-34147-keep-partitioning-in-tpcdsquerybenchmark. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-25 11:42:47 +09:00
Yuanjian Li	59cbacaddf	[SPARK-34185][DOCS] Review and fix issues in API docs ### What changes were proposed in this pull request? Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the following issues: - Add missing `Since` annotation for new APIs - Remove the leaking class/object in API doc ### Why are the changes needed? Fix the issues in the Spark 3.1.1 release API docs. ### Does this PR introduce _any_ user-facing change? Yes, API doc changes. ### How was this patch tested? Manually test. Closes #31271 from xuanyuanking/SPARK-34185. Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-25 11:38:20 +09:00
Takuya UESHIN	43fdd1271e	[SPARK-33489][PYSPARK] Add NullType support for Arrow executions ### What changes were proposed in this pull request? Adds `NullType` support for Arrow executions. ### Why are the changes needed? As Arrow supports null type, we can convert `NullType` between PySpark and pandas with Arrow enabled. ### Does this PR introduce _any_ user-facing change? Yes, if a user has a DataFrame including `NullType`, it will be able to convert with Arrow enabled. ### How was this patch tested? Added tests. Closes #31285 from ueshin/issues/SPARK-33489/arrow_nulltype. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-25 11:34:47 +09:00
Yuming Wang	4fce05d93f	[SPARK-34155][SQL][TEST] Add partition columns for TPCDS tables ### What changes were proposed in this pull request? This pr add partition columns for TPCDS tables. The partition column is consistent with the [TPCDSTables](https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala). ### Why are the changes needed? Better track plan changes. For example, [this is the change](`3fe1a93a40`) after SPARK-34119. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #31243 from wangyum/SPARK-34155. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-24 13:08:26 +09:00
Max Gekk	f8bf72ed5d	[SPARK-34213][SQL] Refresh cached data of v1 table in `LOAD DATA` ### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` instead of `SessionCatalog.refreshTable` in v1 implementation of the `LOAD DATA` command. `SessionCatalog.refreshTable` just refreshes metadata comparing to `CatalogImpl.refreshTable()` which refreshes cached table data as well. ### Why are the changes needed? The example below portraits the issue: - Create a source table: ```sql spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0; spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0); default src_tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0 ... ``` - Load data from the source table to a cached destination table: ```sql spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1; spark-sql> CACHE TABLE dst_tbl; spark-sql> SELECT * FROM dst_tbl; 1 1 spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0); spark-sql> SELECT * FROM dst_tbl; 1 1 ``` The last query does not return new loaded data. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works correctly: ```sql spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0); spark-sql> SELECT * FROM dst_tbl; 0 0 1 1 ``` ### How was this patch tested? Added new test to `org.apache.spark.sql.hive.CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #31304 from MaxGekk/load-data-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-23 15:49:10 -08:00
Max Gekk	0592503669	[SPARK-34207][SQL] Rename `isTemporaryTable` to `isTempView` in `SessionCatalog` ### What changes were proposed in this pull request? Rename `SessionCatalog.isTemporaryTable()` to `SessionCatalog.isTempView()`. ### Why are the changes needed? To improve code maintenance. Currently, there are two methods that do the same but have different names: ```scala def isTempView(nameParts: Seq[String]): Boolean ``` and ```scala def isTemporaryTable(name: TableIdentifier): Boolean ``` ### Does this PR introduce _any_ user-facing change? Should not since `SessionCatalog` is not public API. ### How was this patch tested? By running the existing tests: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SessionCatalogSuite" ``` Closes #31295 from MaxGekk/replace-isTemporaryTable-by-isTempView. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-23 08:16:11 -08:00
yangjie01	e48a8ad1a2	[SPARK-34202][SQL][TEST] Add ability to fetch spark release package from internal environment in HiveExternalCatalogVersionsSuite ### What changes were proposed in this pull request? `HiveExternalCatalogVersionsSuite` can't run in orgs internal environment where access to outside internet is not allowed because `HiveExternalCatalogVersionsSuite` will download spark release package from internet. Similar to SPARK-32998, this pr add 1 environment variables `SPARK_RELEASE_MIRROR` to let user can specify an accessible download address of spark release package and run `HiveExternalCatalogVersionsSuite` in orgs internal environment. ### Why are the changes needed? Let `HiveExternalCatalogVersionsSuite` can run in orgs internal environment without relying on external spark release download address. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test with and without env variables set in internal environment can't access internet. execute ``` mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl sql/hive -am -DskipTests mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl sql/hive -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest=none ``` Without env ``` HiveExternalCatalogVersionsSuite: 19:50:35.123 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed) 19:50:35.126 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://dist.apache.org/repos/dist/release/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed) org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite * ABORTED * Exception encountered when invoking run on a nested suite - Unable to download Spark 3.0.1 (HiveExternalCatalogVersionsSuite.scala:125) Run completed in 2 seconds, 669 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 ``` With env ``` export SPARK_RELEASE_MIRROR=${spark-release.internal.com}/dist/release/ ``` ``` HiveExternalCatalogVersionsSuite - backward compatibility Run completed in 1 minute, 32 seconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #31294 from LuciferYang/SPARK-34202. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-23 08:02:52 -08:00
Wenchen Fan	b8a6906627	[SPARK-34200][SQL] Ambiguous column reference should consider attribute availability ### What changes were proposed in this pull request? This is a long-standing bug that exists since we have the ambiguous self-join check. A column reference is not ambiguous if it can only come from one join side (e.g. the other side has a project to only pick a few columns). An example is ``` Join(b#1 = 3) TableScan(t, [a#0, b#1]) Project(a#2) TableScan(t, [a#2, b#3]) ``` It's a self-join, but `b#1` is not ambiguous because it can't come from the right side, which only has column `a`. ### Why are the changes needed? to not fail valid self-join queries. ### Does this PR introduce _any_ user-facing change? yea as a bug fix ### How was this patch tested? a new test Closes #31287 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-22 20:11:53 +09:00
Yu Zhong	2db0a954e3	[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE ### What changes were proposed in this pull request? This PR is the same as https://github.com/apache/spark/pull/30998, but with a better UT. In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread. 1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job 2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage 1 is trigger before 2, so in normal cases, the broadcast job will be submit first. However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job. Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Add UT Closes #31269 from zhongyu09/aqe-broadcast-partial-fix. Authored-by: Yu Zhong <zhongyu8@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-22 07:22:53 +00:00
beliefer	fec82c9504	[SPARK-33245][SQL] Add built-in UDF - GETBIT ### What changes were proposed in this pull request? `GETBIT` is a bitwise expression function given an INTEGER value, returns the value of a bit at a specified position. `GETBIT( <integer_expr>, <bit_position> )` Examples select getbit(11, 100), getbit(11, 3), getbit(11, 2), getbit(11, 1), getbit(11, 0); GETBIT(11, 3) \| GETBIT(11, 2) \| GETBIT(11, 1) \| GETBIT(11, 0) -- \| -- \| -- \| -- 1 \| 0 \| 1 \| 1 The mainstream database support this feature show below: Teradata https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w Impala https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit Snowflake https://docs.snowflake.com/en/sql-reference/functions/getbit.html Yellowbrick https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html ### Why are the changes needed? GETBIT is very useful. ### Does this PR introduce _any_ user-facing change? Yes. GETBIT is a new bitwise function. ### How was this patch tested? Jenkins test Closes #31198 from beliefer/SPARK-33245. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-22 04:57:39 +00:00
beliefer	cde697a479	[SPARK-33541][SQL] Group exception messages in catalyst/expressions ### What changes were proposed in this pull request? This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #31228 from beliefer/SPARK-33541. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-22 04:52:05 +00:00
Kousuke Saruta	45f076336b	[SPARK-33813][SQL] Fix the issue that JDBC source can't treat MS SQL Server's spatial types ### What changes were proposed in this pull request? This PR fixes the issue that reading tables which contain spatial datatypes from MS SQL Server fails. MS SQL server supports two non-standard spatial JDBC types, `geometry` and `geography` but Spark SQL can't treat them ``` java.sql.SQLException: Unrecognized SQL type -157 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381) ``` Considering the [data type mapping](https://docs.microsoft.com/ja-jp/sql/connect/jdbc/using-basic-data-types?view=sql-server-ver15) says, I think those spatial types can be mapped to Catalyst's `BinaryType`. ### Why are the changes needed? To provide better support. ### Does this PR introduce _any_ user-facing change? Yes. MS SQL Server users can use `geometry` and `geography` types in datasource tables. ### How was this patch tested? New test case added to `MsSqlServerIntegrationSuite`. Closes #31283 from sarutak/mssql-spatial-types. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-22 04:28:22 +00:00
Kousuke Saruta	842902154a	[SPARK-34180][SQL] Fix the regression brought by SPARK-33888 for PostgresDialect ### What changes were proposed in this pull request? This PR fixes the regression bug brought by SPARK-33888 (#30902). After that PR merged, `PostgresDIalect#getCatalystType` throws Exception for array types. ``` [info] - Type mapping for various types * FAILED * (551 milliseconds) [info] java.util.NoSuchElementException: key not found: scale [info] at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:106) [info] at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:104) [info] at org.apache.spark.sql.types.Metadata.get(Metadata.scala:111) [info] at org.apache.spark.sql.types.Metadata.getLong(Metadata.scala:51) [info] at org.apache.spark.sql.jdbc.PostgresDialect$.getCatalystType(PostgresDialect.scala:43) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) ``` ### Why are the changes needed? To fix the regression bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed the test case `SPARK-22291: Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL` in `PostgresIntegrationSuite` passed. I also confirmed whether all the `v2.*IntegrationSuite` pass because this PR changed them and they passed. Closes #31262 from sarutak/fix-postgres-dialect-regression. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-22 13:03:02 +09:00
Kousuke Saruta	116f4cab6b	[SPARK-34094][SQL] Extends StringTranslate to support unicode characters whose code point >= U+10000 ### What changes were proposed in this pull request? This PR extends `StringTranslate` to support unicode characters whose code point >= `U+10000`. ### Why are the changes needed? To make it work with wide variety of characters. ### Does this PR introduce _any_ user-facing change? Yes. Users can use `StringTranslate` with unicode characters whose code point >= `U+10000`. ### How was this patch tested? New assertion added to the existing test. Closes #31164 from sarutak/extends-translate. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-21 08:15:55 -06:00
Max Gekk	e79c1cde1b	[SPARK-34138][SQL] Keep dependants cached while refreshing v1 tables ### What changes were proposed in this pull request? This PR changes cache refreshing of v1 tables in v1 commands. In particular, v1 table dependents are not removed from the cache after this PR. Comparing to current implementation, we just clear cached data of all dependents and keep them in the cache. So, the next actions will fill in the cached data of the original v1 table and its dependents. In more details: 1. Modified the `CatalogImpl.refreshTable()` method to use `recacheByPlan()` instead of `lookupCachedData()`, `uncacheQuery()` and `cacheQuery()`. Users can call this method via public API like `spark.catalog.refreshTable()`. 2. Rewritten the part in `CatalogImpl.refreshTable()` which was responsible for table meta-data refreshing because this code stopped to work properly after removing of the second `sparkSession.table(tableIdent)`. 3. Added new private method `invalidateCachedTable()` to `SessionCatalog`. Comparing to the existing `SessionCatalog.refreshTable`, it invalidates the relation cache only. If we called `SessionCatalog.refreshTable` from `CatalogImpl.refreshTable()`, we would refresh temporary and global temporary views twice (that could lead to refreshing file index twice). ### Why are the changes needed? 1. This should improve user experience with table/view caching. For example, let's imagine that an user has cached v1 table and cached view based on the table. And the user passed the table to external library which drops/renames/adds partitions in the v1 table. Unfortunately, the user gets the view uncached after that even he/she hasn't uncached the view explicitly. 2. To improve code maintenance. 3. To reduce the amount of calls to Hive external catalog. 4. Also this should speed up table recaching. 5. To have the same behavior as for v2 tables supported by https://github.com/apache/spark/pull/31172 ### Does this PR introduce _any_ user-facing change? From the view of the correctness of query results, there are no behavior changes but the changes might influence on consuming memory and query execution time. For example: Before: ```scala scala> sql("CREATE TABLE tbl (c int)") scala> sql("CACHE TABLE tbl") scala> sql("CREATE VIEW v AS SELECT * FROM tbl") scala> sql("CACHE TABLE v") scala> spark.catalog.isCached("v") res6: Boolean = true scala> spark.catalog.refreshTable("tbl") scala> spark.catalog.isCached("v") res8: Boolean = false ``` After: ```scala scala> spark.catalog.refreshTable("tbl") scala> spark.catalog.isCached("v") res8: Boolean = true ``` ### How was this patch tested? 1. Added new unit tests that create a view, a temporary view and a global temporary view on top of v1/v2 tables, and refresh the base table via `ALTER TABLE .. ADD/DROP/RENAME PARTITION`. 2. By running the unified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" # build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31206 from MaxGekk/refreshTable-recache-by-plan. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 13:03:24 +00:00
ulysses-you	da4b50f8e2	[SPARK-33901][SQL][FOLLOWUP] Add drop table in charvarchar test ### What changes were proposed in this pull request? Add `drop table` in charvarchar sql test. ### Why are the changes needed? 1. `drop table` is also a test case, for better coverage. 2. It's more clear to drop table which created in current test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31277 from ulysses-you/SPARK-33901-FOLLOWUP. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 12:41:52 +00:00
Max Gekk	31d828379c	[SPARK-34099][SQL][TESTS] Check re-caching of v2 table dependents after table altering ### What changes were proposed in this pull request? Add tests to check that v2 table dependents are re-cached after table altering via the commands: - `ALTER TABLE .. ADD PARTITION` - `ALTER TABLE .. DROP PARTITION` - `ALTER TABLE .. RENAME PARTITION` ### Why are the changes needed? To improve test coverage and prevent regressions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableDropPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite" ``` Closes #31250 from MaxGekk/check-v2-dependents-recached. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 08:42:17 +00:00
beliefer	140538ea5b	[SPARK-34096][SQL] Improve performance for nth_value ignore nulls over offset window ### What changes were proposed in this pull request? The current implement of `UnboundedOffsetWindowFunctionFrame` and `UnboundedPrecedingOffsetWindowFunctionFrame` only support `nth_value` that respect nulls. So nth_value will execute with `UnboundedWindowFunctionFrame` and `UnboundedPrecedingWindowFunctionFrame`. `UnboundedWindowFunctionFrame` and `UnboundedPrecedingWindowFunctionFrame` will call `updateExpressions` of `nth_value` multiple times. ### Why are the changes needed? Improve performance for nth_value ignore nulls over offset window ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Jenkins test Closes #31178 from beliefer/SPARK-34096. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 07:31:36 +00:00
Kent Yao	d640631e36	[SPARK-34164][SQL] Improve write side varchar check to visit only last few tailing spaces ### What changes were proposed in this pull request? For varchar(N), we currently trim all spaces first to check whether the remained length exceeds, it not necessary to visit them all but at most to those after N. ### Why are the changes needed? improve varchar performance for write side ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? benchmark and existing ut Closes #31253 from yaooqinn/SPARK-34164. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 05:30:57 +00:00
Ismaël Mejía	e9e81f798f	[SPARK-27733][CORE] Upgrade Avro to version 1.10.1 ### What changes were proposed in this pull request? Update Avro dependency to version 1.10.1 ### Why are the changes needed? To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since there were no API changes required we just run the tests Closes #31232 from iemejia/SPARK-27733-avro-upgrade. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:42:27 -08:00
yangjie01	d68612a008	[SPARK-34176][BUILD] Restore the independent mvn test ability of sql/hive module in Scala 2.13 ### What changes were proposed in this pull request? There is one Java UT error when testing sql/hive module independently in Scala 2.13 after SPARK-33212, the error message as follow: ``` [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: 18.548 s <<< ERROR! java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) ``` This pr add a Scala-2.13 profile with dependency of `scala-parallel-collections_` to `sql/hive` module to fix the Java UT in Scala 2.13. ### Why are the changes needed? Recover the independent mvn test ability of sql/hive module in Scala 2.13. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test ``` dev/change-scala-version.sh 2.13 mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive -am -DskipTests mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive ``` Before ``` [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 18.725 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: 16.853 s <<< ERROR! java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite 16:15:36.186 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. 16:15:36.288 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 16:15:36.396 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.481 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite [INFO] [INFO] Results: [INFO] [ERROR] Errors: [ERROR] JavaDataFrameSuite.testUDAF:92->checkAnswer:41 » NoClassDefFound scala/collect... [INFO] [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0 ``` After ``` [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.287 s - in org.apache.spark.sql.hive.JavaDataFrameSuite [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite 16:12:16.697 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. 16:12:17.540 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 16:12:17.654 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.58 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite [INFO] [INFO] Results: [INFO] [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0 ``` Closes #31259 from LuciferYang/SPARK-34176. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:33:31 -08:00
yi.wu	f498977222	[SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance ### What changes were proposed in this pull request? Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`. ### Why are the changes needed? ```scala val df = spark.range(2) df.join(df, df("id") <=> df("id")).show() ``` For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition: `537a49fc09/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala (L125)` However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node. Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated a unit test Closes #31260 from Ngone51/fix-missing-tags. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-20 13:36:14 +00:00
Chao Sun	902a08b9e6	[SPARK-34052][SQL] store SQL text for a temp view created using "CACHE TABLE .. AS SELECT" ### What changes were proposed in this pull request? This passes original SQL text to `CacheTableAsSelect` command in DSv1 and v2 so that it will be stored instead of the analyzed logical plan, similar to `CREATE VIEW` command. In addition, this changes the behavior of dropping temporary view to also invalidate dependent caches in a cascade, when the config `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is false (which is the default value). ### Why are the changes needed? Currently, after creating a temporary view with `CACHE TABLE ... AS SELECT` command, the view can still be queried even after the source table is dropped or replaced (in v2). This can cause correctness issue. For instance, in the following: ```sql > CREATE TABLE t ...; > CACHE TABLE v AS SELECT * FROM t; > DROP TABLE t; > SELECT * FROM v; ``` The last select query still returns the old (and stale) result instead of fail. Note that the cache is already invalidated as part of dropping table `t`, but the temporary view `v` still exist. On the other hand, the following: ```sql > CREATE TABLE t ...; > CREATE TEMPORARY VIEW v AS SELECT * FROM t; > CACHE TABLE v; > DROP TABLE t; > SELECT * FROM v; ``` will throw "Table or view not found" error in the last select query. This is related to #30567 which aligns the behavior of temporary view and global view by storing the original SQL text for temporary view, as opposed to the analyzed logical plan. However, the PR only handles `CreateView` case but not the `CacheTableAsSelect` case. This also changes uncache logic and use cascade invalidation for temporary views created above. This is to align its behavior to how a permanent view is handled as of today, and also to avoid potential issues where a dependent view becomes invalid while its data is still kept in cache. ### Does this PR introduce _any_ user-facing change? Yes, now when `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is set to false (the default value), whenever a table/permanent view/temp view that a cached view depends on is dropped, the cached view itself will become invalid during analysis, i.e., user will get "Table or view not found" error. In addition, when the dependent is a temp view in the previous case, the cache itself will also be invalidated. ### How was this patch tested? Modified/Enhanced some existing tests. Closes #31107 from sunchao/SPARK-34052. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-20 02:09:39 +00:00
Max Gekk	00b444d5ed	[SPARK-34056][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RECOVER PARTITIONS tests ### What changes were proposed in this pull request? 1. Port DS V2 tests from `AlterTablePartitionV2SQLSuite ` to the test suite `v2.AlterTableRecoverPartitionsSuite`. 2. Port DS v1 tests from `DDLSuite` to `v1.AlterTableRecoverPartitionsSuiteBase`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes #31105 from MaxGekk/unify-recover-partitions-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-20 01:49:31 +00:00
Angerszhuuuu	f6338a3e0b	[SPARK-34121][SQL] Intersect operator missing rowCount when CBO enabled ### What changes were proposed in this pull request? This pr add row count to `Intersect` operator when CBO enabled. ### Why are the changes needed? Improve query performance, [JoinEstimation.estimateInnerOuterJoin](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Closes #31240 from AngersZhuuuu/SPARK-34121. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-20 10:00:44 +09:00
Max Gekk	32dad1d5a6	[SPARK-34149][SQL] Refresh cache in v2 `ALTER TABLE .. ADD PARTITION` ### What changes were proposed in this pull request? Clear table cache after adding partitions to v2 table in `AlterTableAddPartitionExec`. ### Why are the changes needed? This PR fixes correctness issue. Without the fix, queries on cached tables modified via `ALTER TABLE .. ADD PARTITION` return incorrect results. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added new UT to `org.apache.spark.sql.execution.command.v2.AlterTableAddPartitionSuite`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31229 from MaxGekk/v2-add-partition-recache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 09:42:07 +00:00
ulysses-you	addbbe8339	[SPARK-33939][SQL] Make Column.named UnresolvedExtractValue use UnresolvedAlias to assign name ### What changes were proposed in this pull request? Change `Column.named` code to let expression check if exists `UnresolvedExtractValue` and use `UnresolvedAlias` to assign name. ### Why are the changes needed? It's more reasonable to treat user specify expression as unresolved expression then we should assign name after analyze. Let's say we have this code ``` spark.range(1).selectExpr("id as id1", "id as id2") .selectExpr("cast(struct(id1, id2).id1 as int)") ``` before this PR, the field name is `CAST(struct(id1, id2)[id1] AS INT)`. After, the field name is `CAST(struct(id1, id2).id1 AS INT)`. ### Does this PR introduce _any_ user-facing change? Yes, the default field name may be changed. ### How was this patch tested? Add test. Closes #30974 from ulysses-you/SPARK-33939-0. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 09:35:56 +00:00
Kent Yao	6fa2fb9eb5	[SPARK-34130][SQL] Impove preformace for char varchar padding and length check with StaticInvoke ### What changes were proposed in this pull request? This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression. here is a case from tpcds that could be fixed by this improvement https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ The original case generate 20K bytes, we are trying to reduce it to less than 8k ### Why are the changes needed? performance improvement as in the PR benchmark test, the performance w/ codegen is 2~3x better than w/o codegen. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes, it's a code reflect so the existing ut should be enough cross-check with https://github.com/apache/spark/pull/31012 where the tpcds shall all pass benchmark compared with master ```logtalk ================================================================================================ Char Varchar Read Side Perf ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1571 1667 83 63.6 15.7 1.0X read char with length 20 1710 1764 58 58.5 17.1 0.9X read varchar with length 20 1774 1792 16 56.4 17.7 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1824 1927 91 54.8 18.2 1.0X read char with length 40 1788 1928 137 55.9 17.9 1.0X read varchar with length 40 1676 1700 40 59.7 16.8 1.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1727 1762 30 57.9 17.3 1.0X read char with length 60 1628 1674 43 61.4 16.3 1.1X read varchar with length 60 1651 1665 13 60.6 16.5 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1748 1778 28 57.2 17.5 1.0X read char with length 80 1673 1678 9 59.8 16.7 1.0X read varchar with length 80 1667 1684 27 60.0 16.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1709 1743 48 58.5 17.1 1.0X read char with length 100 1610 1664 67 62.1 16.1 1.1X read varchar with length 100 1614 1673 53 61.9 16.1 1.1X ================================================================================================ Char Varchar Write Side Perf ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 20 2277 2327 67 4.4 227.7 1.0X write char with length 20 2421 2443 19 4.1 242.1 0.9X write varchar with length 20 2393 2419 27 4.2 239.3 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 40 2249 2290 38 4.4 224.9 1.0X write char with length 40 2386 2444 57 4.2 238.6 0.9X write varchar with length 40 2397 2405 12 4.2 239.7 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 60 2326 2367 41 4.3 232.6 1.0X write char with length 60 2478 2501 37 4.0 247.8 0.9X write varchar with length 60 2475 2503 24 4.0 247.5 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 80 9367 9773 354 1.1 936.7 1.0X write char with length 80 10454 10621 238 1.0 1045.4 0.9X write varchar with length 80 18943 19503 571 0.5 1894.3 0.5X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 100 11055 11104 59 0.9 1105.5 1.0X write char with length 100 12204 12275 63 0.8 1220.4 0.9X write varchar with length 100 21737 22275 574 0.5 2173.7 0.5X ``` Closes #31199 from yaooqinn/SPARK-34130. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 09:03:06 +00:00
Yuming Wang	030639f456	[SPARK-34119][SQL] Keep necessary stats after partition pruning ### What changes were proposed in this pull request? This pr keep necessary stats after partition pruning. ### Why are the changes needed? Improve query performance. It will push down aggregate since SPARK-34081 because it can be planed as BroadcastHashJoin. But it lacks column statistics after [`PruneFileSourcePartitions`](`d0c83f372b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala (L102-L103)`). Therefore, it will eventually be planned as SortMergeJoin. Please see the log: ``` join.right.stats: org.apache.spark.sql.catalyst.optimizer.PushDownPredicates: Statistics(sizeInBytes=348.8 KiB, rowCount=1.79E+4) join.right.stats: org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions: Statistics(sizeInBytes=1414.2 EiB) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark test SQL \| Before this PR(Seconds) \| After this PR(Seconds) -- \| -- \| -- q14a \| 594 \| 384 q14b \| 600 \| 402 This change will not affect the results of `PlanStabilitySuite`, because it does not have partition column. Closes #31205 from wangyum/SPARK-34119. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 06:09:16 +00:00
Max Gekk	a98e77c113	[SPARK-34143][SQL][TESTS] Fix adding partitions to fully partitioned v2 tables ### What changes were proposed in this pull request? While adding new partition to v2 `InMemoryAtomicPartitionTable`/`InMemoryPartitionTable`, add single row to the table content when the table is fully partitioned. ### Why are the changes needed? The `ALTER TABLE .. ADD PARTITION` command does not change content of fully partitioned v2 table. For instance, `INSERT INTO` changes table content: ```scala sql(s"CREATE TABLE t (p0 INT, p1 STRING) USING _ PARTITIONED BY (p0, p1)") sql(s"INSERT INTO t SELECT 1, 'def'") sql(s"SELECT * FROM t").show(false) +---+---+ \|p0 \|p1 \| +---+---+ \|1 \|def\| +---+---+ ``` but `ALTER TABLE .. ADD PARTITION` doesn't change v2 table content: ```scala sql(s"ALTER TABLE t ADD PARTITION (p0 = 0, p1 = 'abc')") sql(s"SELECT * FROM t").show(false) +---+---+ \|p0 \|p1 \| +---+---+ +---+---+ ``` ### Does this PR introduce _any_ user-facing change? No, the changes impact only on tests but for the example above in tests: ```scala sql(s"ALTER TABLE t ADD PARTITION (p0 = 0, p1 = 'abc')") sql(s"SELECT * FROM t").show(false) +---+---+ \|p0 \|p1 \| +---+---+ \|0 \|abc\| +---+---+ ``` ### How was this patch tested? By running the unified tests for `ALTER TABLE .. ADD PARTITION`. Closes #31216 from MaxGekk/add-partition-by-all-columns. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 05:40:15 +00:00
ulysses-you	055124a048	[SPARK-34150][SQL] Strip Null literal.sql in resolve alias ### What changes were proposed in this pull request? Change null Literal to PrettyAttribute during ResolveAlias. ### Why are the changes needed? We will convert `Literal(null)` to target data type during analysis. Then the generated alias name will include something like `CAST(NULL AS String)` instead of `NULL`. ``` spark.sql("SELECT RAND(null)").columns -- before rand(CAST(NULL AS INT)) -- after rand(NULL) ``` ### Does this PR introduce _any_ user-facing change? Yes, the default column name maybe changed. ### How was this patch tested? Add test and pass exists test. Closes #31233 from ulysses-you/SPARK-34150. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 03:35:08 +00:00
Max Gekk	bea10a6274	[SPARK-34153][SQL] Remove unused `getRawTable()` from `HiveExternalCatalog.alterPartitions()` ### What changes were proposed in this pull request? Remove unused call of `getRawTable()` from `HiveExternalCatalog.alterPartitions()`. ### Why are the changes needed? It reduces the number of calls to Hive External catalog. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31234 from MaxGekk/remove-getRawTable-from-alterPartitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-19 11:42:33 +09:00

... 5 6 7 8 9 ...

11223 commits