ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
ulysses	3ae1520185	[SPARK-33131][SQL] Fix grouping sets with having clause can not resolve qualified col name ### What changes were proposed in this pull request? Correct the resolution of having clause. ### Why are the changes needed? Grouping sets construct new aggregate lost the qualified name of grouping expression. Here is a example: ``` -- Works resolved by `ResolveReferences` select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 1 -- Works because of the extra expression c1 select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 -- Failed select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 ``` It wroks with `Aggregate` without grouping sets through `ResolveReferences`, but Grouping sets not works since the exprId has been changed. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? add test. Closes #30029 from ulysses-you/SPARK-33131. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 11:26:27 +00:00
gengjiaan	b69e0651fe	[SPARK-33126][SQL] Simplify offset window function(Remove direction field) ### What changes were proposed in this pull request? The current `Lead`/`Lag` extends `OffsetWindowFunction`. `OffsetWindowFunction` contains field `direction` and use `direction` to calculates the `boundary`. We can use single literal expression unify the two properties. For example: 3 means `direction` is Asc and `boundary` is 3. -3 means `direction` is Desc and `boundary` is -3. ### Why are the changes needed? Improve the current implement of `Lead`/`Lag`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30023 from beliefer/SPARK-33126. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 11:11:57 +00:00
xuewei.linxuewei	306872eefa	[SPARK-33139][SQL] protect setActionSession and clearActiveSession ### What changes were proposed in this pull request? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession. Change of the PR: * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API. * by default, if user call these two API, it will throw exception * add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage * change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive ### Why are the changes needed? Make SQLConf.get reliable and stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Add UT in SparkSessionBuilderSuite to test the legacy config * Existing test Closes #30042 from leanken/leanken-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 06:05:17 +00:00
Denis Pyshev	ba69d68d91	[SPARK-33080][BUILD] Replace fatal warnings snippet ### What changes were proposed in this pull request? Current solution in build file to enable build failure on compilation warnings with exclusion of deprecation ones is not portable after SBT version 1.3.13 (build import fails with compilation error with SBT 1.4) and could be replaced with more robust and maintainable, especially since Scala 2.13.2 with similar built-in functionality. Additionally, warnings were fixed to pass the build, with as few changes as possible: warnings in 2.12 compilation fixed in code, warnings in 2.13 compilation covered by configuration to be addressed separately ### Why are the changes needed? Unblocks upgrade to SBT after 1.3.13. Enhances build file maintainability. Allows fine tune of warnings configuration in scope of Scala 2.13 compilation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `build/sbt`'s `compile` and `Test/compile` for both Scala 2.12 and 2.13 profiles. Closes #29995 from gemelen/feature/warnings-reporter. Authored-by: Denis Pyshev <git@gemelen.net> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-15 14:49:43 -05:00
Liang-Chi Hsieh	9e3746469c	[SPARK-33078][SQL] Add config for json expression optimization ### What changes were proposed in this pull request? This proposes to add a config for json expression optimization. ### Why are the changes needed? For the new Json expression optimization rules, it is safer if we can disable it using SQL config. ### Does this PR introduce _any_ user-facing change? Yes, users can disable json expression optimization rule. ### How was this patch tested? Unit test Closes #30047 from viirya/SPARK-33078. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-10-15 12:38:10 -07:00
Jungtaek Lim (HeartSaVioR)	8e5cb1d276	[SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved ### What changes were proposed in this pull request? This PR proposes to fix a bug on calling `DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters in `V2WriteCommand.outputResolved`. The order of parameters for `DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says that the right order of matching variables are `inAttr` and `outAttr`. ### Why are the changes needed? Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call `DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been swapped. ### Does this PR introduce _any_ user-facing change? Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable. ### How was this patch tested? New UT added. Closes #30033 from HeartSaVioR/SPARK-33136. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-14 08:30:03 -07:00
Max Gekk	05a62dcada	[SPARK-33134][SQL] Return partial results only for root JSON objects ### What changes were proposed in this pull request? In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as `from_json()` will return `null` for malformed nested JSON objects. ### Why are the changes needed? 1. To not raise exception to users in the PERMISSIVE mode 2. To fix a regression and to have the same behavior as Spark 2.4.x has 3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the code below: ```scala val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events") val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType))) val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event")) pokerhand_events.show ``` throws the exception even in the default PERMISSIVE mode: ```java java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) ``` After the changes: ``` +-----+ \|event\| +-----+ \| null\| +-----+ ``` ### How was this patch tested? Added a test to `JsonFunctionsSuite`. Closes #30031 from MaxGekk/json-skip-row-wrong-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-14 12:13:54 +09:00
xuewei.linxuewei	dc697a8b59	[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero ### What changes were proposed in this pull request? As [SPARK-13860](https://issues.apache.org/jira/browse/SPARK-13860) stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result. Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard. ### Why are the changes needed? SQL correctness issue. ### Does this PR introduce any user-facing change? Yes. See sql-migration-guide In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`. ### How was this patch tested? Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior. Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior. Closes #29983 from leanken/leanken-SPARK-13860. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 13:21:45 +00:00
gengjiaan	2b7239edfb	[SPARK-33125][SQL] Improve the error when Lead and Lag are not allowed to specify window frame ### What changes were proposed in this pull request? Except for Postgresql, other data sources (for example: vertica, oracle, redshift, mysql, presto) are not allowed to specify window frame for the Lead and Lag functions. But the current error message is not clear enough. `Window Frame $f must match the required frame` This PR will use the following error message. `Cannot specify window frame for lead function` ### Why are the changes needed? Make clear error message. ### Does this PR introduce _any_ user-facing change? Yes Users will see the clearer error message. ### How was this patch tested? Jenkins test. Closes #30021 from beliefer/SPARK-33125. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 13:12:17 +00:00
Huaxin Gao	af3e2f7d58	[SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) ### What changes were proposed in this pull request? - Override the default SQL strings in the DB2 Dialect for: * ALTER TABLE UPDATE COLUMN TYPE * ALTER TABLE UPDATE COLUMN NULLABILITY - Add new docker integration test suite jdbc/v2/DB2IntegrationSuite.scala ### Why are the changes needed? In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some ALTER TABLE at the moment. This PR supports DB2 specific ALTER TABLE. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running new integration test suite: $ ./build/sbt -Pdocker-integration-tests "test-only *.DB2IntegrationSuite" Closes #29972 from huaxingao/db2_docker. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 12:57:54 +00:00
Chao Sun	feee8da14b	[SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types ### What changes were proposed in this pull request? In SPARK-24994 we implemented unwrapping cast for integral types. This extends it to support numeric types such as float/double/decimal, so that filters involving these types can be better pushed down to data sources. Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case: ```sql cast(e as double) < 1.9 ``` assume type of `e` is short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to: ```sql e < cast(1.9 as short) ``` as in the previous implementation, since if `e` is 1, the original expression evaluates to true, but converted expression will evaluate to false. To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a _roundtrip value_ derived from converting the literal first to the narrower type, and then to the wider type, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9. <img width="1153" alt="Screen Shot 2020-09-28 at 3 30 27 PM" src="https://user-images.githubusercontent.com/506679/94492719-bd29e780-019f-11eb-9111-71d6e3d157f7.png"> Now in the case of truncate, we'd convert the original expression to: ```sql e <= cast(1.9 as short) ``` instead, so that the conversion also is valid when `e` is 1. For more details, please check [this blog post](https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html) by Presto which offers a very good explanation on how it works. ### Why are the changes needed? For queries such as: ```sql SELECT * FROM tbl WHERE short_col < 100.5 ``` The predicate `short_col < 100.5` can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29792 from sunchao/SPARK-32858. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 12:44:20 +00:00
tanel.kiis@gmail.com	17eebd7209	[SPARK-32295][SQL] Add not null and size > 0 filters before inner explode/inline to benefit from predicate pushdown ### What changes were proposed in this pull request? Add `And(IsNotNull(e), GreaterThan(Size(e), Literal(0)))` filter before Explode, PosExplode and Inline, when `outer = false`. Removed unused `InferFiltersFromConstraints` from `operatorOptimizationRuleSet` to avoid confusion that happened during the review process. ### Why are the changes needed? Predicate pushdown will be able to move this new filter down through joins and into data sources for performance improvement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #29092 from tanelk/SPARK-32295. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-13 20:11:04 +09:00
Takeshi Yamamuro	a0e324460e	[SPARK-32704][SQL][FOLLOWUP] Corrects version values of plan logging configs in SQLConf ### What changes were proposed in this pull request? This PR intends to correct version values (`3.0.0` -> `3.1.0`) of three configs below in `SQLConf`: - spark.sql.planChangeLog.level - spark.sql.planChangeLog.rules - spark.sql.planChangeLog.batches This PR comes from https://github.com/apache/spark/pull/29544#discussion_r503049350. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30015 from maropu/pr29544-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-12 22:54:31 +09:00
Liang-Chi Hsieh	78c0967bbe	[SPARK-33092][SQL] Support subexpression elimination in ProjectExec ### What changes were proposed in this pull request? This patch proposes to add subexpression elimination support into `ProjectExec`. It can be controlled by `spark.sql.subexpressionElimination.enabled` config. Before this change: ```scala val df = spark.read.option("header", true).csv("/tmp/test.csv") df.withColumn("my_map", expr("str_to_map(foo, '&', '=')")).select(col("my_map")("foo"), col("my_map")("bar"), col("my_map")("baz")).debugCodegen ``` L27-40: first `str_to_map`. L68:81: second `str_to_map`. L109-122: third `str_to_map`. ``` /* 024 / private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 025 / boolean project_isNull_0 = true; / 026 / UTF8String project_value_0 = null; / 027 / boolean project_isNull_1 = true; / 028 / MapData project_value_1 = null; / 029 / / 030 / if (!project_exprIsNull_0_0) { / 031 / project_isNull_1 = false; // resultCode could change nullability. / 032 / / 033 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] / literal /), -1); / 034 / for(UTF8String kvEntry: project_kvs_0) { / 035 / UTF8String[] kv = kvEntry.split(((UTF8String) references[2] / literal /), 2); / 036 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 037 / } / 038 / project_value_1 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).build(); / 039 / / 040 / } / 041 / if (!project_isNull_1) { / 042 / project_isNull_0 = false; // resultCode could change nullability. / 043 / / 044 / final int project_length_0 = project_value_1.numElements(); / 045 / final ArrayData project_keys_0 = project_value_1.keyArray(); / 046 / final ArrayData project_values_0 = project_value_1.valueArray(); / 047 / / 048 / int project_index_0 = 0; / 049 / boolean project_found_0 = false; / 050 / while (project_index_0 < project_length_0 && !project_found_0) { / 051 / final UTF8String project_key_0 = project_keys_0.getUTF8String(project_index_0); / 052 / if (project_key_0.equals(((UTF8String) references[3] / literal /))) { / 053 / project_found_0 = true; / 054 / } else { / 055 / project_index_0++; / 056 / } / 057 / } / 058 / / 059 / if (!project_found_0 \|\| project_values_0.isNullAt(project_index_0)) { / 060 / project_isNull_0 = true; / 061 / } else { / 062 / project_value_0 = project_values_0.getUTF8String(project_index_0); / 063 / } / 064 / / 065 / } / 066 / boolean project_isNull_6 = true; / 067 / UTF8String project_value_6 = null; / 068 / boolean project_isNull_7 = true; / 069 / MapData project_value_7 = null; / 070 / / 071 / if (!project_exprIsNull_0_0) { / 072 / project_isNull_7 = false; // resultCode could change nullability. / 073 / / 074 / UTF8String[] project_kvs_1 = project_expr_0_0.split(((UTF8String) references[5] / literal /), -1); / 075 / for(UTF8String kvEntry: project_kvs_1) { / 076 / UTF8String[] kv = kvEntry.split(((UTF8String) references[6] / literal /), 2); / 077 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[4] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 078 / } / 079 / project_value_7 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[4] / mapBuilder /).build(); / 080 / / 081 / } / 082 / if (!project_isNull_7) { / 083 / project_isNull_6 = false; // resultCode could change nullability. / 084 / / 085 / final int project_length_1 = project_value_7.numElements(); / 086 / final ArrayData project_keys_1 = project_value_7.keyArray(); / 087 / final ArrayData project_values_1 = project_value_7.valueArray(); / 088 / / 089 / int project_index_1 = 0; / 090 / boolean project_found_1 = false; / 091 / while (project_index_1 < project_length_1 && !project_found_1) { / 092 / final UTF8String project_key_1 = project_keys_1.getUTF8String(project_index_1); / 093 / if (project_key_1.equals(((UTF8String) references[7] / literal /))) { / 094 / project_found_1 = true; / 095 / } else { / 096 / project_index_1++; / 097 / } / 098 / } / 099 / / 100 / if (!project_found_1 \|\| project_values_1.isNullAt(project_index_1)) { / 101 / project_isNull_6 = true; / 102 / } else { / 103 / project_value_6 = project_values_1.getUTF8String(project_index_1); / 104 / } / 105 / / 106 / } / 107 / boolean project_isNull_12 = true; / 108 / UTF8String project_value_12 = null; / 109 / boolean project_isNull_13 = true; / 110 / MapData project_value_13 = null; / 111 / / 112 / if (!project_exprIsNull_0_0) { / 113 / project_isNull_13 = false; // resultCode could change nullability. / 114 / / 115 / UTF8String[] project_kvs_2 = project_expr_0_0.split(((UTF8String) references[9] / literal /), -1); / 116 / for(UTF8String kvEntry: project_kvs_2) { / 117 / UTF8String[] kv = kvEntry.split(((UTF8String) references[10] / literal /), 2); / 118 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[8] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 119 / } / 120 / project_value_13 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[8] / mapBuilder /).build(); / 121 / / 122 / } ... ``` After this change: L27-40 evaluates the common map variable. ``` / 024 / private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 025 / // common sub-expressions / 026 / / 027 / boolean project_isNull_0 = true; / 028 / MapData project_value_0 = null; / 029 / / 030 / if (!project_exprIsNull_0_0) { / 031 / project_isNull_0 = false; // resultCode could change nullability. / 032 / / 033 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] / literal /), -1); / 034 / for(UTF8String kvEntry: project_kvs_0) { / 035 / UTF8String[] kv = kvEntry.split(((UTF8String) references[2] / literal /), 2); / 036 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 037 / } / 038 / project_value_0 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).build(); / 039 / / 040 / } / 041 / / 042 / boolean project_isNull_4 = true; / 043 / UTF8String project_value_4 = null; / 044 / / 045 / if (!project_isNull_0) { / 046 / project_isNull_4 = false; // resultCode could change nullability. / 047 / / 048 / final int project_length_0 = project_value_0.numElements(); / 049 / final ArrayData project_keys_0 = project_value_0.keyArray(); / 050 / final ArrayData project_values_0 = project_value_0.valueArray(); / 051 / / 052 / int project_index_0 = 0; / 053 / boolean project_found_0 = false; / 054 / while (project_index_0 < project_length_0 && !project_found_0) { / 055 / final UTF8String project_key_0 = project_keys_0.getUTF8String(project_index_0); / 056 / if (project_key_0.equals(((UTF8String) references[3] / literal /))) { / 057 / project_found_0 = true; / 058 / } else { / 059 / project_index_0++; / 060 / } / 061 / } / 062 / / 063 / if (!project_found_0 \|\| project_values_0.isNullAt(project_index_0)) { / 064 / project_isNull_4 = true; / 065 / } else { / 066 / project_value_4 = project_values_0.getUTF8String(project_index_0); / 067 / } / 068 / / 069 / } / 070 / boolean project_isNull_6 = true; / 071 / UTF8String project_value_6 = null; / 072 / / 073 / if (!project_isNull_0) { / 074 / project_isNull_6 = false; // resultCode could change nullability. / 075 / / 076 / final int project_length_1 = project_value_0.numElements(); / 077 / final ArrayData project_keys_1 = project_value_0.keyArray(); / 078 / final ArrayData project_values_1 = project_value_0.valueArray(); / 079 / / 080 / int project_index_1 = 0; / 081 / boolean project_found_1 = false; / 082 / while (project_index_1 < project_length_1 && !project_found_1) { / 083 / final UTF8String project_key_1 = project_keys_1.getUTF8String(project_index_1); / 084 / if (project_key_1.equals(((UTF8String) references[4] / literal /))) { / 085 / project_found_1 = true; / 086 / } else { / 087 / project_index_1++; / 088 / } / 089 / } / 090 / / 091 / if (!project_found_1 \|\| project_values_1.isNullAt(project_index_1)) { / 092 / project_isNull_6 = true; / 093 / } else { / 094 / project_value_6 = project_values_1.getUTF8String(project_index_1); / 095 / } / 096 / / 097 / } / 098 / boolean project_isNull_8 = true; / 099 / UTF8String project_value_8 = null; / 100 / ... ``` When the code is split into separated method: ``` / 026 / private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 027 / // common sub-expressions / 028 / / 029 / MapData project_subExprValue_0 = project_subExpr_0(project_exprIsNull_0_0, project_expr_0_0); / 030 / ... / 140 / private MapData project_subExpr_0(boolean project_exprIsNull_0_0, org.apache.spark.unsafe.types.UTF8String project_expr_0_0) { / 141 / boolean project_isNull_0 = true; / 142 / MapData project_value_0 = null; / 143 / / 144 / if (!project_exprIsNull_0_0) { / 145 / project_isNull_0 = false; // resultCode could change nullability. / 146 / / 147 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] / literal /), -1); / 148 / for(UTF8String kvEntry: project_kvs_0) { / 149 / UTF8String[] kv = kvEntry.split(((UTF8String) references[2] / literal /), 2); / 150 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 151 / } / 152 / project_value_0 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).build(); / 153 / / 154 / } / 155 / project_subExprIsNull_0 = project_isNull_0; / 156 / return project_value_0; / 157 */ } ``` ### Why are the changes needed? Users occasionally write repeated expression in projection. It is also possibly that query optimizer optimizes a query to evaluate same expression many times in a Project. Currently in ProjectExec, we don't support subexpression elimination in Whole-stage codegen. We can support it to reduce redundant evaluation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `spark.sql.subexpressionElimination.enabled` is enabled by default. So that's said we should pass all tests with this change. Closes #29975 from viirya/SPARK-33092. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-12 16:54:21 +09:00
Gabor Somogyi	4af1ac9384	[SPARK-32047][SQL] Add JDBC connection provider disable possibility ### What changes were proposed in this pull request? At the moment there is no possibility to turn off JDBC authentication providers which exists on the classpath. This can be problematic because service providers are loaded with service loader. In this PR I've added `spark.sql.sources.disabledJdbcConnProviderList` configuration possibility (default: empty). ### Why are the changes needed? No possibility to turn off JDBC authentication providers. ### Does this PR introduce _any_ user-facing change? Yes, it introduces new configuration option. ### How was this patch tested? * Existing + newly added unit tests. * Existing integration tests. Closes #29964 from gaborgsomogyi/SPARK-32047. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-12 12:24:54 +09:00
Jungtaek Lim (HeartSaVioR)	edb140eb5c	[SPARK-32896][SS] Add DataStreamWriter.table API ### What changes were proposed in this pull request? This PR proposes to add `DataStreamWriter.table` to specify the output "table" to write from the streaming query. ### Why are the changes needed? For now, there's no way to write to the table (especially catalog table) even the table is capable to handle streaming write, so even with Spark 3, writing to the catalog table via SS should go through the `DataStreamWriter.format(provider)` and wish the provider can handle it as same as we do with catalog table. With the new API, we can directly point to the catalog table which supports streaming write. Some of usages are covered with tests - simply saying, end users can do the following: ```scala // assuming `testcat` is a custom catalog, and `ns` is a namespace in the catalog spark.sql("CREATE TABLE testcat.ns.table1 (id bigint, data string) USING foo") val query = inputDF .writeStream .table("testcat.ns.table1") .option(...) .start() ``` ### Does this PR introduce _any_ user-facing change? Yes, as this adds a new public API in DataStreamWriter. This doesn't bring backward incompatible change. ### How was this patch tested? New unit tests. Closes #29767 from HeartSaVioR/SPARK-32896. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 03:01:54 -07:00
ulysses	a9077299d7	[SPARK-32743][SQL] Add distinct info at UnresolvedFunction toString ### What changes were proposed in this pull request? Add distinct info at `UnresolvedFunction.toString`. ### Why are the changes needed? Make `UnresolvedFunction` info complete. ``` create table test (c1 int, c2 int); explain extended select sum(distinct c1) from test; -- before this pr == Parsed Logical Plan == 'Project [unresolvedalias('sum('c1), None)] +- 'UnresolvedRelation [test] -- after this pr == Parsed Logical Plan == 'Project [unresolvedalias('sum(distinct 'c1), None)] +- 'UnresolvedRelation [test] ``` ### Does this PR introduce _any_ user-facing change? Yes, get distinct info during sql parse. ### How was this patch tested? manual test. Closes #29586 from ulysses-you/SPARK-32743. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-09 09:25:22 +09:00
Max Gekk	7d6e3fb998	[SPARK-33074][SQL] Classify dialect exceptions in JDBC v2 Table Catalog ### What changes were proposed in this pull request? 1. Add new method to the `JdbcDialect` class - `classifyException()`. It converts dialect specific exception to Spark's `AnalysisException` or its sub-classes. 2. Replace H2 exception `org.h2.jdbc.JdbcSQLException` in `JDBCTableCatalogSuite` by `AnalysisException`. 3. Add `H2Dialect` ### Why are the changes needed? Currently JDBC v2 Table Catalog implementation throws dialect specific exception and ignores exceptions defined in the `TableCatalog` interface. This PR adds new method for converting dialect specific exception, and assumes that follow up PRs will implement `classifyException()`. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running existing test suites `JDBCTableCatalogSuite` and `JDBCV2Suite`. Closes #29952 from MaxGekk/jdbcv2-classify-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-08 05:28:33 +00:00
Karen Feng	39510b0e9b	[SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true ## What changes were proposed in this pull request? Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field. `raise_error` is exposed in SQL, Python, Scala, and R. `assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R. ### Why are the changes needed? Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`. ### Does this PR introduce _any_ user-facing change? Yes: - Adds `raise_error` function to the SQL, Python, Scala, and R APIs. - Adds `assert_true` function to the SQL, Python and R APIs. ### How was this patch tested? Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`. Closes #29947 from karenfeng/spark-32793. Lead-authored-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 12:05:39 +09:00
Takeshi Yamamuro	94d648dff5	[SPARK-33036][SQL] Refactor RewriteCorrelatedScalarSubquery code to replace exprIds in a bottom-up manner ### What changes were proposed in this pull request? This PR intends to refactor code in `RewriteCorrelatedScalarSubquery` for replacing `ExprId`s in a bottom-up manner instead of doing in a top-down one. This PR comes from the talk with cloud-fan in https://github.com/apache/spark/pull/29585#discussion_r490371252. ### Why are the changes needed? To improve code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29913 from maropu/RefactorRewriteCorrelatedScalarSubquery. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-07 20:16:40 +09:00
Terry Kim	7e99fcd64e	[SPARK-33004][SQL] Migrate DESCRIBE column to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `DESCRIBE tbl colname` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE testcat.ns") sql("DESCRIBE t i") // 't' is resolved to testcat.ns.t Describing columns is not supported for v2 tables.; org.apache.spark.sql.AnalysisException: Describing columns is not supported for v2 tables.; ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE spark_catalog.test") sql("DESCRIBE t i").show // 't' is resolved to a temp view +---------+----------+ \|info_name\|info_value\| +---------+----------+ \| col_name\| i\| \|data_type\| int\| \| comment\| NULL\| +---------+----------+ ``` ### Does this PR introduce _any_ user-facing change? After this PR, `DESCRIBE t i` is resolved to a temp view `t` instead of `testcat.ns.t`. ### How was this patch tested? Added a new test Closes #29880 from imback82/describe_column_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-07 06:33:20 +00:00
Liang-Chi Hsieh	57ed5a829b	[SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain ### What changes were proposed in this pull request? This proposes to simplify named_struct + get struct field + from_json expression chain from `struct(from_json.col1, from_json.col2, from_json.col3...)` to `struct(from_json)`. ### Why are the changes needed? Simplify complex expression tree that could be produced by query optimization or user. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #29942 from viirya/SPARK-33007. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-06 16:59:23 -07:00
Kousuke Saruta	3b2a38d735	[SPARK-32511][SQL][FOLLOWUP] Fix the broken build for Scala 2.13 with Maven ### What changes were proposed in this pull request? This PR fixes the broken build for Scala 2.13 with Maven. https://github.com/apache/spark/pull/29913/checks?check_run_id=1187826966 #29795 was merged though it doesn't successfully finish the build for Scala 2.13 ### Why are the changes needed? To fix the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `build/mvn -Pscala-2.13 -Phive -Phive-thriftserver -DskipTests package` Closes #29954 from sarutak/hotfix-seq. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-06 09:40:16 -07:00
angerszhu	ddc7012b3d	[SPARK-32243][SQL] HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number error ### What changes were proposed in this pull request? When we create a UDAF function use class extended `UserDefinedAggregeteFunction`, when we call the function, in support hive mode, in HiveSessionCatalog, it will call super.makeFunctionExpression, but it will catch error such as the function need 2 parameter and we only give 1, throw exception only show ``` No handler for UDF/UDAF/UDTF xxxxxxxx ``` This is confused for develop , we should show error thrown by super method too, For this pr's UT : Before change, throw Exception like ``` No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` After this pr, throw exception ``` Spark UDAF Error: Invalid number of arguments for function longProductSum. Expected: 2; Found: 1; Hive UDF/UDAF/UDTF Error: No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` ### Why are the changes needed? Show more detail error message when define UDAF ### Does this PR introduce _any_ user-facing change? People will see more detail error message when use spark sql's UDAF in hive support Mode ### How was this patch tested? Added UT Closes #29054 from AngersZhuuuu/SPARK-32243. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 09:09:19 +00:00
fqaiser94@gmail.com	2793347972	[SPARK-32511][SQL] Add dropFields method to Column class ### What changes were proposed in this pull request? 1. Refactored `WithFields` Expression to make it more extensible (now `UpdateFields`). 2. Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`). ### Why are the changes needed? Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column. For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing): ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val data = spark.createDataFrame(sc.parallelize( Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) ))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) )))))).cache data.show(false) +---------------------------------+ \|a \| +---------------------------------+ \|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]\| +---------------------------------+ ``` Currently, to drop the missing value users would have to do something like this: ``` val result = data.withColumn("a", struct( $"a.a", struct( struct( $"a.b.a.a", $"a.b.a.c" ).as("a"), $"a.b.b", $"a.b.c" ).as("b"), $"a.c" )) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as: >this leads to complex, fragile code that cannot survive schema evolution. [SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483) In contrast, with the method added in this PR, a user could simply do something like this to get the same result: ``` val result = data.withColumn("a", 'a.dropFields("b.a.b")) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`. However, this should be added in a separate PR. ### Does this PR introduce _any_ user-facing change? The documentation for `Column.withField` method has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly. ### How was this patch tested? New unit tests were added. Jenkins must pass them. ### Related JIRAs: More discussion on this topic can be found here: - https://issues.apache.org/jira/browse/SPARK-22231 - https://issues.apache.org/jira/browse/SPARK-16483 Closes #29795 from fqaiser94/SPARK-32511-dropFields-second-try. Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 08:53:30 +00:00
Takeshi Yamamuro	4adc2822a3	[SPARK-33035][SQL] Updates the obsoleted entries of attribute mapping in QueryPlan#transformUpWithNewOutput ### What changes were proposed in this pull request? This PR intends to fix corner-case bugs in the `QueryPlan#transformUpWithNewOutput` that is used to propagate updated `ExprId`s in a bottom-up way. Let's say we have a rule to simply assign new `ExprId`s in a projection list like this; ``` case class TestRule extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput { case p Project(projList, _) => val newPlan = p.copy(projectList = projList.map { _.transform { // Assigns a new `ExprId` for references case a: AttributeReference => Alias(a, a.name)() }}.asInstanceOf[Seq[NamedExpression]]) val attrMapping = p.output.zip(newPlan.output) newPlan -> attrMapping } } ``` Then, this rule is applied into a plan below; ``` (3) Project [a#5, b#6] +- (2) Project [a#5, b#6] +- (1) Project [a#5, b#6] +- LocalRelation <empty>, [a#5, b#6] ``` In the first transformation, the rule assigns new `ExprId`s in `(1) Project` (e.g., a#5 AS a#7, b#6 AS b#8). In the second transformation, the rule corrects the input references of `(2) Project` first by using attribute mapping given from `(1) Project` (a#5->a#7 and b#6->b#8) and then assigns new `ExprId`s (e.g., a#7 AS a#9, b#8 AS b#10). But, in the third transformation, the rule fails because it tries to correct the references of `(3) Project` by using incorrect attribute mapping (a#7->a#9 and b#8->b#10) even though the correct one is a#5->a#9 and b#6->b#10. To fix this issue, this PR modified the code to update the attribute mapping entries that are obsoleted by generated entries in a given rule. ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `QueryPlanSuite`. Closes #29911 from maropu/QueryPlanBug. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 08:32:55 +00:00
Yuming Wang	023eb482b2	[SPARK-32914][SQL] Avoid constructing dataType multiple times ### What changes were proposed in this pull request? Some expression's data type not a static value. It needs to be constructed a new object when calling `dataType` function. E.g.: `CaseWhen`. We should avoid constructing dataType multiple times because it may be used many times. E.g.: [`HyperLogLogPlusPlus.update`](`10edeafc69/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala (L122)`). ### Why are the changes needed? Improve query performance. for example: ```scala spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").show ``` Profiling result: ``` -- Execution profile --- Total samples : 18365 Frame buffer usage : 2.6688% --- 58443254327 ns (31.82%), 5844 samples [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::steal_best_of_2(unsigned int, int, StarTask&) [ 1] StealTask::do_it(GCTaskManager, unsigned int) [ 2] GCTaskThread::run() [ 3] java_start(Thread) [ 4] start_thread --- 6140668667 ns (3.34%), 614 samples [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::peek() [ 1] ParallelTaskTerminator::offer_termination(TerminatorTerminator) [ 2] StealTask::do_it(GCTaskManager, unsigned int) [ 3] GCTaskThread::run() [ 4] java_start(Thread) [ 5] start_thread --- 5679994036 ns (3.09%), 568 samples [ 0] scala.collection.generic.Growable.$plus$plus$eq [ 1] scala.collection.generic.Growable.$plus$plus$eq$ [ 2] scala.collection.mutable.ListBuffer.$plus$plus$eq [ 3] scala.collection.mutable.ListBuffer.$plus$plus$eq [ 4] scala.collection.generic.GenericTraversableTemplate.$anonfun$flatten$1 [ 5] scala.collection.generic.GenericTraversableTemplate$$Lambda$107.411506101.apply [ 6] scala.collection.immutable.List.foreach [ 7] scala.collection.generic.GenericTraversableTemplate.flatten [ 8] scala.collection.generic.GenericTraversableTemplate.flatten$ [ 9] scala.collection.AbstractTraversable.flatten [10] org.apache.spark.internal.config.ConfigEntry.readString [11] org.apache.spark.internal.config.ConfigEntryWithDefault.readFrom [12] org.apache.spark.sql.internal.SQLConf.getConf [13] org.apache.spark.sql.internal.SQLConf.caseSensitiveAnalysis [14] org.apache.spark.sql.types.DataType.sameType [15] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1 [16] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted [17] org.apache.spark.sql.catalyst.analysis.TypeCoercion$$$Lambda$1527.1975399904.apply [18] scala.collection.IndexedSeqOptimized.prefixLengthImpl [19] scala.collection.IndexedSeqOptimized.forall [20] scala.collection.IndexedSeqOptimized.forall$ [21] scala.collection.mutable.ArrayBuffer.forall [22] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType [23] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck [24] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$ [25] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataTypeCheck [26] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType [27] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$ [28] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataType [29] org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus.update [30] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2 [31] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted [32] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$Lambda$1534.1383512673.apply [33] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7 [34] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted [35] org.apache.spark.sql.execution.aggregate.AggregationIterator$$Lambda$1555.725788712.apply ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test and benchmark test: Benchmark code \| Before this PR(Milliseconds) \| After this PR(Milliseconds) --- \| --- \| --- spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").collect() \| 56462 \| 3794 Closes #29790 from wangyum/SPARK-32914. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 22:00:42 +09:00
Yuning Zhang	0fb2574d4e	[SPARK-33042][SQL][TEST] Add a test case to ensure changes to spark.sql.optimizer.maxIterations take effect at runtime ### What changes were proposed in this pull request? Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` take effect at runtime. ### Why are the changes needed? Currently, there is only one related test case: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156 However, this test case only checks the value of the conf can be changed at runtime. It does not check the updated value is actually used by the Optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit test Closes #29919 from yuningzh-db/add_optimizer_test. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 20:25:57 +09:00
Liang-Chi Hsieh	37c806af2b	[SPARK-32958][SQL] Prune unnecessary columns from JsonToStructs ### What changes were proposed in this pull request? This patch proposes to do column pruning for `JsonToStructs` expression if we only require some fields from it. ### Why are the changes needed? `JsonToStructs` takes a schema parameter used to tell `JacksonParser` what fields are needed to parse. If `JsonToStructs` is followed by `GetStructField`. We can prune the schema to only parse certain field. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #29900 from viirya/SPARK-32958. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-03 14:55:02 -07:00
Takeshi Yamamuro	82721ce00b	[SPARK-32741][SQL][FOLLOWUP] Run plan integrity check only for effective plan changes ### What changes were proposed in this pull request? (This is a followup PR of #29585) The PR modified `RuleExecutor#isPlanIntegral` code for checking if a plan has globally-unique attribute IDs, but this check made Jenkins maven test jobs much longer (See [the Dongjoon comment](https://github.com/apache/spark/pull/29585#issuecomment-702461314) and thanks, dongjoon-hyun !). To recover running time for the Jenkins tests, this PR intends to update the code to run plan integrity check only for effective plans. ### Why are the changes needed? To recover running time for Jenkins tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29928 from maropu/PR29585-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-02 22:16:19 +09:00
Cheng Su	d6f3138352	[SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically ### What changes were proposed in this pull request? This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism. The feature is to add a physical plan rule right after `EnsureRequirements`: The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan. Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure. The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan . ### Why are the changes needed? To avoid unnecessary bucketed scan in the query, and this is prerequisite for https://github.com/apache/spark/pull/29625 (decide bucketed sorted scan dynamically will be added later in that PR). ### Does this PR introduce _any_ user-facing change? A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in https://github.com/apache/spark/pull/29804#issuecomment-701151447). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature. ### How was this patch tested? Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala` Closes #29804 from c21/bucket-rule. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-02 09:01:15 +09:00
ulysses	e62d24717e	[SPARK-32585][SQL] Support scala enumeration in ScalaReflection ### What changes were proposed in this pull request? Add code in `ScalaReflection` to support scala enumeration and make enumeration type as string type in Spark. ### Why are the changes needed? We support java enum but failed with scala enum, it's better to keep the same behavior. Here is a example. ``` package test object TestEnum extends Enumeration { type TestEnum = Value val E1, E2, E3 = Value } import TestEnum._ case class TestClass(i: Int, e: TestEnum) { } import test._ Seq(TestClass(1, TestEnum.E1)).toDS ``` Before this PR ``` Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for test.TestEnum.TestEnum - field (class: "scala.Enumeration.Value", name: "e") - root class: "test.TestClass" at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:882) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:881) ``` After this PR `org.apache.spark.sql.Dataset[test.TestClass] = [i: int, e: string]` ### Does this PR introduce _any_ user-facing change? Yes, user can make case class which include scala enumeration field as dataset. ### How was this patch tested? Add test. Closes #29403 from ulysses-you/SPARK-32585. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2020-10-01 15:58:01 -04:00
yangjie01	0963fcd848	[SPARK-33024][SQL] Fix CodeGen fallback issue of UDFSuite in Scala 2.13 ### What changes were proposed in this pull request? After `SPARK-32851` set `CODEGEN_FACTORY_MODE` to `CODEGEN_ONLY` of `sparkConf` in `SharedSparkSessionBase` to construction `SparkSession` in test, the test suite `SPARK-32459: UDF should not fail on WrappedArray` in s.sql.UDFSuite exposed a codegen fallback issue in Scala 2.13 as follow: ``` - SPARK-32459: UDF should not fail on WrappedArray * FAILED * Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(java.lang.Object)", "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(scala.reflect.ClassTag)", "public abstract scala.collection.mutable.Builder scala.collection.EvidenceIterableFactory.newBuilder(java.lang.Object)" ``` The root cause is `WrappedArray` represent `mutable.ArraySeq` in Scala 2.13 and has a different constructor of `newBuilder` method. The main change of is pr is add Scala 2.13 only code part to deal with `case match WrappedArray` in Scala 2.13. ### Why are the changes needed? We need to support a Scala 2.13 build ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: All tests passed. Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am mvn test -pl sql/core -Pscala-2.13 ``` Before ``` Tests: succeeded 8540, failed 1, canceled 1, ignored 52, pending 0 * 1 TEST FAILED * ``` After ``` Tests: succeeded 8541, failed 0, canceled 1, ignored 52, pending 0 All tests passed. ``` Closes #29903 from LuciferYang/fix-udfsuite. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-01 08:37:07 -05:00
Takeshi Yamamuro	3a299aa648	[SPARK-32741][SQL] Check if the same ExprId refers to the unique attribute in logical plans ### What changes were proposed in this pull request? Some plan transformations (e.g., `RemoveNoopOperators`) implicitly assume the same `ExprId` refers to the unique attribute. But, `RuleExecutor` does not check this integrity between logical plan transformations. So, this PR intends to add this check in `isPlanIntegral` of `Analyzer`/`Optimizer`. This PR comes from the talk with cloud-fan viirya in https://github.com/apache/spark/pull/29485#discussion_r475346278 ### Why are the changes needed? For better logical plan integrity checking. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29585 from maropu/PlanIntegrityTest. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-30 21:37:29 +09:00
Yuming Wang	711d8dd28a	[SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes ### What changes were proposed in this pull request? This pr fix estimate statistics issue if child has 0 bytes. ### Why are the changes needed? The `sizeInBytes` can be `0` when AQE and CBO are enabled(`spark.sql.adaptive.enabled`=true, `spark.sql.cbo.enabled`=true and `spark.sql.cbo.planStats.enabled`=true). This will generate incorrect BroadcastJoin, resulting in Driver OOM. For example: ![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #29894 from wangyum/SPARK-33018. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 16:46:04 +00:00
Liang-Chi Hsieh	202115e7cd	[SPARK-32948][SQL] Optimize to_json and from_json expression chain ### What changes were proposed in this pull request? This patch proposes to optimize from_json + to_json expression chain. ### Why are the changes needed? To optimize json expression chain that could be manually generated or generated automatically during query optimization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #29828 from viirya/SPARK-32948. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-28 22:22:47 -07:00
Max Gekk	1b60ff5afe	[MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated ### What changes were proposed in this pull request? Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value ### Why are the changes needed? Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation: `0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running `./dev/scalastyle`. Closes #29892 from MaxGekk/doc-current_date. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 05:20:12 +00:00
Max Gekk	68cd5677ae	[SPARK-33015][SQL] Compute the current date only once ### What changes were proposed in this pull request? Compute the current date at the specified time zone using timestamp taken at the start of query evaluation. ### Why are the changes needed? According to the doc for [current_date()](http://spark.apache.org/docs/latest/api/sql/#current_date), the current date should be computed at the start of query evaluation but it can be computed multiple times. As a consequence of that, the function can return different values if the query is executed at the border of two dates. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites `ComputeCurrentTimeSuite` and `DateExpressionsSuite`. Closes #29889 from MaxGekk/fix-current_date. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 05:13:01 +00:00
gengjiaan	a53fc9b7ae	[SPARK-27951][SQL][FOLLOWUP] Improve the window function nth_value ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/29604 supports the ANSI SQL NTH_VALUE. We should override the `prettyName` and `sql`. ### Why are the changes needed? Make the name of nth_value correct. To show the ignoreNulls parameter correctly. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29886 from beliefer/improve-nth_value. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-29 09:54:43 +09:00
tanel.kiis@gmail.com	f41ba2a2f3	[SPARK-32927][SQL] Bitwise OR, AND and XOR should have similar canonicalization rules to boolean OR and AND ### What changes were proposed in this pull request? Add canonicalization rules for commutative bitwise operations. ### Why are the changes needed? Canonical form is used in many other optimization rules. Reduces the number of cases, where plans with identical results are considered to be distinct. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #29794 from tanelk/SPARK-32927. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-28 12:22:15 +09:00
Kris Mok	9a155d42a3	[SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode ### What changes were proposed in this pull request? Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `TreeNode`. ### Why are the changes needed? On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error. Similar to https://github.com/apache/spark/pull/29050, we should use Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue. ### Does this PR introduce _any_ user-facing change? Fixes a bug that throws an error when invoking `TreeNode.nodeName`, otherwise no changes. ### How was this patch tested? Added new unit test case in `TreeNodeSuite`. Note that the test case assumes the test code can trigger the expected error, otherwise it'll skip the test safely, for compatibility with newer JDKs. Manually tested on JDK8u and JDK11u and observed expected behavior: - JDK8u: the test case triggers the "Malformed class name" issue and the fix works; - JDK11u: the test case does not trigger the "Malformed class name" issue, and the test case is safely skipped. Closes #29875 from rednaxelafx/spark-32999-getsimplename. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-26 16:03:59 -07:00
gatorsmile	e887c639a7	[SPARK-32931][SQL] Unevaluable Expressions are not Foldable ### What changes were proposed in this pull request? Unevaluable expressions are not foldable because we don't have an eval for it. This PR is to clean up the code and enforce it. ### Why are the changes needed? Ensure that we will not hit the weird cases that trigger ConstantFolding. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The existing tests. Closes #29798 from gatorsmile/refactorUneval. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 07:27:29 +00:00
Yuanjian Li	9e6882feca	[SPARK-32885][SS] Add DataStreamReader.table API ### What changes were proposed in this pull request? This pr aims to add a new `table` API in DataStreamReader, which is similar to the table API in DataFrameReader. ### Why are the changes needed? Users can directly use this API to get a Streaming DataFrame on a table. Below is a simple example: Application 1 for initializing and starting the streaming job: ``` val path = "/home/yuanjian.li/runtime/to_be_deleted" val tblName = "my_table" // Write some data to `my_table` spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName) // Read the table as a streaming source, write result to destination directory val table = spark.readStream.table(tblName) table.writeStream.format("parquet").option("checkpointLocation", "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2") ``` Application 2 for appending new data: ``` // Append new data into the path spark.range(5).write.format("parquet").option("path", "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save() ``` Check result: ``` // The desitination directory should contains all written data spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show() ``` ### Does this PR introduce _any_ user-facing change? Yes, a new API added. ### How was this patch tested? New UT added and integrated testing. Closes #29756 from xuanyuanking/SPARK-32885. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 06:50:24 +00:00
Terry Kim	e9c98c910a	[SPARK-32990][SQL] Migrate REFRESH TABLE to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `REFRESH TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE testcat.ns") sql("REFRESH TABLE t") // 't' is resolved to testcat.ns.t ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("REFRESH TABLE t") // 't' is resolved to a temp view ``` ### Does this PR introduce _any_ user-facing change? After this PR, `REFRESH TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`. ### How was this patch tested? Added a new test Closes #29866 from imback82/refresh_table_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 04:29:09 +00:00
Michael Munday	383bb4af00	[SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms MurmurHash3 and xxHash64 interpret sequences of bytes as integers encoded in little-endian byte order. This requires a byte reversal on big endian platforms. I've left the hashInt and hashLong functions as-is for now. My interpretation of these functions is that they perform the hash on the integer value as if it were serialized in little-endian byte order. Therefore no byte reversal is necessary. ### What changes were proposed in this pull request? Modify hash functions to produce correct results on big-endian platforms. ### Why are the changes needed? Hash functions produce incorrect results on big-endian platforms which, amongst other potential issues, causes test failures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests run on the IBM Z (s390x) platform which uses a big-endian byte order. Closes #29762 from mundaym/fix-hashes. Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-23 12:36:46 -05:00
tanel.kiis@gmail.com	acfee3c8b1	[SPARK-32870][DOCS][SQL] Make sure that all expressions have their ExpressionDescription filled ### What changes were proposed in this pull request? Made sure, that all the expressions in the `FunctionRegistry ` have the fields `usage`, `examples` and `since` filled in their `ExpressionDescription`. Added UT to `ExpressionInfoSuite`, to make sure, that all new expressions will also fill those fields. ### Why are the changes needed? Documentation improvement ### Does this PR introduce _any_ user-facing change? Better generated SQL built in functions documentation ### How was this patch tested? Checked the fix version in the following jiras: SPARK-1251 - UnaryMinus, Add, Subtract, Multiply, Divide, Remainder, Explode, Not, In, And, Or, Equals, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual, If, Cast SPARK-2053 - CaseWhen SPARK-2665 - EqualNullSafe SPARK-3176 - Abs SPARK-6542 - CreateStruct SPARK-7135 - MonotonicallyIncreasingID SPARK-7152 - SparkPartitionID SPARK-7295 - bitwiseAND, bitwiseOR, bitwiseXOR, bitwiseNOT SPARK-8005 - InputFileName SPARK-8203 - Greatest SPARK-8204 - Least SPARK-8220 - UnaryPositive SPARK-8221 - Pmod SPARK-8230 - Size SPARK-8231 - ArrayContains SPARK-8232 - SortArray SPARK-8234 - md5 SPARK-8235 - sha1 SPARK-8236 - crc32 SPARK-8237 - sha2 SPARK-8240 - Concat SPARK-8246 - GetJsonObject SPARK-8407 - CreateNamedStruct SPARK-9617 - JsonTuple SPARK-10810 - CurrentDatabase SPARK-12480 - Murmur3Hash SPARK-14061 - CreateMap SPARK-14160 - TimeWindow SPARK-14580 - AssertTrue SPARK-16274 - XPathBoolean SPARK-16278 - MapKeys SPARK-16279 - MapValues SPARK-16284 - CallMethodViaReflection SPARK-16286 - Stack SPARK-16288 - Inline SPARK-16289 - PosExplode SPARK-16318 - XPathShort, XPathInt, XPathLong, XPathFloat, XPathDouble, XPathString, XPathList SPARK-16730 - Cast aliases SPARK-17495 - HiveHash SPARK-18702 - InputFileBlockStart, InputFileBlockLength SPARK-20910 - UUID Closes #29743 from tanelk/SPARK-32870. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-23 10:18:38 +09:00
Max Gekk	b53da23a28	[MINOR][SQL] Improve examples for `percentile_approx()` ### What changes were proposed in this pull request? In the PR, I propose to replace current examples for `percentile_approx()` with only one input value by example with multiple values in the input column. ### Why are the changes needed? Current examples are pretty trivial, and don't demonstrate function's behaviour on a sequence of values. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - by running `ExpressionInfoSuite` - `./dev/scalastyle` Closes #29841 from MaxGekk/example-percentile_approx. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-23 09:41:38 +09:00
Max Gekk	7c14f177eb	[SPARK-32306][SQL][DOCS] Clarify the result of `percentile_approx()` ### What changes were proposed in this pull request? More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that the function returns one of elements (or array of elements) from the input column. ### Why are the changes needed? To improve Spark docs and avoid misunderstanding of the function behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./dev/scalastyle` Closes #29835 from MaxGekk/doc-percentile_approx. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-09-22 12:45:19 -07:00
Peter Toth	f03c03576a	[SPARK-32951][SQL] Foldable propagation from Aggregate ### What changes were proposed in this pull request? This PR adds foldable propagation from `Aggregate` as per: https://github.com/apache/spark/pull/29771#discussion_r490412031 ### Why are the changes needed? This is an improvement as `Aggregate`'s `aggregateExpressions` can contain foldables that can be propagated up. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. Closes #29816 from peter-toth/SPARK-32951-foldable-propagation-from-aggregate. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-21 21:43:17 -07:00
angerszhu	c336ddfdb8	[SPARK-32867][SQL] When explain, HiveTableRelation show limited message ### What changes were proposed in this pull request? In current mode, when explain a SQL plan with HiveTableRelation, it will show so many info about HiveTableRelation's prunedPartition, this make plan hard to read, this pr make this information simpler. Before: ![image](https://user-images.githubusercontent.com/46485123/93012078-aeeca080-f5cf-11ea-9286-f5c15eadbee3.png) For UT ``` test("Make HiveTableScanExec message simple") { withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") { withTable("df") { spark.range(30) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("hive") .mode("overwrite") .saveAsTable("df") val df = sql("SELECT df.id, df.k FROM df WHERE df.k < 2") df.explain(true) } } } ``` After this pr will show ``` == Parsed Logical Plan == 'Project ['df.id, 'df.k] +- 'Filter ('df.k < 2) +- 'UnresolvedRelation [df], [] == Analyzed Logical Plan == id: bigint, k: bigint Project [id#11L, k#12L] +- Filter (k#12L < cast(2 as bigint)) +- SubqueryAlias spark_catalog.default.df +- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L]] == Optimized Logical Plan == Filter (isnotnull(k#12L) AND (k#12L < 2)) +- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]] == Physical Plan == Scan hive default.df [id#11L, k#12L], HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]], [isnotnull(k#12L), (k#12L < 2)] ``` In my pr, I will construct `HiveTableRelation`'s `simpleString` method to avoid show too much unnecessary info in explain plan. compared to what we had before，I decrease the detail metadata of each partition and only retain the partSpec to show each partition was pruned. Since for detail information, we always don't see this in Plan but to use DESC EXTENDED statement. ### Why are the changes needed? Make plan about HiveTableRelation more readable ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #29739 from AngersZhuuuu/HiveTableScan-meta-location-info. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-21 09:15:12 +00:00

1 2 3 4 5 ...

4712 commits