ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
ulysses-you	1dd63dccd8	[SPARK-33860][SQL] Make CatalystTypeConverters.convertToCatalyst match special Array value ### What changes were proposed in this pull request? Add some case to match Array whose element type is primitive. ### Why are the changes needed? We will get exception when use `Literal.create(Array(1, 2, 3), ArrayType(IntegerType))` . ``` Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to array<int>, but class int[] found. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:215) at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:292) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:140) ``` And same problem with other array whose element is primitive. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #30868 from ulysses-you/SPARK-33860. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-22 15:10:46 +09:00
yangjie01	b88745565b	[SPARK-33700][SQL] Avoid file meta reading when enableFilterPushDown is true and filters is empty for Orc ### What changes were proposed in this pull request? Orc support filter push down optimization, but this optimization will read file meta from external storage even if filters is empty. This pr add a extra `filters.nonEmpty` when `spark.sql.orc.filterPushdown` is true ### Why are the changes needed? Orc filters push down operation should only triggered when `filters.nonEmpty` is true ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30663 from LuciferYang/pushdownfilter-when-filter-nonempty. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 20:24:23 -08:00
Kent Yao	f5fd10b1bc	[SPARK-33834][SQL] Verify ALTER TABLE CHANGE COLUMN with Char and Varchar ### What changes were proposed in this pull request? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change For v1 table, changing type is not allowed, we fix a regression that uses the replaced string instead of the original char/varchar type when altering char/varchar columns For v2 table, char/varchar to string, char(x) to char(x), char(x)/varchar(x) to varchar(y) if x <=y are valid cases, other changes are invalid ### Why are the changes needed? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30833 from yaooqinn/SPARK-33834. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 03:07:26 +00:00
angerszhu	7466031632	[SPARK-32106][SQL] Implement script transform in sql/core ### What changes were proposed in this pull request? * Implement `SparkScriptTransformationExec` based on `BaseScriptTransformationExec` * Implement `SparkScriptTransformationWriterThread` based on `BaseScriptTransformationWriterThread` of writing data * Add rule `SparkScripts` to support convert script LogicalPlan to SparkPlan in Spark SQL (without hive mode) * Add `SparkScriptTransformationSuite` test spark spec case * add test in `SQLQueryTestSuite` And we will close #29085 . ### Why are the changes needed? Support user use Script Transform without Hive ### Does this PR introduce _any_ user-facing change? User can use Script Transformation without hive in no serde mode. Such as : default no serde ``` SELECT TRANSFORM(a, b, c) USING 'cat' AS (a int, b string, c long) FROM testData ``` no serde with spec ROW FORMAT DELIMITED ``` SELECT TRANSFORM(a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0002' MAP KEYS TERMINATED BY '\u0003' LINES TERMINATED BY '\n' NULL DEFINED AS 'null' USING 'cat' AS (a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0004' MAP KEYS TERMINATED BY '\u0005' LINES TERMINATED BY '\n' NULL DEFINED AS 'NULL' FROM testData ``` ### How was this patch tested? Added UT Closes #29414 from AngersZhuuuu/SPARK-32106-MINOR. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-22 11:37:59 +09:00
Yuming Wang	1c77605682	[SPARK-33848][SQL] Push the UnaryExpression into (if / case) branches ### What changes were proposed in this pull request? This pr push the `UnaryExpression` into (if / case) branches. The use case is: ```sql create table t1 using parquet as select id from range(10); explain select id from t1 where (CASE WHEN id = 1 THEN '1' WHEN id = 3 THEN '2' end) > 3; ``` Before this pr: ``` == Physical Plan == (1) Filter (cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3) +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [(cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == LocalTableScan <empty>, [id#1L] ``` This change can also improve this case: `a78d6ce376/sql/core/src/test/resources/tpcds/q62.sql (L5-L22)` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30853 from wangyum/SPARK-33848. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 10:25:23 -08:00
Max Gekk	661ac10901	[SPARK-33838][SQL][DOCS] Comment the `PURGE` option in the DropTable and in AlterTableDropPartition commands ### What changes were proposed in this pull request? Add comments for the `PURGE` option to the logical nodes `DropTable` and `AlterTableDropPartition`. ### Why are the changes needed? To improve code maintenance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle` Closes #30837 from MaxGekk/comment-purge-logical-node. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 14:06:31 +00:00
Takeshi Yamamuro	69aa727ff4	[SPARK-33124][SQL] Fills missing group tags and re-categorizes all the group tags for built-in functions ### What changes were proposed in this pull request? This PR proposes to fill missing group tags and re-categorize all the group tags for built-in functions. New groups below are added in this PR: - binary_funcs - bitwise_funcs - collection_funcs - predicate_funcs - conditional_funcs - conversion_funcs - csv_funcs - generator_funcs - hash_funcs - lambda_funcs - math_funcs - misc_funcs - string_funcs - struct_funcs - xml_funcs A basic policy to re-categorize functions is that functions in the same file are categorized into the same group. For example, all the functions in `hash.scala` are categorized into `hash_funcs`. But, there are some exceptional/ambiguous cases when categorizing them. Here are some special notes: - All the aggregate functions are categorized into `agg_funcs`. - `array_funcs` and `map_funcs` are sub-groups of `collection_funcs`. For example, `array_contains` is used only for arrays, so it is assigned to `array_funcs`. On the other hand, `reverse` is used for both arrays and strings, so it is assigned to `collection_funcs`. - Some functions logically belong to multiple groups. In this case, these functions are categorized based on the file that they belong to. For example, `schema_of_csv` can be grouped into both `csv_funcs` and `struct_funcs` in terms of input types, but it is assigned to `csv_funcs` because it belongs to the `csvExpressions.scala` file that holds the other CSV-related functions. - Functions in `nullExpressions.scala`, `complexTypeCreator.scala`, `randomExpressions.scala`, and `regexExpressions.scala` are categorized based on their functionalities. For example: - `isnull` in `nullExpressions` is assigned to `predicate_funcs` because this is a predicate function. - `array` in `complexTypeCreator.scala` is assigned to `array_funcs`based on its output type (The other functions in `array_funcs` are categorized based on their input types though). A category list (after this PR) is as follows (the list below includes the exprs that already have a group tag in the current master): \|group\|name\|class\| \|-----\|----\|-----\| \|agg_funcs\|any\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr\| \|agg_funcs\|approx_count_distinct\|org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus\| \|agg_funcs\|approx_percentile\|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile\| \|agg_funcs\|avg\|org.apache.spark.sql.catalyst.expressions.aggregate.Average\| \|agg_funcs\|bit_and\|org.apache.spark.sql.catalyst.expressions.aggregate.BitAndAgg\| \|agg_funcs\|bit_or\|org.apache.spark.sql.catalyst.expressions.aggregate.BitOrAgg\| \|agg_funcs\|bit_xor\|org.apache.spark.sql.catalyst.expressions.aggregate.BitXorAgg\| \|agg_funcs\|bool_and\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd\| \|agg_funcs\|bool_or\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr\| \|agg_funcs\|collect_list\|org.apache.spark.sql.catalyst.expressions.aggregate.CollectList\| \|agg_funcs\|collect_set\|org.apache.spark.sql.catalyst.expressions.aggregate.CollectSet\| \|agg_funcs\|corr\|org.apache.spark.sql.catalyst.expressions.aggregate.Corr\| \|agg_funcs\|count_if\|org.apache.spark.sql.catalyst.expressions.aggregate.CountIf\| \|agg_funcs\|count_min_sketch\|org.apache.spark.sql.catalyst.expressions.aggregate.CountMinSketchAgg\| \|agg_funcs\|count\|org.apache.spark.sql.catalyst.expressions.aggregate.Count\| \|agg_funcs\|covar_pop\|org.apache.spark.sql.catalyst.expressions.aggregate.CovPopulation\| \|agg_funcs\|covar_samp\|org.apache.spark.sql.catalyst.expressions.aggregate.CovSample\| \|agg_funcs\|cube\|org.apache.spark.sql.catalyst.expressions.Cube\| \|agg_funcs\|every\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd\| \|agg_funcs\|first_value\|org.apache.spark.sql.catalyst.expressions.aggregate.First\| \|agg_funcs\|first\|org.apache.spark.sql.catalyst.expressions.aggregate.First\| \|agg_funcs\|grouping_id\|org.apache.spark.sql.catalyst.expressions.GroupingID\| \|agg_funcs\|grouping\|org.apache.spark.sql.catalyst.expressions.Grouping\| \|agg_funcs\|kurtosis\|org.apache.spark.sql.catalyst.expressions.aggregate.Kurtosis\| \|agg_funcs\|last_value\|org.apache.spark.sql.catalyst.expressions.aggregate.Last\| \|agg_funcs\|last\|org.apache.spark.sql.catalyst.expressions.aggregate.Last\| \|agg_funcs\|max_by\|org.apache.spark.sql.catalyst.expressions.aggregate.MaxBy\| \|agg_funcs\|max\|org.apache.spark.sql.catalyst.expressions.aggregate.Max\| \|agg_funcs\|mean\|org.apache.spark.sql.catalyst.expressions.aggregate.Average\| \|agg_funcs\|min_by\|org.apache.spark.sql.catalyst.expressions.aggregate.MinBy\| \|agg_funcs\|min\|org.apache.spark.sql.catalyst.expressions.aggregate.Min\| \|agg_funcs\|percentile_approx\|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile\| \|agg_funcs\|percentile\|org.apache.spark.sql.catalyst.expressions.aggregate.Percentile\| \|agg_funcs\|rollup\|org.apache.spark.sql.catalyst.expressions.Rollup\| \|agg_funcs\|skewness\|org.apache.spark.sql.catalyst.expressions.aggregate.Skewness\| \|agg_funcs\|some\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr\| \|agg_funcs\|stddev_pop\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevPop\| \|agg_funcs\|stddev_samp\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp\| \|agg_funcs\|stddev\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp\| \|agg_funcs\|std\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp\| \|agg_funcs\|sum\|org.apache.spark.sql.catalyst.expressions.aggregate.Sum\| \|agg_funcs\|var_pop\|org.apache.spark.sql.catalyst.expressions.aggregate.VariancePop\| \|agg_funcs\|var_samp\|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp\| \|agg_funcs\|variance\|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp\| \|array_funcs\|array_contains\|org.apache.spark.sql.catalyst.expressions.ArrayContains\| \|array_funcs\|array_distinct\|org.apache.spark.sql.catalyst.expressions.ArrayDistinct\| \|array_funcs\|array_except\|org.apache.spark.sql.catalyst.expressions.ArrayExcept\| \|array_funcs\|array_intersect\|org.apache.spark.sql.catalyst.expressions.ArrayIntersect\| \|array_funcs\|array_join\|org.apache.spark.sql.catalyst.expressions.ArrayJoin\| \|array_funcs\|array_max\|org.apache.spark.sql.catalyst.expressions.ArrayMax\| \|array_funcs\|array_min\|org.apache.spark.sql.catalyst.expressions.ArrayMin\| \|array_funcs\|array_position\|org.apache.spark.sql.catalyst.expressions.ArrayPosition\| \|array_funcs\|array_remove\|org.apache.spark.sql.catalyst.expressions.ArrayRemove\| \|array_funcs\|array_repeat\|org.apache.spark.sql.catalyst.expressions.ArrayRepeat\| \|array_funcs\|array_union\|org.apache.spark.sql.catalyst.expressions.ArrayUnion\| \|array_funcs\|arrays_overlap\|org.apache.spark.sql.catalyst.expressions.ArraysOverlap\| \|array_funcs\|arrays_zip\|org.apache.spark.sql.catalyst.expressions.ArraysZip\| \|array_funcs\|array\|org.apache.spark.sql.catalyst.expressions.CreateArray\| \|array_funcs\|flatten\|org.apache.spark.sql.catalyst.expressions.Flatten\| \|array_funcs\|sequence\|org.apache.spark.sql.catalyst.expressions.Sequence\| \|array_funcs\|shuffle\|org.apache.spark.sql.catalyst.expressions.Shuffle\| \|array_funcs\|slice\|org.apache.spark.sql.catalyst.expressions.Slice\| \|array_funcs\|sort_array\|org.apache.spark.sql.catalyst.expressions.SortArray\| \|bitwise_funcs\|&\|org.apache.spark.sql.catalyst.expressions.BitwiseAnd\| \|bitwise_funcs\|^\|org.apache.spark.sql.catalyst.expressions.BitwiseXor\| \|bitwise_funcs\|bit_count\|org.apache.spark.sql.catalyst.expressions.BitwiseCount\| \|bitwise_funcs\|shiftrightunsigned\|org.apache.spark.sql.catalyst.expressions.ShiftRightUnsigned\| \|bitwise_funcs\|shiftright\|org.apache.spark.sql.catalyst.expressions.ShiftRight\| \|bitwise_funcs\|~\|org.apache.spark.sql.catalyst.expressions.BitwiseNot\| \|collection_funcs\|cardinality\|org.apache.spark.sql.catalyst.expressions.Size\| \|collection_funcs\|concat\|org.apache.spark.sql.catalyst.expressions.Concat\| \|collection_funcs\|reverse\|org.apache.spark.sql.catalyst.expressions.Reverse\| \|collection_funcs\|size\|org.apache.spark.sql.catalyst.expressions.Size\| \|conditional_funcs\|coalesce\|org.apache.spark.sql.catalyst.expressions.Coalesce\| \|conditional_funcs\|ifnull\|org.apache.spark.sql.catalyst.expressions.IfNull\| \|conditional_funcs\|if\|org.apache.spark.sql.catalyst.expressions.If\| \|conditional_funcs\|nanvl\|org.apache.spark.sql.catalyst.expressions.NaNvl\| \|conditional_funcs\|nullif\|org.apache.spark.sql.catalyst.expressions.NullIf\| \|conditional_funcs\|nvl2\|org.apache.spark.sql.catalyst.expressions.Nvl2\| \|conditional_funcs\|nvl\|org.apache.spark.sql.catalyst.expressions.Nvl\| \|conditional_funcs\|when\|org.apache.spark.sql.catalyst.expressions.CaseWhen\| \|conversion_funcs\|bigint\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|binary\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|boolean\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|cast\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|date\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|decimal\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|double\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|float\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|int\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|smallint\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|string\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|timestamp\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|tinyint\|org.apache.spark.sql.catalyst.expressions.Cast\| \|csv_funcs\|from_csv\|org.apache.spark.sql.catalyst.expressions.CsvToStructs\| \|csv_funcs\|schema_of_csv\|org.apache.spark.sql.catalyst.expressions.SchemaOfCsv\| \|csv_funcs\|to_csv\|org.apache.spark.sql.catalyst.expressions.StructsToCsv\| \|datetime_funcs\|add_months\|org.apache.spark.sql.catalyst.expressions.AddMonths\| \|datetime_funcs\|current_date\|org.apache.spark.sql.catalyst.expressions.CurrentDate\| \|datetime_funcs\|current_timestamp\|org.apache.spark.sql.catalyst.expressions.CurrentTimestamp\| \|datetime_funcs\|current_timezone\|org.apache.spark.sql.catalyst.expressions.CurrentTimeZone\| \|datetime_funcs\|date_add\|org.apache.spark.sql.catalyst.expressions.DateAdd\| \|datetime_funcs\|date_format\|org.apache.spark.sql.catalyst.expressions.DateFormatClass\| \|datetime_funcs\|date_from_unix_date\|org.apache.spark.sql.catalyst.expressions.DateFromUnixDate\| \|datetime_funcs\|date_part\|org.apache.spark.sql.catalyst.expressions.DatePart\| \|datetime_funcs\|date_sub\|org.apache.spark.sql.catalyst.expressions.DateSub\| \|datetime_funcs\|date_trunc\|org.apache.spark.sql.catalyst.expressions.TruncTimestamp\| \|datetime_funcs\|datediff\|org.apache.spark.sql.catalyst.expressions.DateDiff\| \|datetime_funcs\|dayofmonth\|org.apache.spark.sql.catalyst.expressions.DayOfMonth\| \|datetime_funcs\|dayofweek\|org.apache.spark.sql.catalyst.expressions.DayOfWeek\| \|datetime_funcs\|dayofyear\|org.apache.spark.sql.catalyst.expressions.DayOfYear\| \|datetime_funcs\|day\|org.apache.spark.sql.catalyst.expressions.DayOfMonth\| \|datetime_funcs\|extract\|org.apache.spark.sql.catalyst.expressions.Extract\| \|datetime_funcs\|from_unixtime\|org.apache.spark.sql.catalyst.expressions.FromUnixTime\| \|datetime_funcs\|from_utc_timestamp\|org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp\| \|datetime_funcs\|hour\|org.apache.spark.sql.catalyst.expressions.Hour\| \|datetime_funcs\|last_day\|org.apache.spark.sql.catalyst.expressions.LastDay\| \|datetime_funcs\|make_date\|org.apache.spark.sql.catalyst.expressions.MakeDate\| \|datetime_funcs\|make_interval\|org.apache.spark.sql.catalyst.expressions.MakeInterval\| \|datetime_funcs\|make_timestamp\|org.apache.spark.sql.catalyst.expressions.MakeTimestamp\| \|datetime_funcs\|minute\|org.apache.spark.sql.catalyst.expressions.Minute\| \|datetime_funcs\|months_between\|org.apache.spark.sql.catalyst.expressions.MonthsBetween\| \|datetime_funcs\|month\|org.apache.spark.sql.catalyst.expressions.Month\| \|datetime_funcs\|next_day\|org.apache.spark.sql.catalyst.expressions.NextDay\| \|datetime_funcs\|now\|org.apache.spark.sql.catalyst.expressions.Now\| \|datetime_funcs\|quarter\|org.apache.spark.sql.catalyst.expressions.Quarter\| \|datetime_funcs\|second\|org.apache.spark.sql.catalyst.expressions.Second\| \|datetime_funcs\|timestamp_micros\|org.apache.spark.sql.catalyst.expressions.MicrosToTimestamp\| \|datetime_funcs\|timestamp_millis\|org.apache.spark.sql.catalyst.expressions.MillisToTimestamp\| \|datetime_funcs\|timestamp_seconds\|org.apache.spark.sql.catalyst.expressions.SecondsToTimestamp\| \|datetime_funcs\|to_date\|org.apache.spark.sql.catalyst.expressions.ParseToDate\| \|datetime_funcs\|to_timestamp\|org.apache.spark.sql.catalyst.expressions.ParseToTimestamp\| \|datetime_funcs\|to_unix_timestamp\|org.apache.spark.sql.catalyst.expressions.ToUnixTimestamp\| \|datetime_funcs\|to_utc_timestamp\|org.apache.spark.sql.catalyst.expressions.ToUTCTimestamp\| \|datetime_funcs\|trunc\|org.apache.spark.sql.catalyst.expressions.TruncDate\| \|datetime_funcs\|unix_date\|org.apache.spark.sql.catalyst.expressions.UnixDate\| \|datetime_funcs\|unix_micros\|org.apache.spark.sql.catalyst.expressions.UnixMicros\| \|datetime_funcs\|unix_millis\|org.apache.spark.sql.catalyst.expressions.UnixMillis\| \|datetime_funcs\|unix_seconds\|org.apache.spark.sql.catalyst.expressions.UnixSeconds\| \|datetime_funcs\|unix_timestamp\|org.apache.spark.sql.catalyst.expressions.UnixTimestamp\| \|datetime_funcs\|weekday\|org.apache.spark.sql.catalyst.expressions.WeekDay\| \|datetime_funcs\|weekofyear\|org.apache.spark.sql.catalyst.expressions.WeekOfYear\| \|datetime_funcs\|year\|org.apache.spark.sql.catalyst.expressions.Year\| \|generator_funcs\|explode_outer\|org.apache.spark.sql.catalyst.expressions.Explode\| \|generator_funcs\|explode\|org.apache.spark.sql.catalyst.expressions.Explode\| \|generator_funcs\|inline_outer\|org.apache.spark.sql.catalyst.expressions.Inline\| \|generator_funcs\|inline\|org.apache.spark.sql.catalyst.expressions.Inline\| \|generator_funcs\|posexplode_outer\|org.apache.spark.sql.catalyst.expressions.PosExplode\| \|generator_funcs\|posexplode\|org.apache.spark.sql.catalyst.expressions.PosExplode\| \|generator_funcs\|stack\|org.apache.spark.sql.catalyst.expressions.Stack\| \|hash_funcs\|crc32\|org.apache.spark.sql.catalyst.expressions.Crc32\| \|hash_funcs\|hash\|org.apache.spark.sql.catalyst.expressions.Murmur3Hash\| \|hash_funcs\|md5\|org.apache.spark.sql.catalyst.expressions.Md5\| \|hash_funcs\|sha1\|org.apache.spark.sql.catalyst.expressions.Sha1\| \|hash_funcs\|sha2\|org.apache.spark.sql.catalyst.expressions.Sha2\| \|hash_funcs\|sha\|org.apache.spark.sql.catalyst.expressions.Sha1\| \|hash_funcs\|xxhash64\|org.apache.spark.sql.catalyst.expressions.XxHash64\| \|json_funcs\|from_json\|org.apache.spark.sql.catalyst.expressions.JsonToStructs\| \|json_funcs\|get_json_object\|org.apache.spark.sql.catalyst.expressions.GetJsonObject\| \|json_funcs\|json_array_length\|org.apache.spark.sql.catalyst.expressions.LengthOfJsonArray\| \|json_funcs\|json_object_keys\|org.apache.spark.sql.catalyst.expressions.JsonObjectKeys\| \|json_funcs\|json_tuple\|org.apache.spark.sql.catalyst.expressions.JsonTuple\| \|json_funcs\|schema_of_json\|org.apache.spark.sql.catalyst.expressions.SchemaOfJson\| \|json_funcs\|to_json\|org.apache.spark.sql.catalyst.expressions.StructsToJson\| \|lambda_funcs\|aggregate\|org.apache.spark.sql.catalyst.expressions.ArrayAggregate\| \|lambda_funcs\|array_sort\|org.apache.spark.sql.catalyst.expressions.ArraySort\| \|lambda_funcs\|exists\|org.apache.spark.sql.catalyst.expressions.ArrayExists\| \|lambda_funcs\|filter\|org.apache.spark.sql.catalyst.expressions.ArrayFilter\| \|lambda_funcs\|forall\|org.apache.spark.sql.catalyst.expressions.ArrayForAll\| \|lambda_funcs\|map_filter\|org.apache.spark.sql.catalyst.expressions.MapFilter\| \|lambda_funcs\|map_zip_with\|org.apache.spark.sql.catalyst.expressions.MapZipWith\| \|lambda_funcs\|transform_keys\|org.apache.spark.sql.catalyst.expressions.TransformKeys\| \|lambda_funcs\|transform_values\|org.apache.spark.sql.catalyst.expressions.TransformValues\| \|lambda_funcs\|transform\|org.apache.spark.sql.catalyst.expressions.ArrayTransform\| \|lambda_funcs\|zip_with\|org.apache.spark.sql.catalyst.expressions.ZipWith\| \|map_funcs\|element_at\|org.apache.spark.sql.catalyst.expressions.ElementAt\| \|map_funcs\|map_concat\|org.apache.spark.sql.catalyst.expressions.MapConcat\| \|map_funcs\|map_entries\|org.apache.spark.sql.catalyst.expressions.MapEntries\| \|map_funcs\|map_from_arrays\|org.apache.spark.sql.catalyst.expressions.MapFromArrays\| \|map_funcs\|map_from_entries\|org.apache.spark.sql.catalyst.expressions.MapFromEntries\| \|map_funcs\|map_keys\|org.apache.spark.sql.catalyst.expressions.MapKeys\| \|map_funcs\|map_values\|org.apache.spark.sql.catalyst.expressions.MapValues\| \|map_funcs\|map\|org.apache.spark.sql.catalyst.expressions.CreateMap\| \|map_funcs\|str_to_map\|org.apache.spark.sql.catalyst.expressions.StringToMap\| \|math_funcs\|%\|org.apache.spark.sql.catalyst.expressions.Remainder\| \|math_funcs\|*\|org.apache.spark.sql.catalyst.expressions.Multiply\| \|math_funcs\|+\|org.apache.spark.sql.catalyst.expressions.Add\| \|math_funcs\|-\|org.apache.spark.sql.catalyst.expressions.Subtract\| \|math_funcs\|/\|org.apache.spark.sql.catalyst.expressions.Divide\| \|math_funcs\|abs\|org.apache.spark.sql.catalyst.expressions.Abs\| \|math_funcs\|acosh\|org.apache.spark.sql.catalyst.expressions.Acosh\| \|math_funcs\|acos\|org.apache.spark.sql.catalyst.expressions.Acos\| \|math_funcs\|asinh\|org.apache.spark.sql.catalyst.expressions.Asinh\| \|math_funcs\|asin\|org.apache.spark.sql.catalyst.expressions.Asin\| \|math_funcs\|atan2\|org.apache.spark.sql.catalyst.expressions.Atan2\| \|math_funcs\|atanh\|org.apache.spark.sql.catalyst.expressions.Atanh\| \|math_funcs\|atan\|org.apache.spark.sql.catalyst.expressions.Atan\| \|math_funcs\|bin\|org.apache.spark.sql.catalyst.expressions.Bin\| \|math_funcs\|bround\|org.apache.spark.sql.catalyst.expressions.BRound\| \|math_funcs\|cbrt\|org.apache.spark.sql.catalyst.expressions.Cbrt\| \|math_funcs\|ceiling\|org.apache.spark.sql.catalyst.expressions.Ceil\| \|math_funcs\|ceil\|org.apache.spark.sql.catalyst.expressions.Ceil\| \|math_funcs\|conv\|org.apache.spark.sql.catalyst.expressions.Conv\| \|math_funcs\|cosh\|org.apache.spark.sql.catalyst.expressions.Cosh\| \|math_funcs\|cos\|org.apache.spark.sql.catalyst.expressions.Cos\| \|math_funcs\|cot\|org.apache.spark.sql.catalyst.expressions.Cot\| \|math_funcs\|degrees\|org.apache.spark.sql.catalyst.expressions.ToDegrees\| \|math_funcs\|div\|org.apache.spark.sql.catalyst.expressions.IntegralDivide\| \|math_funcs\|expm1\|org.apache.spark.sql.catalyst.expressions.Expm1\| \|math_funcs\|exp\|org.apache.spark.sql.catalyst.expressions.Exp\| \|math_funcs\|e\|org.apache.spark.sql.catalyst.expressions.EulerNumber\| \|math_funcs\|factorial\|org.apache.spark.sql.catalyst.expressions.Factorial\| \|math_funcs\|floor\|org.apache.spark.sql.catalyst.expressions.Floor\| \|math_funcs\|greatest\|org.apache.spark.sql.catalyst.expressions.Greatest\| \|math_funcs\|hex\|org.apache.spark.sql.catalyst.expressions.Hex\| \|math_funcs\|hypot\|org.apache.spark.sql.catalyst.expressions.Hypot\| \|math_funcs\|least\|org.apache.spark.sql.catalyst.expressions.Least\| \|math_funcs\|ln\|org.apache.spark.sql.catalyst.expressions.Log\| \|math_funcs\|log10\|org.apache.spark.sql.catalyst.expressions.Log10\| \|math_funcs\|log1p\|org.apache.spark.sql.catalyst.expressions.Log1p\| \|math_funcs\|log2\|org.apache.spark.sql.catalyst.expressions.Log2\| \|math_funcs\|log\|org.apache.spark.sql.catalyst.expressions.Logarithm\| \|math_funcs\|mod\|org.apache.spark.sql.catalyst.expressions.Remainder\| \|math_funcs\|negative\|org.apache.spark.sql.catalyst.expressions.UnaryMinus\| \|math_funcs\|pi\|org.apache.spark.sql.catalyst.expressions.Pi\| \|math_funcs\|pmod\|org.apache.spark.sql.catalyst.expressions.Pmod\| \|math_funcs\|positive\|org.apache.spark.sql.catalyst.expressions.UnaryPositive\| \|math_funcs\|power\|org.apache.spark.sql.catalyst.expressions.Pow\| \|math_funcs\|pow\|org.apache.spark.sql.catalyst.expressions.Pow\| \|math_funcs\|radians\|org.apache.spark.sql.catalyst.expressions.ToRadians\| \|math_funcs\|randn\|org.apache.spark.sql.catalyst.expressions.Randn\| \|math_funcs\|random\|org.apache.spark.sql.catalyst.expressions.Rand\| \|math_funcs\|rand\|org.apache.spark.sql.catalyst.expressions.Rand\| \|math_funcs\|rint\|org.apache.spark.sql.catalyst.expressions.Rint\| \|math_funcs\|round\|org.apache.spark.sql.catalyst.expressions.Round\| \|math_funcs\|shiftleft\|org.apache.spark.sql.catalyst.expressions.ShiftLeft\| \|math_funcs\|signum\|org.apache.spark.sql.catalyst.expressions.Signum\| \|math_funcs\|sign\|org.apache.spark.sql.catalyst.expressions.Signum\| \|math_funcs\|sinh\|org.apache.spark.sql.catalyst.expressions.Sinh\| \|math_funcs\|sin\|org.apache.spark.sql.catalyst.expressions.Sin\| \|math_funcs\|sqrt\|org.apache.spark.sql.catalyst.expressions.Sqrt\| \|math_funcs\|tanh\|org.apache.spark.sql.catalyst.expressions.Tanh\| \|math_funcs\|tan\|org.apache.spark.sql.catalyst.expressions.Tan\| \|math_funcs\|unhex\|org.apache.spark.sql.catalyst.expressions.Unhex\| \|math_funcs\|width_bucket\|org.apache.spark.sql.catalyst.expressions.WidthBucket\| \|misc_funcs\|assert_true\|org.apache.spark.sql.catalyst.expressions.AssertTrue\| \|misc_funcs\|current_catalog\|org.apache.spark.sql.catalyst.expressions.CurrentCatalog\| \|misc_funcs\|current_database\|org.apache.spark.sql.catalyst.expressions.CurrentDatabase\| \|misc_funcs\|input_file_block_length\|org.apache.spark.sql.catalyst.expressions.InputFileBlockLength\| \|misc_funcs\|input_file_block_start\|org.apache.spark.sql.catalyst.expressions.InputFileBlockStart\| \|misc_funcs\|input_file_name\|org.apache.spark.sql.catalyst.expressions.InputFileName\| \|misc_funcs\|java_method\|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection\| \|misc_funcs\|monotonically_increasing_id\|org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID\| \|misc_funcs\|raise_error\|org.apache.spark.sql.catalyst.expressions.RaiseError\| \|misc_funcs\|reflect\|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection\| \|misc_funcs\|spark_partition_id\|org.apache.spark.sql.catalyst.expressions.SparkPartitionID\| \|misc_funcs\|typeof\|org.apache.spark.sql.catalyst.expressions.TypeOf\| \|misc_funcs\|uuid\|org.apache.spark.sql.catalyst.expressions.Uuid\| \|misc_funcs\|version\|org.apache.spark.sql.catalyst.expressions.SparkVersion\| \|predicate_funcs\|!\|org.apache.spark.sql.catalyst.expressions.Not\| \|predicate_funcs\|<=>\|org.apache.spark.sql.catalyst.expressions.EqualNullSafe\| \|predicate_funcs\|<=\|org.apache.spark.sql.catalyst.expressions.LessThanOrEqual\| \|predicate_funcs\|<\|org.apache.spark.sql.catalyst.expressions.LessThan\| \|predicate_funcs\|==\|org.apache.spark.sql.catalyst.expressions.EqualTo\| \|predicate_funcs\|=\|org.apache.spark.sql.catalyst.expressions.EqualTo\| \|predicate_funcs\|>=\|org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual\| \|predicate_funcs\|>\|org.apache.spark.sql.catalyst.expressions.GreaterThan\| \|predicate_funcs\|and\|org.apache.spark.sql.catalyst.expressions.And\| \|predicate_funcs\|in\|org.apache.spark.sql.catalyst.expressions.In\| \|predicate_funcs\|isnan\|org.apache.spark.sql.catalyst.expressions.IsNaN\| \|predicate_funcs\|isnotnull\|org.apache.spark.sql.catalyst.expressions.IsNotNull\| \|predicate_funcs\|isnull\|org.apache.spark.sql.catalyst.expressions.IsNull\| \|predicate_funcs\|like\|org.apache.spark.sql.catalyst.expressions.Like\| \|predicate_funcs\|not\|org.apache.spark.sql.catalyst.expressions.Not\| \|predicate_funcs\|or\|org.apache.spark.sql.catalyst.expressions.Or\| \|predicate_funcs\|regexp_like\|org.apache.spark.sql.catalyst.expressions.RLike\| \|predicate_funcs\|rlike\|org.apache.spark.sql.catalyst.expressions.RLike\| \|string_funcs\|ascii\|org.apache.spark.sql.catalyst.expressions.Ascii\| \|string_funcs\|base64\|org.apache.spark.sql.catalyst.expressions.Base64\| \|string_funcs\|bit_length\|org.apache.spark.sql.catalyst.expressions.BitLength\| \|string_funcs\|char_length\|org.apache.spark.sql.catalyst.expressions.Length\| \|string_funcs\|character_length\|org.apache.spark.sql.catalyst.expressions.Length\| \|string_funcs\|char\|org.apache.spark.sql.catalyst.expressions.Chr\| \|string_funcs\|chr\|org.apache.spark.sql.catalyst.expressions.Chr\| \|string_funcs\|concat_ws\|org.apache.spark.sql.catalyst.expressions.ConcatWs\| \|string_funcs\|decode\|org.apache.spark.sql.catalyst.expressions.Decode\| \|string_funcs\|elt\|org.apache.spark.sql.catalyst.expressions.Elt\| \|string_funcs\|encode\|org.apache.spark.sql.catalyst.expressions.Encode\| \|string_funcs\|find_in_set\|org.apache.spark.sql.catalyst.expressions.FindInSet\| \|string_funcs\|format_number\|org.apache.spark.sql.catalyst.expressions.FormatNumber\| \|string_funcs\|format_string\|org.apache.spark.sql.catalyst.expressions.FormatString\| \|string_funcs\|initcap\|org.apache.spark.sql.catalyst.expressions.InitCap\| \|string_funcs\|instr\|org.apache.spark.sql.catalyst.expressions.StringInstr\| \|string_funcs\|lcase\|org.apache.spark.sql.catalyst.expressions.Lower\| \|string_funcs\|left\|org.apache.spark.sql.catalyst.expressions.Left\| \|string_funcs\|length\|org.apache.spark.sql.catalyst.expressions.Length\| \|string_funcs\|levenshtein\|org.apache.spark.sql.catalyst.expressions.Levenshtein\| \|string_funcs\|locate\|org.apache.spark.sql.catalyst.expressions.StringLocate\| \|string_funcs\|lower\|org.apache.spark.sql.catalyst.expressions.Lower\| \|string_funcs\|lpad\|org.apache.spark.sql.catalyst.expressions.StringLPad\| \|string_funcs\|ltrim\|org.apache.spark.sql.catalyst.expressions.StringTrimLeft\| \|string_funcs\|octet_length\|org.apache.spark.sql.catalyst.expressions.OctetLength\| \|string_funcs\|overlay\|org.apache.spark.sql.catalyst.expressions.Overlay\| \|string_funcs\|parse_url\|org.apache.spark.sql.catalyst.expressions.ParseUrl\| \|string_funcs\|position\|org.apache.spark.sql.catalyst.expressions.StringLocate\| \|string_funcs\|printf\|org.apache.spark.sql.catalyst.expressions.FormatString\| \|string_funcs\|regexp_extract_all\|org.apache.spark.sql.catalyst.expressions.RegExpExtractAll\| \|string_funcs\|regexp_extract\|org.apache.spark.sql.catalyst.expressions.RegExpExtract\| \|string_funcs\|regexp_replace\|org.apache.spark.sql.catalyst.expressions.RegExpReplace\| \|string_funcs\|repeat\|org.apache.spark.sql.catalyst.expressions.StringRepeat\| \|string_funcs\|replace\|org.apache.spark.sql.catalyst.expressions.StringReplace\| \|string_funcs\|right\|org.apache.spark.sql.catalyst.expressions.Right\| \|string_funcs\|rpad\|org.apache.spark.sql.catalyst.expressions.StringRPad\| \|string_funcs\|rtrim\|org.apache.spark.sql.catalyst.expressions.StringTrimRight\| \|string_funcs\|sentences\|org.apache.spark.sql.catalyst.expressions.Sentences\| \|string_funcs\|soundex\|org.apache.spark.sql.catalyst.expressions.SoundEx\| \|string_funcs\|space\|org.apache.spark.sql.catalyst.expressions.StringSpace\| \|string_funcs\|split\|org.apache.spark.sql.catalyst.expressions.StringSplit\| \|string_funcs\|substring_index\|org.apache.spark.sql.catalyst.expressions.SubstringIndex\| \|string_funcs\|substring\|org.apache.spark.sql.catalyst.expressions.Substring\| \|string_funcs\|substr\|org.apache.spark.sql.catalyst.expressions.Substring\| \|string_funcs\|translate\|org.apache.spark.sql.catalyst.expressions.StringTranslate\| \|string_funcs\|trim\|org.apache.spark.sql.catalyst.expressions.StringTrim\| \|string_funcs\|ucase\|org.apache.spark.sql.catalyst.expressions.Upper\| \|string_funcs\|unbase64\|org.apache.spark.sql.catalyst.expressions.UnBase64\| \|string_funcs\|upper\|org.apache.spark.sql.catalyst.expressions.Upper\| \|struct_funcs\|named_struct\|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct\| \|struct_funcs\|struct\|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct\| \|window_funcs\|cume_dist\|org.apache.spark.sql.catalyst.expressions.CumeDist\| \|window_funcs\|dense_rank\|org.apache.spark.sql.catalyst.expressions.DenseRank\| \|window_funcs\|lag\|org.apache.spark.sql.catalyst.expressions.Lag\| \|window_funcs\|lead\|org.apache.spark.sql.catalyst.expressions.Lead\| \|window_funcs\|nth_value\|org.apache.spark.sql.catalyst.expressions.NthValue\| \|window_funcs\|ntile\|org.apache.spark.sql.catalyst.expressions.NTile\| \|window_funcs\|percent_rank\|org.apache.spark.sql.catalyst.expressions.PercentRank\| \|window_funcs\|rank\|org.apache.spark.sql.catalyst.expressions.Rank\| \|window_funcs\|row_number\|org.apache.spark.sql.catalyst.expressions.RowNumber\| \|xml_funcs\|xpath_boolean\|org.apache.spark.sql.catalyst.expressions.xml.XPathBoolean\| \|xml_funcs\|xpath_double\|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble\| \|xml_funcs\|xpath_float\|org.apache.spark.sql.catalyst.expressions.xml.XPathFloat\| \|xml_funcs\|xpath_int\|org.apache.spark.sql.catalyst.expressions.xml.XPathInt\| \|xml_funcs\|xpath_long\|org.apache.spark.sql.catalyst.expressions.xml.XPathLong\| \|xml_funcs\|xpath_number\|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble\| \|xml_funcs\|xpath_short\|org.apache.spark.sql.catalyst.expressions.xml.XPathShort\| \|xml_funcs\|xpath_string\|org.apache.spark.sql.catalyst.expressions.xml.XPathString\| \|xml_funcs\|xpath\|org.apache.spark.sql.catalyst.expressions.xml.XPathList\| Closes #30040 NOTE: An original author of this PR is tanelk, so the credit should be given to tanelk. ### Why are the changes needed? For better documents. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a test to check if exprs have a group tag in `ExpressionInfoSuite`. Closes #30867 from maropu/pr30040. Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 04:24:04 -08:00
Yuming Wang	4b19f49dd0	[SPARK-33845][SQL] Remove unnecessary if when trueValue and falseValue are foldable boolean types ### What changes were proposed in this pull request? Improve `SimplifyConditionals`. Simplify `If(cond, TrueLiteral, FalseLiteral)` to `cond`. Simplify `If(cond, FalseLiteral, TrueLiteral)` to `Not(cond)`. The use case is: ```sql create table t1 using parquet as select id from range(10); select if (id > 2, false, true) from t1; ``` Before this pr: ``` == Physical Plan == (1) Project [if ((id#1L > 2)) false else true AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == (1) Project [(id#1L <= 2) AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30849 from wangyum/SPARK-33798-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 04:15:29 -08:00
Wenchen Fan	b4bea1aa89	[SPARK-28863][SQL][FOLLOWUP] Make sure optimized plan will not be re-analyzed ### What changes were proposed in this pull request? It's a known issue that re-analyzing an optimized plan can lead to various issues. We made several attempts to avoid it from happening, but the current solution `AlreadyOptimized` is still not 100% safe, as people can inject catalyst rules to call analyzer directly. This PR proposes a simpler and safer idea: we set the `analyzed` flag to true after optimization, and analyzer will skip processing plans whose `analyzed` flag is true. ### Why are the changes needed? make the code simpler and safer ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. Closes #30777 from cloud-fan/ds. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-21 20:59:33 +09:00
Max Gekk	cdd1752ad1	[SPARK-33862][SQL] Throw `PartitionAlreadyExistsException` if the target partition exists while renaming ### What changes were proposed in this pull request? Throw `PartitionAlreadyExistsException` from `ALTER TABLE .. RENAME TO PARTITION` for a table from Hive V1 External Catalog in the case when the target partition already exists. ### Why are the changes needed? 1. To have the same behavior of V1 In-Memory and Hive External Catalog. 2. To not propagate internal Hive's exceptions to users. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the partition renaming command throws `PartitionAlreadyExistsException` for tables from the Hive catalog. ### How was this patch tested? Added new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveCatalogedDDLSuite" ``` Closes #30866 from MaxGekk/throw-PartitionAlreadyExistsException. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 03:37:30 -08:00
Kousuke Saruta	f4e1069bb8	[SPARK-33853][SQL] EXPLAIN CODEGEN and BenchmarkQueryTest don't show subquery code ### What changes were proposed in this pull request? This PR fixes an issue that `EXPLAIN CODEGEN` and `BenchmarkQueryTest` don't show the corresponding code for subqueries. The following example is about `EXPLAIN CODEGEN`. ``` spark.conf.set("spark.sql.adaptive.enabled", "false") val df = spark.range(1, 100) df.createTempView("df") spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") scala> spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:55; maxConstantPoolSize:97(0.15% used); numInnerClasses:0) == (1) Project [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L] : +- Subquery scalar-subquery#3, [id=#24] : +- (2) HashAggregate(keys=[], functions=[min(id#0L)], output=[v#2L]) : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#20] : +- (1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) : +- (1) Range (1, 100, step=1, splits=12) +- (1) Scan OneRowRelation[] Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator rdd_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; / 011 / / 012 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / rdd_input_0 = inputs[0]; / 020 / project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 021 / / 022 / } / 023 / / 024 / private void project_doConsume_0() throws java.io.IOException { / 025 / // common sub-expressions / 026 / / 027 / project_mutableStateArray_0[0].reset(); / 028 / / 029 / if (false) { / 030 / project_mutableStateArray_0[0].setNullAt(0); / 031 / } else { / 032 / project_mutableStateArray_0[0].write(0, 1L); / 033 / } / 034 / append((project_mutableStateArray_0[0].getRow())); / 035 / / 036 / } / 037 / / 038 / protected void processNext() throws java.io.IOException { / 039 / while ( rdd_input_0.hasNext()) { / 040 / InternalRow rdd_row_0 = (InternalRow) rdd_input_0.next(); / 041 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 042 / project_doConsume_0(); / 043 / if (shouldStop()) return; / 044 / } / 045 / } / 046 / / 047 / } ``` After this change, the corresponding code for subqueries are shown. ``` Found 3 WholeStageCodegen subtrees. == Subtree 1 / 3 (maxMethodCodeSize:282; maxConstantPoolSize:206(0.31% used); numInnerClasses:0) == (1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) +- (1) Range (1, 100, step=1, splits=12) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean agg_initAgg_0; / 010 / private boolean agg_bufIsNull_0; / 011 / private long agg_bufValue_0; / 012 / private boolean range_initRange_0; / 013 / private long range_nextIndex_0; / 014 / private TaskContext range_taskContext_0; / 015 / private InputMetrics range_inputMetrics_0; / 016 / private long range_batchEnd_0; / 017 / private long range_numElementsTodo_0; / 018 / private boolean agg_agg_isNull_2_0; / 019 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[3]; / 020 / / 021 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 022 / this.references = references; / 023 / } / 024 / / 025 / public void init(int index, scala.collection.Iterator[] inputs) { / 026 / partitionIndex = index; / 027 / this.inputs = inputs; / 028 / / 029 / range_taskContext_0 = TaskContext.get(); / 030 / range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); / 031 / range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 032 / range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 033 / range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 034 / / 035 / } / 036 / / 037 / private void agg_doAggregateWithoutKey_0() throws java.io.IOException { / 038 / // initialize aggregation buffer / 039 / agg_bufIsNull_0 = true; / 040 / agg_bufValue_0 = -1L; / 041 / / 042 / // initialize Range / 043 / if (!range_initRange_0) { / 044 / range_initRange_0 = true; / 045 / initRange(partitionIndex); / 046 / } / 047 / / 048 / while (true) { / 049 / if (range_nextIndex_0 == range_batchEnd_0) { / 050 / long range_nextBatchTodo_0; / 051 / if (range_numElementsTodo_0 > 1000L) { / 052 / range_nextBatchTodo_0 = 1000L; / 053 / range_numElementsTodo_0 -= 1000L; / 054 / } else { / 055 / range_nextBatchTodo_0 = range_numElementsTodo_0; / 056 / range_numElementsTodo_0 = 0; / 057 / if (range_nextBatchTodo_0 == 0) break; / 058 / } / 059 / range_batchEnd_0 += range_nextBatchTodo_0 1L; /* 060 / } / 061 / / 062 / int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); / 063 / for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { / 064 / long range_value_0 = ((long)range_localIdx_0 1L) + range_nextIndex_0; /* 065 / / 066 / agg_doConsume_0(range_value_0); / 067 / / 068 / // shouldStop check is eliminated / 069 / } / 070 / range_nextIndex_0 = range_batchEnd_0; / 071 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localEnd_0); / 072 / range_inputMetrics_0.incRecordsRead(range_localEnd_0); / 073 / range_taskContext_0.killTaskIfInterrupted(); / 074 / } / 075 / / 076 / } / 077 / / 078 / private void initRange(int idx) { / 079 / java.math.BigInteger index = java.math.BigInteger.valueOf(idx); / 080 / java.math.BigInteger numSlice = java.math.BigInteger.valueOf(12L); / 081 / java.math.BigInteger numElement = java.math.BigInteger.valueOf(99L); / 082 / java.math.BigInteger step = java.math.BigInteger.valueOf(1L); / 083 / java.math.BigInteger start = java.math.BigInteger.valueOf(1L); / 084 / long partitionEnd; / 085 / / 086 / java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); / 087 / if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 088 / range_nextIndex_0 = Long.MAX_VALUE; / 089 / } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 090 / range_nextIndex_0 = Long.MIN_VALUE; / 091 / } else { / 092 / range_nextIndex_0 = st.longValue(); / 093 / } / 094 / range_batchEnd_0 = range_nextIndex_0; / 095 / / 096 / java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) / 097 / .multiply(step).add(start); / 098 / if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 099 / partitionEnd = Long.MAX_VALUE; / 100 / } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 101 / partitionEnd = Long.MIN_VALUE; / 102 / } else { / 103 / partitionEnd = end.longValue(); / 104 / } / 105 / / 106 / java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( / 107 / java.math.BigInteger.valueOf(range_nextIndex_0)); / 108 / range_numElementsTodo_0 = startToEnd.divide(step).longValue(); / 109 / if (range_numElementsTodo_0 < 0) { / 110 / range_numElementsTodo_0 = 0; / 111 / } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { / 112 / range_numElementsTodo_0++; / 113 / } / 114 / } / 115 / / 116 / private void agg_doConsume_0(long agg_expr_0_0) throws java.io.IOException { / 117 / // do aggregate / 118 / // common sub-expressions / 119 / / 120 / // evaluate aggregate functions and update aggregation buffers / 121 / / 122 / agg_agg_isNull_2_0 = true; / 123 / long agg_value_2 = -1L; / 124 / / 125 / if (!agg_bufIsNull_0 && (agg_agg_isNull_2_0 \|\| / 126 / agg_value_2 > agg_bufValue_0)) { / 127 / agg_agg_isNull_2_0 = false; / 128 / agg_value_2 = agg_bufValue_0; / 129 / } / 130 / / 131 / if (!false && (agg_agg_isNull_2_0 \|\| / 132 / agg_value_2 > agg_expr_0_0)) { / 133 / agg_agg_isNull_2_0 = false; / 134 / agg_value_2 = agg_expr_0_0; / 135 / } / 136 / / 137 / agg_bufIsNull_0 = agg_agg_isNull_2_0; / 138 / agg_bufValue_0 = agg_value_2; / 139 / / 140 / } / 141 / / 142 / protected void processNext() throws java.io.IOException { / 143 / while (!agg_initAgg_0) { / 144 / agg_initAgg_0 = true; / 145 / long agg_beforeAgg_0 = System.nanoTime(); / 146 / agg_doAggregateWithoutKey_0(); / 147 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] / aggTime /).add((System.nanoTime() - agg_beforeAgg_0) / 1000000); / 148 / / 149 / // output the result / 150 / / 151 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[1] / numOutputRows /).add(1); / 152 / range_mutableStateArray_0[2].reset(); / 153 / / 154 / range_mutableStateArray_0[2].zeroOutNullBytes(); / 155 / / 156 / if (agg_bufIsNull_0) { / 157 / range_mutableStateArray_0[2].setNullAt(0); / 158 / } else { / 159 / range_mutableStateArray_0[2].write(0, agg_bufValue_0); / 160 / } / 161 / append((range_mutableStateArray_0[2].getRow())); / 162 / } / 163 / } / 164 / / 165 */ } ``` ### Why are the changes needed? For better debuggability. ### Does this PR introduce _any_ user-facing change? Yes. After this change, users can see subquery code by `EXPLAIN CODEGEN`. ### How was this patch tested? New test. Closes #30859 from sarutak/explain-codegen-subqueries. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 03:29:00 -08:00
Max Gekk	b313a1e9e6	[SPARK-33849][SQL][TESTS] Unify v1 and v2 DROP TABLE tests ### What changes were proposed in this pull request? 1. Move the `DROP TABLE` parsing tests to `DropTableParserSuite` 2. Place the v1 tests for `DROP TABLE` from `DDLSuite` and v2 tests from `DataSourceV2SQLSuite` to the common trait `DropTableSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `DROP TABLE` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DropTableParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DropTableSuite" ``` Closes #30854 from MaxGekk/unify-drop-table-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 08:34:12 +00:00
Terry Kim	1c7b79c057	[SPARK-33856][SQL] Migrate ALTER TABLE ... RENAME TO PARTITION to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... RENAME TO PARTITION` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... RENAME TO PARTITION` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ``` sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)") // works fine assuming id=1 exists. ``` , but after this PR: ``` sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... RENAME TO PARTITION' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `ALTER TABLE` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30862 from imback82/alter_table_rename_partition_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 04:58:56 +00:00
Kousuke Saruta	3c8be3983c	[SPARK-33850][SQL][FOLLOWUP] Improve and cleanup the test code ### What changes were proposed in this pull request? This PR mainly improves and cleans up the test code introduced in #30855 based on the comment. The test code is actually taken from another test `explain formatted - check presence of subquery in case of DPP` so this PR cleans the code too ( removed unnecessary `withTable`). ### Why are the changes needed? To keep the test code clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `ExplainSuite` passes. Closes #30861 from sarutak/followup-SPARK-33850. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-21 09:40:42 +09:00
Terry Kim	df2314b63a	[SPARK-33852][SQL][TESTS] Use assertAnalysisError in HiveDDLSuite.scala ### What changes were proposed in this pull request? `HiveDDLSuite` has many of the following patterns: ```scala val e = intercept[AnalysisException] { sql(sqlString) } assert(e.message.contains(exceptionMessage)) ``` However, there already exists `assertAnalysisError` helper function which does exactly the same thing. ### Why are the changes needed? To refactor code to simplify. ### Does this PR introduce _any_ user-facing change? No, just refactoring the test code. ### How was this patch tested? Existing tests Closes #30857 from imback82/hive_ddl_suite_use_assertAnalysisError. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 14:37:15 -08:00
Kousuke Saruta	70da86a085	[SPARK-33850][SQL] EXPLAIN FORMATTED doesn't show the plan for subqueries if AQE is enabled ### What changes were proposed in this pull request? This PR fixes an issue that when AQE is enabled, EXPLAIN FORMATTED doesn't show the plan for subqueries. ```scala val df = spark.range(1, 100) df.createTempView("df") spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("FORMATTED") == Physical Plan == AdaptiveSparkPlan (3) +- Project (2) +- Scan OneRowRelation (1) (1) Scan OneRowRelation Output: [] Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0) (2) Project Output [1]: [Subquery subquery#3, [id=#20] AS scalarsubquery()#5L] Input: [] (3) AdaptiveSparkPlan Output [1]: [scalarsubquery()#5L] Arguments: isFinalPlan=false ``` After this change, the plan for the subquerie is shown. ```scala == Physical Plan == * Project (2) +- * Scan OneRowRelation (1) (1) Scan OneRowRelation [codegen id : 1] Output: [] Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0) (2) Project [codegen id : 1] Output [1]: [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L] Input: [] ===== Subqueries ===== Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#3, [id=#24] * HashAggregate (6) +- Exchange (5) +- * HashAggregate (4) +- * Range (3) (3) Range [codegen id : 1] Output [1]: [id#0L] Arguments: Range (1, 100, step=1, splits=Some(12)) (4) HashAggregate [codegen id : 1] Input [1]: [id#0L] Keys: [] Functions [1]: [partial_min(id#0L)] Aggregate Attributes [1]: [min#7L] Results [1]: [min#8L] (5) Exchange Input [1]: [min#8L] Arguments: SinglePartition, ENSURE_REQUIREMENTS, [id=#20] (6) HashAggregate [codegen id : 2] Input [1]: [min#8L] Keys: [] Functions [1]: [min(id#0L)] Aggregate Attributes [1]: [min(id#0L)#4L] Results [1]: [min(id#0L)#4L AS v#2L] ``` ### Why are the changes needed? For better debuggability. ### Does this PR introduce _any_ user-facing change? Yes. Users can see the formatted plan for subqueries. ### How was this patch tested? New test. Closes #30855 from sarutak/fix-aqe-explain. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 14:10:20 -08:00
Ammar Al-Batool	37c4cd8f05	[MINOR][DOCS] Fix typos in ScalaDocs for DataStreamWriter#foreachBatch The title is pretty self-explanatory. ### What changes were proposed in this pull request? Fixing typos in the docs for `foreachBatch` functions. ### Why are the changes needed? To fix typos in JavaDoc/ScalaDoc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Yes. Closes #30782 from ammar1x/patch-1. Lead-authored-by: Ammar Al-Batool <ammar.albatool@gmail.com> Co-authored-by: Ammar Al-Batool <ammar.al-batool@disneystreaming.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-19 14:53:40 -06:00
Terry Kim	06075d849e	[SPARK-33829][SQL] Renaming v2 tables should recreate the cache ### What changes were proposed in this pull request? Currently, renaming v2 tables does not invalidate/recreate the cache, leading to an incorrect behavior (cache not being used) when v2 tables are renamed. This PR fixes the behavior. ### Why are the changes needed? Fixing a bug since the cache associated with the renamed table is not being cleaned up/recreated. ### Does this PR introduce _any_ user-facing change? Yes, now when a v2 table is renamed, cache is correctly updated. ### How was this patch tested? Added a new test Closes #30825 from imback82/rename_recreate_cache_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 08:32:58 -08:00
Kent Yao	dd44ba5460	[SPARK-32976][SQL][FOLLOWUP] SET and RESTORE hive.exec.dynamic.partition.mode for HiveSQLInsertTestSuite to avoid flakiness ### What changes were proposed in this pull request? As https://github.com/apache/spark/pull/29893#discussion_r545303780 mentioned: > We need to set spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") before executing this suite; otherwise, test("insert with column list - follow table output order + partitioned table") will fail. The reason why it does not fail because some test cases [running before this suite] do not change the default value of hive.exec.dynamic.partition.mode back to strict. However, the order of test suite execution is not deterministic. ### Why are the changes needed? avoid flakiness in tests ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #30843 from yaooqinn/SPARK-32976-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 08:00:09 -08:00
Wenchen Fan	de234eec8f	[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property ### What changes were proposed in this pull request? Hive metastore has a limitation for the table property length. To work around it, Spark split the schema json string into several parts when saving to hive metastore as table properties. We need to do the same for histogram column stats as it can go very big. This PR refactors the table property splitting code, so that we can share it between the schema json string and histogram column stats. ### Why are the changes needed? To be able to analyze table when histogram data is big. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test and new tests Closes #30809 from cloud-fan/cbo. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-19 14:35:28 +09:00
Kent Yao	c17c76dd16	[SPARK-33599][SQL][FOLLOWUP] FIX Github Action with unidoc ### What changes were proposed in this pull request? FIX Github Action with unidoc ### Why are the changes needed? FIX Github Action with unidoc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Pass GA Closes #30846 from yaooqinn/SPARK-33599. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-18 11:23:38 -08:00
gengjiaan	6dca2e5d35	[SPARK-33599][SQL] Group exception messages in catalyst/analysis ### What changes were proposed in this pull request? This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #30717 from beliefer/SPARK-33599. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 14:12:35 +00:00
gengjiaan	f239128802	[SPARK-33597][SQL] Support REGEXP_LIKE for consistent with mainstream databases ### What changes were proposed in this pull request? There are a lot of mainstream databases support regex function `REGEXP_LIKE`. Currently, Spark supports `RLike` and we just need add a new alias `REGEXP_LIKE` for it. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19 Presto https://prestodb.io/docs/current/functions/regexp.html Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_____5 Snowflake https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html Additional modifications 1. Because test case named `check outputs of expression examples` in ExpressionInfoSuite executes the example SQL of built-in function, so the below SQL be executed: `SELECT '%SystemDrive%\Users\John' regexp_like '%SystemDrive%\\Users.'` But Spark SQL not supports this syntax yet. 2. Another reason: `SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.';` is an SQL syntax, not the usecase for function `RLike`. As the above reason, this PR changes the example SQL of `RLike`. ### Why are the changes needed? No ### Does this PR introduce _any_ user-facing change? Make the behavior of Spark SQL consistent with mainstream databases. ### How was this patch tested? Jenkins test Closes #30543 from beliefer/SPARK-33597. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 13:47:31 +00:00
Yuming Wang	06b1bbbbab	[SPARK-33798][SQL] Add new rule to push down the foldable expressions through CaseWhen/If ### What changes were proposed in this pull request? This pr add a new rule(`PushFoldableIntoBranches`) to push down the foldable expressions through `CaseWhen/If`. This is a real case from production: ```sql create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2 explain select * from v1 where event_type = 'a'; ``` Before this PR: ``` == Physical Plan == Union :- (1) Project [a AS event_type#30533, id#30535L] : +- (1) ColumnarToRow : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], Format: Parquet +- (2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END AS event_type#30534, id#30536L] +- (2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a) +- (2) ColumnarToRow +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], Format: Parquet ``` After this PR: ``` == Physical Plan == (1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30790 from wangyum/SPARK-33798. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 13:20:58 +00:00
angerszhu	0603913c66	[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value ### What changes were proposed in this pull request? Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT ```scala test("Parquet vector reader incorrect with binary partition value") { Seq(false, true).foreach(tag => { withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { withTable("t1") { sql( """CREATE TABLE t1(name STRING, id BINARY, part BINARY) \| USING PARQUET PARTITIONED BY (part)""".stripMargin) sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')") if (tag) { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "")) } else { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "Spark SQL")) } } } }) } ``` ### Why are the changes needed? Fix data incorrect issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30824 from AngersZhuuuu/SPARK-33593. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-18 00:01:13 -08:00
Terry Kim	0f1a18370a	[SPARK-33817][SQL] CACHE TABLE uses a logical plan when caching a query to avoid creating a dataframe ### What changes were proposed in this pull request? This PR proposes to update `CACHE TABLE` to use a `LogicalPlan` when caching a query to avoid creating a `DataFrame` as suggested here: https://github.com/apache/spark/pull/30743#discussion_r543123190 For reference, `UNCACHE TABLE` also uses `LogicalPlan`: `0c12900120/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala (L91-L98)` ### Why are the changes needed? To avoid creating an unnecessary dataframe and make it consistent with `uncacheQuery` used in `UNCACHE TABLE`. ### Does this PR introduce _any_ user-facing change? No, just internal changes. ### How was this patch tested? Existing tests since this is an internal refactoring change. Closes #30815 from imback82/cache_with_logical_plan. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 04:30:15 +00:00
Takeshi Yamamuro	51ef4430dc	[SPARK-33822][SQL] Use the `CastSupport.cast` method in HashJoin ### What changes were proposed in this pull request? This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](`031c5ef280`)): ``` java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path. at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321) at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ... ``` I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows: ``` +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#2183] +- BroadcastQueryStage 2 +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#1963] ``` A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`: `1e85707738/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (L47-L50)` The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked that q5 passed. Closes #30818 from maropu/BugfixInAQE. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-17 16:16:05 -08:00
allisonwang-db	1e85707738	[SPARK-33697][SQL] RemoveRedundantProjects should require column ordering by default ### What changes were proposed in this pull request? This PR changes the rule `RemoveRedundantProjects` from by default passing column ordering requirements from parent nodes to always require column orders regardless of the requirements from parent nodes unless otherwise specified. More specifically, instead of excluding a few nodes like GenerateExec, UnionExec that are known to require children columns to be ordered, the rule now includes a whitelist of nodes that allow passing through the ordering requirements from their parents. ### Why are the changes needed? Currently, this rule passes through ordering requirements from parents directly to children except for a few excluded nodes. This incorrectly removes the necessary project nodes below a UnionExec since it is not excluded. An earlier PR also fixed a similar issue for GenerateExec (SPARK-32861). In order to prevent similar issues, the rule should be changed to always require column ordering except for a few specific nodes that we know for sure can pass through the requirements. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #30659 from allisonwang-db/spark-33697-remove-project-union. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-17 05:47:44 +00:00
Terry Kim	0c19497222	[SPARK-33815][SQL] Migrate ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES] to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES]` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t SET SERDE 'serdename'") // works fine ``` , but after this PR: ``` sql("ALTER TABLE t SET SERDE 'serdename'") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES\' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `t` in the above example is resolved to a temp view first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30813 from imback82/alter_table_serde_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-17 05:25:51 +00:00
Terry Kim	e7e29fd0af	[SPARK-33514][SQL][FOLLOW-UP] Remove unused TruncateTableStatement case class ### What changes were proposed in this pull request? This PR removes unused `TruncateTableStatement`: https://github.com/apache/spark/pull/30457#discussion_r544433820 ### Why are the changes needed? To remove unused `TruncateTableStatement` from #30457. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not needed. Closes #30811 from imback82/remove_truncate_table_stmt. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-16 14:13:02 -08:00
Kent Yao	728a1298af	[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions ### What changes were proposed in this pull request? It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing. For example ``` insert into table src select * from values (1), (2), (3) t(a) distribute by 1 ``` Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary. When users repeat the insert statement daily, hourly, or minutely, it causes small file issues. ``` spark-sql> set spark.sql.shuffle.partitions=3;drop table if exists test2;create table test2 using parquet as select * from values (1), (2), (3) t(a) distribute by 1; kentyaohulk  ~/spark   SPARK-33806  tree /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ -s /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ ├── [ 0] _SUCCESS ├── [ 298] part-00000-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet └── [ 426] part-00001-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet ``` To avoid this, there are some options you can take. 1. use `distribute by null`, let the data go to the partition 0 2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce 3. using hints instead of `distribute by` 4. set spark.sql.shuffle.partitions to 1 In this PR, we set the partition number to 1 in this particular case. ### Why are the changes needed? 1. avoid small file issues 2. avoid unnecessary empty tasks when no adaptive execution ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30800 from yaooqinn/SPARK-33806. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-16 14:09:28 -08:00
Terry Kim	8666d1c39c	[SPARK-33800][SQL] Remove command name in AnalysisException message when a relation is not resolved ### What changes were proposed in this pull request? Based on the discussion https://github.com/apache/spark/pull/30743#discussion_r543124594, this PR proposes to remove the command name in AnalysisException message when a relation is not resolved. For some of the commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier, when the identifier cannot be resolved, the exception will be something like `Table or view not found for 'SHOW TBLPROPERTIES': badtable`. The command name (`SHOW TBLPROPERTIES` in this case) should be dropped to be consistent with other existing commands. ### Why are the changes needed? To make the exception message consistent. ### Does this PR introduce _any_ user-facing change? Yes, the exception message will be changed from ``` Table or view not found for 'SHOW TBLPROPERTIES': badtable ``` to ``` Table or view not found: badtable ``` for commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier. ### How was this patch tested? Updated existing tests. Closes #30794 from imback82/remove_cmd_from_exception_msg. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 15:56:50 +00:00
Kent Yao	205d8e40bc	[SPARK-32991][SQL] [FOLLOWUP] Reset command relies on session initials first ### What changes were proposed in this pull request? As a follow-up of https://github.com/apache/spark/pull/30045, we modify the RESET command here to respect the session initial configs per session first then fall back to the `SharedState` conf, which makes each session could maintain a different copy of initial configs for resetting. ### Why are the changes needed? to make reset command saner. ### Does this PR introduce _any_ user-facing change? yes, RESET will respect session initials first not always go to the system defaults ### How was this patch tested? add new tests Closes #30642 from yaooqinn/SPARK-32991-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 14:36:38 +00:00
Max Gekk	9d9d4a8e12	[SPARK-33789][SQL][TESTS] Refactor unified V1 and V2 datasource tests ### What changes were proposed in this pull request? 1. Move common utility functions such as `test()`, `withNsTable()` and `checkPartitions()` to `DDLCommandTestUtils`. 2. Place common settings such as `version`, `catalog`, `defaultUsing`, `sparkConf` to `CommandSuiteBase`. ### Why are the changes needed? To improve code maintenance of the unified tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly ShowTablesSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" ``` Closes #30779 from MaxGekk/refactor-unified-tests. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 13:49:49 +00:00
HyukjinKwon	7845865b8d	[SPARK-33803][SQL] Sort table properties by key in DESCRIBE TABLE command ### What changes were proposed in this pull request? This PR proposes to sort table properties in DESCRIBE TABLE command. This is consistent with DSv2 command as well: `e3058ba17c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala (L63)` This PR fixes the test case in Scala 2.13 build as well where the table properties have different order in the map. ### Why are the changes needed? To keep the deterministic and pretty output, and fix the tests in Scala 2.13 build. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/49/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/describe_sql/ ``` describe.sql Expected "...spark_catalog, view.[query.out.col.2=c, view.referredTempFunctionsNames=[], view.catalogAndNamespace.part.1=default]]", but got "...spark_catalog, view.[catalogAndNamespace.part.1=default, view.query.out.col.2=c, view.referredTempFunctionsNames=[]]]" Result did not match for query #29 DESC FORMATTED v ``` ### Does this PR introduce _any_ user-facing change? Yes, it will change the text output from `DESCRIBE [EXTENDED\|FORMATTED] table_name`. Now the table properties are sorted by its key. ### How was this patch tested? Related unittests were fixed accordingly. Closes #30799 from HyukjinKwon/SPARK-33803. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 13:42:30 +00:00
Terry Kim	ef7f6903b4	[SPARK-33786][SQL] The storage level for a cache should be respected when a table name is altered ### What changes were proposed in this pull request? This PR proposes to retain the cache's storage level when a table name is altered by `ALTER TABLE ... RENAME TO ...`. ### Why are the changes needed? Currently, when a table name is altered, the table's cache is refreshed (if exists), but the storage level is not retained. For example: ```scala def getStorageLevel(tableName: String): StorageLevel = { val table = spark.table(tableName) val cachedData = spark.sharedState.cacheManager.lookupCachedData(table).get cachedData.cachedRepresentation.cacheBuilder.storageLevel } Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath) sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'") sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')") val oldStorageLevel = getStorageLevel("old") sql("ALTER TABLE old RENAME TO new") val newStorageLevel = getStorageLevel("new") ``` `oldStorageLevel` will be `StorageLevel(memory, deserialized, 1 replicas)` whereas `newStorageLevel` will be `StorageLevel(disk, memory, deserialized, 1 replicas)`, which is the default storage level. ### Does this PR introduce _any_ user-facing change? Yes, now the storage level for the cache will be retained. ### How was this patch tested? Added a unit test. Closes #30774 from imback82/alter_table_rename_cache_fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 05:45:44 +00:00
Terry Kim	62be2483d7	[SPARK-33765][SQL] Migrate UNCACHE TABLE to use UnresolvedRelation to resolve identifier ### What changes were proposed in this pull request? This PR proposes to migrate `UNCACHE TABLE` to use `UnresolvedRelation` to resolve the table/view identifier in Analyzer as discussed https://github.com/apache/spark/pull/30403/files#r532360022. ### Why are the changes needed? To resolve the table/view in the analyzer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated existing tests Closes #30743 from imback82/uncache_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 05:37:56 +00:00
Max Gekk	3dfdcf4f92	[SPARK-33788][SQL] Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions() ### What changes were proposed in this pull request? Throw `NoSuchPartitionsException` from `ALTER TABLE .. DROP TABLE` for not existing partitions of a table in V1 Hive external catalog. ### Why are the changes needed? The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `NoSuchPartitionsException`. To improve user experience with Spark SQL, it would be better to throw the same exception. ### Does this PR introduce _any_ user-facing change? Yes, the command throws `NoSuchPartitionsException` instead of the general exception `AnalysisException`. ### How was this patch tested? By running tests for `ALTER TABLE .. DROP PARTITION`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30778 from MaxGekk/hive-drop-partition-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-16 10:03:48 +09:00
Anton Okolnychyi	4d56d43838	[SPARK-33735][SQL] Handle UPDATE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR adds `UpdateTable` to supported plans in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? This change allows Spark to optimize update conditions like we optimize filters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends the existing test cases to also cover `UpdateTable`. Closes #30787 from aokolnychyi/spark-33735. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-15 13:50:58 -08:00
Wenchen Fan	40c37d69fd	[SPARK-33617][SQL][FOLLOWUP] refine the default parallelism SQL config ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30559 . The default parallelism config in Spark core is not good, as it's unclear where it applies. To not inherit this problem in Spark SQL, this PR refines the default parallelism SQL config, to make it clear that it only applies to leaf nodes. ### Why are the changes needed? Make the config clearer. ### Does this PR introduce _any_ user-facing change? It changes an unreleased config. ### How was this patch tested? existing tests Closes #30736 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 14:16:43 +00:00
Prakhar Jain	23083aa594	[SPARK-33758][SQL] Prune unrequired partitionings from AliasAwareOutputPartitionings when some columns are dropped from projection ### What changes were proposed in this pull request? This PR tries to prune the unrequired output partitionings in cases when the columns are dropped from Project/Aggregates etc. ### Why are the changes needed? Consider this query: select t1.id from t1 JOIN t2 on t1.id = t2.id This query will have top level Project node which will just project t1.id. But the outputPartitioning of this project node will be: PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id)). But since we are not propagating t2.id column, so we can drop HashPartitioning(t2.id) from the output partitioning of Project node. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UTs. Closes #30762 from prakharjain09/SPARK-33758-prune-partitioning. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 13:46:58 +00:00
gengjiaan	58cb2bae74	[SPARK-33752][SQL] Avoid the getSimpleMessage of AnalysisException adds semicolon repeatedly ### What changes were proposed in this pull request? The current `getSimpleMessage` of `AnalysisException` may adds semicolon repeatedly. There show an example below: `select decode()` The output will be: ``` org.apache.spark.sql.AnalysisException Invalid number of arguments for function decode. Expected: 2; Found: 0;; line 1 pos 7 ``` ### Why are the changes needed? Fix a bug, because it adds semicolon repeatedly. ### Does this PR introduce _any_ user-facing change? Yes. the message of AnalysisException will be correct. ### How was this patch tested? Jenkins test. Closes #30724 from beliefer/SPARK-33752. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 19:20:01 +09:00
Chongguang LIU	20f6d63bc1	[SPARK-33769][SQL] Improve the next-day function of the sql component to deal with Column type ### What changes were proposed in this pull request? The proposition of this pull request is described in this JIRA ticket: [https://issues.apache.org/jira/browse/SPARK-33769](url) It proposes to improve the next-day function of the sql component to deal with Column type for the parameter dayOfWeek. ### Why are the changes needed? It makes this functionality easier to use. Actually the signature of this function is: > def next_day(date: Column, dayOfWeek: String): Column. It accepts the dayOfWeek parameter as a String. However in some cases, the dayOfWeek is in a Column, so a different value for each row of the dataframe. A current workaround is to use the NextDay function like this: > NextDay(dateCol.expr, dayOfWeekCol.expr). The proposition is to add another signature for this function: > def next_day(date: Column, dayOfWeek: Column): Column In fact it is already the case for some other functions in this scala object, exemple: > def date_sub(start: Column, days: Int): Column = date_sub(start, lit(days)) > def date_sub(start: Column, days: Column): Column = withExpr \{ DateSub(start.expr, days.expr) } or > def add_months(startDate: Column, numMonths: Int): Column = add_months(startDate, lit(numMonths)) > def add_months(startDate: Column, numMonths: Column): Column = withExpr { > AddMonths(startDate.expr, numMonths.expr) > } This pull request is the same idea for the function next_day. ### Does this PR introduce _any_ user-facing change? Yes With this pull request, users of spark will have a new signature of the function: > def next_day(date: Column, dayOfWeek: Column): Column But the existing function signature should still work: > def next_day(date: Column, dayOfWeek: String): Column So this change should be retrocompatible. ### How was this patch tested? The unit tests of the next_day function has been enhanced. It tests the dayOfWeek parameter both as String and Column. I also added a test case for the existing signature where the dayOfWeek is a non valid String. This should return null. Closes #30761 from chongguang/SPARK-33769. Authored-by: Chongguang LIU <chongguang.liu@laposte.fr> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 18:55:48 +09:00
Wenchen Fan	03042529e3	[SPARK-33273][SQL] Fix a race condition in subquery execution ### What changes were proposed in this pull request? If we call `SubqueryExec.executeTake`, it will call `SubqueryExec.execute` which will trigger the codegen of the query plan and create an RDD. However, `SubqueryExec` already has a thread (`SubqueryExec.relationFuture`) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time. Spark codegen is not thread-safe, as we have places like `HashAggregateExec.bufferVars` that is a shared variable. The bug in `SubqueryExec` may lead to correctness bugs. Since https://issues.apache.org/jira/browse/SPARK-33119, `ScalarSubquery` will call `SubqueryExec.executeTake`, so flaky tests start to appear. This PR fixes the bug by reimplementing https://github.com/apache/spark/pull/30016 . We should pass the number of rows we want to collect to `SubqueryExec` at planning time, so that we can use `executeTake` inside `SubqueryExec.relationFuture`, and the caller side should always call `SubqueryExec.executeCollect`. This PR also adds checks so that we can make sure only `SubqueryExec.executeCollect` is called. ### Why are the changes needed? fix correctness bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run `build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select"` more than 10 times. Previously it fails, now it passes. Closes #30765 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 18:29:28 +09:00
Max Gekk	141e26d65b	[SPARK-33767][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. DROP PARTITION` parsing tests to `AlterTableDropPartitionParserSuite` 2. Place v1 tests for `ALTER TABLE .. DROP PARTITION` from `DDLSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to the common trait `AlterTableDropPartitionSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. DROP PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly AlterTableDropPartitionParserSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" ``` Closes #30747 from MaxGekk/unify-alter-table-drop-partition-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 05:36:57 +00:00
Terry Kim	366beda54a	[SPARK-33785][SQL] Migrate ALTER TABLE ... RECOVER PARTITIONS to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... RECOVER PARTITIONS` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... RECOVER PARTITIONS` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t RECOVER PARTITIONS") // works fine ``` , but after this PR: ``` sql("ALTER TABLE t RECOVER PARTITIONS") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... RECOVER PARTITIONS' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `ALTER TABLE t RECOVER PARTITIONS` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30773 from imback82/alter_table_recover_part_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 05:23:39 +00:00
Chao Sun	49d3256497	[SPARK-33653][SQL] DSv2: REFRESH TABLE should recache the table itself ### What changes were proposed in this pull request? This changes DSv2 refresh table semantics to also recache the target table itself. ### Why are the changes needed? Currently "REFRESH TABLE" in DSv2 only invalidate all caches referencing the table. With #30403 merged which adds support for caching a DSv2 table, we should also recache the target table itself to make the behavior consistent with DSv1. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing table in DSv2 also recache the target table itself. ### How was this patch tested? Added coverage of this new behavior in the existing UT for v2 refresh table command Closes #30742 from sunchao/SPARK-33653. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 15:18:50 -08:00
Max Gekk	f156718587	[SPARK-33777][SQL] Sort output of V2 SHOW PARTITIONS ### What changes were proposed in this pull request? List partitions returned by the V2 `SHOW PARTITIONS` command in alphabetical order. ### Why are the changes needed? To have the same behavior as: 1. V1 in-memory catalog, see `a28ed86a38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala (L546)` 2. V1 Hive catalogs, see `fab2995972/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (L715)` ### Does this PR introduce _any_ user-facing change? Yes, after the changes, V2 SHOW PARTITIONS sorts its output. ### How was this patch tested? Added new UT to the base trait `ShowPartitionsSuiteBase` which contains tests for V1 and V2. Closes #30764 from MaxGekk/sort-show-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 14:28:47 -08:00
Yuming Wang	412d86e711	[SPARK-33771][SQL][TESTS] Fix Invalid value for HourOfAmPm when testing on JDK 14 ### What changes were proposed in this pull request? This pr fix invalid value for HourOfAmPm when testing on JDK 14. ### Why are the changes needed? Run test on JDK 14. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30754 from wangyum/SPARK-33771. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 13:34:23 -08:00
Anton Okolnychyi	bb60fb1bbd	[SPARK-33779][SQL][FOLLOW-UP] Fix Java Linter error ### What changes were proposed in this pull request? This PR removes unused imports. ### Why are the changes needed? These changes are required to fix the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Via `dev/lint-java`. Closes #30767 from aokolnychyi/fix-linter. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 11:39:42 -08:00

1 2 3 4 5 ...

10416 commits