ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
yi.wu	da076153aa	[SPARK-30433][SQL] Make conflict attributes resolution more scalable in ResolveReferences ### What changes were proposed in this pull request? This PR tries to make conflict attributes resolution in `ResolveReferences` more scalable by doing resolution in batch way. ### Why are the changes needed? Currently, `ResolveReferences` rule only resolves conflict attributes of one single conflict plan pair in one iteration, which can be inefficient when there're many conflicts. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Covered by existed tests. Closes #27105 from Ngone51/resolve-conflict-columns-in-batch. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-07 12:05:27 +08:00
Yuming Wang	17881a467a	[SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty ## What changes were proposed in this pull request? We invalidate table relation once table data is changed by [SPARK-21237](https://issues.apache.org/jira/browse/SPARK-21237). But there is a situation we have not invalidated(`spark.sql.statistics.size.autoUpdate.enabled=false` and `table.stats.isEmpty`): `07c4b9bd1f/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala (L44-L54)` This will introduce some issues, e.g. [SPARK-19784](https://issues.apache.org/jira/browse/SPARK-19784), [SPARK-19845](https://issues.apache.org/jira/browse/SPARK-19845), [SPARK-25403](https://issues.apache.org/jira/browse/SPARK-25403), [SPARK-25332](https://issues.apache.org/jira/browse/SPARK-25332) and [SPARK-28413](https://issues.apache.org/jira/browse/SPARK-28413). This is a example to reproduce [SPARK-19784](https://issues.apache.org/jira/browse/SPARK-19784): ```scala val path = "/tmp/spark/parquet" spark.sql("CREATE TABLE t (a INT) USING parquet") spark.sql("INSERT INTO TABLE t VALUES (1)") spark.range(5).toDF("a").write.parquet(path) spark.sql(s"ALTER TABLE t SET LOCATION '${path}'") spark.table("t").count() // return 1 spark.sql("refresh table t") spark.table("t").count() // return 5 ``` This PR invalidates the table relation in this case(`spark.sql.statistics.size.autoUpdate.enabled=false` and `table.stats.isEmpty`) to fix this issue. ## How was this patch tested? unit tests Closes #22721 from wangyum/SPARK-25403. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-07 11:41:34 +08:00
Ximo Guanter	604d6799df	[SPARK-30226][SQL] Remove withXXX functions in WriteBuilder ### What changes were proposed in this pull request? Adding a `LogicalWriteInfo` interface as suggested by cloud-fan in https://github.com/apache/spark/pull/25990#issuecomment-555132991 ### Why are the changes needed? It provides compile-time guarantees where we previously had none, which will make it harder to introduce bugs in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Compiles and passes tests Closes #26678 from edrevo/add-logical-write-info. Lead-authored-by: Ximo Guanter <joaquin.guantergonzalbez@telefonica.com> Co-authored-by: Ximo Guanter Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-06 23:53:45 +08:00
angerszhu	3eade744f8	[SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf ### What changes were proposed in this pull request? Current catalyst rewrite non-correlated exists subquery to BroadcastNestLoopJoin, it's performance is not good , now we rewrite non-correlated EXISTS subquery to ScalaSubquery to optimize the performance. We rewrite ``` WHERE EXISTS (SELECT A FROM TABLE B WHERE COL1 > 10) ``` to ``` WHERE (SELECT 1 FROM (SELECT A FROM TABLE B WHERE COL1 > 10) LIMIT 1) IS NOT NULL ``` to avoid build join to solve EXISTS expression. ### Why are the changes needed? Optimize EXISTS performance. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Manuel Tested Closes #26437 from AngersZhuuuu/SPARK-29800. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-06 22:54:37 +08:00
root1	ebd2fd7e02	[SPARK-30415][SQL] Improve Readability of SQLConf Doc ### What changes were proposed in this pull request? SQLCOnf Doc updated. ### Why are the changes needed? Some doc comments were not written properly. Space was missing at many places. This patch updates the doc. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Documentation update. Closes #27091 from iRakson/SQLConfDoc. Authored-by: root1 <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-04 15:49:11 -06:00
Wenchen Fan	be4faafee4	Revert "[SPARK-23264][SQL] Make INTERVAL keyword optional when ANSI enabled" ### What changes were proposed in this pull request? Revert https://github.com/apache/spark/pull/20433 . ### Why are the changes needed? According to the SQL standard, the INTERVAL prefix is required: ``` <interval literal> ::= INTERVAL [ <sign> ] <interval string> <interval qualifier> <interval string> ::= <quote> <unquoted interval string> <quote> ``` ### Does this PR introduce any user-facing change? yes, but omitting the INTERVAL prefix is a new feature in 3.0 ### How was this patch tested? existing tests Closes #27080 from cloud-fan/interval. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-03 12:51:10 -08:00
Yuming Wang	568ad4e77a	[SPARK-29947][SQL] Improve ResolveRelations performance ### What changes were proposed in this pull request? It is very common for a SQL query to query a table more than once. For example: ``` == Physical Plan == (12) HashAggregate(keys=[cmn_mtrc_summ_dt#21, rev_rollup#1279, CASE WHEN (rev_rollup#1319 = rev_rollup#1279) THEN 0 ELSE 1 END#1366, CASE WHEN cast(sap_category_id#24 as decimal(10,0)) IN (5,7,23,41) THEN 0 ELSE 1 END#1367], functions=[sum(coalesce(bid_count#34, 0)), sum(coalesce(ck_trans_count#35, 0)), sum(coalesce(ended_bid_count#36, 0)), sum(coalesce(ended_lstg_count#37, 0)), sum(coalesce(ended_success_lstg_count#38, 0)), sum(coalesce(item_sold_count#39, 0)), sum(coalesce(new_lstg_count#40, 0)), sum(coalesce(gmv_us_amt#41, 0.00)), sum(coalesce(gmv_slr_lc_amt#42, 0.00)), sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_insrtn_fee_us_amt#46, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_insrtn_crd_us_amt#50, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_fetr_fee_us_amt#54, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_fetr_crd_us_amt#58, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_fv_fee_us_amt#62, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_fv_crd_us_amt#67, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_othr_l_fee_us_amt#72, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_othr_l_crd_us_amt#76, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_othr_nl_fee_us_amt#80, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_othr_nl_crd_us_amt#84, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_slr_tools_fee_us_amt#88, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_slr_tools_crd_us_amt#92, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), sum(coalesce(rvnu_unasgnd_us_amt#96, 0.000000)), sum((coalesce(rvnu_transaction_us_amt#112, 0.0) + coalesce(rvnu_transaction_crd_us_amt#115, 0.0))), sum((coalesce(rvnu_total_us_amt#118, 0.0) + coalesce(rvnu_total_crd_us_amt#121, 0.0)))]) +- Exchange hashpartitioning(cmn_mtrc_summ_dt#21, rev_rollup#1279, CASE WHEN (rev_rollup#1319 = rev_rollup#1279) THEN 0 ELSE 1 END#1366, CASE WHEN cast(sap_category_id#24 as decimal(10,0)) IN (5,7,23,41) THEN 0 ELSE 1 END#1367, 200), true, [id=#403] +- (11) HashAggregate(keys=[cmn_mtrc_summ_dt#21, rev_rollup#1279, CASE WHEN (rev_rollup#1319 = rev_rollup#1279) THEN 0 ELSE 1 END AS CASE WHEN (rev_rollup#1319 = rev_rollup#1279) THEN 0 ELSE 1 END#1366, CASE WHEN cast(sap_category_id#24 as decimal(10,0)) IN (5,7,23,41) THEN 0 ELSE 1 END AS CASE WHEN cast(sap_category_id#24 as decimal(10,0)) IN (5,7,23,41) THEN 0 ELSE 1 END#1367], functions=[partial_sum(coalesce(bid_count#34, 0)), partial_sum(coalesce(ck_trans_count#35, 0)), partial_sum(coalesce(ended_bid_count#36, 0)), partial_sum(coalesce(ended_lstg_count#37, 0)), partial_sum(coalesce(ended_success_lstg_count#38, 0)), partial_sum(coalesce(item_sold_count#39, 0)), partial_sum(coalesce(new_lstg_count#40, 0)), partial_sum(coalesce(gmv_us_amt#41, 0.00)), partial_sum(coalesce(gmv_slr_lc_amt#42, 0.00)), partial_sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_insrtn_fee_us_amt#46, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_insrtn_crd_us_amt#50, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), partial_sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_fetr_fee_us_amt#54, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_fetr_crd_us_amt#58, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), partial_sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_fv_fee_us_amt#62, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_fv_crd_us_amt#67, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), partial_sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_othr_l_fee_us_amt#72, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_othr_l_crd_us_amt#76, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), partial_sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_othr_nl_fee_us_amt#80, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_othr_nl_crd_us_amt#84, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), partial_sum(CheckOverflow((promote_precision(cast(coalesce(rvnu_slr_tools_fee_us_amt#88, 0.000000) as decimal(19,6))) + promote_precision(cast(coalesce(rvnu_slr_tools_crd_us_amt#92, 0.000000) as decimal(19,6)))), DecimalType(19,6), true)), partial_sum(coalesce(rvnu_unasgnd_us_amt#96, 0.000000)), partial_sum((coalesce(rvnu_transaction_us_amt#112, 0.0) + coalesce(rvnu_transaction_crd_us_amt#115, 0.0))), partial_sum((coalesce(rvnu_total_us_amt#118, 0.0) + coalesce(rvnu_total_crd_us_amt#121, 0.0)))]) +- (11) Project [cmn_mtrc_summ_dt#21, sap_category_id#24, bid_count#34, ck_trans_count#35, ended_bid_count#36, ended_lstg_count#37, ended_success_lstg_count#38, item_sold_count#39, new_lstg_count#40, gmv_us_amt#41, gmv_slr_lc_amt#42, rvnu_insrtn_fee_us_amt#46, rvnu_insrtn_crd_us_amt#50, rvnu_fetr_fee_us_amt#54, rvnu_fetr_crd_us_amt#58, rvnu_fv_fee_us_amt#62, rvnu_fv_crd_us_amt#67, rvnu_othr_l_fee_us_amt#72, rvnu_othr_l_crd_us_amt#76, rvnu_othr_nl_fee_us_amt#80, rvnu_othr_nl_crd_us_amt#84, rvnu_slr_tools_fee_us_amt#88, rvnu_slr_tools_crd_us_amt#92, rvnu_unasgnd_us_amt#96, ... 6 more fields] +- (11) BroadcastHashJoin [byr_cntry_id#23], [cntry_id#1309], LeftOuter, BuildRight :- (11) Project [cmn_mtrc_summ_dt#21, byr_cntry_id#23, sap_category_id#24, bid_count#34, ck_trans_count#35, ended_bid_count#36, ended_lstg_count#37, ended_success_lstg_count#38, item_sold_count#39, new_lstg_count#40, gmv_us_amt#41, gmv_slr_lc_amt#42, rvnu_insrtn_fee_us_amt#46, rvnu_insrtn_crd_us_amt#50, rvnu_fetr_fee_us_amt#54, rvnu_fetr_crd_us_amt#58, rvnu_fv_fee_us_amt#62, rvnu_fv_crd_us_amt#67, rvnu_othr_l_fee_us_amt#72, rvnu_othr_l_crd_us_amt#76, rvnu_othr_nl_fee_us_amt#80, rvnu_othr_nl_crd_us_amt#84, rvnu_slr_tools_fee_us_amt#88, rvnu_slr_tools_crd_us_amt#92, ... 6 more fields] : +- (11) BroadcastHashJoin [slr_cntry_id#28], [cntry_id#1269], LeftOuter, BuildRight : :- (11) Project [gen_attr_1#360 AS cmn_mtrc_summ_dt#21, gen_attr_5#267 AS byr_cntry_id#23, gen_attr_7#268 AS sap_category_id#24, gen_attr_15#272 AS slr_cntry_id#28, gen_attr_27#278 AS bid_count#34, gen_attr_29#279 AS ck_trans_count#35, gen_attr_31#280 AS ended_bid_count#36, gen_attr_33#282 AS ended_lstg_count#37, gen_attr_35#283 AS ended_success_lstg_count#38, gen_attr_37#284 AS item_sold_count#39, gen_attr_39#281 AS new_lstg_count#40, gen_attr_41#285 AS gmv_us_amt#41, gen_attr_43#287 AS gmv_slr_lc_amt#42, gen_attr_51#290 AS rvnu_insrtn_fee_us_amt#46, gen_attr_59#294 AS rvnu_insrtn_crd_us_amt#50, gen_attr_67#298 AS rvnu_fetr_fee_us_amt#54, gen_attr_75#302 AS rvnu_fetr_crd_us_amt#58, gen_attr_83#306 AS rvnu_fv_fee_us_amt#62, gen_attr_93#311 AS rvnu_fv_crd_us_amt#67, gen_attr_103#316 AS rvnu_othr_l_fee_us_amt#72, gen_attr_111#320 AS rvnu_othr_l_crd_us_amt#76, gen_attr_119#324 AS rvnu_othr_nl_fee_us_amt#80, gen_attr_127#328 AS rvnu_othr_nl_crd_us_amt#84, gen_attr_135#332 AS rvnu_slr_tools_fee_us_amt#88, ... 6 more fields] : : +- (11) BroadcastHashJoin [cast(gen_attr_308#777 as decimal(20,0))], [cast(gen_attr_309#803 as decimal(20,0))], LeftOuter, BuildRight : : :- (11) Project [gen_attr_5#267, gen_attr_7#268, gen_attr_15#272, gen_attr_27#278, gen_attr_29#279, gen_attr_31#280, gen_attr_39#281, gen_attr_33#282, gen_attr_35#283, gen_attr_37#284, gen_attr_41#285, gen_attr_43#287, gen_attr_51#290, gen_attr_59#294, gen_attr_67#298, gen_attr_75#302, gen_attr_83#306, gen_attr_93#311, gen_attr_103#316, gen_attr_111#320, gen_attr_119#324, gen_attr_127#328, gen_attr_135#332, gen_attr_143#336, ... 6 more fields] : : : +- (11) BroadcastHashJoin [cast(gen_attr_310#674 as int)], [cast(gen_attr_311#774 as int)], LeftOuter, BuildRight : : : :- (11) Project [gen_attr_5#267, gen_attr_7#268, gen_attr_15#272, gen_attr_27#278, gen_attr_29#279, gen_attr_31#280, gen_attr_39#281, gen_attr_33#282, gen_attr_35#283, gen_attr_37#284, gen_attr_41#285, gen_attr_43#287, gen_attr_51#290, gen_attr_59#294, gen_attr_67#298, gen_attr_75#302, gen_attr_83#306, gen_attr_93#311, gen_attr_103#316, gen_attr_111#320, gen_attr_119#324, gen_attr_127#328, gen_attr_135#332, gen_attr_143#336, ... 6 more fields] : : : : +- (11) BroadcastHashJoin [cast(gen_attr_5#267 as decimal(20,0))], [cast(gen_attr_312#665 as decimal(20,0))], LeftOuter, BuildRight : : : : :- (11) Project [gen_attr_5#267, gen_attr_7#268, gen_attr_15#272, gen_attr_27#278, gen_attr_29#279, gen_attr_31#280, gen_attr_39#281, gen_attr_33#282, gen_attr_35#283, gen_attr_37#284, gen_attr_41#285, gen_attr_43#287, gen_attr_51#290, gen_attr_59#294, gen_attr_67#298, gen_attr_75#302, gen_attr_83#306, gen_attr_93#311, gen_attr_103#316, gen_attr_111#320, gen_attr_119#324, gen_attr_127#328, gen_attr_135#332, gen_attr_143#336, ... 5 more fields] : : : : : +- (11) BroadcastHashJoin [cast(gen_attr_313#565 as decimal(20,0))], [cast(gen_attr_314#591 as decimal(20,0))], LeftOuter, BuildRight : : : : : :- (11) Project [gen_attr_5#267, gen_attr_7#268, gen_attr_15#272, gen_attr_27#278, gen_attr_29#279, gen_attr_31#280, gen_attr_39#281, gen_attr_33#282, gen_attr_35#283, gen_attr_37#284, gen_attr_41#285, gen_attr_43#287, gen_attr_51#290, gen_attr_59#294, gen_attr_67#298, gen_attr_75#302, gen_attr_83#306, gen_attr_93#311, gen_attr_103#316, gen_attr_111#320, gen_attr_119#324, gen_attr_127#328, gen_attr_135#332, gen_attr_143#336, ... 6 more fields] : : : : : : +- (11) BroadcastHashJoin [cast(gen_attr_315#462 as int)], [cast(gen_attr_316#562 as int)], LeftOuter, BuildRight : : : : : : :- (11) Project [gen_attr_5#267, gen_attr_7#268, gen_attr_15#272, gen_attr_27#278, gen_attr_29#279, gen_attr_31#280, gen_attr_39#281, gen_attr_33#282, gen_attr_35#283, gen_attr_37#284, gen_attr_41#285, gen_attr_43#287, gen_attr_51#290, gen_attr_59#294, gen_attr_67#298, gen_attr_75#302, gen_attr_83#306, gen_attr_93#311, gen_attr_103#316, gen_attr_111#320, gen_attr_119#324, gen_attr_127#328, gen_attr_135#332, gen_attr_143#336, ... 6 more fields] : : : : : : : +- (11) BroadcastHashJoin [cast(gen_attr_15#272 as decimal(20,0))], [cast(gen_attr_317#453 as decimal(20,0))], LeftOuter, BuildRight : : : : : : : :- (11) Project [gen_attr_5#267, gen_attr_7#268, gen_attr_15#272, gen_attr_27#278, gen_attr_29#279, gen_attr_31#280, gen_attr_39#281, gen_attr_33#282, gen_attr_35#283, gen_attr_37#284, gen_attr_41#285, gen_attr_43#287, gen_attr_51#290, gen_attr_59#294, gen_attr_67#298, gen_attr_75#302, gen_attr_83#306, gen_attr_93#311, gen_attr_103#316, gen_attr_111#320, gen_attr_119#324, gen_attr_127#328, gen_attr_135#332, gen_attr_143#336, ... 5 more fields] : : : : : : : : +- (11) BroadcastHashJoin [cast(gen_attr_25#277 as decimal(20,0))], [cast(gen_attr_318#379 as decimal(20,0))], LeftOuter, BuildRight : : : : : : : : :- (11) Project [gen_attr_5#267, gen_attr_7#268, gen_attr_15#272, gen_attr_25#277, gen_attr_27#278, gen_attr_29#279, gen_attr_31#280, gen_attr_39#281, gen_attr_33#282, gen_attr_35#283, gen_attr_37#284, gen_attr_41#285, gen_attr_43#287, gen_attr_51#290, gen_attr_59#294, gen_attr_67#298, gen_attr_75#302, gen_attr_83#306, gen_attr_93#311, gen_attr_103#316, gen_attr_111#320, gen_attr_119#324, gen_attr_127#328, gen_attr_135#332, ... 6 more fields] : : : : : : : : : +- (11) BroadcastHashJoin [cast(gen_attr_23#276 as decimal(20,0))], [cast(gen_attr_319#367 as decimal(20,0))], LeftOuter, BuildRight : : : : : : : : : :- (11) Project [byr_cntry_id#1169 AS gen_attr_5#267, sap_category_id#1170 AS gen_attr_7#268, slr_cntry_id#1174 AS gen_attr_15#272, lstg_curncy_id#1178 AS gen_attr_23#276, blng_curncy_id#1179 AS gen_attr_25#277, bid_count#1180 AS gen_attr_27#278, ck_trans_count#1181 AS gen_attr_29#279, ended_bid_count#1182 AS gen_attr_31#280, new_lstg_count#1183 AS gen_attr_39#281, ended_lstg_count#1184 AS gen_attr_33#282, ended_success_lstg_count#1185 AS gen_attr_35#283, item_sold_count#1186 AS gen_attr_37#284, gmv_us_amt#1187 AS gen_attr_41#285, gmv_slr_lc_amt#1189 AS gen_attr_43#287, rvnu_insrtn_fee_us_amt#1192 AS gen_attr_51#290, rvnu_insrtn_crd_us_amt#1196 AS gen_attr_59#294, rvnu_fetr_fee_us_amt#1200 AS gen_attr_67#298, rvnu_fetr_crd_us_amt#1204 AS gen_attr_75#302, rvnu_fv_fee_us_amt#1208 AS gen_attr_83#306, rvnu_fv_crd_us_amt#1213 AS gen_attr_93#311, rvnu_othr_l_fee_us_amt#1218 AS gen_attr_103#316, rvnu_othr_l_crd_us_amt#1222 AS gen_attr_111#320, rvnu_othr_nl_fee_us_amt#1226 AS gen_attr_119#324, rvnu_othr_nl_crd_us_amt#1230 AS gen_attr_127#328, ... 7 more fields] : : : : : : : : : : +- (11) ColumnarToRow : : : : : : : : : : +- FileScan parquet default.big_table1[byr_cntry_id#1169,sap_category_id#1170,slr_cntry_id#1174,lstg_curncy_id#1178,blng_curncy_id#1179,bid_count#1180,ck_trans_count#1181,ended_bid_count#1182,new_lstg_count#1183,ended_lstg_count#1184,ended_success_lstg_count#1185,item_sold_count#1186,gmv_us_amt#1187,gmv_slr_lc_amt#1189,rvnu_insrtn_fee_us_amt#1192,rvnu_insrtn_crd_us_amt#1196,rvnu_fetr_fee_us_amt#1200,rvnu_fetr_crd_us_amt#1204,rvnu_fv_fee_us_amt#1208,rvnu_fv_crd_us_amt#1213,rvnu_othr_l_fee_us_amt#1218,rvnu_othr_l_crd_us_amt#1222,rvnu_othr_nl_fee_us_amt#1226,rvnu_othr_nl_crd_us_amt#1230,... 7 more fields] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionFilters: [isnotnull(cmn_mtrc_summ_dt#1262), (cmn_mtrc_summ_dt#1262 >= 18078), (cmn_mtrc_summ_dt#1262 <= 18..., PushedFilters: [], ReadSchema: struct<byr_cntry_id:decimal(4,0),sap_category_id:decimal(9,0),slr_cntry_id:decimal(4,0),lstg_curn... : : : : : : : : : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, decimal(9,0), true] as decimal(20,0)))), [id=#288] : : : : : : : : : +- (1) Project [CURNCY_ID#1263 AS gen_attr_319#367] : : : : : : : : : +- (1) Filter isnotnull(CURNCY_ID#1263) : : : : : : : : : +- (1) ColumnarToRow : : : : : : : : : +- FileScan parquet default.small_table1[CURNCY_ID#1263] Batched: true, DataFilters: [isnotnull(CURNCY_ID#1263)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/small_table1], PartitionFilters: [], PushedFilters: [IsNotNull(CURNCY_ID)], ReadSchema: struct<CURNCY_ID:decimal(9,0)>, SelectedBucketsCount: 1 out of 1 : : : : : : : : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, decimal(9,0), true] as decimal(20,0)))), [id=#297] : : : : : : : : +- (2) Project [CURNCY_ID#1263 AS gen_attr_318#379] : : : : : : : : +- (2) Filter isnotnull(CURNCY_ID#1263) : : : : : : : : +- (2) ColumnarToRow : : : : : : : : +- FileScan parquet default.small_table1[CURNCY_ID#1263] Batched: true, DataFilters: [isnotnull(CURNCY_ID#1263)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/small_table1], PartitionFilters: [], PushedFilters: [IsNotNull(CURNCY_ID)], ReadSchema: struct<CURNCY_ID:decimal(9,0)>, SelectedBucketsCount: 1 out of 1 : : : : : : : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, decimal(4,0), true] as decimal(20,0)))), [id=#306] : : : : : : : +- (3) Project [cntry_id#1269 AS gen_attr_317#453, rev_rollup_id#1278 AS gen_attr_315#462] : : : : : : : +- (3) Filter isnotnull(cntry_id#1269) : : : : : : : +- (3) ColumnarToRow : : : : : : : +- FileScan parquet default.small_table2[cntry_id#1269,rev_rollup_id#1278] Batched: true, DataFilters: [isnotnull(cntry_id#1269)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/small_table2], PartitionFilters: [], PushedFilters: [IsNotNull(cntry_id)], ReadSchema: struct<cntry_id:decimal(4,0),rev_rollup_id:smallint> : : : : : : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(cast(input[0, smallint, true] as int) as bigint))), [id=#315] : : : : : : +- (4) Project [rev_rollup_id#1286 AS gen_attr_316#562, curncy_id#1289 AS gen_attr_313#565] : : : : : : +- (4) Filter isnotnull(rev_rollup_id#1286) : : : : : : +- (4) ColumnarToRow : : : : : : +- FileScan parquet default.small_table3[rev_rollup_id#1286,curncy_id#1289] Batched: true, DataFilters: [isnotnull(rev_rollup_id#1286)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/small_table3], PartitionFilters: [], PushedFilters: [IsNotNull(rev_rollup_id)], ReadSchema: struct<rev_rollup_id:smallint,curncy_id:decimal(4,0)> : : : : : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, decimal(9,0), true] as decimal(20,0)))), [id=#324] : : : : : +- (5) Project [CURNCY_ID#1263 AS gen_attr_314#591] : : : : : +- (5) Filter isnotnull(CURNCY_ID#1263) : : : : : +- (5) ColumnarToRow : : : : : +- FileScan parquet default.small_table1[CURNCY_ID#1263] Batched: true, DataFilters: [isnotnull(CURNCY_ID#1263)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/small_table1], PartitionFilters: [], PushedFilters: [IsNotNull(CURNCY_ID)], ReadSchema: struct<CURNCY_ID:decimal(9,0)>, SelectedBucketsCount: 1 out of 1 : : : : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, decimal(4,0), true] as decimal(20,0)))), [id=#333] : : : : +- (6) Project [cntry_id#1269 AS gen_attr_312#665, rev_rollup_id#1278 AS gen_attr_310#674] : : : : +- (6) Filter isnotnull(cntry_id#1269) : : : : +- (6) ColumnarToRow : : : : +- FileScan parquet default.small_table2[cntry_id#1269,rev_rollup_id#1278] Batched: true, DataFilters: [isnotnull(cntry_id#1269)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/small_table2], PartitionFilters: [], PushedFilters: [IsNotNull(cntry_id)], ReadSchema: struct<cntry_id:decimal(4,0),rev_rollup_id:smallint> : : : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(cast(input[0, smallint, true] as int) as bigint))), [id=#342] : : : +- (7) Project [rev_rollup_id#1286 AS gen_attr_311#774, curncy_id#1289 AS gen_attr_308#777] : : : +- (7) Filter isnotnull(rev_rollup_id#1286) : : : +- (7) ColumnarToRow : : : +- FileScan parquet default.small_table3[rev_rollup_id#1286,curncy_id#1289] Batched: true, DataFilters: [isnotnull(rev_rollup_id#1286)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/small_table3], PartitionFilters: [], PushedFilters: [IsNotNull(rev_rollup_id)], ReadSchema: struct<rev_rollup_id:smallint,curncy_id:decimal(4,0)> : : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, decimal(9,0), true] as decimal(20,0)))), [id=#351] : : +- (8) Project [CURNCY_ID#1263 AS gen_attr_309#803] : : +- (8) Filter isnotnull(CURNCY_ID#1263) : : +- (8) ColumnarToRow : : +- FileScan parquet default.small_table1[CURNCY_ID#1263] Batched: true, DataFilters: [isnotnull(CURNCY_ID#1263)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/small_table1], PartitionFilters: [], PushedFilters: [IsNotNull(CURNCY_ID)], ReadSchema: struct<CURNCY_ID:decimal(9,0)>, SelectedBucketsCount: 1 out of 1 : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, decimal(4,0), true])), [id=#360] : +- (9) Project [cntry_id#1269, rev_rollup#1279] : +- (9) Filter isnotnull(cntry_id#1269) : +- *(9) ColumnarToRow : +- FileScan parquet default.small_table2[cntry_id#1269,rev_rollup#1279] Batched: true, DataFilters: [isnotnull(cntry_id#1269)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/small_table2], PartitionFilters: [], PushedFilters: [IsNotNull(cntry_id)], ReadSchema: struct<cntry_id:decimal(4,0),rev_rollup:string> +- ReusedExchange [cntry_id#1309, rev_rollup#1319], BroadcastExchange HashedRelationBroadcastMode(List(input[0, decimal(4,0), true])), [id=#360] ``` This PR try to improve `ResolveTables` and `ResolveRelations` performance by reducing the connection times to Hive Metastore Server in such case. ### Why are the changes needed? 1. Reduce the connection times to Hive Metastore Server. 2. Improve `ResolveTables` and `ResolveRelations` performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test. After [SPARK-29606](https://issues.apache.org/jira/browse/SPARK-29606) and before this PR: ``` === Metrics of Analyzer/Optimizer Rules === Total number of runs: 9323 Total time: 2.687441263 seconds Rule Effective Time / Total Time Effective Runs / Total Runs org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations 929173767 / 930133504 2 / 18 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables 0 / 383363402 0 / 18 org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin 0 / 99433540 0 / 4 org.apache.spark.sql.catalyst.analysis.DecimalPrecision 41809394 / 83727901 2 / 18 org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions 71372977 / 71372977 1 / 1 org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 0 / 59071933 0 / 18 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 37858325 / 58471776 5 / 18 org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings 20889892 / 53229016 1 / 18 org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion 23428968 / 50890815 1 / 18 org.apache.spark.sql.catalyst.analysis.TypeCoercion$InConversion 23230666 / 49182607 1 / 18 org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator 0 / 43638350 0 / 18 org.apache.spark.sql.catalyst.optimizer.ColumnPruning 17194844 / 42530885 1 / 6 ``` After [SPARK-29606](https://issues.apache.org/jira/browse/SPARK-29606) and after this PR: ``` === Metrics of Analyzer/Optimizer Rules === Total number of runs: 9323 Total time: 2.163765869 seconds Rule Effective Time / Total Time Effective Runs / Total Runs org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations 658905353 / 659829383 2 / 18 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables 0 / 220708715 0 / 18 org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin 0 / 99606816 0 / 4 org.apache.spark.sql.catalyst.analysis.DecimalPrecision 39616060 / 78215752 2 / 18 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 36706549 / 54917789 5 / 18 org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions 53561921 / 53561921 1 / 1 org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 0 / 52329678 0 / 18 org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings 20945755 / 49695998 1 / 18 org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion 20872241 / 46740145 1 / 18 org.apache.spark.sql.catalyst.analysis.TypeCoercion$InConversion 19780298 / 44327227 1 / 18 org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator 0 / 42312023 0 / 18 org.apache.spark.sql.catalyst.optimizer.ColumnPruning 17197393 / 39501424 1 / 6 ``` Closes #26589 from wangyum/SPARK-29947. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 22:36:31 +08:00
yi.wu	e38964c442	[SPARK-29768][SQL][FOLLOW-UP] Improve handling non-deterministic filter of ScanOperation ### What changes were proposed in this pull request? 1. For `ScanOperation`, if it collects more than one filters, then all filters must be deterministic. And filter can be non-deterministic iff there's only one collected filter. 2. `FileSourceStrategy` should filter out non-deterministic filter, as it will hit haven't initialized exception if it's a partition related filter. ### Why are the changes needed? Strictly follow `CombineFilters`'s behavior which doesn't allow combine two filters where non-deterministic predicates exist. And avoid hitting exception for file source. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Test exists. Closes #27073 from Ngone51/SPARK-29768-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 21:48:14 +08:00
Kent Yao	c49388a484	[SPARK-30214][SQL] A new framework to resolve v2 commands ### What changes were proposed in this pull request? Currently, we have a v2 adapter for v1 catalog (`V2SessionCatalog`), all the table/namespace commands can be implemented via v2 APIs. Usually, a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples: - `DROP NAMESPACE`: only need to know the name of the namespace. - `DESC NAMESPACE`: need to lookup the namespace and get metadata, but is done during execution - `DROP TABLE`: need to do lookup and make sure it's a table not (temp) view. - `DESC TABLE`: need to lookup the table and get metadata. For namespaces, the analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed. For tables, mostly commands need the analyzer to do lookup. Note that, table and namespace have a difference: `DESC NAMESPACE testcat` works and describes the root namespace under `testcat`, while `DESC TABLE testcat` fails if there is no table `testcat` under the current catalog. It's because namespaces can be named [], but tables can't. The commands should explicitly specify it needs to operate on namespace or table. In this Pull Request, we introduce a new framework to resolve v2 commands: 1. parser creates logical plans or commands with `UnresolvedNamespace`/`UnresolvedTable`/`UnresolvedView`/`UnresolvedRelation`. (CREATE TABLE still keeps Seq[String], as it doesn't need to look up relations) 2. analyzer converts 2.1 `UnresolvedNamespace` to `ResolvesNamespace` (contains catalog and namespace identifier) 2.2 `UnresolvedTable` to `ResolvedTable` (contains catalog, identifier and `Table`) 2.3 `UnresolvedView` to `ResolvedView` (will be added later when we migrate view commands) 2.4 `UnresolvedRelation` to relation. 3. an extra analyzer rule to match commands with `V1Table` and converts them to corresponding v1 commands. This will be added later when we migrate existing commands 4. planner matches commands and converts them to the corresponding physical nodes. We also introduce brand new v2 commands - the `comment` syntaxes to illustrate how to work with the newly added framework. ```sql COMMENT ON (DATABASE\|SCHEMA\|NAMESPACE) ... IS ... COMMENT ON TABLE ... IS ... ``` Details about the `comment` syntaxes: As the new design of catalog v2, some properties become reserved, e.g. `location`, `comment`. We are going to disable setting reserved properties by dbproperties or tblproperites directly to avoid confliction with their related subClause or specific commands. They are the best practices from PostgreSQL and presto. https://www.postgresql.org/docs/12/sql-comment.html https://prestosql.io/docs/current/sql/comment.html Mostly, the basic thoughts of the new framework came from the discussions bellow with cloud-fan, https://github.com/apache/spark/pull/26847#issuecomment-564510061, ### Why are the changes needed? To make it easier to add new v2 commands, and easier to unify the table relation behavior. ### Does this PR introduce any user-facing change? yes, add new syntax ### How was this patch tested? add uts. Closes #26847 from yaooqinn/SPARK-30214. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 16:09:06 +08:00
Maxim Gekk	51373467cc	[SPARK-30412][SQL][TESTS] Eliminate warnings in Java tests regarding to deprecated Spark SQL API ### What changes were proposed in this pull request? In the PR, I propose to add the `SuppressWarnings("deprecation")` annotation to Java tests for deprecated Spark SQL APIs. ### Why are the changes needed? This eliminates the following warnings: ``` sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java Warning:Warning:line (32)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (91)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (100)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (109)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (118)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated sql/core/src/test/java/test/org/apache/spark/sql/Java8DatasetAggregatorSuite.java Warning:Warning:line (28)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (37)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (46)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (55)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (64)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java Warning:Warning:line (478)java: json(org.apache.spark.api.java.JavaRDD<java.lang.String>) in org.apache.spark.sql.DataFrameReader has been deprecated ``` and highlights warnings about real problems. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites `Java8DatasetAggregatorSuite.java`, `JavaDataFrameSuite.java` and `JavaDatasetAggregatorSuite.java`. Closes #27081 from MaxGekk/eliminate-warnings-part2. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-03 13:26:48 +09:00
Maxim Gekk	a469976e6e	[SPARK-29930][SQL][FOLLOW-UP] Allow only default value to be set for removed SQL configs ### What changes were proposed in this pull request? In the PR, I propose to throw `AnalysisException` when a removed SQL config is set to non-default value. The following SQL configs removed by #26559 are marked as removed: 1. `spark.sql.fromJsonForceNullableSchema` 2. `spark.sql.legacy.compareDateTimestampInTimestamp` 3. `spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation` ### Why are the changes needed? To improve user experience with Spark SQL by notifying of removed SQL configs used by users. ### Does this PR introduce any user-facing change? Yes, before the `set` command was silently ignored: ```sql spark-sql> set spark.sql.fromJsonForceNullableSchema=false; spark.sql.fromJsonForceNullableSchema false ``` after the exception should be raised: ```sql spark-sql> set spark.sql.fromJsonForceNullableSchema=false; Error in query: The SQL config 'spark.sql.fromJsonForceNullableSchema' was removed in the version 3.0.0. It was removed to prevent errors like SPARK-23173 for non-default value.; ``` ### How was this patch tested? Added new tests into `SQLConfSuite` for both cases when removed SQL configs are set to default and non-default values. Closes #27057 from MaxGekk/remove-sql-configs-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-03 10:41:30 +09:00
Kent Yao	e04309cb1f	[SPARK-30341][SQL] Overflow check for interval arithmetic operations ### What changes were proposed in this pull request? 1. For the interval arithmetic functions, e.g. `add`/`subtract`/`negative`/`multiply`/`divide`, enable overflow check when `ANSI` is on. 2. For `multiply`/`divide`, throw an exception when an overflow happens in spite of `ANSI` is on/off. 3. `add`/`subtract`/`negative` stay the same for backward compatibility. 4. `divide` by 0 throws ArithmeticException whether `ANSI` or not as same as numerics. 5. These behaviors fit the numeric type operations fully when ANSI is on. 6. These behaviors fit the numeric type operations fully when ANSI is off, except 2 and 4. ### Why are the changes needed? 1. bug fix 2. `ANSI` support ### Does this PR introduce any user-facing change? When `ANSI` is on, interval `add`/`subtract`/`negative`/`multiply`/`divide` will overflow if any field overflows ### How was this patch tested? add unit tests Closes #26995 from yaooqinn/SPARK-30341. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 02:04:20 +08:00
Wenchen Fan	68260f5297	revert [SPARK-29680][SQL] Remove ALTER TABLE CHANGE COLUMN syntax ### What changes were proposed in this pull request? Revert https://github.com/apache/spark/pull/26338 , as the syntax is actually the [hive style ALTER COLUMN](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment). This PR brings it back, and make it support multi-catalog: 1. renaming is not allowed as `AlterTableAlterColumnStatement` can't do renaming. 2. column name should be multi-part ### Why are the changes needed? to not break hive compatibility. ### Does this PR introduce any user-facing change? no, as the removal was merged in 3.0. ### How was this patch tested? new parser tests Closes #27076 from cloud-fan/alter. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 01:55:38 +08:00
Wenchen Fan	1743d5be7f	[SPARK-30284][SQL] CREATE VIEW should keep the current catalog and namespace ### What changes were proposed in this pull request? Update CREATE VIEW command to store the current catalog and namespace instead of current database in view metadata. Also update analyzer to leverage the catalog and namespace in view metastore to resolve relations inside views. Note that, this PR still keeps the way we resolve views, by recursively calling Analyzer. This is necessary because view text may contain CTE, window spec, etc. which needs rules outside of the main resolution batch (e.g. `CTESubstitution`) ### Why are the changes needed? To resolve relations inside view correctly. ### Does this PR introduce any user-facing change? Yes, fix a bug. Now tables referred by a view can be resolved correctly even if the current catalog/namespace has been updated. ### How was this patch tested? a new test Closes #26923 from cloud-fan/view. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 01:41:32 +08:00
jiake	6bd5494f34	[SPARK-30403][SQL] fix the NoSuchElementException when enable AQE with InSubquery expression ### What changes were proposed in this pull request? This PR aim to fix the NoSuchElementException exception when enable AQE with insubquery expression. ### Why are the changes needed? Fix exception ### Does this PR introduce any user-facing change? No ### How was this patch tested? added new ut Closes #27068 from JkSelf/fixSubqueryIssue. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-02 23:11:56 +08:00
jiake	05f7b57ddc	[SPARK-30407][SQL] fix the reset metric issue when enable AQE ### What changes were proposed in this pull request? When working on [PR#26813](https://github.com/apache/spark/pull/26813), we encounter the exception in [here(the number of metrics(1) is 2 not 1 )](`5d870ef0bc/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala (L120)`). This PR fix the above exception. ### Why are the changes needed? Fix exception ### Does this PR introduce any user-facing change? No ### How was this patch tested? [this test with enable AQE](`5d870ef0bc/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala (L120)`) Closes #27074 from JkSelf/resetMetricsIssue. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-02 21:55:36 +08:00
Jungtaek Lim (HeartSaVioR)	5d870ef0bc	[SPARK-26560][SQL] Spark should be able to run Hive UDF using jar regardless of current thread context classloader ### What changes were proposed in this pull request? This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior. (This patch contains the commit from #23921 to retain credit.) Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader. This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function. This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads. This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader. ### Why are the changes needed? Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Closes #27025 from HeartSaVioR/SPARK-26560-revised. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: nivo091 <nivedeeta.singh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-02 15:44:45 +08:00
yi.wu	83d289eef4	[SPARK-27638][SQL][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Change config name from `spark.sql.legacy.typeCoercion.datetimeToString` to `spark.sql.legacy.typeCoercion.datetimeToString.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins Closes #27065 from Ngone51/SPARK-27638-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-02 15:35:33 +09:00
yi.wu	90794b617c	[SPARK-27871][SQL][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Change config name from `spark.sql.optimizer.reassignLambdaVariableID` to `spark.sql.optimizer.reassignLambdaVariableID.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #27063 from Ngone51/SPARK-27871-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-02 09:59:15 +09:00
Maxim Gekk	b316d37365	[SPARK-30401][SQL] Call `requireNonStaticConf()` only once in `set()` ### What changes were proposed in this pull request? Calls of `requireNonStaticConf()` are removed from the `set()` methods in RuntimeConfig because those methods invoke `def set(key: String, value: String): Unit` where `requireNonStaticConf()` is called as well. ### Why are the changes needed? To avoid unnecessary calls of `requireNonStaticConf()`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing tests from `SQLConfSuite` Closes #27062 from MaxGekk/call-requireNonStaticConf-once. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-02 09:56:50 +09:00
Zhenhua Wang	a8bf5d823b	[SPARK-30339][SQL] Avoid to fail twice in function lookup ### What changes were proposed in this pull request? Currently if function lookup fails, spark will give it a second change by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and causes extra cost like unnecessary metastore access. We should throw exceptions directly in these cases. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Covered by existing tests. Closes #26994 from wzhfy/avoid_udf_fail_twice. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 01:09:51 +09:00
Jungtaek Lim (HeartSaVioR)	e054a0af6f	[SPARK-29348][SQL][FOLLOWUP] Fix slight bug on streaming example for Dataset.observe ### What changes were proposed in this pull request? This patch fixes a small bug in the example of streaming query, as the type of observable metrics is Java Map instead of Scala Map, so to use foreach it should be converted first. ### Why are the changes needed? Described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Ran below query via `spark-shell`: Streaming ```scala import scala.collection.JavaConverters._ import scala.util.Random import org.apache.spark.sql.streaming.StreamingQueryListener import org.apache.spark.sql.streaming.StreamingQueryListener._ spark.streams.addListener(new StreamingQueryListener() { override def onQueryProgress(event: QueryProgressEvent): Unit = { event.progress.observedMetrics.asScala.get("my_event").foreach { row => // Trigger if the number of errors exceeds 5 percent val num_rows = row.getAs[Long]("rc") val num_error_rows = row.getAs[Long]("erc") val ratio = num_error_rows.toDouble / num_rows if (ratio > 0.05) { // Trigger alert println(s"alert! error ratio: $ratio") } } } def onQueryStarted(event: QueryStartedEvent): Unit = {} def onQueryTerminated(event: QueryTerminatedEvent): Unit = {} }) val rates = spark .readStream .format("rate") .option("rowsPerSecond", 10) .load val rand = new Random() val df = rates.map { row => (row.getLong(1), if (row.getLong(1) % 2 == 0) "error" else null) }.toDF val ds = df.selectExpr("_1 AS id", "_2 AS error") // Observe row count (rc) and error row count (erc) in the batch Dataset val observed_ds = ds.observe("my_event", count(lit(1)).as("rc"), count($"error").as("erc")) observed_ds.writeStream.format("console").start() ``` Closes #27046 from HeartSaVioR/SPARK-29348-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 01:08:25 +09:00
HyukjinKwon	7079e871a7	[SPARK-30185][SQL] Implement Dataset.tail API ### What changes were proposed in this pull request? This PR proposes a `tail` API. Namely, as below: ```scala scala> spark.range(10).head(5) res1: Array[Long] = Array(0, 1, 2, 3, 4) scala> spark.range(10).tail(5) res2: Array[Long] = Array(5, 6, 7, 8, 9) ``` Implementation details will be similar with `head` but it will be reversed: 1. Run the job against the last partition and collect rows. If this is enough, return as is. 2. If this is not enough, calculate the number of partitions to select more based upon `spark.sql.limit.scaleUpFactor` 3. Run more jobs against more partitions (in a reversed order compared to head) as many as the number calculated from 2. 4. Go to 2. Note that, we don't guarantee the natural order in DataFrame in general - there are cases when it's deterministic and when it's not. We probably should write down this as a caveat separately. ### Why are the changes needed? Many other systems support the way to take data from the end, for instance, pandas[1] and Python[2][3]. Scala collections APIs also have head and tail On the other hand, in Spark, we only provide a way to take data from the start (e.g., DataFrame.head). This has been requested multiple times here and there in Spark user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such as Koalas[8]. In addition, this missing API seems explicitly mentioned in comparison to another system[9] time to time. It seems we're missing non-trivial use case in Spark and this motivated me to propose this API. [1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail [2] https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line [3] https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python [4] http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html [5] https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index [6] https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe [7] https://issues.apache.org/jira/browse/SPARK-26433 [8] https://github.com/databricks/koalas/issues/343 [9] https://medium.com/chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 ### Does this PR introduce any user-facing change? No, (new API) ### How was this patch tested? Unit tests were added and manually tested. Closes #26809 from HyukjinKwon/wip-tail. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 01:07:09 +09:00
Terry Kim	a90ad5bf2a	[SPARK-30370][SQL] Update SqlBase.g4 to combine namespace and database tokens ### What changes were proposed in this pull request? In `SqlBase.g4`, `database` is defined as ``` database : DATABASE \| SCHEMA; ``` and it is being used as `(database \| NAMESPACE)` in many places. This PR proposes to define the following and use it as discussed in https://github.com/apache/spark/pull/26847/files#r359754778: ``` namespace : NAMESPACE \| DATABASE \| SCHEMA; ``` ### Why are the changes needed? To simplify the grammar. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? There is no change in the actual grammar, so the existing tests should be sufficient. Closes #27027 from imback82/sqlbase_namespace. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-30 14:58:02 +08:00
Xiao Li	919d551ddb	Revert "[SPARK-29390][SQL] Add the justify_days(), justify_hours() and justif_interval() functions" This reverts commit `f926809a1f`. Closes #27032 from gatorsmile/revertSPARK-29390. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-12-29 15:25:14 -08:00
sandeep katta	16e5e79877	[SPARK-28670][SQL] create function should thrown Exception if the resource is not found ## What changes were proposed in this pull request? Create temporary or permanent function it should throw AnalysisException if the resource is not found. Need to keep behavior consistent across permanent and temporary functions. ## How was this patch tested? Added UT and also tested manually Before Fix If the UDF resource is not present then on creation of temporary function it throws AnalysisException where as for permanent function it does not throw. Permanent funtcion throws AnalysisException only after select operation is performed. After Fix For temporary and permanent function check for the resource, if the UDF resource is not found then throw AnalysisException ![rt](https://user-images.githubusercontent.com/35216143/62781519-d1131580-bad5-11e9-9d58-69e65be86c03.png) Closes #25399 from sandeep-katta/funcIssue. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-28 14:35:33 +09:00
Kent Yao	f0bf2eb006	[SPARK-30356][SQL] Codegen support for the function str_to_map ### What changes were proposed in this pull request? `str_to_map ` has not implemented with codegen support, which prevents a query that contains this expression from being whole stage codegen-ed. This PR removes `CodegenFallBack` from `StringToMap`, add the codegen support for it. ### Why are the changes needed? improve codegen coverage and gain better perfomance ### Does this PR introduce any user-facing change? no ### How was this patch tested? 1. pass ComplexTypeSuite 2. manually review generated code ```java -- !query 12 explain codegen select v, str_to_map(v) from values ('abc🅰️a,:'), (null), (''), ('1:2') t(v) -- !query 12 schema struct<plan:string> -- !query 12 output Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:511; maxConstantPoolSize:188(0.29% used); numInnerClasses:0) == Project [v#x, str_to_map(v#x, ,, :) AS str_to_map(v, ,, :)#x] +- LocalTableScan [v#x] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator localtablescan_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[] project_mutableStateArray_1 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[2]; / 012 / / 013 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 014 / this.references = references; / 015 / } / 016 / / 017 / public void init(int index, scala.collection.Iterator[] inputs) { / 018 / partitionIndex = index; / 019 / this.inputs = inputs; / 020 / localtablescan_input_0 = inputs[0]; / 021 / project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 64); / 022 / project_mutableStateArray_1[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(project_mutableStateArray_0[0], 8); / 023 / project_mutableStateArray_1[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(project_mutableStateArray_0[0], 8); / 024 / / 025 / } / 026 / / 027 / private void project_doConsume_0(InternalRow localtablescan_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 028 / boolean project_isNull_1 = true; / 029 / MapData project_value_1 = null; / 030 / / 031 / if (!project_exprIsNull_0_0) { / 032 / project_isNull_1 = false; // resultCode could change nullability. / 033 / / 034 / int project_i_0 = 0; / 035 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[2] / literal /), -1); / 036 / while (project_i_0 < project_kvs_0.length) { / 037 / UTF8String[] kv = project_kvs_0[project_i_0].split(((UTF8String) references[3] / literal /), 2); / 038 / UTF8String key = kv[0]; / 039 / UTF8String value = null; / 040 / if (kv.length == 2) { / 041 / value = kv[1]; / 042 / } / 043 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[1] / mapBuilder /).put(key, value); / 044 / project_i_0++; / 045 / } / 046 / project_value_1 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[1] / mapBuilder /).build(); / 047 / / 048 / } / 049 / project_mutableStateArray_0[0].reset(); / 050 / / 051 / project_mutableStateArray_0[0].zeroOutNullBytes(); / 052 / / 053 / if (project_exprIsNull_0_0) { / 054 / project_mutableStateArray_0[0].setNullAt(0); / 055 / } else { / 056 / project_mutableStateArray_0[0].write(0, project_expr_0_0); / 057 / } / 058 / / 059 / if (project_isNull_1) { / 060 / project_mutableStateArray_0[0].setNullAt(1); / 061 / } else { / 062 / final MapData project_tmpInput_0 = project_value_1; / 063 / if (project_tmpInput_0 instanceof UnsafeMapData) { / 064 / project_mutableStateArray_0[0].write(1, (UnsafeMapData) project_tmpInput_0); / 065 / } else { / 066 / // Remember the current cursor so that we can calculate how many bytes are / 067 / // written later. / 068 / final int project_previousCursor_0 = project_mutableStateArray_0[0].cursor(); / 069 / / 070 / // preserve 8 bytes to write the key array numBytes later. / 071 / project_mutableStateArray_0[0].grow(8); / 072 / project_mutableStateArray_0[0].increaseCursor(8); / 073 / / 074 / // Remember the current cursor so that we can write numBytes of key array later. / 075 / final int project_tmpCursor_0 = project_mutableStateArray_0[0].cursor(); / 076 / / 077 / final ArrayData project_tmpInput_1 = project_tmpInput_0.keyArray(); / 078 / if (project_tmpInput_1 instanceof UnsafeArrayData) { / 079 / project_mutableStateArray_0[0].write((UnsafeArrayData) project_tmpInput_1); / 080 / } else { / 081 / final int project_numElements_0 = project_tmpInput_1.numElements(); / 082 / project_mutableStateArray_1[0].initialize(project_numElements_0); / 083 / / 084 / for (int project_index_0 = 0; project_index_0 < project_numElements_0; project_index_0++) { / 085 / project_mutableStateArray_1[0].write(project_index_0, project_tmpInput_1.getUTF8String(project_index_0)); / 086 / } / 087 / } / 088 / / 089 / // Write the numBytes of key array into the first 8 bytes. / 090 / Platform.putLong( / 091 / project_mutableStateArray_0[0].getBuffer(), / 092 / project_tmpCursor_0 - 8, / 093 / project_mutableStateArray_0[0].cursor() - project_tmpCursor_0); / 094 / / 095 / final ArrayData project_tmpInput_2 = project_tmpInput_0.valueArray(); / 096 / if (project_tmpInput_2 instanceof UnsafeArrayData) { / 097 / project_mutableStateArray_0[0].write((UnsafeArrayData) project_tmpInput_2); / 098 / } else { / 099 / final int project_numElements_1 = project_tmpInput_2.numElements(); / 100 / project_mutableStateArray_1[1].initialize(project_numElements_1); / 101 / / 102 / for (int project_index_1 = 0; project_index_1 < project_numElements_1; project_index_1++) { / 103 / if (project_tmpInput_2.isNullAt(project_index_1)) { / 104 / project_mutableStateArray_1[1].setNull8Bytes(project_index_1); / 105 / } else { / 106 / project_mutableStateArray_1[1].write(project_index_1, project_tmpInput_2.getUTF8String(project_index_1)); / 107 / } / 108 / / 109 / } / 110 / } / 111 / / 112 / project_mutableStateArray_0[0].setOffsetAndSizeFromPreviousCursor(1, project_previousCursor_0); / 113 / } / 114 / } / 115 / append((project_mutableStateArray_0[0].getRow())); / 116 / / 117 / } / 118 / / 119 / protected void processNext() throws java.io.IOException { / 120 / while ( localtablescan_input_0.hasNext()) { / 121 / InternalRow localtablescan_row_0 = (InternalRow) localtablescan_input_0.next(); / 122 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 123 / boolean localtablescan_isNull_0 = localtablescan_row_0.isNullAt(0); / 124 / UTF8String localtablescan_value_0 = localtablescan_isNull_0 ? / 125 / null : (localtablescan_row_0.getUTF8String(0)); / 126 / / 127 / project_doConsume_0(localtablescan_row_0, localtablescan_value_0, localtablescan_isNull_0); / 128 / if (shouldStop()) return; / 129 / } / 130 / } / 131 / / 132 */ } -- !query 13 select v, str_to_map(v) from values ('abc🅰️a,:'), (null), (''), ('1:2') t(v) -- !query 13 schema struct<v:string,str_to_map(v, ,, :):map<string,string>> -- !query 13 output {"":null} 1:2 {"1":"2"} NULL NULL abc🅰️a,: {"":"","abc":"a:a"} ``` Closes #27013 from yaooqinn/SPARK-30356. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 21:44:03 +08:00
Jungtaek Lim (HeartSaVioR)	7adf886792	[SPARK-30345][SQL] Fix intermittent test failure (ConnectException) on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite ### What changes were proposed in this pull request? This patch fixes the intermittent test failure on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite, getting ConnectException when querying to thrift server. (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/) The relevant unit test log messages are following: ``` 19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:ThriftBinaryCLIService is started. 19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:HiveServer2 is started. ... 19/12/23 13:33:01.888 pool-1-thread-1 INFO ThriftServerWithSparkContextSuite: HiveThriftServer2 started successfully ... 19/12/23 13:33:01.909 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite: ===== TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' ===== ... 19/12/23 13:33:02.017 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Supplied authorities: localhost:15441 19/12/23 13:33:02.018 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Resolved authority: localhost:15441 19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer. 19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer. 19/12/23 13:33:02.121 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite WARN HiveConnection: Failed to connect to localhost:15441 19/12/23 13:33:02.124 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite: ===== FINISHED o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' ===== 19/12/23 13:33:02.143 Thread-35 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 15441 with 5...500 worker threads 19/12/23 13:33:02.327 pool-1-thread-1 INFO HiveServer2: Shutting down HiveServer2 19/12/23 13:33:02.328 pool-1-thread-1 INFO ThriftCLIService: Thrift server has stopped ``` (Here the error is logged as `WARN HiveConnection: Failed to connect to localhost:15441` - the actual stack trace can be seen on Jenkins test summary.) The reason of test failure: Thrift(Binary\|Http)CLIService prepare and launch the service asynchronously (in new thread), which suites are not waiting for completion and just start running tests, ends up with race condition. That can be easily reproduced, via adding artificial sleep in `ThriftBinaryCLIService.run()` here: `ba3f6330dd/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java (L49)` (Note that `sleep` should be added before initializing server socket. E.g. Line 57) This patch changes the test initialization logic to try executing simple query to wait until the service is available. The patch also refactors the code to apply the change both ThriftServerQueryTestSuite and ThriftServerWithSparkContextSuite easily. ### Why are the changes needed? This patch fixes the intermittent failure observed here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/ ### Does this PR introduce any user-facing change? No ### How was this patch tested? Artificially made the test fail consistently (by the approach described above), and confirmed the patch fixed the test. Closes #27001 from HeartSaVioR/SPARK-30345. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 15:30:54 +08:00
lijunqing	a2de20c0e6	[SPARK-30036][SQL] Fix: REPARTITION hint does not work with order by ### Why are the changes needed? `EnsureRequirements` adds `ShuffleExchangeExec` (RangePartitioning) after Sort if `RoundRobinPartitioning` behinds it. This will cause 2 shuffles, and the number of partitions in the final stage is not the number specified by `RoundRobinPartitioning. Example SQL ``` SELECT /+ REPARTITION(5) / * FROM test ORDER BY a ``` BEFORE ``` == Physical Plan == (1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=#11] +- Exchange RoundRobinPartitioning(5), false, [id=#9] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` AFTER* ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 5), true, [id=#11] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run suite Tests and add new test for this. Closes #26946 from stczwd/RoundRobinPartitioning. Lead-authored-by: lijunqing <lijunqing@baidu.com> Co-authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 11:52:39 +08:00
Kengo Seki	59c014e120	[SPARK-30350][SQL] Fix ScalaReflection to use an empty array for getting its class object ### What changes were proposed in this pull request? This PR fixes `ScalaReflection.arrayClassFor()` to use an empty array instead of a one-element array for getting its class object by reflection. ### Why are the changes needed? Because it may reduce unnecessary memory allocation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran the existing unit tests for sql/catalyst and confirmed that all of them succeeded. Closes #27005 from sekikn/SPARK-30350. Authored-by: Kengo Seki <sekikn@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-26 22:54:29 +09:00
gengjiaan	d59e7195f6	[SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression ### What changes were proposed in this pull request? The filter predicate for aggregate expression is an `ANSI SQL`. ``` <aggregate function> ::= COUNT <left paren> <asterisk> <right paren> [ <filter clause> ] \| <general set function> [ <filter clause> ] \| <binary set function> [ <filter clause> ] \| <ordered set function> [ <filter clause> ] \| <array aggregate function> [ <filter clause> ] \| <row pattern count function> [ <filter clause> ] ``` There are some mainstream database support this syntax. PostgreSQL: https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES For example: ``` SELECT year, count() FILTER (WHERE gdp_per_capita >= 40000) FROM countries GROUP BY year ``` ``` SELECT year, code, gdp_per_capita, count() FILTER (WHERE gdp_per_capita >= 40000) OVER (PARTITION BY year) FROM countries ``` jOOQ: https://blog.jooq.org/2014/12/30/the-awesome-postgresql-9-4-sql2003-filter-clause-for-aggregate-functions/ Notice: 1.This PR only supports FILTER predicate without codegen. maropu will create another PR is related to SPARK-30027 to support codegen. 2.This PR only supports FILTER predicate without DISTINCT. I will create another PR is related to SPARK-30276 to support this. 3.This PR only supports FILTER predicate that can't reference the outer query. I created ticket SPARK-30219 to support it. 4.This PR only supports FILTER predicate that can't use IN/EXISTS predicate sub-queries. I created ticket SPARK-30220 to support it. 5.Spark SQL cannot supports a SQL with nested aggregate. I created ticket SPARK-30182 to support it. There are some show of the PR on my production environment. ``` spark-sql> desc gja_test_partition; key string NULL value string NULL other string NULL col2 int NULL # Partition Information # col_name data_type comment col2 int NULL Time taken: 0.79 s ``` ``` spark-sql> select * from gja_test_partition; a A ao 1 b B bo 1 c C co 1 d D do 1 e E eo 2 g G go 2 h H ho 2 j J jo 2 f F fo 3 k K ko 3 l L lo 4 i I io 4 Time taken: 1.75 s ``` ``` spark-sql> select count(key), sum(col2) from gja_test_partition; 12 26 Time taken: 1.848 s ``` ``` spark-sql> select count(key) filter (where col2 > 1) from gja_test_partition; 8 Time taken: 2.926 s ``` ``` spark-sql> select sum(col2) filter (where col2 > 2) from gja_test_partition; 14 Time taken: 2.087 s ``` ``` spark-sql> select count(key) filter (where col2 > 1), sum(col2) filter (where col2 > 2) from gja_test_partition; 8 14 Time taken: 2.847 s ``` ``` spark-sql> select count(key), count(key) filter (where col2 > 1), sum(col2), sum(col2) filter (where col2 > 2) from gja_test_partition; 12 8 26 14 Time taken: 1.787 s ``` ``` spark-sql> desc student; id int NULL name string NULL sex string NULL class_id int NULL Time taken: 0.206 s ``` ``` spark-sql> select * from student; 1 张三 man 1 2 李四 man 1 3 王五 man 2 4 赵六 man 2 5 钱小花 woman 1 6 赵九红 woman 2 7 郭丽丽 woman 2 Time taken: 0.786 s ``` ``` spark-sql> select class_id, count(id), sum(id) from student group by class_id; 1 3 8 2 4 20 Time taken: 18.783 s ``` ``` spark-sql> select class_id, count(id) filter (where sex = 'man'), sum(id) filter (where sex = 'woman') from student group by class_id; 1 2 5 2 2 13 Time taken: 3.887 s ``` ### Why are the changes needed? Add new SQL feature. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Exists UT and new UT. Closes #26656 from beliefer/support-aggregate-clause. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-26 17:41:50 +08:00
wenfang	4d58cd77f9	[SPARK-30330][SQL] Support single quotes json parsing for get_json_object and json_tuple ### What changes were proposed in this pull request? I execute some query as` select get_json_object(ytag, '$.y1') AS y1 from t4`; SparkSQL return null but Hive return correct results. In my production environment, ytag is a json wrapped by single quotes,as follows ``` {'y1': 'shuma', 'y2': 'shuma:shouji'} {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'} {'y1': 'yule', 'y2': 'yule:mingxing'} ``` Then l realized some functions including get_json_object and json_tuple does not support single quotes json parsing. So l provide this PR to resolve the question. ### Why are the changes needed? Enabled for Hive compatibility ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NEW TESTS Closes #26965 from wenfang6/enableSingleQuotesJsonForSparkSQL. Authored-by: wenfang <wenfang@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-26 11:45:31 +09:00
yi.wu	6d64fc2407	[SPARK-26389][SS][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Rename `spark.sql.streaming.forceDeleteTempCheckpointLocation` to `spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, as this config is newly added in 3.0. ### How was this patch tested? Pass Jenkins. Closes #26981 from Ngone51/SPARK-26389-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 21:45:01 +08:00
Kent Yao	da65a955ed	[SPARK-30266][SQL] Avoid match error and int overflow in ApproximatePercentile and Percentile ### What changes were proposed in this pull request? accuracyExpression can accept Long which may cause overflow error. accuracyExpression can accept fractions which are implicitly floored. accuracyExpression can accept null which is implicitly changed to 0. percentageExpression can accept null but cause MatchError. percentageExpression can accept ArrayType(_, nullable=true) in which the nulls are implicitly changed to zeros. ##### cases ```sql select percentile_approx(10.0, 0.5, 2147483648); -- overflow and fail select percentile_approx(10.0, 0.5, 4294967297); -- overflow but success select percentile_approx(10.0, 0.5, null); -- null cast to 0 select percentile_approx(10.0, 0.5, 1.2); -- 1.2 cast to 1 select percentile_approx(10.0, null, 1); -- scala.MatchError select percentile_approx(10.0, array(0.2, 0.4, null), 1); -- null cast to zero. ``` ##### behavior before ```sql +select percentile_approx(10.0, 0.5, 2147483648) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(2147483648L AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = -2147483648); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(NULL AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = 0); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +10.0 + +select percentile_approx(10.0, null, 1) +scala.MatchError +null + + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +[10.0,10.0,10.0] ``` ##### behavior after ```sql +select percentile_approx(10.0, 0.5, 2147483648) +10.0 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, NULL)' due to data type mismatch: argument 3 requires integral type, however, 'NULL' is of null type.; line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, 1.2BD)' due to data type mismatch: argument 3 requires integral type, however, '1.2BD' is of decimal(2,1) type.; line 1 pos 7 + +select percentile_approx(10.0, null, 1) +java.lang.IllegalArgumentException +The value of percentage must be be between 0.0 and 1.0, but got null + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +java.lang.IllegalArgumentException +Each value of the percentage array must be be between 0.0 and 1.0, but got [0.2,0.4,null] ``` ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? yes, fix some improper usages of percentile_approx as cases list above ### How was this patch tested? add ut Closes #26905 from yaooqinn/SPARK-30266. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 20:03:26 +08:00
manuzhang	ef6f9e9668	[SPARK-30331][SQL] Set isFinalPlan to true before posting the final AdaptiveSparkPlan event ### What changes were proposed in this pull request? Set `isFinalPlan=true` before posting the final AdaptiveSparkPlan event (`SparkListenerSQLAdaptiveExecutionUpdate`) ### Why are the changes needed? Otherwise, any attempt to listen on the final event by pattern matching `isFinalPlan=true` would fail ### Does this PR introduce any user-facing change? No. ### How was this patch tested? All tests in `AdaptiveQueryExecSuite` are exteneded with a verification that a `SparkListenerSQLAdaptiveExecutionUpdate` event with `isFinalPlan=True` exists Closes #26983 from manuzhang/spark-30331. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 19:08:24 +08:00
Pavithra Ramachandran	57ca95246c	[SPARK-29505][SQL] Make DESC EXTENDED <table name> <column name> case insensitive ### What changes were proposed in this pull request? While querying using desc , if column name is not entered exactly as per the column name given during the table creation, the colstats are wrong. fetching of col stats has been made case insensitive. ### Why are the changes needed? functions like analyze, etc support case insensitive retrieval of column data. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? <!-- Unit test has been rewritten and tested. Closes #26927 from PavithraRamachandran/desc_caseinsensitive. Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-25 08:57:34 +09:00
wangguangxin.cn	640dcc435b	[SPARK-28332][SQL] Reserve init value -1 only when do min max statistics in SQLMetrics ### What changes were proposed in this pull request? This is an alternative solution to https://github.com/apache/spark/pull/25095. SQLMetrics use -1 as init value as a work around for [SPARK-11013](https://issues.apache.org/jira/browse/SPARK-11013.) However, it may bring out some badcases as https://github.com/apache/spark/pull/26726 reporting. In fact, we only need to reserve -1 when doing min max statistics in `SQLMetrics.stringValue` so that we can filter out those not initialized accumulators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs Closes #26899 from WangGuangxin/sqlmetrics. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-23 13:13:35 +08:00
HyukjinKwon	e5abbab0ed	[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC ### What changes were proposed in this pull request? This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation. - `recursiveFileLookup` at file sources: https://github.com/apache/spark/pull/24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627)) - `pathGlobFilter` at file sources: https://github.com/apache/spark/pull/24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990)) - `mergeSchema` at ORC: https://github.com/apache/spark/pull/24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412)) Note that `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete. ### Why are the changes needed? To document available options in sources properly. ### Does this PR introduce any user-facing change? In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text\|orc\|parquet\|json\|csv)` and `DataStreamReader.(text\|orc\|parquet\|json\|csv)`. ### How was this patch tested? Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only: ```bash $ ls -al tmp ... -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc ``` ```python >>> spark.read.text("tmp", pathGlobFilter="*c").show() ``` ``` +-----+ \|value\| +-----+ \| ac\| \| cc\| +-----+ ``` Closes #26958 from HyukjinKwon/doc-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-23 09:57:42 +09:00
gengjiaan	a38bf7e051	[SPARK-28083][SQL][TEST][FOLLOW-UP] Enable LIKE ... ESCAPE test cases ### What changes were proposed in this pull request? This PR is a follow-up to https://github.com/apache/spark/pull/25001 ### Why are the changes needed? No ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the newly update test files. Closes #26949 from beliefer/uncomment-like-escape-tests. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 14:40:07 -08:00
Kazuaki Ishizaki	f31d9a629b	[MINOR][DOC][SQL][CORE] Fix typo in document and comments ### What changes were proposed in this pull request? Fixed typo in `docs` directory and in other directories 1. Find typo in `docs` and apply fixes to files in all directories 2. Fix `the the` -> `the` ### Why are the changes needed? Better readability of documents ### Does this PR introduce any user-facing change? No ### How was this patch tested? No test needed Closes #26976 from kiszk/typo_20191221. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 14:08:58 -08:00
Wenchen Fan	cd84400271	[SPARK-29906][SQL][FOLLOWUP] Update the final plan in UI for AQE ### What changes were proposed in this pull request? a followup of https://github.com/apache/spark/pull/26576, which mistakenly removes the UI update of the final plan. ### Why are the changes needed? fix mistake. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26968 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 09:56:15 -08:00
Wing Yew Poon	c72f88b0ba	[SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table ### What changes were proposed in this pull request? When querying a partitioned table with format `org.apache.hive.hcatalog.data.JsonSerDe` and more than one task runs in each executor concurrently, the following exception is encountered: `java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord` The exception occurs in `HadoopTableReader.fillObject`. `org.apache.hive.hcatalog.data.JsonSerDe#initialize` populates a `cachedObjectInspector` field by calling `HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector`, which is not thread-safe; this `cachedObjectInspector` is returned by `JsonSerDe#getObjectInspector`. We protect against this Hive bug by synchronizing on an object when we need to call `initialize` on `org.apache.hadoop.hive.serde2.Deserializer` instances (which may be `JsonSerDe` instances). By doing so, the `ObjectInspector` for the `Deserializer` of the partitions of the JSON table and that of the table `SerDe` are the same cached `ObjectInspector` and `HadoopTableReader.fillObject` then works correctly. (If the `ObjectInspector`s are different, then a bug in `HCatRecordObjectInspector` causes an `ArrayList` to be created instead of an `HCatRecord`, resulting in the `ClassCastException` that is seen.) ### Why are the changes needed? To avoid HIVE-15773 / HIVE-21752. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually on a cluster with a partitioned JSON table and running a query using more than one core per executor. Before this change, the ClassCastException happens consistently. With this change it does not happen. Closes #26895 from wypoon/SPARK-17398. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-12-20 10:39:26 -08:00
Sean Owen	7dff3b125d	[SPARK-30272][SQL][CORE] Remove usage of Guava that breaks in 27; replace with workalikes ### What changes were proposed in this pull request? Remove usages of Guava that no longer work in Guava 27, and replace with workalikes. I'll comment on key types of changes below. ### Why are the changes needed? Hadoop 3.2.1 uses Guava 27, so this helps us avoid problems running on Hadoop 3.2.1+ and generally lowers our exposure to Guava. ### Does this PR introduce any user-facing change? Should not be, but see notes below on hash codes and toString. ### How was this patch tested? Existing tests will verify whether these changes break anything for Guava 14. I manually built with an updated version and it compiles with Guava 27; tests running manually locally now. Closes #26911 from srowen/SPARK-30272. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-20 08:55:04 -06:00
Prakhar Jain	07b04c4c72	[SPARK-29938][SQL] Add batching support in Alter table add partition flow ### What changes were proposed in this pull request? Add batching support in Alter table add partition flow. Also calculate new partition sizes faster by doing listing in parallel. ### Why are the changes needed? This PR split the the single createPartitions() call AlterTableAddPartition flow into smaller batches, which could prevent - SocketTimeoutException: Adding thousand of partitions in Hive metastore itself takes lot of time. Because of this hive client fails with SocketTimeoutException. - Hive metastore from OOM (caused by millions of partitions). It will also try to gather stats (total size of all files in all new partitions) faster by parallely listing the new partition paths. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Also tested on a cluster in HDI with 15000 partitions with remote metastore server. Without batching - operation fails with SocketTimeoutException, With batching it finishes in 25 mins. Closes #26569 from prakharjain09/add_partition_batching_r1. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-20 08:54:14 -06:00
Niranjan Artal	0d2ef3ae2b	[SPARK-30300][SQL][WEB-UI] Fix updating the UI max value string when driver updates the same metric id as the tasks ### What changes were proposed in this pull request? In this PR, For a given metrics id we are checking if the driver side accumulator's value is greater than max of all stages value. If it's true, then we are removing that entry from the Hashmap. By doing this, for this metrics, "driver" would be displayed on the UI(As the driver would have the maximum value) ### Why are the changes needed? This PR fixes https://issues.apache.org/jira/browse/SPARK-30300. Currently driver's metric value is not compared while caluculating the max. ### Does this PR introduce any user-facing change? For the metrics where driver's value is greater than max of all stages, this is the change. Previous : (min, median, max (stageId 0( attemptId 1): taskId 2)) Now: (min, median, max (driver)) ### How was this patch tested? Ran unit tests. Closes #26941 from nartal1/SPARK-30300. Authored-by: Niranjan Artal <nartal@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-12-20 07:29:28 -06:00
Kent Yao	12249fcdc7	[SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types ### What changes were proposed in this pull request? When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using `toString`. This makes the result wrong before the default transition ’1582-10-15‘. https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712 cases to reproduce: ```sql +-- !query 47 +select array(cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15', null) +-- !query 47 schema +struct<array(CAST(1582-10-13 AS DATE), DATE '1582-10-14', DATE '1582-10-15', CAST(NULL AS DATE)):array<date>> +-- !query 47 output +[1582-10-03,1582-10-04,1582-10-15,null] + + +-- !query 48 +select cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15' +-- !query 48 schema +struct<CAST(1582-10-13 AS DATE):date,DATE '1582-10-14':date,DATE '1582-10-15':date> +-- !query 48 output +1582-10-13 1582-10-14 1582-10-15 ``` other refencences https://github.com/h2database/h2database/issues/831 ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? yes, complex types containing datetimes in `spark-sql `script and thrift server can result same as self-contained spark app or `spark-shell` script ### How was this patch tested? add uts Closes #26942 from yaooqinn/SPARK-30301. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-20 19:21:43 +08:00
jiake	a296d15235	[SPARK-30291] catch the exception when doing materialize in AQE ### What changes were proposed in this pull request? AQE need catch the exception when doing materialize. And then user can get more information about the exception when enable AQE. ### Why are the changes needed? provide more cause about the exception when doing materialize ### Does this PR introduce any user-facing change? Before this PR, the error in the added unit test is java.lang.RuntimeException: Invalid bucket file file:///${SPARK_HOME}/assembly/spark-warehouse/org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite/bucketed_table/part-00000-3551343c-d003-4ada-82c8-45c712a72efe-c000.snappy.parquet After this PR, the error in the added unit test is: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. ### How was this patch tested? Add a new ut Closes #26931 from JkSelf/catchMoreException. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-12-20 00:23:26 -08:00
Wenchen Fan	18e8d1d5b2	[SPARK-30307][SQL] remove ReusedQueryStageExec ### What changes were proposed in this pull request? When we reuse exchanges in AQE, what we produce is `ReuseQueryStage(QueryStage(Exchange))`. This PR changes it to `QueryStage(ReusedExchange(Exchange))`. This PR also fixes an issue in `LocalShuffleReaderExec.outputPartitioning`. We can only preserve the partitioning if we read one mapper per task. ### Why are the changes needed? `QueryStage` is light-weighted and we don't need to reuse its instance. What we really care is to reuse the exchange instance, which has heavy states (e.g. broadcasted valued, submitted map stage). To simplify the framework, we should use the existing `ReusedExchange` node to do the reuse work, instead of creating a new node. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26952 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-12-19 20:56:06 -08:00
Aman Omer	726f6d3e3c	[SPARK-30184][SQL] Implement a helper method for aliasing functions ### What changes were proposed in this pull request? This PR is to use `expressionWithAlias` for remaining functions for which alias name can be used. Remaining functions are: `Average, First, Last, ApproximatePercentile, StddevSamp, VarianceSamp` PR https://github.com/apache/spark/pull/26712 introduced `expressionWithAlias` ### Why are the changes needed? Error message is wrong when alias name is used for above mentioned functions. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26808 from amanomer/fncAlias. Lead-authored-by: Aman Omer <amanomer1996@gmail.com> Co-authored-by: Aman Omer <40591404+amanomer@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-20 12:49:16 +08:00
Maxim Gekk	dea18231d4	[SPARK-30309][SQL] Mark `Filter` as a `sealed` class ### What changes were proposed in this pull request? Added the `sealed` keyword to the `Filter` class ### Why are the changes needed? To do not miss handling of new filters in a datasource in the future. For example, `AlwaysTrue` and `AlwaysFalse` were added recently by https://github.com/apache/spark/pull/23606 ### Does this PR introduce any user-facing change? Should not. ### How was this patch tested? By existing tests. Closes #26950 from MaxGekk/sealed-filter. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-19 12:30:34 -08:00

1 2 3 4 5 ...

8868 commits