ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
sandeep katta	16e5e79877	[SPARK-28670][SQL] create function should thrown Exception if the resource is not found ## What changes were proposed in this pull request? Create temporary or permanent function it should throw AnalysisException if the resource is not found. Need to keep behavior consistent across permanent and temporary functions. ## How was this patch tested? Added UT and also tested manually Before Fix If the UDF resource is not present then on creation of temporary function it throws AnalysisException where as for permanent function it does not throw. Permanent funtcion throws AnalysisException only after select operation is performed. After Fix For temporary and permanent function check for the resource, if the UDF resource is not found then throw AnalysisException ![rt](https://user-images.githubusercontent.com/35216143/62781519-d1131580-bad5-11e9-9d58-69e65be86c03.png) Closes #25399 from sandeep-katta/funcIssue. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-28 14:35:33 +09:00
Kent Yao	f0bf2eb006	[SPARK-30356][SQL] Codegen support for the function str_to_map ### What changes were proposed in this pull request? `str_to_map ` has not implemented with codegen support, which prevents a query that contains this expression from being whole stage codegen-ed. This PR removes `CodegenFallBack` from `StringToMap`, add the codegen support for it. ### Why are the changes needed? improve codegen coverage and gain better perfomance ### Does this PR introduce any user-facing change? no ### How was this patch tested? 1. pass ComplexTypeSuite 2. manually review generated code ```java -- !query 12 explain codegen select v, str_to_map(v) from values ('abc🅰️a,:'), (null), (''), ('1:2') t(v) -- !query 12 schema struct<plan:string> -- !query 12 output Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:511; maxConstantPoolSize:188(0.29% used); numInnerClasses:0) == Project [v#x, str_to_map(v#x, ,, :) AS str_to_map(v, ,, :)#x] +- LocalTableScan [v#x] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator localtablescan_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[] project_mutableStateArray_1 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[2]; / 012 / / 013 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 014 / this.references = references; / 015 / } / 016 / / 017 / public void init(int index, scala.collection.Iterator[] inputs) { / 018 / partitionIndex = index; / 019 / this.inputs = inputs; / 020 / localtablescan_input_0 = inputs[0]; / 021 / project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 64); / 022 / project_mutableStateArray_1[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(project_mutableStateArray_0[0], 8); / 023 / project_mutableStateArray_1[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(project_mutableStateArray_0[0], 8); / 024 / / 025 / } / 026 / / 027 / private void project_doConsume_0(InternalRow localtablescan_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 028 / boolean project_isNull_1 = true; / 029 / MapData project_value_1 = null; / 030 / / 031 / if (!project_exprIsNull_0_0) { / 032 / project_isNull_1 = false; // resultCode could change nullability. / 033 / / 034 / int project_i_0 = 0; / 035 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[2] / literal /), -1); / 036 / while (project_i_0 < project_kvs_0.length) { / 037 / UTF8String[] kv = project_kvs_0[project_i_0].split(((UTF8String) references[3] / literal /), 2); / 038 / UTF8String key = kv[0]; / 039 / UTF8String value = null; / 040 / if (kv.length == 2) { / 041 / value = kv[1]; / 042 / } / 043 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[1] / mapBuilder /).put(key, value); / 044 / project_i_0++; / 045 / } / 046 / project_value_1 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[1] / mapBuilder /).build(); / 047 / / 048 / } / 049 / project_mutableStateArray_0[0].reset(); / 050 / / 051 / project_mutableStateArray_0[0].zeroOutNullBytes(); / 052 / / 053 / if (project_exprIsNull_0_0) { / 054 / project_mutableStateArray_0[0].setNullAt(0); / 055 / } else { / 056 / project_mutableStateArray_0[0].write(0, project_expr_0_0); / 057 / } / 058 / / 059 / if (project_isNull_1) { / 060 / project_mutableStateArray_0[0].setNullAt(1); / 061 / } else { / 062 / final MapData project_tmpInput_0 = project_value_1; / 063 / if (project_tmpInput_0 instanceof UnsafeMapData) { / 064 / project_mutableStateArray_0[0].write(1, (UnsafeMapData) project_tmpInput_0); / 065 / } else { / 066 / // Remember the current cursor so that we can calculate how many bytes are / 067 / // written later. / 068 / final int project_previousCursor_0 = project_mutableStateArray_0[0].cursor(); / 069 / / 070 / // preserve 8 bytes to write the key array numBytes later. / 071 / project_mutableStateArray_0[0].grow(8); / 072 / project_mutableStateArray_0[0].increaseCursor(8); / 073 / / 074 / // Remember the current cursor so that we can write numBytes of key array later. / 075 / final int project_tmpCursor_0 = project_mutableStateArray_0[0].cursor(); / 076 / / 077 / final ArrayData project_tmpInput_1 = project_tmpInput_0.keyArray(); / 078 / if (project_tmpInput_1 instanceof UnsafeArrayData) { / 079 / project_mutableStateArray_0[0].write((UnsafeArrayData) project_tmpInput_1); / 080 / } else { / 081 / final int project_numElements_0 = project_tmpInput_1.numElements(); / 082 / project_mutableStateArray_1[0].initialize(project_numElements_0); / 083 / / 084 / for (int project_index_0 = 0; project_index_0 < project_numElements_0; project_index_0++) { / 085 / project_mutableStateArray_1[0].write(project_index_0, project_tmpInput_1.getUTF8String(project_index_0)); / 086 / } / 087 / } / 088 / / 089 / // Write the numBytes of key array into the first 8 bytes. / 090 / Platform.putLong( / 091 / project_mutableStateArray_0[0].getBuffer(), / 092 / project_tmpCursor_0 - 8, / 093 / project_mutableStateArray_0[0].cursor() - project_tmpCursor_0); / 094 / / 095 / final ArrayData project_tmpInput_2 = project_tmpInput_0.valueArray(); / 096 / if (project_tmpInput_2 instanceof UnsafeArrayData) { / 097 / project_mutableStateArray_0[0].write((UnsafeArrayData) project_tmpInput_2); / 098 / } else { / 099 / final int project_numElements_1 = project_tmpInput_2.numElements(); / 100 / project_mutableStateArray_1[1].initialize(project_numElements_1); / 101 / / 102 / for (int project_index_1 = 0; project_index_1 < project_numElements_1; project_index_1++) { / 103 / if (project_tmpInput_2.isNullAt(project_index_1)) { / 104 / project_mutableStateArray_1[1].setNull8Bytes(project_index_1); / 105 / } else { / 106 / project_mutableStateArray_1[1].write(project_index_1, project_tmpInput_2.getUTF8String(project_index_1)); / 107 / } / 108 / / 109 / } / 110 / } / 111 / / 112 / project_mutableStateArray_0[0].setOffsetAndSizeFromPreviousCursor(1, project_previousCursor_0); / 113 / } / 114 / } / 115 / append((project_mutableStateArray_0[0].getRow())); / 116 / / 117 / } / 118 / / 119 / protected void processNext() throws java.io.IOException { / 120 / while ( localtablescan_input_0.hasNext()) { / 121 / InternalRow localtablescan_row_0 = (InternalRow) localtablescan_input_0.next(); / 122 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 123 / boolean localtablescan_isNull_0 = localtablescan_row_0.isNullAt(0); / 124 / UTF8String localtablescan_value_0 = localtablescan_isNull_0 ? / 125 / null : (localtablescan_row_0.getUTF8String(0)); / 126 / / 127 / project_doConsume_0(localtablescan_row_0, localtablescan_value_0, localtablescan_isNull_0); / 128 / if (shouldStop()) return; / 129 / } / 130 / } / 131 / / 132 */ } -- !query 13 select v, str_to_map(v) from values ('abc🅰️a,:'), (null), (''), ('1:2') t(v) -- !query 13 schema struct<v:string,str_to_map(v, ,, :):map<string,string>> -- !query 13 output {"":null} 1:2 {"1":"2"} NULL NULL abc🅰️a,: {"":"","abc":"a:a"} ``` Closes #27013 from yaooqinn/SPARK-30356. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 21:44:03 +08:00
Jungtaek Lim (HeartSaVioR)	7adf886792	[SPARK-30345][SQL] Fix intermittent test failure (ConnectException) on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite ### What changes were proposed in this pull request? This patch fixes the intermittent test failure on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite, getting ConnectException when querying to thrift server. (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/) The relevant unit test log messages are following: ``` 19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:ThriftBinaryCLIService is started. 19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:HiveServer2 is started. ... 19/12/23 13:33:01.888 pool-1-thread-1 INFO ThriftServerWithSparkContextSuite: HiveThriftServer2 started successfully ... 19/12/23 13:33:01.909 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite: ===== TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' ===== ... 19/12/23 13:33:02.017 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Supplied authorities: localhost:15441 19/12/23 13:33:02.018 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Resolved authority: localhost:15441 19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer. 19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer. 19/12/23 13:33:02.121 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite WARN HiveConnection: Failed to connect to localhost:15441 19/12/23 13:33:02.124 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite: ===== FINISHED o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' ===== 19/12/23 13:33:02.143 Thread-35 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 15441 with 5...500 worker threads 19/12/23 13:33:02.327 pool-1-thread-1 INFO HiveServer2: Shutting down HiveServer2 19/12/23 13:33:02.328 pool-1-thread-1 INFO ThriftCLIService: Thrift server has stopped ``` (Here the error is logged as `WARN HiveConnection: Failed to connect to localhost:15441` - the actual stack trace can be seen on Jenkins test summary.) The reason of test failure: Thrift(Binary\|Http)CLIService prepare and launch the service asynchronously (in new thread), which suites are not waiting for completion and just start running tests, ends up with race condition. That can be easily reproduced, via adding artificial sleep in `ThriftBinaryCLIService.run()` here: `ba3f6330dd/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java (L49)` (Note that `sleep` should be added before initializing server socket. E.g. Line 57) This patch changes the test initialization logic to try executing simple query to wait until the service is available. The patch also refactors the code to apply the change both ThriftServerQueryTestSuite and ThriftServerWithSparkContextSuite easily. ### Why are the changes needed? This patch fixes the intermittent failure observed here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/ ### Does this PR introduce any user-facing change? No ### How was this patch tested? Artificially made the test fail consistently (by the approach described above), and confirmed the patch fixed the test. Closes #27001 from HeartSaVioR/SPARK-30345. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 15:30:54 +08:00
yi.wu	c35427f6b1	[SPARK-30355][CORE] Unify isExecutorActive between CoarseGrainedSchedulerBackend and DriverEndpoint ### What changes were proposed in this pull request? Unify `DriverEndpoint. executorIsAlive()` and `CoarseGrainedSchedulerBackend .isExecutorActive()`. ### Why are the changes needed? `DriverEndPoint` has method `executorIsAlive()` to check wether an executor is alive/active, while `CoarseGrainedSchedulerBackend` has method `isExecutorActive()` to do the same work. But, `isExecutorActive()` seems forget to consider `executorsPendingLossReason`. Unify these two methods makes behavior be consistent between `DriverEndPoint` and `CoarseGrainedSchedulerBackend` and make code more easier to maintain. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27012 from Ngone51/unify-is-executor-alive. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 14:41:45 +08:00
zhengruifeng	9c046dc808	[SPARK-30102][ML][PYSPARK] GMM supports instance weighting ### What changes were proposed in this pull request? supports instance weighting in GMM ### Why are the changes needed? ML should support instance weighting ### Does this PR introduce any user-facing change? yes, a new param `weightCol` is exposed ### How was this patch tested? added testsuits Closes #26735 from zhengruifeng/gmm_support_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-27 13:32:57 +08:00
Huaxin Gao	a3cf9c564e	[SPARK-30247][PYSPARK][FOLLOWUP] Add Python class MultivariateGaussian ### What changes were proposed in this pull request? add a corresponding class MultivariateGaussian containing a vector and a matrix on the py side, so gaussian can be used on the py side. ### Does this PR introduce any user-facing change? add Python class ```MultivariateGaussian``` ### How was this patch tested? doctest Closes #27020 from huaxingao/spark-30247. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-27 13:30:18 +08:00
Yuanjian Li	2acae975aa	[SPARK-30278][SQL][DOC] Update Spark SQL document menu for new changes ### What changes were proposed in this pull request? Update the Spark SQL document menu and join strategy hints. ### Why are the changes needed? - Several new changes in the Spark SQL document didn't change the menu-sql.yaml correspondingly. - Update the demo code for join strategy hints. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Document change only. Closes #26917 from xuanyuanking/SPARK-30278. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 13:22:26 +08:00
lijunqing	a2de20c0e6	[SPARK-30036][SQL] Fix: REPARTITION hint does not work with order by ### Why are the changes needed? `EnsureRequirements` adds `ShuffleExchangeExec` (RangePartitioning) after Sort if `RoundRobinPartitioning` behinds it. This will cause 2 shuffles, and the number of partitions in the final stage is not the number specified by `RoundRobinPartitioning. Example SQL ``` SELECT /+ REPARTITION(5) / * FROM test ORDER BY a ``` BEFORE ``` == Physical Plan == (1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=#11] +- Exchange RoundRobinPartitioning(5), false, [id=#9] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` AFTER* ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 5), true, [id=#11] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run suite Tests and add new test for this. Closes #26946 from stczwd/RoundRobinPartitioning. Lead-authored-by: lijunqing <lijunqing@baidu.com> Co-authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 11:52:39 +08:00
zhanjf	8d3eed33ee	[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #27000 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-26 11:39:53 -06:00
Fu Chen	3584d84943	[MINOR][CORE] Quiet request executor remove message ### What changes were proposed in this pull request? Settings to quiet for Class `ExecutorAllocationManager` that request message too verbose. otherwise, this class generates too many messages like `INFO spark.ExecutorAllocationManager: Request to remove executorIds: 890` when we enabled DRA. ### Why are the changes needed? Log level improvement. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Closes #26925 from cfmcgrady/quiet-request-executor-remove-message. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-26 09:59:41 -06:00
Kengo Seki	59c014e120	[SPARK-30350][SQL] Fix ScalaReflection to use an empty array for getting its class object ### What changes were proposed in this pull request? This PR fixes `ScalaReflection.arrayClassFor()` to use an empty array instead of a one-element array for getting its class object by reflection. ### Why are the changes needed? Because it may reduce unnecessary memory allocation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran the existing unit tests for sql/catalyst and confirmed that all of them succeeded. Closes #27005 from sekikn/SPARK-30350. Authored-by: Kengo Seki <sekikn@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-26 22:54:29 +09:00
gengjiaan	d59e7195f6	[SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression ### What changes were proposed in this pull request? The filter predicate for aggregate expression is an `ANSI SQL`. ``` <aggregate function> ::= COUNT <left paren> <asterisk> <right paren> [ <filter clause> ] \| <general set function> [ <filter clause> ] \| <binary set function> [ <filter clause> ] \| <ordered set function> [ <filter clause> ] \| <array aggregate function> [ <filter clause> ] \| <row pattern count function> [ <filter clause> ] ``` There are some mainstream database support this syntax. PostgreSQL: https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES For example: ``` SELECT year, count() FILTER (WHERE gdp_per_capita >= 40000) FROM countries GROUP BY year ``` ``` SELECT year, code, gdp_per_capita, count() FILTER (WHERE gdp_per_capita >= 40000) OVER (PARTITION BY year) FROM countries ``` jOOQ: https://blog.jooq.org/2014/12/30/the-awesome-postgresql-9-4-sql2003-filter-clause-for-aggregate-functions/ Notice: 1.This PR only supports FILTER predicate without codegen. maropu will create another PR is related to SPARK-30027 to support codegen. 2.This PR only supports FILTER predicate without DISTINCT. I will create another PR is related to SPARK-30276 to support this. 3.This PR only supports FILTER predicate that can't reference the outer query. I created ticket SPARK-30219 to support it. 4.This PR only supports FILTER predicate that can't use IN/EXISTS predicate sub-queries. I created ticket SPARK-30220 to support it. 5.Spark SQL cannot supports a SQL with nested aggregate. I created ticket SPARK-30182 to support it. There are some show of the PR on my production environment. ``` spark-sql> desc gja_test_partition; key string NULL value string NULL other string NULL col2 int NULL # Partition Information # col_name data_type comment col2 int NULL Time taken: 0.79 s ``` ``` spark-sql> select * from gja_test_partition; a A ao 1 b B bo 1 c C co 1 d D do 1 e E eo 2 g G go 2 h H ho 2 j J jo 2 f F fo 3 k K ko 3 l L lo 4 i I io 4 Time taken: 1.75 s ``` ``` spark-sql> select count(key), sum(col2) from gja_test_partition; 12 26 Time taken: 1.848 s ``` ``` spark-sql> select count(key) filter (where col2 > 1) from gja_test_partition; 8 Time taken: 2.926 s ``` ``` spark-sql> select sum(col2) filter (where col2 > 2) from gja_test_partition; 14 Time taken: 2.087 s ``` ``` spark-sql> select count(key) filter (where col2 > 1), sum(col2) filter (where col2 > 2) from gja_test_partition; 8 14 Time taken: 2.847 s ``` ``` spark-sql> select count(key), count(key) filter (where col2 > 1), sum(col2), sum(col2) filter (where col2 > 2) from gja_test_partition; 12 8 26 14 Time taken: 1.787 s ``` ``` spark-sql> desc student; id int NULL name string NULL sex string NULL class_id int NULL Time taken: 0.206 s ``` ``` spark-sql> select * from student; 1 张三 man 1 2 李四 man 1 3 王五 man 2 4 赵六 man 2 5 钱小花 woman 1 6 赵九红 woman 2 7 郭丽丽 woman 2 Time taken: 0.786 s ``` ``` spark-sql> select class_id, count(id), sum(id) from student group by class_id; 1 3 8 2 4 20 Time taken: 18.783 s ``` ``` spark-sql> select class_id, count(id) filter (where sex = 'man'), sum(id) filter (where sex = 'woman') from student group by class_id; 1 2 5 2 2 13 Time taken: 3.887 s ``` ### Why are the changes needed? Add new SQL feature. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Exists UT and new UT. Closes #26656 from beliefer/support-aggregate-clause. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-26 17:41:50 +08:00
Jungtaek Lim (HeartSaVioR)	481fb63f97	[MINOR][SQL][SS] Remove TODO comments as var in case class is discouraged but worth breaking it ### What changes were proposed in this pull request? This patch removes TODO comments which are left to address changing case classes having vars to normal classes in spark-sql-kafka module - the pattern is actually discouraged, but still worth to break it, as we already use automatic toString implementation and we may be using more. ### Why are the changes needed? Described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #26992 from HeartSaVioR/SPARK-30337. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-26 11:47:41 +09:00
wenfang	4d58cd77f9	[SPARK-30330][SQL] Support single quotes json parsing for get_json_object and json_tuple ### What changes were proposed in this pull request? I execute some query as` select get_json_object(ytag, '$.y1') AS y1 from t4`; SparkSQL return null but Hive return correct results. In my production environment, ytag is a json wrapped by single quotes,as follows ``` {'y1': 'shuma', 'y2': 'shuma:shouji'} {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'} {'y1': 'yule', 'y2': 'yule:mingxing'} ``` Then l realized some functions including get_json_object and json_tuple does not support single quotes json parsing. So l provide this PR to resolve the question. ### Why are the changes needed? Enabled for Hive compatibility ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NEW TESTS Closes #26965 from wenfang6/enableSingleQuotesJsonForSparkSQL. Authored-by: wenfang <wenfang@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-26 11:45:31 +09:00
zhengruifeng	ad77b400da	[SPARK-30347][ML] LibSVMDataSource attach AttributeGroup ### What changes were proposed in this pull request? LibSVMDataSource attach AttributeGroup ### Why are the changes needed? LibSVMDataSource will attach a special metadata to indicate numFeatures: ```scala scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt") scala> data.schema("features").metadata res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4} ``` However, all ML impls will try to obtain vector size via AttributeGroup, which can not use this metadata: ```scala scala> import org.apache.spark.ml.attribute._ import org.apache.spark.ml.attribute._ scala> AttributeGroup.fromStructField(data.schema("features")).size res1: Int = -1 ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? added tests Closes #27003 from zhengruifeng/libsvm_attr_group. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-26 10:02:59 +08:00
yi.wu	6d64fc2407	[SPARK-26389][SS][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Rename `spark.sql.streaming.forceDeleteTempCheckpointLocation` to `spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, as this config is newly added in 3.0. ### How was this patch tested? Pass Jenkins. Closes #26981 from Ngone51/SPARK-26389-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 21:45:01 +08:00
Kent Yao	da65a955ed	[SPARK-30266][SQL] Avoid match error and int overflow in ApproximatePercentile and Percentile ### What changes were proposed in this pull request? accuracyExpression can accept Long which may cause overflow error. accuracyExpression can accept fractions which are implicitly floored. accuracyExpression can accept null which is implicitly changed to 0. percentageExpression can accept null but cause MatchError. percentageExpression can accept ArrayType(_, nullable=true) in which the nulls are implicitly changed to zeros. ##### cases ```sql select percentile_approx(10.0, 0.5, 2147483648); -- overflow and fail select percentile_approx(10.0, 0.5, 4294967297); -- overflow but success select percentile_approx(10.0, 0.5, null); -- null cast to 0 select percentile_approx(10.0, 0.5, 1.2); -- 1.2 cast to 1 select percentile_approx(10.0, null, 1); -- scala.MatchError select percentile_approx(10.0, array(0.2, 0.4, null), 1); -- null cast to zero. ``` ##### behavior before ```sql +select percentile_approx(10.0, 0.5, 2147483648) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(2147483648L AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = -2147483648); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(NULL AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = 0); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +10.0 + +select percentile_approx(10.0, null, 1) +scala.MatchError +null + + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +[10.0,10.0,10.0] ``` ##### behavior after ```sql +select percentile_approx(10.0, 0.5, 2147483648) +10.0 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, NULL)' due to data type mismatch: argument 3 requires integral type, however, 'NULL' is of null type.; line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, 1.2BD)' due to data type mismatch: argument 3 requires integral type, however, '1.2BD' is of decimal(2,1) type.; line 1 pos 7 + +select percentile_approx(10.0, null, 1) +java.lang.IllegalArgumentException +The value of percentage must be be between 0.0 and 1.0, but got null + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +java.lang.IllegalArgumentException +Each value of the percentage array must be be between 0.0 and 1.0, but got [0.2,0.4,null] ``` ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? yes, fix some improper usages of percentile_approx as cases list above ### How was this patch tested? add ut Closes #26905 from yaooqinn/SPARK-30266. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 20:03:26 +08:00
yi.wu	35506dced7	[SPARK-25855][CORE][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Change config name from `spark.eventLog.allowErasureCoding` to `spark.eventLog.allowErasureCoding.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Tested manually and pass Jenkins. Closes #26998 from Ngone51/SPARK-25855-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 19:24:58 +08:00
manuzhang	ef6f9e9668	[SPARK-30331][SQL] Set isFinalPlan to true before posting the final AdaptiveSparkPlan event ### What changes were proposed in this pull request? Set `isFinalPlan=true` before posting the final AdaptiveSparkPlan event (`SparkListenerSQLAdaptiveExecutionUpdate`) ### Why are the changes needed? Otherwise, any attempt to listen on the final event by pattern matching `isFinalPlan=true` would fail ### Does this PR introduce any user-facing change? No. ### How was this patch tested? All tests in `AdaptiveQueryExecSuite` are exteneded with a verification that a `SparkListenerSQLAdaptiveExecutionUpdate` event with `isFinalPlan=True` exists Closes #26983 from manuzhang/spark-30331. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 19:08:24 +08:00
Liang-Chi Hsieh	0042ad575a	[SPARK-30290][CORE] Count for merged block when fetching continuous blocks in batch ### What changes were proposed in this pull request? We added shuffle block fetch optimization in SPARK-9853. In ShuffleBlockFetcherIterator, we merge single blocks into batch blocks. During merging, we should count merged blocks for `maxBlocksInFlightPerAddress`, not original single blocks. ### Why are the changes needed? If `maxBlocksInFlightPerAddress` is specified, like set it to 1, it should mean one batch block, not one original single block. Otherwise, it will conflict with batch shuffle fetch. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #26930 from viirya/respect-max-blocks-inflight. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 18:57:02 +08:00
zhengruifeng	8f07839e74	[SPARK-30178][ML] RobustScaler support large numFeatures ### What changes were proposed in this pull request? compute the medians/ranges more distributedly ### Why are the changes needed? It is a bottleneck to collect the whole Array[QuantileSummaries] from executors, since a QuantileSummaries is a large object, which maintains arrays of large sizes 10k(`defaultCompressThreshold`)/50k(`defaultHeadSize`). In Spark-Shell with default params, I processed a dataset with numFeatures=69,200, and existing impl fail due to OOM. After this PR, it will sucessfuly fit the model. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26803 from zhengruifeng/robust_high_dim. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-25 09:44:19 +08:00
zhengruifeng	5715a84c40	[SPARK-29914][ML][FOLLOWUP] fix SQLTransformer & VectorSizeHint toString method ### What changes were proposed in this pull request? 1,modify the toString in SQLTransformer & VectorSizeHint 2,add toString in RegexTokenizer ### Why are the changes needed? in SQLTransformer & VectorSizeHint , `toString` methods directly call getter of param without default values. This will cause `java.util.NoSuchElementException` in REPL: ```scala scala> val vs = new VectorSizeHint() java.util.NoSuchElementException: Failed to find a default value for size at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:780) at scala.Option.getOrElse(Option.scala:189) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26999 from zhengruifeng/fix_toString. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-25 09:39:10 +08:00
Pavithra Ramachandran	57ca95246c	[SPARK-29505][SQL] Make DESC EXTENDED <table name> <column name> case insensitive ### What changes were proposed in this pull request? While querying using desc , if column name is not entered exactly as per the column name given during the table creation, the colstats are wrong. fetching of col stats has been made case insensitive. ### Why are the changes needed? functions like analyze, etc support case insensitive retrieval of column data. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? <!-- Unit test has been rewritten and tested. Closes #26927 from PavithraRamachandran/desc_caseinsensitive. Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-25 08:57:34 +09:00
Wenchen Fan	ba3f6330dd	Revert "[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component" This reverts commit `c6ab7165dd`.	2019-12-24 14:01:27 +08:00
Maxim Gekk	ab0dd41ff2	[SPARK-26618][SQL][FOLLOWUP] Update the SQL migration guide regarding to typed `TIMESTAMP` and `DATE` literals ### What changes were proposed in this pull request? In the PR, I propose to update the SQL migration guide and clarify semantic of string conversion to typed `TIMESTAMP` and `DATE` literals. ### Why are the changes needed? This is a follow-up of the PR https://github.com/apache/spark/pull/23541 which changed the behavior of `TIMESTAMP`/`DATE` literals, and can impact on results of user's queries. ### Does this PR introduce any user-facing change? No ### How was this patch tested? It should be checked by jenkins build. Closes #26985 from MaxGekk/timestamp-date-constructors-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-24 12:01:29 +09:00
Jungtaek Lim (HeartSaVioR)	2164243526	[SPARK-28144][SPARK-29294][SS][FOLLOWUP] Use SystemTime defined in Kafka Time interface ### What changes were proposed in this pull request? This patch addresses review comments in #26960 (https://github.com/apache/spark/pull/26960#discussion_r360661930 / https://github.com/apache/spark/pull/26960#discussion_r360661947) which were not addressed in the patch. Addressing these review comments will let the code less dependent on actual implementation as it only relies on `Time` interface in Kafka. ### Why are the changes needed? These were review comments in previous PR and they bring actual benefit though they're minors. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26979 from HeartSaVioR/SPARK-29294-follow-up. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-24 11:39:03 +09:00
Jungtaek Lim (HeartSaVioR)	7bff2db9ed	[SPARK-21869][SS] Revise Kafka producer pool to implement 'expire' correctly This patch revises Kafka producer pool (cache) to implement 'expire' correctly. Current implementation of Kafka producer cache leverages Guava cache, which decides cached producer instance to be expired if the instance is not "accessed" from cache. The behavior defines expiration time as "last accessed time + timeout", which is incorrect because some task may use the instance longer than timeout. There's no concept of "returning" in Guava cache as well, so it cannot be fixed with Guava cache. This patch introduces a new pool implementation which tracks "reference count" of cached instance, and defines expiration time for the instance as "last returned time + timeout" if the reference count goes 0, otherwise Long.MaxValue (effectively no expire). Expiring instances will be done with evict thread explicitly instead of evicting in part of handling acquire. (It might bring more overhead, but it ensures clearing expired instances even the pool is idle.) This patch also creates a new package `producer` under `kafka010`, to hide the details from `kafka010` package. In point of `kafka010` package's view, only acquire()/release()/reset() are available in pool, and even for CachedKafkaProducer the package cannot close the producer directly. Explained above. Yes, but only for the way of expiring cached instances. (The difference is described above.) Each executor leveraging spark-sql-kafka would have one eviction thread. New and existing UTs. Closes #26845 from HeartSaVioR/SPARK-21869-revised. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-12-23 14:19:33 -08:00
zhanjf	c6ab7165dd	[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #26124 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-23 10:11:09 -06:00
wangguangxin.cn	640dcc435b	[SPARK-28332][SQL] Reserve init value -1 only when do min max statistics in SQLMetrics ### What changes were proposed in this pull request? This is an alternative solution to https://github.com/apache/spark/pull/25095. SQLMetrics use -1 as init value as a work around for [SPARK-11013](https://issues.apache.org/jira/browse/SPARK-11013.) However, it may bring out some badcases as https://github.com/apache/spark/pull/26726 reporting. In fact, we only need to reserve -1 when doing min max statistics in `SQLMetrics.stringValue` so that we can filter out those not initialized accumulators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs Closes #26899 from WangGuangxin/sqlmetrics. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-23 13:13:35 +08:00
HyukjinKwon	e5abbab0ed	[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC ### What changes were proposed in this pull request? This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation. - `recursiveFileLookup` at file sources: https://github.com/apache/spark/pull/24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627)) - `pathGlobFilter` at file sources: https://github.com/apache/spark/pull/24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990)) - `mergeSchema` at ORC: https://github.com/apache/spark/pull/24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412)) Note that `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete. ### Why are the changes needed? To document available options in sources properly. ### Does this PR introduce any user-facing change? In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text\|orc\|parquet\|json\|csv)` and `DataStreamReader.(text\|orc\|parquet\|json\|csv)`. ### How was this patch tested? Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only: ```bash $ ls -al tmp ... -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc ``` ```python >>> spark.read.text("tmp", pathGlobFilter="*c").show() ``` ``` +-----+ \|value\| +-----+ \| ac\| \| cc\| +-----+ ``` Closes #26958 from HyukjinKwon/doc-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-23 09:57:42 +09:00
gengjiaan	a38bf7e051	[SPARK-28083][SQL][TEST][FOLLOW-UP] Enable LIKE ... ESCAPE test cases ### What changes were proposed in this pull request? This PR is a follow-up to https://github.com/apache/spark/pull/25001 ### Why are the changes needed? No ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the newly update test files. Closes #26949 from beliefer/uncomment-like-escape-tests. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 14:40:07 -08:00
Kazuaki Ishizaki	f31d9a629b	[MINOR][DOC][SQL][CORE] Fix typo in document and comments ### What changes were proposed in this pull request? Fixed typo in `docs` directory and in other directories 1. Find typo in `docs` and apply fixes to files in all directories 2. Fix `the the` -> `the` ### Why are the changes needed? Better readability of documents ### Does this PR introduce any user-facing change? No ### How was this patch tested? No test needed Closes #26976 from kiszk/typo_20191221. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 14:08:58 -08:00
Jungtaek Lim (HeartSaVioR)	8384ff4c9d	[SPARK-28144][SPARK-29294][SS] Upgrade Kafka to 2.4.0 ### What changes were proposed in this pull request? This patch upgrades the version of Kafka to 2.4, which supports Scala 2.13. There're some incompatible changes in Kafka 2.4 which the patch addresses as well: * `ZkUtils` is removed -> Replaced with `KafkaZkClient` * Majority of methods are removed in `AdminUtils` -> Replaced with `AdminZkClient` * Method signature of `Scheduler.schedule` is changed (return type) -> leverage `DeterministicScheduler` to avoid implementing `ScheduledFuture` ### Why are the changes needed? * Kafka 2.4 supports Scala 2.13 ### Does this PR introduce any user-facing change? No, as Kafka API is known to be compatible across versions. ### How was this patch tested? Existing UTs Closes #26960 from HeartSaVioR/SPARK-29294. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 14:01:25 -08:00
Yuming Wang	fa47b7faf7	[SPARK-30280][DOC] Update docs for make Hive 2.3 dependency by default ### What changes were proposed in this pull request? This PR update document for make Hive 2.3 dependency by default. ### Why are the changes needed? The documentation is incorrect. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #26919 from wangyum/SPARK-30280. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 10:51:28 -08:00
Wenchen Fan	cd84400271	[SPARK-29906][SQL][FOLLOWUP] Update the final plan in UI for AQE ### What changes were proposed in this pull request? a followup of https://github.com/apache/spark/pull/26576, which mistakenly removes the UI update of the final plan. ### Why are the changes needed? fix mistake. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26968 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 09:56:15 -08:00
Wing Yew Poon	c72f88b0ba	[SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table ### What changes were proposed in this pull request? When querying a partitioned table with format `org.apache.hive.hcatalog.data.JsonSerDe` and more than one task runs in each executor concurrently, the following exception is encountered: `java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord` The exception occurs in `HadoopTableReader.fillObject`. `org.apache.hive.hcatalog.data.JsonSerDe#initialize` populates a `cachedObjectInspector` field by calling `HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector`, which is not thread-safe; this `cachedObjectInspector` is returned by `JsonSerDe#getObjectInspector`. We protect against this Hive bug by synchronizing on an object when we need to call `initialize` on `org.apache.hadoop.hive.serde2.Deserializer` instances (which may be `JsonSerDe` instances). By doing so, the `ObjectInspector` for the `Deserializer` of the partitions of the JSON table and that of the table `SerDe` are the same cached `ObjectInspector` and `HadoopTableReader.fillObject` then works correctly. (If the `ObjectInspector`s are different, then a bug in `HCatRecordObjectInspector` causes an `ArrayList` to be created instead of an `HCatRecord`, resulting in the `ClassCastException` that is seen.) ### Why are the changes needed? To avoid HIVE-15773 / HIVE-21752. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually on a cluster with a partitioned JSON table and running a query using more than one core per executor. Before this change, the ClassCastException happens consistently. With this change it does not happen. Closes #26895 from wypoon/SPARK-17398. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-12-20 10:39:26 -08:00
Sean Owen	7dff3b125d	[SPARK-30272][SQL][CORE] Remove usage of Guava that breaks in 27; replace with workalikes ### What changes were proposed in this pull request? Remove usages of Guava that no longer work in Guava 27, and replace with workalikes. I'll comment on key types of changes below. ### Why are the changes needed? Hadoop 3.2.1 uses Guava 27, so this helps us avoid problems running on Hadoop 3.2.1+ and generally lowers our exposure to Guava. ### Does this PR introduce any user-facing change? Should not be, but see notes below on hash codes and toString. ### How was this patch tested? Existing tests will verify whether these changes break anything for Guava 14. I manually built with an updated version and it compiles with Guava 27; tests running manually locally now. Closes #26911 from srowen/SPARK-30272. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-20 08:55:04 -06:00
Prakhar Jain	07b04c4c72	[SPARK-29938][SQL] Add batching support in Alter table add partition flow ### What changes were proposed in this pull request? Add batching support in Alter table add partition flow. Also calculate new partition sizes faster by doing listing in parallel. ### Why are the changes needed? This PR split the the single createPartitions() call AlterTableAddPartition flow into smaller batches, which could prevent - SocketTimeoutException: Adding thousand of partitions in Hive metastore itself takes lot of time. Because of this hive client fails with SocketTimeoutException. - Hive metastore from OOM (caused by millions of partitions). It will also try to gather stats (total size of all files in all new partitions) faster by parallely listing the new partition paths. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Also tested on a cluster in HDI with 15000 partitions with remote metastore server. Without batching - operation fails with SocketTimeoutException, With batching it finishes in 25 mins. Closes #26569 from prakharjain09/add_partition_batching_r1. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-20 08:54:14 -06:00
Niranjan Artal	0d2ef3ae2b	[SPARK-30300][SQL][WEB-UI] Fix updating the UI max value string when driver updates the same metric id as the tasks ### What changes were proposed in this pull request? In this PR, For a given metrics id we are checking if the driver side accumulator's value is greater than max of all stages value. If it's true, then we are removing that entry from the Hashmap. By doing this, for this metrics, "driver" would be displayed on the UI(As the driver would have the maximum value) ### Why are the changes needed? This PR fixes https://issues.apache.org/jira/browse/SPARK-30300. Currently driver's metric value is not compared while caluculating the max. ### Does this PR introduce any user-facing change? For the metrics where driver's value is greater than max of all stages, this is the change. Previous : (min, median, max (stageId 0( attemptId 1): taskId 2)) Now: (min, median, max (driver)) ### How was this patch tested? Ran unit tests. Closes #26941 from nartal1/SPARK-30300. Authored-by: Niranjan Artal <nartal@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-12-20 07:29:28 -06:00
Kent Yao	12249fcdc7	[SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types ### What changes were proposed in this pull request? When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using `toString`. This makes the result wrong before the default transition ’1582-10-15‘. https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712 cases to reproduce: ```sql +-- !query 47 +select array(cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15', null) +-- !query 47 schema +struct<array(CAST(1582-10-13 AS DATE), DATE '1582-10-14', DATE '1582-10-15', CAST(NULL AS DATE)):array<date>> +-- !query 47 output +[1582-10-03,1582-10-04,1582-10-15,null] + + +-- !query 48 +select cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15' +-- !query 48 schema +struct<CAST(1582-10-13 AS DATE):date,DATE '1582-10-14':date,DATE '1582-10-15':date> +-- !query 48 output +1582-10-13 1582-10-14 1582-10-15 ``` other refencences https://github.com/h2database/h2database/issues/831 ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? yes, complex types containing datetimes in `spark-sql `script and thrift server can result same as self-contained spark app or `spark-shell` script ### How was this patch tested? add uts Closes #26942 from yaooqinn/SPARK-30301. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-20 19:21:43 +08:00
jiake	a296d15235	[SPARK-30291] catch the exception when doing materialize in AQE ### What changes were proposed in this pull request? AQE need catch the exception when doing materialize. And then user can get more information about the exception when enable AQE. ### Why are the changes needed? provide more cause about the exception when doing materialize ### Does this PR introduce any user-facing change? Before this PR, the error in the added unit test is java.lang.RuntimeException: Invalid bucket file file:///${SPARK_HOME}/assembly/spark-warehouse/org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite/bucketed_table/part-00000-3551343c-d003-4ada-82c8-45c712a72efe-c000.snappy.parquet After this PR, the error in the added unit test is: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. ### How was this patch tested? Add a new ut Closes #26931 from JkSelf/catchMoreException. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-12-20 00:23:26 -08:00
Wenchen Fan	18e8d1d5b2	[SPARK-30307][SQL] remove ReusedQueryStageExec ### What changes were proposed in this pull request? When we reuse exchanges in AQE, what we produce is `ReuseQueryStage(QueryStage(Exchange))`. This PR changes it to `QueryStage(ReusedExchange(Exchange))`. This PR also fixes an issue in `LocalShuffleReaderExec.outputPartitioning`. We can only preserve the partitioning if we read one mapper per task. ### Why are the changes needed? `QueryStage` is light-weighted and we don't need to reuse its instance. What we really care is to reuse the exchange instance, which has heavy states (e.g. broadcasted valued, submitted map stage). To simplify the framework, we should use the existing `ReusedExchange` node to do the reuse work, instead of creating a new node. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26952 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-12-19 20:56:06 -08:00
Aman Omer	726f6d3e3c	[SPARK-30184][SQL] Implement a helper method for aliasing functions ### What changes were proposed in this pull request? This PR is to use `expressionWithAlias` for remaining functions for which alias name can be used. Remaining functions are: `Average, First, Last, ApproximatePercentile, StddevSamp, VarianceSamp` PR https://github.com/apache/spark/pull/26712 introduced `expressionWithAlias` ### Why are the changes needed? Error message is wrong when alias name is used for above mentioned functions. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26808 from amanomer/fncAlias. Lead-authored-by: Aman Omer <amanomer1996@gmail.com> Co-authored-by: Aman Omer <40591404+amanomer@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-20 12:49:16 +08:00
Maxim Gekk	dea18231d4	[SPARK-30309][SQL] Mark `Filter` as a `sealed` class ### What changes were proposed in this pull request? Added the `sealed` keyword to the `Filter` class ### Why are the changes needed? To do not miss handling of new filters in a datasource in the future. For example, `AlwaysTrue` and `AlwaysFalse` were added recently by https://github.com/apache/spark/pull/23606 ### Does this PR introduce any user-facing change? Should not. ### How was this patch tested? By existing tests. Closes #26950 from MaxGekk/sealed-filter. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-19 12:30:34 -08:00
Jungtaek Lim (HeartSaVioR)	ab87bfd087	[SPARK-29450][SS] Measure the number of output rows for streaming aggregation with append mode ### What changes were proposed in this pull request? This patch addresses missing metric, the number of output rows for streaming aggregation with append mode. Other modes are correctly measuring it. ### Why are the changes needed? Without the patch, the value for such metric is always 0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test added. Also manually tested with below query: > query ``` import spark.implicits._ spark.conf.set("spark.sql.shuffle.partitions", "5") val df = spark.readStream .format("rate") .option("rowsPerSecond", 1000) .load() .withWatermark("timestamp", "5 seconds") .selectExpr("timestamp", "mod(value, 100) as mod", "value") .groupBy(window($"timestamp", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) val query = df .writeStream .format("memory") .option("queryName", "test") .outputMode("append") .start() query.awaitTermination() ``` > before the patch ![screenshot-before-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023217-58d7bc80-0a01-11ea-8cac-40f1cced6d16.png) > after the patch ![screenshot-after-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023221-5c6b4380-0a01-11ea-8a66-7bf1b7d09fc7.png) Closes #26104 from HeartSaVioR/SPARK-29450. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-19 18:20:41 +09:00
Xingbo Jiang	2af5237fe8	[SPARK-29918][SQL][FOLLOWUP][TEST] Fix arrayOffset in `RecordBinaryComparatorSuite` ### What changes were proposed in this pull request? As mentioned in https://github.com/apache/spark/pull/26548#pullrequestreview-334345333, some test cases in `RecordBinaryComparatorSuite` use a fixed arrayOffset when writing to long arrays, this could lead to weird stuff including crashing with a SIGSEGV. This PR fix the problem by computing the arrayOffset based on `Platform.LONG_ARRAY_OFFSET`. ### How was this patch tested? Tested locally. Previously, when we try to add `System.gc()` between write into long array and compare by RecordBinaryComparator, there is a chance to hit JVM crash with SIGSEGV like: ``` # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007efc66970bcb, pid=11831, tid=0x00007efc0f9f9700 # # JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) # Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x5fbbcb] # # Core dump written. Default location: /home/jenkins/workspace/sql/core/core or core.11831 # # An error report file with more information is saved as: # /home/jenkins/workspace/sql/core/hs_err_pid11831.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # ``` After the fix those test cases didn't crash the JVM anymore. Closes #26939 from jiangxb1987/rbc. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-19 17:01:40 +08:00
Gengliang Wang	ab8eb86a77	Revert "[SPARK-29629][SQL] Support typed integer literal expression" This reverts commit `8e667db5d8`. Closes #26940 from gengliangwang/revert_Spark_29629. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-19 16:34:27 +09:00
ulysses	1e48b43a0e	[SPARK-30254][SQL] Fix LikeSimplification optimizer to use a given escapeChar Since [25001](https://github.com/apache/spark/pull/25001), spark support like escape syntax. We should also sync the escape used by `LikeSimplification`. Avoid optimize failed. No. Add UT. Closes #26880 from ulysses-you/SPARK-30254. Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-18 15:56:13 -08:00
chenliang	abfc267f0c	[SPARK-30262][SQL] Avoid NumberFormatException when totalSize is empty ### What changes were proposed in this pull request? We could get the Partitions Statistics Info.But in some specail case, The Info like totalSize，rawDataSize，rowCount maybe empty. When we do some ddls like `desc formatted partition` ,the NumberFormatException is showed as below: ``` spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', hour='23'); 19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 partition(year='2019', month='10', day='17', hour='23')] java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.(BigInteger.java:411) at java.math.BigInteger.(BigInteger.java:597) at scala.math.BigInt$.apply(BigInt.scala:77) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) ``` Although we can use 'Analyze table partition ' to update the totalSize,rawDataSize or rowCount, it's unresonable for normal SQL to throw NumberFormatException for Empty totalSize.We should fix the empty case when readHiveStats. ### Why are the changes needed? This is a related to the robustness of the code and may lead to unexpected exception in some unpredictable situation.Here is the case: <img width="981" alt="image" src="https://user-images.githubusercontent.com/20614350/70845771-7b88b400-1e8d-11ea-95b0-df5c58097d7d.png"> ### Does this PR introduce any user-facing change? No ### How was this patch tested? manual Closes #26892 from southernriver/SPARK-30262. Authored-by: chenliang <southernriver@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-18 15:12:32 -08:00
Kousuke Saruta	0945633844	[SPARK-29997][WEBUI][FOLLOWUP] Refactor code for job description of empty jobs ### What changes were proposed in this pull request? Refactor the code brought by #26637 . No more dummy StageInfo and its side-effects are needed at all. This change also enable users to set job description to empty jobs though. ### Why are the changes needed? The previous approach introduced dummy StageInfo and this causes side-effects. ### Does this PR introduce any user-facing change? Yes. Description set by user will be shown in the AllJobsPage. ![](https://user-images.githubusercontent.com/4736016/70788638-acf17900-1dd4-11ea-95f9-6d6739b24083.png) ### How was this patch tested? Manual test and newly added unit test. Closes #26703 from sarutak/fix-ui-for-empty-job2. Lead-authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-18 10:27:31 -08:00

... 8 9 10 11 12 ...

26515 commits