ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Huaxin Gao	0c8d3d2a15	[SPARK-28798][FOLLOW-UP] Add alter view link to drop view ### What changes were proposed in this pull request? Add alter view link to drop view ### Why are the changes needed? create view has links to drop view and alter view alter view has links to create view and drop view drop view currently doesn't have a link to alter view. I think it's better to link to alter view as well. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Tested using jykyll build --serve Closes #26495 from huaxingao/spark-28798. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-13 07:11:26 -06:00
Huaxin Gao	2beca777b6	[SPARK-28795][FOLLOW-UP] Links should point to html instead of md files ### What changes were proposed in this pull request? Use html files for the links ### Why are the changes needed? links not working ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Used jekyll build and serve to verify. Closes #26494 from huaxingao/spark-28795. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-13 07:10:20 -06:00
gengjiaan	56be7318cc	[MINOR][BUILD] Fix an incorrect path in license-binary file ### What changes were proposed in this pull request? I want to say sorry! this PR follows the previous https://github.com/apache/spark/pull/26050. I didn't find them at the same time. The `LICENSE-binary` file exists a minor issue has an incorrect path. This PR will fix it. ### Why are the changes needed? This is a minor bug. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT. Closes #26490 from beliefer/fix-minor-license-issue. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-13 07:06:08 -06:00
xy_xin	d7bdc6aa17	[SPARK-29835][SQL] Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE ### What changes were proposed in this pull request? The current parse and analyze flow for DELETE is: 1, the SQL string will be firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be converted to `DeleteFromTable`. However, the SQL string can be parsed to `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be redundant. It is the same for UPDATE. This pr removes the unnecessary `DeleteFromStatement` and `UpdateTableStatement`. ### Why are the changes needed? This makes the codes for DELETE and UPDATE cleaner, and keep align with MERGE INTO. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed tests and new tests. Closes #26464 from xianyinxin/SPARK-29835. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-13 20:53:12 +08:00
Terry Kim	b5a2ed6a37	[SPARK-29851][SQL] V2 catalog: Change default behavior of dropping namespace to cascade ### What changes were proposed in this pull request? Currently, `SupportsNamespaces.dropNamespace` drops a namespace only if it is empty. Thus, to implement a cascading drop, one needs to iterate all objects (tables, view, etc.) within the namespace (including its sub-namespaces recursively) and drop them one by one. This can have a negative impact on the performance when there are large number of objects. Instead, this PR proposes to change the default behavior of dropping a namespace to cascading such that implementing cascading/non-cascading drop is simpler without performance penalties. ### Why are the changes needed? The new behavior makes implementing cascading/non-cascading drop simple without performance penalties. ### Does this PR introduce any user-facing change? Yes. The default behavior of `SupportsNamespaces.dropNamespace` is now cascading. ### How was this patch tested? Added new unit tests. Closes #26476 from imback82/drop_ns_cascade. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-13 17:06:27 +08:00
Kent Yao	f926809a1f	[SPARK-29390][SQL] Add the justify_days(), justify_hours() and justif_interval() functions ### What changes were proposed in this pull request? Add 3 interval functions justify_days, justify_hours, justif_interval to support justify interval values ### Why are the changes needed? For feature parity with postgres add three interval functions to justify interval values. justify_days(interval) \| interval \| Adjust interval so 30-day time periods are represented as months \| justify_days(interval '35 days') \| 1 mon 5 days -- \| -- \| -- \| -- \| -- justify_hours(interval) \| interval \| Adjust interval so 24-hour time periods are represented as days \| justify_hours(interval '27 hours') \| 1 day 03:00:00 justify_interval(interval) \| interval \| Adjust interval using justify_days and justify_hours, with additional sign adjustments \| justify_interval(interval '1 mon -1 hour') \| 29 days 23:00:00 ### Does this PR introduce any user-facing change? yes. new interval functions are added ### How was this patch tested? add ut Closes #26465 from yaooqinn/SPARK-29390. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-11-13 15:04:39 +09:00
HyukjinKwon	80fbc382a6	Revert "[SPARK-29462] The data type of "array()" should be array<null>" This reverts commit `0dcd739534`.	2019-11-13 13:12:20 +09:00
angerszhu	eb79af8dae	[SPARK-29145][SQL][FOLLOW-UP] Move tests from `SubquerySuite` to `subquery/in-subquery/in-joins.sql` ### What changes were proposed in this pull request? Follow comment of https://github.com/apache/spark/pull/25854#discussion_r342383272 ### Why are the changes needed? NO ### Does this PR introduce any user-facing change? NO ### How was this patch tested? ADD TEST CASE Closes #26406 from AngersZhuuuu/SPARK-29145-FOLLOWUP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-12 17:34:03 -08:00
Marcelo Vanzin	56a0b5421e	[SPARK-29399][CORE] Remove old ExecutorPlugin interface SPARK-29397 added new interfaces for creating driver and executor plugins. These were added in a new, more isolated package that does not pollute the main o.a.s package. The old interface is now redundant. Since it's a DeveloperApi and we're about to have a new major release, let's remove it instead of carrying more baggage forward. Closes #26390 from vanzin/SPARK-29399. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-13 09:52:40 +09:00
Ankitraj	45e212e161	[SPARK-29570][WEBUI] Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns ### What changes were proposed in this pull request? All tooltips message will display in centre. ### Why are the changes needed? Some time tooltips will hide the data of column and tooltips display position will be inconsistent in UI. ### Does this PR introduce any user-facing change? yes. ![Screenshot 2019-10-26 at 3 08 51 AM](https://user-images.githubusercontent.com/8948111/67606124-04dd0d80-f79e-11e9-865a-b7e9bffc9890.png) ### How was this patch tested? Manual test. Closes #26263 from 07ARB/SPARK-29570. Lead-authored-by: Ankitraj <8948111+07ARB@users.noreply.github.com> Co-authored-by: 07ARB <ankitrajboudh@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-12 18:49:54 -06:00
DongWang	9a96a20a69	[SPARK-29844][ML] Improper unpersist strategy in ml.recommendation.ASL.train ### What changes were proposed in this pull request? Adjust improper unpersist timing on RDD. ### Why are the changes needed? Improper unpersist timing will result in memory waste ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26469 from Icysandwich/SPARK-29844. Authored-by: DongWang <cqwd123@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-12 18:31:58 -06:00
Wenchen Fan	030e5d987e	[SPARK-29789][SQL] should not parse the bucket column name when creating v2 tables ### What changes were proposed in this pull request? When creating v2 expressions, we have public java APIs, as well as interval scala APIs. All of these APIs take a string column name and parse it to `NamedReference`. This is convenient for end-users, but not for interval development. For example, the query plan already contains the parsed partition/bucket column names, and it's tricky if we need to quote the names before creating v2 expressions. This PR proposes to change the interval scala APIs to take `NamedReference` directly, with a new method to create `NamedReference` with the exact name parts. The public java APIs are not changed. ### Why are the changes needed? fix a bug, and make it easier to create v2 expressions correctly in the future. ### Does this PR introduce any user-facing change? yes, now v2 CREATE TABLE works as expected. ### How was this patch tested? a new test Closes #26425 from cloud-fan/extract. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Ryan Blue <blue@apache.org>	2019-11-12 12:25:45 -08:00
Wenchen Fan	414cade011	[SPARK-29850][SQL] sort-merge-join an empty table should not memory leak ### What changes were proposed in this pull request? When whole stage codegen `HashAggregateExec`, create the hash map when we begin to process inputs. ### Why are the changes needed? Sort-merge join completes directly if the left side table is empty. If there is an aggregate in the right side, the aggregate will not be triggered at all, but its hash map is created during codegen and can't be released. ### Does this PR introduce any user-facing change? No ### How was this patch tested? a new test Closes #26471 from cloud-fan/memory. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-13 01:00:30 +08:00
Kent Yao	d99398e9f5	[SPARK-29855][SQL] typed literals with negative sign with proper result or exception ### What changes were proposed in this pull request? ```sql -- !query 83 select -integer '7' -- !query 83 schema struct<7:int> -- !query 83 output 7 -- !query 86 select -date '1999-01-01' -- !query 86 schema struct<DATE '1999-01-01':date> -- !query 86 output 1999-01-01 -- !query 87 select -timestamp '1999-01-01' -- !query 87 schema struct<TIMESTAMP('1999-01-01 00:00:00'):timestamp> -- !query 87 output 1999-01-01 00:00:00 ``` the integer should be -7 and the date and timestamp results are confusing which should throw exceptions ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? NO ### How was this patch tested? ADD UTs Closes #26479 from yaooqinn/SPARK-29855. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-12 23:53:07 +09:00
lajin	5cb05f4100	[SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint ### What changes were proposed in this pull request? Executor's heartbeat will send synchronously to BlockManagerMaster to let it know that the block manager is still alive. In a heavy cluster, it will timeout and cause block manager re-register unexpected. This improvement will separate a heartbeat endpoint from the driver endpoint. In our production environment, this was really helpful to prevent executors from unstable up and down. ### Why are the changes needed? `BlockManagerMasterEndpoint` handles many events from executors like `RegisterBlockManager`, `GetLocations`, `RemoveShuffle`, `RemoveExecutor` etc. In a heavy cluster/app, it is always busy. The `BlockManagerHeartbeat` event also was handled in this endpoint. We found it may timeout when it's busy. So we add a new endpoint `BlockManagerMasterHeartbeatEndpoint` to handle heartbeat separately. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Exist UTs Closes #25971 from LantaoJin/SPARK-29298. Lead-authored-by: lajin <lajin@ebay.com> Co-authored-by: Alan Jin <lajin@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-12 16:24:48 +08:00
Xingbo Jiang	0346afa8fc	[SPARK-29001][CORE] Print events that take too long time to process ### What changes were proposed in this pull request? Print events that take too long time to process, to help find out what type of events is slow. Introduce two extra configs: * spark.scheduler.listenerbus.logSlowEvent.enabled Whether to enable log the events that are slow * spark.scheduler.listenerbus.logSlowEvent.threshold The time threshold of whether an event is considered to be slow. ### How was this patch tested? N/A Closes #25702 from jiangxb1987/SPARK-29001. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-12 14:08:13 +08:00
Pablo Langa	37e387a22d	[SPARK-29519][SQL] SHOW TBLPROPERTIES should do multi-catalog resolution ### What changes were proposed in this pull request? Add ShowTablePropertiesStatement and make SHOW TBLPROPERTIES go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. USE my_catalog DESC t // success and describe the table t from my_catalog SHOW TBLPROPERTIES t // report table not found as there is no table t in the session catalog ### Does this PR introduce any user-facing change? yes. When running SHOW TBLPROPERTIES Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26176 from planga82/feature/SPARK-29519_SHOW_TBLPROPERTIES_datasourceV2. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-12 13:31:28 +08:00
Jungtaek Lim (HeartSaVioR)	df08e903b5	[SPARK-29755][CORE] Provide @JsonDeserialize for Option[Long] in LogInfo & AttemptInfoWrapper ### What changes were proposed in this pull request? This patch adds `JsonDeserialize` annotation for the field which type is `Option[Long]` in LogInfo/AttemptInfoWrapper. It hits https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges - other existing json models take care of this, but we missed to add annotation to these classes. ### Why are the changes needed? Without this change, SHS will throw ClassNotFoundException when rebuilding App UI. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested. Closes #26397 from HeartSaVioR/SPARK-29755. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-11-11 15:49:16 -08:00
Jungtaek Lim (HeartSaVioR)	c941362cb9	[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows ### What changes were proposed in this pull request? This patch fixes the edge case of streaming left/right outer join described below: Suppose query is provided as `select * from A join B on A.id = B.id AND (A.ts <= B.ts AND B.ts <= A.ts + interval 5 seconds)` and there're two rows for L1 (from A) and R1 (from B) which ensures L1.id = R1.id and L1.ts = R1.ts. (we can simply imagine it from self-join) Then Spark processes L1 and R1 as below: - row L1 and row R1 are joined at batch 1 - row R1 is evicted at batch 2 due to join and watermark condition, whereas row L1 is not evicted - row L1 is evicted at batch 3 due to join and watermark condition When determining outer rows to match with null, Spark applies some assumption commented in codebase, as below: ``` Checking whether the current row matches a key in the right side state, and that key has any value which satisfies the filter function when joined. If it doesn't, we know we can join with null, since there was never (including this batch) a match within the watermark period. If it does, there must have been a match at some point, so we know we can't join with null. ``` But as explained the edge-case earlier, the assumption is not correct. As we don't have any good assumption to optimize which doesn't have edge-case, we have to track whether such row is matched with others before, and match with null row only when the row is not matched. To track the matching of row, the patch adds a new state to streaming join state manager, and mark whether the row is matched to others or not. We leverage the information when dealing with eviction of rows which would be candidates to match with null rows. This approach introduces new state format which is not compatible with old state format - queries with old state format will be still running but they will still have the issue and be required to discard checkpoint and rerun to take this patch in effect. ### Why are the changes needed? This patch fixes a correctness issue. ### Does this PR introduce any user-facing change? No for compatibility viewpoint, but we'll encourage end users to discard the old checkpoint and rerun the query if they run stream-stream outer join query with old checkpoint, which might be "yes" for the question. ### How was this patch tested? Added UT which fails on current Spark and passes with this patch. Also passed existing streaming join UTs. Closes #26108 from HeartSaVioR/SPARK-26154-shorten-alternative. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-11-11 15:47:17 -08:00
Marcelo Vanzin	9753a8e330	[SPARK-29766][SQL] Do metrics aggregation asynchronously in SQL listener This unblocks the event handling thread, which should help avoid dropped events when large queries are running. Existing unit tests should already cover this code. Closes #26405 from vanzin/SPARK-29766. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-11 14:20:34 -08:00
DB Tsai	a6a2748585	[SPARK-29805][SQL] Enable nested schema pruning and nested pruning on expressions by default ### What changes were proposed in this pull request? Enable nested schema pruning and nested pruning on expressions by default. We have been using those features in production in Apple for couple months with great success. For some jobs, we reduce the data reading by more than 8x and 21x faster in wall clock time. ### Why are the changes needed? Better performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26443 from dbtsai/enableNestedSchemaPrunning. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-11-11 19:11:05 +00:00
zhengruifeng	76e5294bb6	[SPARK-29801][ML] ML models unify toString method ### What changes were proposed in this pull request? 1,ML models should extend toString method to expose basic information. Current some algs (GBT/RF/LoR) had done this, while others not yet. 2,add `val numFeatures` in `BisectingKMeansModel`/`GaussianMixtureModel`/`KMeansModel`/`AFTSurvivalRegressionModel`/`IsotonicRegressionModel` ### Why are the changes needed? ML models should extend toString method to expose basic information. ### Does this PR introduce any user-facing change? yes ### How was this patch tested? existing testsuites Closes #26439 from zhengruifeng/models_toString. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-11 11:03:26 -08:00
Takeshi Yamamuro	cceb2d6f11	[SPARK-29825][SQL][TESTS] Add join-related configs in `inner-join.sql` and `postgreSQL/join.sql` ### What changes were proposed in this pull request? For better test coverage, this pr is to add join-related configs in `inner-join.sql` and `postgreSQL/join.sql`. These join related configs were just copied from ones in the other join-related tests in `SQLQueryTestSuite` (e.g., https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/natural-join.sql#L2-L4). ### Why are the changes needed? Better test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26459 from maropu/AddJoinConds. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-11 10:21:33 -08:00
Kent Yao	d06a9cc4bd	[SPARK-29822][SQL] Fix cast error when there are white spaces between signs and values ### What changes were proposed in this pull request? With the latest string to literal optimization https://github.com/apache/spark/pull/26256, some interval strings can not be cast when there are some spaces between signs and unit values. After state `PARSE_SIGN`, it directly goes to `PARSE_UNIT_VALUE` when takes a space character as the end. So when there are some white spaces come before the real unit value, it fails to parse, we should add a new state like `TRIM_VALUE` to trim all these spaces. How to re-produce, which aim the revisions since https://github.com/apache/spark/pull/26256 is merged ```sql select cast(v as interval) from values ('+ 1 second') t(v); select cast(v as interval) from values ('- 1 second') t(v); ``` ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? 1. ut 2. new benchmark test Closes #26449 from yaooqinn/SPARK-29605. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-11 21:53:33 +08:00
lajin	4de7131cff	[SPARK-29421][SQL] Supporting Create Table Like Using Provider ### What changes were proposed in this pull request? Hive support STORED AS new file format syntax: ```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; ``` We add a similar syntax for Spark. Here we separate to two features: 1. specify a different table provider in CREATE TABLE LIKE 2. Hive compatibility In this PR, we address the first one: - [ ] Using `USING provider` to specify a different table provider in CREATE TABLE LIKE. - [ ] Using `STORED AS file_format` in CREATE TABLE LIKE to address Hive compatibility. ### Why are the changes needed? Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on the definition of table tb2. The most user case is to create tb1 with the same schema of tb2. But an inconvenient case here is this command also copies the FileFormat from tb2, it cannot change the input/output format and serde. Add the ability of changing file format is useful for some scenarios like upgrading a table from a low performance file format to a high performance one (parquet, orc). ### Does this PR introduce any user-facing change? Add a new syntax based on current CTL: ```sql CREATE TABLE tbl2 LIKE tbl [USING parquet]; ``` ### How was this patch tested? Modify some exist UTs. Closes #26097 from LantaoJin/SPARK-29421. Authored-by: lajin <lajin@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-11 15:25:56 +08:00
Maxim Gekk	18440151b0	[SPARK-29393][SQL] Add `make_interval` function ### What changes were proposed in this pull request? In the PR, I propose new expression `MakeInterval` and register it as the function `make_interval`. The function accepts the following parameters: - `years` - the number of years in the interval, positive or negative. The parameter is multiplied by 12, and added to interval's `months`. - `months` - the number of months in the interval, positive or negative. - `weeks` - the number of months in the interval, positive or negative. The parameter is multiplied by 7, and added to interval's `days`. - `hours`, `mins` - the number of hours and minutes. The parameters can be negative or positive. They are converted to microseconds and added to interval's `microseconds`. - `seconds` - the number of seconds with the fractional part in microseconds precision. It is converted to microseconds, and added to total interval's `microseconds` as `hours` and `minutes`. For example: ```sql spark-sql> select make_interval(2019, 11, 1, 1, 12, 30, 01.001001); 2019 years 11 months 8 days 12 hours 30 minutes 1.001001 seconds ``` ### Why are the changes needed? - To improve user experience with Spark SQL, and allow users making `INTERVAL` columns from other columns containing `years`, `months` ... `seconds`. Currently, users can make an `INTERVAL` column from other columns only by constructing a `STRING` column and cast it to `INTERVAL`. Have a look at the `IntervalBenchmark` as an example. - To maintain feature parity with PostgreSQL which provides such function: ```sql # SELECT make_interval(2019, 11); make_interval -------------------- 2019 years 11 mons ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By new tests for the `MakeInterval` expression to `IntervalExpressionsSuite` - By tests in `interval.sql` Closes #26446 from MaxGekk/make_interval. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-10 14:34:52 -08:00
Pavithra Ramachandran	e2ca7f396f	[SPARK-29601][WEBUI] JDBC ODBC Tab Statement column provide ellipsis for big SQL statement ### What changes were proposed in this pull request? Provide Ellipses in Statement column , just like description in Jobs page . ### Why are the changes needed? When a query is executed the whole query statement is displayed no matter how big it is. When bigger queries are executed, it covers a large portion of the page display, when we have multiple queries it is difficult to scroll down to view all. ### Does this PR introduce any user-facing change? No Before: ![Screenshot from 2019-11-01 23-15-23](https://user-images.githubusercontent.com/51401130/68064468-ebaa0300-fd41-11e9-8787-c5144c1468d4.png) After: ![Screenshot from 2019-11-02 07-07-21](https://user-images.githubusercontent.com/51401130/68064471-f19fe400-fd41-11e9-85c6-65f0faa64cc3.png) ### How was this patch tested? Manual Closes #26364 from PavithraRamachandran/ellipse_JDBC. Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-10 13:08:26 -06:00
Dongjoon Hyun	4b71ad6ffb	[SPARK-29820][INFRA] Use GitHub Action Cache for `./.m2/repository/[com\|org]` ### What changes were proposed in this pull request? This PR aims to enable [GitHub Action Cache on Maven local repository](https://github.com/actions/cache/blob/master/examples.md#java---maven) for the following goals. 1. To reduce the chance of failure due to the Maven download flakiness. 2. To speed up the build a little bit. Unfortunately, due to the GitHub Action Cache limitation, it seems that we cannot put all into a single cache. It's ignored like the following. - .m2/repository is 680777194 bytes ``` /bin/tar -cz -f /home/runner/work/_temp/01f162c3-0c78-4772-b3de-b619bb5d7721/cache.tgz -C /home/runner/.m2/repository . 3 ##[warning]Cache size of 680777194 bytes is over the 400MB limit, not saving cache. ``` ### Why are the changes needed? Not only for the speed up, but also for reducing the Maven download flakiness, we had better enable caching on local maven repository. The followings are the failure examples in these days. - https://github.com/apache/spark/runs/295869450 ``` [ERROR] Failed to execute goal on project spark-streaming-kafka-0-10_2.12: Could not resolve dependencies for project org.apache.spark:spark-streaming-kafka-0-10_2.12🫙spark-367716: Could not transfer artifact com.fasterxml.jackson.datatype:jackson-datatype-jdk8:jar:2.10.0 from/to central (https://repo.maven.apache.org/maven2): Failed to transfer file https://repo.maven.apache.org/maven2/com/fasterxml/jackson/datatype/ jackson-datatype-jdk8/2.10.0/jackson-datatype-jdk8-2.10.0.jar with status code 503 -> [Help 1] ... [ERROR] mvn <args> -rf :spark-streaming-kafka-0-10_2.12 ``` ``` [ERROR] Failed to execute goal on project spark-tools_2.12: Could not resolve dependencies for project org.apache.spark:spark-tools_2.12🫙3.0.0-SNAPSHOT: Failed to collect dependencies at org.clapper:classutil_2.12🫙1.5.1: Failed to read artifact descriptor for org.clapper:classutil_2.12🫙1.5.1: Could not transfer artifact org.clapper:classutil_2.12:pom:1.5.1 from/to central (https://repo.maven.apache.org/maven2): Connection timed out (Read failed) -> [Help 1] ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually check the GitHub Action log of this PR. ``` Cache restored from key: 1.8-hadoop-2.7-maven-com-5b4a9fb13c5f5ff78e65a20003a3810796e4d1fde5f24d397dfe6e5153960ce4 Cache restored from key: 1.8-hadoop-2.7-maven-org-5b4a9fb13c5f5ff78e65a20003a3810796e4d1fde5f24d397dfe6e5153960ce4 ``` Closes #26456 from dongjoon-hyun/SPARK-29820. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-10 11:02:54 -08:00
Maxim Gekk	d4de01f567	[SPARK-29408][SQL] Support `-` before `interval` in interval literals ### What changes were proposed in this pull request? - `SqlBase.g4` is modified to support a negative sign `-` in the interval type constructor from a string and in interval literals - Negate interval in `AstBuilder` if a sign presents. - Interval related SQL statements are moved from `inputs/datetime.sql` to new file `inputs/interval.sql` For example: ```sql spark-sql> select -interval '-1 month 1 day -1 second'; 1 months -1 days 1 seconds spark-sql> select -interval -1 month 1 day -1 second; 1 months -1 days 1 seconds ``` ### Why are the changes needed? For feature parity with PostgreSQL which supports that: ```sql # select -interval '-1 month 1 day -1 second'; ?column? ------------------------- 1 mon -1 days +00:00:01 (1 row) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added tests to `ExpressionParserSuite` - by `interval.sql` Closes #26438 from MaxGekk/negative-interval. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-10 10:10:04 -08:00
Dongjoon Hyun	e25cfc4bb8	[SPARK-29528][BUILD] Upgrade scala-maven-plugin to 4.3.0 for Scala 2.13.1 ### What changes were proposed in this pull request? This PR aims to upgrade `scala-maven-plugin` to `4.3.0` for Scala `2.13.1`. We tried 4.2.4, but it's reverted due to Windows build issue. Now, `4.3.0` has a Window fix. ### Why are the changes needed? Scala 2.13.1 seems to break the binary compatibility. We need to upgrade `scala-maven-plugin` to bring the the following fixes for the latest Scala 2.13.1. - https://github.com/davidB/scala-maven-plugin/issues/363 - https://github.com/sbt/zinc/issues/698 Also, `4.3.0` has the following Window fix. - https://github.com/davidB/scala-maven-plugin/issues/370 (4.2.4 throws error on Windows) ### Does this PR introduce any user-facing change? No. ### How was this patch tested? - For now, we don't support Scala-2.13. This PR at least needs to pass the existing Jenkins with Maven to get prepared for Scala-2.13. - `AppVeyor` passed. (https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28745383) Closes #26457 from dongjoon-hyun/SPARK-29528. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-10 08:49:05 -08:00
Maxim Gekk	7ddcb5b46d	[SPARK-29819][SQL] Introduce an enum for interval units ### What changes were proposed in this pull request? In the PR, I propose an enumeration for interval units with the value `YEAR`, `MONTH`, `WEEK`, `DAY`, `HOUR`, `MINUTE`, `SECOND`, `MILLISECOND`, `MICROSECOND` and `NANOSECOND`. ### Why are the changes needed? - This should prevent typos in interval unit names - Stronger type checking of unit parameters. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites `ExpressionParserSuite` and `IntervalUtilsSuite` Closes #26455 from MaxGekk/interval-unit-enum. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-10 08:41:55 -08:00
Huaxin Gao	57b954e825	[SPARK-29730][SQL] ALTER VIEW QUERY should look up catalog/table like v2 commands Add AlterViewAsStatement and make ALTER VIEW ... QUERY go through the same catalog/table resolution framework of v2 commands. It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC v // success and describe the view v from my_catalog ALTER VIEW v SELECT 1 // report view not found as there is no view v in the session catalog ``` Yes. When running ALTER VIEW ... QUERY, Spark fails the command if the current catalog is set to a v2 catalog, or the view name specified a v2 catalog. unit tests Closes #26453 from huaxingao/spark-29730. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-09 17:06:09 -08:00
Luca Canali	2888009d66	[SPARK-29654][CORE] Add configuration to allow disabling registration of static sources to the metrics system ### What changes were proposed in this pull request? The Spark metrics system produces many different metrics and not all of them are used at the same time. This proposes to introduce a configuration parameter to allow disabling the registration of metrics in the "static sources" category. ### Why are the changes needed? This allows to reduce the load and clutter on the sink, in the cases when the metrics in question are not needed. The metrics registerd as "static sources" are under the namespaces CodeGenerator and HiveExternalCatalog and can produce a significant amount of data, as they are registered for the driver and executors. ### Does this PR introduce any user-facing change? It introduces a new configuration parameter `spark.metrics.register.static.sources.enabled` ### How was this patch tested? Manually tested. ``` $ cat conf/metrics.properties .sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet .sink.prometheusServlet.path=/metrics/prometheus master.sink.prometheusServlet.path=/metrics/master/prometheus applications.sink.prometheusServlet.path=/metrics/applications/prometheus $ bin/spark-shell $ curl -s http://localhost:4040/metrics/prometheus/ \| grep Hive metrics_local_1573330115306_driver_HiveExternalCatalog_fileCacheHits_Count 0 metrics_local_1573330115306_driver_HiveExternalCatalog_filesDiscovered_Count 0 metrics_local_1573330115306_driver_HiveExternalCatalog_hiveClientCalls_Count 0 metrics_local_1573330115306_driver_HiveExternalCatalog_parallelListingJobCount_Count 0 metrics_local_1573330115306_driver_HiveExternalCatalog_partitionsFetched_Count 0 $ bin/spark-shell --conf spark.metrics.static.sources.enabled=false $ curl -s http://localhost:4040/metrics/prometheus/ \| grep Hive ``` Closes #26320 from LucaCanali/addConfigRegisterStaticMetrics. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-09 12:13:13 -08:00
Sean Owen	4d9c36d5ba	[SPARK-29795][CORE] Explicitly clear registered metrics on MetricSystem shutdown ### What changes were proposed in this pull request? Explicitly clear registered metrics when `MetricsSystem` shuts down. ### Why are the changes needed? See https://issues.apache.org/jira/browse/SPARK-29795 for a complete explanation. The TL;DR is there is some evidence this could leak resources after Spark is shut down, and that may be a minor issue in Spark 3+ for apps or tests that re-start SparkContexts in the same JVM. ### Does this PR introduce any user-facing change? The possible difference here is that, after Spark is stopped, metrics are no longer available. It's unclear to me whether this is intended behavior anyway. ### How was this patch tested? See https://issues.apache.org/jira/browse/SPARK-29795 for more context: - Spark 3 already passes tests without this change - Spark 2.4 does too, as exists in branch-2.4 now - Spark 2.4 fails tests if metrics 4.x is used, without this change The last point is not directly relevant, as Spark 2.4 will not use metrics 4.x. It's evidence that it addresses some potential issue, however. Closes #26427 from srowen/SPARK-29795. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-09 08:47:05 -06:00
Gabor Somogyi	12598e1b93	[MINOR] FsHistoryProvider import cleanup ### What changes were proposed in this pull request? As it has been discussed in https://github.com/apache/spark/pull/26397#discussion_r343726691 `FsHistoryProvider` import section has to be cleaned up. ### Why are the changes needed? Unused imports. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #26436 from gaborgsomogyi/SPARK-29755. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-09 08:40:56 -06:00
Jobit Mathew	1e408d6fe6	[SPARK-29788][DOC] Fix the typos in the SQL reference documents ### What changes were proposed in this pull request? Fixing the typos in SQL reference document. ### Why are the changes needed? For user readability ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested manually. Closes #26424 from jobitmathew/typo. Authored-by: Jobit Mathew <jobit.mathew@huawei.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-09 08:04:14 -06:00
Xiao Li	1e2d76e80a	[HOT-FIX] Fix the SQLBase.g4 ### What changes were proposed in this pull request? Remove the duplicate code See the build failure: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/986/ ### Why are the changes needed? Fix the compilation ### Does this PR introduce any user-facing change? No ### How was this patch tested? The existing tests Closes #26445 from gatorsmile/hotfixPraser. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-11-08 22:39:07 -08:00
xy_xin	7cfd589868	[SPARK-28893][SQL] Support MERGE INTO in the parser and add the corresponding logical plan ### What changes were proposed in this pull request? This PR supports MERGE INTO in the parser and add the corresponding logical plan. The SQL syntax likes, ``` MERGE INTO [ds_catalog.][multi_part_namespaces.]target_table [AS target_alias] USING [ds_catalog.][multi_part_namespaces.]source_table \| subquery [AS source_alias] ON <merge_condition> [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ] [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ] [ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ] ``` where ``` <matched_action> = DELETE \| UPDATE SET * \| UPDATE SET column1 = value1 [, column2 = value2 ...] <not_matched_action> = INSERT * \| INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...]) ``` ### Why are the changes needed? This is a start work for introduce `MERGE INTO` support for the builtin datasource, and the design work for the `MERGE INTO` support in DSV2. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New test cases. Closes #26167 from xianyinxin/SPARK-28893. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-09 11:45:24 +08:00
Bago Amirbekian	8152a87235	[SPARK-28978][ ] Support > 256 args to python udf ### What changes were proposed in this pull request? On the worker we express lambda functions as strings and then eval them to create a "mapper" function. This make the code hard to read & limits the # of arguments a udf can support to 256 for python <= 3.6. This PR rewrites the mapper functions as nested functions instead of "lambda strings" and allows passing in more than 255 args. ### Why are the changes needed? The jira ticket associated with this issue describes how MLflow uses udfs to consume columns as features. This pattern isn't unique and a limit of 255 features is quite low. ### Does this PR introduce any user-facing change? Users can now pass more than 255 cols to a udf function. ### How was this patch tested? Added a unit test for passing in > 255 args to udf. Closes #26442 from MrBago/replace-lambdas-on-worker. Authored-by: Bago Amirbekian <bago@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-11-08 19:19:14 -08:00
Liang-Chi Hsieh	70987d8144	[SPARK-29680][SQL][FOLLOWUP] Replace qualifiedName with multipartIdentifier ### What changes were proposed in this pull request? Replace qualifiedName with multipartIdentifier in parser rules of DDL commands. ### Why are the changes needed? There are identifiers in some DDL rules we use `qualifiedName`. We should use `multipartIdentifier` because it can capture wrong identifiers such as `test-table`, `test-col`. ### Does this PR introduce any user-facing change? Yes. Wrong identifiers such as test-table, will be captured now after this change. ### How was this patch tested? Unit tests. Closes #26419 from viirya/SPARK-29680-followup2. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-11-08 14:18:06 -08:00
HyukjinKwon	7fc9db0853	[SPARK-29798][PYTHON][SQL] Infers bytes as binary type in createDataFrame in Python 3 at PySpark ### What changes were proposed in this pull request? This PR proposes to infer bytes as binary types in Python 3. See https://github.com/apache/spark/pull/25749 for discussions. I have also checked that Arrow considers `bytes` as binary type, and PySpark UDF can also accepts `bytes` as a binary type. Since `bytes` is not a `str` anymore in Python 3, it's clear to call it `BinaryType` in Python 3. ### Why are the changes needed? To respect Python 3's `bytes` type and support Python's primitive types. ### Does this PR introduce any user-facing change? Yes. Before: ```python >>> spark.createDataFrame([[b"abc"]]) Traceback (most recent call last): File "/.../spark/python/pyspark/sql/types.py", line 1036, in _infer_type return _infer_schema(obj) File "/.../spark/python/pyspark/sql/types.py", line 1062, in _infer_schema raise TypeError("Can not infer schema for type: %s" % type(row)) TypeError: Can not infer schema for type: <class 'bytes'> During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 445, in _createFromLocal struct = self._inferSchemaFromList(data, names=schema) File "/.../spark/python/pyspark/sql/session.py", line 377, in _inferSchemaFromList schema = reduce(_merge_type, (_infer_schema(row, names) for row in data)) File "/.../spark/python/pyspark/sql/session.py", line 377, in <genexpr> schema = reduce(_merge_type, (_infer_schema(row, names) for row in data)) File "/.../spark/python/pyspark/sql/types.py", line 1064, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File "/.../spark/python/pyspark/sql/types.py", line 1064, in <listcomp> fields = [StructField(k, _infer_type(v), True) for k, v in items] File "/.../spark/python/pyspark/sql/types.py", line 1038, in _infer_type raise TypeError("not supported type: %s" % type(obj)) TypeError: not supported type: <class 'bytes'> ``` After: ```python >>> spark.createDataFrame([[b"abc"]]) DataFrame[_1: binary] ``` ### How was this patch tested? Unittest was added and manually tested. Closes #26432 from HyukjinKwon/SPARK-29798. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-11-08 12:10:39 -08:00
Emil Sandstø	0bdadba5e3	[SPARK-29790][DOC] Note required port for Kube API It adds a note about the required port of a master url in Kubernetes. Currently a port needs to be specified for the Kubernetes API. Also in case the API is hosted on the HTTPS port. Else the driver might fail with https://medium.com/kidane.weldemariam_75349/thanks-james-on-issuing-spark-submit-i-run-into-this-error-cc507d4f8f0d Yes, a change to the "Running on Kubernetes" guide. None - Documentation change Closes #26426 from Tapped/patch-1. Authored-by: Emil Sandstø <emilalexer@hotmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-11-08 09:33:07 -08:00
Kent Yao	e026412d9c	[SPARK-29679][SQL] Make interval type comparable and orderable ### What changes were proposed in this pull request? interval type support >, >=, <, <=, =, <=>, order by, min,max.. ### Why are the changes needed? Part of SPARK-27764 Feature Parity between PostgreSQL and Spark ### Does this PR introduce any user-facing change? yes, we now support compare intervals ### How was this patch tested? add ut Closes #26337 from yaooqinn/SPARK-29679. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-08 22:45:11 +08:00
Kent Yao	e7f7990bc3	[SPARK-29688][SQL] Support average for interval type values ### What changes were proposed in this pull request? avg aggregate support interval type values ### Why are the changes needed? Part of SPARK-27764 Feature Parity between PostgreSQL and Spark ### Does this PR introduce any user-facing change? yes, we can do avg on intervals ### How was this patch tested? add ut Closes #26347 from yaooqinn/SPARK-29688. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-08 21:55:07 +08:00
davidvrba	afc943ff8a	[SPARK-28477][SQL] Rewrite CaseWhen with single branch to If ### What changes were proposed in this pull request? Spark org.apache.spark.sql.functions do not have `if` function so conditions are expressed using `when-otherwise` function. However `If` (which is available in SQL) has more efficient code gen. This pr rewrites `when-otherwise` conditions to `If` if it is possible (`when-otherwise` with single branch) ### Why are the changes needed? It is an optimization enhancement. Here is a simple performance comparison (tested in local mode (with 4 cores)): ``` val df = spark.range(10000000000L).withColumn("x", rand) val resultA = df.withColumn("r", when($"x" < 0.5, lit(1)).otherwise(lit(0))).agg(sum($"r")) val resultB = df.withColumn("r", expr("if(x < 0.5, 1, 0)")).agg(sum($"r")) resultA.collect() // takes 56s to finish resultB.collect() // takes 30s to finish ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? New test is added. Closes #26294 from davidvrba/spark-28477_rewriteCaseWhenToIf. Authored-by: davidvrba <vrba.dave@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-08 21:25:48 +08:00
zhengruifeng	d1cb98d70a	[SPARK-29756][ML] CountVectorizer forget to unpersist intermediate rdd ### What changes were proposed in this pull request? 1,unpersist intermediate rdd `wordCounts` 2,if the `dataset` is already persisted, we do not need to persist rdd `input` 3,if both `minDF`&`maxDF` are gteq or lt than 1, we can compare & check them af first. ### Why are the changes needed? we should unpersit unused rdd ASAP ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing testsuites Closes #26398 from zhengruifeng/CountVectorizer_unpersist_wordCounts. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-08 18:31:51 +08:00
ulysses	7759f7179c	[SPARK-29772][TESTS][SQL] Add withNamespace in SQLTestUtils ### What changes were proposed in this pull request? V2 catalog support namespace, we should add `withNamespace` like `withDatabase`. ### Why are the changes needed? Make test easy. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add UT. Closes #26411 from ulysses-you/Add-test-with-namespace. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-08 11:53:44 +08:00
Kent Yao	0a03839366	[SPARK-29787][SQL] Move methods add/subtract/negate from CalendarInterval to IntervalUtils ### What changes were proposed in this pull request? Move method add/subtract/negate from CalendarInterval to IntervalUtils ### Why are the changes needed? https://github.com/apache/spark/pull/26410#discussion_r343125468 suggested here ### Does this PR introduce any user-facing change? no ### How was this patch tested? add uts and move some Closes #26423 from yaooqinn/SPARK-29787. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-08 10:28:58 +08:00
Gabor Somogyi	3641c3dd69	[SPARK-21869][SS] Apply Apache Commons Pool to Kafka producer ### What changes were proposed in this pull request? Kafka producers are now closed when `spark.kafka.producer.cache.timeout` reached which could be significant problem when processing big SQL queries. The workaround was to increase `spark.kafka.producer.cache.timeout` to a number where the biggest SQL query can be finished. In this PR I've adapted similar solution which already exists on the consumer side, namely applies Apache Commons Pool on the producer side as well. Main advantages choosing this solution: * Producers are not closed until they're in use * No manual reference counting needed (which may be error prone) * Thread-safe by design * Provides jmx connection to the pool where metrics can be fetched What this PR contains: * Introduced producer side parameters to configure pool * Renamed `InternalKafkaConsumerPool` to `InternalKafkaConnectorPool` and made it abstract * Created 2 implementations from it: `InternalKafkaConsumerPool` and `InternalKafkaProducerPool` * Adapted `CachedKafkaProducer` to use `InternalKafkaProducerPool` * Changed `KafkaDataWriter` and `KafkaDataWriteTask` to release producer even in failure scenario * Added several new tests * Extended `KafkaTest` to clear not only producers but consumers as well * Renamed `InternalKafkaConsumerPoolSuite` to `InternalKafkaConnectorPoolSuite` where only consumer tests are checking the behavior (please see comment for reasoning) What this PR not yet contains(but intended when the main concept is stable): * User facing documentation ### Why are the changes needed? Kafka producer closed after 10 minutes (with default settings). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing + additional unit tests. Cluster tests being started. Closes #25853 from gaborgsomogyi/SPARK-21869. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-11-07 17:06:32 -08:00
HyukjinKwon	4ec04e5ef3	[SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's ## What changes were proposed in this pull request? This PR proposes to add Single threading model design (pinned thread model) mode which is an experimental mode to sync threads on PVM and JVM. See https://www.py4j.org/advanced_topics.html#using-single-threading-model-pinned-thread ### Multi threading model Currently, PySpark uses this model. Threads on PVM and JVM are independent. For instance, in a different Python thread, callbacks are received and relevant Python codes are executed. JVM threads are reused when possible. Py4J will create a new thread every time a command is received and there is no thread available. See the current model we're using - https://www.py4j.org/advanced_topics.html#the-multi-threading-model One problem in this model is that we can't sync threads on PVM and JVM out of the box. This leads to some problems in particular at some codes related to threading in JVM side. See: `7056e004ee/core/src/main/scala/org/apache/spark/SparkContext.scala (L334)` Due to reusing JVM threads, seems the job groups in Python threads cannot be set in each thread as described in the JIRA. ### Single threading model design (pinned thread model) This mode pins and syncs the threads on PVM and JVM to work around the problem above. For instance, in the same Python thread, callbacks are received and relevant Python codes are executed. See https://www.py4j.org/advanced_topics.html#the-single-threading-model Even though this mode can sync threads on PVM and JVM for other thread related code paths, this might cause another problem: seems unable to inherit properties as below (assuming multi-thread mode still creates new threads when existing threads are busy, I suspect this issue already exists when multiple jobs are submitted in multi-thread mode; however, it can be always seen in single threading mode): ```bash $ PYSPARK_PIN_THREAD=true ./bin/pyspark ``` ```python import threading spark.sparkContext.setLocalProperty("a", "hi") def print_prop(): print(spark.sparkContext.getLocalProperty("a")) threading.Thread(target=print_prop).start() ``` ``` None ``` Unlike Scala side: ```scala spark.sparkContext.setLocalProperty("a", "hi") new Thread(new Runnable { def run() = println(spark.sparkContext.getLocalProperty("a")) }).start() ``` ``` hi ``` This behaviour potentially could cause weird issues but this PR currently does not target this fix this for now since this mode is experimental. ### How does this PR fix? Basically there are two types of Py4J servers `GatewayServer` and `ClientServer`. The former is for multi threading and the latter is for single threading. This PR adds a switch to use the latter. In Scala side: The logic to select a server is encapsulated in `Py4JServer` and use `Py4JServer` at `PythonRunner` for Spark summit and `PythonGatewayServer` for Spark shell. Each uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise. In Python side: Simply do an if-else to switch the server to talk. It uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise. This is disabled by default for now. ## How was this patch tested? Manually tested. This can be tested via: ```python PYSPARK_PIN_THREAD=true ./bin/pyspark ``` and/or ```bash cd python ./run-tests --python-executables=python --testnames "pyspark.tests.test_pin_thread" ``` Also, ran the Jenkins tests with `PYSPARK_PIN_THREAD` enabled. Closes #24898 from HyukjinKwon/pinned-thread. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-08 06:44:58 +09:00

... 9 10 11 12 13 ...

26198 commits