ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kousuke Saruta	04f04e0ea7	[SPARK-31420][WEBUI] Infinite timeline redraw in job details page ### What changes were proposed in this pull request? Upgrade vis.js to fix an infinite re-drawing issue. As reported here, old releases of vis.js have that issue. Fortunately, the latest version seems to resolve the issue. With the latest release of vis.js, there are some performance issues with the original `timeline-view.js` and `timeline-view.css` so I also changed them. ### Why are the changes needed? For better UX. ### Does this PR introduce any user-facing change? No. Appearance and functionalities are not changed. ### How was this patch tested? I confirmed infinite redrawing doesn't happen with a JobPage which I had reproduced the issue. With the original version of vis.js, I reproduced the issue with the following conditions. * Use history server and load core/src/test/resources/spark-events. * Visit the JobPage for job2 in application_1553914137147_0018. * Zoom out to 80% on Safari / Chrome / Firefox. Maybe, it depends on OS and the version of browsers. Closes #28192 from sarutak/upgrade-visjs. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-04-13 23:23:00 -07:00
yi.wu	5d4f5d36a2	[SPARK-30953][SQL] InsertAdaptiveSparkPlan should apply AQE on child plan of write commands ### What changes were proposed in this pull request? This PR changes `InsertAdaptiveSparkPlan` to apply AQE on the child plan of V1/V2 write commands rather than the command itself. ### Why are the changes needed? Apply AQE on write commands with child plan will expose `LogicalQueryStage` to `Analyzer` while it should hider under `AdaptiveSparkPlanExec` only to avoid unexpected broken. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27701 from Ngone51/skip_v2_commands. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-14 05:18:58 +00:00
Takuya UESHIN	87be3641eb	[SPARK-31441] Support duplicated column names for toPandas with arrow execution ### What changes were proposed in this pull request? This PR is adding support duplicated column names for `toPandas` with Arrow execution. ### Why are the changes needed? When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates. ```py >>> spark.sql("select 1 v, 1 v").toPandas() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` ### Does this PR introduce any user-facing change? Yes, previously we will face an error above, but after this PR, we will see the result: ```py >>> spark.sql("select 1 v, 1 v").toPandas() v v 0 1 1 ``` ### How was this patch tested? Added and modified related tests. Closes #28210 from ueshin/issues/SPARK-31441/to_pandas. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-14 14:08:56 +09:00
Max Gekk	a0f8cc08a3	[SPARK-31426][SQL] Fix perf regressions of toJavaTimestamp/fromJavaTimestamp ### What changes were proposed in this pull request? Reuse the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` functions introduced by the PR #28119 in `DateTimeUtils`.`toJavaTimestamp()` and `fromJavaTimestamp()`. Actually, new implementation is derived from Spark 2.4 + rebasing via pre-calculated rebasing maps. ### Why are the changes needed? The changes speed up conversions to/from java.sql.Timestamp, and as a consequence the PR improve performance of ORC datasource in loading/saving timestamps: - Saving ~ x2.8 faster in master, and -11% against Spark 2.4.6 - Loading - x3.2-4.5 faster in master, -5% against Spark 2.4.6 Before: ``` Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582 59877 59877 0 1.7 598.8 0.0X before 1582 61361 61361 0 1.6 613.6 0.0X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 48197 48288 118 2.1 482.0 1.0X after 1582, vec on 38247 38351 128 2.6 382.5 1.3X before 1582, vec off 53179 53359 249 1.9 531.8 0.9X before 1582, vec on 44076 44268 269 2.3 440.8 1.1X ``` After: ``` Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582 21250 21250 0 4.7 212.5 0.1X before 1582 22105 22105 0 4.5 221.0 0.1X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 14903 14933 40 6.7 149.0 1.0X after 1582, vec on 8342 8426 73 12.0 83.4 1.8X before 1582, vec off 15528 15575 76 6.4 155.3 1.0X before 1582, vec on 9025 9075 61 11.1 90.2 1.7X ``` Spark 2.4.6-SNAPSHOT: ``` Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582 18858 18858 0 5.3 188.6 1.0X before 1582 18508 18508 0 5.4 185.1 1.0X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 14063 14177 143 7.1 140.6 1.0X after 1582, vec on 5955 6029 100 16.8 59.5 2.4X before 1582, vec off 14119 14126 7 7.1 141.2 1.0X before 1582, vec on 5991 6007 25 16.7 59.9 2.3X ``` ### Does this PR introduce any user-facing change? Yes, the `to_utc_timestamp` function returns the later local timestamp in the case of overlapping local timestamps at daylight saving time. it's changed back to the 2.4 behavior. ### How was this patch tested? - By existing test suite `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuites`, `ParquetIOSuite`, `OrcHadoopFsRelationSuite`. - Re-generating results of the benchmarks `DateTimeBenchmark` and `DateTimeRebaseBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28189 from MaxGekk/optimize-to-from-java-timestamp. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-14 04:50:20 +00:00
Huaxin Gao	46be1e01e9	[SPARK-31319][SQL][FOLLOW-UP] Add a SQL example for UDAF ### What changes were proposed in this pull request? Add a SQL example for UDAF ### Why are the changes needed? To make SQL Reference complete ### Does this PR introduce any user-facing change? Yes. Add the following page, also change ```Sql``` to ```SQL``` in the example tab for all the sql examples. <img width="1110" alt="Screen Shot 2020-04-13 at 6 09 24 PM" src="https://user-images.githubusercontent.com/13592258/79175240-06cd7400-7db2-11ea-8f3e-af71a591a64b.png"> ### How was this patch tested? Manually build and check Closes #28209 from huaxingao/udf_followup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-14 13:29:44 +09:00
Kent Yao	31b907748d	[SPARK-31414][SQL][DOCS][FOLLOWUP] Update default datetime pattern for json/csv APIs documentations ### What changes were proposed in this pull request? Update default datetime pattern from `yyyy-MM-dd'T'HH:mm:ss.SSSXXX ` to `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] ` for JSON/CSV APIs documentations ### Why are the changes needed? doc fix ### Does this PR introduce any user-facing change? Yes, the documentation will change ### How was this patch tested? Passing Jenkins Closes #28204 from yaooqinn/SPARK-31414-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-14 10:25:37 +09:00
Takeshi Yamamuro	853c6c9909	[SPARK-31434][SQL][DOCS] Drop builtin function pages from SQL references ### What changes were proposed in this pull request? This PR intends to drop the built-in function pages from SQL references. We've already had a complete list of built-in functions in the API documents. See related discussions for more details: https://github.com/apache/spark/pull/28170#issuecomment-611917191 ### Why are the changes needed? For better SQL documents. ### Does this PR introduce any user-facing change? ![functions](https://user-images.githubusercontent.com/692303/79109009-793e5400-7db2-11ea-8cb7-4c3cf31ccb77.png) ### How was this patch tested? Manually checked. Closes #28203 from maropu/DropBuiltinFunctionDocs. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-14 10:22:46 +09:00
Gengliang Wang	28e1a4fa93	[SPARK-31411][UI] Show submitted time and duration in job details page ### What changes were proposed in this pull request? Show submitted time and duration of a job in its details page ### Why are the changes needed? When we check job details from the SQL execution page, it will be more convenient if we can get the submission time and duration from the job page, instead of finding the info from job list page. ### Does this PR introduce any user-facing change? Yes. After changes, the job details page shows the submitted time and duration. ### How was this patch tested? Manual check ![image](https://user-images.githubusercontent.com/1097932/78974997-0a1de280-7ac8-11ea-8072-ce7a001b1b0c.png) Closes #28179 from gengliangwang/addSubmittedTimeAndDuration. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-04-13 17:12:26 -07:00
yi.wu	bbb3cd9c5e	[SPARK-31391][SQL][TEST] Add AdaptiveTestUtils to ease the test of AQE ### What changes were proposed in this pull request? This PR adds `AdaptiveTestUtils` to make AQE test simpler, which includes: `DisableAdaptiveExecution` - a test tag to skip a single test case if AQE is enabled. `EnableAdaptiveExecutionSuite` - a helper trait to enable AQE for all tests except those tagged with `DisableAdaptiveExecution`. `DisableAdaptiveExecutionSuite` - a helper trait to disable AQE for all tests. `assertExceptionMessage` - a method to handle message of normal or AQE exception in a consistent way. `assertExceptionCause` - a method to handle cause of normal or AQE exception in a consistent way. ### Why are the changes needed? With this utils, we can: - reduce much more duplicate codes; - handle normal or AQE exception in a consistent way; - improve the stability of AQE tests; ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated tests with the util. Closes #28162 from Ngone51/add_aqe_test_utils. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-13 14:40:53 +00:00
yi.wu	f6512903da	[SPARK-31409][SQL][TEST] Fix failed tests due to result order changing when enable AQE ### What changes were proposed in this pull request? This PR fix two tests by avoid result order changing when we enable AQE: 1. In `SQLQueryTestSuite`, disable BHJ optimization to avoid changing result order 2. In test `SQLQuerySuite#check outputs of expression examples`, disable `spark.sql.adaptive.coalescePartitions.enabled` to avoid changing result order ### Why are the changes needed? query 147 in SQLQueryTestSuite#"udf/postgreSQL/udf-join.sql - Scala UDF" and test sql/SQLQuerySuite#"check outputs of expression examples" can fail when enable AQE due to result order changing. And this PR fix them. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually with AQE enabled. Closes #28178 from Ngone51/fix_order. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-13 14:36:25 +00:00
yi.wu	4de8ae1a0f	[SPARK-31407][SQL][TEST] TestHiveQueryExecution should respect database when creating table ### What changes were proposed in this pull request? In `TestHiveQueryExecution`, if we detect a database in the referenced table, we should create the table under that database. ### Why are the changes needed? This fix the test `Fix hive/SQLQuerySuite.derived from Hive query file: drop_database_removes_partition_dirs.q` which currently only pass when we run it with the whole test suit but fail when run it separately. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Run the test separately and together with the whole test suite. Closes #28177 from Ngone51/fix_derived. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-13 19:04:36 +09:00
HyukjinKwon	c519fe1358	[SPARK-31330][INFRA][FOLLOW-UP] Move sbin and some files into appropriate categories in autolabeller ### What changes were proposed in this pull request? This PR is a followup of `1b87015044`. Now, we automatically label PRs, and seems working fine. This PR proposes to correct some minor list and categories. 1. Move `sbin` from `CORE` into `DEPLOY` components. ``` $ ls sbin decommission-slave.sh start-all.sh start-slave.sh stop-master.sh stop-thriftserver.sh slaves.sh start-history-server.sh start-slaves.sh stop-mesos-dispatcher.sh spark-config.sh start-master.sh start-thriftserver.sh stop-mesos-shuffle-service.sh spark-daemon.sh start-mesos-dispatcher.sh stop-all.sh stop-slave.sh spark-daemons.sh start-mesos-shuffle-service.sh stop-history-server.sh stop-slaves.sh ``` 2. `/sbin/mesos.sh` -> `MESOS` `/bin/spark-shell*` -> `SPARK SHELL`. ### Why are the changes needed? To label correctly and dev can take an advantage of it such as checking the PRs of a specific component. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? It was not tested yet. It can be tested after it was merged. Closes #28201 from HyukjinKwon/SPARK-31330. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-13 18:48:41 +09:00
Max Gekk	cf63ad61f5	[SPARK-31402][SQL] Fix rebasing of BCE dates/timestamps ### What changes were proposed in this pull request? In the PR, I propose to fallback to rebasing via local dates/timestamps for days/micros of before common era (BCE). ### Why are the changes needed? It fixes the bug of rebasing dates/timestamps of BCE. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - By existing tests in `RebaseDateTimeSuite` and `DateTimeUtilsSuite` - Added tests for negative years to `RebaseDateTimeSuite` Closes #28172 from MaxGekk/fix-era-in-date-micros-rebasing. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-13 06:07:31 +00:00
Max Gekk	cac8d1b352	[SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader ### What changes were proposed in this pull request? In regular ORC reader when `spark.sql.orc.enableVectorizedReader` is set to `false`, I propose to use `DaysWritable` in reading DATE values from ORC files. Currently, days from ORC files are converted to java.sql.Date, and then to days in Proleptic Gregorian calendar. So, the conversion to Java type can be eliminated. ### Why are the changes needed? - The PR fixes regressions in loading dates before the 1582 year from ORC files by when vectorised ORC reader is off. - The changes improve performance of regular ORC reader for DATE columns. - x3.6 faster comparing to the current master - x1.9-x4.3 faster against Spark 2.4.6 Before (on JDK 8): ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 39651 39686 31 2.5 396.5 1.0X after 1582, vec on 3647 3660 13 27.4 36.5 10.9X before 1582, vec off 38155 38219 61 2.6 381.6 1.0X before 1582, vec on 4041 4046 6 24.7 40.4 9.8X ``` After (on JDK 8): ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 10947 10971 28 9.1 109.5 1.0X after 1582, vec on 3677 3702 36 27.2 36.8 3.0X before 1582, vec off 11456 11472 21 8.7 114.6 1.0X before 1582, vec on 4079 4103 21 24.5 40.8 2.7X ``` Spark 2.4.6: ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 48169 48276 96 2.1 481.7 1.0X after 1582, vec on 5375 5410 41 18.6 53.7 9.0X before 1582, vec off 22353 22482 198 4.5 223.5 2.2X before 1582, vec on 5474 5475 1 18.3 54.7 8.8X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites like `DateTimeUtilsSuite` - Checked for `hive-1.2` by: ``` ./build/sbt -Phive-1.2 "test:testOnly *OrcHadoopFsRelationSuite" ``` - Re-run `DateTimeRebaseBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28169 from MaxGekk/orc-optimize-dates. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-13 05:29:54 +00:00
Takeshi Yamamuro	179289f0bf	[SPARK-31383][SQL][DOC] Clean up the SQL documents in docs/sql-ref* ### What changes were proposed in this pull request? This PR intends to clean up the SQL documents in `doc/sql-ref`. Main changes are as follows; - Fixes wrong syntaxes and capitalize sub-titles - Adds some DDL queries in `Examples` so that users can run examples there - Makes query output in `Examples` follows the `Dataset.showString` (right-aligned) format - Adds/Removes spaces, Indents, or blank lines to follow the format below; ``` --- license... --- ### Description Writes what's the syntax is. ### Syntax {% highlight sql %} SELECT... WHERE... // 4 indents after the second line ... {% endhighlight %} ### Parameters <dl> <dt><code><em>Param Name</em></code></dt> <dd> Param Description </dd> ... </dl> ### Examples {% highlight sql %} -- It is better that users are able to execute example queries here. -- So, we prepare test data in the first section if possible. CREATE TABLE t (key STRING, value DOUBLE); INSERT INTO t VALUES ('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0); -- query output has 2 indents and it follows the `Dataset.showString` -- format (right-aligned). SELECT FROM t; +---+-----+ \|key\|value\| +---+-----+ \| a\| 1.0\| \| a\| 2.0\| \| b\| 3.0\| \| c\| 4.0\| +---+-----+ -- Query statements after the second line have 4 indents. SELECT key, SUM(value) FROM t GROUP BY key; +---+----------+ \|key\|sum(value)\| +---+----------+ \| c\| 4.0\| \| b\| 3.0\| \| a\| 3.0\| +---+----------+ ... {% endhighlight %} ### Related Statements * [XXX](xxx.html) * ... ``` ### Why are the changes needed? The most changes of this PR are pretty minor, but I think the consistent formats/rules to write documents are important for long-term maintenance in our community ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Manually checked. Closes #28151 from maropu/MakeRightAligned. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-12 23:40:36 -05:00
Huaxin Gao	310bef1ac7	[SPARK-31419][SQL][DOCS] Document Table-valued Function and Inline Table ### What changes were proposed in this pull request? Document Table-valued Function and Inline Table ### Why are the changes needed? To make SQL Reference complete ### Does this PR introduce any user-facing change? Yes <img width="1050" alt="Screen Shot 2020-04-11 at 5 34 25 PM" src="https://user-images.githubusercontent.com/13592258/79057852-cedff880-7c1a-11ea-9e1e-7882594ab573.png"> <img width="1050" alt="Screen Shot 2020-04-11 at 5 34 46 PM" src="https://user-images.githubusercontent.com/13592258/79057854-d4d5d980-7c1a-11ea-94cc-92ef1121fa43.png"> <img width="1050" alt="Screen Shot 2020-04-10 at 7 36 00 PM" src="https://user-images.githubusercontent.com/13592258/79033391-c2986480-7b62-11ea-9d0a-6c60de823256.png"> <img width="1051" alt="Screen Shot 2020-04-10 at 7 36 21 PM" src="https://user-images.githubusercontent.com/13592258/79033392-c5935500-7b62-11ea-88d4-e7d7812a7add.png"> <img width="1051" alt="Screen Shot 2020-04-11 at 5 09 48 PM" src="https://user-images.githubusercontent.com/13592258/79057555-6ba09700-7c17-11ea-9683-16bbde63a529.png"> Also, linked the newly added pages to select statement <img width="1050" alt="Screen Shot 2020-04-10 at 3 27 59 PM" src="https://user-images.githubusercontent.com/13592258/79027245-5147ba00-7b40-11ea-9b10-527fd9639958.png"> ### How was this patch tested? Manually build and check Closes #28185 from huaxingao/tvf. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-12 23:39:27 -05:00
Huaxin Gao	3bbd80dbc3	[SPARK-31319][SQL][DOCS] Document UDFs/UDAFs in SQL Reference ### What changes were proposed in this pull request? Document UDF in SQL Reference ### Why are the changes needed? To make SQL Reference complete. ### Does this PR introduce any user-facing change? Yes. Here are the new pages: <img width="1050" alt="Screen Shot 2020-04-09 at 5 06 42 PM" src="https://user-images.githubusercontent.com/13592258/78950977-585dc200-7a85-11ea-875c-ce14c3795e0f.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 07 06 PM" src="https://user-images.githubusercontent.com/13592258/78950979-5b58b280-7a85-11ea-81f3-bd5d91bd07e3.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 07 26 PM" src="https://user-images.githubusercontent.com/13592258/78950985-5e53a300-7a85-11ea-86be-f63152c1501b.png"> <img width="1051" alt="Screen Shot 2020-04-09 at 5 07 54 PM" src="https://user-images.githubusercontent.com/13592258/78950991-63185700-7a85-11ea-9379-8da46cfc434c.png"> <img width="1060" alt="Screen Shot 2020-04-09 at 5 08 17 PM" src="https://user-images.githubusercontent.com/13592258/78950994-657ab100-7a85-11ea-8b34-d2c87f94b03b.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 09 27 PM" src="https://user-images.githubusercontent.com/13592258/78951001-6875a180-7a85-11ea-874e-8abd14a3d3d3.png"> <img width="1060" alt="Screen Shot 2020-04-09 at 5 10 00 PM" src="https://user-images.githubusercontent.com/13592258/78951005-6f041900-7a85-11ea-9e57-520eb8db59ec.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 11 10 PM" src="https://user-images.githubusercontent.com/13592258/78951014-73303680-7a85-11ea-93ab-32d68d2e2d59.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 11 41 PM" src="https://user-images.githubusercontent.com/13592258/78951019-75929080-7a85-11ea-9d3b-600e8e157c05.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 16 22 PM" src="https://user-images.githubusercontent.com/13592258/78951137-dfab3580-7a85-11ea-8512-c6b660aa271e.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 22 15 PM" src="https://user-images.githubusercontent.com/13592258/78951466-22214200-7a87-11ea-93dd-6e36492421f1.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 22 46 PM" src="https://user-images.githubusercontent.com/13592258/78951469-24839c00-7a87-11ea-93a9-fe30d689adbd.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 23 08 PM" src="https://user-images.githubusercontent.com/13592258/78951472-26e5f600-7a87-11ea-84db-087a3528aa53.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 23 34 PM" src="https://user-images.githubusercontent.com/13592258/78951474-29e0e680-7a87-11ea-8be4-2a5be1bc3788.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 23 57 PM" src="https://user-images.githubusercontent.com/13592258/78951481-2cdbd700-7a87-11ea-8894-0a39abf54a3b.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 24 15 PM" src="https://user-images.githubusercontent.com/13592258/78951483-2f3e3100-7a87-11ea-8845-ffebf89d7898.png"> ### How was this patch tested? Manually build and check Closes #28087 from huaxingao/udf. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-12 23:38:17 -05:00
Kent Yao	d65f534c5a	[SPARK-31414][SQL] Fix performance regression with new TimestampFormatter for json and csv time parsing ### What changes were proposed in this pull request? With benchmark original, where the timestamp values are valid to the new parser the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5781 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 44764 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 93764 ms [info] Running case: from_json(timestamp) [info] Stopped after 3 iterations, 59021 ms ``` When we modify the benchmark to ```scala def timestampStr: Dataset[String] = { spark.range(0, rowsNum, 1, 1).mapPartitions { iter => iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""") }.select($"value".as("timestamp")).as[String] } readBench.addCase("timestamp strings", numIters) { _ => timestampStr.noop() } readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ => spark.read.schema(tsSchema).json(timestampStr).noop() } readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ => spark.read.json(timestampStr).noop() } ``` where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4). the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` About 10x perf-regression BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` ### Why are the changes needed? Fix performance regression. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? new tests added. Closes #28181 from yaooqinn/SPARK-31414. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-13 03:11:28 +00:00
Nicholas Chammas	1b87015044	[SPARK-31330] Automatically label PRs based on the paths they touch ### What changes were proposed in this pull request? This PR adds some rules that will be used by Probot Auto Labeler to label PRs based on what paths they modify. ### Why are the changes needed? This should make it easier for committers to organize PRs, and it could also help drive downstream tooling like the PR dashboard. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? We'll only be able to test it, I believe, after merging it in. Given that [the Avro project is using this same bot already](https://github.com/apache/avro/blob/master/.github/autolabeler.yml), I expect it will be straightforward to get this working. Closes #28114 from nchammas/SPARK-31330-auto-label-prs. Lead-authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-13 10:01:31 +09:00
Kousuke Saruta	6cd0bef7fe	[SPARK-31416][SQL] Check more strictly that a field name can be used as a valid Java identifier for codegen ### What changes were proposed in this pull request? Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor` To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), . ### Why are the changes needed? In the current implementation, `enum` is not checked even though it's a reserved keyword. Also, there are lots of characters and sequences of character including numeric literals but they are not checked. So we can't get better error message with following code. ``` case class Data(`0`: Int) Seq(Data(1)).toDF.show 20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type ... ``` ### Does this PR introduce any user-facing change? Yes. With this change and the code example above, we can get following error message. ``` java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name - root class: "Data" ... ``` ### How was this patch tested? Add another assertion to existing test case. Closes #28184 from sarutak/improve-identifier-check. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-12 13:14:41 -07:00
gatorsmile	ad79ae11ba	[SPARK-31424][SQL] Rename AdaptiveSparkPlanHelper.collectInPlanAndSubqueries to collectWithSubqueries ### What changes were proposed in this pull request? Like https://github.com/apache/spark/pull/28092, this PR is to rename `QueryPlan.collectInPlanAndSubqueries` in AdaptiveSparkPlanHelper to `collectWithSubqueries` ### Why are the changes needed? The old name is too verbose. `QueryPlan` is internal but it's the core of catalyst and we'd better make the API name clearer before we release it. ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #28193 from gatorsmile/spark-31322. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-12 13:10:57 -07:00
Huaxin Gao	fda910d4e2	[SPARK-31348][SQL][DOCS] Document Join in SQL Reference ### What changes were proposed in this pull request? Document join in SQL Reference. ### Why are the changes needed? To make SQL Reference complete. ### Does this PR introduce any user-facing change? Yes <img width="1050" alt="Screen Shot 2020-04-05 at 8 46 47 PM" src="https://user-images.githubusercontent.com/13592258/78521722-ab7efe80-777f-11ea-90f5-1fac09282721.png"> <img width="1049" alt="Screen Shot 2020-04-05 at 8 47 20 PM" src="https://user-images.githubusercontent.com/13592258/78521724-ade15880-777f-11ea-9238-183d999ed918.png"> <img width="1049" alt="Screen Shot 2020-04-05 at 8 47 41 PM" src="https://user-images.githubusercontent.com/13592258/78521726-b043b280-777f-11ea-996f-a8e86d453c01.png"> <img width="1049" alt="Screen Shot 2020-04-05 at 8 48 11 PM" src="https://user-images.githubusercontent.com/13592258/78521731-b3d73980-777f-11ea-85c8-c24798ef41ac.png"> <img width="1049" alt="Screen Shot 2020-04-05 at 8 48 33 PM" src="https://user-images.githubusercontent.com/13592258/78521734-b5a0fd00-777f-11ea-8b2c-96af30f3bf49.png"> ### How was this patch tested? Manually build and check. Closes #28121 from huaxingao/join. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-12 13:57:54 -05:00
Dongjoon Hyun	a6e6fbf2ca	[SPARK-31422][CORE] Fix NPE when BlockManagerSource is used after BlockManagerMaster stops ### What changes were proposed in this pull request? This PR (SPARK-31422) aims to return empty result in order to avoid `NullPointerException` at `getStorageStatus` and `getMemoryStatus` which happens after `BlockManagerMaster` stops. The empty result is consistent with the current status of `SparkContext` because `BlockManager` and `BlockManagerMaster` is already stopped. ### Why are the changes needed? In `SparkEnv.stop`, the following stop sequence is used and `metricsSystem.stop` invokes `sink.stop`. ``` blockManager.master.stop() metricsSystem.stop() --> sinks.foreach(_.stop) ``` However, some sink can invoke `BlockManagerSource` and ends up with `NullPointerException` because `BlockManagerMaster` is already stopped and `driverEndpoint` became `null`. ``` java.lang.NullPointerException at org.apache.spark.storage.BlockManagerMaster.getStorageStatus(BlockManagerMaster.scala:170) at org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63) at org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63) at org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:31) at org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:30) ``` Since `SparkContext` registers and forgets `BlockManagerSource` without deregistering, we had better avoid `NullPointerException` inside `BlockManagerMaster` preventively. ```scala _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager)) ``` ### Does this PR introduce any user-facing change? Yes. This will remove NPE for the users who uses `BlockManagerSource`. ### How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #28187 from dongjoon-hyun/SPARK-31422. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-11 08:27:30 -07:00
Dongjoon Hyun	b4c438a5e0	[SPARK-31291][SQL][TEST][FOLLOWUP] Fix resource loading error in ThriftServerQueryTestSuite ### What changes were proposed in this pull request? [SPARK-31291](https://github.com/apache/spark/pull/28060) broke `ThriftServerQueryTestSuite` in Maven environment. This PR fixes it by copying the resource file from jars to local temp file. ### Why are the changes needed? To recover the Jenkins jobs in `master` and `branch-3.0`. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/211/ ``` org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite * ABORTED * ... java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/sql/core/target/ spark-sql_2.12-3.0.1-SNAPSHOT-tests.jar!/test-data/postgresql/agg.data ``` ![Screen Shot 2020-04-10 at 9 54 28 PM](https://user-images.githubusercontent.com/9700541/79035702-f03ad900-7b75-11ea-9eee-0c1581a28838.png) ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with SBT and Maven. - [x] Sbt (`Test build #121117` https://github.com/apache/spark/pull/28186#issuecomment-612329068) - [x] Maven (`Test build #121118` https://github.com/apache/spark/pull/28186#issuecomment-612414382) Closes #28186 from dongjoon-hyun/SPARK-31291. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-11 08:23:59 -07:00
Dilip Biswal	f0e2fc37d1	[SPARK-25154][SQL] Support NOT IN sub-queries inside nested OR conditions ### What changes were proposed in this pull request? Currently NOT IN subqueries (predicated null aware subquery) are not allowed inside OR expressions. We currently catch this condition in checkAnalysis and throw an error. This PR enhances the subquery rewrite to support this type of queries. Query ```SQL SELECT * FROM s1 WHERE a > 5 or b NOT IN (SELECT c FROM s2); ``` Optimized Plan ```SQL == Optimized Logical Plan == Project [a#3, b#4] +- Filter ((a#3 > 5) \|\| NOT exists#7) +- Join ExistenceJoin(exists#7), ((b#4 = c#5) \|\| isnull((b#4 = c#5))) :- HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#3, b#4] +- Project [c#5] +- HiveTableRelation `default`.`s2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c#5, d#6] ``` This is rework from #22141. The original author of this PR is dilipbiswal. Closes #22141 ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new tests in SQLQueryTestSuite, RewriteSubquerySuite and SubquerySuite. Output from DB2 as a reference: [nested-not-db2.txt](https://github.com/apache/spark/files/2299945/nested-not-db2.txt) Closes #28158 from maropu/pr22141. Lead-authored-by: Dilip Biswal <dkbiswal@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-11 08:28:11 +09:00
beliefer	2d3692ed45	[SPARK-31406][SQL][TEST] ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases ### What changes were proposed in this pull request? This PR is related to https://github.com/apache/spark/pull/28060. `ThriftServerQueryTestSuite` spend 17 minutes time to test. I checked the code and found `ThriftServerQueryTestSuite` load test data repeatedly. I've listed all the test cases order by time with desc in the `hive-thriftserver` module below. Class \| Spend time ↑ \| Failure \| Skip \| Pass \| Total test case -- \| -- \| -- \| -- \| -- \| -- ThriftServerQueryTestSuite \| 17 minutes \| 0 \| 15 \| 140 \| 155 CliSuite \| 8 minutes 24 seconds \| 0 \| 0 \| 24 \| 24 SparkThriftServerProtocolVersionsSuite \| 59 seconds \| 0 \| 0 \| 210 \| 210 HiveThriftBinaryServerSuite \| 36 seconds \| 0 \| 1 \| 21 \| 22 SparkMetadataOperationSuite \| 19 seconds \| 0 \| 0 \| 7 \| 7 HiveCliSessionStateSuite \| 16 seconds \| 0 \| 0 \| 2 \| 2 SparkSQLEnvSuite \| 16 seconds \| 0 \| 0 \| 1 \| 1 HiveThriftHttpServerSuite \| 15 seconds \| 0 \| 0 \| 3 \| 3 SingleSessionSuite \| 14 seconds \| 0 \| 0 \| 3 \| 3 JdbcConnectionUriSuite \| 2.1 seconds \| 0 \| 0 \| 1 \| 1 ThriftServerWithSparkContextSuite \| 1.4 seconds \| 0 \| 0 \| 1 \| 1 SparkExecuteStatementOperationSuite \| 63 millseconds \| 0 \| 0 \| 2 \| 2 UISeleniumSuite \| -1 millseconds \| 0 \| 1 \| 0 \| 1 I checked the code of `ThriftServerQueryTestSuite` and found `ThriftServerQueryTestSuite` load test data repeatedly. This PR will improve the performance of `ThriftServerQueryTestSuite`. Because https://github.com/apache/spark/pull/28060 provides `createTestTables`(`e42a3945ac/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L574)`) and `removeTestTables`(`e42a3945ac/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L666)`), this PR will still uses them. The total time run `ThriftServerQueryTestSuite` before and after this PR show below. Before No \| Time -- \| -- 1 \| 18 minutes, 8 seconds 2 \| 22 minutes, 44 seconds 3 \| 17 minutes, 48 seconds 4 \| 18 minutes, 30 seconds After No \| Time -- \| -- 1 \| 16 minutes, 11 seconds 2 \| 17 minutes, 19 seconds 3 \| 18 minutes, 15 seconds 4 \| 17 minutes, 27 seconds ### Why are the changes needed? Improve the performance of `ThriftServerQueryTestSuite`. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28180 from beliefer/avoid-load-thrift-test-data-repeatedly. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-10 13:08:19 +00:00
yi.wu	6cddda7847	[SPARK-31384][SQL] Fix NPE in OptimizeSkewedJoin ### What changes were proposed in this pull request? 1. Fix NPE in `OptimizeSkewedJoin` 2. prevent other potential NPE errors in AQE. ### Why are the changes needed? When there's a `inputRDD` of a plan has 0 partition, rule `OptimizeSkewedJoin` can hit NPE error because this kind of RDD means a null `MapOutputStatistics` due to: `d98df7626b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (L68-L69)` Thus, we should take care of such NPE errors in other places too. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test. Closes #28153 from Ngone51/npe. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-10 08:16:48 +00:00
Kent Yao	a454510917	[SPARK-31392][SQL] Support CalendarInterval to be reflect to CalendarntervalType ### What changes were proposed in this pull request? Since 3.0.0, we make CalendarInterval public for input, it's better for it to be inferred to CalendarIntervalType. In the PR, we add a rule for CalendarInterval to be mapped to CalendarIntervalType in ScalaRelection, then records(e.g case class, tuples ...) contains interval fields are able to convert to a Dataframe. ### Why are the changes needed? CalendarInterval is public but can not be used as input for Datafame. ```scala scala> import org.apache.spark.unsafe.types.CalendarInterval import org.apache.spark.unsafe.types.CalendarInterval scala> Seq((1, new CalendarInterval(1, 2, 3))).toDF("a", "b") java.lang.UnsupportedOperationException: Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$schemaFor$1(ScalaReflection.scala:735) ``` this should be supported as well as ```scala scala> sql("select interval 2 month 1 day a") res2: org.apache.spark.sql.DataFrame = [a: interval] ``` ### Does this PR introduce any user-facing change? Yes, records(e.g case class, tuples ...) contains interval fields are able to convert to a Dataframe ### How was this patch tested? add uts Closes #28165 from yaooqinn/SPARK-31392. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-10 07:34:01 +00:00
Dongjoon Hyun	e42a3945ac	[SPARK-31401][K8S] Show JDK11 usage in `bin/docker-image-tool.sh` ### What changes were proposed in this pull request? This PR adds an JDK11-based build example in `bin/docker-image-tool.sh`. ### Why are the changes needed? This helps the users migrate to JDK11 more easily. ### Does this PR introduce any user-facing change? Yes, but this is a usage help. ### How was this patch tested? First, check the help usage manually. ``` $ bin/docker-image-tool.sh -h ... - Build and push JDK11-based image with tag "v3.0.0" to docker.io/myrepo bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 -b java_image_tag=11-jre-slim build bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 push {code} ``` Then, build the image and check Java version. ``` $ docker run -it --rm docker.io/myrepo/spark:v3.0.0 java --version \| tail -n3 openjdk 11.0.6 2020-01-14 OpenJDK Runtime Environment 18.9 (build 11.0.6+10) OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode) ``` Closes #28171 from dongjoon-hyun/SPARK-31401. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-09 21:36:26 -07:00
Wenchen Fan	148950fa2b	[SPARK-31359][DOC][FOLLOWUP] improve code comments in RebaseDateTime ### What changes were proposed in this pull request? improve the code comment and make them consistent between `rebaseJulianToGregorian` and `rebaseGregorianToJulian` ### Why are the changes needed? improve readability. ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #28166 from cloud-fan/comment. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-10 03:43:32 +00:00
Dongjoon Hyun	c6ea6933e2	[SPARK-18886][CORE][TESTS][FOLLOWUP] Fix a test failure due to InvalidUseOfMatchersException ### What changes were proposed in this pull request? This fixes one UT failure. ``` [info] - extra resources from executor * FAILED * (218 milliseconds) [info] org.mockito.exceptions.misusing.InvalidUseOfMatchersException: Invalid use of argument matchers! [info] 0 matchers expected, 1 recorded: ``` ### Why are the changes needed? The original PR was merged with an outdated Jenkins result (7 days before the merging). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins or manually do the following. ``` $ build/sbt "core/testOnly *.CoarseGrainedSchedulerBackendSuite" ``` Closes #28174 from dongjoon-hyun/SPARK-18886. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-10 12:02:41 +09:00
Huaxin Gao	f69b0ef25d	[SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference ### What changes were proposed in this pull request? Document TABLESAMPLE in SQL Reference ### Why are the changes needed? To make SQL Reference complete ### Does this PR introduce any user-facing change? Yes <img width="1049" alt="Screen Shot 2020-04-06 at 10 23 52 PM" src="https://user-images.githubusercontent.com/13592258/78633123-96749f00-7855-11ea-9509-b7ee21da7fbd.png"> <img width="1050" alt="Screen Shot 2020-04-06 at 10 24 26 PM" src="https://user-images.githubusercontent.com/13592258/78633130-98d6f900-7855-11ea-8675-fd4b6163dfb6.png"> ### How was this patch tested? Manually build and check. Closes #28130 from huaxingao/sampling. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-09 19:39:34 -05:00
zero323	697fe911ac	[SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR ### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `FMRegressor`: - Supporting ` org.apache.spark.ml.r.FMRegressorWrapper`. - `FMRegressionModel` S4 class. - Corresponding `spark.fmRegressor`, `predict`, `summary` and `write.ml` generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes #27571 from zero323/SPARK-30819. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-09 19:38:11 -05:00
Huaxin Gao	61f903fa7a	[SPARK-31331][SQL][DOCS] Document Spark integration with Hive UDFs/UDAFs/UDTFs ### What changes were proposed in this pull request? Document Spark integration with Hive UDFs/UDAFs/UDTFs ### Why are the changes needed? To make SQL Reference complete ### Does this PR introduce any user-facing change? Yes <img width="1031" alt="Screen Shot 2020-04-02 at 2 22 42 PM" src="https://user-images.githubusercontent.com/13592258/78301971-cc7cf080-74ee-11ea-93c8-7d4c75213b47.png"> ### How was this patch tested? Manually build and check Closes #28104 from huaxingao/hive-udfs. Lead-authored-by: Huaxin Gao <huaxing@us.ibm.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-09 13:28:01 -05:00
Gabor Somogyi	1354d2d0de	[SPARK-31021][SQL] Support MariaDB Kerberos login in JDBC connector ### What changes were proposed in this pull request? When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it. This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues. In this PR I've added MariaDB support (other supported databases will come in later PRs). What this PR contains: * Introduced `SecureConnectionProvider` and added basic secure functionalities * Added `MariaDBConnectionProvider` * Added `MariaDBConnectionProviderSuite` * Added `MariaDBKrbIntegrationSuite` docker integration test * Added some missing code documentation ### Why are the changes needed? Missing JDBC kerberos support. ### Does this PR introduce any user-facing change? Yes, now user is able to connect to MariaDB using kerberos. ### How was this patch tested? * Additional + existing unit tests * Additional + existing integration tests * Test on cluster manually Closes #28019 from gaborgsomogyi/SPARK-31021. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@apache.org>	2020-04-09 09:20:02 -07:00
gengjiaan	014d33570b	[SPARK-31291][SQL][TEST] SQLQueryTestSuite: Sharing test data and test tables among multiple test cases ### What changes were proposed in this pull request? `SQLQueryTestSuite` spend 35 minutes time to test. I've listed the 10 test cases that took the longest time in the `SQL` module below. Class \| Spend time ↑ \| Failure \| Skip \| Pass \| Total test case -- \| -- \| -- \| -- \| -- \| -- SQLQueryTestSuite \| 35 minutes \| 0 \| 1 \| 230 \| 231 TPCDSQuerySuite \| 3 minutes 8 seconds \| 0 \| 0 \| 156 \| 156 SQLQuerySuite \| 2 minutes 52 seconds \| 0 \| 0 \| 185 \| 185 DynamicPartitionPruningSuiteAEOff \| 1 minutes 52 seconds \| 0 \| 0 \| 22 \| 22 DataFrameFunctionsSuite \| 1 minutes 37 seconds \| 0 \| 0 \| 102 \| 102 DynamicPartitionPruningSuiteAEOn \| 1 minutes 24 seconds \| 0 \| 0 \| 22 \| 22 DataFrameSuite \| 1 minutes 14 seconds \| 0 \| 2 \| 157 \| 159 SubquerySuite \| 1 minutes 12 seconds \| 0 \| 1 \| 70 \| 71 SingleLevelAggregateHashMapSuite \| 1 minutes 1 seconds \| 0 \| 0 \| 50 \| 50 DataFrameAggregateSuite \| 59 seconds \| 0 \| 0 \| 50 \| 50 I checked the code of `SQLQueryTestSuite` and found `SQLQueryTestSuite` load test data repeatedly. This PR will improve the performance of `SQLQueryTestSuite`. The total time run `SQLQueryTestSuite` before and after this PR show below. Before No \| Time -- \| -- 1 \| 20 minutes, 22 seconds 2 \| 23 minutes, 21 seconds 3 \| 21 minutes, 19 seconds 4 \| 22 minutes, 26 seconds 5 \| 20 minutes, 8 seconds After No \| Time -- \| -- 1 \| 20 minutes, 52 seconds 2 \| 20 minutes, 47 seconds 3 \| 20 minutes, 7 seconds 4 \| 21 minutes, 10 seconds 5 \| 20 minutes, 4 seconds ### Why are the changes needed? Improve the performance of `SQLQueryTestSuite`. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28060 from beliefer/avoid-load-test-data-repeatedly. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-09 12:16:43 +00:00
Nicholas Marcott	8b4862953a	[SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling ### What changes were proposed in this pull request? [Delay scheduling](http://elmeleegy.com/khaled/papers/delay_scheduling.pdf) is an optimization that sacrifices fairness for data locality in order to improve cluster and workload throughput. One useful definition of "delay" here is how much time has passed since the TaskSet was using its fair share of resources. However it is impractical to calculate this delay, as it would require running simulations assuming no delay scheduling. Tasks would be run in different orders with different run times. Currently the heuristic used to estimate this delay is the time since a task was last launched for a TaskSet. The problem is that it essentially does not account for resource utilization, potentially leaving the cluster heavily underutilized. This PR modifies the heuristic in an attempt to move closer to the useful definition of delay above. The newly proposed delay is the time since a TasksSet last launched a task and did not reject any resources due to delay scheduling when offered its "fair share". See the last comments of #26696 for more discussion. ### Why are the changes needed? cluster can become heavily underutilized as described in [SPARK-18886](https://issues.apache.org/jira/browse/SPARK-18886?jql=project%20%3D%20SPARK%20AND%20text%20~%20delay) ### How was this patch tested? TaskSchedulerImplSuite cloud-fan tgravescs squito Closes #27207 from bmarcott/nmarcott-fulfill-slots-2. Authored-by: Nicholas Marcott <481161+bmarcott@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-09 11:00:29 +00:00
HyukjinKwon	c279e6b091	[SPARK-30722][DOCS][FOLLOW-UP] Explicitly mention the same entire input/output length restriction of Series Iterator UDF ### What changes were proposed in this pull request? This PR explicitly mention that the requirement of Iterator of Series to Iterator of Series and Iterator of Multiple Series to Iterator of Series (previously Scalar Iterator pandas UDF). The actual limitation of this UDF is the same length of the _entire input and output_, instead of each series's length. Namely you can do something as below: ```python from typing import Iterator, Tuple import pandas as pd from pyspark.sql.functions import pandas_udf pandas_udf("long") def func( iterator: Iterator[pd.Series]) -> Iterator[pd.Series]: return iter([pd.concat(iterator)]) spark.range(100).select(func("id")).show() ``` This characteristic allows you to prefetch the data from the iterator to speed up, compared to the regular Scalar to Scalar (previously Scalar pandas UDF). ### Why are the changes needed? To document the correct restriction and characteristics of a feature. ### Does this PR introduce any user-facing change? Yes in the documentation but only in unreleased branches. ### How was this patch tested? Github Actions should test the documentation build Closes #28160 from HyukjinKwon/SPARK-30722-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-09 16:46:27 +09:00
Gengliang Wang	d89fcc64db	[SPARK-31333][FOLLOWUP][DOC] Link Join Hints doc in SQL perf tuning guide ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/28113. There is also a brief section about Join hints in SQL perf tuning guide: https://spark.apache.org/docs/latest/sql-performance-tuning.html . We should link the new Join hint doc in it. ### Why are the changes needed? So that users can read more examples. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually build the doc and check it: ![image](https://user-images.githubusercontent.com/1097932/78860030-f7cb7800-79e5-11ea-8573-c0587d43a7dc.png) Closes #28161 from gengliangwang/joinHintFollowUp. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-09 15:03:08 +09:00
Max Gekk	e2d9399602	[SPARK-31359][SQL] Speed up timestamps rebasing ### What changes were proposed in this pull request? In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01: \| i \| local timestamp \| Proleptic Greg. seconds \| Hybrid (Julian+Greg) seconds \| difference in minutes\| \| -- \| ------- \|----\|----\| ---- \| \|0\|0001-01-01 00:00\|-62135568422\|-62135740800\|-2872\| \|1\|0100-03-01 00:00\|-59006333222\|-59006419200\|-1432\| \|...\|...\|...\|...\|...\| \|13\|1582-10-15 00:00\|-12219264422\|-12219264000\|7\| \|14\|1883-11-18 12:00\|-2717640000\|-2717640000\|0\| The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps: 1. Look at the table, and find the time interval where the micros falls to 2. Take the difference between 2 calendars for this time interval 3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation. Here are details of the implementation: - Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example: ```json [ { "tz" : "America/Los_Angeles", "switches" : [ -62135740800, -59006419200, ... , -2717640000 ], "diffs" : [ 172378, 85978, ..., 0 ] } ] ``` The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default. The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test). - The _Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_ time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385) - The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case. The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime. - I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason. - I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation. - Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'. ### Why are the changes needed? To make timestamps rebasing faster: - Saving timestamps to parquet files is ~ x3.8 faster - Loading timestamps from parquet files is ~x2.8 faster. - Loading timestamps by Vectorized reader ~x4.6 faster. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that - the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`. - swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function. - Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones. - Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. - The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`. - Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR #28127): \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28119 from MaxGekk/optimize-rebase-micros. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-09 05:23:52 +00:00
HyukjinKwon	4fafdcd63b	[SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator pandas UDF ### What changes were proposed in this pull request? This PR proposes to improve the error message from Scalar iterator pandas UDF. ### Why are the changes needed? To show the correct error messages. ### Does this PR introduce any user-facing change? Yes, but only in unreleased branches. ```python import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('long', PandasUDFType.SCALAR_ITER) def pandas_plus_one(iterator): for _ in iterator: yield pd.Series(1) spark.range(10).repartition(1).select(pandas_plus_one("id")).show() ``` ```python import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('long', PandasUDFType.SCALAR_ITER) def pandas_plus_one(iterator): for _ in iterator: yield pd.Series(list(range(20))) spark.range(10).repartition(1).select(pandas_plus_one("id")).show() ``` Before: ``` RuntimeError: The number of output rows of pandas iterator UDF should be the same with input rows. The input rows number is 10 but the output rows number is 1. ``` ``` AssertionError: Pandas MAP_ITER UDF outputted more rows than input rows. ``` After: ``` RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 10. ``` ``` AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows. ``` ### How was this patch tested? Unittests were fixed accordingly. Closes #28135 from HyukjinKwon/SPARK-26412-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-09 13:14:41 +09:00
zero323	0063462d55	[SPARK-30818][SPARKR][ML] Add SparkR LinearRegression wrapper ### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `LinearRegression` - Supporting `org.apache.spark.ml.rLinearRegressionWrapper`. - `LinearRegressionModel` S4 class. - Corresponding `spark.lm` predict, summary and write.ml generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes #27593 from zero323/SPARK-30818. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-08 22:29:44 -05:00
HyukjinKwon	0248b32972	[SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake ### What changes were proposed in this pull request? This PR proposes to show a better error message when a user mistakenly installs `pyspark` from PIP but the default `python` does not point out the corresponding `pip`. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example. It can be reproduced as below: I have two Python executables. `python` is Python 3.7, `pip` binds with Python 3.7 and `python2.7` is Python 2.7. ```bash pip install pyspark ``` ```bash pyspark ``` ``` ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:15) SparkSession available as 'spark'. ... ``` ```bash PYSPARK_PYTHON=python2.7 pyspark ``` ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: /bin/spark-submit: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: exec: /bin/spark-submit: cannot execute: No such file or directory ``` ### Why are the changes needed? There are multiple questions outside about this error and they have no idea what's going on. See: - https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 - https://stackoverflow.com/questions/45991888/path-issue-could-not-find-valid-spark-home-while-searching - https://stackoverflow.com/questions/49707239/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/48296474/error-could-not-find-valid-spark-home-while-searching-pycharm-in-windows - https://github.com/ContinuumIO/anaconda-issues/issues/8076 The answer is usually setting `SPARK_HOME`; however this isn't completely correct. It works if you set `SPARK_HOME` because `pyspark` executable script directly imports the library by using `SPARK_HOME` (see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified via `python` executable. So, this way you use a package installed in a different Python, which isn't ideal. ### Does this PR introduce any user-facing change? Yes, it fixes the error message better. Before: ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] ... ``` After: ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] Did you install PySpark via a package manager such as pip or Conda? If so, PySpark was not found in your Python environment. It is possible your Python environment does not properly bind with your package manager. Please check your default 'python' and if you set PYSPARK_PYTHON and/or PYSPARK_DRIVER_PYTHON environment variables, and see if you can import PySpark, for example, 'python -c 'import pyspark'. If you cannot import, you can install by using the Python executable directly, for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also explicitly set the Python executable, that has PySpark installed, to PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example, 'PYSPARK_PYTHON=python3 pyspark'. ... ``` ### How was this patch tested? Manually tested as described above. Closes #28152 from HyukjinKwon/SPARK-31382. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-09 11:04:35 +09:00
Jungtaek Lim (HeartSaVioR)	ca2ba4fe64	[SPARK-29314][SS] Don't overwrite the metric "updated" of state operator to 0 if empty batch is run ### What changes were proposed in this pull request? This patch fixes the behavior of ProgressReporter which always overwrite the value of "updated" of state operator to 0 if there's no new data. The behavior is correct only when we copy the state progress from "previous" executed plan, meaning no batch has been run. (Nonzero value of "updated" would be odd if batch didn't run, so it was correct.) It was safe to assume no data is no batch, but SPARK-24156 enables empty data can run the batch if Spark needs to deal with watermark. After the patch, it only overwrites the value if both two conditions are met: 1) no data 2) no batch. ### Why are the changes needed? Currently Spark doesn't reflect correct metrics when empty batch is run and this patch fixes it. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified UT. Note that FlatMapGroupsWithState increases the value of "updated" when state rows are removed. Also manually tested via below query (not a simple query to test with spark-shell, as you'll meet closure issue in spark-shell while playing with state func): > query ``` case class RunningCount(count: Long) object TestFlatMapGroupsWithState { def main(args: Array[String]): Unit = { import org.apache.spark.sql.SparkSession val ss = SparkSession .builder() .appName("TestFlatMapGroupsWithState") .getOrCreate() ss.conf.set("spark.sql.shuffle.partitions", "5") import ss.implicits._ val stateFunc = (key: String, values: Iterator[String], state: GroupState[RunningCount]) => { if (state.hasTimedOut) { // End users are not restricted to remove the state here - they can update the // state as well. For example, event time session window would have list of // sessions here and it cannot remove entire state. state.update(RunningCount(-1)) Iterator((key, "-1")) } else { val count = state.getOption.map(_.count).getOrElse(0L) + values.size state.update(RunningCount(count)) state.setTimeoutDuration("1 seconds") Iterator((key, count.toString)) } } implicit val sqlContext = ss.sqlContext val inputData = MemoryStream[String] val result = inputData .toDF() .as[String] .groupByKey { v => v } .flatMapGroupsWithState(OutputMode.Append(), GroupStateTimeout.ProcessingTimeTimeout())(stateFunc) val query = result .writeStream .format("memory") .option("queryName", "test") .outputMode("append") .trigger(Trigger.ProcessingTime("5 second")) .start() Thread.sleep(1000) var chIdx: Long = 0 while (true) { (chIdx to chIdx + 4).map { idx => inputData.addData(idx.toString) } chIdx += 5 // intentionally sleep much more than trigger to enable "empty" batch Thread.sleep(10 * 1000) } } } ``` > before the patch (batch 3 which was an "empty" batch) ``` { "id":"de945a5c-882b-4dae-aa58-cb8261cbaf9e", "runId":"f1eb6d0d-3cd5-48b2-a03b-5e989b6c151b", "name":"test", "timestamp":"2019-11-18T07:00:25.005Z", "batchId":3, "numInputRows":0, "inputRowsPerSecond":0.0, "processedRowsPerSecond":0.0, "durationMs":{ "addBatch":1664, "getBatch":0, "latestOffset":0, "queryPlanning":29, "triggerExecution":1789, "walCommit":51 }, "stateOperators":[ { "numRowsTotal":10, "numRowsUpdated":0, "memoryUsedBytes":5130, "customMetrics":{ "loadedMapCacheHitCount":15, "loadedMapCacheMissCount":0, "stateOnCurrentVersionSizeBytes":2722 } } ], "sources":[ { "description":"MemoryStream[value#1]", "startOffset":9, "endOffset":9, "numInputRows":0, "inputRowsPerSecond":0.0, "processedRowsPerSecond":0.0 } ], "sink":{ "description":"MemorySink", "numOutputRows":5 } } ``` > after the patch (batch 3 which was an "empty" batch) ``` { "id":"7cb41623-6b9a-408e-ae02-6796ec636fa0", "runId":"17847710-ddfe-45f5-a7ab-b160e149382f", "name":"test", "timestamp":"2019-11-18T07:02:25.005Z", "batchId":3, "numInputRows":0, "inputRowsPerSecond":0.0, "processedRowsPerSecond":0.0, "durationMs":{ "addBatch":1196, "getBatch":0, "latestOffset":0, "queryPlanning":30, "triggerExecution":1333, "walCommit":46 }, "stateOperators":[ { "numRowsTotal":10, "numRowsUpdated":5, "memoryUsedBytes":5130, "customMetrics":{ "loadedMapCacheHitCount":15, "loadedMapCacheMissCount":0, "stateOnCurrentVersionSizeBytes":2722 } } ], "sources":[ { "description":"MemoryStream[value#1]", "startOffset":9, "endOffset":9, "numInputRows":0, "inputRowsPerSecond":0.0, "processedRowsPerSecond":0.0 } ], "sink":{ "description":"MemorySink", "numOutputRows":5 } } ``` "numRowsUpdated" is `0` in "stateOperators" before applying the patch which is "wrong", as we "update" the state when timeout occurs. After applying the patch, it correctly represents the "numRowsUpdated" as `5` in "stateOperators". Closes #25987 from HeartSaVioR/SPARK-29314. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-04-08 16:59:39 -07:00
iRakson	b56242332d	[SPARK-31009][SQL] Support json_object_keys function ### What changes were proposed in this pull request? A new function `json_object_keys` is proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument. - If invalid json expression is given, `NULL` will be returned. - If an empty string or json array is given, `NULL` will be returned. - If valid json object is given, all the keys of the outmost object will be returned as an array. - For empty json object, empty array is returned. We can also get JSON object keys using `map_keys+from_json`. But `json_object_keys` is more efficient. ``` Performance result for json_object = {"a":[1,2,3,4,5], "b":[2,4,5,12333321]} Intel(R) Core(TM) i7-9750H CPU 2.60GHz JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ json_object_keys 11666 12361 673 0.9 1166.6 1.0X from_json+map_keys 15309 15973 701 0.7 1530.9 0.8X ``` ### Why are the changes needed? This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings. Some of the most popular DBMSs supports this function. - PostgreSQL - MySQL - MariaDB ### Does this PR introduce any user-facing change? Yes. Now users can extract keys of json objects using this function. ### How was this patch tested? UTs added. Closes #27836 from iRakson/jsonKeys. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-08 13:04:59 -07:00
Huaxin Gao	5dc9b9c7c1	[SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference ### What changes were proposed in this pull request? Document Set Operators in SQL Reference ### Why are the changes needed? To make SQL Reference complete ### Does this PR introduce any user-facing change? Yes <img width="1050" alt="Screen Shot 2020-04-07 at 9 20 05 AM" src="https://user-images.githubusercontent.com/13592258/78694605-c6ea2680-78b1-11ea-8590-afb43dbe5933.png"> <img width="1050" alt="Screen Shot 2020-04-07 at 9 20 41 AM" src="https://user-images.githubusercontent.com/13592258/78694613-c8b3ea00-78b1-11ea-89b9-d6cd71ee86a0.png"> <img width="1050" alt="Screen Shot 2020-04-07 at 9 21 29 AM" src="https://user-images.githubusercontent.com/13592258/78694622-ca7dad80-78b1-11ea-9acf-7611ee57d4f2.png"> <img width="1050" alt="Screen Shot 2020-04-07 at 9 21 54 AM" src="https://user-images.githubusercontent.com/13592258/78694626-cc477100-78b1-11ea-82f8-4deaf0048de7.png"> ### How was this patch tested? Manually build and check Closes #28139 from huaxingao/set-operators. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-08 10:51:04 -05:00
yi.wu	a2789c2a51	[SPARK-31379][CORE][TEST] Fix flaky o.a.s.scheduler.CoarseGrainedSchedulerBackendSuite.extra resources from executor ### What changes were proposed in this pull request? This PR (SPARK-31379) adds one line `when(ts.resourceOffers(any[IndexedSeq[WorkerOffer]])).thenReturn(Seq.empty)` to avoid allocating resources. ### Why are the changes needed? The test is flaky and here's part of error stack: ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 325 times over 5.01070979 seconds. Last failure message: ArrayBuffer("1", "3") did not equal Array("0", "1", "3"). ... org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:45) ``` You can check [here](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120786/testReport/org.apache.spark.scheduler/CoarseGrainedSchedulerBackendSuite/extra_resources_from_executor/) for details. And it is flaky because: after sending `StatusUpdate` to `CoarseGrainedSchedulerBackend`, `CoarseGrainedSchedulerBackend` will call `makeOffer` immediately once releasing the resources. So, it's possible that `availableAddrs` has allocated again before we assert `execResources(GPU).availableAddrs.sorted === Array("0", "1", "3")`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The issue can be stably reproduced by inserting `Thread.sleep(3000)` after the line of sending `StatusUpdate`. After applying this fix, the issue is gone. Closes #28145 from Ngone51/fix_flaky. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-08 17:54:28 +09:00
gatorsmile	a3d83948b8	[SPARK-31351][DOC] Migration Guide Auditing for Spark 3.0 Release ### What changes were proposed in this pull request? This PR is to audit the migration guides in Spark 3.0 release: - correct the grammar errors - clean up some items - replace HTML table by markdown table ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? No ### How was this patch tested? Screenshot: ![screencapture-127-0-0-1-4000-sql-migration-guide-html-2020-04-04-21_36_29](https://user-images.githubusercontent.com/11567269/78467043-9477d800-76bd-11ea-8ab0-3d51ea5e9fa5.png) ![Screen Shot 2020-04-04 at 9 28 13 PM](https://user-images.githubusercontent.com/11567269/78467045-98a3f580-76bd-11ea-9e4b-927bf12e683a.png) ![Screen Shot 2020-04-04 at 9 28 02 PM](https://user-images.githubusercontent.com/11567269/78467046-98a3f580-76bd-11ea-8ea3-9f13cb8d200b.png) ![Screen Shot 2020-04-04 at 9 21 40 PM](https://user-images.githubusercontent.com/11567269/78467047-993c8c00-76bd-11ea-8c29-91afc68eb590.png) Closes #28125 from gatorsmile/updateMigrationGuide3.0. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-08 12:27:40 +09:00
beliefer	0fc859b4d5	[SPARK-31269][DOC][FOLLOWUP][MINOR] Add version head of GraphX table ### What changes were proposed in this pull request? HyukjinKwon have ported back all the PR about version to branch-3.0. I make a double check and found GraphX table lost version head. This PR will fix the issue. HyukjinKwon, please help me merge this PR to master and branch-3.0 ### Why are the changes needed? Add version head of GraphX table ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28149 from beliefer/fix-head-of-graphx-table. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-08 12:25:06 +09:00
Burak Yavuz	8ab2a0c5f2	[SPARK-31278][SS] Fix StreamingQuery output rows metric ### What changes were proposed in this pull request? In Structured Streaming, we provide progress updates every 10 seconds when a stream doesn't have any new data upstream. When providing this progress though, we zero out the input information but not the output information. This PR fixes that bug. ### Why are the changes needed? Fixes a bug around incorrect metrics ### Does this PR introduce any user-facing change? Fixes a bug in the metrics ### How was this patch tested? New regression test Closes #28040 from brkyvz/sinkMetrics. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-04-07 17:17:47 -07:00

... 2 3 4 5 6 ...

27117 commits