ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
yi.wu	8be53c3298	[SPARK-36614][CORE][UI] Correct executor loss reason caused by decommission in UI ### What changes were proposed in this pull request? Post the correct executor loss reason to UI. ### Why are the changes needed? To show the accurate loss reason. ### Does this PR introduce _any_ user-facing change? Yes. Users can see the difference from the UI. Before: <img width="509" alt="WeChataad8d1f27d9f9aa7cf93ced4bcc820e2" src="https://user-images.githubusercontent.com/16397174/131341692-6f412607-87b8-405e-822d-0d28f07928da.png"> <img width="1138" alt="WeChat13c9f1345a096ff83d193e4e9853b165" src="https://user-images.githubusercontent.com/16397174/131341699-f2c9de09-635f-49df-8e27-2495f34276c0.png"> After: <img width="599" alt="WeChata4313fa2dbf27bf2dbfaef5c1d4a19cf" src="https://user-images.githubusercontent.com/16397174/131341754-e3c93b5d-5252-4006-a4cc-94d76f41303b.png"> <img width="1182" alt="WeChat5559d52fd3070ae6c42fe32d56f9dc94" src="https://user-images.githubusercontent.com/16397174/131341761-e1e0644f-1e76-49c0-915a-26aad77ec272.png"> ### How was this patch tested? Manully tested. Closes #33868 from Ngone51/fix-executor-remove-reason. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `ebe7bb6217`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-08-30 09:09:34 -07:00
gengjiaan	d42536a6ee	[SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source ### What changes were proposed in this pull request? Spark SQL includes a data source that can read data from other databases using JDBC. Spark also supports the case-insensitive option `pushDownPredicate`. According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark. But I find it still be pushed down to JDBC data source. ### Why are the changes needed? Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source. ### Does this PR introduce _any_ user-facing change? 'No'. The output of query will not change. ### How was this patch tested? Jenkins test. Closes #33822 from beliefer/SPARK-36574. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `fcc91cfec4`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-30 19:09:45 +08:00
Sean Owen	b76471c5df	[SPARK-36603][CORE] Use WeakReference not SoftReference in LevelDB ### What changes were proposed in this pull request? Use WeakReference not SoftReference in LevelDB ### Why are the changes needed? (See discussion at https://github.com/apache/spark/pull/28769#issuecomment-906722390 ) "The soft reference to iterator introduced in this pr unfortunately ended up causing iterators to not be closed when they go out of scope (which would have happened earlier in the finalize) This is because java is more conservative in cleaning up SoftReference's. The net result was we ended up having 50k files for SHS while typically they get compacted away to 200 odd files. Changing from SoftReference to WeakReference should make it much more aggresive in cleanup and prevent the issue - which we observed in a 3.1 SHS" ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33859 from srowen/SPARK-36603. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `89e907f76c`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-29 09:30:20 -07:00
Hyukjin Kwon	e3edb65bf0	[MINOR][DOCS] Add Apache license header to GitHub Actions workflow files ### What changes were proposed in this pull request? Some of GitHub Actions workflow files do not have the Apache license header. This PR adds them. ### Why are the changes needed? To comply Apache license. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33862 from HyukjinKwon/minor-lisence. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `22c492a6b8`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-28 20:30:25 -07:00
Gengliang Wang	3719d87668	[SPARK-36606][DOCS][TESTS] Enhance the docs and tests of try_add/try_divide ### What changes were proposed in this pull request? The `try_add` function allows the following inputs: - number, number - date, number - date, interval - timestamp, interval - interval, interval And, the `try_divide` function allows the following inputs: - number, number - interval, number However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too. ### Why are the changes needed? Improve documentation and tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Also build docs for preview: ![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png) Closes #33861 from gengliangwang/enhanceTryDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8a52ad9f82`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-29 10:30:13 +09:00
Kousuke Saruta	93f2b00501	[SPARK-36509][CORE] Fix the issue that executors are never re-scheduled if the worker stops with standalone cluster ### What changes were proposed in this pull request? This PR fixes an issue that executors are never re-scheduled if the worker which the executors run on stops. As a result, the application stucks. You can easily reproduce this issue by the following procedures. ``` # Run master $ sbin/start-master.sh # Run worker 1 $ SPARK_LOG_DIR=/tmp/worker1 SPARK_PID_DIR=/tmp/worker1/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker1 --webui-port 8081 spark://<hostname>:7077 # Run worker 2 $ SPARK_LOG_DIR=/tmp/worker2 SPARK_PID_DIR=/tmp/worker2/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker2 --webui-port 8082 spark://<hostname>:7077 # Run Spark Shell $ bin/spark-shell --master spark://<hostname>:7077 --executor-cores 1 --total-executor-cores 1 # Check which worker the executor runs on and then kill the worker. $ kill <worker pid> ``` With the procedure above, we will expect that the executor is re-scheduled on the other worker but it won't. The reason seems that `Master.schedule` cannot be called after the worker is marked as `WorkerState.DEAD`. So, the solution this PR proposes is to call `Master.schedule` whenever `Master.removeWorker` is called. This PR also fixes an issue that `ExecutorRunner` can send `ExecutorStateChanged` message without changing its state. This issue causes assertion error. ``` 2021-08-13 14:05:37,991 [dispatcher-event-loop-9] ERROR: Ignoring errorjava.lang.AssertionError: assertion failed: executor 0 state transfer from RUNNING to RUNNING is illegal ``` ### Why are the changes needed? It's a critical bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested with the procedure shown above and confirmed the executor is re-scheduled. Closes #33818 from sarutak/fix-scheduling-stuck. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit `ea8c31e5ea`) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-08-28 18:02:11 +09:00
Gengliang Wang	9b28c2b09e	[SPARK-36597][DOCS][3.2] Fix issues in SQL function docs ### What changes were proposed in this pull request? * the functions make_dt_interval and make_ym_interval should make it clear that some of the fields are optional * remove the `\|` symbol from the doc of `bit_get` https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#bit_get * Address one missing comment in https://github.com/apache/spark/pull/33824#discussion_r695405699 ### Why are the changes needed? Improve the documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build doc and preview: ![image](https://user-images.githubusercontent.com/1097932/130996918-8c1fff88-ef5a-434b-8445-df7140bad3ba.png) ![image](https://user-images.githubusercontent.com/1097932/130996954-0ced28e7-fb90-4fcc-857e-6ccc31dc3c09.png) ![image](https://user-images.githubusercontent.com/1097932/130955106-5ae32dfc-6e89-4e28-bb8a-6c1b5213051c.png) ![image](https://user-images.githubusercontent.com/1097932/130922351-2f0f262d-5624-4d08-ba83-dfa3ed0b646b.png) Closes #33857 from gengliangwang/SPARK-36597-3.2. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-27 13:00:12 -07:00
itholic	396b76466b	[SPARK-36388][SPARK-36386][PYTHON][FOLLOWUP] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3 This PR is followup for https://github.com/apache/spark/pull/33646 to add missing tests. Some tests are missing No Unittest Closes #33776 from itholic/SPARK-36388-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `c91ae544fd`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 20:46:54 +09:00
Huaxin Gao	786d773585	[SPARK-36578][ML] UnivariateFeatureSelector API doc improvement ### What changes were proposed in this pull request? Change API doc for `UnivariateFeatureSelector` ### Why are the changes needed? make the doc look better ### Does this PR introduce _any_ user-facing change? yes, API doc change ### How was this patch tested? Manually checked Closes #33855 from huaxingao/ml_doc. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `15e42b4442`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-08-26 21:16:59 -07:00
Jungtaek Lim	118a53d87f	[SPARK-36595][SQL][SS][DOCS] Document window & session_window function in SQL API doc ### What changes were proposed in this pull request? This PR proposes to document `window` & `session_window` function in SQL API doc page. Screenshot of functions: > window ![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png) > session_window ![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png) ### Why are the changes needed? Description is missing in both `window` / `session_window` functions for SQL API page. ### Does this PR introduce _any_ user-facing change? Yes, the description of `window` / `session_window` functions will be available in SQL API page. ### How was this patch tested? Only doc changes. Closes #33846 from HeartSaVioR/SPARK-36595. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `bc32144a91`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-27 12:39:21 +09:00
Gengliang Wang	eca81cc0ae	[SPARK-36457][DOCS][3.2] Review and fix issues in Scala/Java API docs ### What changes were proposed in this pull request? Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues: - Add missing `Since` annotation for new APIs - Remove the leaking class/object in API doc ### Why are the changes needed? Improve API docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33845 from gengliangwang/SPARK-36457-3.2. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-27 10:51:27 +08:00
Yuanjian Li	f50f2d474c	[SPARK-35611][SS][FOLLOW-UP] Improve the user guide document ### What changes were proposed in this pull request? Improve the user guide document. ### Why are the changes needed? Make the user guide clear. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Doc change only. Closes #33854 from xuanyuanking/SPARK-35611-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `dd3f0fa8c2`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 10:27:37 +09:00
Hyukjin Kwon	069f326e36	[MINOR] Address conflicts for SPARK-36367 cherry-pick	2021-08-27 10:24:18 +09:00
itholic	2dc15d9d84	[SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype This PR proposes to enable the tests, disabled since different behavior with pandas 1.3. - `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them. - Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`. We should enable the tests as much as possible even if pandas has a bug. And we should follow the behavior of latest pandas. Yes, `GroupBy.transform` now follow the behavior of latest pandas. Unittests. Closes #33817 from itholic/SPARK-36537. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `fe486185c4`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 10:00:23 +09:00
itholic	8829406366	[SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3 This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3. Before: ```python >>> pcat 0 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] >>> pcat.astype(CategoricalDtype(["b", "c", "a"])) 0 a 1 b 2 c dtype: category Categories (3, object): ['b', 'c', 'a'] ``` After: ```python >>> pcat 0 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] >>> pcat.astype(CategoricalDtype(["b", "c", "a"])) 0 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] # CategoricalDtype is not updated if dtype is the same ``` `CategoricalDtype` is treated as a same `dtype` if the unique values are the same. ```python >>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"])) >>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"])) >>> pcat1.dtype == pcat2.dtype True ``` We should follow the latest pandas as much as possible. Yes, the behavior is changed as example in the PR description. Unittest Closes #33757 from itholic/SPARK-36368. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `f2e593bcf1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 10:00:12 +09:00
itholic	31557d4759	[SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3. In pandas < 1.3, ```python >>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string") 0 2020-10-27 00:00:01 1 NaT Name: datetime, dtype: string ``` This is changed to ```python >>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string") 0 2020-10-27 00:00:01 1 <NA> Name: datetime, dtype: string ``` in pandas >= 1.3, so we follow the behavior of latest pandas. Because pandas-on-Spark always follow the behavior of latest pandas. Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype) Unittest passed Closes #33735 from itholic/SPARK-36387. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `c0441bb7e8`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 10:00:01 +09:00
itholic	0fc8c393b4	[SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3 This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior. `RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3. Before: ```python >>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]}) >>> df.groupby("A").rolling(2).sum() A B A 1 0 NaN NaN 1 2.0 1.0 2 2 NaN NaN 3 3 NaN NaN ``` After: ```python >>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]}) >>> df.groupby("A").rolling(2).sum() B A 1 0 NaN 1 1.0 2 2 NaN 3 3 NaN ``` We should follow the behavior of pandas as much as possible. Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above. Unit tests. Closes #33646 from itholic/SPARK-36388. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `b8508f4876`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 09:59:48 +09:00
itholic	f2f09e4cdb	[SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3 This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3. Before: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64') ``` After: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64') ``` This bug is fixed in https://github.com/pandas-dev/pandas/issues/36289. We should follow the behavior of pandas as much as possible. Yes, the result for some cases have duplicates values will change. Unit test. Closes #33634 from itholic/SPARK-36369. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `a9f371c247`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 09:59:32 +09:00
Takuya UESHIN	cb075b5301	[SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3 Disable tests failed by the incompatible behavior of pandas 1.3. Pandas 1.3 has been released. There are some behavior changes and we should follow it, but it's not ready yet. No. Disabled some tests related to the behavior change. Closes #33598 from ueshin/issues/SPARK-36367/disable_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8cb9cf39b6`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 09:58:42 +09:00
Gengliang Wang	c25f1e4347	[SPARK-36227][SQL][FOLLOWUP][3.2] Remove unused import in TimestampNTZType.scala ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/33837 It is to fix compilation error: https://github.com/apache/spark/runs/3431646840 ### Why are the changes needed? Fix a compilation error ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass existing UTs Closes #33851 from gengliangwang/fixCompile. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-27 02:18:24 +08:00
Gengliang Wang	52b3b2d5bc	[SPARK-36227][SQL][DOCS][3.2] Remove TimestampNTZ from API docs ### What changes were proposed in this pull request? Although we try to remove TimestampNTZ from Branch 3.2 in , it still shows up in our API doc: https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/scala/org/apache/spark/sql/types/TimestampNTZType.html https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/java/org/apache/spark/sql/types/DataType.html This PR is to clean it up in the API docs by * making the TimestampNTZ type private * remove TimestampNTZ in DataTypes The changes are only for branch 3.2. ### Why are the changes needed? Fix API doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually check generated docs Closes #33837 from gengliangwang/privateNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-26 18:27:33 +08:00
Leona Yoda	36be232eea	[SPARK-36541][DOCS][PYTHON] Replace the word Koalas to pandas-on-Spark ### What changes were proposed in this pull request? Replace images in pyspark on pandas document because those images uses the word Koalas ### Why are the changes needed? Images in Transform and apply a function documentation still uses the word Koalas, althogh the word was replaced to panas-on-Spark by this PR . https://github.com/apache/spark/pull/32835 I think we have to match the word on that images ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `make html` Screen shots ![130179112-8485fdde-b422-4834-8b23-fe69e7402118](https://user-images.githubusercontent.com/14937752/130186051-d6ff65f0-c121-40bd-b4f1-2fbc10e76f3e.png) ![130179239-8dae7812-4d81-4f8c-8558-b75e4eae3787](https://user-images.githubusercontent.com/14937752/130186063-17d4a95f-0b9d-49d3-85c7-13ea07e4b6bb.png) ![130179273-10f9fbc3-0a62-4e1a-ab6e-7049d75653a1](https://user-images.githubusercontent.com/14937752/130186074-7d684669-b9ef-4a4e-8a2d-c63bb9800ddb.png) ![130179311-616545af-dde2-4dec-807f-dde0a0d4bfbe](https://user-images.githubusercontent.com/14937752/130186095-20669673-b1d3-4552-97bf-86bbc1a5d43b.png) Environment - Windows 10 - Google Chrome 92.0.4515.159 [images.pptx](https://github.com/apache/spark/files/7029087/images.pptx) Closes #33786 from yoda-mon/replace-pyspark-doc-images. Authored-by: Leona Yoda <yodal@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `aeb3da2798`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-26 19:03:11 +09:00
itholic	0feb19c53a	[SPARK-36505][PYTHON] Improve test coverage for frame.py ### What changes were proposed in this pull request? This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`. This PR did the following to improve coverage: - Add unittest for untested code - Remove unused code - Add arguments to some functions for testing ### Why are the changes needed? To make the project healthier by improving coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittest. Closes #33833 from itholic/SPARK-36505. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `97e7d6e667`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-26 17:43:12 +09:00
Cheng Su	c21303f02c	[SPARK-36594][SQL][3.2] ORC vectorized reader should properly check maximal number of fields ### What changes were proposed in this pull request? This is the patch on branch-3.2 for https://github.com/apache/spark/pull/33842. See the description in the other PR. ### Why are the changes needed? Avoid OOM/performance regression when reading ORC table with nested column types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `OrcSourceSuite.scala`. Closes #33843 from c21/branch-3.2. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-26 14:55:21 +08:00
Max Gekk	0c364e607d	[SPARK-36590][SQL] Convert special timestamp_ntz values in the session time zone In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on. Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings: ```sql $ export TZ="Europe/Amsterdam" $ ./bin/spark-sql -S spark-sql> select timestamp_ntz'now'; 2021-08-25 18:12:36.233 spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> select timestamp_ntz'now'; 2021-08-25 18:14:40.547 ``` Yes. For the example above, after the changes: ```sql spark-sql> select timestamp_ntz'now'; 2021-08-25 18:47:46.832 spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> select timestamp_ntz'now'; 2021-08-25 09:48:05.211 ``` By running the affected test suites: ``` $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` Closes #33838 from MaxGekk/fix-ts_ntz-special-values. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `159ff9fd14`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-26 10:11:37 +08:00
Max Gekk	5198c0c316	[SPARK-35581][SPARK-36567][SQL][DOCS][FOLLOWUP] Update the SQL migration guide about foldable special datetime values ### What changes were proposed in this pull request? In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well. <img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png"> ### Why are the changes needed? To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By generating docs, and checking results manually. Closes #33840 from MaxGekk/special-datetime-cast-migr-guide. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `c4e739fb4b`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-26 10:02:15 +08:00
Dongjoon Hyun	d841679ecc	[MINOR][3.2] Remove unused `numpy` import ### What changes were proposed in this pull request? This fixed Python linter failure. ### Why are the changes needed? ``` flake8 checks failed: ./python/pyspark/ml/tests/test_tuning.py:21:1: F401 'numpy as np' imported but unused import numpy as np F401 'numpy as np' imported but unused Error: Process completed with exit code 1. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action Linter job. Closes #33841 from dongjoon-hyun/unused_import. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-26 09:52:54 +09:00
Gengliang Wang	464841224c	[SPARK-36585][SQL][DOCS] Support setting "since" version in FunctionRegistry ### What changes were proposed in this pull request? Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`: - https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp - https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like This PR is to: * Support setting `since` version in FunctionRegistry * Correct the `since` version of `regexp` and `regexp_like` ### Why are the changes needed? Correct the SQL doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run ``` sh sql/create-docs.sh ``` and check the SQL doc manually Closes #33834 from gengliangwang/allowSQLFunVersion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `18143fb426`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-25 22:32:49 +08:00
Kousuke Saruta	fb38887e00	[SPARK-36398][SQL] Redact sensitive information in Spark Thrift Server log ### What changes were proposed in this pull request? This PR fixes an issue that there is no way to redact sensitive information in Spark Thrift Server log. For example, JDBC password can be exposed in the log. ``` 21/08/25 18:52:37 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde")' with ca14ae38-1aaf-4bf4-a099-06b8e5337613 ``` ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".")\|('.')` Then, confirmed the log. ``` 21/08/25 18:54:11 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password=*********(redacted))' with ffc627e2-b1a8-4d83-ab6d-d819b3ccd909 ``` Closes #33832 from sarutak/fix-SPARK-36398. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit `b2ff01608f`) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-08-25 21:31:04 +09:00
Max Gekk	a4c5140242	[SPARK-36567][SQL] Support foldable special datetime strings by `CAST` ### What changes were proposed in this pull request? In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable. ### Why are the changes needed? 1. To avoid a breaking change. 2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance, at Spark 3.1: ```sql select ts_col > 'now'; ``` but the query fails at the moment, and users have to use typed timestamp literal: ```sql select ts_col > timestamp'now'; ``` ### Does this PR introduce _any_ user-facing change? No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714. ### How was this patch tested? 1. Manually tested via the sql command line: ```sql spark-sql> select cast('today' as date); 2021-08-24 spark-sql> select timestamp('today'); 2021-08-24 00:00:00 spark-sql> select timestamp'tomorrow' > 'today'; true ``` 2. By running new test suite: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite" ``` Closes #33816 from MaxGekk/foldable-datetime-special-values. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `df0ec56723`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-25 14:09:13 +08:00
Kousuke Saruta	beabf91ea1	[SPARK-35236][SQL][DOCS][FOLLOWUP] Mention ARCHIVE as an acceptable resource type for CREATE FUNCTION statement ### What changes were proposed in this pull request? This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement. `ARCHIVE` is acceptable as of SPARK-35236 (#32359). ### Why are the changes needed? To maintain the document. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `SKIP_API=1 bundle exec jekyll build` ![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png) Closes #33823 from sarutak/create-function-archive. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `bd0a4950ae`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-25 10:05:00 +09:00
Hyukjin Kwon	26ae9e93da	[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization ### What changes were proposed in this pull request? This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(10).id.value_counts().to_frame().spark.explain() ``` Before: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#51L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70] +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L]) +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67] +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L]) +- Project [id#37L] +- Filter atleastnnonnulls(1, id#37L) +- Scan ExistingRDD[__index_level_0__#36L,id#37L] # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed) ``` After: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#275L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174] +- HashAggregate(keys=[id#258L], functions=[count(1)]) +- HashAggregate(keys=[id#258L], functions=[partial_count(1)]) +- Filter atleastnnonnulls(1, id#258L) +- Range (0, 10, step=1, splits=16) # ^^^ Removed the Spark job execution for `zipWithIndex` ``` ### Why are the changes needed? To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`. Closes #33807 from HyukjinKwon/SPARK-36559. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `93cec49212`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-25 10:03:00 +09:00
Gengliang Wang	5463caac0d	Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization" ### What changes were proposed in this pull request? Revert `397b843890` and `5a48eb8d00` ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ci tests Closes #33819 from gengliangwang/revert-SPARK-34415. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `de932f51ce`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:39:29 -07:00
yi.wu	36df86c0d0	[SPARK-36564][CORE] Fix NullPointerException in LiveRDDDistribution.toApi ### What changes were proposed in this pull request? This PR fixes `NullPointerException` in `LiveRDDDistribution.toApi`. ### Why are the changes needed? Looking at the stack trace, the NPE is caused by the null `exec.hostPort`. I can't get the complete log to take a close look but only guess that it might be due to the event `SparkListenerBlockManagerAdded` is dropped or out of order. ``` 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an exception java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192) at com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507) at com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85) at org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696) at org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563) at org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629) at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51) at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206) at org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212) at org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956) ... ``` ### Does this PR introduce _any_ user-facing change? Yes, users will see the expected RDD info in UI instead of the NPE error. ### How was this patch tested? Pass existing tests. Closes #33812 from Ngone51/fix-hostport-npe. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `d6c453aaea`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:33:51 -07:00
Gengliang Wang	a313082d67	[SPARK-35535][SQL][FOLLOWUP] Move LocalScan to Catalyst package ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/32678. It moves `LocalScan` from SQL core package to Catalyst package. ### Why are the changes needed? There are two packages for `org.apache.spark.sql.connector` SQL Core: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/connector Catalyst: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector As `LocalScan` doesn't depend on the classes of SQL Core, we should move it to catalyst. ### Does this PR introduce _any_ user-facing change? No, the trait is not released yet. ### How was this patch tested? Existing UT. Closes #33826 from gengliangwang/moveLocalScan. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `5b4c216478`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:24:06 -07:00
Huaxin Gao	e48de7884d	[SPARK-34952][SQL][FOLLOWUP] Move aggregates to a separate package ### What changes were proposed in this pull request? Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there. ### Why are the changes needed? Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33815 from huaxingao/agg_package. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `cd2342691d`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-23 15:31:35 -07:00
Xinrong Meng	56c211bd6a	[SPARK-36470][PYTHON] Implement `CategoricalIndex.map` and `DatetimeIndex.map` Implement `CategoricalIndex.map` and `DatetimeIndex.map` `MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary. Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well. Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now. - CategoricalIndex.map ```py >>> idx = ps.CategoricalIndex(['a', 'b', 'c']) >>> idx CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category') >>> idx.map(lambda x: x.upper()) CategoricalIndex(['A', 'B', 'C'], categories=['A', 'B', 'C'], ordered=False, dtype='category') >>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True)) >>> idx.map(pser) CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category') >>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'}) CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category') ``` - DatetimeIndex.map ```py >>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10") >>> psidx = ps.from_pandas(pidx) >>> mapper_dict = { ... datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8), ... datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9), ... } >>> psidx.map(mapper_dict) DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None) >>> mapper_pser = pd.Series([1, 2, 3], index=pidx) >>> psidx.map(mapper_pser) Int64Index([1, 2, 3], dtype='int64') >>> psidx DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None) >>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r")) Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM', 'August 10, 2020, 12:00:00 AM'], dtype='object') ``` Unit tests. Closes #33756 from xinrong-databricks/other_indexes_map. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `0b6af464dc`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-23 10:11:21 +09:00
Gengliang Wang	eea7d0037e	[SPARK-36557][DOCS] Update the MAVEN_OPTS in Spark build docs ### What changes were proposed in this pull request? As Jacek Laskowski pointed out in the dev list, there is StackOverflowError if compiling Spark with the current MAVEN_OPTS in Spark documentation. We should update it with `-Xss64m` to avoid it. ### Why are the changes needed? Correct the documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test. The MAVEN_OPTS is consistent with our github action build. Closes #33804 from gengliangwang/updateBuildDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `3da0e9500f`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-23 09:46:41 +09:00
Venkata krishnan Sowrirajan	0f2e318894	[SPARK-36374][FOLLOW-UP] Change config key spark.shuffle.server.mergedShuffleFileManagerImpl to spark.shuffle.push.server.mergedShuffleFileManagerImpl ### What changes were proposed in this pull request? Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615. ### Why are the changes needed? To keep the config names consistent ### Does this PR introduce _any_ user-facing change? Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change. ### How was this patch tested? Existing tests. Closes #33799 from venkata91/SPARK-36374-follow-up. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit `7b2842e986`) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>	2021-08-22 01:29:36 -05:00
Liang-Chi Hsieh	212a21ee4f	[MINOR][SS][DOCS] Update doc for streaming deduplication ### What changes were proposed in this pull request? This patch fixes an error about streaming dedupliaction is Structured Streaming, and also updates an item about unsupported operation. ### Why are the changes needed? Update the user document. ### Does this PR introduce _any_ user-facing change? No. It's a doc only change. ### How was this patch tested? Doc only change. Closes #33801 from viirya/minor-ss-deduplication. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `5876e04de2`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-21 18:20:27 -07:00
Angerszhuuuu	45c4b751f3	[SPARK-36549][SQL] Add taskStatus supports multiple value to monitoring doc ### What changes were proposed in this pull request? In Stage related restful API, we support `taskStatus` parameter as a list ``` QueryParam("taskStatus") taskStatus: JList[TaskStatus] ``` In restful we should write like ``` taskStatus=SUCCESS&taskStatus=FAILED ``` It's usefule but not show in the doc, and many user don't know how to write the list parameters. So add this feature to monitoring doc too. ### Why are the changes needed? Make doc clear ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With restful request ``` http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED ``` Resultful request result tasks ``` tasks" : { "0" : { "taskId" : 0, "index" : 0, "attempt" : 0, "launchTime" : "2021-08-20T04:06:55.515GMT", "duration" : 273, "executorId" : "driver", "host" : "host", "status" : "FAILED", "taskLocality" : "PROCESS_LOCAL", "speculative" : false, "accumulatorUpdates" : [ ], "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n", "taskMetrics" : { "executorDeserializeTime" : 0, "executorDeserializeCpuTime" : 0, "executorRunTime" : 206, "executorCpuTime" : 0, "resultSize" : 0, "jvmGcTime" : 0, "resultSerializationTime" : 0, "memoryBytesSpilled" : 0, "diskBytesSpilled" : 0, "peakExecutionMemory" : 0, "inputMetrics" : { "bytesRead" : 0, "recordsRead" : 0 }, "outputMetrics" : { "bytesWritten" : 0, "recordsWritten" : 0 }, "shuffleReadMetrics" : { "remoteBlocksFetched" : 0, "localBlocksFetched" : 0, "fetchWaitTime" : 0, "remoteBytesRead" : 0, "remoteBytesReadToDisk" : 0, "localBytesRead" : 0, "recordsRead" : 0 }, "shuffleWriteMetrics" : { "bytesWritten" : 0, "writeTime" : 0, "recordsWritten" : 0 } }, "executorLogs" : { }, "schedulerDelay" : 67, "gettingResultTime" : 0 } }, ``` With restful request ``` http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED&taskStatus=SUCCESS ``` Restful result tasks ``` "tasks" : { "1" : { "taskId" : 1, "index" : 1, "attempt" : 0, "launchTime" : "2021-08-20T04:06:55.786GMT", "duration" : 16, "executorId" : "driver", "host" : "host", "status" : "SUCCESS", "taskLocality" : "PROCESS_LOCAL", "speculative" : false, "accumulatorUpdates" : [ ], "taskMetrics" : { "executorDeserializeTime" : 2, "executorDeserializeCpuTime" : 2638000, "executorRunTime" : 2, "executorCpuTime" : 1993000, "resultSize" : 837, "jvmGcTime" : 0, "resultSerializationTime" : 0, "memoryBytesSpilled" : 0, "diskBytesSpilled" : 0, "peakExecutionMemory" : 0, "inputMetrics" : { "bytesRead" : 0, "recordsRead" : 0 }, "outputMetrics" : { "bytesWritten" : 0, "recordsWritten" : 0 }, "shuffleReadMetrics" : { "remoteBlocksFetched" : 0, "localBlocksFetched" : 0, "fetchWaitTime" : 0, "remoteBytesRead" : 0, "remoteBytesReadToDisk" : 0, "localBytesRead" : 0, "recordsRead" : 0 }, "shuffleWriteMetrics" : { "bytesWritten" : 0, "writeTime" : 0, "recordsWritten" : 0 } }, "executorLogs" : { }, "schedulerDelay" : 12, "gettingResultTime" : 0 }, "0" : { "taskId" : 0, "index" : 0, "attempt" : 0, "launchTime" : "2021-08-20T04:06:55.515GMT", "duration" : 273, "executorId" : "driver", "host" : "host", "status" : "FAILED", "taskLocality" : "PROCESS_LOCAL", "speculative" : false, "accumulatorUpdates" : [ ], "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n", "taskMetrics" : { "executorDeserializeTime" : 0, "executorDeserializeCpuTime" : 0, "executorRunTime" : 206, "executorCpuTime" : 0, "resultSize" : 0, "jvmGcTime" : 0, "resultSerializationTime" : 0, "memoryBytesSpilled" : 0, "diskBytesSpilled" : 0, "peakExecutionMemory" : 0, "inputMetrics" : { "bytesRead" : 0, "recordsRead" : 0 }, "outputMetrics" : { "bytesWritten" : 0, "recordsWritten" : 0 }, "shuffleReadMetrics" : { "remoteBlocksFetched" : 0, "localBlocksFetched" : 0, "fetchWaitTime" : 0, "remoteBytesRead" : 0, "remoteBytesReadToDisk" : 0, "localBytesRead" : 0, "recordsRead" : 0 }, "shuffleWriteMetrics" : { "bytesWritten" : 0, "writeTime" : 0, "recordsWritten" : 0 } }, "executorLogs" : { }, "schedulerDelay" : 67, "gettingResultTime" : 0 } }, ``` Closes #33793 from AngersZhuuuu/SPARK-36549. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `5740d5641d`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-22 09:45:34 +09:00
Kent Yao	bdd3b49026	[SPARK-36552][SQL] Fix different behavior for writing char/varchar to hive and datasource table ### What changes were proposed in this pull request? For the hive table, the actual write path and the schema handling are inconsistent when `spark.sql.legacy.charVarcharAsString` is true. This causes problems like SPARK-36552 described. In this PR we respect `spark.sql.legacy.charVarcharAsString` when generates hive table schema from spark data types. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? yes, when `spark.sql.legacy.charVarcharAsString` is true, hive table with char/varchar will respect string behavior. ### How was this patch tested? newly added test Closes #33798 from yaooqinn/SPARK-36552. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `f918c123a0`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-22 09:38:54 +09:00
ulysses-you	e0d2d8f1a6	[SPARK-35083][CORE][FOLLLOWUP] Improve docs and migration guide ### What changes were proposed in this pull request? * improve docs in `docs/job-scheduling.md` * add migration guide docs in `docs/core-migration-guide.md` ### Why are the changes needed? Help user to migrate. ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? Pass CI Closes #33794 from ulysses-you/SPARK-35083-f. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit `90cbf9ca3e`) Signed-off-by: Kent Yao <yao@apache.org>	2021-08-20 21:33:06 +08:00
sweisdb	243bfafd5c	Updates AuthEngine to pass the correct SecretKeySpec format AuthEngineSuite was passing on some platforms (MacOS), but failing on others (Linux) with an InvalidKeyException stemming from this line. We should explicitly pass AES as the key format. ### What changes were proposed in this pull request? Changes the AuthEngine SecretKeySpec from "RAW" to "AES". ### Why are the changes needed? Unit tests were failing on some platforms with InvalidKeyExceptions when this key was used to instantiate a Cipher. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests on a MacOS and Linux platform. Closes #33790 from sweisdb/patch-1. Authored-by: sweisdb <60895808+sweisdb@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit `c441c7e365`) Signed-off-by: Sean Owen <srowen@gmail.com>	2021-08-20 08:31:54 -05:00
Gengliang Wang	69be513c5e	Preparing development version 3.2.1-SNAPSHOT	2021-08-20 12:40:47 +00:00
Gengliang Wang	6bb3523d8e	Preparing Spark release v3.2.0-rc1	2021-08-20 12:40:40 +00:00
Gengliang Wang	fafdc1482b	Revert "Preparing Spark release v3.2.0-rc1" This reverts commit `8e58fafb05`.	2021-08-20 20:07:02 +08:00
Gengliang Wang	c829ed53ff	Revert "Preparing development version 3.2.1-SNAPSHOT" This reverts commit `4f1d21571d`.	2021-08-20 20:07:01 +08:00
Gengliang Wang	f47a519721	[SPARK-36551][BUILD] Add sphinx-plotly-directive in Spark release Dockerfile ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`. This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully. Also, this PR mentions it in the README of docs. ### Why are the changes needed? Fix release script and update README of docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test locally. Closes #33797 from gengliangwang/fixReleaseDocker. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `42eebb84f5`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-20 20:02:44 +08:00
Gengliang Wang	6357b22ba8	[SPARK-36547][BUILD] Downgrade scala-maven-plugin to 4.3.0 ### What changes were proposed in this pull request? When preparing Spark 3.2.0 RC1, I hit the same issue of https://github.com/apache/spark/pull/31031. ``` [INFO] Compiling 21 Scala sources and 3 Java sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes ... [ERROR] ## Exception when compiling 24 sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes java.lang.SecurityException: class "javax.servlet.SessionCookieConfig"'s signer information does not match signer information of other classes in the same package java.lang.ClassLoader.checkCerts(ClassLoader.java:891) java.lang.ClassLoader.preDefineClass(ClassLoader.java:661) ``` This PR is to apply the same fix again by downgrading scala-maven-plugin to 4.3.0 ### Why are the changes needed? To unblock the release process. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build test Closes #33791 from gengliangwang/downgrade. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `f0775d215e`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-20 10:45:35 +08:00

1 2 3 4 5 ...

31013 commits