ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
yangjie01	01cf6f4c6b	[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache ### What changes were proposed in this pull request? There are 3 ways to use Guava cache in spark code: 1. `Loadingcache` is the main way to use Guava cache in spark code and the key usages are as follows: a. `LoadingCache` with `maximumsize` data eviction policy, such as `appCache` in `ApplicationCache`, `cache` in `Codegenerator` b. `LoadingCache` with `maximumWeight` data eviction policy, such as `shuffleIndexCache` in `ExternalShuffleBlockResolver` c. `LoadingCache` with 'expireAfterWrite' data eviction policy, such as `tableRelationCache` in `SessionCatalog` 2. `ManualCache` is another way to use Guava cache in spark code and the key usage is `cache` in `SharedInMemoryCache`, it use to caches partition file statuses in memory 3. The last use way is `hadoopJobMetadata` in `SparkEnv`, it uses Guava Cache to build a `soft-reference map`. The goal of this pr is use `Caffeine` instead of `Guava Cache` because `Caffeine` is faster than `Guava Cache` from benchmarks, the main changes as follows: 1. Add `Caffeine` deps to maven `pom.xml` 2. Use `Caffeine` instead of Guava `LoadingCache`, `ManualCache` and soft-reference map in `SparkEnv` 3. Add `LocalCacheBenchmark` to compare performance of `Loadingcache` between `Guava Cache` and `Caffeine` ### Why are the changes needed? `Caffeine` is faster than `Guava Cache` from benchmarks ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Add `LocalCacheBenchmark` to compare performance of `Loadingcache` between `Guava Cache` and `Caffeine` Closes #31517 from LuciferYang/guava-cache-to-caffeine. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Holden Karau <hkarau@netflix.com>	2021-08-04 12:01:44 -07:00
Dongjoon Hyun	28a2a2238f	[SPARK-36354][CORE] EventLogFileReader should skip rolling event log directories with no logs ### What changes were proposed in this pull request? This PR aims to skip rolling event log directories which has only `appstatus` file. ### Why are the changes needed? Currently, Spark History server shows `IllegalArgumentException` warning, but the event log might arrive later. The situation also can happen when the job is killed before uploading its first log to the remote storages like S3. ``` 21/07/30 07:38:26 WARN FsHistoryProvider: Error while reading new log s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771 java.lang.IllegalArgumentException: requirement failed: Log directory must contain at least one event log file! ... at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216) ``` ### Does this PR introduce _any_ user-facing change? Yes. Users will not see `IllegalArgumentException` warnings. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #33586 from dongjoon-hyun/SPARK-36354. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-04 20:26:06 +09:00
yi.wu	0b0f4dd186	[SPARK-36383][CORE] Avoid NullPointerException during executor shutdown ### What changes were proposed in this pull request? Fix `NullPointerException` in `Executor.stop()`. ### Why are the changes needed? Some initialization steps could fail before the initialization of `metricsPoller`, `heartbeater`, `threadPool`, which results in the null of `metricsPoller`, `heartbeater`, `threadPool`. For example, I encountered a failure of: `c20af53580/core/src/main/scala/org/apache/spark/executor/Executor.scala (L137)` where the executor itself failed to register at the driver. This PR helps to eliminate the error messages when the issue happens to not confuse users: <details> <summary><mark><font color=darkred>[click to see the detailed error message]</font></mark></summary> <pre> 21/07/23 16:04:10 WARN Executor: Unable to stop executor metrics poller java.lang.NullPointerException at org.apache.spark.executor.Executor.stop(Executor.scala:318) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/07/23 16:04:10 WARN Executor: Unable to stop heartbeater java.lang.NullPointerException at org.apache.spark.executor.Executor.stop(Executor.scala:324) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/07/23 16:04:10 ERROR Utils: Uncaught exception in thread shutdown-hook-0 java.lang.NullPointerException at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:334) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:231) at org.apache.spark.executor.Executor.stop(Executor.scala:334) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) </pre> </details> ### Does this PR introduce _any_ user-facing change? Yes, users won't see error messages of `NullPointerException` after this fix. ### How was this patch tested? Pass existing tests. Closes #33612 from Ngone51/avoid-npe-during-executor-shutdown. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-02 23:57:14 -07:00
Chandni Singh	2712343a27	[SPARK-36389][CORE][SHUFFLE] Revert the change that accepts negative mapId in ShuffleBlockId ### What changes were proposed in this pull request? With SPARK-32922, we added a change that ShuffleBlockId can have a negative mapId. This was to support push-based shuffle where -1 as mapId indicated a push-merged block. However with SPARK-32923, a different type of BlockId was introduced - ShuffleMergedId, but reverting the change to ShuffleBlockId was missed. ### Why are the changes needed? This reverts the changes to `ShuffleBlockId` which will never have a negative mapId. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified the unit test to verify the newly added ShuffleMergedBlockId. Closes #33616 from otterc/SPARK-36389. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-02 23:35:32 -07:00
Karen Feng	63517eb430	[SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines ### What changes were proposed in this pull request? Adds ANSI/ISO SQLSTATE standards to the error guidelines. ### Why are the changes needed? Provides visibility and consistency to the SQLSTATEs assigned to error classes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not needed; docs only Closes #33560 from karenfeng/sqlstate-manual. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-03 13:57:58 +09:00
Kousuke Saruta	366f7febaf	[SPARK-36382][WEBUI] Remove noisy footer from the summary table for metrics ### What changes were proposed in this pull request? This PR changed `StagePage` to remove a noisy footer from the summary table for metrics. ### Why are the changes needed? In the WebUI, some tables are implemented using DataTables (https://datatables.net/). By default, tables created using DataTables shows footer which says `Showing x to y of z entries`, which is helpful for some tables if table entries can grow But the summary table for metrics in StagePage cannot grow so it's a little bit noisy. ![summary_metrics_before](https://user-images.githubusercontent.com/4736016/127866960-d2fa23fc-7260-4b99-86cc-b31f85249632.png) Actually, ExecutorPage has a similar summary table and the footer is from the table. ![executors-no-footer](https://user-images.githubusercontent.com/4736016/127867104-23581e79-de70-49fa-aaef-ab241a8bfa0a.png) ### Does this PR introduce _any_ user-facing change? Yes, appearance will be slightly changed but I don't think this change affects users. ### How was this patch tested? I confirmed that the footer is removed from the table. ![summary_metrics_after](https://user-images.githubusercontent.com/4736016/127867320-097a6f52-7aa8-4fec-9d50-b982165baef7.png) Closes #33611 from sarutak/remove-unnecessary-table-footer. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-02 23:40:07 +08:00
yi.wu	a98d919da4	[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown. After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis. Please check out https://github.com/apache/spark/pull/32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes #33451 from Ngone51/SPARK-36206. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-08-02 09:58:36 -05:00
Angerszhuuuu	951efb8085	[SPARK-36237][UI][SQL] Attach and start handler after application started in UI ### What changes were proposed in this pull request? When using prometheus to fetch metrics with a defined interval, we always pull data through restful API. If the pulling happens when a driver SparkUI port is bind to the driver and the application is not fully started, Spark driver will throw a lot of exceptions about NoSuchElementException as below. ``` 21/07/19 04:53:37 INFO Client: Preparing resources for our AM container 21/07/19 04:53:37 INFO Client: Uploading resource hdfs://tl3/packages/jars/spark-2.4-archive.tar.gz -> hdfs://R2/user/xiaoke.zhou/.sparkStaging/application_1624456325569_7143920/spark-2.4-archive.tar.gz 21/07/19 04:53:37 WARN JettyUtils: GET /jobs/ failed: java.util.NoSuchElementException: Failed to get the application information. If you are starting up Spark, please wait a while until it's ready. java.util.NoSuchElementException: Failed to get the application information. If you are starting up Spark, please wait a while until it's ready. at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:43) at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:275) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:90) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:90) at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:513) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.spark_project.jetty.server.Server.handle(Server.java:539) at org.spark_project.jetty.server.HttpChannel.handle(Htt [2021-07-19 04:54:55,111] INFO - pChannel.java:333) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:748) ``` Have check origin pr, we need to start server and bind port before taskScheduler started for client mode since we need to pass web url to register application master. But when we attach and start handler this time, we can provide restful API to user, but during this time, application is not started so we always return such error. In this pr, to start SparUI, Spark starts Jetty Server first to bind address. After the Spark application is fully started, call [attachAllHandlers] to start all existing handlers to Jetty seerver. ### Why are the changes needed? Improve the SparkUI start logical ### Does this PR introduce _any_ user-facing change? Before spark application is fully started, all url request will return ``` Spark is starting up. Please wait a while until it's ready. ``` in the page ### How was this patch tested? Existed During after bind address and finish start spark application, all request will show ![image](https://user-images.githubusercontent.com/46485123/127124316-0ec637c5-eeab-4e5e-973b-8fec4f928a3c.png) Closes #33457 from AngersZhuuuu/SPARK-36237. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-02 19:36:20 +08:00
Venkata krishnan Sowrirajan	c039d99812	[SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle ### What changes were proposed in this pull request? [[SPARK-23243](https://issues.apache.org/jira/browse/SPARK-23243)] and [[SPARK-25341](https://issues.apache.org/jira/browse/SPARK-25341)] addressed cases of stage retries for indeterminate stage involving operations like repartition. This PR addresses the same issues in the context of push-based shuffle. Currently there is no way to distinguish the current execution of a stage for a shuffle ID. Therefore the changes explained below are necessary. Core changes are summarized as follows: 1. Introduce a new variable `shuffleMergeId` in `ShuffleDependency` which is monotonically increasing value tracking the temporal ordering of execution of <stage-id, stage-attempt-id> for a shuffle ID. 2. Correspondingly make changes in the push-based shuffle protocol layer in `MergedShuffleFileManager`, `BlockStoreClient` passing the `shuffleMergeId` in order to keep track of the shuffle output in separate files on the shuffle service side. 3. `DAGScheduler` increments the `shuffleMergeId` tracked in `ShuffleDependency` in the cases of a indeterministic stage execution 4. Deterministic stage will have `shuffleMergeId` set to 0 as no special handling is needed in this case and indeterminate stage will have `shuffleMergeId` starting from 1. ### Why are the changes needed? New protocol changes are needed due to the reasons explained above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new unit tests in `RemoteBlockPushResolverSuite, DAGSchedulerSuite, BlockIdSuite, ErrorHandlerSuite` Closes #33034 from venkata91/SPARK-32923. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-08-01 23:16:33 -05:00
Venkata krishnan Sowrirajan	2a18f82940	[SPARK-32919][FOLLOW-UP] Filter out driver in the merger locations and fix the return type of RemoveShufflePushMergerLocations ### What changes were proposed in this pull request? SPARK-32919 added support for fetching shuffle push merger locations with push-based shuffle. Filter out driver host in the shuffle push merger locations as driver won't participate in the shuffle merge also fix ClassCastException in the RemoveShufflePushMergerLocations. ### Why are the changes needed? No ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. Closes #33425 from venkata91/SPARK-32919-follow-up. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-08-01 13:11:34 -05:00
Sean Owen	72615bc551	[SPARK-36362][CORE][SQL][TESTS] Omnibus Java code static analyzer warning fixes ### What changes were proposed in this pull request? Fix up some minor Java issues: - Some int*int multiplications that widen to long maybe could overflow - Unnecessarily non-static inner classes - Some tests "catch (AssertionError)" and do nothing - Manual array iteration vs very slightly faster/simpler foreach - Incorrect generic types that just happen to not cause a runtime error - Missed opportunities for try-close - Mutable enums - .. and a few other minor things ### Why are the changes needed? Some are minor but clear fixes; some may have a marginal perf impact or avoid a bug later. Also: maybe avoid future PRs to address these one by one. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests Closes #33594 from srowen/SPARK-36362. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-31 22:35:57 -07:00
zhuqi-lucas	900b38d5fa	[SPARK-36344][CORE][SHUFFLE] Fix some typos in ShuffleBlockPusher class ### What changes were proposed in this pull request? Just to fix some typos in ShuffleBlockPusher class. ### Why are the changes needed? Fix the typos, make code clear. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need test. Closes #33575 from zhuqi-lucas/master. Authored-by: zhuqi-lucas <821684824@qq.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-30 09:11:04 +09:00
Angerszhuuuu	4d11b0de8a	[SPARK-36341][SQL] Aggregated Metrics by Executor link should wrap <a> with <h4> ### What changes were proposed in this pull request? In current stage page, when we move mouse to the title of `Aggregated Metrics by Executor`, the underline is blocked. ![image](https://user-images.githubusercontent.com/46485123/127449431-8897e648-c7cd-4bff-adb1-f43347c381b1.png) For completed jobs, the code is ![image](https://user-images.githubusercontent.com/46485123/127449708-d6a7ccea-a1b5-4251-9648-38c5107457f8.png) After this pr ![image](https://user-images.githubusercontent.com/46485123/127449465-6631076e-2eb4-41eb-a5ef-6c3c8be08d87.png) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #33571 from AngersZhuuuu/SPARK-36341. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-29 18:52:56 -05:00
dgd-contributor	af6d04b65c	[SPARK-36095][CORE] Grouping exception in core/rdd ### What changes were proposed in this pull request? This PR group exception messages in core/src/main/scala/org/apache/spark/rdd ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #33317 from dgd-contributor/SPARK-36095_GroupExceptionCoreRdd. Lead-authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 22:01:26 +08:00
Min Shen	c4aa54ed4e	[SPARK-36266][SHUFFLE] Rename classes in shuffle RPC used for block push operations ### What changes were proposed in this pull request? This is a follow-up to #29855 according to the [comments](https://github.com/apache/spark/pull/29855/files#r505536514) In this PR, the following changes are made: 1. A new `BlockPushingListener` interface is created specifically for block push. The existing `BlockFetchingListener` interface is left as is, since it might be used by external shuffle solutions. These 2 interfaces are unified under `BlockTransferListener` to enable code reuse. 2. `RetryingBlockFetcher`, `BlockFetchStarter`, and `RetryingBlockFetchListener` are renamed to `RetryingBlockTransferor`, `BlockTransferStarter`, and `RetryingBlockTransferListener` respectively. This makes their names more generic to be reused across both block fetch and push. 3. Comments in `OneForOneBlockPusher` are further clarified to better explain how we handle retries for block push. ### Why are the changes needed? To make code cleaner without sacrificing backward compatibility. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests. Closes #33340 from Victsm/SPARK-32915-followup. Lead-authored-by: Min Shen <mshen@linkedin.com> Co-authored-by: Min Shen <victor.nju@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-07-26 17:39:19 -05:00
Hyukjin Kwon	6e3d404cec	[SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE ### What changes were proposed in this pull request? This PR proposes to rename: - Rename `Reader`/`reader` to `Read`/`read` for rules and execution plan (user-facing doc/config name remain untouched) - `ShuffleReaderExec` ->`ShuffleReadExec` - `isLocalReader` -> `isLocalRead` - ... - Rename `CustomShuffle` prefix to `AQEShuffle` - Rename `OptimizeLocalShuffleReader` rule to `OptimizeShuffleWithLocalRead` ### Why are the changes needed? There are multiple problems in the current naming: - `CustomShuffle` -> `AQEShuffle` it sounds like it is a pluggable API. However, this is actually only used by AQE. - `OptimizeLocalShuffleReader` -> `OptimizeShuffleWithLocalRead` it is the name of a rule but it can be misread as a reader, which is counterintuative - `ReaderExec` -> `ReadExec` Reader execution reads a bit odd. It should better be read execution (like `ScanExec`, `ProjectExec` and `FilterExec`). I can't find the reason to name it with something that performs an action. See also the generated plans: Before: ``` ... * HashAggregate (12) +- CustomShuffleReader (11) +- ShuffleQueryStage (10) +- Exchange (9) ... ``` After: ``` ... * HashAggregate (12) +- AQEShuffleRead (11) +- ShuffleQueryStage (10) +- Exchange (9) .. ``` ### Does this PR introduce _any_ user-facing change? No, internal refactoring. ### How was this patch tested? Existing unittests should cover the changes. Closes #33429 from HyukjinKwon/SPARK-36217. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-26 22:41:54 +08:00
Venkata krishnan Sowrirajan	ba1a7ce5ec	[SPARK-32920][FOLLOW-UP] Fix shuffleMergeFinalized directly calling rdd.getNumPartitions as RDD is not serialized to executor ### What changes were proposed in this pull request? `ShuffleMapTask` should not push blocks if a shuffle is already merge finalized. Currently block push is disabled for retry cases. Also fix `shuffleMergeFinalized` calling `rdd.getNumPartitions` as RDD is not serialized causing issues. ### Why are the changes needed? No ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33426 from venkata91/SPARK-32920-follow-up. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-07-26 09:17:04 -05:00
yi.wu	21450b3254	[SPARK-32920][FOLLOW-UP][CORE] Shutdown shuffleMergeFinalizeScheduler when DAGScheduler stop ### What changes were proposed in this pull request? Call `shuffleMergeFinalizeScheduler.shutdownNow()` in `DAGScheduler.stop()`. ### Why are the changes needed? Avoid the thread leak. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #33495 from Ngone51/SPARK-32920-followup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-24 17:40:47 -07:00
Chandni Singh	09e1c61272	[SPARK-36255][SHUFFLE][CORE] Stop pushing and retrying on FileNotFound exceptions ### What changes were proposed in this pull request? Once the shuffle is cleaned up by the `ContextCleaner`, the shuffle files are deleted by the executors. In this case, the push of the shuffle data by the executors can throw `FileNotFoundException`s because the shuffle files are deleted. When this exception is thrown from the `shuffle-block-push-thread`, it causes the executor to exit. Both the `shuffle-block-push` threads and the netty event-loops will encounter `FileNotFoundException`s in this case. The fix here stops these threads from pushing more blocks when they encounter `FileNotFoundException`. When the exception is from the `shuffle-block-push-thread`, it will get handled and logged as warning instead of failing the executor. ### Why are the changes needed? This fixes the bug which causes executor to exits when they are instructed to clean up shuffle data. Below is the stacktrace of this exception: ``` 21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[block-push-thread-1,5,main] java.lang.Error: java.io.IOException: Error in opening FileSegmentManagedBuffer {file=******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data, offset=10640, length=190} at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Error in opening FileSegmentManagedBuffer\{file=***/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data, offset=10640, length=190} at org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89) at org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294) at org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270) at org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191) at org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ... 2 more Caused by: java.io.FileNotFoundException: ****/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data (No such file or directory) at java.io.RandomAccessFile.open0(Native Method) at java.io.RandomAccessFile.open(RandomAccessFile.java:316) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243) at org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unit to verify no more data is pushed when `FileNotFoundException` is encountered. Have also verified in our environment. Closes #33477 from otterc/SPARK-36255. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: yi.wu <yi.wu@databricks.com>	2021-07-24 21:09:11 +08:00
yangjie01	f61d5993ea	[SPARK-36242][CORE] Ensure spill file closed before set `success = true` in `ExternalSorter.spillMemoryIteratorToDisk` method ### What changes were proposed in this pull request? The main change of this pr is move `writer.close()` before `success = true` to ensure spill file closed before set `success = true` in `ExternalSorter.spillMemoryIteratorToDisk` method. ### Why are the changes needed? Avoid setting `success = true` first and then failure of close spill file ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Add a new Test case to check `The spill file should not exists if writer close fails` Closes #33460 from LuciferYang/external-sorter-spill-close. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yi.wu <yi.wu@databricks.com>	2021-07-23 23:15:13 +08:00
Holden Karau	89a83196ac	[SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake ### What changes were proposed in this pull request? GHA probably doesn't have the same resources as jenkins so move down from 5 to 3 execs and give a bit more time for them to come up. ### Why are the changes needed? Test is timing out in GHA ### Does this PR introduce _any_ user-facing change? No, test only change. ### How was this patch tested? Run through GHA verify no OOM during WorkerDecommissionExtended Closes #33467 from holdenk/SPARK-36246-WorkerDecommissionExtendedSuite-flakes-in-GHA. Lead-authored-by: Holden Karau <holden@pigscanfly.ca> Co-authored-by: Holden Karau <hkarau@netflix.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 15:17:48 +09:00
Jie	1a8c6755a1	[SPARK-35027][CORE] Close the inputStream in FileAppender when writin… ### What changes were proposed in this pull request? 1. add "closeStreams" to FileAppender and RollingFileAppender 2. set "closeStreams" to "true" in ExecutorRunner ### Why are the changes needed? The executor will hang when due disk full or other exceptions which happened in writting to outputStream: the root cause is the "inputStream" is not closed after the error happens: 1. ExecutorRunner creates two files appenders for pipe: one for stdout, one for stderr 2. FileAppender.appendStreamToFile exits the loop when writing to outputStream 3. FileAppender closes the outputStream, but left the inputStream which refers the pipe's stdout and stderr opened 4. The executor will hang when printing the log message if the pipe is full (no one consume the outputs) 5. From the driver side, you can see the task can't be completed for ever With this fix, the step 4 will throw an exception, the driver can catch up the exception and reschedule the failed task to other executors. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new tests for the "closeStreams" in FileAppenderSuite Closes #33263 from jhu-chang/SPARK-35027. Authored-by: Jie <gt.hu.chang@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-20 21:23:51 -05:00
Ye Zhou	c77acf0bbc	[SPARK-35546][SHUFFLE] Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a better way ### What changes were proposed in this pull request? This is one of the patches for SPIP SPARK-30602 which is needed for push-based shuffle. ### Summary of the change: When Executor registers with Shuffle Service, it will encode the merged shuffle dir created and also the application attemptId into the ShuffleManagerMeta into Json. Then in Shuffle Service, it will decode the Json string and get the correct merged shuffle dir and also the attemptId. If the registration comes from a newer attempt, the merged shuffle information will be updated to store the information from the newer attempt. This PR also refactored the management of the merged shuffle information to avoid concurrency issues. ### Why are the changes needed? Refer to the SPIP in SPARK-30602. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. The reference PR with the consolidated changes covering the complete implementation is also provided in SPARK-30602. We have already verified the functionality and the improved performance as documented in the SPIP doc. Closes #33078 from zhouyejoe/SPARK-35546. Authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-07-20 00:03:30 -05:00
Dongjoon Hyun	fd3e9ce0b9	[SPARK-36193][CORE] Recover SparkSubmit.runMain not to stop SparkContext in non-K8s env ### What changes were proposed in this pull request? According to the discussion on https://github.com/apache/spark/pull/32283 , this PR aims to limit the feature of SPARK-34674 to K8s environment only. ### Why are the changes needed? To reduce the behavior change in non-K8s environment. ### Does this PR introduce _any_ user-facing change? The change behavior is consistent with 3.1.1 and older Spark releases. ### How was this patch tested? N/A Closes #33403 from dongjoon-hyun/SPARK-36193. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-18 22:26:23 -07:00
skhandrikagmail	bfdde9635d	[SPARK-36122][CORE] Passing on needClientAuth to Jetty SSLContextFactory SPARK-36122: Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication. passing needClientAuth to sslContextFactory would help enable mTLS authentication for Jetty through x509 certificates. ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #33301 from skhandrikagmail/patch-1. Authored-by: skhandrikagmail <87313842+skhandrikagmail@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-17 08:59:42 -05:00
Chandni Singh	6d2cbadcfe	[SPARK-32922][SHUFFLE][CORE][FOLLOWUP] Fixes few issues when the executor tries to fetch push-merged blocks ### What changes were proposed in this pull request? Below 2 bugs were introduced with https://github.com/apache/spark/pull/32140 1. Instead of requesting the local-dirs for push-merged-local blocks from the ESS, `PushBasedFetchHelper` requests it from other executors. Push-based shuffle is only enabled when the ESS is enabled so it should always fetch the dirs from the ESS and not from other executors which is not yet supported. 2. The size of the push-merged blocks is logged incorrectly. ### Why are the changes needed? This fixes the above mentioned bugs and is needed for push-based shuffle to work properly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested this by running an application on the cluster. The UTs mock the call `hostLocalDirManager.getHostLocalDirs` which is why didn't catch (1) with the UT. However, the fix is trivial and checking this in the UT will require a lot more effort so I haven't modified it in the UT. Logs of the executor with the bug ``` 21/07/15 15:42:46 WARN ExternalBlockStoreClient: Error while trying to get the host local dirs for [shuffle-push-merger] 21/07/15 15:42:46 WARN PushBasedFetchHelper: Error while fetching the merged dirs for push-merged-local blocks: shuffle_0_-1_13. Fetch the original blocks instead java.lang.RuntimeException: java.lang.IllegalStateException: Invalid executor id: shuffle-push-merger, expected 92. at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:130) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163) ``` After the fix, the executors were able to fetch the local push-merged blocks. Closes #33378 from otterc/SPARK-32922-followup. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-07-17 00:26:46 -05:00
yi.wu	4783fb72af	[SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file ### What changes were proposed in this pull request? This is the initial work of add checksum support of shuffle. This is a piece of https://github.com/apache/spark/pull/32385. And this PR only adds checksum functionality at the shuffle writer side. Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation: * `BypassMergeSortShuffleWriter` - wrap on each partition file * `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting * `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting \* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime. And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, added a new conf: `spark.shuffle.checksum`. ### How was this patch tested? Added unit tests. Closes #32401 from Ngone51/add-checksum-files. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-07-17 00:23:14 -05:00
Karen Feng	e92b8ea6f8	[SPARK-36106][SQL][CORE] Label error classes for subset of QueryCompilationErrors ### What changes were proposed in this pull request? Adds error classes to some of the exceptions in QueryCompilationErrors. ### Why are the changes needed? Improves auditing for developers and adds useful fields for users (error class and SQLSTATE). ### Does this PR introduce _any_ user-facing change? Yes, fills in missing error class and SQLSTATE fields. ### How was this patch tested? Existing tests and new unit tests. Closes #33309 from karenfeng/group-compilation-errors-1. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-15 11:43:18 +09:00
Venkata krishnan Sowrirajan	fbf53dee37	[SPARK-32920][CORE][SHUFFLE][FOLLOW-UP] Fix to run push-based shuffle tests in DAGSchedulerSuite in ad-hoc manner ### What changes were proposed in this pull request? Currently when the push-based shuffle tests are run in an ad-hoc manner through IDE, `spark.testing` is not set to true therefore `Utils#isPushBasedShuffleEnabled` returns false disabling push-based shuffle eventually causing the tests to fail. This doesn't happen when it is run on command line using maven as `spark.testing` is set to true. Changes made - set `spark.testing` to true in `initPushBasedShuffleConfs` ### Why are the changes needed? Fix to run DAGSchedulerSuite tests in ad-hoc manner ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? In my local IDE Closes #33303 from venkata91/SPARK-32920-follow-up. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-07-13 12:16:28 -05:00
Wenchen Fan	4a62e1e9c1	[SPARK-36074][SQL] Add error class for StructType.findNestedField ### What changes were proposed in this pull request? This PR adds an INVALID_FIELD_NAME error class for the errors in `StructType.findNestedField`. It also cleans up the code there and adds UT for this method. ### Why are the changes needed? follow the new error message framework ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33282 from cloud-fan/error. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-13 21:13:58 +08:00
yi.wu	f8a80c42ce	[SPARK-36048][TEST][CORE] Fix HealthTrackerSuite.allExecutorAndHostIds ### What changes were proposed in this pull request? Fix the executor ids that are declared at `allExecutorAndHostIds`. ### Why are the changes needed? Currently, `HealthTrackerSuite.allExecutorAndHostIds` is mistakenly declared, which leads to the executor exclusion isn't correctly tested. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests in `HealthTrackerSuite`. Closes #33262 from Ngone51/fix-healthtrackersuite. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: yi.wu <yi.wu@databricks.com>	2021-07-13 16:41:30 +08:00
Denis Tarima	cfcd094147	[SPARK-36036][CORE] Fix cleanup of DownloadFile resources ### What changes were proposed in this pull request? There was a regression since Spark started storing large remote files on disk (https://issues.apache.org/jira/browse/SPARK-22062). In 2018 a refactoring introduced a hidden reference preventing the auto-deletion of the files (`a97001d217 (diff-42a673b8fa5f2b999371dc97a5de7ebd2c2ec19447353d39efb7e8ebc012fe32L1677)`). Since then all underlying files of DownloadFile instances are kept on disk for the duration of the Spark application which sometimes results in "no space left" errors. `ReferenceWithCleanup` class uses `file` (the `DownloadFile`) in `cleanUp(): Unit` method so it has to keep a reference to it which prevents it from being garbage-collected. ``` def cleanUp(): Unit = { logDebug(s"Clean up file $filePath") if (!file.delete()) { <--- here logDebug(s"Fail to delete file $filePath") } } ``` ### Why are the changes needed? Long-running Spark applications require freeing resources when they are not needed anymore, and iterative algorithms could use all the disk space quickly too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test in BlockManagerSuite and tested manually. Closes #33251 from dtarima/fix-download-file-cleanup. Authored-by: Denis Tarima <dtarima@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-11 11:54:23 -05:00
yangjie01	83b3b75a34	[SPARK-36047][CORE] Replace the handwriting compare methods with static compare methods in Java code ### What changes were proposed in this pull request? The main change of this is use the static `Integer.compare()` method and `Long.compare()` method instead of the handwriting compare method in Java code. ### Why are the changes needed? Removing unnecessary handwriting compare methods ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #33260 from LuciferYang/static-compare. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-10 07:54:01 -05:00
Kent Yao	f5a63322de	[SPARK-36070][CORE] Log time cost info for writing rows out and committing the task ### What changes were proposed in this pull request? We have a job that has a stage that contains about 8k tasks. Most tasks take about 1~10min to finish but 3 of them tasks run extremely slow with similar data sizes. They take about 1 hour each to finish and also do their speculations. The root cause is most likely the delay of the storage system. But it's not straightforward enough to find where the performance issue occurs, in the phase of shuffle read, task execution, output, commitment e.t.c.. ```log 2021-07-09 03:05:17 CST SparkHadoopMapRedUtil INFO - attempt_20210709022249_0003_m_007050_37351: Committed 2021-07-09 03:05:17 CST Executor INFO - Finished task 7050.0 in stage 3.0 (TID 37351). 3311 bytes result sent to driver 2021-07-09 04:06:10 CST ShuffleBlockFetcherIterator INFO - Getting 9 non-empty blocks including 0 local blocks and 9 remote blocks 2021-07-09 04:06:10 CST TransportClientFactory INFO - Found inactive connection to ``` ### Why are the changes needed? On the spark side, we can record the time cost in logs for better bug hunting or performance tuning. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing GA Closes #33279 from yaooqinn/SPARK-36070. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-07-10 00:54:19 +08:00
gengjiaan	a46dc9b0f2	[SPARK-36018][CORE][SQL] Some Improvement for Spark Core ### What changes were proposed in this pull request? This PR improve some implement for Spark. ### Why are the changes needed? This PR improve some implement for Spark. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #33216 from beliefer/gather-code-format. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-09 11:24:06 -05:00
Takuya UESHIN	115b8a180f	[SPARK-36062][PYTHON] Try to capture faulthanlder when a Python worker crashes ### What changes were proposed in this pull request? Try to capture the error message from the `faulthandler` when the Python worker crashes. ### Why are the changes needed? Currently, we just see an error message saying `"exited unexpectedly (crashed)"` when the UDFs causes the Python worker to crash by like segmentation fault. We should take advantage of [`faulthandler`](https://docs.python.org/3/library/faulthandler.html) and try to capture the error message from the `faulthandler`. ### Does this PR introduce _any_ user-facing change? Yes, when a Spark config `spark.python.worker.faulthandler.enabled` is `true`, the stack trace will be seen in the error message when the Python worker crashes. ```py >>> def f(): ... import ctypes ... ctypes.string_at(0) ... >>> sc.parallelize([1]).map(lambda x: f()).count() ``` ``` org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault Current thread 0x000000010965b5c0 (most recent call first): File "/.../ctypes/__init__.py", line 525 in string_at File "<stdin>", line 3 in f File "<stdin>", line 1 in <lambda> ... ``` ### How was this patch tested? Added some tests, and manually. Closes #33273 from ueshin/issues/SPARK-36062/faulthandler. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-09 11:30:39 +09:00
Karen Feng	71c086eb87	[SPARK-35958][CORE] Refactor SparkError.scala to SparkThrowable.java ### What changes were proposed in this pull request? Refactors the base Throwable trait `SparkError.scala` (introduced in SPARK-34920) an interface `SparkThrowable.java`. ### Why are the changes needed? - Renaming `SparkError` to `SparkThrowable` better reflect sthat this is the base interface for both `Exception` and `Error` - Migrating to Java maximizes its extensibility ### Does this PR introduce _any_ user-facing change? Yes; the base trait has been renamed and the accessor methods have changed (eg. `sqlState` -> `getSqlState()`). ### How was this patch tested? Unit tests. Closes #33164 from karenfeng/SPARK-35958. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-08 23:54:53 +08:00
Shockang	55373b118f	[SPARK-35907][CORE] Instead of File#mkdirs, Files#createDirectories is expected ### What changes were proposed in this pull request? The code of method: createDirectory in class: org.apache.spark.util.Utils is modified. ### Why are the changes needed? To solve the problem of ambiguous exception handling in traditional IO creating directories. What's more, there shouldn't be an improper comment in Spark's source code. ### Does this PR introduce _any_ user-facing change? Yes The modified method would be called to create the working directory when Worker starts. The modified method would be called to create local directories for storing block data when the class: DiskBlockManager instantiates. The modified method would be called to create a temporary directory inside the given parent directory in several classes. ### How was this patch tested? I have provided test cases as much as possible. Authored-by: Shockang <shockangaliyun.com> Closes #33101 from Shockang/SPARK-35907. Authored-by: Shockang <shockang@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-07 09:16:13 -05:00
Tim Armstrong	e4273f7098	[SPARK-35980][CORE] ThreadAudit logs whether thread is daemon ### What changes were proposed in this pull request? Add `daemon={true\|false}` to the POSSIBLE THREAD LEAK IN SUITE warning printed by test framework. ### Why are the changes needed? This is to slightly accelerate interpretation of that warning, since non-daemon threads can block the process from exiting and are likely to be problematic. Only affects test code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually ran some tests, inspected the output log line. Closes #33178 from timarmstrong/thread-leak. Authored-by: Tim Armstrong <tim.armstrong@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-04 10:44:00 +09:00
Dongjoon Hyun	f9f95686cb	[SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.3.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.3.0 and the published snapshot version should not conflict with `branch-3.2`. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #33196 from dongjoon-hyun/SPARK-35996. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-02 13:47:36 -07:00
Kevin Su	dc85b0b51a	[SPARK-35950][WEBUI] Failed to toggle Exec Loss Reason in the executors page ### What changes were proposed in this pull request? Update the executor's page, so it can successfully hide the "Exec Loss Reason" column. ### Why are the changes needed? When unselected the checkbox "Exec Loss Reason" on the executor page, the "Active tasks" column disappears instead of the "Exec Loss Reason" column. Before: ![Screenshot from 2021-06-30 15-55-05](https://user-images.githubusercontent.com/37936015/123930908-bd6f4180-d9c2-11eb-9aba-bbfe0a237776.png) After: ![Screenshot from 2021-06-30 22-21-38](https://user-images.githubusercontent.com/37936015/123977632-bf042e00-d9f1-11eb-910e-93d615d2db47.png) ### Does this PR introduce _any_ user-facing change? Yes, The Web UI is updated. ### How was this patch tested? Pass the CIs. Closes #33155 from pingsutw/SPARK-35950. Lead-authored-by: Kevin Su <pingsutw@gmail.com> Co-authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-01 12:32:54 +08:00
yi.wu	868a594706	[SPARK-35714][FOLLOW-UP][CORE] Use a shared stopping flag for WorkerWatcher to avoid the duplicate System.exit ### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` reuse the `stopping` flag in `CoarseGrainedExecutorBackend` to avoid the duplicate call of `System.exit`. ### Why are the changes needed? As a followup of https://github.com/apache/spark/pull/32868, this PR tries to give a more robust fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #33028 from Ngone51/spark-35714-followup. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: yi.wu <yi.wu@databricks.com>	2021-07-01 11:40:00 +08:00
Karen Feng	e3bd817d65	[SPARK-34920][CORE][SQL] Add error classes with SQLSTATE ### What changes were proposed in this pull request? Unifies exceptions thrown from Spark under a single base trait `SparkError`, which unifies: - Error classes - Parametrized error messages - SQLSTATE, as discussed in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Add-error-IDs-td31126.html. ### Why are the changes needed? - Adding error classes creates a consistent label for exceptions, even as error messages change - Creating a single, centralized source-of-truth for parametrized error messages improves auditing for error message quality - Adding SQLSTATE helps ODBC/JDBC users receive standardized error codes ### Does this PR introduce _any_ user-facing change? Yes, changes ODBC experience by: - Adding error classes to error messages - Adding SQLSTATE to TStatus ### How was this patch tested? Unit tests, as well as local tests with PyODBC. Closes #32850 from karenfeng/SPARK-34920. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-30 09:22:02 +00:00
Cheng Su	6bbfb45ffe	[SPARK-33298][CORE][FOLLOWUP] Add Unstable annotation to `FileCommitProtocol` ### What changes were proposed in this pull request? This is the followup from https://github.com/apache/spark/pull/33012#discussion_r659440833, where we want to add `Unstable` to `FileCommitProtocol`, to give people a better idea of API. ### Why are the changes needed? Make it easier for people to follow and understand code. Clean up code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests, as no real logic change. Closes #33148 from c21/bucket-followup. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-30 16:25:20 +09:00
Chandni Singh	9a5cd15e87	[SPARK-32922][SHUFFLE][CORE] Adds support for executors to fetch local and remote merged shuffle data ### What changes were proposed in this pull request? This is the shuffle fetch side change where executors can fetch local/remote push-merged shuffle data from shuffle services. This is needed for push-based shuffle - SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). The change adds support to the `ShuffleBlockFetchIterator` to fetch push-merged block meta and shuffle chunks from local and remote ESS. If the fetch of any of these fails, then the iterator fallsback to fetch the original shuffle blocks that belonged to the push-merged block. ### Why are the changes needed? These changes are needed for push-based shuffle. Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). ### Does this PR introduce _any_ user-facing change? When push-based shuffle is turned on then that will fetch push-merged blocks from the remote shuffle service. The client logs will indicate this. ### How was this patch tested? Added unit tests. The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). We have already verified the functionality and the improved performance as documented in the SPIP doc. Lead-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Min Shen mshenlinkedin.com Co-authored-by: Ye Zhou yezhoulinkedin.com Closes #32140 from otterc/SPARK-32922. Lead-authored-by: Chandni Singh <singh.chandni@gmail.com> Co-authored-by: Chandni Singh <chsingh@linkedin.com> Co-authored-by: Min Shen <mshen@linkedin.com> Co-authored-by: otterc <singh.chandni@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-06-29 17:44:15 -05:00
Dongjoon Hyun	7e7028282c	[SPARK-35928][BUILD] Upgrade ASM to 9.1 ### What changes were proposed in this pull request? This PR aims to upgrade ASM to 9.1 ### Why are the changes needed? The latest `xbean-asm9-shaded` is built with ASM 9.1. - https://mvnrepository.com/artifact/org.apache.xbean/xbean-asm9-shaded/4.20 - `5e0e3c0c64/pom.xml (L67)` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33130 from dongjoon-hyun/SPARK-35928. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-29 10:27:51 -07:00
Kent Yao	9c157a490b	[SPARK-35910][CORE][SHUFFLE] Update remoteBlockBytes based on merged block info to reduce task time ### What changes were proposed in this pull request? Currently, we calculate the `remoteBlockBytes` based on the original block info list. It's not efficient. Usually, it costs more ~25% time to be spent here. If the original reducer size is big but the actual reducer size is small due to automatically partition coalescing of AQE, the reducer will take more time to calculate `remoteBlockBytes`. We can reduce this cost via remote requests which contain merged block info lists. ### Why are the changes needed? improve task performance ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new unit tests and verified manually. Closes #33109 from yaooqinn/SPARK-35910. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 13:55:59 -07:00
Erik Krogen	3255511d52	[SPARK-35258][SHUFFLE][YARN] Add new metrics to ExternalShuffleService for better monitoring ### What changes were proposed in this pull request? This adds two new additional metrics to `ExternalBlockHandler`: - `blockTransferRate` -- for indicating the rate of transferring blocks, vs. the data within them - `blockTransferAvgSize_1min` -- a 1-minute trailing average of block sizes transferred by the ESS Additionally, this enhances `YarnShuffleServiceMetrics` to expose the histogram/`Snapshot` information from `Timer` metrics within `ExternalBlockHandler`. ### Why are the changes needed? Currently `ExternalBlockHandler` exposes some useful metrics, but is lacking around metrics for the rate of block transfers. We have `blockTransferRateBytes` to tell us the rate of _bytes_, but no metric to tell us the rate of _blocks_, which is especially relevant when running the ESS on HDDs that are sensitive to random reads. Many small block transfers can have a negative impact on performance, but won't show up as a spike in `blockTransferRateBytes` since the sizes are small. Thus the new metrics to show information around average block size and block transfer rate are very useful to monitor the health/performance of the ESS, especially when running on HDDs. For the `YarnShuffleServiceMetrics`, currently the three `Timer` metrics exposed by `ExternalBlockHandler` are being underutilized in a YARN-based environment -- they are basically treated as a `Meter`, only exposing rate-based information, when the metrics themselves are collected detailed histograms of timing information. We should expose this information for better observability. ### Does this PR introduce _any_ user-facing change? Yes, there are two entirely new metrics for the ESS, as documented in `monitoring.md`. Additionally in a YARN environment, `Timer` metrics exposed by the ESS will include more rich timing information. ### How was this patch tested? New unit tests are added to verify that new metrics are showing up as expected. We have been running this patch internally for approx. 1 year and have found it to be useful for monitoring the health of ESS and diagnosing performance issues. Closes #32388 from xkrogen/xkrogen-SPARK-35258-ess-new-metrics. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-06-28 02:36:17 -05:00
Kent Yao	14d4decf73	[SPARK-35879][CORE][SHUFFLE] Fix performance regression caused by collectFetchRequests ### What changes were proposed in this pull request? This PR fixes perf regression at the executor side when creating fetch requests with large initial partitions ![image](https://user-images.githubusercontent.com/8326978/123270865-dd21e800-d532-11eb-8447-ad80e47b034f.png) In NetEase, we had an online job that took `45min` to "fetch" about 100MB of shuffle data, which actually turned out that it was just collecting fetch requests slowly. Normally, such a task should finish in seconds. See the `DEBUG` log ``` 21/06/22 11:52:26 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to kyuubi.163.org: 21/06/22 11:53:05 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 3941440 at BlockManagerId(12, .., 43559, None) with 19 blocks 21/06/22 11:53:44 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 3716400 at BlockManagerId(20, .., 38287, None) with 18 blocks 21/06/22 11:54:41 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 4559280 at BlockManagerId(6, .., 39689, None) with 22 blocks 21/06/22 11:55:08 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 3120160 at BlockManagerId(33, .., 39449, None) with 15 blocks ``` I also create a test case locally with my local laptop docker env to give some reproducible cases. ``` bin/spark-sql --conf spark.kubernetes.file.upload.path=./ --master k8s://https://kubernetes.docker.internal:6443 --conf spark.kubernetes.container.image=yaooqinn/spark:v20210624-5 -c spark.kubernetes.context=docker-for-desktop_1 --num-executors 5 --driver-memory 5g --conf spark.kubernetes.executor.podNamePrefix=sparksql ``` ```sql SET spark.sql.adaptive.enabled=true; SET spark.sql.shuffle.partitions=3000; SELECT /+ REPARTITION / 1 as pid, id from range(1, 1000000, 1, 500); SELECT /+ REPARTITION(pid, id) / 1 as pid, id from range(1, 1000000, 1, 500); ``` ### Why are the changes needed? fix perf regression which was introduced by SPARK-29292 (`3ad4863673`) in v3.1.0. `3ad4863673` is for support compilation with scala 2.13 but the performance losses is huge. We need to consider backporting this PR to branch 3.1. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Mannully, #### before ```log 21/06/23 13:54:22 DEBUG ShuffleBlockFetcherIterator: maxBytesInFlight: 50331648, targetRemoteRequestSize: 10066329, maxBlocksInFlightPerAddress: 2147483647 21/06/23 13:54:38 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2314708 at BlockManagerId(2, 10.1.3.114, 36423, None) with 86 blocks 21/06/23 13:54:59 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2636612 at BlockManagerId(3, 10.1.3.115, 34293, None) with 87 blocks 21/06/23 13:55:18 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2508706 at BlockManagerId(4, 10.1.3.116, 41869, None) with 90 blocks 21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2350854 at BlockManagerId(5, 10.1.3.117, 45787, None) with 85 blocks 21/06/23 13:55:34 INFO ShuffleBlockFetcherIterator: Getting 438 (11.8 MiB) non-empty blocks including 90 (2.5 MiB) local and 0 (0.0 B) host-local and 348 (9.4 MiB) remote blocks 21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Sending request for 87 blocks (2.5 MiB) from 10.1.3.115:34293 21/06/23 13:55:34 INFO TransportClientFactory: Successfully created connection to /10.1.3.115:34293 after 1 ms (0 ms spent in bootstraps) 21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Sending request for 90 blocks (2.4 MiB) from 10.1.3.116:41869 21/06/23 13:55:34 INFO TransportClientFactory: Successfully created connection to /10.1.3.116:41869 after 2 ms (0 ms spent in bootstraps) 21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Sending request for 85 blocks (2.2 MiB) from 10.1.3.117:45787 ``` ```log 21/06/23 14:00:45 INFO MapOutputTracker: Broadcast outputstatuses size = 411, actual size = 828997 21/06/23 14:00:45 INFO MapOutputTrackerWorker: Got the map output locations 21/06/23 14:00:45 DEBUG ShuffleBlockFetcherIterator: maxBytesInFlight: 50331648, targetRemoteRequestSize: 10066329, maxBlocksInFlightPerAddress: 2147483647 21/06/23 14:00:55 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1894389 at BlockManagerId(2, 10.1.3.114, 36423, None) with 99 blocks 21/06/23 14:01:04 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1919993 at BlockManagerId(3, 10.1.3.115, 34293, None) with 100 blocks 21/06/23 14:01:14 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1977186 at BlockManagerId(5, 10.1.3.117, 45787, None) with 103 blocks 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1938336 at BlockManagerId(4, 10.1.3.116, 41869, None) with 101 blocks 21/06/23 14:01:23 INFO ShuffleBlockFetcherIterator: Getting 500 (9.1 MiB) non-empty blocks including 97 (1820.3 KiB) local and 0 (0.0 B) host-local and 403 (7.4 MiB) remote blocks 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 101 blocks (1892.9 KiB) from 10.1.3.116:41869 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 103 blocks (1930.8 KiB) from 10.1.3.117:45787 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 99 blocks (1850.0 KiB) from 10.1.3.114:36423 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 100 blocks (1875.0 KiB) from 10.1.3.115:34293 21/06/23 14:01:23 INFO ShuffleBlockFetcherIterator: Started 4 remote fetches in 37889 ms ``` #### After ```log 21/06/24 13:01:16 DEBUG ShuffleBlockFetcherIterator: maxBytesInFlight: 50331648, targetRemoteRequestSize: 10066329, maxBlocksInFlightPerAddress: 2147483647 21/06/24 13:01:16 INFO ShuffleBlockFetcherIterator: ==> Call blockInfos.map(_._2).sum: 40 ms 21/06/24 13:01:16 INFO ShuffleBlockFetcherIterator: ==> Call mergeFetchBlockInfo for shuffle_0_9_2990_2997/9: 0 ms 21/06/24 13:01:16 INFO ShuffleBlockFetcherIterator: ==> Call mergeFetchBlockInfo for shuffle_0_15_2395_2997/15: 0 ms ``` Closes #33063 from yaooqinn/SPARK-35879. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-26 12:48:24 +08:00
Yuanjian Li	0c31137172	[SPARK-35628][SS][FOLLOW-UP] Fix the consistent break on Scala 2.13 build ### What changes were proposed in this pull request? Fix the consistent break on Scala 2.13 build caused by the PR https://github.com/apache/spark/pull/32767 ### Why are the changes needed? Fix the consistent break. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33084 from xuanyuanking/SPARK-35628-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-25 07:08:03 -07:00

1 2 3 4 5 ...

8080 commits