ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Sean Owen	c9b49f3978	[SPARK-28737][CORE] Update Jersey to 2.29 ## What changes were proposed in this pull request? Update Jersey to 2.27+, ideally 2.29, for possible JDK 11 fixes. ## How was this patch tested? Existing tests. Closes #25455 from srowen/SPARK-28737. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-16 15:08:04 -07:00
angerszhu	036fd3903f	[SPARK-27637][SHUFFLE][FOLLOW-UP] For nettyBlockTransferService, if IOException occurred while create client, check whether relative executor is alive before retry #24533 ### What changes were proposed in this pull request? In pr #[24533](https://github.com/apache/spark/pull/24533/files) , it prevent retry to a removed Executor. In my test, I can't catch exceptions from ` new OneForOneBlockFetcher(client, appId, execId, blockIds, listener, transportConf, tempFileManager).start()` And I check the code carefully， method start() will handle exception of IOException in it's retry logical, won't throw it out. until it meet maxRetry times or meet exception that is not IOException. And if we meet the situation that when we fetch block , the executor is dead, when we rerun `RetryingBlockFetcher.BlockFetchStarter.createAndStart()` we may failed when we create a transport client to dead executor. it will throw a IOException. We should catch this IOException. ### Why are the changes needed? Old solution not comprehensive. Didn't cover more case. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existed Unit Test Closes #25469 from AngersZhuuuu/SPARK-27637-FLLOW-UP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-16 23:24:32 +08:00
Steve Loughran	2ac6163a5d	[SPARK-23977][SQL] Support High Performance S3A committers [test-hadoop3.2] This patch adds the binding classes to enable spark to switch dataframe output to using the S3A zero-rename committers shipping in Hadoop 3.1+. It adds a source tree into the hadoop-cloud-storage module which only compiles with the hadoop-3.2 profile, and contains a binding for normal output and a specific bridge class for Parquet (as the parquet output format requires a subclass of `ParquetOutputCommitter`. Commit algorithms are a critical topic. There's no formal proof of correctness, but the algorithms are documented an analysed in [A Zero Rename Committer](https://github.com/steveloughran/zero-rename-committer/releases). This also reviews the classic v1 and v2 algorithms, IBM's swift committer and the one from EMRFS which they admit was based on the concepts implemented here. Test-wise * There's a public set of scala test suites [on github](https://github.com/hortonworks-spark/cloud-integration) * We have run integration tests against Spark on Yarn clusters. * This code has been shipping for ~12 months in HDP-3.x. Closes #24970 from steveloughran/cloud/SPARK-23977-s3a-committer. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-08-15 09:39:26 -07:00
Xianjin YE	3ec24fd128	[SPARK-28203][CORE][PYTHON] PythonRDD should respect SparkContext's hadoop configuration ## What changes were proposed in this pull request? 1. PythonHadoopUtil.mapToConf generates a Configuration with loadDefaults disabled 2. merging hadoop conf in several places of PythonRDD is consistent. ## How was this patch tested? Added a new test and existed tests Closes #25002 from advancedxy/SPARK-28203. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-15 10:39:33 +09:00
Marcelo Vanzin	0343854f54	[SPARK-28487][K8S] More responsive dynamic allocation with K8S This change implements a few changes to the k8s pod allocator so that it behaves a little better when dynamic allocation is on. (i) Allow the application to ramp up immediately when there's a change in the target number of executors. Without this change, scaling would only trigger when a change happened in the state of the cluster, e.g. an executor going down, or when the periodical snapshot was taken (default every 30s). (ii) Get rid of pending pod requests, both acknowledged (i.e. Spark knows that a pod is pending resource allocation) and unacknowledged (i.e. Spark has requested the pod but the API server hasn't created it yet), when they're not needed anymore. This avoids starting those executors to just remove them after the idle timeout, wasting resources in the meantime. (iii) Re-work some of the code to avoid unnecessary logging. While not bad without dynamic allocation, the existing logging was very chatty when dynamic allocation was on. With the changes, all the useful information is still there, but only when interesting changes happen. (iv) Gracefully shut down executors when they become idle. Just deleting the pod causes a lot of ugly logs to show up, so it's better to ask pods to exit nicely. That also allows Spark to respect the "don't delete pods" option when dynamic allocation is on. Tested on a small k8s cluster running different TPC-DS workloads. Closes #25236 from vanzin/SPARK-28487. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-08-13 17:29:54 -07:00
Kousuke Saruta	247bebcf94	[SPARK-28561][WEBUI] DAG viz for barrier-execution mode ## What changes were proposed in this pull request? In the current UI, we cannot identify which RDDs are barrier. Visualizing it will make easy to debug. Following images are shown after this change. ![Screenshot from 2019-07-30 16-30-35](https://user-images.githubusercontent.com/4736016/62110508-83cec100-b2e9-11e9-83b9-bc2e485a4cbe.png) ![Screenshot from 2019-07-30 16-31-09](https://user-images.githubusercontent.com/4736016/62110509-83cec100-b2e9-11e9-9e2e-47c4dae23a52.png) The boxes in pale green mean barrier (We might need to discuss which color is proper). ## How was this patch tested? Tested manually. The images above are shown by following operations. ``` val rdd1 = sc.parallelize(1 to 10) val rdd2 = sc.parallelize(1 to 10) val rdd3 = rdd1.zip(rdd2).barrier.mapPartitions(identity(_)) val rdd4 = rdd3.map(identity(_)) val rdd5 = rdd4.reduceByKey(_+_) rdd5.collect ``` Closes #25296 from sarutak/barrierexec-dagviz. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-08-12 22:38:10 -07:00
Kousuke Saruta	25857c6559	[SPARK-28647][WEBUI] Recover additional metric feature and remove additional-metrics.js ## What changes were proposed in this pull request? By SPARK-17019, `On Heap Memory` and `Off Heap Memory` are introduced as optional metrics. But they are not displayed because they are made `display: none` in css and there are no way to appear them. I know #22595 also try to resolve this issue but that will use `additional-metrics.js`. Initially, `additional-metrics.js` is created for `StagePage` but `StagePage` currently uses `stagepage.js` for its additional metrics to be toggle because `DataTable (one of jQuery plugins)` was introduced and we needed another mechanism to add/remove columns for additional metrics. Now that `ExecutorsPage` also uses `DataTable` so it might be better to introduce same mechanism as `StagePage` for additional metrics. ![Screenshot from 2019-08-10 05-37-25](https://user-images.githubusercontent.com/4736016/62807960-c4240f80-bb31-11e9-8e1a-1a44e2f91597.png) And then, we can remove `additional-metrics.js` which is no longer used from anywhere. ## How was this patch tested? After this change is applied, I confirmed `ExecutorsPage` and `StagePage` are properly rendered and all checkboxes for additional metrics work. Closes #25374 from sarutak/remove-additional-metrics.js. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-12 17:02:28 -07:00
Gengliang Wang	48d04f74ca	[SPARK-28638][WEBUI] Task summary should only contain successful tasks' metrics ## What changes were proposed in this pull request? Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data. However, the number of "SUCCESS" tasks is wrong when there are running tasks. In `AppStatusStore`, the KVStore is `ElementTrackingStore`, instead of `InMemoryStore`. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks. Thus, even when the running tasks are finished, the out-of-update cached data is returned. This PR is to fix the code in getting the number of "SUCCESS" tasks. ## How was this patch tested? Test manually, run ``` sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect() ``` and keep refreshing the stage page , we can see the task summary metrics is wrong. ### Before fix: ![image](https://user-images.githubusercontent.com/1097932/62560343-6a141780-b8af-11e9-8942-d88540659a93.png) ### After fix: ![image](https://user-images.githubusercontent.com/1097932/62560355-7009f880-b8af-11e9-8ba8-10c083a48d7b.png) Closes #25369 from gengliangwang/fixStagePage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-08-12 11:47:29 -07:00
WeichenXu	0f2efe6825	[SPARK-28366][CORE][FOLLOW-UP] Refine logging in driver when loading single large unsplittable file ## What changes were proposed in this pull request? * Add log in `NewHadoopRDD` * Remove some words in logs which related to specific user API. ## How was this patch tested? Manual. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25391 from WeichenXu123/log_sf. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-12 19:15:00 +08:00
Kousuke Saruta	31ef268bae	[SPARK-28639][CORE][DOC] Configuration doc for Barrier Execution Mode ## What changes were proposed in this pull request? SPARK-24817 and SPARK-24819 introduced new 3 non-internal properties for barrier-execution mode but they are not documented. So I've added a section into configuration.md for barrier-mode execution. ## How was this patch tested? Built using jekyll and confirm the layout by browser. Closes #25370 from sarutak/barrier-exec-mode-conf-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-11 08:13:19 -05:00
Kousuke Saruta	dd5599efaf	[SPARK-28677][WEBUI] "Select All" checkbox in StagePage doesn't work properly ## What changes were proposed in this pull request? In StagePage, only the first optional column (Scheduler Delay, in this case) appears even though "Select All" checkbox is checked. ![Screenshot from 2019-08-09 18-46-05](https://user-images.githubusercontent.com/4736016/62771600-8f379e80-bad8-11e9-9faa-6da8d57739d2.png) The cause is that wrong method is used to manipulate multiple columns. columns should have been used but column was used. I've fixed this issue by replacing the `column` with `columns`. ## How was this patch tested? Confirmed behavior of the check-box. ![Screenshot from 2019-08-09 18-54-33](https://user-images.githubusercontent.com/4736016/62771614-98c10680-bad8-11e9-9cc0-5879ac47d1e1.png) Closes #25397 from sarutak/fix-stagepage.js. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-10 16:51:12 -05:00
younggyu chun	8535df7261	[MINOR] Fix typos in comments and replace an explicit type with <> ## What changes were proposed in this pull request? This PR fixed typos in comments and replace the explicit type with '<>' for Java 8+. ## How was this patch tested? Manually tested. Closes #25338 from younggyuchun/younggyu. Authored-by: younggyu chun <younggyuchun@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-10 16:47:11 -05:00
Ajith	ef80c32266	[SPARK-28676][CORE] Avoid Excessive logging from ContextCleaner ## What changes were proposed in this pull request? In high workload environments, ContextCleaner seems to have excessive logging at INFO level which do not give much information. In one Particular case we see that ``INFO ContextCleaner: Cleaned accumulator`` message is 25-30% of the generated logs. We can log this information for cleanup in DEBUG level instead. ## How was this patch tested? This do not modify any functionality. This is just changing cleanup log levels to DEBUG for ContextCleaner Closes #25396 from ajithme/logss. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-09 15:49:20 -07:00
wuyi	cbad616d4c	[SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone ## What changes were proposed in this pull request? In this PR, we implements a complete process of GPU-aware resources scheduling in Standalone. The whole process looks like: Worker sets up isolated resources when it starts up and registers to master along with its resources. And, Master picks up usable workers according to driver/executor's resource requirements to launch driver/executor on them. Then, Worker launches the driver/executor after preparing resources file, which is created under driver/executor's working directory, with specified resource addresses(told by master). When driver/executor finished, their resources could be recycled to worker. Finally, if a worker stops, it should always release its resources firstly. For the case of Workers and Drivers in client mode run on the same host, we introduce a config option named `spark.resources.coordinate.enable`(default true) to indicate whether Spark should coordinate resources for user. If `spark.resources.coordinate.enable=false`, user should be responsible for configuring different resources for Workers and Drivers when use resourcesFile or discovery script. If true, Spark would help user to assign different resources for Workers and Drivers. The solution for Spark to coordinate resources among Workers and Drivers is: Generally, use a shared file named ____allocated_resources____.json to sync allocated resources info among Workers and Drivers on the same host. After a Worker or Driver found all resources using the configured resourcesFile and/or discovery script during launching, it should filter out available resources by excluding resources already allocated in ____allocated_resources____.json and acquire resources from available resources according to its own requirement. After that, it should write its allocated resources along with its process id (pid) into ____allocated_resources____.json. Pid (proposed by tgravescs) here used to check whether the allocated resources are still valid in case of Worker or Driver crashes and doesn't release resources properly. And when a Worker or Driver finished, normally, it would always clean up its own allocated resources in ____allocated_resources____.json. Note that we'll always get a file lock before any access to file ____allocated_resources____.json and release the lock finally. Futhermore, we appended resources info in `WorkerSchedulerStateResponse` to work around master change behaviour in HA mode. ## How was this patch tested? Added unit tests in WorkerSuite, MasterSuite, SparkContextSuite. Manually tested with client/cluster mode (e.g. multiple workers) in a single node Standalone. Closes #25047 from Ngone51/SPARK-27371. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-08-09 07:49:03 -05:00
Jungtaek Lim (HeartSaVioR)	128ea37bda	[SPARK-28601][CORE][SQL] Use StandardCharsets.UTF_8 instead of "UTF-8" string representation, and get rid of UnsupportedEncodingException ## What changes were proposed in this pull request? This patch tries to keep consistency whenever UTF-8 charset is needed, as using `StandardCharsets.UTF_8` instead of using "UTF-8". If the String type is needed, `StandardCharsets.UTF_8.name()` is used. This change also brings the benefit of getting rid of `UnsupportedEncodingException`, as we're providing `Charset` instead of `String` whenever possible. This also changes some private Catalyst helper methods to operate on encodings as `Charset` objects rather than strings. ## How was this patch tested? Existing unit tests. Closes #25335 from HeartSaVioR/SPARK-28601. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-05 20:45:54 -07:00
wuyi	94499af6f0	[SPARK-28486][CORE][PYTHON] Map PythonBroadcast's data file to a BroadcastBlock to avoid delete by GC ## What changes were proposed in this pull request? Currently, PythonBroadcast may delete its data file while a python worker still needs it. This happens because PythonBroadcast overrides the `finalize()` method to delete its data file. So, when GC happens and no references on broadcast variable, it may trigger `finalize()` to delete data file. That's also means, data under python Broadcast variable couldn't be deleted when `unpersist()`/`destroy()` called but relys on GC. In this PR, we removed the `finalize()` method, and map the PythonBroadcast data file to a BroadcastBlock(which has the same broadcast id with the broadcast variable who wrapped this PythonBroadcast) when PythonBroadcast is deserializing. As a result, the data file could be deleted just like other pieces of the Broadcast variable when `unpersist()`/`destroy()` called and do not rely on GC any more. ## How was this patch tested? Added a Python test, and tested manually(verified create/delete the broadcast block). Closes #25262 from Ngone51/SPARK-28486. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-05 20:18:53 +09:00
Yuanjian Li	db39f45baf	[SPARK-28593][CORE] Rename ShuffleClient to BlockStoreClient which more close to its usage ## What changes were proposed in this pull request? After SPARK-27677, the shuffle client not only handles the shuffle block but also responsible for local persist RDD blocks. For better code scalability and precise semantics(as the [discussion](https://github.com/apache/spark/pull/24892#discussion_r300173331)), here we did several changes: - Rename ShuffleClient to BlockStoreClient. - Correspondingly rename the ExternalShuffleClient to ExternalBlockStoreClient, also change the server-side class from ExternalShuffleBlockHandler to ExternalBlockHandler. - Move MesosExternalBlockStoreClient to Mesos package. Note, we still keep the name of BlockTransferService, because the `Service` contains both client and server, also the name of BlockTransferService is not referencing shuffle client only. ## How was this patch tested? Existing UT. Closes #25327 from xuanyuanking/SPARK-28593. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-05 14:54:45 +08:00
yunzoud	c212c9d9ed	[SPARK-28574][CORE] Allow to config different sizes for event queues ## What changes were proposed in this pull request? Add configuration spark.scheduler.listenerbus.eventqueue.${name}.capacity to allow configuration of different event queue size. ## How was this patch tested? Unit test in core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala Closes #25307 from yunzoud/SPARK-28574. Authored-by: yunzoud <yun.zou@databricks.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-08-02 15:27:33 -07:00
Nick Karpov	6d32deeecc	[SPARK-28475][CORE] Add regex MetricFilter to GraphiteSink ## What changes were proposed in this pull request? Today all registered metric sources are reported to GraphiteSink with no filtering mechanism, although the codahale project does support it. GraphiteReporter (ScheduledReporter) from the codahale project requires you implement and supply the MetricFilter interface (there is only a single implementation by default in the codahale project, MetricFilter.ALL). Propose to add an additional regex config to match and filter metrics to the GraphiteSink ## How was this patch tested? Included a GraphiteSinkSuite that tests: 1. Absence of regex filter (existing default behavior maintained) 2. Presence of `regex=<regexexpr>` correctly filters metric keys Closes #25232 from nkarpov/graphite_regex. Authored-by: Nick Karpov <nick@nickkarpov.com> Signed-off-by: jerryshao <jerryshao@tencent.com>	2019-08-02 17:50:15 +08:00
Marcelo Vanzin	607fb87906	[SPARK-28584][CORE] Fix thread safety issue in blacklist timer, tests There's a small, probably very hard to hit thread-safety issue in the blacklist abort timers in the task scheduler, where they access a non-thread-safe map without locks. In the tests, the code was also calling methods on the TaskSetManager without holding the proper locks, which could cause threads to call non-thread-safe TSM methods concurrently. Closes #25317 from vanzin/SPARK-28584. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-01 10:37:47 -07:00
Wing Yew Poon	80ab19b9fd	[SPARK-26329][CORE] Faster polling of executor memory metrics. ## What changes were proposed in this pull request? Prior to this change, in an executor, on each heartbeat, memory metrics are polled and sent in the heartbeat. The heartbeat interval is 10s by default. With this change, in an executor, memory metrics can optionally be polled in a separate poller at a shorter interval. For each executor, we use a map of (stageId, stageAttemptId) to (count of running tasks, executor metric peaks) to track what stages are active as well as the per-stage memory metric peaks. When polling the executor memory metrics, we attribute the memory to the active stage(s), and update the peaks. In a heartbeat, we send the per-stage peaks (for stages active at that time), and then reset the peaks. The semantics would be that the per-stage peaks sent in each heartbeat are the peaks since the last heartbeat. We also keep a map of taskId to memory metric peaks. This tracks the metric peaks during the lifetime of the task. The polling thread updates this as well. At end of a task, we send the peak metric values in the task result. In case of task failure, we send the peak metric values in the `TaskFailedReason`. We continue to do the stage-level aggregation in the EventLoggingListener. For the driver, we still only poll on heartbeats. What the driver sends will be the current values of the metrics in the driver at the time of the heartbeat. This is semantically the same as before. ## How was this patch tested? Unit tests. Manually tested applications on an actual system and checked the event logs; the metrics appear in the SparkListenerTaskEnd and SparkListenerStageExecutorMetrics events. Closes #23767 from wypoon/wypoon_SPARK-26329. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-08-01 09:09:46 -05:00
WeichenXu	26d03b62e2	[SPARK-28366][CORE] Logging in driver when loading single large unsplittable file ## What changes were proposed in this pull request? Logging in driver when loading single large unsplittable file via `sc.textFile` or csv/json datasouce. Current condition triggering logging is * only generate one partition * file is unsplittable, possible reason is: - compressed by unsplittable compression algo such as gzip. - multiLine mode in csv/json datasource - wholeText mode in text datasource * file size exceed the config threshold `spark.io.warning.largeFileThreshold` (default value is 1GB) ## How was this patch tested? Manually test. Generate one gzip file exceeding 1GB, ``` base64 -b 50 /dev/urandom \| head -c 2000000000 > file1.txt cat file1.txt \| gzip > file1.gz ``` then launch spark-shell, run ``` sc.textFile("file:///path/to/file1.gz").count() ``` Will print log like: ``` WARN HadoopRDD: Loading one large unsplittable file file:/.../f1.gz with only one partition, because the file is compressed by unsplittable compression codec ``` run ``` sc.textFile("file:///path/to/file1.txt").count() ``` Will print log like: ``` WARN HadoopRDD: Loading one large file file:/.../f1.gz with only one partition, we can increase partition numbers by the `minPartitions` argument in method `sc.textFile ``` run ``` spark.read.csv("file:///path/to/file1.gz").count ``` Will print log like: ``` WARN CSVScan: Loading one large unsplittable file file:/.../f1.gz with only one partition, the reason is: the file is compressed by unsplittable compression codec ``` run ``` spark.read.option("multiLine", true).csv("file:///path/to/file1.gz").count ``` Will print log like: ``` WARN CSVScan: Loading one large unsplittable file file:/.../f1.gz with only one partition, the reason is: the csv datasource is set multiLine mode ``` JSON and Text datasource also tested with similar cases. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25134 from WeichenXu123/log_gz. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-01 20:29:18 +08:00
Marcelo Vanzin	b3ffd8be14	[SPARK-24352][CORE][TESTS] De-flake StandaloneDynamicAllocationSuite blacklist test The issue is that the test tried to stop an existing scheduler and replace it with a new one set up for the test. That can cause issues because both were sharing the same RpcEnv underneath, and unregistering RpcEndpoints is actually asynchronous (see comment in Dispatcher.unregisterRpcEndpoint). So that could lead to races where the new scheduler tried to register before the old one was fully unregistered. The updated test avoids the issue by using a separate RpcEnv / scheduler instance altogether, and also avoids a misleading NPE in the test logs. Closes #25318 from vanzin/SPARK-24352. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-31 17:44:20 -07:00
sychen	70ef9064a8	[SPARK-28564][CORE] Access history application defaults to the last attempt id ## What changes were proposed in this pull request? When we set ```spark.history.ui.maxApplications``` to a small value, we can't get some apps from the page search. If the url is spliced (http://localhost:18080/history/local-xxx), it can be accessed if the app has no attempt. But in the case of multiple attempted apps, such a url cannot be accessed, and the page displays Not Found. ## How was this patch tested? Add UT Closes #25301 from cxzl25/hs_app_last_attempt_id. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-31 13:24:36 -07:00
HyukjinKwon	b8e13b0aea	[SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) ## What changes were proposed in this pull request? This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder. Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later. 1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize. 2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`. 3. Both now have two different references. Updating at child isn't reflected to parent. This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected. I also tried to explain this a bit more at https://github.com/apache/spark/pull/24958#discussion_r297203041. ## How was this patch tested? Manually tested and unittest was added. Closes #24958 from HyukjinKwon/SPARK-28153. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-07-31 22:40:01 +08:00
mcheah	abef84a868	[SPARK-28209][CORE][SHUFFLE] Proposed new shuffle writer API ## What changes were proposed in this pull request? As part of the shuffle storage API proposed in SPARK-25299, this introduces an API for persisting shuffle data in arbitrary storage systems. This patch introduces several concepts: * `ShuffleDataIO`, which is the root of the entire plugin tree that will be proposed over the course of the shuffle API project. * `ShuffleExecutorComponents` - the subset of plugins for managing shuffle-related components for each executor. This will in turn instantiate shuffle readers and writers. * `ShuffleMapOutputWriter` interface - instantiated once per map task. This provides child `ShufflePartitionWriter` instances for persisting the bytes for each partition in the map task. The default implementation of these plugins exactly mirror what was done by the existing shuffle writing code - namely, writing the data to local disk and writing an index file. We leverage the APIs in the `BypassMergeSortShuffleWriter` only. Follow-up PRs will use the APIs in `SortShuffleWriter` and `UnsafeShuffleWriter`, but are left as future work to minimize the review surface area. ## How was this patch tested? New unit tests were added. Micro-benchmarks indicate there's no slowdown in the affected code paths. Closes #25007 from mccheah/spark-shuffle-writer-refactor. Lead-authored-by: mcheah <mcheah@palantir.com> Co-authored-by: mccheah <mcheah@palantir.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-30 14:17:30 -07:00
pgandhi	70910e6ad0	[SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… … more efficiently This PR improves the performance of scheduling speculative tasks to be O(1) instead of O(numSpeculativeTasks), using the same approach used for scheduling regular tasks. The performance of this method is particularly important because a lock is held on the TaskSchedulerImpl which is a bottleneck for all scheduling operations. We ran a Join query on a large dataset with speculation enabled and out of 100000 tasks for the ShuffleMapStage, the maximum number of speculatable tasks that was noted was close to 7900-8000 at a point. That is when we start seeing the bottleneck on the scheduler lock. In particular, this works by storing a separate stack of tasks by executor, node, and rack locality preferences. Then when trying to schedule a speculative task, rather than scanning all speculative tasks to find ones which match the given executor (or node, or rack) preference, we can jump to a quick check of tasks matching the resource offer. This technique was already used for regular tasks -- this change refactors the code to allow sharing with regular and speculative task execution. ## What changes were proposed in this pull request? Have split the main queue "speculatableTasks" into 5 separate queues based on locality preference similar to how normal tasks are enqueued. Thus, the "dequeueSpeculativeTask" method will avoid performing locality checks for each task at runtime and simply return the preferable task to be executed. ## How was this patch tested? We ran a spark job that performed a join on a 10 TB dataset to test the code change. Original Code: <img width="1433" alt="screen shot 2019-01-28 at 5 07 22 pm" src="https://user-images.githubusercontent.com/22228190/51873321-572df280-2322-11e9-9149-0aae08d5edc6.png"> Optimized Code: <img width="1435" alt="screen shot 2019-01-28 at 5 08 19 pm" src="https://user-images.githubusercontent.com/22228190/51873343-6745d200-2322-11e9-947b-2cfd0f06bcab.png"> As you can see, the run time of the ShuffleMapStage came down from 40 min to 6 min approximately, thus, reducing the overall running time of the spark job by a significant amount. Another example for the same job: Original Code: <img width="1440" alt="screen shot 2019-01-28 at 5 11 30 pm" src="https://user-images.githubusercontent.com/22228190/51873355-70cf3a00-2322-11e9-9c3a-af035449a306.png"> Optimized Code: <img width="1440" alt="screen shot 2019-01-28 at 5 12 16 pm" src="https://user-images.githubusercontent.com/22228190/51873367-7dec2900-2322-11e9-8d07-1b1b49285f71.png"> Closes #23677 from pgandhi999/SPARK-26755. Lead-authored-by: pgandhi <pgandhi@verizonmedia.com> Co-authored-by: pgandhi <pgandhi@oath.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-07-30 09:54:51 -05:00
Lee Dongjin	d98aa2a184	[MINOR] Trivial cleanups These are what I found during working on #22282. - Remove unused value: `UnsafeArraySuite#defaultTz` - Remove redundant new modifier to the case class, `KafkaSourceRDDPartition` - Remove unused variables from `RDD.scala` - Remove trailing space from `structured-streaming-kafka-integration.md` - Remove redundant parameter from `ArrowConvertersSuite`: `nullable` is `true` by default. - Remove leading empty line: `UnsafeRow` - Remove trailing empty line: `KafkaTestUtils` - Remove unthrown exception type: `UnsafeMapData` - Replace unused declarations: `expressions` - Remove duplicated default parameter: `AnalysisErrorSuite` - `ObjectExpressionsSuite`: remove duplicated parameters, conversions and unused variable Closes #25251 from dongjinleekr/cleanup/201907. Authored-by: Lee Dongjin <dongjin@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-29 23:38:02 +09:00
Dongjoon Hyun	a428f40669	[SPARK-28549][BUILD][CORE][SQL] Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils` ## What changes were proposed in this pull request? `org.apache.commons.lang3.StringEscapeUtils` was deprecated over two years ago at [LANG-1316](https://issues.apache.org/jira/browse/LANG-1316). There is no bug fixes after that. ```java /** * <p>Escapes and unescapes {code String}s for * Java, Java Script, HTML and XML.</p> * * <p>#ThreadSafe#</p> * since 2.0 * deprecated as of 3.6, use commons-text * <a href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html"> * StringEscapeUtils</a> instead */ Deprecated public class StringEscapeUtils { ``` This PR aims to use the latest one from `commons-text` module which has more bug fixes like [TEXT-100](https://issues.apache.org/jira/browse/TEXT-100), [TEXT-118](https://issues.apache.org/jira/browse/TEXT-118) and [TEXT-120](https://issues.apache.org/jira/browse/TEXT-120) by the following replacement. ```scala -import org.apache.commons.lang3.StringEscapeUtils +import org.apache.commons.text.StringEscapeUtils ``` This will add a new dependency to `hadoop-2.7` profile distribution. In `hadoop-3.2` profile, we already have it. ``` +commons-text-1.6.jar ``` ## How was this patch tested? Pass the Jenkins with the existing tests. - [Hadoop 2.7](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108281) - [Hadoop 3.2](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108282) Closes #25281 from dongjoon-hyun/SPARK-28549. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-29 11:45:29 +09:00
Marcelo Vanzin	7f84104b39	[SPARK-28535][CORE][TEST] Slow down tasks to de-flake JobCancellationSuite This test tries to detect correct behavior in racy code, where the event thread is racing with the executor thread that's trying to kill the running task. If the event that signals the stage end arrives first, any delay in the delivery of the message to kill the task causes the code to rapidly process elements, and may cause the test to assert. Adding a 10ms delay in LocalSchedulerBackend before the task kill makes the test run through ~1000 elements. A longer delay can easily cause the 10000 elements to be processed. Instead, by adding a small delay (10ms) in the test code that processes elements, there's a much lower probability that the kill event will not arrive before the end; that leaves a window of 100s for the event to be delivered to the executor. And because each element only sleeps for 10ms, the test is not really slowed down at all. Closes #25270 from vanzin/SPARK-28535. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-27 11:06:35 -07:00
Luca Canali	f2a2d980ed	[SPARK-25285][CORE] Add startedTasks and finishedTasks to the metrics system in the executor instance ## What changes were proposed in this pull request? The motivation for these additional metrics is to help in troubleshooting and monitoring task execution workload when running on a cluster. Currently available metrics include executor threadpool metrics for task completed and for active tasks. The addition of threadpool taskStarted metric will allow for example to collect info on the (approximate) number of failed tasks by computing the difference thread started – (active threads + completed tasks and/or successfully finished tasks). The proposed metric finishedTasks is also intended for this type of troubleshooting. The difference between finshedTasks and threadpool.completeTasks, is that the latter is a (dropwizard library) gauge taken from the threadpool, while the former is a (dropwizard) counter computed in the [[Executor]] class, when a task successfully finishes, together with several other task metrics counters. Note, there are similarities with some of the metrics introduced in SPARK-24398, however there are key differences, coming from the fact that this PR concerns the executor source, therefore providing metric values per executor + metric values do not require to pass through the listerner bus in this case. ## How was this patch tested? Manually tested on a YARN cluster Closes #22290 from LucaCanali/AddMetricExecutorStartedTasks. Lead-authored-by: Luca Canali <luca.canali@cern.ch> Co-authored-by: LucaCanali <luca.canali@cern.ch> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-26 14:03:57 -07:00
Marcelo Vanzin	a3e013391e	[SPARK-28455][CORE] Avoid overflow when calculating executor timeout time This would cause the timeout time to be negative, so executors would be timed out immediately (instead of never). I also tweaked a couple of log messages that could get pretty long when lots of executors were active. Added unit test (which failed without the fix). Closes #25208 from vanzin/SPARK-28455. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-22 14:31:58 -07:00
Josh Rosen	3776fbdfde	[SPARK-28430][UI] Fix stage table rendering when some tasks' metrics are missing ## What changes were proposed in this pull request? The Spark UI's stages table misrenders the input/output metrics columns when some tasks are missing input metrics. See the screenshot below for an example of the problem: ![image](https://user-images.githubusercontent.com/50748/61420042-a3abc100-a8b5-11e9-8a92-7986563ee712.png) This is because those columns' are defined as ```scala {if (hasInput(stage)) { metricInfo(task) { m => ... <td>....</td> } } ``` where `metricInfo` renders the node returned by the closure in case metrics are defined or returns `Nil` in case metrics are not defined. If metrics are undefined then we'll fail to render the empty `<td></td>` tag, causing columns to become misaligned as shown in the screenshot. To fix this, this patch changes this to ```scala {if (hasInput(stage)) { <td>{ metricInfo(task) { m => ... Unparsed(...) } }</td> } ``` which is an idiom that's already in use for the shuffle read / write columns. ## How was this patch tested? It isn't. I'm arguing for correctness because the modifications are consistent with rendering methods that work correctly for other columns. Closes #25183 from JoshRosen/joshrosen/fix-task-table-with-partial-io-metrics. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-18 13:15:39 -07:00
Marcelo Vanzin	2ddeff97d7	[SPARK-27963][CORE] Allow dynamic allocation without a shuffle service. This change adds a new option that enables dynamic allocation without the need for a shuffle service. This mode works by tracking which stages generate shuffle files, and keeping executors that generate data for those shuffles alive while the jobs that use them are active. A separate timeout is also added for shuffle data; so that executors that hold shuffle data can use a separate timeout before being removed because of being idle. This allows the shuffle data to be kept around in case it is needed by some new job, or allow users to be more aggressive in timing out executors that don't have shuffle data in active use. The code also hooks up to the context cleaner so that shuffles that are garbage collected are detected, and the respective executors not held unnecessarily. Testing done with added unit tests, and also with TPC-DS workloads on YARN without a shuffle service. Closes #24817 from vanzin/SPARK-27963. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-16 16:37:38 -07:00
Angers	be4a55220a	[SPARK-28106][SQL] When Spark SQL use "add jar" , before add to SparkContext, check jar path exist first. ## What changes were proposed in this pull request? ISSUE : https://issues.apache.org/jira/browse/SPARK-28106 When we use add jar in SQL, it will have three step: - add jar to HiveClient's classloader - HiveClientImpl.runHiveSQL("ADD JAR" + PATH) - SessionStateBuilder.addJar The second step seems has no impact to the whole process. Since event it failed, we still can execute. The first step will add jar path to HiveClient's ClassLoader, then we can use the jar in HiveClientImpl The Third Step will add this jar path to SparkContext. But expect local file path, it will call RpcServer's FileServer to add this to Env, the is you pass wrong path. it will cause error, but if you pass HDFS path or VIEWFS path, it won't check it and just add it to jar Path Map. Then when next TaskSetManager send out Task, this path will be brought by TaskDescription. Then Executor will call updateDependencies, this method will check all jar path and file path in TaskDescription. Then error happends like below: ![image](https://user-images.githubusercontent.com/46485123/59817635-4a527f80-9353-11e9-9e08-9407b2b54023.png) ## How was this patch tested? Exist Unit Test Environment Test Closes #24909 from AngersZhuuuu/SPARK-28106. Lead-authored-by: Angers <angers.zhu@gamil.com> Co-authored-by: 朱夷 <zhuyi01@corp.netease.com> Signed-off-by: jerryshao <jerryshao@tencent.com>	2019-07-16 15:29:05 +08:00
Marcelo Vanzin	8d1e87ac90	[SPARK-28150][CORE][FOLLOW-UP] Don't try to log in when impersonating. When fetching delegation tokens for a proxy user, don't try to log in, since it will fail. Closes #25141 from vanzin/SPARK-28150.2. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-15 10:32:34 -07:00
Liang-Chi Hsieh	591de42351	[SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30 ## What changes were proposed in this pull request? This upgraded to a newer version of Pyrolite. Most updates [1] in the newer version are for dotnot. For java, it includes a bug fix to Unpickler regarding cleaning up Unpickler memo, and support of protocol 5. After upgrading, we can remove the fix at SPARK-27629 for the bug in Unpickler. [1] https://github.com/irmen/Pyrolite/compare/pyrolite-4.23...master ## How was this patch tested? Manually tested on Python 3.6 in local on existing tests. Closes #25143 from viirya/upgrade-pyrolite. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-15 12:29:58 +09:00
Jesse Cai	79e2047703	[SPARK-28355][CORE][PYTHON] Use Spark conf for threshold at which command is compressed by broadcast ## What changes were proposed in this pull request? The `_prepare_for_python_RDD` method currently broadcasts a pickled command if its length is greater than the hardcoded value `1 << 20` (1M). This change sets this value as a Spark conf instead. ## How was this patch tested? Unit tests, manual tests. Closes #25123 from jessecai/SPARK-28355. Authored-by: Jesse Cai <jesse.cai@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-07-13 08:44:16 -07:00
Dongjoon Hyun	1c29212394	[SPARK-28357][CORE][TEST] Fix Flaky Test - FileAppenderSuite.rollingfile appender - size-based rolling compressed ## What changes were proposed in this pull request? `SizeBasedRollingPolicy.shouldRollover` returns false when the size is equal to `rolloverSizeBytes`. ```scala /** Should rollover if the next set of bytes is going to exceed the size limit */ def shouldRollover(bytesToBeWritten: Long): Boolean = { logDebug(s"$bytesToBeWritten + $bytesWrittenSinceRollover > $rolloverSizeBytes") bytesToBeWritten + bytesWrittenSinceRollover > rolloverSizeBytes } ``` - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/ ``` org.scalatest.exceptions.TestFailedException: 1000 was not less than 1000 ``` ## How was this patch tested? Pass the Jenkins with the updated test. Closes #25125 from dongjoon-hyun/SPARK-28357. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-12 18:40:07 +09:00
Gabor Somogyi	f83000597f	[SPARK-23472][CORE] Add defaultJavaOptions for driver and executor. ## What changes were proposed in this pull request? This PR adds two new config properties: `spark.driver.defaultJavaOptions` and `spark.executor.defaultJavaOptions`. These are intended to be set by administrators in a file of defaults for options like JVM garbage collection algorithm. Users will still set `extraJavaOptions` properties, and both sets of JVM options will be added to start a JVM (default options are prepended to extra options). ## How was this patch tested? Existing + additional unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24804 from gaborgsomogyi/SPARK-23472. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-11 09:37:26 -07:00
Thomas Graves	f84cca2d84	[SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources ## What changes were proposed in this pull request? Add python api support and JavaSparkContext support for resources(). I needed the JavaSparkContext support for it to properly translate into python with the py4j stuff. ## How was this patch tested? Unit tests added and manually tested in local cluster mode and on yarn. Closes #25087 from tgravescs/SPARK-28234-python. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-11 09:32:58 +09:00
Dongjoon Hyun	a6506f0c8a	[SPARK-28290][CORE][SQL] Use SslContextFactory.Server instead of SslContextFactory ## What changes were proposed in this pull request? `SslContextFactory` is deprecated at Jetty 9.4 and we are using `9.4.18.v20190429`. This PR aims to replace it with `SslContextFactory.Server`. - https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html - https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html ``` [WARNING] /Users/dhyun/APACHE/spark/core/src/main/scala/org/apache/spark/SSLOptions.scala:71: constructor SslContextFactory in class SslContextFactory is deprecated: see corresponding Javadoc for more information. [WARNING] val sslContextFactory = new SslContextFactory() [WARNING] ^ ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #25067 from dongjoon-hyun/SPARK-28290. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-10 09:05:56 -07:00
Dongjoon Hyun	bbc2be4f42	[SPARK-28294][CORE] Support `spark.history.fs.cleaner.maxNum` configuration ## What changes were proposed in this pull request? Up to now, Apache Spark maintains the given event log directory by time policy, `spark.history.fs.cleaner.maxAge`. However, there are two issues. 1. Some file system has a limitation on the maximum number of files in a single directory. For example, HDFS `dfs.namenode.fs-limits.max-directory-items` is 1024 * 1024 by default. https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml 2. Spark is sometimes unable to to clean up some old log files due to permission issues (mainly, security policy). To handle both (1) and (2), this PR aims to support an additional policy configuration for the maximum number of files in the event log directory, `spark.history.fs.cleaner.maxNum`. Spark will try to keep the number of files in the event log directory according to this policy. ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #25072 from dongjoon-hyun/SPARK-28294. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-10 07:19:47 -07:00
Wenchen Fan	d41bd7c891	[SPARK-26713][CORE][FOLLOWUP] revert the partial fix in ShuffleBlockFetcherIterator ## What changes were proposed in this pull request? This PR reverts the partial bug fix in `ShuffleBlockFetcherIterator` which was introduced by https://github.com/apache/spark/pull/23638 . The reasons: 1. It's a potential bug. After fixing `PipelinedRDD` in #23638 , the original problem was resolved. 2. The fix is incomplete according to [the discussion](https://github.com/apache/spark/pull/23638#discussion_r251869084) We should fix the potential bug completely later. ## How was this patch tested? existing tests Closes #25049 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-09 11:52:12 +09:00
Gabor Somogyi	0b6c2c259c	[MINOR] Add requestHeaderSize debug log ## What changes were proposed in this pull request? `requestHeaderSize` is added in https://github.com/apache/spark/pull/23090 and applies to Spark + History server UI as well. Without debug log it's hard to find out on which side what configuration is used. In this PR I've added a log message which prints out the value. ## How was this patch tested? Manually checked log files. Closes #25045 from gaborgsomogyi/SPARK-26118. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-03 09:34:31 -07:00
ShuMingLi	378ed677a8	[SPARK-28202][CORE][TEST] Avoid noises of system props in SparkConfSuite When SPARK_HOME of env is set and contains a specific `spark-defaults,conf`, `org.apache.spark.util.loadDefaultSparkProperties` method may noise `system props`. So when runs `core/test` module, it is possible to fail to run `SparkConfSuite` . It's easy to repair by setting `loadDefaults` in `SparkConf` to be false. ``` [info] - deprecated configs * FAILED * (79 milliseconds) [info] 7 did not equal 4 (SparkConfSuite.scala:266) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) [info] at org.apache.spark.SparkConfSuite.$anonfun$new$26(SparkConfSuite.scala:266) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) [info] at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) [info] at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) ``` Closes #24998 from LiShuMing/SPARK-28202. Authored-by: ShuMingLi <ming.moriarty@gmail.com> Signed-off-by: jerryshao <jerryshao@tencent.com>	2019-07-02 10:04:42 +08:00
HyukjinKwon	02f4763286	[SPARK-28198][PYTHON] Add mapPartitionsInPandas to allow an iterator of DataFrames ## What changes were proposed in this pull request? This PR proposes to add `mapPartitionsInPandas` API to DataFrame by using existing `SCALAR_ITER` as below: 1. Filtering via setting the column ```python from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age")) pandas_udf(df.schema, PandasUDFType.SCALAR_ITER) def filter_func(iterator): for pdf in iterator: yield pdf[pdf.id == 1] df.mapPartitionsInPandas(filter_func).show() ``` ``` +---+---+ \| id\|age\| +---+---+ \| 1\| 21\| +---+---+ ``` 2. `DataFrame.loc` ```python from pyspark.sql.functions import pandas_udf, PandasUDFType import pandas as pd df = spark.createDataFrame([['aa'], ['bb'], ['cc'], ['aa'], ['aa'], ['aa']], ["value"]) pandas_udf(df.schema, PandasUDFType.SCALAR_ITER) def filter_func(iterator): for pdf in iterator: yield pdf.loc[pdf.value.str.contains('^a'), :] df.mapPartitionsInPandas(filter_func).show() ``` ``` +-----+ \|value\| +-----+ \| aa\| \| aa\| \| aa\| \| aa\| +-----+ ``` 3. `pandas.melt` ```python from pyspark.sql.functions import pandas_udf, PandasUDFType import pandas as pd df = spark.createDataFrame( pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'}, 'B': {0: 1, 1: 3, 2: 5}, 'C': {0: 2, 1: 4, 2: 6}})) pandas_udf("A string, variable string, value long", PandasUDFType.SCALAR_ITER) def filter_func(iterator): for pdf in iterator: import pandas as pd yield pd.melt(pdf, id_vars=['A'], value_vars=['B', 'C']) df.mapPartitionsInPandas(filter_func).show() ``` ``` +---+--------+-----+ \| A\|variable\|value\| +---+--------+-----+ \| a\| B\| 1\| \| a\| C\| 2\| \| b\| B\| 3\| \| b\| C\| 4\| \| c\| B\| 5\| \| c\| C\| 6\| +---+--------+-----+ ``` The current limitation of `SCALAR_ITER` is that it doesn't allow different length of result, which is pretty critical in practice - for instance, we cannot simply filter by using Pandas APIs but we merely just map N to N. This PR allows map N to M like flatMap. This API mimics the way of `mapPartitions` but keeps API shape of `SCALAR_ITER` by allowing different results. ### How does this PR implement? This PR adds mimics both `dapply` with Arrow optimization and Grouped Map Pandas UDF. At Python execution side, it reuses existing `SCALAR_ITER` code path. Therefore, externally, we don't introduce any new type of Pandas UDF but internally we use another evaluation type code `205` (`SQL_MAP_PANDAS_ITER_UDF`). This approach is similar with Pandas' Windows function implementation with Grouped Aggregation Pandas UDF functions - internally we have `203` (`SQL_WINDOW_AGG_PANDAS_UDF`) but externally we just share the same `GROUPED_AGG`. ## How was this patch tested? Manually tested and unittests were added. Closes #24997 from HyukjinKwon/scalar-udf-iter. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-02 10:54:16 +09:00
Marcelo Vanzin	6af47b93ec	[SPARK-28150][CORE] Log in user before getting delegation tokens. This ensures that tokens are always created with an empty UGI, which allows multiple contexts to be (sequentially) started from the same JVM. Tested with code attached to the bug, and also usual kerberos tests. Closes #24955 from vanzin/SPARK-28150. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-06-27 13:30:28 -07:00
wuyi	7cbe01e8ef	[SPARK-27369][CORE] Setup resources when Standalone Worker starts up ## What changes were proposed in this pull request? To support GPU-aware scheduling in Standalone (cluster mode), Worker should have ability to setup resources(e.g. GPU/FPGA) when it starts up. Similar as driver/executor do, Worker has two ways(resourceFile & resourceDiscoveryScript) to setup resources when it starts up. User could use `SPARK_WORKER_OPTS` to apply resource configs on Worker in the form of "-Dx=y". For example, ``` SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=2 \ -Dspark.worker.resource.fpga.amount=1 \ -Dspark.worker.resource.fpga.discoveryScript=/Users/wuyi/tmp/getFPGAResources.sh \ -Dspark.worker.resourcesFile=/Users/wuyi/tmp/worker-resource-file" ``` ## How was this patch tested? Tested manually in Standalone locally: - Worker could start up normally when no resources are configured - Worker should fail to start up when exception threw during setup resources(e.g. unknown directory, parse fail) - Worker could setup resources from resource file - Worker could setup resources from discovery scripts - Worker should setup resources from resource file & discovery scripts when both are configure. Closes #24841 from Ngone51/dev-worker-resources-setup. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-06-26 19:19:00 -07:00
Bryan Cutler	c277afb12b	[SPARK-27992][PYTHON] Allow Python to join with connection thread to propagate errors ## What changes were proposed in this pull request? Currently with `toLocalIterator()` and `toPandas()` with Arrow enabled, if the Spark job being run in the background serving thread errors, it will be caught and sent to Python through the PySpark serializer. This is not the ideal solution because it is only catch a SparkException, it won't handle an error that occurs in the serializer, and each method has to have it's own special handling to propagate the error. This PR instead returns the Python Server object along with the serving port and authentication info, so that it allows the Python caller to join with the serving thread. During the call to join, the serving thread Future is completed either successfully or with an exception. In the latter case, the exception will be propagated to Python through the Py4j call. ## How was this patch tested? Existing tests Closes #24834 from BryanCutler/pyspark-propagate-server-error-SPARK-27992. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-06-26 13:05:41 -07:00

1 2 3 4 5 ...

7193 commits