edits (full cleanup)
parent
1991e3cffb
commit
395058197d
|
@ -176,7 +176,7 @@ We modded the Linux 4.4.210 kernel with the \systemname governor to implement th
|
|||
This is based on a fork of the \schedutil governor.
|
||||
We rely on the existing system idle policy to put the CPU in an idle state whenever the runqueue becomes empty.
|
||||
The Linux CFS scheduler, as before, periodically calls into the governor to set the CPU cluster speed.
|
||||
The \systemname governor picks a new CPU (cluster) speed as described in \Cref{alg:fullKiss} above.
|
||||
The \systemname governor picks a new CPU (cluster) speed.% as described in \Cref{alg:fullKiss} above.
|
||||
%The default policy sets the speed based upon recent utilization.
|
||||
%Instead, we set the speed to $\fenergy$ in the general case.
|
||||
% A syscall API, with native calldown support from the Android platform, allows userspace to communicate hints about pending system needs and to request a new default CPU speed setting.
|
||||
|
|
|
@ -62,7 +62,7 @@ We further conduct several experiments to confirm our observations from \Cref{se
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\paragraph{Evaluation platform}
|
||||
|
||||
Our results were obtained using Google Pixel 2 devices running Android AOSP 10 with 4 GB RAM and 128 GB SSD storage and the Snapdragon 835 chipset~\cite{snapdragon-835}.
|
||||
Our results were obtained using stock Google Pixel 2 devices running Android AOSP 10 with 4 GB RAM and 128 GB SSD storage and the Snapdragon 835 chipset~\cite{snapdragon-835}.
|
||||
Standalone microbenchmarks were implemented in C, while end-to-end macrobenchmarks were performed using the Android UI Automator testing framework to perform scripted, simulated interactions with real-world apps~\cite{uiautomator}.
|
||||
One of the phones was modified to obtain energy measurements using the Monsoon HVPM power meter~\cite{monsoon}.
|
||||
Our evaluation system consists of a pair of shell scripts running on the phone and an external monitor, respectively.
|
||||
|
@ -78,10 +78,10 @@ Information on screen performance including framedrops came from the Android \te
|
|||
\paragraph{CPU Policies}
|
||||
We evaluate six different CPU policies:
|
||||
(i) the system default, \schedutil,
|
||||
(ii) a truncated \schedutil implemented by lower-bounding the CPU using the existing API discussed in section \ref{subsec:signal_perf_needs},
|
||||
(ii) a truncated \schedutil implemented by lower-bounding the CPU to 70\% using the existing API discussed in section \ref{subsec:signal_perf_needs},
|
||||
(iii) a fixed 70\% speed using the existing \texttt{userspace} governor,
|
||||
(iv) a truncated \schedutil implemented with \systemname,
|
||||
(v) unmodified \systemname, and
|
||||
(iv) \systemname with speeds lower bounded at 70\%,
|
||||
(v) unmodified \systemname with default speed of fixed 70\%, and
|
||||
(vi) the \texttt{performance} governor.
|
||||
We include (ii) and (iii) to compare the general performance of the truncated \schedutil and a common-case $\sim$70\% speed policies when implemented under the existing API with the equivalents implemented using \systemname.
|
||||
Under default Linux, a specific CPU speed requested gets implemented as the next-highest speed in a preset series of supported speeds in \texttt{scaling\_available\_frequencies} in \texttt{sysfs}.
|
||||
|
@ -113,7 +113,7 @@ We argue this does not noticeably affect user experience and is more than accept
|
|||
The results of the truncated \schedutil policies and of fixedspeed 70\% similarly offer significant energy savings at small to zero cost.
|
||||
|
||||
|
||||
Youtube shows a clear performance win for \systemname, producing 5.2\% screendrops than with the default.
|
||||
Youtube shows a clear performance win for \systemname, producing 5.2\% fewer screendrops than with the default.
|
||||
The truncated \schedutil policy under \systemname and the fixed speed 70\% policy also offer notably improved sreendrop rates, with 4.3\% and 3.6\% lower drop rates respectively.
|
||||
UI performance under \systemname for both the Spotify and the Combined workloads, like that for Facebook, costs .3\% fps compared to the default -- a cost we again argue is both very minimal and acceptable.
|
||||
The other non-default policies for both Spotify and Combined also offer either essentially the same or even somewhat better performance than the default: Truncated \schedutil and fixed 70\% under the existing API for Spotify both offer a $\sim$2.5\% lower framedrop rate.
|
||||
|
@ -123,12 +123,12 @@ In summary: \systemname, with a considerably simpler policy mechanism, offers e
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Ramp-Up Times}
|
||||
|
||||
To attribute the improvement in performance, we measure the CPU frequencies selected by \schedutil and \systemname, respectively.
|
||||
To attribute the improvement in performance, we measure the CPU frequencies selected by \schedutil.
|
||||
\Cref{fig:time_per_freq_fb,fig:time_per_freq_yt,fig:time_per_freq_spot} plot a CDF of the difference between these two selections.
|
||||
This addresses evaluation claim (iv) from above:
|
||||
We note that for a significant fraction of the workload (5\% for Facebook, 15\% for Youtube, 12\% for Spotify), the frequency selected by \schedutil is significantly (up to 50\%) lower.
|
||||
This is \schedutil's ramp-up period, where it selects frequencies lower than $\fenergy$.
|
||||
We attribute the improved performance for both governors to eliminating the ramp-up period where \systemname selects speeds below $\fenergy$.
|
||||
We attribute the relative performance of \systemname to eliminating the ramp-up period where \systemname selects speeds below $\fenergy$.
|
||||
Although each workload spends part of its time at a higher frequency in \schedutil compared to \systemname, it spends more time ramping up to $\fenergy$ than at a higher speed.
|
||||
In summary, the improved performance of both truncated \schedutil and \systemname can be attributed to \schedutil's ramp-up period.
|
||||
|
||||
|
@ -165,7 +165,7 @@ Indeed, all of the non-default policies except \texttt{performance} also best \s
|
|||
Youtube under \systemname also saves energy, albeit less at a 1.6\% savings versus default.
|
||||
Spotify actually costs 2.3\% more.
|
||||
Note that this is Spotify running interactively.
|
||||
The use case of Spotify in the Combined workload, where it is running in the background, is likely much more dominant in actual usage.
|
||||
The use case of Spotify in the Combined workload, where it is running in the background, is likely much more dominant in actual real world usage.
|
||||
The energy consumed by the Combined workload, unsurprisingly, is significantly higher across the board than that of the individual app loads.
|
||||
Here, \systemname uses 5.6\% less energy than the default.
|
||||
Once again, all of the non-default policies save \texttt{performance} do too.
|
||||
|
|
|
@ -26,17 +26,17 @@ As Table \ref{fig:item_energy_cost} shows, a single (big) CPU core on a Pixel 2,
|
|||
On typical mobile phones, these high costs are mitigated by running the CPU at a slower speed (frequency) to save energy.
|
||||
The policies that govern this speed selection, called governors, must balance providing computation resources when needed, and reducing resources to save energy when not.
|
||||
|
||||
Most popular recent and current Android (resp., Linux) governors, such as \texttt{ondemand}, \texttt{interactive}, and \texttt{conservative}, and the current Android system default, \schedutil, use a proportion of recent past CPU usage as a guide to set future speeds.
|
||||
Most popular recent and current Android (resp., Linux) governors, including the \texttt{ondemand}, \texttt{interactive}, and the \texttt{conservative} policies, as well as the current Android system default, \schedutil, use a proportion of recent past CPU usage as a guide to set future speeds.
|
||||
In this paper, we explore several premises on which the designs of these governors are based.
|
||||
We identify flaws in the premises, and propose a new, simpler governor that has better latency and power consumption than \schedutil.
|
||||
|
||||
Our fundamental insight, also observed by prior work~\cite{vogeleer2013energy, nuessle2019benchmarking}, is that there exists an energy-optimal frequency for each device (call it $\fenergy$).
|
||||
We argue that
|
||||
(i)~past CPU usage is not a meaningful for identifying the rare cases when speeds below $\fenergy$ are appropriate,
|
||||
(i)~past CPU usage is not meaningful for identifying the rare cases when speeds below $\fenergy$ are appropriate,
|
||||
(ii)~speeds above $\fenergy$ are useful only in specific situations, often known in advance by user-space.
|
||||
\Cref{fig:missed_opportunities} illustrates the potential for improvement;
|
||||
(i)~\schedutil has a ramp-up period (first grey box) where the CPU is operating at speeds that sacrifice both energy and performance, and
|
||||
(ii)~\schedutil continues ramping up the frequency (second grey box) paying significant energy costs for often negligible visible benefits.
|
||||
(i)~\schedutil has a ramp-up period (left grey box) where the CPU is operating at speeds that sacrifice both energy and performance, and
|
||||
(ii)~\schedutil continues ramping up the frequency (right grey box) paying significant energy costs for often negligible visible benefits.
|
||||
|
||||
We propose a series of changes to \schedutil, ultimately converging on a radical proposal: default the CPU's frequency to its $\fenergy$, switching to faster speeds based only on (already existent) signals from user-space.
|
||||
Based on the simplicity of this approach, we call it the \systemname governor.
|
||||
|
|
|
@ -8,7 +8,7 @@ On modern systems, CPUs typically consist of multiple cores, often of different
|
|||
A policy, or `governor', sets the CPU's frequency (P-state) when there is pending computation, optimizing performance at the expense of energy, or visa versa.
|
||||
The governor runs in conjunction with other policies, in particular (i) the scheduler -- which determines what tasks are run on what CPU cores and (ii) the idle policy -- which places CPUs with no pending work into a (idle) C-state.
|
||||
Hardware design on phones can constrain governor policy calculations.
|
||||
For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the asymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.
|
||||
For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint partly linked to the asymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.
|
||||
|
||||
% idle paper: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=60fdaa6a74dec29a0538325b742bee4097247c6d#page=119
|
||||
|
||||
|
@ -22,9 +22,10 @@ For example, CPU speeds often cannot be set on individual cores but only on grou
|
|||
|
||||
\paragraph{Idling overrides any speed}
|
||||
When a CPU's runqueue has no tasks, the idle policy bypasses the governor's speed selection and instead shuts down unneeded cores.
|
||||
Figure \ref{fig:idle_impact} illustrates this through a simple microbenchmark that continuously performs simple arithmetic computations (red circle), alternates computation and sleep in 15ms intervals (blue square), or continuous sleep (green diamond).
|
||||
The x-axis varies the fixed frequency to which the CPU is pinned, with the default \schedutil governor's behavior for comparison. Total energy consumed over the 30s period is shown on the y-axis.
|
||||
Power consumed by the sleeping task is largely independent of the overall CPU frequency, modulo minor system interrupts.
|
||||
Figure \ref{fig:idle_impact} illustrates this with a microbenchmark that continuously performs simple arithmetic computations (red circle), alternates computation and sleep in 15ms intervals (blue square), or continuously sleeps (green diamond).
|
||||
The x-axis varies the fixed frequency to which the CPU is pinned, with the default \schedutil governor's behavior for comparison.
|
||||
Total energy consumed is shown on the y-axis.
|
||||
Power consumed by the sleeping task is largely independent of the CPU frequency, modulo minor system interrupts.
|
||||
Energy consumed by the remaining tasks tracks CPU speed, as expected, with a flattening for the partially sleeping workload.
|
||||
In summary, no matter what the requested speed by the CPU governor, when there is no work, the idle policy overrides the speed and shuts down the core, \emph{consuming negligible energy}.
|
||||
We refer to the `speed' of the core in its idle state as $\fidle$
|
||||
|
|
|
@ -85,10 +85,10 @@ This reduces the per-CPU utilization, but triggers CPU frequencies well below $\
|
|||
Figure \ref{fig:time_per_freq_fb} illustrates this under-performance in practice.
|
||||
We ran a scripted Facebook app interaction, scrolling for 25s through friends and then feed under different CPU policies: with the default \schedutil and with differing fixed speeds.
|
||||
We tracked the time spent for each CPU at a given frequency, with 0 representing idle.
|
||||
The solid blue lines in the top 2 graphs are histograms of the average time spent at each speed for little and big CPUs.
|
||||
The solid blue lines in the top 2 graphs are histograms of the average total time spent at each speed for little and big CPUs.
|
||||
The dashed red lines provides a baseline of a governor that selects a fixed 70\% CPU frequency ($\approx \fenergy$) for the CPU when work is pending (and as usual, that idles the core when no work is available).
|
||||
Note that, when the CPU is idle, depicted by a speed of 0, the actual and baseline plots coincide.
|
||||
The left-sides of the 2 CDF plots (bottom 2 graphs of Figure \ref{fig:time_per_freq_fb}) show that the app spends significant \textit{non-idle} time running at speeds below $\fenergy$ -- approximately 2/3 for little CPUs and 1/4 for big CPUs.
|
||||
The left-sides of the 2 CDF plots (bottom 2 graphs of Figure \ref{fig:time_per_freq_fb}) show that the app spends significant \textit{non-idle} time running at speeds well below $\fenergy$.
|
||||
All of the time spent in the areas marked Underperformance represents energy wasted for a slower performance.
|
||||
|
||||
\begin{figure*}
|
||||
|
@ -105,7 +105,7 @@ Frequencies strictly below $\fenergy$ (excepting $\fidle$) consume more power pe
|
|||
In the absence of CPU stalls, spin-locks, and thermal throttling, frequencies in this range are strictly worse.
|
||||
Based on this observation and two further insights, we now propose our first adjustment to the \schedutil governor.
|
||||
|
||||
First, recall that the only signal used by \schedutil is recent past CPU usage.
|
||||
First, recall that the main signal used by \schedutil is recent past CPU usage.
|
||||
This signal conveys no information about CPU stalls, and so is not useful for deciding whether the CPU should be set to a frequency in this regime.
|
||||
Second, we observe that workloads that trigger the relevant CPU behaviors are typically data-intensive and memory bound, or parallel workloads with high contention.
|
||||
Such workloads are often offloaded to more powerful cloud compute infrastructures; When run at the edge (e.g., for federated learning), it is typically when the device has a stable power source.
|
||||
|
|
|
@ -66,7 +66,7 @@ A common measure of user-perceivable value is the rate of dropped animation fram
|
|||
In the ideal case, no frames are dropped.
|
||||
|
||||
Figure \ref{fig:screendrops_per_freq_fb} shows the jank rates for our case study at a variety of fixed frequencies, with \schedutil as a comparison point.
|
||||
Runs with speeds above this produce drop rates of $\sim$2\% and below, lower than that of the default dynamic policy ($\sim$3\%).
|
||||
Runs with speeds 70\% and above produce drop rates of $\sim$2\% and below, lower than that of the default dynamic policy ($\sim$3\%).
|
||||
We attribute the higher drop rate of \schedutil to the ramp-up period, where it runs the CPU at below $\fenergy$.
|
||||
Although there is a step function: At frequencies below 60\% jank increases to about $4\%$.
|
||||
However, this step occurs below $\fenergy$.
|
||||
|
@ -84,14 +84,14 @@ During this time, the user is waiting on the app to become responsive.
|
|||
It makes sense under such circumstances to run the CPUs at a higher speed to enhance user experience.
|
||||
As we will discuss further in section \ref{subsec:signal_perf_needs}, the system already boosts the CPU to 100\% speed upon app launch for $\sim$.5s.
|
||||
Figure \ref{fig:coldstart_time_spot}, however, shows this is insufficient.
|
||||
The right-side graph shows that the app does not become fully responsive (Time to Fully Drawn, or TTFD) until $\sim$2s.
|
||||
The right-side graph shows that the app does not become fully responsive (Time to Fully Drawn, or TTFD) until $\sim$2s after a coldstart.
|
||||
|
||||
The graph depicts the latency of initial display (when the app screen appears) and TTFD under 4 CPU policies: the default \schedutil, a fixed 70\% CPU speed, and the same 2 but with the CPU frequency set to 100\% from userspace.
|
||||
The graph depicts the latency of initial display (when the app screen appears) and TTFD under 4 CPU policies: the default \schedutil, a fixed 70\% CPU speed, and the same 2 but with the CPU frequency set to 100\% from userspace for 2s.
|
||||
The vertical axes depict energy consumed; the inner boxes represent detail zooms of the larger plots.
|
||||
Unsurprisingly, the fixed 70\% policy offers worst performance on both metrics.
|
||||
The second worst is had by the unmodified default policy -- which also offers the worst energy performance.
|
||||
Unsurprisingly, the fixed 70\% policy offers worst performance on both performance metrics.
|
||||
Second worst performance is from the unmodified default policy -- which also offers the worst energy performance.
|
||||
The best performance comes from a fixed 70\% frequency with a 2s boost.
|
||||
Likely, the \schedutil with a 2s boost policy harms itself with slow ramp-up.
|
||||
Likely, this partly stems from avoiding the ramp up penalty of \schedutil.
|
||||
|
||||
This shows that the existing CPU policy degrades user experience through poor latency.
|
||||
Instead, a general-purpose 70\% speed combined with as-needed (and properly timed) frequency boosts offers both better performance (user responsiveness) and energy usage.
|
||||
|
@ -110,7 +110,7 @@ At 90\% of the core's maximum frequency, we would expect the time spent doing wo
|
|||
This is not the case; the ratio is a half for the little cores, and only four-fifths for the big cores.
|
||||
In short, as the CPU frequency goes up and more compute capacity becomes available, the Facebook app adapts by creating additional work.
|
||||
|
||||
We attribute this increase in compute to the asynchronously-loading list through which our test case scrolls.
|
||||
We attribute this compute increase to the asynchronously-loading list through which our test case scrolls.
|
||||
In this design pattern, the cells comprising a large list are not materialized all at once.
|
||||
Rather, as a cell approaches the viewport, a background worker task is spawned to retrieve the data backing the cell and populate its component elements.
|
||||
For example, a mail client might retrieve the contents of an email only as the message scrolls into view.
|
||||
|
@ -129,13 +129,13 @@ However, even at the CPU's maximum frequency, more work is created than the CPU
|
|||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=.87\linewidth]{figures/graph_u_fb.pdf}
|
||||
\bfcaption{Energy consumed for a fixed set of iterations, given compute at different speeds}
|
||||
\bfcaption{Energy consumed for a fixed set of iterations, given compute at different speeds (10 runs, 90\% confidence}
|
||||
\label{fig:u_micro_fb}
|
||||
\end{figure}
|
||||
|
||||
\Cref{fig:u_micro_fb} shows power consumption for the Facebook workload, padded with idle time to a fixed 40s period.
|
||||
Operating the CPU at maximum frequency imposes an energy overhead of approximately $1$mAh compared to operating at $\fenergy \approx 70\%$ of its maximum.
|
||||
This represents about $\frac{1}{2700}$ of the typical Pixel 2's maximum battery capacity.
|
||||
Operating the CPU at maximum frequency imposes an energy overhead of approximately .6 mAh compared to operating at $\fenergy \approx 70\%$ of its maximum.
|
||||
This represents about $\frac{1}{1700}$ of the typical Pixel 2's maximum battery capacity.
|
||||
|
||||
While the energy cost is significant, the potential value of more table cells being displayed as they scroll past is subjective and beyond the scope of this study.
|
||||
We consider each possibility in turn.
|
||||
|
@ -146,7 +146,7 @@ If added performance is desirable in this use case and others like it, then the
|
|||
\subsection{Signaling Performance Needs}
|
||||
\label{subsec:signal_perf_needs}
|
||||
|
||||
The more interesting systems design question is how to select CPU speeds in the presence of adaptive applications, when the additional energy investment does not provide value.
|
||||
The more interesting systems design question is how to select CPU speeds in the presence of adaptive applications, when additional energy does not provide value.
|
||||
Specifically, adaptive apps (while in-use, e.g., scrolling through a list) create a functionally infinite source of work.
|
||||
The CPU usage profiles presented by an adaptive app and a user legitimately waiting on a CPU-bound task (e.g., cold-start) are identical, rendering them indistinguishable to \schedutil.
|
||||
|
||||
|
@ -163,7 +163,7 @@ The kernel reacts to the boost parameter by scaling up the CPU usage as seen by
|
|||
This virtual increase in CPU usage causes most governors to select a higher CPU frequency than they would otherwise select.
|
||||
|
||||
Android's user-space is already configured to make use of the \texttt{stune} API in performance-critical periods.
|
||||
For example, when an app cold starts, Android briefly marks the app with a boost parameter of 100 to mitigate \schedutil's usual ramp-up period.
|
||||
For example, when an app starts, Android briefly marks the app with a boost parameter of 100 to mitigate \schedutil's usual ramp-up period.
|
||||
|
||||
\claim{
|
||||
The boost parameter is a meaningful signal of the need for additional performance.
|
||||
|
@ -195,7 +195,7 @@ The switching cost in both cases, while small, is not negligible, suggesting tha
|
|||
|
||||
To summarize, increasing CPU above $\fenergy$ adds more compute cycles per unit time, but comes with diminishing returns.
|
||||
In the ideal situation where we know exactly what compute must be completed in a given time interval, we could set the CPU speed once, to the minimum frequency required to meet our obligations.
|
||||
However, this information is not available, forcing \schedutil to employ a simple PID control loop: As long as more work is offered, it keeps increasing the CPU speed.
|
||||
However, this information is not available, forcing \schedutil to employ a simple PID (proportional-integral-derivative) control loop: As long as more work is offered, it keeps increasing the CPU speed.
|
||||
|
||||
However, (i) adaptive apps offer an effectively infinite amount of work, (ii) micro-managing frequencies comes at a cost, and (iii) increasing CPU above $\fenergy$ does not meaningfully affect jank.
|
||||
These observations, coupled with already extant adoption of the \texttt{prio\_hint} syscall in Android, drive our second governor proposal: \systemname.
|
||||
|
@ -224,7 +224,7 @@ If no tasks are pending, it idles the CPU.
|
|||
Allowing apps to pin the core to $\fperf$, could in principle, extend the attack surface for the Android kernel.
|
||||
However, we note that in practice, userspace is afforded no capabilities that it did not already have~.
|
||||
It can already spin uselessly from mistake or malice~\cite{maiti2015jouler}.
|
||||
If an app schedules work for more than $\sim$200ms, \schedutil will already ramp the core up to full speed~\ref{fig:missed_opportunities}.
|
||||
If an app schedules work for more than $\sim$200ms, \schedutil will already ramp the core up to full speed (Figure \ref{fig:missed_opportunities}).
|
||||
Furthermore, the {\texttt{schedtune.boost} API is already present in Android.
|
||||
Regardless of policy, hardware enforced thermal throttling will eventually cap a runaway process~\cite{8410428}.
|
||||
|
||||
|
|
Loading…
Reference in New Issue