edits (full cleanup)

master
carlnues@buffalo.edu 2023-08-26 07:01:06 -04:00
parent 1991e3cffb
commit 395058197d
6 changed files with 35 additions and 34 deletions

View File

@ -176,7 +176,7 @@ We modded the Linux 4.4.210 kernel with the \systemname governor to implement th
This is based on a fork of the \schedutil governor.
We rely on the existing system idle policy to put the CPU in an idle state whenever the runqueue becomes empty.
The Linux CFS scheduler, as before, periodically calls into the governor to set the CPU cluster speed.
The \systemname governor picks a new CPU (cluster) speed as described in \Cref{alg:fullKiss} above.
The \systemname governor picks a new CPU (cluster) speed.% as described in \Cref{alg:fullKiss} above.
%The default policy sets the speed based upon recent utilization.
%Instead, we set the speed to $\fenergy$ in the general case.
% A syscall API, with native calldown support from the Android platform, allows userspace to communicate hints about pending system needs and to request a new default CPU speed setting.

View File

@ -62,7 +62,7 @@ We further conduct several experiments to confirm our observations from \Cref{se
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Evaluation platform}
Our results were obtained using Google Pixel 2 devices running Android AOSP 10 with 4 GB RAM and 128 GB SSD storage and the Snapdragon 835 chipset~\cite{snapdragon-835}.
Our results were obtained using stock Google Pixel 2 devices running Android AOSP 10 with 4 GB RAM and 128 GB SSD storage and the Snapdragon 835 chipset~\cite{snapdragon-835}.
Standalone microbenchmarks were implemented in C, while end-to-end macrobenchmarks were performed using the Android UI Automator testing framework to perform scripted, simulated interactions with real-world apps~\cite{uiautomator}.
One of the phones was modified to obtain energy measurements using the Monsoon HVPM power meter~\cite{monsoon}.
Our evaluation system consists of a pair of shell scripts running on the phone and an external monitor, respectively.
@ -78,10 +78,10 @@ Information on screen performance including framedrops came from the Android \te
\paragraph{CPU Policies}
We evaluate six different CPU policies:
(i) the system default, \schedutil,
(ii) a truncated \schedutil implemented by lower-bounding the CPU using the existing API discussed in section \ref{subsec:signal_perf_needs},
(ii) a truncated \schedutil implemented by lower-bounding the CPU to 70\% using the existing API discussed in section \ref{subsec:signal_perf_needs},
(iii) a fixed 70\% speed using the existing \texttt{userspace} governor,
(iv) a truncated \schedutil implemented with \systemname,
(v) unmodified \systemname, and
(iv) \systemname with speeds lower bounded at 70\%,
(v) unmodified \systemname with default speed of fixed 70\%, and
(vi) the \texttt{performance} governor.
We include (ii) and (iii) to compare the general performance of the truncated \schedutil and a common-case $\sim$70\% speed policies when implemented under the existing API with the equivalents implemented using \systemname.
Under default Linux, a specific CPU speed requested gets implemented as the next-highest speed in a preset series of supported speeds in \texttt{scaling\_available\_frequencies} in \texttt{sysfs}.
@ -113,7 +113,7 @@ We argue this does not noticeably affect user experience and is more than accept
The results of the truncated \schedutil policies and of fixedspeed 70\% similarly offer significant energy savings at small to zero cost.
Youtube shows a clear performance win for \systemname, producing 5.2\% screendrops than with the default.
Youtube shows a clear performance win for \systemname, producing 5.2\% fewer screendrops than with the default.
The truncated \schedutil policy under \systemname and the fixed speed 70\% policy also offer notably improved sreendrop rates, with 4.3\% and 3.6\% lower drop rates respectively.
UI performance under \systemname for both the Spotify and the Combined workloads, like that for Facebook, costs .3\% fps compared to the default -- a cost we again argue is both very minimal and acceptable.
The other non-default policies for both Spotify and Combined also offer either essentially the same or even somewhat better performance than the default: Truncated \schedutil and fixed 70\% under the existing API for Spotify both offer a $\sim$2.5\% lower framedrop rate.
@ -123,12 +123,12 @@ In summary: \systemname, with a considerably simpler policy mechanism, offers e
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Ramp-Up Times}
To attribute the improvement in performance, we measure the CPU frequencies selected by \schedutil and \systemname, respectively.
To attribute the improvement in performance, we measure the CPU frequencies selected by \schedutil.
\Cref{fig:time_per_freq_fb,fig:time_per_freq_yt,fig:time_per_freq_spot} plot a CDF of the difference between these two selections.
This addresses evaluation claim (iv) from above:
We note that for a significant fraction of the workload (5\% for Facebook, 15\% for Youtube, 12\% for Spotify), the frequency selected by \schedutil is significantly (up to 50\%) lower.
This is \schedutil's ramp-up period, where it selects frequencies lower than $\fenergy$.
We attribute the improved performance for both governors to eliminating the ramp-up period where \systemname selects speeds below $\fenergy$.
We attribute the relative performance of \systemname to eliminating the ramp-up period where \systemname selects speeds below $\fenergy$.
Although each workload spends part of its time at a higher frequency in \schedutil compared to \systemname, it spends more time ramping up to $\fenergy$ than at a higher speed.
In summary, the improved performance of both truncated \schedutil and \systemname can be attributed to \schedutil's ramp-up period.
@ -165,7 +165,7 @@ Indeed, all of the non-default policies except \texttt{performance} also best \s
Youtube under \systemname also saves energy, albeit less at a 1.6\% savings versus default.
Spotify actually costs 2.3\% more.
Note that this is Spotify running interactively.
The use case of Spotify in the Combined workload, where it is running in the background, is likely much more dominant in actual usage.
The use case of Spotify in the Combined workload, where it is running in the background, is likely much more dominant in actual real world usage.
The energy consumed by the Combined workload, unsurprisingly, is significantly higher across the board than that of the individual app loads.
Here, \systemname uses 5.6\% less energy than the default.
Once again, all of the non-default policies save \texttt{performance} do too.

View File

@ -26,17 +26,17 @@ As Table \ref{fig:item_energy_cost} shows, a single (big) CPU core on a Pixel 2,
On typical mobile phones, these high costs are mitigated by running the CPU at a slower speed (frequency) to save energy.
The policies that govern this speed selection, called governors, must balance providing computation resources when needed, and reducing resources to save energy when not.
Most popular recent and current Android (resp., Linux) governors, such as \texttt{ondemand}, \texttt{interactive}, and \texttt{conservative}, and the current Android system default, \schedutil, use a proportion of recent past CPU usage as a guide to set future speeds.
Most popular recent and current Android (resp., Linux) governors, including the \texttt{ondemand}, \texttt{interactive}, and the \texttt{conservative} policies, as well as the current Android system default, \schedutil, use a proportion of recent past CPU usage as a guide to set future speeds.
In this paper, we explore several premises on which the designs of these governors are based.
We identify flaws in the premises, and propose a new, simpler governor that has better latency and power consumption than \schedutil.
Our fundamental insight, also observed by prior work~\cite{vogeleer2013energy, nuessle2019benchmarking}, is that there exists an energy-optimal frequency for each device (call it $\fenergy$).
We argue that
(i)~past CPU usage is not a meaningful for identifying the rare cases when speeds below $\fenergy$ are appropriate,
(i)~past CPU usage is not meaningful for identifying the rare cases when speeds below $\fenergy$ are appropriate,
(ii)~speeds above $\fenergy$ are useful only in specific situations, often known in advance by user-space.
\Cref{fig:missed_opportunities} illustrates the potential for improvement;
(i)~\schedutil has a ramp-up period (first grey box) where the CPU is operating at speeds that sacrifice both energy and performance, and
(ii)~\schedutil continues ramping up the frequency (second grey box) paying significant energy costs for often negligible visible benefits.
(i)~\schedutil has a ramp-up period (left grey box) where the CPU is operating at speeds that sacrifice both energy and performance, and
(ii)~\schedutil continues ramping up the frequency (right grey box) paying significant energy costs for often negligible visible benefits.
We propose a series of changes to \schedutil, ultimately converging on a radical proposal: default the CPU's frequency to its $\fenergy$, switching to faster speeds based only on (already existent) signals from user-space.
Based on the simplicity of this approach, we call it the \systemname governor.

View File

@ -8,7 +8,7 @@ On modern systems, CPUs typically consist of multiple cores, often of different
A policy, or `governor', sets the CPU's frequency (P-state) when there is pending computation, optimizing performance at the expense of energy, or visa versa.
The governor runs in conjunction with other policies, in particular (i) the scheduler -- which determines what tasks are run on what CPU cores and (ii) the idle policy -- which places CPUs with no pending work into a (idle) C-state.
Hardware design on phones can constrain governor policy calculations.
For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the asymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.
For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint partly linked to the asymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.
% idle paper: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=60fdaa6a74dec29a0538325b742bee4097247c6d#page=119
@ -22,9 +22,10 @@ For example, CPU speeds often cannot be set on individual cores but only on grou
\paragraph{Idling overrides any speed}
When a CPU's runqueue has no tasks, the idle policy bypasses the governor's speed selection and instead shuts down unneeded cores.
Figure \ref{fig:idle_impact} illustrates this through a simple microbenchmark that continuously performs simple arithmetic computations (red circle), alternates computation and sleep in 15ms intervals (blue square), or continuous sleep (green diamond).
The x-axis varies the fixed frequency to which the CPU is pinned, with the default \schedutil governor's behavior for comparison. Total energy consumed over the 30s period is shown on the y-axis.
Power consumed by the sleeping task is largely independent of the overall CPU frequency, modulo minor system interrupts.
Figure \ref{fig:idle_impact} illustrates this with a microbenchmark that continuously performs simple arithmetic computations (red circle), alternates computation and sleep in 15ms intervals (blue square), or continuously sleeps (green diamond).
The x-axis varies the fixed frequency to which the CPU is pinned, with the default \schedutil governor's behavior for comparison.
Total energy consumed is shown on the y-axis.
Power consumed by the sleeping task is largely independent of the CPU frequency, modulo minor system interrupts.
Energy consumed by the remaining tasks tracks CPU speed, as expected, with a flattening for the partially sleeping workload.
In summary, no matter what the requested speed by the CPU governor, when there is no work, the idle policy overrides the speed and shuts down the core, \emph{consuming negligible energy}.
We refer to the `speed' of the core in its idle state as $\fidle$

View File

@ -85,10 +85,10 @@ This reduces the per-CPU utilization, but triggers CPU frequencies well below $\
Figure \ref{fig:time_per_freq_fb} illustrates this under-performance in practice.
We ran a scripted Facebook app interaction, scrolling for 25s through friends and then feed under different CPU policies: with the default \schedutil and with differing fixed speeds.
We tracked the time spent for each CPU at a given frequency, with 0 representing idle.
The solid blue lines in the top 2 graphs are histograms of the average time spent at each speed for little and big CPUs.
The solid blue lines in the top 2 graphs are histograms of the average total time spent at each speed for little and big CPUs.
The dashed red lines provides a baseline of a governor that selects a fixed 70\% CPU frequency ($\approx \fenergy$) for the CPU when work is pending (and as usual, that idles the core when no work is available).
Note that, when the CPU is idle, depicted by a speed of 0, the actual and baseline plots coincide.
The left-sides of the 2 CDF plots (bottom 2 graphs of Figure \ref{fig:time_per_freq_fb}) show that the app spends significant \textit{non-idle} time running at speeds below $\fenergy$ -- approximately 2/3 for little CPUs and 1/4 for big CPUs.
The left-sides of the 2 CDF plots (bottom 2 graphs of Figure \ref{fig:time_per_freq_fb}) show that the app spends significant \textit{non-idle} time running at speeds well below $\fenergy$.
All of the time spent in the areas marked Underperformance represents energy wasted for a slower performance.
\begin{figure*}
@ -105,7 +105,7 @@ Frequencies strictly below $\fenergy$ (excepting $\fidle$) consume more power pe
In the absence of CPU stalls, spin-locks, and thermal throttling, frequencies in this range are strictly worse.
Based on this observation and two further insights, we now propose our first adjustment to the \schedutil governor.
First, recall that the only signal used by \schedutil is recent past CPU usage.
First, recall that the main signal used by \schedutil is recent past CPU usage.
This signal conveys no information about CPU stalls, and so is not useful for deciding whether the CPU should be set to a frequency in this regime.
Second, we observe that workloads that trigger the relevant CPU behaviors are typically data-intensive and memory bound, or parallel workloads with high contention.
Such workloads are often offloaded to more powerful cloud compute infrastructures; When run at the edge (e.g., for federated learning), it is typically when the device has a stable power source.

View File

@ -66,7 +66,7 @@ A common measure of user-perceivable value is the rate of dropped animation fram
In the ideal case, no frames are dropped.
Figure \ref{fig:screendrops_per_freq_fb} shows the jank rates for our case study at a variety of fixed frequencies, with \schedutil as a comparison point.
Runs with speeds above this produce drop rates of $\sim$2\% and below, lower than that of the default dynamic policy ($\sim$3\%).
Runs with speeds 70\% and above produce drop rates of $\sim$2\% and below, lower than that of the default dynamic policy ($\sim$3\%).
We attribute the higher drop rate of \schedutil to the ramp-up period, where it runs the CPU at below $\fenergy$.
Although there is a step function: At frequencies below 60\% jank increases to about $4\%$.
However, this step occurs below $\fenergy$.
@ -84,14 +84,14 @@ During this time, the user is waiting on the app to become responsive.
It makes sense under such circumstances to run the CPUs at a higher speed to enhance user experience.
As we will discuss further in section \ref{subsec:signal_perf_needs}, the system already boosts the CPU to 100\% speed upon app launch for $\sim$.5s.
Figure \ref{fig:coldstart_time_spot}, however, shows this is insufficient.
The right-side graph shows that the app does not become fully responsive (Time to Fully Drawn, or TTFD) until $\sim$2s.
The right-side graph shows that the app does not become fully responsive (Time to Fully Drawn, or TTFD) until $\sim$2s after a coldstart.
The graph depicts the latency of initial display (when the app screen appears) and TTFD under 4 CPU policies: the default \schedutil, a fixed 70\% CPU speed, and the same 2 but with the CPU frequency set to 100\% from userspace.
The graph depicts the latency of initial display (when the app screen appears) and TTFD under 4 CPU policies: the default \schedutil, a fixed 70\% CPU speed, and the same 2 but with the CPU frequency set to 100\% from userspace for 2s.
The vertical axes depict energy consumed; the inner boxes represent detail zooms of the larger plots.
Unsurprisingly, the fixed 70\% policy offers worst performance on both metrics.
The second worst is had by the unmodified default policy -- which also offers the worst energy performance.
Unsurprisingly, the fixed 70\% policy offers worst performance on both performance metrics.
Second worst performance is from the unmodified default policy -- which also offers the worst energy performance.
The best performance comes from a fixed 70\% frequency with a 2s boost.
Likely, the \schedutil with a 2s boost policy harms itself with slow ramp-up.
Likely, this partly stems from avoiding the ramp up penalty of \schedutil.
This shows that the existing CPU policy degrades user experience through poor latency.
Instead, a general-purpose 70\% speed combined with as-needed (and properly timed) frequency boosts offers both better performance (user responsiveness) and energy usage.
@ -110,7 +110,7 @@ At 90\% of the core's maximum frequency, we would expect the time spent doing wo
This is not the case; the ratio is a half for the little cores, and only four-fifths for the big cores.
In short, as the CPU frequency goes up and more compute capacity becomes available, the Facebook app adapts by creating additional work.
We attribute this increase in compute to the asynchronously-loading list through which our test case scrolls.
We attribute this compute increase to the asynchronously-loading list through which our test case scrolls.
In this design pattern, the cells comprising a large list are not materialized all at once.
Rather, as a cell approaches the viewport, a background worker task is spawned to retrieve the data backing the cell and populate its component elements.
For example, a mail client might retrieve the contents of an email only as the message scrolls into view.
@ -129,13 +129,13 @@ However, even at the CPU's maximum frequency, more work is created than the CPU
\begin{figure}
\centering
\includegraphics[width=.87\linewidth]{figures/graph_u_fb.pdf}
\bfcaption{Energy consumed for a fixed set of iterations, given compute at different speeds}
\bfcaption{Energy consumed for a fixed set of iterations, given compute at different speeds (10 runs, 90\% confidence}
\label{fig:u_micro_fb}
\end{figure}
\Cref{fig:u_micro_fb} shows power consumption for the Facebook workload, padded with idle time to a fixed 40s period.
Operating the CPU at maximum frequency imposes an energy overhead of approximately $1$mAh compared to operating at $\fenergy \approx 70\%$ of its maximum.
This represents about $\frac{1}{2700}$ of the typical Pixel 2's maximum battery capacity.
Operating the CPU at maximum frequency imposes an energy overhead of approximately .6 mAh compared to operating at $\fenergy \approx 70\%$ of its maximum.
This represents about $\frac{1}{1700}$ of the typical Pixel 2's maximum battery capacity.
While the energy cost is significant, the potential value of more table cells being displayed as they scroll past is subjective and beyond the scope of this study.
We consider each possibility in turn.
@ -146,7 +146,7 @@ If added performance is desirable in this use case and others like it, then the
\subsection{Signaling Performance Needs}
\label{subsec:signal_perf_needs}
The more interesting systems design question is how to select CPU speeds in the presence of adaptive applications, when the additional energy investment does not provide value.
The more interesting systems design question is how to select CPU speeds in the presence of adaptive applications, when additional energy does not provide value.
Specifically, adaptive apps (while in-use, e.g., scrolling through a list) create a functionally infinite source of work.
The CPU usage profiles presented by an adaptive app and a user legitimately waiting on a CPU-bound task (e.g., cold-start) are identical, rendering them indistinguishable to \schedutil.
@ -163,7 +163,7 @@ The kernel reacts to the boost parameter by scaling up the CPU usage as seen by
This virtual increase in CPU usage causes most governors to select a higher CPU frequency than they would otherwise select.
Android's user-space is already configured to make use of the \texttt{stune} API in performance-critical periods.
For example, when an app cold starts, Android briefly marks the app with a boost parameter of 100 to mitigate \schedutil's usual ramp-up period.
For example, when an app starts, Android briefly marks the app with a boost parameter of 100 to mitigate \schedutil's usual ramp-up period.
\claim{
The boost parameter is a meaningful signal of the need for additional performance.
@ -195,7 +195,7 @@ The switching cost in both cases, while small, is not negligible, suggesting tha
To summarize, increasing CPU above $\fenergy$ adds more compute cycles per unit time, but comes with diminishing returns.
In the ideal situation where we know exactly what compute must be completed in a given time interval, we could set the CPU speed once, to the minimum frequency required to meet our obligations.
However, this information is not available, forcing \schedutil to employ a simple PID control loop: As long as more work is offered, it keeps increasing the CPU speed.
However, this information is not available, forcing \schedutil to employ a simple PID (proportional-integral-derivative) control loop: As long as more work is offered, it keeps increasing the CPU speed.
However, (i) adaptive apps offer an effectively infinite amount of work, (ii) micro-managing frequencies comes at a cost, and (iii) increasing CPU above $\fenergy$ does not meaningfully affect jank.
These observations, coupled with already extant adoption of the \texttt{prio\_hint} syscall in Android, drive our second governor proposal: \systemname.
@ -224,7 +224,7 @@ If no tasks are pending, it idles the CPU.
Allowing apps to pin the core to $\fperf$, could in principle, extend the attack surface for the Android kernel.
However, we note that in practice, userspace is afforded no capabilities that it did not already have~.
It can already spin uselessly from mistake or malice~\cite{maiti2015jouler}.
If an app schedules work for more than $\sim$200ms, \schedutil will already ramp the core up to full speed~\ref{fig:missed_opportunities}.
If an app schedules work for more than $\sim$200ms, \schedutil will already ramp the core up to full speed (Figure \ref{fig:missed_opportunities}).
Furthermore, the {\texttt{schedtune.boost} API is already present in Android.
Regardless of policy, hardware enforced thermal throttling will eventually cap a runaway process~\cite{8410428}.