Pass over intro; frequency symbols no longer escape math mode.

This commit is contained in:
Oliver Kennedy 2023-08-14 15:54:20 -04:00
parent e4fb4aa687
commit 74f8c8f518
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60
7 changed files with 131 additions and 150 deletions

View file

@ -9,6 +9,7 @@
\usepackage{algpseudocode}
\usepackage[inline]{enumitem}
\usepackage{cleveref}
\usepackage{tcolorbox}
\newcommand{\feedback}[1]{[[\textcolor{blue}{#1}]]}
\newcommand{\bfcaption}[1]{\caption{\bf #1}}
@ -22,12 +23,18 @@
\newcommand{\optspeed}{70\%}
\newcommand{\facebook}{Facebook\xspace}
\newcommand{\schedutil}{\texttt{schedutil}\xspace}
\newcommand{\fenergy}{$\mathcal{F}_{pow}$ }
\newcommand{\fperf}{$\mathcal{F}_{max}$ }
\newcommand{\fidle}{$\mathcal{F}_{0}$ }
\newcommand{\fmemory}{$\mathcal{F}_{mem}$ }
\newcommand{\fenergy}{\mathcal{F}_{pow} }
\newcommand{\fperf}{\mathcal{F}_{max} }
\newcommand{\fidle}{\mathcal{F}_{0} }
\newcommand{\fmemory}{\mathcal{F}_{mem} }
\newcommand{\memory}{cache bound } % memory access
\newcommand{\Memory}{Cache bound }
\newcounter{ClaimCounter}
\newcommand{\claim}[1]{
\begin{tcolorbox}[colframe=blue!75!white,colback=blue!10!white]
\textbf{Claim \refstepcounter{ClaimCounter}}: #1
\end{tcolorbox}
}
\setcopyright{acmcopyright}
\copyrightyear{2023}
@ -101,15 +108,19 @@
\maketitle
\section{Governor Background}
\section{Introduction}
\label{sec:introduction}
\input{sections/introduction.tex}
\section{Reactive Governors Pick Unjustifiable Speeds}
\section{Background and Related Work}
\label{sec:related}
\input{sections/related.tex}
\section{Frequencies below $\fenergy$}
\label{sec:unjustifed}
\input{sections/unjustified.tex}
\section{Reactive Governors Use Resources to Little Benefit}
\section{Frequencies above $\fenergy$}
\label{sec:wasted}
\input{sections/wasted.tex}
@ -125,10 +136,6 @@
\label{sec:evaluation}
\input{sections/evaluation.tex}
\section{Related Work}
\label{sec:related}
\input{sections/related.tex}
\section{Conclusions}
\label{sec:conclusions}
\input{sections/conclusion.tex}

View file

@ -6,7 +6,7 @@ We start with the default \schedutil governor, and iteratively adapt it based on
\subsection{Speeds below $f_{pow}$}
Recall from \Cref{TODO}, that frequencies strictly below \fenergy (excepting \fidle) consume more power per CPU cycle than \fenergy, and result in higher latencies.
Recall from \Cref{TODO}, that frequencies strictly below $\fenergy$ (excepting $\fidle$) consume more power per CPU cycle than $\fenergy$, and result in higher latencies.
In the absence of CPU stalls and spinlocks, there is no benefit to operating the CPU at frequencies in this range.
Our first design adaptation is based on this observation, coupled with two further insights.
@ -22,19 +22,19 @@ Such workloads are often offloaded to more powerful cloud compute infrastructure
\begin{algorithmic}
\Ensure $f$: The target CPU frequency
\State $f \gets \schedutil{}\texttt{()}$
\State \textbf{if} {\fidle$ < f < $ \fenergy} \textbf{then} $f = $ \fenergy \textbf{end if}
\State \textbf{if} {$\fidle$$ < f < $ $\fenergy$} \textbf{then} $f = $ $\fenergy$ \textbf{end if}
\end{algorithmic}
\end{algorithm}
To a first-degree approximation, there is no value in running the CPU at frequencies between $f_0$ and $f_{pow}$.
\Cref{alg:boundedschedutil} summarizes our fix to schedutil: truncating the function's domain.
Speeds below \fenergy are increased to \fenergy.
Speeds below $\fenergy$ are increased to $\fenergy$.
This eliminates the ramp-up period illustrated in \Cref{fig:missed_opportunities}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Speeds above \fenergy}
\subsection{Speeds above $\fenergy$}
As the CPU frequency increases above \fenergy, more compute cycles become available per unit time at the cost of increased energy.
As the CPU frequency increases above $\fenergy$, more compute cycles become available per unit time at the cost of increased energy.
As we observe in \Cref{TODO}, this trade-off has diminishing returns: The faster we make the CPU, the more energy each subsequent increase costs.
In the ideal situation where we know exactly what compute must be completed in a given time interval, we could set the CPU speed to the minimum frequency required to meet our obligations.
As this knowledge is not available, \schedutil employs a simple PID control loop: As long as more work is offered, keep increasing the CPU speed until we finish.
@ -43,7 +43,7 @@ However, recall from \Cref{TODO} that apps can and do scale the amount of work t
Furthermore, recall from \Cref{TODO} that in many cases, increasing CPU frequencies does not lead to perceptible benefits.
In summary, in attempting to scale themselves to the capabilities of a device, apps force the CPU to its maximum frequency without offering a meaningful benefit in return.
This observation leads us to a radical, if naive proposal (we offer a somewhat less radical variant shortly): Ignore speeds other than \fenergy and \fidle, as illustrated in \Cref{alg:naiveKiss}.
This observation leads us to a radical, if naive proposal (we offer a somewhat less radical variant shortly): Ignore speeds other than $\fenergy$ and $\fidle$, as illustrated in \Cref{alg:naiveKiss}.
\begin{algorithm}
\caption{\texttt{NaiveKISS}($\mathcal T$)}
@ -51,7 +51,7 @@ This observation leads us to a radical, if naive proposal (we offer a somewhat l
\begin{algorithmic}
\Require $\mathcal T$: The set of currently scheduled tasks
\Ensure $f$: The target CPU frequency
\State \textbf{if} {$|\mathcal T| > 0$} \textbf{then} $f = $ \fenergy \textbf{else} $f = $ \fidle \textbf{end if}
\State \textbf{if} {$|\mathcal T| > 0$} \textbf{then} $f = $ $\fenergy$ \textbf{else} $f = $ $\fidle$ \textbf{end if}
\end{algorithmic}
\end{algorithm}
@ -66,7 +66,7 @@ Android already offers a native syscall API that allows userspace to communicate
Through this API, userspace applications can request increased CPU performance for short periods.
This API is used by Android, for example to accelerate performance during app cold-starts.
Our final governor iteration, summarized in \Cref{alg:fullKiss}, incorporates per-task frequency requests, selecting the fastest frequency requested by a running app (or \fenergy if higher).
Our final governor iteration, summarized in \Cref{alg:fullKiss}, incorporates per-task frequency requests, selecting the fastest frequency requested by a running app (or $\fenergy$ if higher).
\begin{algorithm}
\caption{\texttt{KISS}($\mathcal T$)}
@ -74,15 +74,15 @@ Our final governor iteration, summarized in \Cref{alg:fullKiss}, incorporates pe
\begin{algorithmic}
\Require $\mathcal T$: The set of currently scheduled tasks
\Ensure $f$: The target CPU frequency
\State $\mathcal F \gets \left\{\; t.request \;|\; t \in \mathcal T \;\right\} \cup \{ $ \fenergy $\}$
\State \textbf{if} {$|\mathcal T| > 0$} \textbf{then} $f = \max(\mathcal F)$ \textbf{else} $f = $ \fidle \textbf{end if}
\State $\mathcal F \gets \left\{\; t.request \;|\; t \in \mathcal T \;\right\} \cup \{ $ $\fenergy$ $\}$
\State \textbf{if} {$|\mathcal T| > 0$} \textbf{then} $f = \max(\mathcal F)$ \textbf{else} $f = $ $\fidle$ \textbf{end if}
\end{algorithmic}
\end{algorithm}
\paragraph{Security Concerns}
Allowing apps to pin the core to \fperf may be considered a security hole, as a malicious app could aggressively drain a user's battery.
We first observe, however, that this capability is already available to apps: If an app schedules work for more than 200ms \todo{check the number please}, \schedutil will already run the core at \fperf.
Allowing apps to pin the core to $\fperf$ may be considered a security hole, as a malicious app could aggressively drain a user's battery.
We first observe, however, that this capability is already available to apps: If an app schedules work for more than 200ms \todo{check the number please}, \schedutil will already run the core at $\fperf$.
Moreover, the performance request API call is already present in Android.
Conversely, we observe that passing requests for performance explicitly from userspace creates opportunities for more effective security policies, for example requiring user consent to request long-duration performance boosts.
@ -135,11 +135,11 @@ Moreover, such requests are not subject to the usual ramp-up period of \scheduti
% \ElsIf{$ \exists cache\_bound\_ hint $}
% \State $ speed \gets 50 $
% \ElsIf{$ screenon = False $}
% \State $ speed \gets $ \fenergy
% \State $ speed \gets $ $\fenergy$
% \ElsIf{$ \exists performance\_hint $}
% \State $ speed \gets $ \fperf
% \State $ speed \gets $ $\fperf$
% \Else
% \State $ speed \gets $ \fenergy
% \State $ speed \gets $ $\fenergy$
% \EndIf
% \EndProcedure
@ -193,22 +193,22 @@ Moreover, such requests are not subject to the usual ramp-up period of \scheduti
% \systemname runs the general case at a speed setting that conserves energy.
% We previously observed in Section \ref{complexity_cost} that, for a given workload, energy can be minimized by running the CPU at some midspeed.
% The exact energy-optimal speed is device- and CPU-specific, as even the big and little core clusters of phones will have slightly different optima.
% However, we have found that the variance is close enough that a simple single-speed setting, \fenergy, suffices.
% However, we have found that the variance is close enough that a simple single-speed setting, $\fenergy$, suffices.
% Previous studies have acknowledged the policy goal in adjusting CPU speed on phones should be to minimize energy usage, subject to meeting performance targets.\cite{rao2017application}
% Non-phone studies have suggested the goal in such cases should be to meet deadlines rather than merely to minimize compuation latency.\cite{korkmaz2018workload}
% Given the useful output of phones is largely screen interation, we adopt screendraws as our metric.
% We make the key observation, which we evaluate in Section \ref{sec:evaluation}, that \fenergy speed not only saves energy but is \textit{also fast enough} to run representative interactive apps at acceptable performance.
% We make the key observation, which we evaluate in Section \ref{sec:evaluation}, that $\fenergy$ speed not only saves energy but is \textit{also fast enough} to run representative interactive apps at acceptable performance.
% It does so by avoiding the pitfalls of picking a speed that is too low -- sacrificing performance, energy or both -- or too high, wasting energy.
% %This design, intuitive for for non-interactive periods, also proves suitable for running typical interactive apps.
% In the later case, the problem with the default governor is that it runs the CPU faster than necessary to meet deadlines, resulting in overperformance.
% We design \systemname to receive hints from userspace when exceptions to the general case should occur.
% Particularly, periods when performance should be prioritized run at the maximum CPU speed \fperf.
% Particularly, periods when performance should be prioritized run at the maximum CPU speed $\fperf$.
% Such instances, when the user is waiting on the phone, include app coldstarts and app installs.
% Note that such periods are not synonymous with CPU-bound workloads.
% Indeed, the intermittency of the CPU load during these periods in practice is precisely what makes their runtime even worse in the default case by triggering the default \schedutil policy to ramp down the CPU speed.
% Rather, we leave the CPU at \fperf when there is work, and let the idle policy shut off the CPU when not.
% Rather, we leave the CPU at $\fperf$ when there is work, and let the idle policy shut off the CPU when not.
% Our governor takes its guidance from what the system priority should be, rather than how much computational work there is.
% \fixme{\memory hints}
@ -219,7 +219,7 @@ Moreover, such requests are not subject to the usual ramp-up period of \scheduti
% \label{sub:decision_logic}
% Algorithm \ref{alg:cpu_speed_selection} depicts the speed selection logic of \systemname.
% Our basic design is to set the CPU speed to \fenergy for general case, unless overridden.
% Our basic design is to set the CPU speed to $\fenergy$ for general case, unless overridden.
% In all cases, the existing Linux idle policy will disable the CPU if the corresponding runqueue is empty -- that is, if the individual CPU has no work.
% Otherwise, when there is work to do, the CPU will run in a C-state at a speed selected by \systemname.
% The governor first considers whether there is any current \memory hint, from any app.
@ -230,18 +230,18 @@ Moreover, such requests are not subject to the usual ramp-up period of \scheduti
% Otherwise, the governor considers whether the phone is currently interactive -- whether the screen is on.
% \fixme{or audio streaming?}
% If so, it sets the CPU speed to \fenergy: Absent interactivity, there is no reasonable priority besides conserving energy.
% If so, it sets the CPU speed to $\fenergy$: Absent interactivity, there is no reasonable priority besides conserving energy.
% Any request otherwise from an app is therefore discarded.
% Thirdly, the governor checks whether there is any current performance hint request, from any app, and if so immediately sets the CPU speed to \fperf.
% Thirdly, the governor checks whether there is any current performance hint request, from any app, and if so immediately sets the CPU speed to $\fperf$.
% Since the device is interactive, the phone should immediately prioritize latency and not wait for any rampup.
% Nor should it consider any intermittency in the workload -- say, an app blocking on I/O -- as does the \schedutil default policy.
% Lastly, as a default the governor sets the CPU to \fenergy.
% Lastly, as a default the governor sets the CPU to $\fenergy$.
% In this case, the policy knows the phone is interactive, so it also monitors screendraws.
% \fixme{not implemented -- offer as an option?}
% If the frame drop (jank) rate hits \fixme{what}, the governor adjusts the speed upward slightly, bounded by \fixme{what}.
% This avoids the overperformance triggered by the default policy while maintaining user experience.
% In practice, we have observed that keeping the CPU at \fenergy is sufficient.
% In practice, we have observed that keeping the CPU at $\fenergy$ is sufficient.
% We do not address different simultaneous intra-cluster CPU speeds: On our devices, like most current phones \fixme{verify this}, the speeds of the 4 big and 4 little core clusters must be set as a block.
% For stability and security, we implement a time-out of 10s -- we have observed these periods are typically much less than this in practice, and userspace can always re-supply the hint.
@ -266,7 +266,7 @@ We rely on the existing system idle policy to put the CPU in an idle state whene
The Linux CFS scheduler, as before, periodically calls into the governor to set the CPU cluster speed.
The \systemname governor picks a new CPU (cluster) speed as described in Section \ref{sub:decision_logic} above.
%The default policy sets the speed based upon recent utilization.
%Instead, we set the speed to \fenergy in the general case.
%Instead, we set the speed to $\fenergy$ in the general case.
A syscall API, with native calldown support from the Android platform, allows userspace to communicate hints about pending system needs and to request a new default CPU speed setting.
This suggestion can be either a new fixed speed or a bounded range that allows the calculated \schedutil speed to float within a requested range.
The kernel then uses this information, along with the decision logic of Section \ref{sub:decision_logic}, to set the actual speed of the CPU cluster.

View file

@ -1,11 +1,5 @@
% -*- root: ../main.tex -*-
\begin{figure}
\centering
\includegraphics[width=.95\linewidth]{figures/graph_missed_opportunities.pdf}
\bfcaption{Phone governors hurt both energy and performance by running the CPU at wateful speeds and taking time to ramp up}
\label{fig:missed_opportunities}
\end{figure}
%!TEX root=../main.tex
\begin{figure}
\begin{tabular}{l|c}
@ -16,10 +10,18 @@ Launch screen on; idle & 130 \\
1 CPU saturated; screen off & 310 \\
2 CPUs saturated; screen off & 560 \\
\end{tabular}
\bfcaption{CPU usage dominates energy consumption}
\bfcaption{Energy usage on a Pixel 2}
\label{fig:item_energy_cost}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=.95\linewidth]{figures/graph_missed_opportunities.pdf}
\bfcaption{An example trace of \schedutil's cpu frequency selections for a fixed workload (solid blue). The dotted red line shows a energy/latency optimal frequency choice ($\fenergy$).}
\label{fig:missed_opportunities}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=.95\linewidth]{figures/graph_showcase.pdf}
@ -27,76 +29,34 @@ Launch screen on; idle & 130 \\
\label{fig:showcase}
\end{figure}
CPUs form the computation heart of computing systems including phones.
They also consume considreable energy.
Phones must balance providing computation resources when needed and reducing resources, to save energy, when not.
Historically, systems have addressed these two competing goals by employing frequency scaling to change the speed at which the CPU runs.
They set the CPU speed higher when there is pending computation, optimizing performance at the expenese of energy.
They set speed lower when computation needs decline or stop, to save energy.
CPUs consume considerable energy on mobile phones.
As Table \ref{fig:item_energy_cost} shows, a single (big) CPU core on a Pixel 2, running at full speed with the screen off, consumes almost three times the energy of the display, and a second core running at full speed almost doubles that.
On typical mobile phones, these high costs are mitigated by running the CPU at a slower speed (frequency) to save energy.
The rules that govern this speed selection, called governors, must balance providing computation resources when needed, and reducing resources to save energy when not.
On modern systems, CPUs typically consist of multiple cores, often of different types, that run at different speeds (known as P-states) or can be turned on and off into idle (known as C-states).
The software policies that control what CPU cores run when and at what performance level must balance competing system design goals, particularly optimizing for energy versus for performance.
Most popular recent and current Android (resp., Linux) governors, such as \texttt{ondemand}, \texttt{interactive}, and \texttt{conservative}, and the current Android system default, \schedutil, use a proportion of recent past CPU usage as a guide to set future speeds.
In this paper, we explore several premises on which the design of these governors is based.
We identify flaws in these premises, and propose a new, simpler governor that has better latency and power consumption than \schedutil.
Phones, as embedded devices, must be particularly cognizant of energy.
A major power consumer on phones is the CPU cores.
Table \ref{fig:item_energy_cost} shows the energy consumed by the system for a fixed time when run under different conditions.
Saturating a single CPU, with a blank display, consumes over double the energy than an on display with all CPUs idle; saturating 2 CPUs nearly doubles this again.
It is thus critical, for efficient energy management, to manage CPU usage properly.
The fundamental insight behind this paper, also observed by prior work~\cite{vogeleer2013energy, nuessle2019benchmarking}, is that there exists an energy-optimal frequency for each device (call it $\fenergy$).
We argue that
(i) past CPU usage is not a meaningful signal for identifying the rare cases when speeds below $\fenergy$ are appropriate,
(ii) speeds above $\fenergy$ are useful only in specific situations, often known in advance by userspace.
\Cref{fig:missed_opportunities} illustrates the potential for improvement;
(i) \schedutil has a ramp-up period (first grey box) where the CPU is operating at speeds that sacrifice both energy and performance, and
(ii) \schedutil continues ramping up the frequency (second grey box) paying significant energy costs for often negligible visible benefits.
Ultimately, we propose a series of changes to \schedutil, converging on a radical proposal: Default the CPU's frequency to its $\fenergy$, switching to faster speeds based only on (already existent) signals from userspace.
Based on the simplicity of this approach, we call it the \systemname governor.
Considering the radical nature of \systemname, we also explore the gradient between \schedutil and \systemname, and show that significant gains are possible, even with only minor adjustments to \schedutil.
\subsection{The role of CPU governors}
There have been a number of policies, called \textit{governors}, to determine at which speed to run the CPU.
Most popular recent and current Linux governors, such as \texttt{ondemand, interactive}, and \texttt{conservative}, and the current Android system default, \schedutil, use a proportion of recent past CPU usage as a guide to set future speeds.
These governors run in conjunction with other policies, in particular (i) the scheduler -- which determines what tasks are run on what CPU cores and (ii) the idle policy -- which shuts down CPUs with no pending work.
Hardware design on phones can constrain governor policy calculations.
CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the assymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores.\cite{big-little}
\subsection{The problem with CPU governors on phones \fixme{OR A simpler governor}}
%\XXXnote{First 3 of 4 para's in this subsection focus on current problems}
%\todo{Question}
The default governor policy, despite the considerable sophistication involved in its implementation, is based on a flawed premise: That past utilization is a meaningful signal of the optimal CPU speed.
As we will show in this paper, this premise is based on a set of assumptions that are not applicable to modern mobile devices.
%% SHOW THIS...
%, frequently makes sub-optimal choices that waste energy and inhibit performance relative to what we show can be achieved using different speed settings.
%At other times, the same default often picks speeds that retard performance when it is needed, thus degrading user experience.
%The reason the default policy rarely picks a good speed stems from the main input it uses, CPU utilization.
%This metric, as we will show, proves \fixme{nearly?} useless for calculating the speed at which the CPU should be run.
To understand this, we present additional claims that we will later substantiate:
\begin{enumerate*}
\item For a device and a workload, there exists a CPU frequency that minimizes energy usage. We denote this \fenergy.
\item A CPU frequency below \fenergy always wastes energy, except in very specific corner cases.
%thermal throttling or memory stalling
\item A CPU speed above \fenergy reduces useful latency in specific, identifiable situations, but in most other cases consumes energy for negligble benefit.
\item User apps, given additional CPU resources, will not show perceptible benefit.
\end{enumerate*}
\fixme{Revisit ordering of above items}
Figure \ref{fig:missed_opportunities} illustrates the core of these problems in practice.
We ran a short $\sim$.5s CPU-bound load on a previously idle phone using default settings, and tracked the effect on CPU speed (with 0 representing idle).
The default governor begins to notice the load, and begins ramping speed until it hits 100\% maximum.
The solid blue line represents actual CPU speed, with 0 representing idle.
The dashed red line indicates a speed that minimizes energy usage for this device CPU.
The 2 shaded grey areas indicate areas of squandered energy.
The default governor initially runs the CPU at below the energy-optimal setting.
Speeds below this constitute a double loss: not only do they increase runtime, the additional energy spent in keeping the CPU on longer outweighs the energy saved in avoiding higher speeds.
The lower-left blue triangle illustrates this, where the CPU speed is wasting both energy and runtime.
Once the CPU speed rises above this speed, the system is now trading off between performance and energy.
As we discuss later, for most usages on phones, this added performance is unnecessary --
This is depicted on the graph by the upper grey trapezoid.
In this paper, we present our governor, \systemname, which adopts a simpler heuristic based on common usage needs.
% which runs tasks at speeds that save energy compared to the system default -- speeds that, in practice, also prove sufficiently performant to maintain user experience.
It avoids the twin pitfalls of both overy slow speeds, which would not only hurt latency but also cost energy by increasing runtime, and also overly fast settings, which cost energy but do not improve perceptible benefit.
\systemname leverages information from userspace, sometimes already furnished by the Android platform, to identify those common use cases that do warrant additional speed.
\fixme{implement}
We ran our experiments on Google Pixel 2 devices with Android AOSP, evaluating \systemname against the system default and several other policies, using microbenchmarks and popular apps.
We run our experiments on Google Pixel 2 devices with Android AOSP, evaluating \systemname against the system default and several other policies, using microbenchmarks and popular apps.
These are representative of common platforms and uses in the real world.
This paper is organized as follows:
(i) We review background and related work in \Cref{sec:related}.
(ii) In \Cref{sec:unjustifed}, we confirm our hypothesis that $\fenergy$ exists and demonstrate that speeds below $\fenergy$ are usually unjustified.
(iii) In \Cref{sec:wasted}, we explore the remainder of the design space through a series of micro-benchmarks that motivate \systemname.
(iv) In \Cref{sec:design}, we tie together our claims from the preceding sections in the design of a series of governors along the gradient from \schedutil to \systemname.
(v) Finally, we evaluate these proposed governors in \Cref{sec:evaluation}.

View file

@ -11,27 +11,27 @@ To answer the question of at what speed a governor \textit{should} set the CPU,
We illustrate our discussion of 5 CPU speed regimes with Figure \ref{fig:optimize_goal_cpu_speed}: 3 specific CPU speed settings and 2 speed regions.
\subsection{Speed Baselines}
We define 3 CPU frequencies, \fidle, \fenergy, and \fperf, as follows, noting that the precise values of these frequencies will vary with hardware.
First, there exists an idle optimal setting for when there is no work, which we denote as \fidle.
We define 3 CPU frequencies, $\fidle$, $\fenergy$, and $\fperf$, as follows, noting that the precise values of these frequencies will vary with hardware.
First, there exists an idle optimal setting for when there is no work, which we denote as $\fidle$.
%While our plots often depict this as 0, it is strictly neither a 0 speed or the lowest available speed setting, but a processor C-state.
This is the processor in a C-State.
Second, there is also an energy-optimal speed: Figure \ref{fig:u_micro} shows that, for each of the plotted curves that illustrate the energy and runtime for different workload settings, there exists an energy minimum, depicted by the vertical lowpoint on the curve.
We denote this speed \fenergy in Figure \ref{fig:optimize_goal_cpu_speed}.
We denote this speed $\fenergy$ in Figure \ref{fig:optimize_goal_cpu_speed}.
For workloads involving > 1 CPU, which is typical of phone apps, this speed lies between 60\%-80\% for little cores and 40\%-60\% for big cores on our representative device.
%While the exact speed depends on the device and CPU type, it will be at some midpoint between the lowest and highest speed settings.
Third, the performance optimal speed of any CPU is simply the highest available setting, which we denote \fperf in figure \ref{fig:optimize_goal_cpu_speed}.
Third, the performance optimal speed of any CPU is simply the highest available setting, which we denote $\fperf$ in figure \ref{fig:optimize_goal_cpu_speed}.
Fourth, between \fidle and \fenergy lies a CPU speed regime that offers nearly no benefit in practice (despite the system default policy often picking speeds in this region).
Figure \ref{fig:u_micro} depicts this as the right-side part of the plots that curve up and to the right, where a slower speed setting than \fenergy both increases runtime and consumes more energy.
Fifth, the speed regime between \fenergy and \fperf offers a trade off between power and runtime, depicted in Figure \ref{fig:u_micro} as the left-side part of the plots that curve sharply up and to the left, where a higher speed setting produces a lower runtime but costs more energy.
The system would justifiably want to pick a speed in this region if the minimum speed necessary to meet system constraints and pending deadlines, such as latency deadlines or screendraws, is \textit{both} faster than \fenergy \textit{and also} slower than \fperf.
Fourth, between $\fidle$ and $\fenergy$ lies a CPU speed regime that offers nearly no benefit in practice (despite the system default policy often picking speeds in this region).
Figure \ref{fig:u_micro} depicts this as the right-side part of the plots that curve up and to the right, where a slower speed setting than $\fenergy$ both increases runtime and consumes more energy.
Fifth, the speed regime between $\fenergy$ and $\fperf$ offers a trade off between power and runtime, depicted in Figure \ref{fig:u_micro} as the left-side part of the plots that curve sharply up and to the left, where a higher speed setting produces a lower runtime but costs more energy.
The system would justifiably want to pick a speed in this region if the minimum speed necessary to meet system constraints and pending deadlines, such as latency deadlines or screendraws, is \textit{both} faster than $\fenergy$ \textit{and also} slower than $\fperf$.
%%%%%%%%%\fixme{add intemittent loads to u-curve microb?}
\subsection{The CPU speed regime between \fidle and \fenergy}
\subsection{The CPU speed regime between $\fidle$ and $\fenergy$}
\label{subsec:regimes_idle_energy}
Running the CPU at a speed below \fenergy should only be done in identifiable corner cases.
Running the CPU at a speed below $\fenergy$ should only be done in identifiable corner cases.
One reason recognized by the kernel maintainers is implementing thermal throttling.
%\cite{energy-aware-schedutil}
However, this is typically already enforced in hardware and is already ignored by the current default policy.
@ -39,7 +39,7 @@ Phone workloads also typically do not saturate CPUs for lengthy periods.
A second reason we have identified for running the CPU in this regime is to minimize CPU stalls.
Fetching data from main memory due to cache misses, or clearing the results of misspeculated branches, produce periods where the CPU does no useful work but still consumes energy.
A speed slower than \fenergy may be justifiable to minimize energy.
A speed slower than $\fenergy$ may be justifiable to minimize energy.
However, to produce stalls in meaningful quantities requires conditions that constantly produce cache misses or frequent branch mispredictions.
An example of the former case would be a \memory workload such as scanning a very large hash table or sorting a large sparse array.
%\fixme{show stall microbench}
@ -51,21 +51,21 @@ However, we have not identified such an actual use case on phones to date.
\subsection{The CPU speed regime between \fenergy and \fperf}
\subsection{The CPU speed regime between $\fenergy$ and $\fperf$}
\label{subsec:regimes_energy_perf}
The system should pick a speed above \fenergy only when additional speed is necessary to achieve system goals such as meeting UI screendraws or other background latency deadlines.
For example, when the user is waiting on the phone, the CPU should immediately be set to \fperf.
The system should pick a speed above $\fenergy$ only when additional speed is necessary to achieve system goals such as meeting UI screendraws or other background latency deadlines.
For example, when the user is waiting on the phone, the CPU should immediately be set to $\fperf$.
%That is, the ideal CPU policy for phones, rather than dwelling on past utilization, should be to minimize energy subject to performance constraints.
Absent compelling reason, the governor should be wary of increasing frequency above \fenergy.
Absent compelling reason, the governor should be wary of increasing frequency above $\fenergy$.
Figure \ref{fig:u_micro} shows that the energy penalty ramps sharply for the highest speed -- particularly so when multiple CPUs are being used (the upward curving left side of the yellow, dash-dotted lines in the graph), as is typically the case with real world apps.
Yet, under the default policy, the speed regime above \fenergy is precisely where apps spend much of there time.
Figure \ref{fig:time_per_freq_fb} shows the CPUs spend significant time well inside the \fenergy-\fperf regime.
Yet, under the default policy, the speed regime above $\fenergy$ is precisely where apps spend much of there time.
Figure \ref{fig:time_per_freq_fb} shows the CPUs spend significant time well inside the $\fenergy--\fperf$ regime.
The right-side grey shaded regions of the bottom CDF graphs illustrate the time the CPUs spend here, with significant amounts at their highest and most wasteful speeds: approximately half of their non-idle time, in the case of the big core CPUs.
Simply because a task is compute bound with high CPU utilization is \textit{not} itself justification for entering the \fenergy-\fperf regime.
Simply because a task is compute bound with high CPU utilization is \textit{not} itself justification for entering the $\fenergy--\fperf$ regime.
The \schedutil governor in such situations will blindly ramp up speed to maximum -- indeed, that what happens in Figure \ref{fig:u_micro}.
While this behavior is beneficial for interactive, compute-bound tasks when the user is waiting and runtime is the priority, it is less desirable for background tasks.
%The governor cannot distinguish when the additional energy may be justified and will always blindly adjust to a high speed (yet ironically taking too long to adjust when warranted).
@ -75,11 +75,11 @@ To our knowlege, the Android platform has not taken advantage of this ability.
Establishing the exact energy-performance tradeoff requires precise usecase-dependent measurement.
\fixme{wordsmithing?}
In practice, we make a key observation that the speed necessary to maintain user experience hovers around \fenergy and rarely approaches \fperf.
Figure \ref{subsec:regimes_energy_perf} shows that CPU speeds above \fenergy do not offer perceptible benefits, as measured by the primary user experience metric on phones, framedrop rate.
Indeed, using the simple fixed speed of \fenergy offers better measured results than the default.
In practice, we make a key observation that the speed necessary to maintain user experience hovers around $\fenergy$ and rarely approaches $\fperf$.
Figure \ref{subsec:regimes_energy_perf} shows that CPU speeds above $\fenergy$ do not offer perceptible benefits, as measured by the primary user experience metric on phones, framedrop rate.
Indeed, using the simple fixed speed of $\fenergy$ offers better measured results than the default.
Absent infrequent, identifiable periods when conditions warrant additional performance -- such as when the user waiting on a response from the phone --
\textit{running the CPUs at a preset fixed speed of \fenergy is sufficient to maintain user experience.}
\textit{running the CPUs at a preset fixed speed of $\fenergy$ is sufficient to maintain user experience.}
%Real world apps, while running, commonly spend the bulk of their time blocking on user input while also running pre-compute tasks in the background.
%As long as these tasks finish their work before without impacting user experience, there is no reason to set a CPU speed any higher.
@ -106,20 +106,20 @@ If the user is waiting, such as during an app coldstart, this hurts performance.
\tinysection{Optimizing for energy}
This is the common case: Most of the time, the main app thread is blocking on user input, whether interactively with the screen on or while dozing with the screen off.
Here, the goal is saving energy, and the governor should run the CPU at the \fenergy regime.
Here, the goal is saving energy, and the governor should run the CPU at the $\fenergy$ regime.
While there are typically background threads running, they are precomputing work.% for some future use, particularly pending screendraws.
%In the this case, completing them as quickly as possible is not the goal.
%Rather, they only need to be ready before periodic needs, such as user response or screendraw deadlines.
The key observation is that the presence of background tasks does \textit{not change the goal of optimizing for energy}.
As Figure \ref{fig:time_per_freq_fb} shows, the CPUs spend the bulk of their time in idle -- that is, there are plenty of potential compute resources available.
%Thus, there is typically no need to run CPUs anywhere close to full speed.
We will show that running the phone CPUs at the \fenergy regime speed will save energy compared to the default policy while still meeting screendraw deadlines.
That is, there is no reason to set the CPU speed in the regime between \fenergy and \fperf.
We will show that running the phone CPUs at the $\fenergy$ regime speed will save energy compared to the default policy while still meeting screendraw deadlines.
That is, there is no reason to set the CPU speed in the regime between $\fenergy$ and $\fperf$.
\fixme{also show this works for download / audiostream}
\tinysection{Optimizing for performance}
When the user is waiting, the goal should be performance and the governor should set the CPU to the \fperf regime.
When the user is waiting, the goal should be performance and the governor should set the CPU to the $\fperf$ regime.
App installs, app coldstarts -- after an installed app gets killed due to memory pressure -- and new browser all tabs fit this case.
There is no reason to run the CPU at any less than 100\%, as the default policy often does.
Notably, the nature of the CPU load by itself is insufficient to determine when to optimize for performance.

View file

@ -1,5 +1,19 @@
% -*- root: ../main.tex -*-
\paragraph{Governors}
Historically, systems have addressed the competing goals of energy and latency optimization by employing frequency scaling to change the speed at which the CPU runs.
On modern systems, CPUs typically consist of multiple cores, often of different types, that run at different speeds (known as P-states) or can be turned on and off into idle (known as C-states).
A policy called a `governor' set the CPU to a higher frequency P-state when there is pending computation, optimizing performance at the expense of energy and visa versa.
The governor runs in conjunction with other policies, in particular (i) the scheduler -- which determines what tasks are run on what CPU cores and (ii) the idle policy -- which places CPUs with no pending work into a (idle) C-state.
Hardware design on phones can constrain governor policy calculations.
CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the assymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores.\cite{big-little}
Rao et al acknowledge the need for going beyond a blind general-purpose governor, and tuning performance to particular apps.\cite{rao2017application}
They do not...

View file

@ -96,10 +96,10 @@ To summarize, not only does the kernel already hardcode when when the CPU should
%\fixme{show}
\subsection{Governors should not pick speeds < \fenergy, but they frequently do so anyway}
\subsection{Governors should not pick speeds < $\fenergy$, but they frequently do so anyway}
Previous works \cite{vogeleer2013energy, nuessle2019benchmarking} have suggested that, for a given workload, there is an energy optimal speed.
While the exact setting is hardware dependent, this \fenergy is not, generally, the slowest CPU speed.
While the exact setting is hardware dependent, this $\fenergy$ is not, generally, the slowest CPU speed.
If CPUs could not be turned off, their lowest energy usage would simply happen at their slowest speed.
However, phone CPUs can be turned off by entering idle or C-states.
Hence, running workloads at slower speeds keeps the CPU on for longer periods, wasting power.
@ -109,7 +109,7 @@ It adjusts CPU speed to utilization -- implicitly assuming that energy consumpti
However, this turns out not to be the case in practice.
Figure \ref{fig:u_micro} shows the cost in energy and runtime of running a fixed amount of compute per CPU, under different CPU policies: for the default CPU policy and for several fixed speed settings.
We also denote the 70\% speed on the graph as \fenergy -- as we observe later, on our test device this speed furnishes an adequate approximation.
We also denote the 70\% speed on the graph as $\fenergy$ -- as we observe later, on our test device this speed furnishes an adequate approximation.
\fixme{Eh. this make sense? Or simpler to revert graph fpow to 70}
\todo{Question}
We also vary the number of loads from 1-4, with each pinned to a separate CPU within a CPU cluster, and whether the loads are run on either big or little CPUs.
@ -124,25 +124,25 @@ Conversely, running the CPU too slowly both wastes power and lengthens runtime.
Recall that when there is no work, the idle policy turns the CPU off.
Figure \ref{fig:u_micro} illustrates this double penalty with the upward-curving right-hand tails of the curves. (We omit showing results of speeds below 30\%; results of these would be even worse.)
Speeds below \fenergy thus offer no benefit.
The Linux kernel maintainers themselves observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a speed in the \fidle-\fenergy regime}\cite{energy-aware-schedutil}.
Speeds below $\fenergy$ thus offer no benefit.
The Linux kernel maintainers themselves observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a speed in the $\fidle--\fenergy$ regime}\cite{energy-aware-schedutil}.
Yet, their own default \schedutil policy frequently picks speeds in this regime anyway.
Partly, the Linux CFS scheduler tries to spread work among available CPUs.\cite{lozi2016linux, sched-domains}
This reduces the per-CPU utilization, often triggering CPU frequencies well below \fenergy.
This reduces the per-CPU utilization, often triggering CPU frequencies well below $\fenergy$.
Figure \ref{fig:time_per_freq_fb} illustrates this in practice.
We ran a scripted Facebook app interaction, scrolling for :30 through friends and then feed under different CPU policies: with the default \schedutil and with differing fixed speeds.
We tracked the time spent for each CPU at a given frequency, with 0 representing idle.
The solid blue lines in the top 2 graphs are histograms of the average time spent at each speed for little and big CPUs.
The dashed red lines represent where the CPUs should be spending their time.
The bottom 2 graphs are CDFs of total time spent, relative to an ideal \fenergy of $\sim$70\%.
The bottom 2 graphs are CDFs of total time spent, relative to an ideal $\fenergy$ of $\sim$70\%.
\fixme{"approximate ideal"?}
Note that, when the CPU is idle, depicted by a speed of 0, the actual and ideal plots coincide.
The left-side grey shaded regions show that the app spends significant \textit{non-idle} time running at speeds below \fenergy -- approximately 2/3 for little CPUs and 1/4 for big CPUs.
The left-side grey shaded regions show that the app spends significant \textit{non-idle} time running at speeds below $\fenergy$ -- approximately 2/3 for little CPUs and 1/4 for big CPUs.
%at below 20\% of maximum speed, well under the energy minimum of 40-50\%.
All of the time spent in the left-side grey areas represents both wasted energy and slow performance.
Simply, absent a very specific corner case reason, the governor should not ever be picking speeds in this area.
Rather, the governor should bound the CPU speed to above \fenergy, and let the idle subsystem turn off CPUs when there is no work.
Rather, the governor should bound the CPU speed to above $\fenergy$, and let the idle subsystem turn off CPUs when there is no work.

View file

@ -86,20 +86,20 @@ The switching cost in both cases, while small, suggests speed changes should be
\end{figure}
\subsection{CPU frequencies over \fenergy do not provide perceptible benefits in return}
\subsection{CPU frequencies over $\fenergy$ do not provide perceptible benefits in return}
Dynamic governors often run CPUs at high frequencies.
%unnecessarily
Figure \ref{fig:time_per_freq_fb} shows that, for the Facebook interaction under the default policy, CPUs spend a significant proportion of their non-idle time at maximum speed.
However, absent a specific, identified need -- such as shortening responsetime to a waiting user -- running the CPU above \fenergy does not offer practical return for the additional energy.
However, absent a specific, identified need -- such as shortening responsetime to a waiting user -- running the CPU above $\fenergy$ does not offer practical return for the additional energy.
Rather, Figure \ref{fig:nonidle_fb} shows that, when given additional CPU resources, real world apps simply consume the additional offered resources.
For the Facebook app interaction, we measured the non-idle time of the CPUs through the Linux \texttt{sysfs} interface, bucketing by little and big CPU type.
We will later show that \fenergy can be reasonably approximated on our test device with a CPU speed of 70\%.
We will later show that $\fenergy$ can be reasonably approximated on our test device with a CPU speed of 70\%.
The graph shows that, as speed increases between 70\% and 100\%, the nonidle time of the CPUs did not decrease appreciably -- in contrast to microbenchmarks with deterministic workloads.
Likely, the app is internally adjusting to being given additional compute by simply prefetching addtitional data during user scrolling.
Despite consuming the additional resources, apps do not show appreciable pragmatic benefit when given additional CPU speed above \fenergy.
Despite consuming the additional resources, apps do not show appreciable pragmatic benefit when given additional CPU speed above $\fenergy$.
Figure \ref{fig:screendrops_per_freq_fb} illustrates this for the same experiment.
The output of phones is largely visual display maitenance; CPU policies should avoid damaging display quality.
Hence, for each run, we additionally tracked the affect on display output as measured in proportion of framedrops, termed Android display \textit{jank}.