paper-KeepItSimple/sections/regimes.tex

134 lines
12 KiB
TeX

% -*- root: ../main.tex -*-
\begin{figure}
\centering
\includegraphics[width=.90\linewidth]{figures/optimize_goal_cpu_speed.pdf}
\bfcaption{How CPU speed should flow from the current CPU goal}
\label{fig:optimize_goal_cpu_speed}
\end{figure}
To answer the question of at what speed a governor \textit{should} set the CPU, we consider what benefits exist, if any, to picking a particular speed.
We illustrate our discussion of 5 CPU speed regimes with Figure \ref{fig:optimize_goal_cpu_speed}: 3 specific CPU speed settings and 2 speed regions.
\subsection{Speed Baselines}
We define 3 CPU frequencies, $\fidle$, $\fenergy$, and $\fperf$, as follows, noting that the precise values of these frequencies will vary with hardware.
First, there exists an idle optimal setting for when there is no work, which we denote as $\fidle$.
%While our plots often depict this as 0, it is strictly neither a 0 speed or the lowest available speed setting, but a processor C-state.
This is the processor in a C-State.
Second, there is also an energy-optimal speed: Figure \ref{fig:u_micro} shows that, for each of the plotted curves that illustrate the energy and runtime for different workload settings, there exists an energy minimum, depicted by the vertical lowpoint on the curve.
We denote this speed $\fenergy$ in Figure \ref{fig:optimize_goal_cpu_speed}.
For workloads involving > 1 CPU, which is typical of phone apps, this speed lies between 60\%-80\% for little cores and 40\%-60\% for big cores on our representative device.
%While the exact speed depends on the device and CPU type, it will be at some midpoint between the lowest and highest speed settings.
Third, the performance optimal speed of any CPU is simply the highest available setting, which we denote $\fperf$ in figure \ref{fig:optimize_goal_cpu_speed}.
Fourth, between $\fidle$ and $\fenergy$ lies a CPU speed regime that offers nearly no benefit in practice (despite the system default policy often picking speeds in this region).
Figure \ref{fig:u_micro} depicts this as the right-side part of the plots that curve up and to the right, where a slower speed setting than $\fenergy$ both increases runtime and consumes more energy.
Fifth, the speed regime between $\fenergy$ and $\fperf$ offers a trade off between power and runtime, depicted in Figure \ref{fig:u_micro} as the left-side part of the plots that curve sharply up and to the left, where a higher speed setting produces a lower runtime but costs more energy.
The system would justifiably want to pick a speed in this region if the minimum speed necessary to meet system constraints and pending deadlines, such as latency deadlines or screendraws, is \textit{both} faster than $\fenergy$ \textit{and also} slower than $\fperf$.
%%%%%%%%%\fixme{add intemittent loads to u-curve microb?}
\subsection{The CPU speed regime between $\fidle$ and $\fenergy$}
\label{subsec:regimes_idle_energy}
Running the CPU at a speed below $\fenergy$ should only be done in identifiable corner cases.
One reason recognized by the kernel maintainers is implementing thermal throttling.
%\cite{energy-aware-schedutil}
However, this is typically already enforced in hardware and is already ignored by the current default policy.
Phone workloads also typically do not saturate CPUs for lengthy periods.
A second reason we have identified for running the CPU in this regime is to minimize CPU stalls.
Fetching data from main memory due to cache misses, or clearing the results of misspeculated branches, produce periods where the CPU does no useful work but still consumes energy.
A speed slower than $\fenergy$ may be justifiable to minimize energy.
However, to produce stalls in meaningful quantities requires conditions that constantly produce cache misses or frequent branch mispredictions.
An example of the former case would be a \memory workload such as scanning a very large hash table or sorting a large sparse array.
%\fixme{show stall microbench}
However, we have not identified such an actual use case on phones to date.
\XXXnote{NOTE: Moved 0-70 discussion up...}
%We later observe that this models the behavior of common real-world apps: they run many background tasks, keep all CPUs partially busy, but do not saturate any.
\subsection{The CPU speed regime between $\fenergy$ and $\fperf$}
\label{subsec:regimes_energy_perf}
The system should pick a speed above $\fenergy$ only when additional speed is necessary to achieve system goals such as meeting UI screendraws or other background latency deadlines.
For example, when the user is waiting on the phone, the CPU should immediately be set to $\fperf$.
%That is, the ideal CPU policy for phones, rather than dwelling on past utilization, should be to minimize energy subject to performance constraints.
Absent compelling reason, the governor should be wary of increasing frequency above $\fenergy$.
Figure \ref{fig:u_micro} shows that the energy penalty ramps sharply for the highest speed -- particularly so when multiple CPUs are being used (the upward curving left side of the yellow, dash-dotted lines in the graph), as is typically the case with real world apps.
Yet, under the default policy, the speed regime above $\fenergy$ is precisely where apps spend much of there time.
Figure \ref{fig:time_per_freq_fb} shows the CPUs spend significant time well inside the $\fenergy--\fperf$ regime.
The right-side grey shaded regions of the bottom CDF graphs illustrate the time the CPUs spend here, with significant amounts at their highest and most wasteful speeds: approximately half of their non-idle time, in the case of the big core CPUs.
Simply because a task is compute bound with high CPU utilization is \textit{not} itself justification for entering the $\fenergy--\fperf$ regime.
The \schedutil governor in such situations will blindly ramp up speed to maximum -- indeed, that what happens in Figure \ref{fig:u_micro}.
While this behavior is beneficial for interactive, compute-bound tasks when the user is waiting and runtime is the priority, it is less desirable for background tasks.
%The governor cannot distinguish when the additional energy may be justified and will always blindly adjust to a high speed (yet ironically taking too long to adjust when warranted).
Background services downloading updates well in advance of need by the app and user fall into the later category.
The Linux community has recognized this problem: the second primary reason they added the \texttt{schedtune} API to the \schedutil governor was to permit sidestepping of an energy-wasteful speed picked by the governor.
To our knowlege, the Android platform has not taken advantage of this ability.
Establishing the exact energy-performance tradeoff requires precise usecase-dependent measurement.
\fixme{wordsmithing?}
In practice, we make a key observation that the speed necessary to maintain user experience hovers around $\fenergy$ and rarely approaches $\fperf$.
Figure \ref{fig:screendrops_per_freq_fb} shows that CPU speeds above $\fenergy$ do not offer perceptible benefits, as measured by the primary user experience metric on phones, framedrop rate.
Indeed, using the simple fixed speed of $\fenergy$ offers better measured results than the default.
Absent infrequent, identifiable periods when conditions warrant additional performance -- such as when the user waiting on a response from the phone --
\textit{running the CPUs at a preset fixed speed of $\fenergy$ is sufficient to maintain user experience.}
%Real world apps, while running, commonly spend the bulk of their time blocking on user input while also running pre-compute tasks in the background.
%As long as these tasks finish their work before without impacting user experience, there is no reason to set a CPU speed any higher.
\subsection{What parameters governors should be considering}
Dynamic reactive governors ignore the most import important criterion, \textit{the current primary goal of the CPU}.
Specifically, the system, and CPU, should already know what it is currently primarily trying to achieve: optimizing to save energy, to improve performance, or to prevent CPU stalls.
This information, in turn, should drive the selection of CPU speed from the regimes in Figure \ref{fig:optimize_goal_cpu_speed}.
We will show that this selection system offers better energy in the common case, and better performance where needed, than the system default.
The system can derive its goal information from 2 particular sources: userspace and system interactivity.
Previous studies have shown the utility of applications using knowledge of their own workloads to set CPU speeds manually.\cite{korkmaz2018workload}
The Android system already partly leverages platform information of when to optimize app starts.
This can and should be expanded for more general usage.
Secondly, governors should consider the state of interactivity.
The default governor, when presented with a compute-intensive task, will quickly ramp speed to maximum.
Unless the user is actively waiting, this wastes energy.
Conversely, when the default governor becomes blocked on disk or net, it will lower CPU speed.
If the user is waiting, such as during an app coldstart, this hurts performance.
\fixme{dup earlier? examples}
\tinysection{Optimizing for energy}
This is the common case: Most of the time, the main app thread is blocking on user input, whether interactively with the screen on or while dozing with the screen off.
Here, the goal is saving energy, and the governor should run the CPU at the $\fenergy$ regime.
While there are typically background threads running, they are precomputing work.% for some future use, particularly pending screendraws.
%In the this case, completing them as quickly as possible is not the goal.
%Rather, they only need to be ready before periodic needs, such as user response or screendraw deadlines.
The key observation is that the presence of background tasks does \textit{not change the goal of optimizing for energy}.
As Figure \ref{fig:time_per_freq_fb} shows, the CPUs spend the bulk of their time in idle -- that is, there are plenty of potential compute resources available.
%Thus, there is typically no need to run CPUs anywhere close to full speed.
We will show that running the phone CPUs at the $\fenergy$ regime speed will save energy compared to the default policy while still meeting screendraw deadlines.
That is, there is no reason to set the CPU speed in the regime between $\fenergy$ and $\fperf$.
\fixme{also show this works for download / audiostream}
\tinysection{Optimizing for performance}
When the user is waiting, the goal should be performance and the governor should set the CPU to the $\fperf$ regime.
App installs, app coldstarts -- after an installed app gets killed due to memory pressure -- and new browser all tabs fit this case.
There is no reason to run the CPU at any less than 100\%, as the default policy often does.
Notably, the nature of the CPU load by itself is insufficient to determine when to optimize for performance.
A long-running compute-heavy background task, that would trigger an energy-wasteful speed ramp-up under the default policy, should not justify changing optimization goals.
Rather, the governor should also consider whether the user is actually waiting.
Happily, the bulk of these cases -- when the Android system is interactive but the foreground is not ready to receive input are readily identifiable: %Userspace, the platform or the app, knows when it needs to do a lot of work before it can present a foreground app ready to receive input.
We design our system to use this information and show that it offers better performance than the default case.
\fixme{prevent stalls...}