paper-KeepItSimple/sections/wasted.tex

230 lines
15 KiB
TeX

% -*- root: ../main.tex -*-
%!TEX root=../main.tex
\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_oscill_cycles.pdf}
\bfcaption{Runtime and CPU cyclecount for a fixed compute under different CPU policies (10 runs, 90\% confidence)}
\label{fig:cycles_time}
\end{figure*}
\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_energy_perf_coldstart_spot.pdf}
\bfcaption{Energy and latency of coldstarting Spotify under different policies (5 runs, 90\% confidence)}
\label{fig:coldstart_time_spot}
\end{figure*}
In this section we consider the frequency regime above $\fenergy$.
Returning to \Cref{fig:u_micro}, frequencies above $\fenergy$ trade increased energy costs for increased compute per unit time, albeit with diminishing returns.
In one previous study, a 50\% reduction in speed would save 80\% in energy~\cite{7091048}.
Operating a core at its maximum frequency (we denote this $\fperf$) is almost as expensive as starting a second core at $\fenergy$.
We conduct a series of case studies to better understand the user-perceptible benefits of operating the CPU at frequencies above $\fenergy$, and summarize the results here.
Our main finding is that the classical measure of system performance, dropped frames (`jank'), is largely unaffected by increasing CPU frequencies to values above $\fenergy$.
We further identify an `adaptive' app design pattern and illustrate it in the context of the Facebook app.
An adaptive app adjusts its CPU requirements to meet available computing power;
While this pattern allows the app to gracefully degrade on CPUs with more limited compute power, in practice it also skews the measure that \schedutil uses to gauge pending compute needs.
Instead, we search for alternative, more reliable signals.
As we will discuss, such a signal already exists: An Android Kernel API that allows userspace to signal a need for additional compute.
We note, in particular, that the Android operating system has already been instrumented to leverage this API in exactly those situations where we identified user-perceptible benefits to increased CPU frequencies.
%
We further observe that rapid frequency shifting, as \schedutil does, has a small, but non-negligible overhead.
Armed with this knowledge, we realize the \systemname governor: a simple policy that avoids frequency shifting by locking into $\fenergy$, except for idling, and brief periods where userspace requests a temporary performance boost.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_nonidletime_fb.pdf}
\bfcaption{CPU non-idle time for Facebook under different CPU policies (10 runs, 90\% confidence)}
\label{fig:nonidle_fb}
\end{figure*}
\begin{figure}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_jank_perspeed_fb.pdf}
\bfcaption{Display framedrop for Facebook under different CPU policies (10 runs, 90\% confidence)}
\label{fig:screendrops_per_freq_fb}
\end{figure}
\paragraph{Case Studies}
We studied a range of apps, including Facebook, Youtube, and Spotify.
For each app, we developed a simple, short scripted interaction to study in terms of its performance characteristics and energy usage.
We focus here primarily on the Facebook workload that we previously introduced in \Cref{sec:low-speed-in-practice}, and return to the other workloads to verify our findings in \Cref{sec:evaluation}.
Our first goal is to understand what user-perceivable value this energy overhead obtains for us.
\subsection{Screen Jank}
Recall that, as illustrated in \Cref{fig:time_per_freq_fb}, for a significant portion of the induced compute workload (about 15\% of the trace), \schedutil keeps the CPU running at a frequency above $\fenergy$.
A common measure of user-perceivable value is the rate of dropped animation frames, often termed Android display \textit{jank}.
In the ideal case, no frames are dropped.
Figure \ref{fig:screendrops_per_freq_fb} shows the jank rates for our case study at a variety of fixed frequencies, with \schedutil as a comparison point.
Runs with speeds above this produce drop rates of $\sim$2\% and below, lower than that of the default dynamic policy ($\sim$3\%).
We attribute the higher drop rate of \schedutil to the ramp-up period, where it runs the CPU at below $\fenergy$.
Although there is a step function: At frequencies below 60\% jank increases to about $4\%$.
However, this step occurs below $\fenergy$.
We find similar behavior in our other case study apps.
\claim{
Increasing the CPU frequency above $\fenergy$ does not significantly improve the jank rate.
}
\subsection{Application Cold-Start}
Many apps require significant compute at launch to initialize themselves.
While facebook does not, one of the other apps in our study, Spotify, takes 2s to coldstart.
During this time, the user is waiting on the app to become responsive.
It makes sense under such circumstances to run the CPUs at a higher speed to enhance user experience.
Figure \ref{fig:coldstart_time_spot} shows the results of a study to examine the benefits and costs of doing this.
Bear in mind that the unmodified system already uses the .. to provide a minimal $\sim$.5s boost to CPU speed.
Ironically, the largest benefit comes in energy saved...
\todo{Add a section here discussing cold-start launches}
\subsection{Adaptive Apps}
\label{sec:adaptiveApps}
We next consider the benefits that the app itself receives from running the CPU at a higher frequency.
As we are unable to instrument the Facebook app directly, we consider a proxy measure of performance: CPU idle time.
As we increase CPU speed, we would expect an increase in idle time to indicate that Facebook's workload is completed faster.
Figure \ref{fig:nonidle_fb} measures the non-idle time --- the time that the CPU spends doing useful work --- through the Linux \texttt{sysfs} interface, bucketing by little and big CPU type.
Non-idle time is measured for each CPU frequency, with \schedutil as a baseline.
Assuming a fixed workload, we would expect the time the CPU spends doing work would drop in fixed ratio with the increase in cpu frequency:
At 90\% of the core's maximum frequency, we would expect the time spent doing work to be a third of the time at 30\% of maximum.
This is not the case; the ratio is a half for the little cores, and only four-fifths for the big cores.
In short, as the CPU frequency goes up and more compute capacity becomes available, the Facebook app adapts by creating additional work.
We attribute this increase in compute to the asynchronously-loading list through which our test case scrolls.
In this design pattern, the cells comprising a large list are not materialized all at once.
Rather, as a cell approaches the viewport, a background worker task is spawned to retrieve the data backing the cell and populate its component elements.
For example, a mail client might retrieve the contents of an email only as the message scrolls into view.
If a cell scrolls out of view before it is populated, the task and any pending work is aborted.
Even running the CPU at full speed, the system has more work.
\claim{
Adaptive widgets in mobile apps present an effectively infinite source of work to mobile CPUs over finite windows of interaction.
}
By adapting itself to the available CPU power, Facebook signals an effectively infinite source of work to the governor, which responds by ramping the CPU up to full speed.
In this situation, speeds above $\fenergy$ may actually provide a perceptible benefit.
However, even at the CPU's maximum frequency, more work is created than the CPU can keep up with.
\begin{figure}
\centering
\includegraphics[width=.95\linewidth]{figures/graph_u_fb.pdf}
\bfcaption{Energy consumed for a fixed set of interations, given compute at different speeds \fixme{fullrun set}}
\label{fig:u_micro_fb}
\end{figure}
\Cref{fig:u_micro_fb} shows power consumption for the Facebook workload, padded with idle time to a fixed 40s period.
Operating the CPU at maximum frequency imposes an energy overhead of approximately $1$mAh compared to operating at $\fenergy \approx 70\%$ of its maximum.
This represents about $\frac{1}{2700}$ of the typical Pixel 2's maximum battery capacity.
While the energy cost is significant, the potential value of more table cells being displayed as they scroll past is subjective and beyond the scope of this study.
We consider each possibility in turn.
If added performance is desirable in this use case and others like it, then the truncated \schedutil of \Cref{alg:boundedschedutil} is appropriate, and we argue that no further work is required.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Signaling Performance Needs}
The more interesting systems design question is how to select CPU speeds in the presence of adaptive applications, when the additional energy investment does not provide value.
Specifically, adaptive apps (while in-use, e.g., scrolling through a list) create a functionally infinite source of work.
The CPU usage profiles presented by an adaptive app and a user legitimately waiting on a CPU-bound task (e.g., cold-start) are identical, rendering them indistinguishable to \schedutil.
Fortunately, the Linux maintainers have already recognized the need for better user-space signalling of performance needs.
In 2015, the Linux kernel added a virtual filesystem, mounted at \texttt{/dev/stune/} that provides virtual file hooks:
\begin{itemize}
\item[]{\texttt{schedtune.boost}}
\item[]{\texttt{schedtune.prefer\_idle}}
\item[]{\texttt{schedtune.tasks}}
\item[]{\texttt{schedtune.cgroup\_procs}}
\end{itemize}
The \texttt{schedtune.boost} pseudofile in particular, allows user-space to mark specific tasks with a boost parameter in the range 0 to 100.
The kernel reacts to the boost parameter by scaling up the CPU usage as seen by the governor.
This virtual increase in CPU usage causes most governors to select a higher CPU frequency than they would otherwise select.
Android's user-space is already configured to make use of the \texttt{stune} API in performance-critical periods.
For example, when an app cold starts, Android briefly marks the app with a boost parameter of 100 to mitigate \schedutil's usual ramp-up period.
\claim{
The boost parameter is a meaningful signal of the need for additional performance.
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Micromanaged Frequency Scaling}
\schedutil makes frequent changes to CPU speed, up to dozens of times per second.
We considered the cost of micro-managing the CPU frequency at this level by running a fixed workload, tracking runtime and work performed (measured in HW CPU cycles), under different CPU policies: The default, fixed low, medium, and high speeds, and a policy that rapidly oscillated between the low and high speeds.
We selected the the fixed medium speed to equal exactly the mean of the fixed low and fixed high speeds, and chose an oscilation rate of 3ms to mimic the rate of speed changes we observed from \schedutil.
In the absence of any overhead, the runtime of the fixed midspeed and oscillating speed policies should be nearly identical.
Figure \ref{fig:cycles_time} shows the results, broken out by big and little CPU type.
Relative to the midspeed setting, the runtimes of the low and high speed policies were slower and higher, as expected.
The runtime of the oscillating policy was higher than that of the midspeed: very marginally ($\sim$.3\%) so for little CPUs, and 1.5\% for big CPUs.
The cyclecount of the oscillating policy was also $\sim$.3\% higher for little CPUs, suggesting performance overhead is due entirely to additional software computation in calling into the driver code in this case.
For big CPUs, computation overhead was .08\% higher for the oscillating than for the midspeed policy, a magnitude smaller difference than the runtime cost.
Additional hardware delays likely contribute to frequency switching overhead.
The switching cost in both cases, while small, is not negligible, suggesting that speed changes should be minimized where possible.
\claim{
Frequent small adjustments to CPU frequency can cost more energy than they save.
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{The \systemname Governor}
To summarize, increasing CPU above $\fenergy$ adds more compute cycles per unit time, but comes with diminishing returns.
In the ideal situation where we know exactly what compute must be completed in a given time interval, we could set the CPU speed once, to the minimum frequency required to meet our obligations.
However, this information is not available, forcing \schedutil to employ a simple PID control loop: As long as more work is offered, it keeps increasing the CPU speed.
However, (i) adaptive apps offer an effectively infinite amount of work, (ii) micro-managing frequencies comes at a cost, and (iii) increasing CPU above $\fenergy$ does not meaningfully affect jank.
These observations, coupled with already extant adoption of the \texttt{prio\_hint} syscall in Android, drive our second governor proposal: \systemname.
\systemname, summarized in \Cref{alg:fullKiss} leverages boost requests provided by userspace through \texttt{/dev/stune}.
Recall that this API can be used to assign each task a \texttt{boost} value (between 0 and 100).
Instead of scaling usage history (like \schedutil), \systemname instead treats this value as a direct, fractional request for CPU performance:
A \texttt{boost} value of 100 is interpreted as a request for the CPU's maximum frequency (denoted $\fperf$).
\systemname selects the highest frequency of any scheduled task, with $\fenergy$ as a lower-bound.
If no tasks are pending, it idles the CPU.
\begin{algorithm}
\caption{\texttt{KISS}($\mathcal T$)}
\label{alg:fullKiss}
\begin{algorithmic}
\Require $\mathcal T$: The set of currently scheduled tasks
\Ensure $f$: The target CPU frequency
\State $\mathcal F \gets \left\{\left.\; \frac{t.\texttt{boost}}{100}\cdot \fperf \;\right|\; t \in \mathcal T \;\right\} \cup \{ $ $\fenergy$ $\}$
\State \textbf{if} {$|\mathcal T| > 0$} \textbf{then} $f = \max(\mathcal F)$ \textbf{else} $f = $ $\fidle$ \textbf{end if}
\end{algorithmic}
\end{algorithm}
\paragraph{Security Concerns}
Allowing apps to pin the core to $\fperf$, could in principle, extend the attack surface for the Android kernel.
However, we note that in practice, userspace is afforded no capabilities that it did not already have~.
It can already spin uselessly from mistake or malice~\cite{maiti2015jouler}.
If an app schedules work for more than $\sim$200ms, \schedutil will already ramp the core up to full speed~\ref{fig:missed_opportunities}.
Furthermore, the {\texttt{schedtune.boost} API is already present in Android.
Regardless of policy, hardware enforced thermal throttling will eventually cap a runaway process~\cite{8410428}.
Conversely, we observe that passing requests for performance explicitly from userspace creates opportunities for more effective security policies.
For example, as in \schedutil, we adopt a finite timeout period.
In principle, longer-duration performance boosts could be limited, for example by requiring user consent.
Moreover, because such requests are not subject to the usual ramp-up period of \schedutil; the improved performance is available immediately.