paper-KeepItSimple/sections/unjustified.tex

% -*- root: ../main.tex -*-
%!TEX root=../main.tex


%\begin{figure}
%\centering
%\includegraphics[width=.95\linewidth]{figures/graph_energy_varying_sleep.pdf}
%\bfcaption{The impact of idling on energy:  Idling overrides speed setting}
%\bfcaption{The energy consumed under different CPU policies, under 3 conditions:  continuos load, intermittent load, and no load (3 runs, 90\% confidence) %%%\fixme{ORPHAN GRAPH}}
%\label{fig:idle_impact}
%\end{figure}

\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_u_fixedlen_multicore.pdf}
%
%\begin{subfigure}{0.45\textwidth}
%\centering
%%\includegraphics
%\bfcaption{Little CPUs}
%\end{subfigure}
%\begin{subfigure}{0.45\textwidth}
%\centering
%%\includegraphics
%\bfcaption{Big CPUs}
%\end{subfigure}
%
\bfcaption{Runtime and energy for a fixed compute workload per CPU, varying governor policy and CPUs.  Energy measurements are taken over a fixed 75s period that includes the workload.  (Average of 5 runs, 90\% confidence)}
\label{fig:u_micro}
\end{figure*}

As CPU frequency increases, the energy per unit of \emph{time} used by the CPU grows.
This is the fundamental premise behind governor design.
However, as illustrated in \Cref{fig:item_energy_cost}, an idling mobile CPU consumes negligible energy.
Given this, the more useful metric is the energy per unit of \emph{work}.
As we argue in this section, the relationship between frequency and cost-per-cpu cycle is convex, driving our first claim:

\claim{
  There is a power-optimal frequency (denoted $\fenergy$), below which it is not useful, in general, to operate the CPU.
}

\Cref{fig:u_micro} illustrates this point on a Pixel 2, a device that implements the common big-little architecture, with 4 big and 4 little CPUs.\cite{big-little}.
For this experiment, we prepared a deterministic, cpu-bound arithmetic workload.
We ran the workload with a series of CPU policies that fix the CPU to a particular speed, as well as the default \schedutil governor for comparison.
We also vary the number of loads from 1-4, with each pinned to a separate CPU within a CPU cluster, and whether the loads are run on either big or little CPUs.
Regardless of governor policy, idle CPU cores are put into a low-power C-state.

On the x-axis, we measure the total time to complete the fixed amount of work.
On the y-axis, we measured energy over a fixed period of 75s;
The duration of the workload was padded to this length with idle time as necessary.
We use a fixed period of time, as the length of real world phone interactions tends to be governed by user activity rather than by workload completion.
Points to the lower-left are best.

Unsurprisingly, higher CPU frequencies lead to shorter runtimes.
However, the relationship between frequency and energy consumption is convex, especially for the Pixel 2's little cores.
Below a certain frequency (we denote this frequency $\fenergy$), the energy required to complete the fixed workload \emph{increases}.
Compared to frequencies below this point, it is more efficient to run the core at a faster frequency for a shorter time, returning the core to idle sooner.

\subsection{$\fenergy$ in General}
Previous works \cite{vogeleer2013energy, Liang2011AnEC, nuessle2019benchmarking} have suggested that, for a given workload, there is an energy-optimal frequency.
The Linux kernel maintainers  observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a non-idle speed below $\fenergy$}\cite{energy-aware-schedutil}.
While the exact optimal frequency depends on the specific CPU and core type, this $\fenergy$ is not, generally, the slowest frequency available.

We observe that frequencies slower than $\fenergy$ are useful for situations where it is not possible to fully saturate the CPU, but \emph{idling is not an option}:
in memory-bound workloads (i.e., with frequent CPU stalls resulting from cache misses), workloads with high spin-lock contention, and similar busy-waiting scenarios, it can be beneficial to reduce the CPU frequency to minimize the number of unused CPU cycles\footnote{
  We note that we did not encounter any significant busy-waiting across all of the apps that we tested.
}.

By comparison, tasks blocked on IO or user input are removed from the runqueue entirely, allowing the CPU to enter an idle state.
%\todo{add an experiment that shows that tasks blocked on IO or user input or something similar have similar energy profiles.?}
We note that CPU utilization is a measure of time spent idling or blocked on IO, and not the time the CPU spends running without doing useful work, supporting our second claim:

\claim{
  CPU utilization is not a meaningful signal for selecting speeds below $\fenergy$.
}

\subsection{$\fenergy$ in Practice}
\label{sec:low-speed-in-practice}

Speeds below $\fenergy$ thus offer no justifiable benefit.
Yet, CPUs spend much of their time in this region when running workloads.
Partly, the Linux CFS scheduler tries to spread work among available CPUs.\cite{lozi2016linux, sched-domains}
This reduces the per-CPU utilization, but triggers CPU frequencies well below $\fenergy$.

Figure \ref{fig:time_per_freq_fb} illustrates this under-performance in practice.
We ran a scripted Facebook app interaction, scrolling for 25s through friends and then feed under different CPU policies:  with the default \schedutil and with differing fixed speeds.
We tracked the time spent for each CPU at a given frequency, with 0 representing idle.
The solid blue lines in the top 2 graphs are histograms of the average time spent at each speed for little and big CPUs.
The dashed red lines provides a baseline of a governor that selects a fixed 70\% CPU frequency ($\approx \fenergy$) for the CPU when work is pending (and as usual, that idles the core when no work is available).
Note that, when the CPU is idle, depicted by a speed of 0, the actual and baseline plots coincide.
The left-sides of the 2 CDF plots (bottom 2 graphs of Figure \ref{fig:time_per_freq_fb}) show that the app spends significant \textit{non-idle} time running at speeds below $\fenergy$ -- approximately 2/3 for little CPUs and 1/4 for big CPUs.
All of the time spent in the areas marked Underperformance represents energy wasted for a slower performance.

\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_time_per_freq_fb.pdf}
\bfcaption{Average time spent per CPU at a given frequency under the default policy for a 25s scripted Facebook interaction (Average of 10 runs, 90\% confidence)}
\label{fig:time_per_freq_fb}
\end{figure*}


\subsection{Truncated \schedutil}

To summarize, frequencies strictly below $\fenergy$ (excepting $\fidle$) consume more power per CPU cycle than $\fenergy$, and result in higher latencies.
In the absence of CPU stalls, spin-locks, and thermal throttling, frequencies in this range are strictly worse.
Based on this observation and two further insights, we now propose our first adjustment to the \schedutil governor.

First, recall that the only signal used by \schedutil is recent past CPU usage.
This signal conveys no information about CPU stalls, and so is not useful for deciding whether the CPU should be set to a frequency in this regime.

Second, we observe that workloads that trigger the relevant CPU behaviors are typically data-intensive and memory bound, or parallel workloads with high contention.
Such workloads are often offloaded to more powerful cloud compute infrastructures; When run at the edge (e.g., for federated learning), it is typically when the device has a stable power source.

\begin{algorithm}
  \caption{\texttt{TruncatedSchedutil}()}
  \label{alg:boundedschedutil}
  \begin{algorithmic}
    \Ensure $f$: The target CPU frequency
    \State $f \gets \schedutil{}\texttt{()}$
    \State \textbf{if} {$\fidle$$ < f < $ $\fenergy$} \textbf{then} $f = $ $\fenergy$ \textbf{end if}
  \end{algorithmic}
\end{algorithm}

To a first-degree approximation, there is no value in running the CPU at frequencies between $\fidle$ and $\fenergy$.
\Cref{alg:boundedschedutil} summarizes our first proposed adjustment to schedutil: truncating the function's domain.
Speeds below $\fenergy$ are increased to $\fenergy$.
This eliminates the initial ramp-up period illustrated in \Cref{fig:missed_opportunities}, providing better performance, and recovering the `wasted' energy in this regime.