128 lines
8.2 KiB
TeX
128 lines
8.2 KiB
TeX
% -*- root: ../main.tex -*-
|
|
%!TEX root=../main.tex
|
|
|
|
|
|
%\begin{figure}
|
|
%\centering
|
|
%\includegraphics[width=.95\linewidth]{figures/graph_energy_varying_sleep.pdf}
|
|
%\bfcaption{The impact of idling on energy: Idling overrides speed setting}
|
|
%\bfcaption{The energy consumed under different CPU policies, under 3 conditions: continuos load, intermittent load, and no load (3 runs, 90\% confidence) %%%\fixme{ORPHAN GRAPH}}
|
|
%\label{fig:idle_impact}
|
|
%\end{figure}
|
|
|
|
\begin{figure*}
|
|
\centering
|
|
\includegraphics[width=.90\linewidth]{figures/graph_u_fixedlen_multicore.pdf}
|
|
%
|
|
%\begin{subfigure}{0.45\textwidth}
|
|
%\centering
|
|
%%\includegraphics
|
|
%\bfcaption{Little CPUs}
|
|
%\end{subfigure}
|
|
%\begin{subfigure}{0.45\textwidth}
|
|
%\centering
|
|
%%\includegraphics
|
|
%\bfcaption{Big CPUs}
|
|
%\end{subfigure}
|
|
%
|
|
\bfcaption{Runtime and energy for a fixed compute workload per CPU, varying governor policy and CPUs. Energy measurements are taken over a fixed 75s period that includes the workload. (Average of 5 runs, 90\% confidence)}
|
|
\label{fig:u_micro}
|
|
\end{figure*}
|
|
|
|
As CPU frequency increases, the energy per unit of \emph{time} used by the CPU grows.
|
|
This is the fundamental premise behind governor design.
|
|
However, as illustrated in \Cref{fig:item_energy_cost}, an idling mobile CPU consumes negligible energy.
|
|
Given this, the more useful metric is the energy per unit of \emph{work}.
|
|
As we argue in this section, the relationship between frequency and cost-per-cpu cycle is convex, driving our first claim:
|
|
|
|
\claim{
|
|
There is a power-optimal frequency (denoted $\fenergy$), below which it is not useful, in general, to operate the CPU.
|
|
}
|
|
|
|
\Cref{fig:u_micro} illustrates this point on a Pixel 2, a device that implements the common big-little architecture, with 4 big and 4 little CPUs.\cite{big-little}.
|
|
For this experiment, we prepared a deterministic, cpu-bound arithmetic workload.
|
|
We ran the workload with a series of CPU policies that fix the CPU to a particular speed, as well as the default \schedutil governor for comparison.
|
|
We also vary the number of loads from 1-4, with each pinned to a separate CPU within a CPU cluster, and whether the loads are run on either big or little CPUs.
|
|
Regardless of governor policy, idle CPU cores are put into a low-power C-state.
|
|
|
|
On the x-axis, we measure the total time to complete the fixed amount of work.
|
|
On the y-axis, we measured energy over a fixed period of 75s;
|
|
The duration of the workload was padded to this length with idle time as necessary.
|
|
We use a fixed period of time, as the length of real world phone interactions tends to be governed by user activity rather than by workload completion.
|
|
Points to the lower-left are best.
|
|
|
|
Unsurprisingly, higher CPU frequencies lead to shorter runtimes.
|
|
However, the relationship between frequency and energy consumption is convex, especially for the Pixel 2's little cores.
|
|
Below a certain frequency (we denote this frequency $\fenergy$), the energy required to complete the fixed workload \emph{increases}.
|
|
Compared to frequencies below this point, it is more efficient to run the core at a faster frequency for a shorter time, returning the core to idle sooner.
|
|
|
|
\subsection{$\fenergy$ in General}
|
|
Previous works \cite{vogeleer2013energy, Liang2011AnEC, nuessle2019benchmarking} have suggested that, for a given workload, there is an energy-optimal frequency.
|
|
The Linux kernel maintainers observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a non-idle speed below $\fenergy$}\cite{energy-aware-schedutil}.
|
|
While the exact optimal frequency depends on the specific CPU and core type, this $\fenergy$ is not, generally, the slowest frequency available.
|
|
|
|
We observe that frequencies slower than $\fenergy$ are useful for situations where it is not possible to fully saturate the CPU, but \emph{idling is not an option}:
|
|
in memory-bound workloads (i.e., with frequent CPU stalls resulting from cache misses), workloads with high spin-lock contention, and similar busy-waiting scenarios, it can be beneficial to reduce the CPU frequency to minimize the number of unused CPU cycles\footnote{
|
|
We note that we did not encounter any significant busy-waiting across all of the apps that we tested.
|
|
}.
|
|
|
|
By comparison, tasks blocked on IO or user input are removed from the runqueue entirely, allowing the CPU to enter an idle state.
|
|
%\todo{add an experiment that shows that tasks blocked on IO or user input or something similar have similar energy profiles.?}
|
|
We note that CPU utilization is a measure of time spent idling or blocked on IO, and not the time the CPU spends running without doing useful work, supporting our second claim:
|
|
|
|
\claim{
|
|
CPU utilization is not a meaningful signal for selecting speeds below $\fenergy$.
|
|
}
|
|
|
|
\subsection{$\fenergy$ in Practice}
|
|
\label{sec:low-speed-in-practice}
|
|
|
|
Speeds below $\fenergy$ thus offer no justifiable benefit.
|
|
Yet, CPUs spend much of their time in this region when running workloads.
|
|
Partly, the Linux CFS scheduler tries to spread work among available CPUs.\cite{lozi2016linux, sched-domains}
|
|
This reduces the per-CPU utilization, but triggers CPU frequencies well below $\fenergy$.
|
|
|
|
Figure \ref{fig:time_per_freq_fb} illustrates this under-performance in practice.
|
|
We ran a scripted Facebook app interaction, scrolling for 25s through friends and then feed under different CPU policies: with the default \schedutil and with differing fixed speeds.
|
|
We tracked the time spent for each CPU at a given frequency, with 0 representing idle.
|
|
The solid blue lines in the top 2 graphs are histograms of the average time spent at each speed for little and big CPUs.
|
|
The dashed red lines provides a baseline of a governor that selects a fixed 70\% CPU frequency ($\approx \fenergy$) for the CPU when work is pending (and as usual, that idles the core when no work is available).
|
|
Note that, when the CPU is idle, depicted by a speed of 0, the actual and baseline plots coincide.
|
|
The left-sides of the 2 CDF plots (bottom 2 graphs of Figure \ref{fig:time_per_freq_fb}) show that the app spends significant \textit{non-idle} time running at speeds below $\fenergy$ -- approximately 2/3 for little CPUs and 1/4 for big CPUs.
|
|
All of the time spent in the areas marked Underperformance represents energy wasted for a slower performance.
|
|
|
|
\begin{figure*}
|
|
\centering
|
|
\includegraphics[width=.90\linewidth]{figures/graph_time_per_freq_fb.pdf}
|
|
\bfcaption{Average time spent per CPU at a given frequency under the default policy for a 25s scripted Facebook interaction (Average of 10 runs, 90\% confidence)}
|
|
\label{fig:time_per_freq_fb}
|
|
\end{figure*}
|
|
|
|
|
|
\subsection{Truncated \schedutil}
|
|
|
|
To summarize, frequencies strictly below $\fenergy$ (excepting $\fidle$) consume more power per CPU cycle than $\fenergy$, and result in higher latencies.
|
|
In the absence of CPU stalls, spin-locks, and thermal throttling, frequencies in this range are strictly worse.
|
|
Based on this observation and two further insights, we now propose our first adjustment to the \schedutil governor.
|
|
|
|
First, recall that the only signal used by \schedutil is recent past CPU usage.
|
|
This signal conveys no information about CPU stalls, and so is not useful for deciding whether the CPU should be set to a frequency in this regime.
|
|
|
|
Second, we observe that workloads that trigger the relevant CPU behaviors are typically data-intensive and memory bound, or parallel workloads with high contention.
|
|
Such workloads are often offloaded to more powerful cloud compute infrastructures; When run at the edge (e.g., for federated learning), it is typically when the device has a stable power source.
|
|
|
|
\begin{algorithm}
|
|
\caption{\texttt{TruncatedSchedutil}()}
|
|
\label{alg:boundedschedutil}
|
|
\begin{algorithmic}
|
|
\Ensure $f$: The target CPU frequency
|
|
\State $f \gets \schedutil{}\texttt{()}$
|
|
\State \textbf{if} {$\fidle$$ < f < $ $\fenergy$} \textbf{then} $f = $ $\fenergy$ \textbf{end if}
|
|
\end{algorithmic}
|
|
\end{algorithm}
|
|
|
|
To a first-degree approximation, there is no value in running the CPU at frequencies between $\fidle$ and $\fenergy$.
|
|
\Cref{alg:boundedschedutil} summarizes our first proposed adjustment to schedutil: truncating the function's domain.
|
|
Speeds below $\fenergy$ are increased to $\fenergy$.
|
|
This eliminates the initial ramp-up period illustrated in \Cref{fig:missed_opportunities}, providing better performance, and recovering the `wasted' energy in this regime.
|