carlnues@buffalo.edu 2023-08-25 15:28:24 -04:00
commit 91132af4f3
2 changed files with 14 additions and 16 deletions

View File

@ -7,9 +7,8 @@ Historically, systems have addressed the competing goals of energy and latency o
On modern systems, CPUs typically consist of multiple cores, often of different types, that run at different speeds (known as P-states) or can be turned on and off into idle (known as C-states).
A policy, or `governor', sets the CPU's frequency (P-state) when there is pending computation, optimizing performance at the expense of energy, or visa versa.
The governor runs in conjunction with other policies, in particular (i) the scheduler -- which determines what tasks are run on what CPU cores and (ii) the idle policy -- which places CPUs with no pending work into a (idle) C-state.
Hardware design on phones can constrain governor policy calculations.
For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the assymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.
For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the asymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.
% idle paper: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=60fdaa6a74dec29a0538325b742bee4097247c6d#page=119
@ -37,7 +36,7 @@ We refer to the `speed' of the core in its idle state as $\fidle$
Many papers have studied the performance-energy trade-off of governors.
Yao et al. \cite{492493} established an ideal framework, but assume prior knowledge of all workloads.
Dynamic systems, by contrast, must somehow gague future work.
Dynamic systems, by contrast, must somehow gauge future work.
The common approach is to minimize energy usage subject to some performance constraint.
Calculating the constraint -- pending work -- takes several approaches.
The Polaris system \cite{korkmaz2018workload} tunes CPU speed to pending workloads based on userspace information.
@ -46,7 +45,7 @@ It requires knowledge of the pending amount of work and deadline target, informa
Instead of focusing on the current workload, Zhou et al. \cite{9591359} employ machine learning to predict it for a known QoS performance constraint.
Unsurprisingly, several studies have focused on the phone platform given the later's energy constraints, generally seeking to maintain user experience as the constraint.
The system proposed by Chen et al. \cite{7372574, 8356047} gagues workload on phone games by tracking CPU-GPU interaction and dynamically selects among existing governors.
The system proposed by Chen et al. \cite{7372574, 8356047} gauges workload on phone games by tracking CPU-GPU interaction and dynamically selects among existing governors.
Li et al. \cite{10.1145/3061639.3062239, 9153119} go further, predicting future work by categorizing game graphic scenes.
Broyde et al. \cite{8226044} combine scaling non-idle CPU count with CPU frequency to tune their system.
The Maestro system \cite{8410428}, like ours, recognizes that existing policies can unduly overreact, resulting in CPU overperformance.
@ -60,8 +59,9 @@ Zhisheng et al. \cite{10.1145/2973750.2973780} constrain streaming, analyzing th
%% HERE...
Begem et al. take the opposite of the general approach and maximize performance pursuant to energy constraints on phones.\cite{7314145}
A system that potentially constrains comptation resources needs to measure the cost.
Meeting query latencies or screendraws are common measurements used in the previous studies.
A system that potentially constrains computation resources needs to measure the cost.
Meeting query latencies or screendraws are common measurements used in previous studies.
None of these, to our knowledge, uses our approach of observing that an approximate energy-minimum setting already suffices to maintain acceptable performance targets, baring specific identifiable cases.
@ -70,7 +70,6 @@ Meeting query latencies or screendraws are common measurements used in the previ
None of these, to our knowledge, uses our approach of observing that an approximate energy-minimum setting already suffices to maintain acceptable performance targets, baring specific identifiable cases.

View File

@ -25,14 +25,14 @@
%\bfcaption{Big CPUs}
%\end{subfigure}
%
\bfcaption{Runtime and energy for a fixed compute workload per CPU, varying governor policy and number of CPUs. Energy measurements are taken over a fixed 75s period that includes the workload. (Average of 5 runs, 90\% confidence)}
\bfcaption{Runtime and energy for a fixed compute workload per CPU, varying governor policy and CPUs. Energy measurements are taken over a fixed 75s period that includes the workload. (Average of 5 runs, 90\% confidence)}
\label{fig:u_micro}
\end{figure*}
As CPU frequency increases, the energy per unit of \emph{time} required to operate the CPU grows.
As CPU frequency increases, the energy per unit of \emph{time} used by the CPU grows.
This is the fundamental premise behind governor design.
However, as illustrated in \Cref{fig:item_energy_cost}, an idling mobile CPU consumes only negligible energy.
Given this, the more useful metric to us is the energy per unit of \emph{work}.
However, as illustrated in \Cref{fig:item_energy_cost}, an idling mobile CPU consumes negligible energy.
Given this, the more useful metric is the energy per unit of \emph{work}.
As we argue in this section, the relationship between frequency and cost-per-cpu cycle is convex, driving our first claim:
\claim{
@ -43,8 +43,7 @@ As we argue in this section, the relationship between frequency and cost-per-cpu
For this experiment, we prepared a deterministic, cpu-bound arithmetic workload.
We ran the workload with a series of CPU policies that fix the CPU to a particular speed, as well as the default \schedutil governor for comparison.
We also vary the number of loads from 1-4, with each pinned to a separate CPU within a CPU cluster, and whether the loads are run on either big or little CPUs.
We also remind the reader that regardless of governor policy, idle CPU cores are put into a low-power C-state.
Regardless of governor policy, idle CPU cores are put into a low-power C-state.
On the x-axis, we measure the total time to complete the fixed amount of work.
On the y-axis, we measured energy over a fixed period of 75s;
@ -59,11 +58,11 @@ Compared to frequencies below this point, it is more efficient to run the core a
\subsection{$\fenergy$ in General}
Previous works \cite{vogeleer2013energy, Liang2011AnEC, nuessle2019benchmarking} have suggested that, for a given workload, there is an energy-optimal frequency.
The Linux kernel maintainers themselves observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a non-idle speed below $\fenergy$}\cite{energy-aware-schedutil}.
The Linux kernel maintainers observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a non-idle speed below $\fenergy$}\cite{energy-aware-schedutil}.
While the exact optimal frequency depends on the specific CPU and core type, this $\fenergy$ is not, generally, the slowest frequency available.
We observe that frequencies slower than $\fenergy$ are useful for situations where it is not possible to fully saturate the CPU, but \emph{idling is not an option};
For example, in memory-bound workloads (i.e., with frequent CPU stalls resulting from cache misses), workloads with high spin-lock contention, and similar busy-waiting scenarios, it can be beneficial to reduce the CPU frequency to minimize the number of unused CPU cycles\footnote{
We observe that frequencies slower than $\fenergy$ are useful for situations where it is not possible to fully saturate the CPU, but \emph{idling is not an option}:
in memory-bound workloads (i.e., with frequent CPU stalls resulting from cache misses), workloads with high spin-lock contention, and similar busy-waiting scenarios, it can be beneficial to reduce the CPU frequency to minimize the number of unused CPU cycles\footnote{
We note that we did not encounter any significant busy-waiting across all of the apps that we tested.
}.