Merge branch 'master' of https://git.odin.cse.buffalo.edu/carlnues/paper-KeepItSimple
commit
91132af4f3
|
@ -7,9 +7,8 @@ Historically, systems have addressed the competing goals of energy and latency o
|
|||
On modern systems, CPUs typically consist of multiple cores, often of different types, that run at different speeds (known as P-states) or can be turned on and off into idle (known as C-states).
|
||||
A policy, or `governor', sets the CPU's frequency (P-state) when there is pending computation, optimizing performance at the expense of energy, or visa versa.
|
||||
The governor runs in conjunction with other policies, in particular (i) the scheduler -- which determines what tasks are run on what CPU cores and (ii) the idle policy -- which places CPUs with no pending work into a (idle) C-state.
|
||||
|
||||
Hardware design on phones can constrain governor policy calculations.
|
||||
For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the assymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.
|
||||
For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the asymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.
|
||||
|
||||
% idle paper: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=60fdaa6a74dec29a0538325b742bee4097247c6d#page=119
|
||||
|
||||
|
@ -37,7 +36,7 @@ We refer to the `speed' of the core in its idle state as $\fidle$
|
|||
|
||||
Many papers have studied the performance-energy trade-off of governors.
|
||||
Yao et al. \cite{492493} established an ideal framework, but assume prior knowledge of all workloads.
|
||||
Dynamic systems, by contrast, must somehow gague future work.
|
||||
Dynamic systems, by contrast, must somehow gauge future work.
|
||||
The common approach is to minimize energy usage subject to some performance constraint.
|
||||
Calculating the constraint -- pending work -- takes several approaches.
|
||||
The Polaris system \cite{korkmaz2018workload} tunes CPU speed to pending workloads based on userspace information.
|
||||
|
@ -46,7 +45,7 @@ It requires knowledge of the pending amount of work and deadline target, informa
|
|||
Instead of focusing on the current workload, Zhou et al. \cite{9591359} employ machine learning to predict it for a known QoS performance constraint.
|
||||
|
||||
Unsurprisingly, several studies have focused on the phone platform given the later's energy constraints, generally seeking to maintain user experience as the constraint.
|
||||
The system proposed by Chen et al. \cite{7372574, 8356047} gagues workload on phone games by tracking CPU-GPU interaction and dynamically selects among existing governors.
|
||||
The system proposed by Chen et al. \cite{7372574, 8356047} gauges workload on phone games by tracking CPU-GPU interaction and dynamically selects among existing governors.
|
||||
Li et al. \cite{10.1145/3061639.3062239, 9153119} go further, predicting future work by categorizing game graphic scenes.
|
||||
Broyde et al. \cite{8226044} combine scaling non-idle CPU count with CPU frequency to tune their system.
|
||||
The Maestro system \cite{8410428}, like ours, recognizes that existing policies can unduly overreact, resulting in CPU overperformance.
|
||||
|
@ -60,8 +59,9 @@ Zhisheng et al. \cite{10.1145/2973750.2973780} constrain streaming, analyzing th
|
|||
%% HERE...
|
||||
|
||||
Begem et al. take the opposite of the general approach and maximize performance pursuant to energy constraints on phones.\cite{7314145}
|
||||
A system that potentially constrains comptation resources needs to measure the cost.
|
||||
Meeting query latencies or screendraws are common measurements used in the previous studies.
|
||||
A system that potentially constrains computation resources needs to measure the cost.
|
||||
Meeting query latencies or screendraws are common measurements used in previous studies.
|
||||
None of these, to our knowledge, uses our approach of observing that an approximate energy-minimum setting already suffices to maintain acceptable performance targets, baring specific identifiable cases.
|
||||
|
||||
|
||||
|
||||
|
@ -70,7 +70,6 @@ Meeting query latencies or screendraws are common measurements used in the previ
|
|||
|
||||
|
||||
|
||||
None of these, to our knowledge, uses our approach of observing that an approximate energy-minimum setting already suffices to maintain acceptable performance targets, baring specific identifiable cases.
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -25,14 +25,14 @@
|
|||
%\bfcaption{Big CPUs}
|
||||
%\end{subfigure}
|
||||
%
|
||||
\bfcaption{Runtime and energy for a fixed compute workload per CPU, varying governor policy and number of CPUs. Energy measurements are taken over a fixed 75s period that includes the workload. (Average of 5 runs, 90\% confidence)}
|
||||
\bfcaption{Runtime and energy for a fixed compute workload per CPU, varying governor policy and CPUs. Energy measurements are taken over a fixed 75s period that includes the workload. (Average of 5 runs, 90\% confidence)}
|
||||
\label{fig:u_micro}
|
||||
\end{figure*}
|
||||
|
||||
As CPU frequency increases, the energy per unit of \emph{time} required to operate the CPU grows.
|
||||
As CPU frequency increases, the energy per unit of \emph{time} used by the CPU grows.
|
||||
This is the fundamental premise behind governor design.
|
||||
However, as illustrated in \Cref{fig:item_energy_cost}, an idling mobile CPU consumes only negligible energy.
|
||||
Given this, the more useful metric to us is the energy per unit of \emph{work}.
|
||||
However, as illustrated in \Cref{fig:item_energy_cost}, an idling mobile CPU consumes negligible energy.
|
||||
Given this, the more useful metric is the energy per unit of \emph{work}.
|
||||
As we argue in this section, the relationship between frequency and cost-per-cpu cycle is convex, driving our first claim:
|
||||
|
||||
\claim{
|
||||
|
@ -43,8 +43,7 @@ As we argue in this section, the relationship between frequency and cost-per-cpu
|
|||
For this experiment, we prepared a deterministic, cpu-bound arithmetic workload.
|
||||
We ran the workload with a series of CPU policies that fix the CPU to a particular speed, as well as the default \schedutil governor for comparison.
|
||||
We also vary the number of loads from 1-4, with each pinned to a separate CPU within a CPU cluster, and whether the loads are run on either big or little CPUs.
|
||||
|
||||
We also remind the reader that regardless of governor policy, idle CPU cores are put into a low-power C-state.
|
||||
Regardless of governor policy, idle CPU cores are put into a low-power C-state.
|
||||
|
||||
On the x-axis, we measure the total time to complete the fixed amount of work.
|
||||
On the y-axis, we measured energy over a fixed period of 75s;
|
||||
|
@ -59,11 +58,11 @@ Compared to frequencies below this point, it is more efficient to run the core a
|
|||
|
||||
\subsection{$\fenergy$ in General}
|
||||
Previous works \cite{vogeleer2013energy, Liang2011AnEC, nuessle2019benchmarking} have suggested that, for a given workload, there is an energy-optimal frequency.
|
||||
The Linux kernel maintainers themselves observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a non-idle speed below $\fenergy$}\cite{energy-aware-schedutil}.
|
||||
The Linux kernel maintainers observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a non-idle speed below $\fenergy$}\cite{energy-aware-schedutil}.
|
||||
While the exact optimal frequency depends on the specific CPU and core type, this $\fenergy$ is not, generally, the slowest frequency available.
|
||||
|
||||
We observe that frequencies slower than $\fenergy$ are useful for situations where it is not possible to fully saturate the CPU, but \emph{idling is not an option};
|
||||
For example, in memory-bound workloads (i.e., with frequent CPU stalls resulting from cache misses), workloads with high spin-lock contention, and similar busy-waiting scenarios, it can be beneficial to reduce the CPU frequency to minimize the number of unused CPU cycles\footnote{
|
||||
We observe that frequencies slower than $\fenergy$ are useful for situations where it is not possible to fully saturate the CPU, but \emph{idling is not an option}:
|
||||
in memory-bound workloads (i.e., with frequent CPU stalls resulting from cache misses), workloads with high spin-lock contention, and similar busy-waiting scenarios, it can be beneficial to reduce the CPU frequency to minimize the number of unused CPU cycles\footnote{
|
||||
We note that we did not encounter any significant busy-waiting across all of the apps that we tested.
|
||||
}.
|
||||
|
||||
|
|
Loading…
Reference in New Issue