Merge branch 'master' of https://git.odin.cse.buffalo.edu/carlnues/paper-KeepItSimple

2023-08-25 15:28:24 -04:00 · 2023-08-25 15:28:24 -04:00 · 91132af4f3
parent 0d181669e3 04853a12a6
commit 91132af4f3
2 changed files with 14 additions and 16 deletions
--- a/sections/related.tex
+++ b/sections/related.tex
@ -7,9 +7,8 @@ Historically, systems have addressed the competing goals of energy and latency o
 On modern systems, CPUs typically consist of multiple cores, often of different types, that run at different speeds (known as P-states) or can be turned on and off into idle (known as C-states).
 A policy, or `governor', sets the CPU's frequency (P-state) when there is pending computation, optimizing performance at the expense of energy, or visa versa.
 The governor runs in conjunction with other policies, in particular (i) the scheduler -- which determines what tasks are run on what CPU cores and (ii) the idle policy -- which places CPUs with no pending work into a (idle) C-state.
-
 Hardware design on phones can constrain governor policy calculations.
-For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the assymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.
+For example, CPU speeds often cannot be set on individual cores but only on groups of CPUs -- a constraint stemming from the asymetric big-little CPU architecture, with 2 clusters of higher- and lower-performance CPU cores~\cite{big-little}.

 % idle paper:  https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=60fdaa6a74dec29a0538325b742bee4097247c6d#page=119

@ -37,7 +36,7 @@ We refer to the `speed' of the core in its idle state as $\fidle$

 Many papers have studied the performance-energy trade-off of governors.
 Yao et al. \cite{492493} established an ideal framework, but assume prior knowledge of all workloads.
-Dynamic systems, by contrast, must somehow gague future work.
+Dynamic systems, by contrast, must somehow gauge future work.
 The common approach is to minimize energy usage subject to some performance constraint.
 Calculating the constraint -- pending work -- takes several approaches.
 The Polaris system \cite{korkmaz2018workload} tunes CPU speed to pending workloads based on userspace information.
@ -46,7 +45,7 @@ It requires knowledge of the pending amount of work and deadline target, informa
 Instead of focusing on the current workload, Zhou et al. \cite{9591359} employ machine learning to predict it for a known QoS performance constraint.

 Unsurprisingly, several studies have focused on the phone platform given the later's energy constraints, generally seeking to maintain user experience as the constraint.
-The system proposed by Chen et al. \cite{7372574, 8356047} gagues workload on phone games by tracking CPU-GPU interaction and dynamically selects among existing governors.
+The system proposed by Chen et al. \cite{7372574, 8356047} gauges workload on phone games by tracking CPU-GPU interaction and dynamically selects among existing governors.
 Li et al. \cite{10.1145/3061639.3062239, 9153119} go further, predicting future work by categorizing game graphic scenes.
 Broyde et al. \cite{8226044} combine scaling non-idle CPU count with CPU frequency to tune their system.
 The Maestro system \cite{8410428}, like ours, recognizes that existing policies can unduly overreact, resulting in CPU overperformance.
@ -60,8 +59,9 @@ Zhisheng et al. \cite{10.1145/2973750.2973780} constrain streaming, analyzing th
 %% HERE...

 Begem et al. take the opposite of the general approach and maximize performance pursuant to energy constraints on phones.\cite{7314145}
-A system that potentially constrains comptation resources needs to measure the cost.
-Meeting query latencies or screendraws are common measurements used in the previous studies.
+A system that potentially constrains computation resources needs to measure the cost.
+Meeting query latencies or screendraws are common measurements used in previous studies.
+None of these, to our knowledge, uses our approach of observing that an approximate energy-minimum setting already suffices to maintain acceptable performance targets, baring specific identifiable cases.



@ -70,7 +70,6 @@ Meeting query latencies or screendraws are common measurements used in the previ



-None of these, to our knowledge, uses our approach of observing that an approximate energy-minimum setting already suffices to maintain acceptable performance targets, baring specific identifiable cases.



--- a/sections/unjustified.tex
+++ b/sections/unjustified.tex
@ -25,14 +25,14 @@
 %\bfcaption{Big CPUs}
 %\end{subfigure}
 %
-\bfcaption{Runtime and energy for a fixed compute workload per CPU, varying governor policy and number of CPUs.  Energy measurements are taken over a fixed 75s period that includes the workload.  (Average of 5 runs, 90\% confidence)}
+\bfcaption{Runtime and energy for a fixed compute workload per CPU, varying governor policy and CPUs.  Energy measurements are taken over a fixed 75s period that includes the workload.  (Average of 5 runs, 90\% confidence)}
 \label{fig:u_micro}
 \end{figure*}

-As CPU frequency increases, the energy per unit of \emph{time} required to operate the CPU grows.
+As CPU frequency increases, the energy per unit of \emph{time} used by the CPU grows.
 This is the fundamental premise behind governor design.
-However, as illustrated in \Cref{fig:item_energy_cost}, an idling mobile CPU consumes only negligible energy.
-Given this, the more useful metric to us is the energy per unit of \emph{work}.
+However, as illustrated in \Cref{fig:item_energy_cost}, an idling mobile CPU consumes negligible energy.
+Given this, the more useful metric is the energy per unit of \emph{work}.
 As we argue in this section, the relationship between frequency and cost-per-cpu cycle is convex, driving our first claim:

 \claim{
@ -43,8 +43,7 @@ As we argue in this section, the relationship between frequency and cost-per-cpu
 For this experiment, we prepared a deterministic, cpu-bound arithmetic workload.
 We ran the workload with a series of CPU policies that fix the CPU to a particular speed, as well as the default \schedutil governor for comparison.
 We also vary the number of loads from 1-4, with each pinned to a separate CPU within a CPU cluster, and whether the loads are run on either big or little CPUs.
-
-We also remind the reader that regardless of governor policy, idle CPU cores are put into a low-power C-state. 
+Regardless of governor policy, idle CPU cores are put into a low-power C-state. 

 On the x-axis, we measure the total time to complete the fixed amount of work.
 On the y-axis, we measured energy over a fixed period of 75s;
@ -59,11 +58,11 @@ Compared to frequencies below this point, it is more efficient to run the core a

 \subsection{$\fenergy$ in General}
 Previous works \cite{vogeleer2013energy, Liang2011AnEC, nuessle2019benchmarking} have suggested that, for a given workload, there is an energy-optimal frequency.
-The Linux kernel maintainers themselves observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a non-idle speed below $\fenergy$}\cite{energy-aware-schedutil}.
+The Linux kernel maintainers  observe that, absent compelling corner cases such as thermal throttling, \textit{there is no reason to set the CPU to a non-idle speed below $\fenergy$}\cite{energy-aware-schedutil}.
 While the exact optimal frequency depends on the specific CPU and core type, this $\fenergy$ is not, generally, the slowest frequency available.

-We observe that frequencies slower than $\fenergy$ are useful for situations where it is not possible to fully saturate the CPU, but \emph{idling is not an option}; 
-For example, in memory-bound workloads (i.e., with frequent CPU stalls resulting from cache misses), workloads with high spin-lock contention, and similar busy-waiting scenarios, it can be beneficial to reduce the CPU frequency to minimize the number of unused CPU cycles\footnote{
+We observe that frequencies slower than $\fenergy$ are useful for situations where it is not possible to fully saturate the CPU, but \emph{idling is not an option}:
+in memory-bound workloads (i.e., with frequent CPU stalls resulting from cache misses), workloads with high spin-lock contention, and similar busy-waiting scenarios, it can be beneficial to reduce the CPU frequency to minimize the number of unused CPU cycles\footnote{
  We note that we did not encounter any significant busy-waiting across all of the apps that we tested.
 }.