paper-KeepItSimple/sections/evaluation.tex

% -*- root: ../main.tex -*-
%!TEX root=../main.tex

\begin{figure*}
\centering
\includegraphics[width=.87\linewidth]{figures/graph_jank_allapps.pdf}
\bfcaption{Display framedrop for apps under different CPU policies (10 runs, 90\% confidence)}
\label{fig:jank_allapps}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.87\linewidth]{figures/graph_energy_allapps.pdf}
\bfcaption{Energy usage for apps under different CPU policies (10 runs, 90\% confidence)}
\label{fig:energy_allapps}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.87\linewidth]{figures/graph_time_per_freq_yt.pdf}
\bfcaption{Average time spent per CPU under the default policy for Youtube (Average of 10 runs, 90\% confidence)}
\label{fig:time_per_freq_yt}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.87\linewidth]{figures/graph_time_per_freq_spot.pdf}
\bfcaption{Average time spent per CPU under the default policy for Spotify (Average of 10 runs, 90\% confidence)}
\label{fig:time_per_freq_spot}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.87\linewidth]{figures/graph_nonidletime_yt.pdf}
\bfcaption{CPU non-idle time for Youtube under different CPU policies (10 runs, 90\% confidence)}
\label{fig:nonidle_yt}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.87\linewidth]{figures/graph_nonidletime_spot.pdf}
\bfcaption{CPU non-idle time for Spotify under different CPU policies (10 runs, 90\% confidence)}
\label{fig:nonidle_spot}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.75\linewidth]{figures/graph_idlejank_heavyload.pdf}
\bfcaption{The effect of additional background loads on user experience for given CPU policies}
\label{fig:idlejank}
\end{figure*}


We now evaluate the \systemname and truncated \schedutil governors. %, by comparing their performance on a range of representative workloads the default Android \schedutil governor.
Concretely, we evaluate the claims that on normal workloads:
(i) truncated \schedutil achieves better performance than regular \schedutil without significantly increasing energy consumption, and
(ii) \systemname achieves better energy consumption than \schedutil, without significantly increasing screen jank.
We further conduct several experiments to confirm our observations from \Cref{sec:wasted}, namely that:
(iii) the adaptive app pattern is not unique to facebook,
(iv) apps spend significant time below $\fenergy$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Evaluation platform}

Our results were obtained using stock Google Pixel 2 devices running Android AOSP 10 with 4 GB RAM and 128 GB SSD storage and the Snapdragon 835 chipset~\cite{snapdragon-835}.
Standalone microbenchmarks were implemented in C, while end-to-end macrobenchmarks were performed using the Android UI Automator testing framework to perform scripted, simulated interactions with real-world apps~\cite{uiautomator}.
One of the phones was modified to obtain energy measurements using the Monsoon HVPM power meter~\cite{monsoon}.
Our evaluation system consists of a pair of shell scripts running on the phone and an external monitor, respectively.
The external script sleeps for 10s to ensure quiescence and prevent inter-trial artifacts, and initializes both the Monsoon meter and the on-phone script.
The on-phone script sleeps for 20s to ensure that the Monsoon meter is capturing data, sets the desired governor policy, and starts the experiment.
%When the experiment concludes, the on-phone script sleeps for a further 10s to ensure that the Monsoon meter captures the full trace, and notifies the external script that the experiment has concluded.
%The external script concludes by retrieving relevant artifacts from the phone, excluding data transfer from any energy or performance measurements.
We collected information on CPU speed and idlestate from both the Linux \texttt{ftrace} framework and from \texttt{sysfs}, and on CPU cycles from the \texttt{perf\_event\_open} syscall~\cite{perf-event}.
We also used \texttt{ftrace} to log testing parameter and state.
Information on screen performance including framedrops came from the Android \texttt{dumpsys gfxinfo} service.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{CPU Policies}
We evaluate six different CPU policies:
(i) the system default, \schedutil,
(ii) a truncated \schedutil implemented by lower-bounding the CPU to 70\% using the existing API discussed in section \ref{subsec:signal_perf_needs},
(iii) a fixed 70\% speed using the existing \texttt{userspace} governor,
(iv) \systemname with speeds lower bounded at 70\%,
(v) unmodified \systemname with default speed of fixed 70\%, and
(vi) the \texttt{performance} governor.
We include (ii) and (iii) to compare the general performance of the truncated \schedutil and a common-case $\sim$70\% speed policies when implemented under the existing API with the equivalents implemented using \systemname.
Under default Linux, a specific CPU speed requested gets implemented as the next-highest speed in a preset series of supported speeds in \texttt{scaling\_available\_frequencies} in \texttt{sysfs}.
We follow this behavior with our system.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Workloads}
We consider four separate workloads, the first 3 involving individual apps:  (i) Facebook, (ii) YouTube, and (iii) Spotify.
The fourth (iv) workload combines the Facebook and Spotify loads.
%These were designed to mimic common user phone interactions.
The \textbf{Facebook} workload was described in \Cref{sec:low-speed-in-practice}.
The \textbf{YouTube} workload starts the app, and searches a popular video by its name.
The app selects the first hit, starts the video, and waits for 30 seconds.
The specific video was selected to get a predictable high rate of being served random motion video ads at the start.
The \textbf{Spotify} workload starts the app searches for a common musical selection.
It starts the first suggestion and waits for 30 seconds while the audio plays with the app in the foreground.
Lastly, the \textbf{Combined} workload examines the system under commonplace additional stress.
It runs the original Facebook workload in the foreground while the Spotify app streams audio continuously in the background.
These workloads address evaluation claim (iii).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Screen Jank}

\Cref{fig:jank_allapps} shows frame drop rates.
These graphs address the performance aspect of claims (i) and (ii).
On all workloads, \systemname and truncated \schedutil offer nearly identical or notably better performance than regular \schedutil.
The Facebook load under \systemname costs an additional .3\%, or $\sim$.2 frames per second (12.6 frames per minute) at a 60fps display rate.
We argue this does not noticeably affect user experience and is more than acceptable given the greater than 10\% energy savings.
The results of the truncated \schedutil policies and of fixedspeed 70\% similarly offer significant energy savings at small to zero cost.


Youtube shows a clear performance win for \systemname, producing 5.2\% fewer screendrops than with the default.
The truncated \schedutil policy under \systemname and the fixed speed 70\% policy also offer notably improved sreendrop rates, with 4.3\% and 3.6\% lower drop rates respectively.
UI performance under \systemname for both the Spotify and the Combined workloads, like that for Facebook, costs .3\% fps compared to the default -- a cost we again argue is both very minimal and acceptable.
The other non-default policies for both Spotify and Combined also offer either essentially the same or even somewhat better performance than the default:  Truncated \schedutil and fixed 70\% under the existing API for Spotify both offer a $\sim$2.5\% lower framedrop rate.
%Finally, we observe that even with the increased background load of the Combined workload,
In summary:  \systemname, with a considerably simpler policy mechanism, offers essentially the same performance, measured in user experience screendrops, to that of \systemname, in common app workloads.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Ramp-Up Times}

To attribute the improvement in performance, we measure the CPU frequencies selected by \schedutil.
\Cref{fig:time_per_freq_fb,fig:time_per_freq_yt,fig:time_per_freq_spot} plot a CDF of the difference between these two selections.
This addresses evaluation claim (iv) from above:
We note that for a significant fraction of the workload (5\% for Facebook, 15\% for Youtube, 12\% for Spotify), the frequency selected by \schedutil is significantly (up to 50\%) lower.
This is \schedutil's ramp-up period, where it selects frequencies lower than $\fenergy$.
We attribute the relative performance of \systemname to eliminating the ramp-up period where \systemname selects speeds below $\fenergy$.
Although each workload spends part of its time at a higher frequency in \schedutil compared to \systemname, it spends more time ramping up to $\fenergy$ than at a higher speed.
In summary, the improved performance of both truncated \schedutil and \systemname can be attributed to \schedutil's ramp-up period.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Jank Under High-Load}

%We next explore the level of additional load required to degrade the user experience.
For this experiment, we run the facebook workload in the presence of background tasks.
These background tasks generate additional background load by performing simple arithmetic with periodically injected sleeps at varying intervals.
%We collect non-idle time through sysfs and framedrop rate through Android GFX as before.
%We pin one load-producing task to each of the 8 CPU cores.
\Cref{fig:idlejank} illustrates the effect of the added CPU load on the measured jank.
The x-axis shows the average load across all 8 CPU cores (based on the injected sleeps), and the frame-drop rate is shown on the y-axis.
Note that a smaller sleep interval equates to a higher load.


The leftmost part of the graph, with the smallest circles (representing a normal interaction, with no additional background load) shows that a fixed speed of 70\% or greater produces a measured screen drop rate that is essentially idential with that of the system default.
Up to a sustained load of about 70\% across \emph{all} CPU cores, the system is able to keep up with screen redraw events, with a significant effect on jank only at the lowest 2 CPU frequencies.
In actual usage, a user would likely never encounter this level of background usage; it takes significant, and unrealistic, additional workload to degrade the user experience.

A more representative evaluation case of high loads is that offered by our fourth Combined workload:  Browsing through Facebook while listening to Spotify music in the background.
As we discuss above, \Cref{fig:jank_allapps} shows the cost of the additional background load is quite small in terms of frame drops; 2 of the non-default policies offer improvements.
In common settings, background load does not pose a threat to the performance of \systemname.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Energy Usage}

\Cref{fig:energy_allapps} shows energy usage for the four workloads, addressing the last aspect of evaluation claims (i) and (ii).
The Facebook workload under \systemname consumes significantly less (11.5\%) energy compared to the default.
Indeed, all of the non-default policies except \texttt{performance} also best \schedutil.

Youtube under \systemname also saves energy, albeit less at a 1.6\% savings versus default.
Spotify actually costs 2.3\% more.
Note that this is Spotify running interactively.
The use case of Spotify in the Combined workload, where it is running in the background, is likely much more dominant in actual real world usage.
The energy consumed by the Combined workload, unsurprisingly, is significantly higher across the board than that of the individual app loads.
Here, \systemname uses 5.6\% less energy than the default.
Once again, all of the non-default policies save \texttt{performance} do too.
Common apps under common usage cases show \systemname offers notable energy savings compared to the default.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Idle Time}

We next review our findings from \Cref{sec:adaptiveApps}, that typical apps increase their offered load as CPU capacity increases.
\Cref{fig:nonidle_fb,fig:nonidle_yt,fig:nonidle_spot} illustrate the time fraction the CPU spends doing work in each workload as CPU frequency increases.
Recall that, assuming the amount of work stays constant in a fixed-duration workload, the time spent non-idle would show an inverse-linear relationship with the CPU frequency.
As with Facebook, both Youtube and Spotify shows a much flatter relationship, particularly on the big cores.