paper-KeepItSimple/sections/evaluation.tex

% -*- root: ../main.tex -*-
%!TEX root=../main.tex

\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_jank_allapps.pdf}
\bfcaption{Display framedrop for apps under different CPU policies (10 runs, 90\% confidence)}
\label{fig:jank_allapps}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_energy_allapps.pdf}
\bfcaption{Energy usage for apps under different CPU policies (10 runs, 90\% confidence)}
\label{fig:energy_allapps}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_time_per_freq_yt.pdf}
\bfcaption{Average time spent per CPU under the default policy for Youtube (Average of 10 runs, 90\% confidence)}
\label{fig:time_per_freq_yt}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_time_per_freq_spot.pdf}
\bfcaption{Average time spent per CPU under the default policy for Spotify (Average of 10 runs, 90\% confidence)}
\label{fig:time_per_freq_spot}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_nonidletime_yt.pdf}
\bfcaption{CPU non-idle time for Youtube under different CPU policies (10 runs, 90\% confidence)}
\label{fig:nonidle_yt}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_nonidletime_spot.pdf}
\bfcaption{CPU non-idle time for Spotify under different CPU policies (10 runs, 90\% confidence)}
\label{fig:nonidle_spot}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[width=.90\linewidth]{figures/graph_idlejank_heavyload.pdf}
\bfcaption{The effect of additional background loads on user experience for given CPU policies}
\label{fig:idlejank}
\end{figure*}


We now evaluate the \systemname and truncated \schedutil governors, by comparing their performance on a range of representative workloads the default Android \schedutil governor.
Concretely, we evaluate the claims that on normal workloads:
(i) truncated \schedutil achieves significantly better performance than regular \schedutil without significantly increasing energy consumption, and
(ii) \systemname achieves significantly better energy consumption than \schedutil, without significantly increasing screen jank.

We further conduct several experiments to confirm our observations from \Cref{sec:wasted}, namely that:
(iii) the adaptive app pattern is not unique to facebook,
(iv) apps spend significant time below $\fenergy$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Evaluation platform}

Our results were obtained using Google Pixel 2 devices running Android AOSP 10 with 4 GB RAM and 128 GB SSD storage and the Snapdragon 835 chipset~\cite{snapdragon-835}.
Standalone microbenchmarks were implemented in C, while end-to-end macrobenchmarks were performed using the Android UI Automator testing framework to perform scripted simulated interactions with real-world apps~\cite{uiautomator}.
One of the phones was modified to obtain energy measurements using the Monsoon HVPM power meter~\cite{monsoon}.
Our evaluation system consists of a pair of shell scripts running on the phone and an external monitor, respectively.

The external script sleeps for 10s to ensure quiescence and prevent inter-trial artifacts, and initializes both the Monsoon meter and the on-phone script.
The on-phone script sleeps for 20s to ensure that the Monsoon meter is capturing data, sets the desired governor policy, and starts the experiment.
When the experiment concludes, the on-phone script sleeps for a further 10s to ensure that the Monsoon meter captures the full trace, and notifies the external script that the experiment has concluded.
The external script concludes by retrieving relevant artifacts from the phone, excluding data transfer from any energy or performance measurements.

We collected information on CPU speed and idlestate from both the Linux \texttt{ftrace} framework and from \texttt{sysfs}, and on CPU cycles from the \texttt{perf\_event\_open} syscall~\cite{perf-event}.
We also used \texttt{ftrace} to log testing parameter and state.
Information on screen performance including framedrops came from the Android \texttt{dumpsys gfxinfo} service.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Workloads}
We consider three separate workloads: (i) Facebook, (ii) YouTube, and (iii) Spotify.
The \textbf{Facebook} workload was described in \Cref{sec:low-speed-in-practice}.
The \textbf{YouTube} workload starts the app, and searches a popular video by its name.
The app selects the first hit, starts the video, and waits for 30 seconds.
The specific video was selected to get a predictable high rate of being served random motion video ads at the start.
The \textbf{Spotify} workload...
\todo{Fill in details}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Screen Jank}

%\begin{figure}
%\centering
%\includegraphics[width=.95\linewidth]{figures/graph_jank_perspeed_yt.pdf}
%\bfcaption{Display framedrop proportion for a :30 Youtube interaction under different CPU policies (10 runs, 90\% confidence)}
%\label{fig:screendrops_per_freq_yt}
%\end{figure}

\Cref{fig:jank_allapps} show frame drop rates for the three workloads.
\todo{discuss}

These graphs confirm the performance aspect of claims (i) and (ii).
On all workloads, \systemname and truncated \schedutil both outperform regular \schedutil.
\systemname has a 10-25\% lower frame drop rate, varying by workload.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Ramp-Up Times}

To attribute the improvement in performance, we measure the CPU frequencies selected by \schedutil and \systemname, respectively.
\Cref{fig:time_per_freq_fb,fig:time_per_freq_yt,fig:time_per_freq_spot} plot a CDF of the difference between these two selections.
We note that for a significant fraction of the workload (5\% for Facebook, 15\% for Youtube), the frequency selected by \schedutil is significantly (up to 50\%) lower.
This is \schedutil's ramp-up period, where it selects frequencies lower than $\fenergy$.
We attribute the improved performance for both governors to eliminating the ramp-up period where \systemname selects speeds below $\fenergy$.
Although each workload spends part of its time at a higher frequency in \schedutil compared to \systemname, it spends more time ramping up to $\fenergy$ than at a higher speed.
In summary, the improved performance of both truncated \schedutil and \systemname can be attributed to \schedutil's ramp-up period.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Jank Under High-Load}


We next explore the level of additional load required to degrade the user experience.
For this experiment, we run the facebook workload in the presence of background tasks.
These background tasks generate additional background load by performing simple arithmetic with periodically injected sleeps at varying intervals.
%We collect non-idle time through sysfs and framedrop rate through Android GFX as before.
%We pin one load-producing task to each of the 8 CPU cores.
\Cref{fig:idlejank} illustrates the effect of the added CPU load on the measured jank.
The x-axis shows the average load across all 8 CPU cores (based on the injected sleeps), and the frame-drop rate is shown on the y-axis.
Note that a smaller sleep interval equates to a higher load.


The leftmost part of the graph, with the smallest circles (representing a normal interaction, with no additional background load) shows that a fixed speed of 70\% or greater produces a measured screen drop rate that is essentially idential with that of the system default.
Up to a sustained load of about 70\% across \emph{all} CPU cores, the system is able to keep up with screen redraw events, with a significant effect on jank only at the lowest 2 CPU frequencies.
In actual usage, a user would likely never encounter this level of background usage; it takes significant, and unrealistic, additional workload to degrade the user experience.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Energy Usage}

\Cref{fig:energy_allapps} show frame drop rates for the three workloads.
\todo{discuss}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Idle Time}

We next review our findings from \Cref{sec:adaptiveApps}, that typical apps increase their offered load as CPU capacity increases.
\Cref{fig:nonidle_fb,fig:nonidle_yt,fig:nonidle_spot} illustrate the fraction of the of time the CPU spends doing work in each workload as CPU frequency increases.
Recall that, assuming the amount of work stays constant in a fixed-duration workload, the time spent non-idle would show an inverse-linear relationship with the CPU frequency.
As with Facebook, the Youtube workload shows a much flatter relationship, particularly on the big cores.
\todo{Discuss Spotify}