paper-PocketDataClustering/sections/4-experiments.tex

%!TEX root=../paper.tex

In this section, we describe the datasets, the environment we performed our experiments in, and the experiment designs along with their results.
All of our experiments were run on a machine with 3.6 GHz Intel i7 6th. Generation processor with 16GB RAM. We leveraged the Java 1.8 SE Runtime Environment and R v3.3.2 on Ubuntu 16.04 operating system.

\subsection{Datasets}

We structure our results over two real world data sets to evaluate the performance and accuracy of our design.  The first consists of real world traces gathered in the wild and the second consists of traces
gathered by us tovestablish ground truth by associating the specific activity the user performs with the queries generated during collection.

\subsubsection{Dataset 1 - PocketData Dataset}
\label{sec:pocketdatadataset}

As the mobile workload dataset, we use an extended version of the PocketData~\cite{kennedy2015pocket} dataset which provides handset-based data directly collected from smartphones for multiple users.
It includes a 21 day trace of SQLite activity of 53 PhoneLab~\cite{phonelab} smartphones running the Android smartphone platform. University at Buffalo's PhoneLab is a smartphone platform testbed consisting of 175 participants drawn from the university faculty, students and staff. They were provided a discounted Sprint service and a Google Nexus 5 smartphone to use as their primary device. In exchange, they ran a modified Android platform image containing custom instrumentation.

A 2013-14 survey indicates that PhoneLab participants are well distributed among genders and age brackets~\cite{phonelab}. Because PhoneLab is open to any UB affiliate , the participants are in many different departments on campus, providing a reasonable level of on-campus spatial coverage.

This dataset consists of various information about the usage patterns across a wide variety of apps, and is a best-effort anonymized dataset~\footnote{IRB Number Anonymized}.
Most of the private information is irreversibly concealed but there are still some constants in the queries.

An older version of the dataset that includes a month's trace of SQLite activity of 11 users is available online~\cite{kennedy2015pocket}.
The data that we used in this paper is available by request.
%, and the researchers has to follow the IRB requirements of their institute to be able to use, and publish using this data.

\begin{table}[]
\centering
\caption{Dataset}
\label{tab:dataset}
\vspace{-0.2cm}
\begin{tabular}{cc}
\textbf{Application}                                           & \textbf{\# of queries} \\ \hline
Facebook                                                       & 245,550             \\
Whatsapp                                                        & 14,898             \\
Photos                                                       & 165,347               \\
%\begin{tabular}[c]{@{}c@{}}Google Play\\ Services\end{tabular} & 3,008722            \\
Google Play Services										& 3,008,722 \\
Twitter                                                  & 27,318 \\ \hline
\end{tabular}
\end{table}

%\subsubsection{Dataset 2 - Controlled Dataset}
%\label{sec:controlleddataset}

%We collected this dataset in a controlled environment where one user's activities for one hour are recorded along with the time of each activity, and the activity the user performed. To create this dataset, we selected \emph{Facebook} application simply because it actively utilizes the mobile database, and we could define the descriptive activities such as opening a profile page, scrolling down on the home feed, and so on.

%During the data collection period, the user was free to use the any other application along with Facebook, or not to use the phone at all. However, the user was informed about the experiment was being performed on Facebook application, and they should narrate what they are doing while doing it to be recorded.

%This dataset contains 60 minutes of queries that Facebook application issued to the database. There are \todo{about 5000} parsable~\footnote{by JSqlParser} queries recorded, labeled with what activity they were associated with, timing information of each query, and lastly the user labeled session which is roughly described to the user as \textit{``Session in an application is the period between a user starts to dedicate their attention to the application, and loses their interest.''} Note that this description is ambiguous, and aims to lead the experiment participant to form their understanding of a session.

\subsubsection{Dataset 2 - Controlled Activity}
\label{sec:controlledactivity}

This dataset contains queries that are generated when the user interacts with the mobile application for a known specific purpose. While collecting these activity logs, every time the user completes an operation, the set of queries issued is labeled with that the activity the user performs.
For instance, when a user in the Facebook app visits the profile page of another user through the newsfeed, the sequence of queries for this activity is record.
A label is generated for this activity, and the sequence of queries that was recorded is associated with the label.  If the users performs this activity again, the sequence of queries recorded would once again be labeled with the label
for this activity.


%Each line in the log has:

%\begin{itemize}
%  \item Device ID: Unique identifier for each device
%  \item UNIX timestamp: Milliseconds since 1970
%  \item Ordering: Timestamp and request order
%  \item Date and time: Human readable timestamp
%  \item Process ID: Standard UNIX process ID
%  \item Thread ID: Standard UNIX thread ID
%  \item Log level: Verbose (V), Debug (D), Info (I), Warning (W), Error (E)
%  \item Tag: Source of log information, ``SQLite-Query-PhoneLab''
%  \item JSON object that holds various information about the event that is logged
%\end{itemize}

%information.

%Note that the app ID is not included in the log, because different apps can have the same process and thread IDs in different times.
%Our strategy to get the log lines for our app of interest is to search for the app name in JSON object parsing from the beginning of the file.
%When we first encounter the app of interest, we use process and thread IDs to identify the events related to that app until we encounter a different app name in the JSON object.

%Hence, we implement all the functions of the system with Java for it to be repackaged, and be able to be imported into mobile phones as well as servers.
%This may solve the privacy concerns of the users: allowing the processing to be performed on the phone instead of a company server even if it is not the ideal case due to performance and energy consumption constraints.

\subsection{Experiment Design}

The core output of the system presented in this paper is an expected query workload extracted, without any supervision, from the user's past activity.
%It is important to answer the question of \textit{``what is the accuracy of such an output?''}.
The goal of this section is to analyze the accuracy of our proposed method, and compare it against the methods in existing literature. Accuracy for us is defined with respect
to extracting meaning patterns from our datasets.

%One of the ways of doing it is through a comparison.
We test various section clustering methodologies presented in Section~\ref{sec:sessionclustering}, and our session identification methodology presented in Section~\ref{sec:sessionidentifier} with the state of the art method for web query session identification~\cite{hagen2011query}.
To achieve this, we designed the following experiments as building blocks to verify the accuracy and consistency of the methods that we use in each step.

\subsubsection{Pattern Identification Comparison}

Extracting meaningful patterns from user data is central to a reliable characterization of the smartphone database workload. In this experiment, we aim to show the accuracy of our proposed approach, and test various methods in search of the best performance.

One of the ways of doing it is to show that the method proposed can \emph{predict} the same user's expected workload for a given timeframe. To achieve this, under the assumption that there would be a certain level of monotonicity of the users' weekly smartphone usage, we partitioned the 21 days of query workload created by each user into 3 equal partitions. We cross-validated the accuracy results of our tests by comparing each partition with the other two partition.

Following this approach, we performed 5 different runs comparison, changing one session similarity method each time:
\begin{itemize}
	\item Without Session Identification: We identify each query as a session, and used the query cluster assignments for comparison.
	\item Cluster-based Set Difference: We apply our session identification methodology, and perform session clustering by creating the distance matrix through the comparison of query cluster assignments with Jaccard Index.
	\item Cluster-based Divergence: We apply our session identification methodology, and perform session clustering by creating the distance matrix through the comparison of query cluster assignment frequencies in each session with JS-Divergence.
	\item Feature-based Set Difference: We apply our session identification methodology, and perform session clustering by creating the distance matrix through the comparison of feature sets extracted from each session with Jaccard Index.
	\item Feature-based Divergence: We apply our session identification methodology, and perform session clustering by creating the distance matrix through the comparison of feature appearance frequencies in each session with JS-Divergence.
\end{itemize}

\begin{figure}[h!]
    \centering
    \includegraphics[width=0.45\textwidth]{graphics/WorkloadPredictibility}
    \vspace{-0.5cm}
    \caption{Workload Prediction Accuracy Comparison}
    \label{fig:averagesimilarity}
\end{figure}

%We proposed a robust session similarity measure which takes into the account the considerations and constraints that are imposed by the problem domain.
As can be seen in Figure~\ref{fig:averagesimilarity}, Cluster-based Set Difference session clustering method's output matches with more than 90\% of other partitions, and outperforms all the other methods consistently. Cluster-based Divergence method follows this method very closely accuracy-wise. Feature-based methods perform worse than cluster-based methods in all three tests consistently.
We believe that our experimental results support the dependability of this approach for mobile applications. Also, although divergence-based methods produce comparable results with set difference based methods, they take more time to compute, and exhaust more system resources since the size of the features for each session is higher than the size of cluster assignments. The relatively lower accuracy of divergence based methods is due to the fact that even though the user is performing the same activity, the time spent on performing that activity changes. For instance, a Facebook user might open up the app, refresh the feed once, and might exit, or refresh every 5 seconds, and the type of the session still remains the same, although it changes the feature and query cluster assignment frequency in a session profile.
This experiment also highlights the effectiveness of utilizing sessions instead of using only queries in workload exploration in mobile databases.

%As for outlier detection, the 25 sessions that have 25\% or less average similarity is a very moderate number that should be inspected to understand the structure of extraordinarily different sessions.


\subsubsection{Session Identification Comparison}
The methodology we proposed in Section~\ref{sec:sessionidentifier} presents an approach that only depends on the query time, and aims to capture the query activity for the time the user interacts with the phone. On the other hand, there is a body of work on session identification over web search query workloads in the literature that focuses on human annotated queries in search engines~\cite{hagen2011query}. Our hypothesis is that existing methodologies are inherently less accurate in detecting sessions in mobile query workloads due to the difference in mobile app query workloads compared to web query workloads. In this experiment, we measure the performance of these methods by workload predictability by partitioning the the 3 week long query workload in 3 equal partitions, and compare the session characteristics of each partition with the others.
The experiment results confirm our argument by consistently showing comparable characteristics for each partition.

%Extracting meaningful patterns from user data is central to a reliable characterization of the smartphone database workload. We proposed a robust session similarity measure which takes into the account the considerations and constraints that are imposed by the problem domain. We believe that our experimental results support the dependability of this approach for mobile applications.

%We believe that a random session selection among the 5184 sessions can provide us with a representative query set of the workloads for all users since 90\% average similarity means the query set represents 90\% of all the sessions in the dataset. The average, minimum and maximum session lengths are given in Table~\ref{tab:sessionlength}.

\begin{figure}[h!]
    \centering
    \includegraphics[width=0.45\textwidth]{graphics/SessionIdentificationComp}
    \vspace{-0.5cm}
    \caption{Session Identification Impact Comparison}
    \label{fig:sessionIdentification}
\end{figure}


Figure~\ref{fig:sessionIdentification} shows the workload prediction accuracy scores when we apply both of the session identification methods paired with Cluster-based Set Difference session clustering method to the user workloads. Our method performs significantly better than the cascade method. This is strictly due to the mobile app query workload characteristics significantly differing from the workloads cascade method is tailored for. Since the cascade method is designed to include queries that have lexical similarities in the same session, it consistently merges different bursts of activities into one single session because both activities include similar queries. As an example, in the user's case, this corresponds to putting two different notification check actions that occurs one hour apart into the same session.

%\question{Create the average similarity graph.}
%\question{We applied the idle time treshold specific for each user over the Facebook query workload as we indicated in the previous section in order to determine how many sessions there are. This operation revealed that there were 15820 sessions initiated in the dataset according to the session definition. Among these 15820 sessions, 5184 of them had 90\% or higher average similarity with all other sessions and there are 25 sessions that had 25\% or less average similarity with all other sessions as show in Figure~\ref{fig:averagesimilarity}.}

\subsubsection{Query Clustering Accuracy}

In this experiment, we aim to measure the placement performance of clustering methodology described in Section~\ref{sec:profiler} over the queries in the query log.
For our experiments, we selected Facebook to be our example app. For visual purposes, we clustered the activities of only one user in Figure~\ref{fig:dendrogram}.
For this specific user's case, there are 8856 rows of parsable queries in the log. However, there are 431 unique queries among them.
We prepare ground truth cluster labels by manually inspecting all the unique queries within a user's query log for Facebook app. The accuracy of the clustering result is measured by comparing the query placements to the clusters to the ground truth.

\begin{table}[h!]
	\centering
	\caption{Clustering accuracy for a random user}
	\label{tab:clusteringAccuracy}
	\vspace{-0.2cm}
    \begin{tabular}{ccc}
    ~                                  & \textbf{Facebook} \\ \hline
    \# of queries                      & 8856       \\
    \# of unique queries               & 419       \\
    \# of detected clusters            & 8       \\
    \# of accurately placed queries    & 8345       \\
    \# of inaccurately placed queries  & 511       \\
    Accuracy                           & 94.2\%       \\ \hline
    \end{tabular}
\end{table}

\begin{figure}[h!]
	\vspace{-1cm}
    \centering
    \includegraphics[width=0.5\textwidth]{graphics/dendrogram}
    \vspace{-1.5cm}
    \caption{Query Clustering Dendrogram of Facebook usage for a user}
    \label{fig:dendrogram}
    \vspace{-0.5cm}
\end{figure}

\begin{table}[h!]
\centering
\caption{Clusters extracted from a user's Facebook workload}
\label{tab:clusteringresult}
\vspace{-0.2cm}
\begin{tabular}{cc}
\hline
\textbf{Cluster} & \textbf{Explanation}                        \\ \hline

1       & Fill home feed                     \\
2		& Profile - Account	check						\\
3       & Cache operations                   \\
4		& Photo operations					\\
5       & New notification check             \\
6       & Prefetch and retrieve notification \\
7		& Consistency check					\\
8       & Housekeeping                       \\ \hline
\end{tabular}
\end{table}


%Keep in mind that PocketData dataset is an anonymized dataset where most of the constant values are replaced with ``?'', which reduces the number of distinct queries greatly.

There are 8 different clusters of queries. In Table~\ref{tab:clusteringresult}, we provide the activities performed by the queries shown in Figure~\ref{fig:dendrogram} in the appearance order. As can be seen in the figure, filling home feed forms a completely different query workload, since it needs to load a variety of information such as events, birthdays, posts, and pictures. On the other hand, it is still closer to profile and account checks than the other clusters since the account checking activity overlaps with some of the required information for the homefeed. Under the assumption that a profile visit is likely to get repeated, cache and photo operations form clusters close to each other. The most interesting cluster is the last cluster that we call \textit{housekeeping}. It includes very distinctive queries from the rest of the workload such that if we allow our automatic clustering mechanism to have a cut-off point $h < 1$, each of the queries on the tail end forms a cluster, because their distance from each other reduces the average silhouette score significantly. These queries generally are not strongly associated with any activity, and appear randomly in the query workload. When inspected, they appear to be issued by background tasks of the application.


%Also, for the n-gram approach, when we choose n to be 2, we created the clustering shown in Figure~\ref{fig:ngram}.

%\begin{figure}[h!]
%    \centering
%    \includegraphics[width=0.5\textwidth]{graphics/Ngram}
%    \caption{N-Gram Clustering Dendrogram of Facebook usage for a user}
%    \label{fig:ngram}
%\end{figure}

%In Table~\ref{tab:clusteringresultngram}, we provide the explanations for the queries according to the clusters they got appointed with n-gram feature extraction scheme.

%\begin{table}[h!]
%\centering
%\caption{N-Gram clustering results}
%\label{tab:clusteringresultngram}
%\begin{tabular}{cc}
%\hline
%\textbf{Cluster} & \textbf{Explanation}                       \\ \hline
%1       & Key-Value lookups                 \\
%2       & No filter or multiple row lookups \\
%3       & Lookup in a provided list         \\
%4       & Complex queries                   \\
%5       & Top-k row queries                 \\ \hline
%\end{tabular}
%\end{table}

%We also created a tanglegram to show how similar clusterings these two methods created in Figure~\ref{fig:tanglegram}. As can be seen in the figure, there is little to no similarity between these clusterings which is not unexpected since the two feature extraction mechanisms completely have different strategies and targets different features.

%\question{Should we create a tanglegram like the one below with the automated clustering and the manual clustering?}
%We also created a tanglegram to show how the automated clusterings, and manual cluster assignments match in Figure~\ref{fig:tanglegram}.
%As can be seen in the figure, there is little to no similarity between these clusterings which is not unexpected since the two feature extraction mechanisms completely have different strategies and targets different features.


\subsubsection{Idle Time Tolerance}
This set of experiments were directed towards coming up with an idle time tolerance for each user.
We ran our user session segmentation routine described in Section~\ref{sec:sessionidentifier} for query logs of all users for all the query workload they created.

The idle time tolerances used were 0.1s, 1s, 5s, 10s, 20s, 30s, 1m, 5m, 10m in our experiments.
So, we chose an idle time specific to each user for our further experiments.

\begin{figure}[h!]
     \centering
     \includegraphics[width=0.45\textwidth]{graphics/SessionCountExample}
     \vspace{-0.5cm}
     \caption{Number of user sessions generated with varying idle time tolerances}
     \label{fig:idletime}
\end{figure}


Figure~\ref{fig:idletime} shows which cut-off points selected for each user. Each line represents a user, and each point on a line corresponds to how many sessions that can be created at that specific idle time tolerance. The cut-off points represent an ideal idle time tolerance since they mark a point where selecting a higher cut-off point only marginally changes the session size. From the figure, we can clearly see that as the average session count increases, higher idle time tolerances are selected, whereas for the lines that has lower session counts get affected from idle time tolerance selection less than the other, which results in selecting smaller cut-off points. This is due to the bursts of activities clustered very close to each other compared to the higher cut-off points. We can intuitively infer that these users have more heterogenous app usage over time.


%\subsubsection{Experiment 3 - Session Identification on Controlled Dataset}

%The controlled dataset is created by monitoring and labeling every activity performed on the phone. Hence, both applying the state of the art search log session identification technique described in Section~\ref{sec:backgroundSessionIdentification} and our heuristic. \note{Of course, this experiment here cannot be used to conclude our method is better than the cascade method. It is just an indicator that it is more suitable for mobile workloads.}

%The controlled dataset includes 13 labeled sessions. Our methodology identified 12 sessions, where one of the sessions was labeled to be the tail end of the previous session. Hagen \textit{et al.}~\cite{hagen2011query} methodology, on the other hand, identified only 3 sessions, where 2 of these sessions were correctly identified, but the one of the sessions identified covered 11 user labeled sessions. This result confirms that although the state of the art method for web search query session identification method is successful in the area it is developed for, it is not suitable for mobile phone database session identification.


\subsubsection{Activity Recognition}

%\question{Would it be useful if we show we can use the labeled activities to label sessions detected by our methods? We normally deleted this experiment since we deleted the controlled dataset.}

The aim of this experiment is to investigate if detected session clusters correspond to specific types of activities using labeled data. We collected the queries produced by Facebook app while performing common activities such as scrolling down through the home feed as described in Section~\ref{sec:controlledactivity}.

% one hour of Facebook query log where the user recorded every activity they performed freely on the phone with corresponding time data as described in Section~\ref{sec:controlleddataset}. We then collected the query output of a set of activities individually.

\begin{figure}[h!]
    \centering
    \includegraphics[width=0.45\textwidth]{graphics/ActivityRecognition}
    \vspace{-0.5cm}
    \caption{Activity recognition performance for different profiler methods}
    \label{fig:activityRecognition}
    \vspace{-0.5cm}
\end{figure}

Utilizing the session similarity methods we described in Section~\ref{sec:sessionclustering}, we compared each labeled activity we collected to the detected sessions. We measured average percentage of finding each of these activities in the session cluster that most frequently includes that activity. Figure~\ref{fig:activityRecognition} shows that both set difference based methods have comparable performance.