completed except part 4.2

2016-12-19 00:16:18 -05:00 · 2016-12-19 00:16:18 -05:00 · 1d56ad0bdb
parent 6dc7c5793b
commit 1d56ad0bdb
7 changed files with 67 additions and 20 deletions
--- a/graphics/allsessions.pdf
+++ b/graphics/allsessions.pdf
--- a/results/similarity-all-users.numbers
+++ b/results/similarity-all-users.numbers
--- a/sections/2-background.tex
+++ b/sections/2-background.tex
@ -8,7 +8,7 @@ In fact, there are four key characteristics about data~\cite{ORCL-BIGDATA-519135
 	\item \textbf{Volume.} The most intuitive characteristic about data is the amount of data itself. It fact, it is the sheer amount of data that we generate and process these days that calls for a better approach to data management. It is one of the driving forces behind this work.
 	\item \textbf{Velocity.} Velocity refers to the idea of the amount of data flowing through an interface in unit time. 
 	\item \textbf{Variety.} Traditional data formats were relatively well defined by a data schema and changed slowly. In contrast, modern data formats change at a dizzying rate. This is referred to as variety in data.
-	\item \textbf{Value.} The value of data varies significatly. The challenge in drawing insights from data is identifying what is valuable and then transforming and extracting the data for analysis.
+	\item \textbf{Value.} The value of data varies significantly. The challenge in drawing insights from data is identifying what is valuable and then transforming and extracting the data for analysis.
 \end{itemize}

 Database servers and web applications experience a workload that is not typical in smartphones.
@ -17,17 +17,17 @@ In most cases, they store the business data to support OLTP and OLAP operations.
 The data volumes may range from medium to high amounts for systems with a large concurrent user base.
 The data velocity also grows in proportion to the number of concurrent users.

-The usage pattern of databases in smartphones differs significatly from the above mentioned ideas.
+The usage pattern of databases in smartphones differs significantly from the above mentioned ideas.
 Most modern day smartphones rely on some kind of a web service to help a mobile application deliver the desired functionality to the user.
-This introduces various new application design cnsiderations.
+This introduces various new application design considerations.
 Mobile users must be able to work without a network connection due to poor or nonexistent connection.
 In that case, a mobile database serves as a cache to hold recently accessed data and transactions so that they are not lost due to connection failure.
 In many cases, users might not expect live data during connection failures; only recently modified data.
-Update of recent changes and downloading of live data can be deffered until connection is restored.
+Update of recent changes and downloading of live data can be deferred until connection is restored.

 Mobile computing devices tend to have slower CPUs and limited battery life.
 The luxury of having a cluster of powerful computers to deploy a database is just not there.
-Also, the fact that battery power is scarce drives the case furthur to achieve high resource utilzation.
+Also, the fact that battery power is scarce drives the case further to achieve high resource utilization.

 It is a common practice among smartphone users to have multiple devices.
 Most smartphones have an authentication system which is powered by an email account which is accessed in other devices too.
@ -36,12 +36,12 @@ This leads to occasional synchronization activities that occur between different
 Often, this activity happens in background so that the user is not blocked from using other functions on a smartphone. 

 The PocketData dataset \cite{kennedy2015pocket} provides handset-based data directly collected from smartphones for multiple users.
-User sessions in context of smartphones might not be similar to a session on other more traditional computinf devices like PCs.
+User sessions in context of smartphones might not be similar to a session on other more traditional computing devices like PCs.
 Typically, an end user would use their smartphones in multiple small intervals of time through a day.
 These `bursts of activity' can be referred to as session.
 In context of a single mobile application, the user would access multiple logical transactions in these bursts of activity.
 Intuitively, a user session is quite straightforward to understand but its technical  aspects require defining.
-Some smartphone usage studies define a session as the time period where the smartphone's screen is active. \cite{soikkeli2011diversity}.
+Some smartphone usage studies define a session as the time period where the smartphone's screen is active~\cite{soikkeli2011diversity}.
 Smartphone usage is dominated by usage of the applications that the smartphone has to offer.
 Thus, the idea of a smartphone usage session can be reduced to an application usage session.
 This is relevant to us because we aim to study the interaction with the smartphone database through understanding a single application.
--- a/sections/3b-patternmatching.tex
+++ b/sections/3b-patternmatching.tex
@ -1,18 +1,18 @@
 Soikkeli \textit{et al.}~\cite{soikkeli2011diversity} say that even considering the time between launch and close of an application is not a reliable notion of an application usage session. Applications running in the foreground are visible to the user. Applications which are running in background are not visible to user even though he might have launched them before. 
-An individual session now consists of two parameters : a start time and an end time. Two user sessions can be back to back or might have an idle time in between them. User sessions for a single application can now be modelled as a time-wise closely-spaced series of queries issued to the smartphone database. A threshold value T is defined for the idle time between two transactions. If the idle time is less than or equal compared to T, the transaction belongs belong to the same user session. This approach enables us to identify a time window which can be applied to the transactions in the PocketData dataset. This time window would contain a chronologically ordered subset of queries issued to the smartphone database. 
+An individual session now consists of two parameters: a start time and an end time. Two user sessions can be back to back or might have an idle time in between them. User sessions for a single application can now be modeled as a time-wise closely-spaced series of queries issued to the smartphone database. A threshold value T is defined for the idle time between two transactions. If the idle time is less than or equal compared to T, the transaction belongs belong to the same user session. This approach enables us to identify a time window which can be applied to the transactions in the PocketData dataset. This time window would contain a chronologically ordered subset of queries issued to the smartphone database. 

-Suppose, if a query for a user consits of a series of chronomically ordered queries $Q = [q_1,q_2,...q_n]$ and $f$ be the function which converts this series into windows using the above mentioned logic.
+Suppose, if a query for a user consists of a series of chronically ordered queries $Q = [q_1,q_2,...q_n]$ and $f$ be the function which converts this series into windows using the above mentioned logic.
 $$f(q_1,q_2,...q_n) \rightarrow [w_1,w_2,...,w_m]$$
 where $[w_1,w_2,...,w_m]$ is the series of user sessions the are obtained. Each $w_i$ consists of chronologically ordered bag queries $Q_{w_i}$. Also, $\bigcup_1^m Q_{w_i} = Q$ and $Q_{w_i} \cap Q_{w_j}=\varnothing$ $\forall\;i,j\;\epsilon\;[1,m]\;$and$\;i\;\neq\;j$

-There are some pecularities of the query logs that must be considered in order to design the methdology of working with them. Any user activity on a smartphone app consists of a sequence of multiple asynchronous operations. For example, a user might want to refresh the Facebook feed updates from time to time. The user might perceive this as a single repeating activity performed multiples times in a day. But on app performs multiple transactions during each "burst" of the same activity. Given the asynchronous name of most smartphone applications, the relative order in which these queries are issed is not fixed. This is also reflected in the query logs. User sessions might be similar to each other in terms of the intent. The intent is reflected by the query cluster that a particular query belongs to. But we can not test for similarity among these sessions by searching for common subsequences. It is highly probable that a group of queries might be be issued as part of the same logical task but they might appear to be interleaved in the query log. Each user ssion can be treated as a bag of queries. Hence, we need to use a similarity measure which works on the basis of membership for a particular bag. Jaccard Similarity is a simple measure to meet the above mentioned requirements. 
+There are some peculiarities of the query logs that must be considered in order to design the methodology of working with them. Any user activity on a smartphone app consists of a sequence of multiple asynchronous operations. For example, a user might want to refresh the Facebook feed updates from time to time. The user might perceive this as a single repeating activity performed multiples times in a day. But on app performs multiple transactions during each "burst" of the same activity. Given the asynchronous name of most smartphone applications, the relative order in which these queries are issued is not fixed. This is also reflected in the query logs. User sessions might be similar to each other in terms of the intent. The intent is reflected by the query cluster that a particular query belongs to. But we can not test for similarity among these sessions by searching for common subsequences. It is highly probable that a group of queries might be be issued as part of the same logical task but they might appear to be interleaved in the query log. Each user session can be treated as a bag of queries. Hence, we need to use a similarity measure which works on the basis of membership for a particular bag. Jaccard Similarity is a simple measure to meet the above mentioned requirements. 

 The Jaccard similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. For two user sessions $w_i$ and $w_j$, we compare the membership of the clusters that the constituent queries belong to.

 $$J(w_i,w_j) = \frac{\left[w_i\cap w_j\right]}{\left[w_i\cup w_j\right]}$$
 If $w_i$ and $w_j$ are both empty, we define $J(w_i,w_j) = 1$. Also,  $0\leq J(w_i,w_j) \leq 1$ .

-We calculate $J(w_i,w_j)$ for all pairs of user sessions. Now, we start to look for "interesting" user sessions. One notion of of user sessions being interesting can be that their contents occur in the query log more frequently. A high Jaccard similarity score for a pair of user sessions can be interpreted as them being similar to each other, thereby leading the contents to occur more frequently. For a particular user session $w_i$, we would be now looking out for the top K user sessions which are most similar with $w_i$. Calculating the average similarity of $w_i$ with the most similar K sessions $[w_1,w_2..., w_k]$ yields a notion of the importance of $w_i$ in representing the characteristics of the workload represented in the query log. We denote this avereage similarity for $w_i$ with top K windows as $J_{w_i{avg}}$.
+We calculate $J(w_i,w_j)$ for all pairs of user sessions. Now, we start to look for "interesting" user sessions. One notion of of user sessions being interesting can be that their contents occur in the query log more frequently. A high Jaccard similarity score for a pair of user sessions can be interpreted as them being similar to each other, thereby leading the contents to occur more frequently. For a particular user session $w_i$, we would be now looking out for the top K user sessions which are most similar with $w_i$. Calculating the average similarity of $w_i$ with the most similar K sessions $[w_1,w_2..., w_k]$ yields a notion of the importance of $w_i$ in representing the characteristics of the workload represented in the query log. We denote this average similarity for $w_i$ with top K windows as $J_{w_i{avg}}$.

 $$J_{wi_{avg}}= \frac{\sum_{j=1}^{k} J(w_i,w_j)}{k}$$

--- a/sections/4-experiments.tex
+++ b/sections/4-experiments.tex
@ -5,7 +5,7 @@ In our experiments regarding clustering, we used an Apple Macbook Pro with macOS

 In our experiments regarding pattern matching, we used a Lenovo Thinkpad with Windows 10, 2.3 GHz Intel Core i5 processor, 8GB RAM, Java 1.8 SE Runtime Environment and R v3.3.1.

-We will provide timing results of our experiments in the next phase of our project.
+%We will provide timing results of our experiments in the next phase of our project.

 \subsection{Clustering}

@ -72,9 +72,9 @@ Also, for the n-gram approach, when we choose n to be 2, we created the clusteri



-\subsection{Pattern matching}
+\subsection{Session Identification}

-The first set of experiments were directed towards coming up with an idle time tolerance. We ran our user sessions segmentation routine for query logs of 3 users for the timeline of a month. The idle time tolerances used were 10ms, 100ms, 1s, 5s, 10s, 1min, 2min and 5min. We noticed that the number of user sessions generated converged to around 10000 for 2 min and 5 min of idle time tolerance. So, we chose an idle time of 5 mins for furthur experiments. While the idea of a low idle time tolerance might be enticing because it enables us to look into the query log at a more granular level , we decided that this approach was more suitable. When the number of user sessions became too high, neighbouring sessions started to become very similar to each other. This might lead to a myopic view of the data. Also, it is reasonable to hypothesize that the general usage pattern of smartphones is in bursts. The user would pickup the smartphone for a few minutes, perform a bunch of tasks and then keep it away. During these bursts of activity, the high similarity among smaller user sessions could be because of the fact that if a user is checking the Facebook feed for , say, 5 seconds, it is highly probably that they will keep doing that for the next many seconds. However, we are able to deal with this myopic view with larger idle time tolerances. Also, higher idle time tolerances lead to lower number of user session windows. The time complexity of similarity calculation operation is $O(n^2)$. Higher idle time tolerances fit the general usage patterns, as well as, reduce the computational complexity of calculating the average similarity vector. Refer Figure~\ref{fig:idletime}.
+The first set of experiments were directed towards coming up with an idle time tolerance. We ran our user sessions segmentation routine for query logs of 3 users for the timeline of a month. The idle time tolerances used were 10ms, 100ms, 1s, 5s, 10s, 1min, 2min and 5min. We noticed that the number of user sessions generated converged to around 10000 for 2 min and 5 min of idle time tolerance as can be seen in Figure~\ref{fig:idletime}. So, we chose an idle time of 5 mins for further experiments.

 \begin{figure}[h!]
    \centering
@ -82,3 +82,40 @@ The first set of experiments were directed towards coming up with an idle time t
    \caption{Number of user sessions generated with varying idle time tolerances}
    \label{fig:idletime}
 \end{figure}
+
+While the idea of a low idle time tolerance might be enticing because it enables us to look into the query log at a more granular level, we decided that this approach was more suitable. When the number of user sessions became too high, neighboring sessions started to become very similar to each other. This might lead to a myopic view of the data. Also, it is reasonable to hypothesize that the general usage pattern of smartphones is in bursts. The user would pickup the smartphone for a few minutes, perform a bunch of tasks and then keep it away. During these bursts of activity, the high similarity among smaller user sessions could be because of the fact that if a user is checking the Facebook feed for 5 seconds, it is highly probably that they will keep doing that for the next many seconds. However, we are able to deal with this myopic view with larger idle time tolerances. Also, higher idle time tolerances lead to lower number of user session windows. The time complexity of similarity calculation operation is $O(n^2)$. Higher idle time tolerances fit the general usage patterns, as well as, reduce the computational complexity of calculating the average similarity vector.
+
+\subsection{Common patterns}
+
+Extracting meaningful patterns from user data is central to a reliable characterization of the smartphone database workload. We proposed a robust similarity measure which takes into the account the considerations and constraints that are imposed by the problem domain. We believe that our experimental results support the dependability of this approach for mobile applications. 
+
+We applied the idle time treshold as \todo{X milliseconds} over the Facebook query set for a month of usage for 11 users as we indicated in the previous section in order to determine how many sessions there are. This operation revealed that there were 15820 sessions initiated in the dataset according to the session definition. Among these 15820 sessions, 5184 of them had 90\% or higher average similarity with all other sessions and there are 25 sessions that had 25\% or less average similarity with all other sessions as show in Figure~\ref{fig:averagesimilarity}.
+
+\begin{figure}[h!]
+    \centering
+    \includegraphics[width=0.5\textwidth]{graphics/allsessions}
+    \caption{Number of sessions separated by their average similarity}
+    \label{fig:averagesimilarity}
+\end{figure}
+
+We believe that a random session selection among the 5184 sessions can provide us with a representative query set of the workloads for all users since 90\% average similarity means the query set represents 90\% of all the sessions in the dataset. The average, minimum and maximum session lengths are given in Table~\ref{tab:sessionlength}.
+
+\begin{table*}[]
+\centering
+\caption{Session length}
+\label{tab:sessionlength}
+\begin{tabular}{ccccl}
+\cline{1-4}
+\multicolumn{1}{|c|}{}                                        & \multicolumn{1}{c|}{Average Session Length} & \multicolumn{1}{c|}{Minimum Session Length} & \multicolumn{1}{c|}{Maximum Session Length} &  \\ \cline{1-4}
+\multicolumn{1}{|c|}{Sessions with 90\% or higher similarity} & \multicolumn{1}{c|}{\todo{X}}                      & \multicolumn{1}{c|}{\todo{Y}}                      & \multicolumn{1}{c|}{\todo{Z}}                      &  \\ \cline{1-4}
+\multicolumn{1}{|c|}{All sessions}                            & \multicolumn{1}{c|}{\todo{A}}                      & \multicolumn{1}{c|}{\todo{B}}                      & \multicolumn{1}{c|}{\todo{C}}                      &  \\ \cline{1-4}
+\multicolumn{1}{l}{}                                          & \multicolumn{1}{l}{}                        & \multicolumn{1}{l}{}                        & \multicolumn{1}{l}{}                        & 
+\end{tabular}
+\end{table*}
+
+As for outlier detection, the 25 sessions that have 25\% or less average similarity is a very moderate number that should be inspected to understand the structure of extraordinarily different sessions.
+
+
+
+
+
--- a/sections/5-conclusion.tex
+++ b/sections/5-conclusion.tex
@ -1 +1,7 @@
-The focus of this paper is to identify common behaviors and interesting patterns in user activities on mobile databases with the hypothesis that making use of this information can open up a lot of opportunities for mobile apps. To achieve this, we analyze PocketData dataset which consist of SQL queries posed by different applications for 11 users for a month. We utilize different query similarity methods to identify important features out of these queries, form feature vectors out of them, and cluster the similar queries together by their feature vectors with hierarchical and k-means clustering methods. Finally, we discuss on how we can make use of this clusters; we appoint an integer label to each cluster, and whenever there is an incoming query from a user, we identify which cluster the new query belongs to. This labels sequentially create a stream of integers for that specific user, in which we explore interesting and repeating patterns.
+The focus of this paper is to identify common behaviors and interesting patterns in user activities on mobile databases with the hypothesis that making use of this information can open up a lot of opportunities for mobile apps.
+Identifying the common behaviors are essential for creating a \textit{small data} benchmark which compares the performance of mobile database management systems under different workloads.
+Another usage scenario of this information is to find out bugs and unnecessary function calls by identifying repeating queries on data that has not change the last reading.
+To achieve this, we analyze PocketData dataset which consist of SQL queries posed by different applications for 11 users for a month.
+We utilize different query similarity methods to identify important features out of these queries, form feature vectors out of them, and cluster the similar queries together by their feature vectors with hierarchical clustering.
+Finally, we discuss on how we can make use of this clusters; we appoint an integer label to each cluster, and whenever there is an incoming query from a user, we identify which cluster the new query belongs to.
+This labels sequentially create a stream of integers for that specific user, in which we explore interesting and repeating patterns.
--- a/sections/6-futurework.tex
+++ b/sections/6-futurework.tex
@ -1,11 +1,15 @@
 This paper represents the first steps for developing a \textit{small data} benchmarking tool for apps running on mobile devices. We plan several extensions as future work. 

-First, our efforts, until now, focused on how users access data instead of their complete utilization of database system capabilities: inserts, updates, and deletions along with select statements. This will require us to modify upon the current query comparison methods since their specifications do not support them. We will continue to expand the scope of our analysis through understanding more statement types and their effects on the query load.
+First, our efforts, until now, focused on how users access data instead of their complete utilization of database system capabilities: inserts, updates, and deletions along with select statements.
+This will require us to modify upon the current query comparison methods since their specifications do not support them.
+We will continue to expand the scope of our analysis through understanding more statement types and their effects on the query load.

-Second, extracting meaningful patterns from user data is central to a reliable characterization of the smartphone database workload. This calls for more robust similarity measures which take into the account the considerations and constraints that are imposed by the problem domain. 
+Second, the PocketData dataset contains the time which the query took to execute itself. 
+We have not used this measure in our analysis yet.
+This could prove to be valuable aid in uncovering further characteristics of the data.

-Third, the PocketData dataset contains the time which the query took to execute itself. We have not used this measure in our analysis yet. This could prove to be valuable aid in uncovering further characteristics of the data.
-
-Finally, we will validate our clustering results by selecting sample query loads and manually categorizing them to compare them with the automated clustering result. This step will be the measure of the sanity of our system since the following steps are exploratory analysis provided by this step. 
+Finally, we will investigate how to automatically emulate a workload utilizing the findings in this paper.
+This step is essential for creating a benchmarking tool.
+This includes concentrating our focus on both emulating the queries and generating data for the mobile application we are evaluating.