Feature extraction figures added

master
Gokhan Kul 2017-10-05 21:40:31 -04:00
parent 583302f116
commit 1ab905a8b2
10 changed files with 61 additions and 26 deletions

BIN
graphics/entropy.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

BIN
graphics/featurebased.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

View File

@ -245,3 +245,14 @@
Year = {2011},
Bdsk-Url-1 = {http://doi.acm.org/10.1145/2063576.2063602},
Bdsk-Url-2 = {http://dx.doi.org/10.1145/2063576.2063602}}
@article{kullback1951information,
title={On information and sufficiency},
author={Kullback, Solomon and Leibler, Richard A},
journal={The annals of mathematical statistics},
volume={22},
number={1},
pages={79--86},
year={1951},
publisher={JSTOR}
}

View File

@ -67,7 +67,6 @@ These cluster labels then serve as the basis for the task-similarity metric.
%(3) give more accurate recommendations to the user, and
%(4) explore the data flow improvement opportunities within the app.
%\todo{Need to clarify Point 3 below. Gokhan: I rephrased it but I really think we shouldn't make it any longer in this paragraph.}
Concretely, in this paper we:
(1) Identify the challenges posed by mobile app workloads for query log summarization,
(2) Propose a hierarchical, task-oriented summary format for mobile app workloads,
@ -80,6 +79,6 @@ Concretely, in this paper we:
%Note that although we motivate for creating a benchmark in this paper, developing the data and query emulation step is not in the scope of this work.
This paper is organized as follows. We first describe the moving parts of the system in Section~\ref{sec:systemoutline}.
Then, we give a detailed description of our framework and our methods in Section~\ref{sec:method}.
Then, we give a detailed description of our framework and our methods in Section~\ref{sec:buildingblocks}.
In Section~\ref{sec:experiments}, we introduce a sample dataset for workload characterization, and we evaluate our proposed techniques using this dataset.
Finally, we conclude in Section~\ref{sec:conclusion}, and identify the steps needed to deploy our methods into practice in Section~\ref{sec:futurework}.

View File

@ -75,7 +75,7 @@ The effects of these characteristics are threefold:
\end{figure}
\subsection{Session Identification}
\label{sec:backgroundSessionIdentification}
In traditional databases, a \textit{session} is defined as a connection between a user or an application, and the user~\cite{oracle9i}. Basically, every session has (1) a user or an application as the owner, (2) a start time where the user connects to the database, and (3) an end time where the user disconnects from the database.
This basic understanding of a session is not always enough for specific purposes such as prefetching predicted queries, or automatic query suggestion to the user. Yao \textit{et al.}~\cite{huang2006} define a session as a sequence of queries issued to the database by a user, or an application to achieve a \emph{certain task} where they adopt the some of the methods described from their previous work on web logs~\cite{huang2004dynamic}.

View File

@ -30,12 +30,15 @@ A logical task, called an \emph{activity}, performed by a user on a smartphone,
After partitioning the query log into \emph{sessions}, we create a statistical summary of \emph{similar} activities by providing frequencies of each pattern detected. This could be done in two ways: (1) Supervised approach, and (2) Unsupervised approach.
\subsection{Supervised Approach}
\label{sec:supervised}
We define a set of atomic activities that are possible to be performed on an application, and look for them every time a user uses the phone. The frequency of the activities that appear together or individually in these \emph{sessions} provide an outlook of how a user utilizes an application. The process is illustrated in Figure~\ref{fig:supervised}.
\todo{Go on. Describe the figure.}
\subsection{Unsupervised Approach}
\label{sec:unsupervised}
\todo{Define this.}

View File

@ -46,7 +46,7 @@ Media Storage & 13,592,982 (ten
\subsubsection{Dataset 2 - Controlled Dataset}
\label{controlleddataset}
\label{sec:controlleddataset}
We collected this dataset in a controlled environment where one user's activities for one hour are recorded along with the time of each activity, and the activity the user performed. To create this dataset, we selected \emph{Facebook} application simply because it actively utilizes the mobile database, and we could define the descriptive activities such as opening a profile page, scrolling down on the home feed, and so on.
@ -55,7 +55,7 @@ During the data collection period, the user was free to use the any other applic
This dataset contains 60 minutes of queries that Facebook application issued to the database. There are \todo{about 2000} parsable~\footnote{by JSqlParser} queries recorded, labeled with what activity they were associated with, timing information of each query, and lastly the user labeled session which is roughly described to the user as \textit{``Session in an application is the period between a user starts to dedicate their attention to the application, and loses their interest.''} Note that this description is ambiguous, and aims to lead the experiment participant to form their understanding of a session.
\subsubsection{Dataset 3 - Controlled Activity}
\label{controlledactivity}
\label{sec:controlledactivity}
This dataset contains queries that are generated when the user interacts with the mobile application for a specific purpose. While collecting these activity logs, everytime the user completes an operation, the set of queries issued is labeled with that certain activity.
@ -90,15 +90,15 @@ For instance, on Facebook application, the user clicks on another profile on the
The core expected output of the system presented in this paper is an expected query workload created based on the user's past activity without any supervision.
However, it is very difficult to answer the question of \textit{``what is the accuracy of such an output?''} without providing a comparison.
Therefore, we compare our session identification methodology with the state of the art of web query session identification, and also measure the accuracy of our session similarity method, we designed a supervised approach given in \todo{Section~\ref{sec:supervisedapproach}} to be able to show the unsupervised approach presented has comparable results.
Therefore, we compare our session identification methodology with the state of the art of web query session identification, and also measure the accuracy of our session similarity method, we designed a supervised approach given in Section~\ref{sec:supervised} to be able to show the unsupervised approach presented has comparable results.
To achieve this, we designed the following experiments as building blocks to verify the accuracy and consistency of the methods that we use in each step.
\subsubsection{Experiment 1 - Clustering Accuracy}
In this experiment, we aim to measure the placement performance of unsupervised clustering methodology described in \todo{Section~\ref{sec:clustering}} over all of the queries in the query log.
In this experiment, we aim to measure the placement performance of clustering methodology described in Section~\ref{sec:profiler} over all of the queries in the query log.
We prepare ground truth cluster labels by manually inspecting all the unique queries within the query log for a \note{specific application}. The accuracy of the clustering result is measured by comparing the query placements to the clusters to the ground truth.
We prepare ground truth cluster labels by manually inspecting all the unique queries within the query log for Facebook app. The accuracy of the clustering result is measured by comparing the query placements to the clusters to the ground truth.
\begin{table}[h!]
\centering
@ -115,8 +115,6 @@ We prepare ground truth cluster labels by manually inspecting all the unique que
\end{tabular}
\end{table}
\todo{Should we talk about the TKDE submission on the clustering quality of similarity methods used for clustering?}
For our experiments, we selected Facebook to be our example app. For visual purposes, we clustered the activities of only one user in Figure~\ref{fig:dendrogram}.
For this specific user's case, there are 84273 rows of activities in the log.
There are 8795 parsable select queries, however, there are only 59 unique queries among them.
@ -187,7 +185,7 @@ We also created a tanglegram to show how the automated clusterings, and manual c
\subsubsection{Experiment 2 - Idle Time Tolerance}
This set of experiments were directed towards coming up with an idle time tolerance for each user.
We ran our user session segmentation routine \todo{described in Section~\ref{sec:idleTimeTolerance}} for query logs of all users for all the query workload they created.
We ran our user session segmentation routine described in Section~\ref{sec:sessionidentifier} for query logs of all users for all the query workload they created.
The idle time tolerances used were 10ms, 100ms, 1s, 5s, 10s, 1min, 2min and 5min.
@ -204,7 +202,7 @@ So, we chose an idle time specific to each user for our further experiments.
\subsubsection{Experiment 3 - Session Identification on Controlled Dataset}
The controlled dataset is created by monitoring and labeling every activity performed on the phone. Hence, both applying the state of the art search log session identification technique described in \todo{Section~\ref{sec:cascade}} and our heuristic. \note{Of course, this experiment here cannot be used to conclude our method is better than the cascade method. It is just an indicator that it is more suitable for mobile workloads.}
The controlled dataset is created by monitoring and labeling every activity performed on the phone. Hence, both applying the state of the art search log session identification technique described in Section~\ref{sec:backgroundSessionIdentification} and our heuristic. \note{Of course, this experiment here cannot be used to conclude our method is better than the cascade method. It is just an indicator that it is more suitable for mobile workloads.}
\todo{Go on here, add results...}

View File

@ -28,15 +28,30 @@ Makiyama \textit{et al.}~\cite{makiyama2015text} put forward the most similar wo
The \emph{profiles} to compare activities and sessions can be performed in three ways: (1) Feature based sets which can be compared with Jaccard Index, (2) KL-Divergence entropy using feature appearance frequencies, and (3) Jaccard Index using the clustering appointments of queries.
\tinysection{Feature based sets with Jaccard Index}
\todo{Describe the process}
\tinysection{Feature sets based Jaccard Index} The Jaccard similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. For two user sessions $S_i$ and $S_j$, we compare the feature sets of these sessions that are extracted from the constituent queries.
\tinysection{KL-Divergence entropy}
$$J(S_i,S_j) = \frac{\left[S_i\cap S_j\right]}{\left[S_i\cup S_j\right]}$$
%Sessions over time may show various characteristics over time, and it is important to identify sessions that deviate from the other sessions to be able to generate new workloads.
%The generated workloads should reflect the normal behavior of users, but they should still include realistic deviations in order to create a pragmatic workload that emulate a user.
If $S_i$ and $S_j$ are both empty, we define $J(S_i,S_j) = 1$. Also, this guarantees that $0\leq J(S_i,S_j) \leq 1$.
Each query $Q^{t_i}_u$ is processed with the methodology given above, and denoted as:
\begin{figure}[h!]
\centering
\includegraphics[width=0.5\textwidth]{graphics/featurebased.png}
\caption{Creating a feature set based session profile}
\label{fig:featureBasedSessionProfile}
\end{figure}
%We calculate $J(S_i,S_j)$ for all pairs of user sessions.
%Now, we start to look for "interesting" user sessions. One notion of of user sessions being interesting can be that their contents occur in the query log more frequently. A high Jaccard similarity score for a pair of user sessions can be interpreted as them being similar to each other, thereby leading the contents to occur more frequently. For a particular user session $w_i$, we would be now looking out for the top K user sessions which are most similar with $w_i$. Calculating the average similarity of $w_i$ with the most similar K sessions $[w_1,w_2..., w_k]$ yields a notion of the importance of $w_i$ in representing the characteristics of the workload represented in the query log. We denote this average similarity for $w_i$ with top K windows as $J_{w_i{avg}}$.
%$$J_{wi_{avg}}= \frac{\sum_{j=1}^{k} J(w_i,w_j)}{k}$$
%When we calculate $J_{w_i{avg}}$ for all $w_1,w_2...,w_m$, we obtain a vector of average similarity scores for the entire query log for a user.
\tinysection{KL-Divergence entropy} Each query $Q^{t_i}_u$ is processed with the methodology given above, and denoted as:
\begin{equation}
Q^{t}_u = ( f^{t}_1(c_0), f_2^{t}(c_1), ... , f_m^{t}(c_n) )
@ -44,6 +59,9 @@ Q^{t}_u = ( f^{t}_1(c_0), f_2^{t}(c_1), ... , f_m^{t}(c_n) )
where $t$ is the timestamp that the query $Q$ was issued, $u$ represents the username of the query owner, $f_i$ is the feature extracted, and $c_j$ denotes how many times the feature $f_i$ was observed in the query.
%Sessions over time may show various characteristics over time, and it is important to identify sessions that deviate from the other sessions to be able to generate new workloads.
%The generated workloads should reflect the normal behavior of users, but they should still include realistic deviations in order to create a pragmatic workload that emulate a user.
A session $S$ is represented by a user $u \in U$, where $U$ is the set of all users, for the time period $T$ that starts from $t_0$ and goes on for $\Delta t$, and the set of queries $Q$ performed by $u$ within $T$. Formally,
\begin{equation}
@ -64,9 +82,10 @@ A session distribution $\phi$ is formally denoted as:
where $P(f_i)^{T}_u$ represents the probability of encountering feature $f_i$ within the timeframe $T$ among all the operations performed by user $u$.
We compute the difference between distributions with KL- Divergence \todo{find the KL-Divergence reference}
%~\cite{fuglede2004jensen}
. Comparison of a with the other sessions using KL-Divergence gives the difference denoted as follows:
We compute the difference between distributions with KL- Divergence~\cite{kullback1951information}.
%~\cite{fuglede2004jensen}.
Comparison of a session with the other sessions using KL-Divergence gives the difference denoted as follows:
%\begin{equation}
%d^{T_1}_u (\phi^{T_1}_u || \phi^{T_2}_u) = \frac{1}{2} KL(\phi^{T_1}_u || \phi^{T_2}_u) + KL(\phi^{T_2}_u || \phi^{T_1}_u)
@ -82,10 +101,16 @@ KL(\phi^{T_1}_u || \phi^{T_2}_u) = \sum_i \phi^{T_1}_u(i) log_2 \frac{\phi^{T_1
Note that when $P(i) \neq 0$ and $Q(i) = 0$, $D_{JS}(P||Q)=\infty$. For example, suppose, we have two distributions $P$ and $Q$ as follows: $P = \{ f_0: 3/10, f_1: 4/10, f_2: 2/10, f_3: 1/10 \}$ and $Q = \{ f_0: 3/10, f_1: 3/10, f_2: 3/10, f_4: 1/10 \}$. In this case, since $f_3$ is not a part of $Q$, the result would be $\infty$, which means these two distributions are completely different.
\textbf{Smoothing.} To get past this problem, we can apply \textit{smoothing} (i.e., Laplace/additive smoothing), which is essentially adding a small constant $epsilon$ to the distribution, to handle zero values, without significantly impacting the distribution. After we apply smoothing, $D_{KL}(P||Q)$ becomes $1.38$.
To get past this problem, we can apply \textit{smoothing} (i.e., Laplace/additive smoothing), which is essentially adding a small constant $epsilon$ to the distribution, to handle zero values, without significantly impacting the distribution. After we apply smoothing, $D_{KL}(P||Q)$ becomes $1.38$.
\begin{figure}[h!]
\centering
\includegraphics[width=0.5\textwidth]{graphics/entropy.png}
\caption{Creating an probability distribution based session profile}
\label{fig:entropySessionProfile}
\end{figure}
\tinysection{Jaccard Index using the clustering of queries} In this strategy, the profiler takes the preprocessed clustering appointments of queries.
\tinysection{Jaccard Index using the clustering of queries} In this strategy, the profiler takes the preprocessed clustering appointments of queries as input instead of the features.
Clustering the queries in the workload narrows down the space of possible patterns that could be detected. This facilitates easier and more accurate understanding of the workload~\cite{pavlo2017self}. The main goal of this step is to group queries into classes that exhibit similar interests over database attributes. We consider two queries to exhibit similar interests over database attributes if they are similar in semantic structure. In the clustering process, we first filter the queries belonging to the app of our interest without distinguishing which user the activity belongs to. Then, we create clusters using all the queries belonging to that specific app. The workflow for the PocketData dataset is illustrated in Figure~\ref{fig:clusteringWorkflow}.
@ -111,4 +136,6 @@ In this form of session profiling, we embed query cluster appointments for all t
\todo{Show how to calculate the session similarity.}
\todo{Add a representative figure.}

View File

@ -1 +0,0 @@
\todo{Define how the analyzer works for both supervised and unsupervised approaches.}

View File

@ -1,2 +0,0 @@
Sessions are essentially a bag of queries