New paper structure

master
Gokhan Kul 2017-10-04 22:00:40 -04:00
parent 896f76cb41
commit 6f709f3aae
26 changed files with 10248 additions and 26 deletions

Binary file not shown.

1995
graphics/Session.eps Normal file

File diff suppressed because it is too large Load Diff

BIN
graphics/Session.graffle Normal file

Binary file not shown.

Binary file not shown.

4854
graphics/Supervised.eps Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.4 KiB

Binary file not shown.

3108
graphics/Unsupervised.eps Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

View File

@ -44,27 +44,33 @@ Gokhan Kul, Gourab Mitra, Oliver Kennedy, Lukasz Ziarek\\
\section{System Outline}
\label{sec:systemoutline}
\input{sections/2-systemoutline.tex}
%\input{sections/2-systemoutline.tex}
\input{sections/3-systemoverview.tex}
\section{Methodology}
\label{sec:method} % \label{} allows reference to this section
\input{sections/3-method.tex}
\subsection{Clustering}
\label{sec:clustering} % \label{} allows reference to this section
\input{sections/3a-clustering.tex}
%\section{Methodology}
%\label{sec:method} % \label{} allows reference to this section
%\input{sections/3-method.tex}
\section{Building Blocks}
\label{sec:buildingblocks}
\input{sections/4-buildingBlocks.tex}
%\subsection{Clustering}
%\label{sec:clustering} % \label{} allows reference to this section
%\input{sections/3a-clustering.tex}
%\subsection{Session Identification}
%\label{sec:session} % \label{} allows reference to this section
%\input{sections/3c-sessionidentification.tex}
\subsection{Pattern Matching}
\label{sec:pattern} % \label{} allows reference to this section
\input{sections/3b-patternmatching.tex}
%\subsection{Pattern Matching}
%\label{sec:pattern} % \label{} allows reference to this section
%\input{sections/3b-patternmatching.tex}
\subsection{Resource Utilization}
\label{sec:resource} % \label{} allows reference to this section
\input{sections/3d-resourceutilization.tex}
%\subsection{Resource Utilization}
%\label{sec:resource} % \label{} allows reference to this section
%\input{sections/3d-resourceutilization.tex}
\section{Experiments}
\label{sec:experiments} % \label{} allows reference to this section

View File

@ -74,7 +74,7 @@ The effects of these characteristics are threefold:
\label{fig:sampleFacebook}
\end{figure}
\subsection{Session identification}
\subsection{Session Identification}
In traditional databases, a \textit{session} is defined as a connection between a user or an application, and the user~\cite{oracle9i}. Basically, every session has (1) a user or an application as the owner, (2) a start time where the user connects to the database, and (3) an end time where the user disconnects from the database.
@ -90,19 +90,28 @@ This approach is not suitable for mobiles systems since the database user sessio
\tinysection{Timeout based approach} This approach is based on identifying a time-out value that is ideal for the given scenario to detect session boundaries. The queries that are issued between two boundaries belong to the same session.
\note{Our approach is not explored in the way that we did, but we can say that the closest approach to ours is timeout based approach. We will briefly discuss how the cited papers differs from our approach.}
%\note{Our approach is not explored in the way that we did, but we can say that the closest approach to ours is timeout based approach. We will briefly discuss how the cited papers differs from our approach.}
\tinysection{Semantic segmentation based methods} This approach focuses on the content of the queries in order to understand the context change in the query workload~\cite{jones2008, huang2006, hagen2011}. The assumption is that, if two queries are semantically close to each other, they should be placed in the same session, and if there is a shift in the query interest, there should be session boundary between these queries. Naturally, this raises the question of \textit{what makes two queries similar?} The research on this question focuses on various motivations, such as database performance optimization~\cite{aouiche2006}, workload exploration~\cite{makiyama2015text}, and security applications~\cite{kul2016ettu}.
Yao \textit{et al.}~\cite{huang2006} report that sessions can be identified by studying change in information entropy. They use a language model which is chacterized by an order parameter and a threshhold parameter. The order parameter determines the granularity of the n-gram model which would be used to break the query log into smaller sequences of queries. The conditional probability of occurence of those sequences is calculated from the training data consisting of queries with sessions already identified beforehand. Using these probabilities, a running count of entropy is calculated for all sessions. The entropy parameter determines a threshold value at which a session can't accomodate any more new queries. These queries form part of another session and contribute of its entropy. If a sequence of queries has been observed to be occuring close to each other before, their entropy value will be low. This indicates presence of some kind of link between them and hence supports the case of them being in the same session. However, when a completely unrelated querie is being considered to be part of a particular session, the entropy value of the session will rise. The system would place the query in a different session. This approach is not dependent on time intervals. Even though this approach is very intuitive, it falls short of addressing specific issues in smartphone database sessions. As already described, there is no clear way to identify start and end of a session. As a result, it is not possible to obtain a training set of queries with sessions already labelled. Hence, this approach doesn't work for session identification in our case.
Hagen \textit{et al.}~\cite{hagen2011query} present an interesting approach to session identification.
Hagen \textit{et al.}~\cite{hagen2011query} present an interesting approach to session identification. \todo{Gourab}
%Although semantic segmentation of the queries performs very well for web log session identification, it may not always be practical in a database setting, both in traditional database systems, and in mobile databases. In traditional database systems, task oriented sessions where the user focuses on a single task can be identified with this approach, but when the user issues queries with different interests, this approach would fail to identify a sequence of queries that could be classified as a session. Similarly, in a mobile application, an activity can create a series of wide range of queries, which renders this approach inapplicable to identify sessions for a mobile application.
\subsection{Session similarity}
\subsection{Session Similarity}
Once the the database user sessions have been identified, we need to identify sessions which perform similar logical tasks. Session similarity could be used to identify them.
Once the the database user sessions have been identified, we need to identify sessions which perform similar logical activities.
A session can include one or more activities, and activities consist of a bag of queries as illustrated in Figure~\ref{fig:session}.
Exploring the similarities of sessions could be used to identify repeating patterns.
\begin{figure}[h!]
\centering
\includegraphics[width=0.45\textwidth]{graphics/Session}
\caption{Session - Activity - Query relationship}
\label{fig:session}
\end{figure}
Aligon \textit{et al.}~\cite{aligon2014similarity} report that there are 4 approaches in the literature for computing session similarity: (1) Edit-based approach, (2) Subsequence-based approach, (3) Log-based approach, and (4) Alignment-based approach.

88
sections/3-systemoverview.tex Executable file
View File

@ -0,0 +1,88 @@
In order to create a representative sample of the workload, we need to identify patterns in the query log. The task of identifying patterns directly from the log is difficult. A naive approach would be to consider individual queries as atomic components of the query logs and cluster them by directly extracting features. The most important difference between a smartphone database workload and that in database servers are the users' bursts of activity. Most benchmarks like the TPC-C focus on emulating homogenous query workloads of an OLTP system. Their goal is to analyse throughput fpr these homogenous workloads. But it is not correct to truly emulate smartphone query workloads without emulating the inermittent bursts of query activity. These bursts can only be detected by looking at the chronological attributes like query timestamp and query interarrival time. This approach does not consider the chronological ordering of the queries. Hence, This naive clustering is not meaningful. Another level of abstraction is needed to extract meaningful patterns from the query log.
\begin{figure*}[h!]
\captionsetup[subfigure]{justification=centering}
\centering
\begin{subfigure}[b]{0.9\textwidth}%{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{graphics/Supervised}
\caption{Supervised approach}
\label{fig:supervised}
\end{subfigure}
\\
\begin{subfigure}[b]{0.9\textwidth}%{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{graphics/Unsupervised}
\caption{Unsupervised approach}
\label{fig:unsupervised}
\end{subfigure}
\\
\caption{Abstract views of inputs and outputs with both (a) Supervised and (b) Unsupervised approaches}
\label{fig:approaches}
\end{figure*}
We introduce the concept of sessions to solve this problem. A user would interact with their smartphone multiple times a day for small intervals of time. These bursts of intermittent activity are captured in database user sessions. These form the atomic units of the query log which can be used for detecting patterns.
A logical task, called an \emph{activity}, performed by a user on a smartphone, such as checking for new email, might produce multiple queries to the database. Since smartphone applications keep switching between foreground and background, these queries could be arbitrarily spaced out in time. Hence, one database user session might contain one or more logical user tasks. A logical user task might be spread across multiple database user sessions. These sessions are useful in capture subset of logical tasks which are repetitive. Since there is no discrete indicator of the start and end of a database user session in smartphones, we use a heuristic to help define one. If queries in a log are apart in time beyond a threshold, we consider them to be a part of different sessions.
After partitioning the query log into \emph{sessions}, we create a statistical summary of \emph{similar} activities by providing frequencies of each pattern detected. This could be done in two ways: (1) Supervised approach, and (2) Unsupervised approach.
\subsection{Supervised Approach}
We define a set of atomic activities that are possible to be performed on an application, and look for them every time a user uses the phone. The frequency of the activities that appear together or individually in these \emph{sessions} provide an outlook of how a user utilizes an application. The process is illustrated in Figure~\ref{fig:supervised}.
\todo{Go on. Describe the figure.}
\subsection{Unsupervised Approach}
\todo{Define this.}
This process is illustrated in Figure~\ref{fig:unsupervised}.
\todo{Go on. Describe the figure.}
%% I COMMENTED OUT THE REST OF THE SECTION. WE WILL INTRODUCE CLUSTERING LATER - GOKHAN %%
%Even though we have now obtained an atomic unit of a query log for the purpose of detecting patterns, the problem of poor clustering owing to directly extracting features still remains to be addressed. We introduce another level of abstraction to segregate queries based on their interest over database attributes. This is determined by similarity in their semantic structure. Since most of the queries issued to a smartphone database are made through ORMs, they contain ingerent structure. Each query can belong to one of many query clusters. This makes the detection of patterns easier.
%To summarize, we perform a two step preprocessing : Clustering and Session Identification. Our system operates in three stages: (1)~A \textbf{clustering} phase where the queries in the log are segregated by similarity in their semantic structure, (2)~A \textbf{session identification} phase where the query logs are partitioned based on the timing of queries to identify the length and coverage of individual user sessions, and (3)~A \textbf{session parameter detection} phase where the partitioned sessions are inspected in order to establish associations of what kinds of activities are performed together in a typical session, as well as detection of unusual activities. These stages are illustrated in Figure~\ref{fig:abstractview}.
%\begin{figure}[h!]
% \centering
% \includegraphics[width=0.5\textwidth]{graphics/systemoutline}
% \caption{Abstract view of inputs and outputs. (a) Clustering (b) Session Identification}
% \label{fig:abstractview}
%\end{figure}
%\subsection{Clustering phase}
%The query workload created by an Android application consists of queries generated by user activity. Most of the applications support several functionalities, and these functionalities generate queries with conditions based on the user needs.
%In this phase, we aim to distinguish these functionalities based on the variety of structural differences in queries. We observe and process every SQL query issued to the mobile DBMS. We extract the relevant features of the query considering what kind of attributes the query is accessing with that particular query. Our work follows the basic SQL query feature extraction principles of Makiyama \textit{et al.}~\cite{makiyama2015text}. Using these query features, we construct a feature vector for each query to cluster similar queries together. We use hierarchical clustering since there is not a specific number of clusters we need. This step provides us with the information required to see frequency of different kinds of queries a user issues.
%\subsection{Session identification phase}
%The usage pattern of databases in smartphones differs significantly from the continuous patterns experienced in database servers~\cite{kennedy2015pocket} or more traditional computing devices like PCs.
%Typically, an end user would use their smartphones multiple times a day for small intervals of time. This pattern of intermittent bursts of activity motivates a new way of looking at these usage patterns. Identification of patterns would require breaking down the workload into logical units which can be compared for similarity. These logical units can be formed by ``slicing'' the chronologically ordered queries in the workload. Every unit represents a task that the user performs in a burst of activity. Such tasks are comprised of a bag of queries. We refer to these logical units as database sessions.
%\note{To Gourab: the rest of the section is moved to the methodology section.}
%\subsection{Session parameter detection phase}
%Another aspect of generating a query workload is to be able to imitate the length and timings of the sessions created by the users as well as the deviations from this behavior. In the session parameter detection phase, we identify the user interests via query features created by the user, determine the session lengths and query counts in a session, and compute the difference between sessions in order to capture the unusual activities.
%We treat sessions as entities that aim to perform a bag of tasks. To automatically generate synthetic workloads, we extract the minimum and maximum session lengths, typical query counts, and the features encountered in each session.
%\todo{Gourab: I suggest moving out implementation details from Para 2 to another section}
%\todo{Talk about what is novel and what has already been done before.}

View File

@ -1,8 +0,0 @@
Soikkeli \textit{et al.}~\cite{soikkeli2011diversity} say that even considering the time between launch and close of an application is not a reliable notion of an application usage session. Applications running in the foreground are visible to the user. Applications which are running in background are not visible to user even though he might have launched them before.
An individual session now consists of two parameters: a start time and an end time. Two user sessions can be back to back or might have an idle time in between them. User sessions for a single application can be modeled as a time-wise closely-spaced series of queries issued to the smartphone database. A threshold value T is defined for the idle time between two queries. If the idle time is less than or equal compared to T, the queries belongs belong to the same user session. This session would contain a chronologically ordered subset of queries issued to the smartphone database.
Since the number and content of these sessions depend on the idle time parameter T, it is important to identify the best idle time. While the idea of a low idle time tolerance might be enticing because it enables us to look into the query log at a more granular level, we decided that the following approach was more suitable. When the number of user sessions became too high, neighboring sessions started to become very similar to each other. This might lead to a myopic view of the data. Also, it is reasonable to hypothesize that the general usage pattern of smartphones is in bursts. The user would pickup the smartphone for a few minutes, perform a bunch of tasks and then keep it away. During these bursts of activity, the high similarity among smaller user sessions could be because of the fact that if a user is checking the Facebook feed for 5 seconds, it is highly probably that they will keep doing that for the next many seconds. However, we are able to deal with this myopic view with larger idle time tolerances. Also, higher idle time tolerances lead to lower number of user session windows. The time complexity of similarity calculation operation is $O(n^2)$. Higher idle time tolerances fit the general usage patterns, as well as, reduce the computational complexity of calculating the average similarity vector.
By iteratively going though different idle time parameter values, we choose a tradeoff point through a plot of idle time parameter values and number of resulting sessions~\cite{satopaa2011finding}. This method can implicitly adapt to workloads with different characteristics since the trade-off point if choosen by running through the actual query log.

32
sections/4-buildingBlocks.tex Executable file
View File

@ -0,0 +1,32 @@
%We propose a heuristic to analyze query logs and find out interesting patterns. This process has three steps:
%\begin{enumerate}
% \item Figuring out the activities of interest with respect to the application
% \item Clustering the queries
% \begin{enumerate}
% \item Extracting features
% \item Query comparison
% \item Clustering with different strategies
% \end{enumerate}
% \item Detecting patterns in user activity
% \begin{enumerate}
% \item Appoint an integer label to each cluster
% \item Identify which cluster a new-coming query belongs to
% \item Identify patterns with different strategies
% \end{enumerate}
%\end{enumerate}
%The strategies for applying these steps are given in this section.
\subsection{Session Identifier}
\label{sec:sessionidentifier}
\input{sections/4a-sessionidentification.tex}
\subsection{Profiler}
\label{sec:profiler}
\input{sections/4b-profiler.tex}
\subsection{Analyzer (Evaluation Engine)}
\label{sec:analyzer}
\input{sections/4c-analyzer.tex}

View File

@ -0,0 +1,21 @@
In our framework, a \textit{database session} is a logical unit of user interaction. It spans over a period of time and comprises of sequential queries. If two sequential queries are more than \textit{t} seconds apart, we consider them to be in different sessions. Parameter \textit{t} is called the Idle Time Tolerance.
%In our framework, a \textit{database session} on a smartphone is a time period in which the user's activity makes the application issue sequential queries with a period of at most \textit{t} seconds between them.
%We identify the approximate the time \textit{t} for each user to find what constitutes of a session for each user.
Large values of \textit{t} would create sessions which span longer amounts of time. These bloated sessions would capture multiple tasks which would reduce the granuarity of subsequent processing. On the other hand, very small values of \textit{t} would create very small sessions which span minuscule amounts of time and contain very less queries. Such tiny sessions might not capture complete tasks.
We incrementally iterate different idle time tolerances \textit{t} and look at the corresponding number of sessions that are obtained. The optimum value of \textit{t} is obtained by locating the knee or trade-off point. The non-optimum values of \textit{t} would require an unfavorably large change in one of the quantities to gain a small amount in the other.
%We incrementally iterate different idle time tolerances \textit{t}, and determine the ideal \textit{t} when incrementing it starts not to affect the number of sessions identified.
%We harvest the features extracted for each query in a session, and create a bag of features.
%We measure the behavior difference between sessions with Jensen-Shannon divergence~\cite{fuglede2004jensen} which compares two given probability distributions.
%Therefore, we are able to determine distinctive session characteristics, as well as repeating ones.
Soikkeli \textit{et al.}~\cite{soikkeli2011diversity} say that even considering the time between launch and close of an application is not a reliable notion of an application usage session. Applications running in the foreground are visible to the user. Applications which are running in background are not visible to user even though he might have launched them before.
An individual session now consists of two parameters: a start time and an end time. Two user sessions can be back to back or might have an idle time in between them. User sessions for a single application can be modeled as a time-wise closely-spaced series of queries issued to the smartphone database. A threshold value \textit{T} is defined for the idle time between two queries. If the idle time is less than or equal compared to \textit{T}, the queries belongs belong to the same user session. This session would contain a chronologically ordered subset of queries issued to the smartphone database.
Since the number and content of these sessions depend on the idle time parameter \textit{T}, it is important to identify the best idle time. While the idea of a low idle time tolerance might be enticing because it enables us to look into the query log at a more granular level, we decided that the following approach was more suitable. When the number of user sessions became too high, neighboring sessions started to become very similar to each other. This might lead to a myopic view of the data. Also, it is reasonable to hypothesize that the general usage pattern of smartphones is in bursts. The user would pickup the smartphone for a few minutes, perform a bunch of tasks and then keep it away. During these bursts of activity, the high similarity among smaller user sessions could be because of the fact that if a user is checking the Facebook feed for 5 seconds, it is highly probably that they will keep doing that for the next many seconds. However, we are able to deal with this myopic view with larger idle time tolerances. Also, higher idle time tolerances lead to lower number of user session windows. The time complexity of similarity calculation operation is $O(n^2)$. Higher idle time tolerances fit the general usage patterns, as well as, reduce the computational complexity of calculating the average similarity vector.
By iteratively going though different idle time parameter values, we choose a tradeoff point through a plot of idle time parameter values and number of resulting sessions~\cite{satopaa2011finding}. This method can implicitly adapt to workloads with different characteristics since the trade-off point if choosen by running through the actual query log.

114
sections/4b-profiler.tex Executable file
View File

@ -0,0 +1,114 @@
In order to be able to perform a similarity assessment between queries, activities, and sessions, we need to be able to extract features out of SQL queries.
%, compare them and compute pairwise similarity between them.
Extracting features from a SQL query can be done in many ways. Let's consider the following queries:
\begin{verbatim}
Q1: SELECT username FROM user WHERE rank = "admin"
Q2: SELECT rank, count(*) FROM user
WHERE rank <> "admin" GROUP BY rank
\end{verbatim}
These two queries share many attributes, and seem to be working on similar concepts although not performing semantically very similar tasks. Usually, what we consider important in a query can roughly be listed as \textit{selection}, \textit{joins}, \textit{group-by}, \textit{projection}, and \textit{order-by}.
Makiyama \textit{et al.}~\cite{makiyama2015text} put forward the most similar work we are working on. They perform query log analysis with a motivation of analyzing the workload on the system, and they provide a set of experiments on Sloan Digital Sky Survey (SDSS) dataset. They extract the terms in \textit{selection}, \textit{joins}, \textit{projection}, \textit{from}, \textit{group-by}, and \textit{order-by} items separately, and create the query vector out of their appearance frequency for each query in the dataset. They compute the pairwise similarity of queries with cosine similarity.
\begin{figure}[h!]
\centering
\includegraphics[width=0.5\textwidth]{graphics/clustering}
\caption{The workflow for clustering process}
\label{fig:clusteringWorkflow}
\end{figure}
%To clarify the ambiguity between distance and similarity terms, we define distance as follows:
%$$distance = 1 - similarity$$
%where the similarity is the score we get from the methods explained above.
The \emph{profiles} to compare activities and sessions can be performed in three ways: (1) Feature based sets which can be compared with Jaccard Index, (2) KL-Divergence entropy using feature appearance frequencies, and (3) Jaccard Index using the clustering appointments of queries.
\tinysection{Feature based sets with Jaccard Index}
\todo{Describe the process}
\tinysection{KL-Divergence entropy}
%Sessions over time may show various characteristics over time, and it is important to identify sessions that deviate from the other sessions to be able to generate new workloads.
%The generated workloads should reflect the normal behavior of users, but they should still include realistic deviations in order to create a pragmatic workload that emulate a user.
Each query $Q^{t_i}_u$ is processed with the methodology given above, and denoted as:
\begin{equation}
Q^{t}_u = ( f^{t}_1(c_0), f_2^{t}(c_1), ... , f_m^{t}(c_n) )
\end{equation}
where $t$ is the timestamp that the query $Q$ was issued, $u$ represents the username of the query owner, $f_i$ is the feature extracted, and $c_j$ denotes how many times the feature $f_i$ was observed in the query.
A session $S$ is represented by a user $u \in U$, where $U$ is the set of all users, for the time period $T$ that starts from $t_0$ and goes on for $\Delta t$, and the set of queries $Q$ performed by $u$ within $T$. Formally,
\begin{equation}
S^T_u = ( Q^{t_0}_u, Q^{t_1}_u, ... , Q^{t_n}_u )
\end{equation}
where $Q^{t_i}_u$ represents a query $Q_{t_i}$ issued at time $t_i$ by user $u$.
The \textit{session profiles} are created with the accumulation of these features from the beginning to the end of the session.
Using the appearance frequency of these features, we calculate the appearance probability of each harvested feature.
This multinomial probability distribution of the features for each session constitutes the \textit{session distribution}.
A session distribution $\phi$ is formally denoted as:
\begin{equation}
\phi^T_u = ( P(f_0)^{T}_u, P(f_1)^{T}_u, ... , P(f_n)^{T}_u )
\end{equation}
where $P(f_i)^{T}_u$ represents the probability of encountering feature $f_i$ within the timeframe $T$ among all the operations performed by user $u$.
We compute the difference between distributions with KL- Divergence \todo{find the KL-Divergence reference}
%~\cite{fuglede2004jensen}
. Comparison of a with the other sessions using KL-Divergence gives the difference denoted as follows:
%\begin{equation}
%d^{T_1}_u (\phi^{T_1}_u || \phi^{T_2}_u) = \frac{1}{2} KL(\phi^{T_1}_u || \phi^{T_2}_u) + KL(\phi^{T_2}_u || \phi^{T_1}_u)
%\end{equation}
%where
\begin{equation}
KL(\phi^{T_1}_u || \phi^{T_2}_u) = \sum_i \phi^{T_1}_u(i) log_2 \frac{\phi^{T_1}_u(i)}{\phi^{T_2}_u(i)}
\end{equation}
\textbf{KL-Divergence} is used for comparing two probability distributions, $P$ and $Q$; and it ranges between 0 and $\infty$. $D_{JS}(P||Q)$ essentially represents the symmetric information loss when $P$ distribution is used to approximate $Q$.
Note that when $P(i) \neq 0$ and $Q(i) = 0$, $D_{JS}(P||Q)=\infty$. For example, suppose, we have two distributions $P$ and $Q$ as follows: $P = \{ f_0: 3/10, f_1: 4/10, f_2: 2/10, f_3: 1/10 \}$ and $Q = \{ f_0: 3/10, f_1: 3/10, f_2: 3/10, f_4: 1/10 \}$. In this case, since $f_3$ is not a part of $Q$, the result would be $\infty$, which means these two distributions are completely different.
\textbf{Smoothing.} To get past this problem, we can apply \textit{smoothing} (i.e., Laplace/additive smoothing), which is essentially adding a small constant $epsilon$ to the distribution, to handle zero values, without significantly impacting the distribution. After we apply smoothing, $D_{KL}(P||Q)$ becomes $1.38$.
\tinysection{Jaccard Index using the clustering of queries} In this strategy, the profiler takes the preprocessed clustering appointments of queries.
Clustering the queries in the workload narrows down the space of possible patterns that could be detected. This facilitates easier and more accurate understanding of the workload~\cite{pavlo2017self}. The main goal of this step is to group queries into classes that exhibit similar interests over database attributes. We consider two queries to exhibit similar interests over database attributes if they are similar in semantic structure. In the clustering process, we first filter the queries belonging to the app of our interest without distinguishing which user the activity belongs to. Then, we create clusters using all the queries belonging to that specific app. The workflow for the PocketData dataset is illustrated in Figure~\ref{fig:clusteringWorkflow}.
We use hierarchical clustering
%in our experiments,
which takes the distance matrix as input, and outputs a dendrogram -- a tree structure which shows how each query can be grouped together.
Furthermore, a dendrogram is a convenient way to visualize the relationship between queries and how each query is grouped in the clustering process.
To clarify the ambiguity between distance and similarity terms, we define distance as follows:
$$distance = 1 - similarity$$
where the similarity is the score we get from the methods explained above.
\begin{figure}[h!]
\centering
\includegraphics[width=0.5\textwidth]{graphics/systemoutline}
\caption{Session representations with query clustering}
\label{fig:clusterSession}
\end{figure}
In this form of session profiling, we embed query cluster appointments for all the queries within the session to the \emph{session profile}, which would be used to define the session for the rest of the process. An illustration of how the sessions are represented is given in Figure~\ref{fig:clusterSession}.
\todo{Show how to calculate the session similarity.}

1
sections/4c-analyzer.tex Executable file
View File

@ -0,0 +1 @@
\todo{Define how the analyzer works for both supervised and unsupervised approaches.}

2
sections/sessionSimilarity.tex Executable file
View File

@ -0,0 +1,2 @@
Sessions are essentially a bag of queries