Cleaning up intro

master
Oliver Kennedy 2017-11-02 17:35:31 -04:00
parent 973e54a635
commit 75d11fbd0f
3 changed files with 23 additions and 16 deletions

View File

@ -5,6 +5,8 @@
\input{preamble}
\usepackage{balance} % for \balance command ON LAST PAGE (only there!)
% \toappear{}
%\numberofauthors{1}
\begin{document}

View File

@ -19,10 +19,9 @@
\usepackage[normalem]{ulem}
\usepackage{array}
\usepackage{multicol}
\usepackage{enumitem}
\usepackage[inline]{enumitem}
\usepackage{etoolbox}
\usepackage{varwidth}
\usepackage{enumitem}
\usepackage{longtable}
\usepackage{rotating}
\usepackage{arydshln}

View File

@ -49,22 +49,28 @@ Although our summarization strategy does support this mode of use, we also show
Summaries resulting from our approach may embedded database developers to better understand app requirements, app developers to better tune app performance, mobile phone OS designers to tune OS level services, or even to generate synthetic workload traces.
We validate our approach using recently collected traces of smartphone query activity in the wild~\cite{kennedy2015pocket}, and show that it is able to cluster queries into meaningful summaries of real world data.
% \tinysection{Segmenting Queries}
\tinysection{Segmenting Query Logs}
The first major challenge we address is segmenting the query log into its component sessions.
This is difficult for two reasons.
First, there are no explicit cues from the user or app that signal a session has started or completed.
Furthermore, some apps generate queries in background tasks, adding a level of noise to the session segmenting process.
Nevertheless, we show that a simple approach based on query inter-arrival rates is sufficient to segment the log into meaningful sessions.
We also propose a strategy for identifying an appropriate time threshold for each app, using similarities in query logs from across a diverse population of users.
Our solution addresses two specific challenges in analyzing real-world query workloads.
First, we propose a methodology to extracts chunks of queries from the log, each of which correspond to one or more user interactions.
We assume that queries for a given app arrive in sessions, or sequences of queries.
Under this assumption, we explore how to segment the log into sub-sequences of queries, each corresponding to one session.
Since there are no explicit cues from the user that signal a session has started or completed, identification of sessions in the query log is not straightforward . We show that it is possible to use query inter-arrival rates to identify a time threshold between sessions. This helps us lay the foundation for extracting meaningful patterns from the query logs.
Second, we tackle the problem of effective summarizing the query log. A~\textit{good} summary would consist of subsets of the query log, each of which contain queries from most recurnrig user activities. We evaluate options for linking these sub-sequences together into self-similar clusters to identify \emph{categories} of sessions, where each cluster corresponds to one category.
To do this, we must first define what makes two clusters similar.
\tinysection{Assigning Session Categories}
Second, we address the problem of effectively identifying session categories to create meaningful summaries of the log.
Abstractly, we can accomplish this by using a clustering algorithm to group sessions with ``similar'' queries.
However, for this we first need a good definition of what makes two queries similar.
Query similarity has already received significant attention~\cite{aouiche2006,aligon2014similarity,makiyama2015text}.
A common approach is to describe queries in terms of features extracted from the query. Common features include the columns which are projected or what selection predicates are used.
Although our underlying approach is agnostic to how these features are extracted, we experiment with a variety of feature definitions and adopt a feature extraction method proposed by Makiyama \textit{et al.}~\cite{makiyama2015text}.
A common approach is to describe queries in terms of features extracted from the query.
Common features include the columns which are projected or what selection predicates are used.
As our underlying approach is agnostic to how these features are extracted, we experiment with a variety of feature extraction techniques.
Our final clustering approach adopts a feature extraction method proposed by Makiyama \textit{et al.}~\cite{makiyama2015text}, and addresses one further technical challenge.
In principle each sub-sequence of queries could be described by the features of those queries.
As we show, clustering directly on the query features is neither scalable nor reliable.
Instead, we add an intermediate step where we first cluster individual queries, allowing us to first link related queries.
These cluster labels serve as the basis for the session-similarity metric. This metric helps clustering similar sessions together.
Instead, we show the need for an intermediate step where we first cluster individual queries, allowing us to first link related queries.
These cluster labels reduce noise, and serve as the basis for the session-similarity metric for clustering similar sessions together.
%We will model common behaviors and identify unusual patterns which can be used to realistically emulate synthetic workloads created by Android applications.
@ -91,12 +97,12 @@ These cluster labels serve as the basis for the session-similarity metric. This
%(4) explore the data flow improvement opportunities within the app.
Concretely, in this paper we:
\begin{enumerate}
\begin{enumerate*}
\item identify the challenges posed by mobile app workloads for query log summarization,
\item propose a two-level, activity-oriented summary format for mobile app workloads,
\item design a process for creating session-oriented summaries from query logs,
\item evaluate our summarization process, showing that it efficiently creates representative summaries.
\end{enumerate}
\end{enumerate*}
%An application of the core contribution of this work is the development of synthetic workload generator which could be used to create a benchmark.
%The methods described in this paper can be used to automatically generate benchmarks from query logs of an application.