Merge branch 'master' of gitlab.odin.cse.buffalo.edu:ahuber/SketchingWorlds

This commit is contained in:
Boris Glavic 2020-12-13 23:30:44 -06:00
commit d80d1f5373
7 changed files with 367 additions and 241 deletions

View file

@ -266,3 +266,29 @@ numpages = {12}
year = {2016}
}
@inproceedings{DBLP:conf/pods/GreenKT07,
author = {Todd J. Green and
Gregory Karvounarakis and
Val Tannen},
title = {Provenance semirings},
booktitle = {{PODS}},
pages = {31--40},
publisher = {{ACM}},
year = {2007}
}
@article{DBLP:journals/sigmod/GuagliardoL17,
author = {Paolo Guagliardo and
Leonid Libkin},
title = {Correctness of {SQL} Queries on Databases with Nulls},
journal = {{SIGMOD} Rec.},
volume = {46},
number = {3},
pages = {5--16},
year = {2017}
}

11
hardness-app.tex Normal file
View file

@ -0,0 +1,11 @@
\section{Missing details from Section~\ref{sec:hard}}
\label{app:hard}
\subsection{Proofs of~\cref{eq:1e}-\cref{eq:2pd-3d}}
\label{app:easy-counts}
\cref{eq:1e},~\cref{eq:2p} and~\cref{eq:3s} are immediate.
A quick argument to why \cref{eq:2m} is true. Note that for edge $(i, j)$ connecting arbitrary vertices $i$ and $j$, finding all other edges in $G$ disjoint to $(i, j)$ is equivalent to finding all edges that are not connected to either vertex $i$ or $j$. The number of such edges is $m - d_i - d_j + 1$, where we add $1$ since edge $(i, j)$ is removed twice when subtracting both $d_i$ and $d_j$. Since the summation is iterating over all edges such that a pair $\left((i, j), (k, \ell)\right)$ will also be counted as $\left((k, \ell), (i, j)\right)$, division by $2$ then eliminates this double counting.
\cref{eq:2pd-3d} is true for similar reasons. For edge $(i, j)$, it is necessary to find two additional edges, disjoint or connected. As in ~\cref{eq:2m}, once the number of edges disjoint to $(i, j)$ have been computed, then we only need to consider all possible combinations of two edges from the set of disjoint edges, since it doesn't matter if the two edges are connected or not. Note, the factor $3$ of $\threedis$ is necessary to account for the triple counting of $3$-matchings. It is also the case that, since the two path in $\twopathdis$ is connected, that there will be no double counting by the fact that the summation automatically 'disconnects' the current edge, meaning that a two matching at the current vertex will not be counted. The sum over all such edge combinations is precisely then $\numocc{G}{\twopathdis} + 3\numocc{G}{\threedis}$.

273
intro.tex
View file

@ -3,32 +3,33 @@
\section{Introduction}
Modern production databases like Postgres and Oracle use bag semantics.
Conversely, probabilistic database research (PDBs)~\cite{DBLP:series/synthesis/2011Suciu,DBLP:conf/sigmod/BoulosDMMRS05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} focuseses predominantly on queries evaluated under set semantics.
Modern production databases like Postgres and Oracle use bag semantics, while research on probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu,DBLP:conf/sigmod/BoulosDMMRS05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} focuseses predominantly on query evaluation under set semantics.
This is not surprising, as the conventional strategy for encoding the lineage of a query result --- a key component of query evaluation in PDBs --- makes computing typical statistics like marginal probabilities or moments easy (at worst linear in the size of the lineage) for bags, but hard (at worst exponential in the size of the lineage) for sets.
However, conventional encodings of a result's lineage are typically far larger than a comparable deterministic query result, and so even for Bag-PDBs, computing such statistics still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs, and address it by proposing an approximation algorithm that, to the best of our knowledge, is the first approach to computing expected counts that stays within a constant factor of deterministic query processing.
However, conventional encodings of a result's lineage are typically large, and even for Bag-PDBs, computing such statistics still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs, and address it by proposing an approximation algorithm that, to the best of our knowledge, is the first $\epsilon - \delta$ approximation for expectations of counts to stay within a constant factor of deterministic query processing.
Consider the dominant problem in Set-PDBs: Computing marginal probabilities, and the corresponding problem in Bag-PDBs: computing expected counts.
In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage is a boolean formula over random variables that captures the conditions under which each output tuple appears in the result.
Consider the dominant problem in Set-PDBs: Computing marginal probabilities, and the corresponding problem in Bag-PDBs: computing expectations of counts.
In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage of a query result tuple is a boolean formula over random variables that captures the conditions under which the tuple appears in the result.
Computing the probability of the tuple appearing in the result is thus analogous to weighted model counting (a known \sharpphard problem).
In the corresponding problem for Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty}, the lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
In the corresponding problem for Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty}, lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
Thus, the expectation of the multiplicity is the expectation of this formula.
Lineage in Set-PDBs is typically encoded in conjunctive normal form.
Lineage in Set-PDBs is typically encoded in disjunctive normal form.
This representation is significantly larger than the query result sans lineage.
However, even with alternative encodings~\cite{DBLP:journals/vldb/FinkHO13}, the limiting factor in computing marginal probabilities remains the probability computation itself and not the lineage formula.
The corresponding representation in Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of terms, each of which is the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of tuple multiplicities is linear in the number of terms in the SOP polynomial.
The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of clauses, each of which is the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Such compressed representations like Factorized Databases~\cite{10.1145/3003665.3003667,DBLP:conf/tapp/Zavodny11} or Polynomial Circuits (cite), are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
Thus, measuring the performance of a PDB algorithm in terms of the size of a \emph{compressed} lineage formula allows us to more closely relate the algorithm's performance to the complexity of query evaluation in a deterministic database.
Such compressed representations like Factorized Databases~\cite{10.1145/3003665.3003667,DBLP:conf/tapp/Zavodny11} or Polynomial Circuits\todo[noinline]{cite}, are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula allows us to more closely relate the algorithm's performance to the complexity of query evaluation in a deterministic database.
The initial picture is not good.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed polynomial, meaning that even bag PDBs can not enjoy the same computational complexity as deterministic databases.
This motivates our second goal, a linear time approximation algorithm that, as we prove, estimates the expected multiplicities for tuples in the result of an SPJU query with a complexity to within a constant factor of the equivalent deterministic query.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{10.1145/3003665.3003667} --- lineage polynomial by reduction to counting 3-matchings.
Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases.
This motivates our second goal, a linear time approximation algorithm for computing expected counts in a bag database, with complexity linear in the size of a factorized lineage formula.
\todo[noinline]{as we show...}The worst-case size of the factorized lineage formula for a query is on the same order as the worst-case complexity of deterministic query evaluation~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}, making it possible to estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
\subsection{Sets vs Bags}
%Consider an arbitrary output polynomial $\poly$. Further, consider the same polynomial, with all exponents $e > 1$ set to $1$ and call the resulting polynomial $\rpoly$.
@ -37,7 +38,7 @@ This motivates our second goal, a linear time approximation algorithm that, as w
\begin{figure}[ht]
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c | c}
\begin{tabular}{ c | c c}
$\rel$ & A & $\Phi$\\
\hline
& a & $W_a$\\
@ -52,9 +53,9 @@ This motivates our second goal, a linear time approximation algorithm that, as w
\begin{tabular}{ c | c c c}
$E$ & A & B & $\Phi$\\
\hline
& a & b & 1\\
& b & c & 1\\
& c & a & 1\\
& a & b & $\top$\\
& b & c & $\top$\\
& c & a & $\top$\\
\end{tabular}
%\caption{Atom 3 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom3}
@ -92,163 +93,177 @@ This motivates our second goal, a linear time approximation algorithm that, as w
%\end{figure}
\begin{Example}\label{ex:intro}
Assume a set semantics setting. Suppose we are given a Tuple Independent Database ($\ti$), which is a PDB whose tuples are independently present or not. We are given the following boolean query $\poly() :- R(A), E(A, B), R(B)$. The lineage of the output is computed by adding polynomials when a union operation is performed, and by multiplying polynomials for a join operation. This yields the products of all input tuple lineages whose combination satsifies the join condition, summed together. A $\ti$ example instance is given in~\cref{fig:intro-ex}. The attribute column $\Phi$ contains its repsective random variable, where $P[W_i = 1]$ is its marginal probability. While for completeness we should include random variables for Table E, since each tuple has a probability of $1$, we drop them for simplicity. %Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
Next we explain why this query is hard in set semantics % due to correlations in the lineage formula. But
and easy under bag semantics.% with a polynomial formula representing the multiple contributing tuples from the input set $\ti$, it is easy since we enjoy linearity of expectation.
Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work also handles Block Independent Disjoint Databases ($\bi$)~\cite{DBLP:conf/sigmod/BoulosDMMRS05,DBLP:series/synthesis/2011Suciu}, we return to this model later.} given in \cref{fig:intro-ex} with two input relations $R$ and $E$.
Each input tuple is assigned an annotation (attribute $\Phi$): an independent random boolean variable ($W_i$) or the constant $\top$.
Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) identifies one \emph{possible world}, a deterministic database instance that contains exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p^2-p^3$
\end{Example}
Our work also handles Block Independent Disjoint Databases ($\bi$), a PDB model in which tuples are arranged in blocks, where all blocks are independent from one another, but tuples within the same block are mutually exclusive. For now, let us consider the $\ti$ model. In the example we consider a fixed probability $\prob$ for all tuple variables such that $P[W_i = 1] = \prob$. Let us also be explicit in mentioning that the input tables are \textit{sets}, i.e. $Dom(W_i) = \{0, 1\}$, and the difference when we speak of bag semantics, is that we consider the query to potentially have duplicates, or in other words we are thinking about query output (over set instances) in the bag context.
Prior efforts to generalize incomplete databases to bags~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17} replace the boolean annotations with natural numbers.
Analogously, we generalize the above model of Set-PDBs to bags by using natural-number-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and positive natural number constants.
Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.
We contrast bag and set query evaluation with the following example:
To contrast the bag/polynomial and set/lineage interpretations, we provide another example.
\begin{Example}\label{ex:bag-vs-set}
The output polynomial in ~\cref{ex:intro} has the following lineage formula (top) and polynomial (bottom).
Continuing the prior example, we are given the following boolean (resp,. count) query
$$\poly() :- R(A), E(A, B), R(B)$$
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a boolean (resp., polynomial) formula over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the boolean query has only a nullary relation, we write $Q(\cdot)$ to denote the function mapping variable assignments to a concrete value for the lineage in the corresponding possible world:
\begin{align*}
&\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a\\
&\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a
\poly_{set}(W_a, W_b, W_c) &= W_aW_b \vee W_bW_c \vee W_cW_a\\
\poly_{bag}(W_a, W_b, W_c) &= W_aW_b + W_bW_c + W_cW_a
\end{align*}
Notice that $\poly$ in the set/lineage setting above, $\poly: (\mathbb{B})^3 \mapsto \mathbb{B}$, while under bag/polynomial semantics we define $\poly: (\mathbb{N})^3 \mapsto \mathbb{N}$.
Assume the following $\mathbb{B}/\mathbb{N}$ variable assignments: $W_a\mapsto T/1, W_b \mapsto T/1, W_c \mapsto F/0.$ Then the polynomials evaluate as
It is left as an exercise for the reader to show that, given assignments to $W_a$, $W_b$, $W_c$, these expressions correspond to the existence (resp., count) of the single nullary result tuple for $\poly$ applied to the database instance in \cref{fig:intro-ex}.
We show one possible world here, with the set assignment $\{\;W_a\mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$ (and the corresponding bag assignment),
The polynomials evaluate as:
\begin{align*}
&\poly(T, T, F) = TT \vee TF \vee FT = T\\
&\poly(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
&\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \top\bot = \top\\
&\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
\end{align*}
In the set/lineage setting, we find that the boolean query is satisfied, while in the bags evaluation we see how many combinations of the input satsify the query.
The Set-PDB query is satisfied in this possible world, while the Bag-PDB query produces a nullary tuple with a multiplicity of 1.
The marginal probability (resp., expected count) of this query is computed over all possible worlds:
{\small
\begin{align*}
P[\poly_{set}] &= \sum_{w_i \in \{\top,\bot\}} \mu(\poly_{set}(w_a, w_b, w_c))P[W_a = w_a,W_b = w_b,W_c = w_c]\\
\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot P[W_a = w_a,W_b = w_b,W_c = w_c]
\end{align*}
}
\end{Example}
Note that computing the probability of the query of ~\cref{ex:intro} in set semantics is indeed \sharpphard, since it is a query that is non-hierarchical
%, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$,
~\cite{10.1145/1265530.1265571}. %Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics.
To see why this computation is hard for query $\poly$ over set semantics, from the query input we compute an output lineage formula of $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent of one another and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$:
\begin{equation*}
\expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
\end{equation*}
In general, such a computation can be exponential in the size of the database.
Note that the query of \cref{ex:bag-vs-set} in set semantics is indeed \sharpphard, since it non-hierarchical~\cite{10.1145/1265530.1265571}.
To see why computing this probability is hard, observe that the clauses of the disjunctive normal form boolean lineage are neither independent nor disjoint, forcing~\cite{DBLP:journals/vldb/FinkHO13} the use of Shannon decomposition, which is at worst exponential in the size of the input.
% \begin{equation*}
% \expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
% \end{equation*}
% In general, such a computation can be exponential in the size of the database.
%Using Shannon's Expansion,
%\begin{align*}
%&W_aW_b \vee W_bW_c \vee W_cW_a
%= &W_a
%\end{align*}
However, in the bag setting, the polynomial is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$. To be reiterate, the output lineage formula is produced from a query over a set $\ti$ input, where duplicates are allowed in the output. The expectation computation over the output lineage is a computation of the expected multiplicity of an output tuple across possible worlds. In ~\cref{ex:intro}, the expectation is simply
Conversely, in Bag-PDBs, correlations between clauses of the SOP polynomial are not problematic thanks to linearity of expectation.
The expectation computation over the output lineage is simply the sum of expectations of each clause.
For \cref{ex:intro}, the expectation is simply
{\small
\begin{align*}
&\expct\pbox{\poly(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
= &\expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
\expct\pbox{\poly(W_a, W_b, W_c)} &= \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
\intertext{\normalsize
In this particular lineage polynomial, all variables in each product clause are independent, so we can push expectations through.
}
&= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
\end{align*}
Note that $\expct\pbox{W_i} = P[W_i = 1]$, and so
}
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
\begin{align*}
&\expct\pbox{\poly(W_a, W_b, W_c)} = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
\expct\pbox{\poly_{bag}} =&\; P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
&+ P[W_c = 1]P[W_a = 1]\\
= &\prob^2 + \prob^2 + \prob^2 = 3\prob^2.
=&\; \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
\end{align*}
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial. The above equalities hold due to linearity of expectation. In this particular case all variables are independent, so we can push expectation into the products as well. Note that the answer is the same as substituting $\prob$ in for each variable, i.e., for this example $\expct\pbox{\poly(W_a, W_b, W_c)} = \poly(\prob, \prob, \prob)$. Is this equality always the case?%This however is coincidental and not true for the general case.
Now, consider the query
\begin{equation*}
\poly^2() := \rel(A), E(A, B), \rel(B), \rel(C), E(C, D), \rel(D),
\end{equation*}
For an arbitrary lineage formula, which we can view as a polynomial, it is known that there may exist equivalent compressed representations of the polynomial. One such compression is the factorized polynomial ~\cite{10.1145/3003665.3003667}, where the polynomial can be broken up into separate factors. %Another form of the polynomial is the SOP, which is the expansion of the factorized polynomial by multiplying out all terms, and in general is exponentially larger (in the number of products) than the factorized version.
A factorized polynomial of $\poly^2$ is
\begin{equation*}
\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right).
\end{equation*}
This factorized expression can be easily modeled as an expression tree as depicted by ~\cref{fig:intro-q2-etree}.
\begin{figure}[h!]
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=4.55cm}, level 2/.style={sibling distance=1.5cm}, level 3/.style={sibling distance=0.7cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
};
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
};
\end{tikzpicture}
\caption{Expression tree for query $\poly^2$.}
\label{fig:intro-q2-etree}
\end{figure}
In contrast, the equivalent SOP representation is
\subsection{Superlinearity of Bag PDBs}
Moving forward, we focus exclusively on bags and drop the subscript from $\poly_{bag}$.
Consider the cartesian product of $\poly$ with itself:
\begin{equation*}
\poly^2() := \rel(A), E(A, B), \rel(B),\; \rel(C), E(C, D), \rel(D)
\end{equation*}
For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.
One such compression is the factorized polynomial~\cite{10.1145/3003665.3003667}, where the polynomial is broken up into separate factors.
For example:
{\small
\begin{equation*}
\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right).
\end{equation*}
}
This factorized expression can be easily modeled as an expression tree, as in \cref{fig:intro-q2-etree}.
In contrast, the equivalent SOP representation is
\begin{equation*}
W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
\end{equation*}
One can see that the factorized form more closely models the optimizations of deterministic query evaluation.
The expectation then is
\begin{align*}
&\expct\pbox{\poly^2(W_a, W_b, W_c)}\\
&= \expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} +\\
&\qquad \expct\pbox{2W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} +\\
&\qquad \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}\\
= &\prob^2 + \prob^2 + \prob^2 + 2\prob^3 + 2\prob^3 + 2\prob^3\\
= & 3\prob^2(1 + 2\prob) \neq \poly^2(\prob, \prob, \prob).
\end{align*}
In this case, even though we substitute probability values in for each variable, $\poly^2(\prob, \prob, \prob)$ is not the answer we seek since in our example $Dom(W_i) = \{0, 1\}$ and $\expct\pbox{W_i^2} = \sum\limits_{w \in Dom(W_i)}w^2 \cdot P[W_i = w] = \prob$. This property leads us to consider another structure related to $\poly$. \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
\par Atri suggests a proof in the appendix regarding this claim.}
Define $\rpoly^2(\vct{X})$ to be the resulting polynomial when all exponents $e > 1$ are set to $1$ in $\poly^2$. For example, when we have
In our original example, the lineage polynomial for $\poly$ had the nice property that the expected count could be computed by simply replacing each variable with its probability.
This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(P\pbox{W_a}, P\pbox{W_b}, P\pbox{W_c})$).
Nevertheless, it suggests that a similar closed form formula for the expected count might be possible.
Observe that under assumption that $Dom(W_i) = \{0, 1\}$, it is generally true that for any $k$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
This property leads us to consider another structure related to $\poly$.
% \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
% \par Atri suggests a proof in the appendix regarding this claim.}
For any polynomial $\poly(\vct{X})$, we define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
With $\poly^2$ as an example, we have:
\begin{align*}
&\poly^2(W_a, W_b, W_c) = W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c\\
&+ 2W_aW_bW_c^2,
\end{align*}
then
\begin{align*}
&\rpoly^2(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
\rpoly^2(W_a, W_b, W_c) =&\; W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
&+ 2W_aW_bW_c\\
&= W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
=&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
\end{align*}
Note that this structure $\rpoly^2(\prob, \prob, \prob)$ is the expectation we computed, since it is always the case that $i^2 = i$ for all $i$ in $\{0, 1\}$. And, $\poly^2()$ is still computable in linear time in the size of the output polynomial, compressed or SOP.
Observe that the reduced polynomial is a closed form formula for the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(P\pbox{W_a=1}, P\pbox{W_b=1}, P\pbox{W_c=1})$).
Also note that our initial example polynomial $\poly$ is already in reduced form.
A compressed polynomial can be exponentially smaller in $k$ for $k$-products. It is also always the case that computing the expectation of an output polynomial in SOP is always linear in the size of the polynomial, since expecation can be pushed through addition.
This works seeks to explore the complexity landscape for compressed representations of polynomials. Note that when we are linear in the size of the lineage formula, we essentially have runtime that is of deterministic query complexity.
Up to this point the message seems consistent that bags are always easy in the size of the SOP representation, but
The reduced form of a polynomial can be obtained in a linear scan over the clauses of a SOP encoding of the polynomial.
In prior work on PDBs, where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the cartesian product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
This leads us to the central question of this paper:
\begin{Question}
Is it always the case that bags are easy in the size of the \emph{compressed} polynomial?
Is it always the case that the expectation of a nullary count query in a Bag-PDB can be computed in time linear in the size of the \emph{compressed} lineage polynomial?
\end{Question}
If bags \textit{are} always easy for any compressed version of the polynomial, then there is no need for improvement. But, if proveably not, then the option to approximate the computation over a compressed polynomial in linear time is critical for making PDBs practical.
Consider the query
\begin{equation*}
\poly^3() := \left(\rel(A), E(A, B), R(B)\right), \left(\rel(C), E(C, D), R(D)\right), \left(\rel(F), E(F, G), R(G)\right).
\end{equation*}
Upon inspection one can see that the factorized output polynomial consists of three product terms, while the SOP version consists of $3^3$ terms. We show in this paper that, given a $\ti$ and any conjunctive query with input $\prob$ for all variables of $\poly^3$, this particular query is hard given a factorized polynomial as input. We show this via a reduction to computing the number of $3$-matchings over an arbitrary graph. The fact that bags are not easy in the general case when considering compressed polynomials necessitates an approximation algorithm that computes the expected multiplicity of the output in linear time when the output polynomial is in factorized form. We introduce such an approximation algorithm with confidence guarantees to compute $\rpoly(\vct{X})$ in linear time. Our apporximation algorithm generalizes to the $\bi$ model as well. This shows that for all RA+ queries, the processing time in approximation is essentially the same deterministic processing.
If the answer is yes, then it is possible for Bag-PDBs to achieve performance competitive with deterministic databases.
The answer, unfortunately, is no, and an approximation algorithm is required.
% Consider the :
% \begin{equation*}
% \poly^3() := \left(\rel(A), E(A, B), R(B)\right), \left(\rel(C), E(C, D), R(D)\right), \left(\rel(F), E(F, G), R(G)\right).
% \end{equation*}
% The factorized output polynomial consists of a product of three identical three-way summations, while the SOP encoding is exponential --- $3^3$ clauses to be precise.
Concretely, in this paper:
(i) We show that conjunctive queries over a bag-$\ti$ are hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction to counting the number of $3$-matchings over an arbitrary graph;
(ii) We present an $\epsilon - \delta$ approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments, polynomial circuits, and prove RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View file

@ -36,6 +36,7 @@
\newcommand{\aug}[1]{AUG^{\graph{#1}}}
\newcommand{\mtrix}[1]{M_{#1}}
\newcommand{\dtrm}[1]{Det\left(#1\right)}
\newcommand{\tuple}[1]{\left<#1\right>}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Query Classes

View file

@ -185,9 +185,10 @@ sensitive=true
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% APPENDIX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \clearpage
% \appendix
% \normalsize
\clearpage
\appendix
\normalsize
\input{hardness-app}
% \input{glossary.tex}
% \input{addproofappendix.tex}
\end{document}

View file

@ -1,6 +1,7 @@
%root:main.tex
\section{Hardness of exact computation}
\label{sec:hard}
We would like to argue for a compressed version of $\poly(\vct{w})$, in general $\expct_{\vct{w}}\pbox{\poly(\vct{w})}$ even for TIDB, cannot be computed in linear time. We will argue two flavors of such a hardness result. In Section~\ref{sec:multiple-p}, we argue that computing the expected value exactly for all query polynommials $\poly(\vct{X})$ for multiple values of $p$ is \sharpwonehard. However, this does not rule out the possibility of being able to solve the problem for a any {\em fixed} value of $p$ being say even in linear time. In Section~\ref{sec:single-p}, we rule out even this possibility (based on some popular hardness conjectures in fine-grained complexity).
@ -63,16 +64,16 @@ We will prove the above result by reducing the problem of computing the number o
As mentioned earlier, we prove our hardness result by presenting a reduction from the problem of couting $\kElem$-matchings in a graph:
\begin{Lemma}\label{lem:qEk-multi-p}
Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$. Then given the values $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$, the number of $\kElem$-matchings in $G$ can be computed in $poly(\kElem)$ time.
Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$. Then given the values $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$, the number of $\kElem$-matchings in $G$ can be computed in $O\inparen{\kElem^3}$ time.
\end{Lemma}
Before we prove the above Lemma, let us use it to prove~\Cref{thm:mult-p-hard-result}:
\begin{proof}[Proof of Theorem~\ref{thm:mult-p-hard-result}]
For the sake of contradiction, let us assume we can solve our problem in $f(\kElem)\cdot m^c$ time for some absolute constant $c$. Then given a graph $G$ we can compute the query polynomial $\rpoly_G^\kElem$ (in the obvious way) in $O(km)$ time. Then after we run our algorithm on $\rpoly_G^\kElem$, we get $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$ in additional $f(\kElem)\cdot m^c$ time. \Cref{lem:qEk-multi-p} then computes the number of $k$-matchings in $G$ in $poly(\kElem)$ time. Thus, overall we have an algorithm for computing the number of $k$-matchings in time
For the sake of contradiction, let us assume we can solve our problem in $f(\kElem)\cdot m^c$ time for some absolute constant $c$. Then given a graph $G$ we can compute the query polynomial $\rpoly_G^\kElem$ (in the obvious way) in $O(km)$ time. Then after we run our algorithm on $\rpoly_G^\kElem$, we get $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$ in additional $f(\kElem)\cdot m^c$ time. \Cref{lem:qEk-multi-p} then computes the number of $k$-matchings in $G$ in $O(\kElem^3)$ time. Thus, overall we have an algorithm for computing the number of $k$-matchings in time
\begin{align*}
O(km) + f(\kElem)\cdot m^c + poly(\kElem)
&\le \inparen{poly(\kElem) + f(\kElem)}\cdot m^{c+1} \\
&\le \inparen{poly(\kElem) + f(\kElem)}\cdot n^{2c+2},
O(km) + f(\kElem)\cdot m^c + O(\kElem^3)
&\le \inparen{O(\kElem^3) + f(\kElem)}\cdot m^{c+1} \\
&\le \inparen{O(\kElem^3) + f(\kElem)}\cdot n^{2c+2},
\end{align*}
which contradicts~\cref{thm:k-match-hard}.
\end{proof}
@ -80,22 +81,37 @@ which contradicts~\cref{thm:k-match-hard}.
Finally, we are rerady to prove~\Cref{lem:qEk-multi-p}:
\begin{proof}[Proof of ~\cref{lem:qEk-multi-p}]
%It is trivial to see that one can readily expand the exponential expression by performing the $n^\kElem$ product operations, yielding the polynomial in the sum of products form of the lemma statement. By definition $\rpoly_{G}^\kElem$ reduces all variable exponents greater than $1$ to $1$. Thus, a monomial such as $X_i^\kElem X_j^\kElem$ is $X_iX_j$ in $\rpoly_{G}^\kElem$, and the value after substitution is $p_i\cdot p_j = p^2$. Further, that the number of terms in the sum is no greater than $2\kElem + 1$, can be easily justified by the fact that each edge has two endpoints, and the most endpoints occur when we have $\kElem$ distinct edges (such a subgraph is also known as a $\kElem$-matching), with non-intersecting points, a case equivalent to $p^{2\kElem}$.
We will show that $\rpoly_{G}^\kElem(\prob,\ldots, \prob) = \sum\limits_{i = 0}^{2\kElem} c_i \cdot \prob^i$. First, since $\poly_G^\kElem(\vct{X})$ has $\kElem$ products of monomials of degree $2$, it follows that $\poly_G^\kElem(\vct{X})$ has degree $2\kElem$. We can further write $\poly_{G}^{\kElem}(\vct{X})$ in its expanded SOP form,
\begin{equation*}
\sum_{\substack{(i_1, j_1),\\\cdots,\\(i_\kElem, j_\kElem) \in E}}X_{i_1}X_{j_1}\cdots X_{i_\kElem}X_{j_\kElem}
\end{equation*}
Since each of $(i_1, j_1),\ldots, (i_\kElem, j_\kElem)$ are from $E$, it follows that the set of $\kElem!$ permutations of the $\kElem$ $X_iX_j$ pairs which form the monomial products are of degree $2\kElem$ with the number of distinct variables in an arbitrary monomial $\leq 2\kElem$. By definition, $\rpoly_{G}^{\kElem}(\vct{X})$ sets every exponent $e > 1$ to $e = 1$, thereby shrinking the degree a monomial product term in the SOP form of $\poly_{G}^{\kElem}(\vct{X})$ to the exact number of distinct variables the monomial contains. This implies that $\rpoly_{G}^\kElem$ is a polynomial of degree $2\kElem$ and hence $\rpoly_{G}^\kElem(\prob,\ldots, \prob)$ is a polynomial in $\prob$ of degree $2\kElem$. Then it is the case that
We first argue that $\rpoly_{G}^\kElem(\prob,\ldots, \prob) = \sum\limits_{i = 0}^{2\kElem} c_i \cdot \prob^i$. First, since $\poly_G(\vct{X})$ has %$\kElem$ products of monomials of
degree $2$, it follows that $\poly_G^\kElem(\vct{X})$ has degree $2\kElem$.
%We can further write $\poly_{G}^{\kElem}(\vct{X})$ in its expanded SOP form,
%\begin{equation*}
%\sum_{\substack{(i_1, j_1),\\\cdots,\\(i_\kElem, j_\kElem) \in E}}X_{i_1}X_{j_1}\cdots X_{i_\kElem}X_{j_\kElem}
%\end{equation*}
%Since each of $(i_1, j_1),\ldots, (i_\kElem, j_\kElem)$ are from $E$, it follows that the set of $\kElem!$ permutations of the $\kElem$ $X_iX_j$ pairs which form the monomial products are of degree $2\kElem$ with the number of distinct variables in an arbitrary monomial $\leq 2\kElem$.
By definition, $\rpoly_{G}^{\kElem}(\vct{X})$ sets every exponent $e > 1$ to $e = 1$, which means that $\deg(\rpoly_{G}^\kElem)\le \deg\poly_G^\kElem=2k$. Thus, if we think of $\prob$ as a variable, then $\rpoly_{G}^{\kElem}(\prob,\dots,\prob)$ is a univariate polynomial of degree at most $\deg(\rpoly_{G}^\kElem)\le 2k$. Thus, we can write
%thereby shrinking the degree a monomial product term in the SOP form of $\poly_{G}^{\kElem}(\vct{X})$ to the exact number of distinct variables the monomial contains. This implies that $\rpoly_{G}^\kElem$ is a polynomial of degree $2\kElem$ and hence $\rpoly_{G}^\kElem(\prob,\ldots, \prob)$ is a polynomial in $\prob$ of degree $2\kElem$. Then it is the case that
\begin{equation*}
\rpoly_{G}^{\kElem}(\prob,\ldots, \prob) = \sum_{i = 0}^{2\kElem} c_i \prob^i
\end{equation*}
where $c_i$ denotes all monomials in the expansion of $\poly_{G}^{\kElem}(\vct{X})$ composed of $i$ distinct variables, with $\prob$ substituted for each distinct variable\footnote{Since $\rpoly_G^\kElem(\vct{X})$ does not have any monomial with degree $< 2$, it is the case that $c_0 = c_1 = 1$.}.
Given that we then have $2\kElem + 1$ distinct values of $\rpoly_{G}^\kElem(\prob,\ldots, \prob)$ for $0\leq i\leq2\kElem$, it follows that we then have $2\kElem + 1$ distinct rows of the form $\prob_i^0\ldots\prob_i^{2\kElem}$ which form a matrix $M$. We have then a linear system of the form $M \cdot \vct{c} = \vct{b}$ where $\vct{c}$ is the coefficient vector ($c_0,\ldots, c_{2\kElem}$), and $\vct{b}$ is the vector such that $\vct{b}[i] = \rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$. By construction of the summation, matrix $M$ is the Vandermonde matrix, from which it follows that we have a matrix with full rank, and we can solve the linear system in $O(k^3)$ time to determine $\vct{c}$ exactly.
We note that $c_i$ is {\em exactly} the number of monomials in the SOP expansion of $\poly_{G}^{\kElem}(\vct{X})$ composed of $i$ distinct variables.%, with $\prob$ substituted for each distinct variable
\footnote{Since $\rpoly_G^\kElem(\vct{X})$ does not have any monomial with degree $< 2$, it is the case that $c_0 = c_1 = 0$ but for the sake of simplcity we will ignore this observation.}
Denote the number of $\kElem$-matchings in $G$ as $\numocc{G}{\kmatch}$. Note that $c_{2\kElem}$ is $\kElem! \cdot \numocc{G}{\kmatch}$. This can be seen intuitively by looking at the original factorized representation $\poly_{G}^\kElem(\vct{X})$, where, across each of the $\kElem$ products, an arbitrary $\kElem$-matching can be selected $\prod_{i = 1}^\kElem \kElem = \kElem!$ times. Note that each $\kElem$-matching $(i_1, j_1)\ldots$ $(i_k, j_k)$ in $G$ corresponds to the unique monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ in $\poly_{G}^\kElem(\vct{X})$, where each index is distinct. Since each index is distinct, then each variable has an exponent $e = 1$ and this monomial survives in $\rpoly_{G}^{\kElem}(\vct{X})$ Since $\rpoly$ contains only exponents $e \leq 1$, the only degree $2\kElem$ terms that can exist in $\rpoly_{G}^\kElem$ are $\kElem$-matchings since every other monomial in $\poly_{G}^\kElem(\vct{X})$ has strictly less than $2\kElem$ distinct variables, which, as stated earlier implies that every other non-$\kElem$-matching monomial in $\rpoly_{G}^\kElem(\vct{X})$ has degree $< 2\kElem$.
Given that we then have $2\kElem + 1$ distinct values of $\rpoly_{G}^\kElem(\prob,\ldots, \prob)$ for $0\leq i\leq2\kElem$, it follows that
%we then have $2\kElem + 1$ distinct rows of the form $\prob_i^0\ldots\prob_i^{2\kElem}$ which form a matrix $M$.
we have a linear system of the form $\vec{M} \cdot \vct{c} = \vct{b}$ where the $i$th row of $\vec{M}$ is $\inparen{\prob_i^0\ldots\prob_i^{2\kElem}}$, $\vct{c}$ is the coefficient vector $\inparen{c_0,\ldots, c_{2\kElem}}$, and $\vct{b}$ is the vector such that $\vct{b}[i] = \rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$. In other words, matrix $\vec{M}$ is the Vandermonde matrix, from which it follows that we have a matrix with full rank (since the $p_i$'s are distinct), and we can solve the linear system in $O(k^3)$ time (say using Gaussian Elimination) to determine $\vct{c}$ exactly. Thus, after $O(k^3)$ work, we know $\vct{c}$ and in particular, $c_{2k}$ exactly. Next we show why we can compute $\numocc{G}{\kmatch}$ from $c_{2k}$ in $O(1)$ additional time.
%Denote the number of $\kElem$-matchings in $G$ as $\numocc{G}{\kmatch}$.
We claim that $c_{2\kElem}$ is $\kElem! \cdot \numocc{G}{\kmatch}$. This can be seen intuitively by looking at the original factorized representation
\[\poly_{G}^\kElem(\vct{X}) = \sum_{\substack{(i_1, j_1),\\\cdots,\\(i_\kElem, j_\kElem) \in E}}X_{i_1}X_{j_1}\cdots X_{i_\kElem}X_{j_\kElem},\]
where across each of the $\kElem$ products, an arbitrary $\kElem$-matching can be selected $\prod_{i = 1}^\kElem \kElem = \kElem!$ times. Indeed, note that each $\kElem$-matching $(i_1, j_1)\ldots$ $(i_k, j_k)$ in $G$ corresponds to the monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ in $\poly_{G}^\kElem(\vct{X})$, where each index is distinct. %Since each index is distinct, then each variable has an exponent $e = 1$ and this monomial survives in $\rpoly_{G}^{\kElem}(\vct{X})$ Since $\rpoly$ contains only exponents $e \leq 1$, the only degree $2\kElem$ terms that can exist in $\rpoly_{G}^\kElem$ are $\kElem$-matchings since every other monomial in $\poly_{G}^\kElem(\vct{X})$ has strictly less than $2\kElem$ distinct variables, which, as stated earlier implies that every other non-$\kElem$-matching monomial in $\rpoly_{G}^\kElem(\vct{X})$ has degree $< 2\kElem$.
Further, the only monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ of degree exactly $2k$ in $\rpoly_{G}^{\kElem}(\vct{X})$ needs to have all of $i_1,j_1,\dots,i_\kElem,j_\kElem$ to be distinct. This every monomial of degree $2k$ in $\poly_{G}^{\kElem}(\vct{X})$ (and hence in $\rpoly_{G}^{\kElem}(\vct{X})$) corresponds to a $k$-matching in $G$.
%It has already been established above that a $\kElem$-matching ($\kmatch$) has coefficient $c_{2\kElem}$. As noted, a $\kElem$-matching occurs when there are $\kElem$ edges, $e_1, e_2,\ldots, e_\kElem$, such that all of them are disjoint, i.e., $e_1 \neq e_2 \neq \cdots \neq e_\kElem$. In all $\kElem$ factors of $\poly_{G}^\kElem(\vct{X})$ there are $k$ choices from the first factor to select an edge for a given $\kElem$ matching, $\kElem - 1$ choices in the second factor, and so on throughout all the factors, yielding $\kElem!$ duplicate terms for each $\kElem$ matching in the expansion of $\poly_{G}^\kElem(\vct{X})$.
Then, since we have $\kElem!$ duplicates of each distinct $\kElem$-matching, and the fact that $c_{2\kElem}$ contains all monomials with degree $2\kElem$, it follows that $c_{2\kElem} = \kElem!\cdot\numocc{G}{\kmatch}$. This allows us to solve for $\numocc{G}{\kmatch}$ by simply dividing $c_{2\kElem}$ by $\kElem!$.
Then, since %we have $\kElem!$ duplicates of
each $k!$ permutations of a $\kElem$-matching does not change the $\kElem$-matching (but map to distinct degree exactly $2k$ monomials in the SOP expansion of $\poly_{G}^\kElem$), gives us that we have a $k!$-to-$1$ mapping between the $\kElem$-matchings in $G$ and degree exactly $2k$ monomials in the SOP expansion of $\poly_{G}^\kElem$. Recalling that $c_{2k}$ counts the number of monomials of degree $2k$ in $\poly_{G}^\kElem$ proves $c_{2\kElem}= \kElem! \cdot \numocc{G}{\kmatch}$.
% and the fact that $c_{2\kElem}$ contains all monomials with degree $2\kElem$, it follows that $c_{2\kElem} = \kElem!\cdot\numocc{G}{\kmatch}$.
Thus, simply dividing $c_{2\kElem}$ by $\kElem!$ gives us $\numocc{G}{\kmatch}$, as needed. % by simply dividing $c_{2\kElem}$ by $\kElem!$.
\end{proof}
\qed

View file

@ -4,31 +4,51 @@
\subsection{Single $\prob$ value}
\label{sec:single-p}
In this discussion, let us fix $\kElem = 3$.
%In this discussion, let us fix $\kElem = 3$.
While~\cref{thm:mult-p-hard-result} shows that computing $\rpoly(\prob,\dots,\prob)$ in general is hard it does not rule out the possibility that can one compute this value exactly for a {\em fixed} value of $p$. Indeed, it is easy to check that once can compute $\rpoly(\prob,\dots,\prob)$ exactly in linear time for $p\in \inset{0,1}$. In this section, we show that these two are the only possibilities:
\AH{@atri needs to put in the result for triangles of $\numvar^{\frac{4}{3}}$ runtime.}
\begin{Theorem}\label{th:single-p}
If we can compute $\rpoly_{G}^3(\vct{X})$ in T(\numedge) time for $X_1 =\cdots= X_\numvar = \prob$, then we can count the number of triangles, 3-paths, and 3-matchings in $G$ in $T(\numedge) + O(\numedge)$ time.
\begin{Theorem}\label{cor:single-p-hard}
Fix $p\in (0,1)$. Then assuming~\cref{conj:graph} is true, then any algorithms that compute $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly has to run in time $\Omega\inparen{\abs{E(G)}^{1+\eps_0}}$, where $\eps_0$ is as defined in~\cref{conj:graph}.
\end{Theorem}
%\begin{proof}[Proof of Corollary ~\ref{cor:single-p-gen-k}]
%Consider $\poly^3_{G}$ and $\poly' = 1$ such that $\poly'' = \poly^3_{G} \cdot \poly'$. By ~\cref{th:single-p}, query $\poly''$ with $\kElem = 4$ has $\Omega(\numvar^{\frac{4}{3}})$ complexity.
%\end{proof}
The above shows the hardness for a very specific query polynomial but it is easy to come up with an infinite family of hard query polynomials by `embedding' $\rpoly_{G}^3$ into an infinite family of trivial query polynomials. However, unlike~\cref{thm:mult-p-hard-result} the above result does not show that computing $\rpoly_{G}^3(\prob,\dots,\prob)$ for a fixed $p\in (0,1)$ is \sharpwonehard. By contrast, in~\cref{sec:algo} we show that if we are willing to compute an approximation that this problem (and indeed solving our problem for a much more general setting) is in linear time.
%\AH{@atri needs to put in the result for triangles of $\numvar^{\frac{4}{3}}$ runtime.}
We will prove the above result by the following reduction:
\begin{Theorem}\label{th:single-p}
Fix $p\in (0,1)$. Let $G$ be a graph on $\numedge$ edges.
If we can compute $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly in $T(\numedge)$ time, then we can exactly compute $\numocc{G}{\tri}$, $\numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ %count the number of triangles, 3-paths, and 3-matchings in $G$
in $O\inparen{T(\numedge) + \numedge}$ time.
\end{Theorem}
\begin{proof}[Proof of~\cref{cor:single-p-hard}]
For the sake of contradiction, let us assume that for any $G$, we can compute $\rpoly_{G}^3(\prob,\dots,\prob)$ in $o\inparen{m^{1+\eps_0}}$ time.
Let $G$ be the input graph. It is easy to see that one can compute the expression tree for $\poly_{G}^3(\vct{X})$ in $O(m)$ time. Then by~\cref{th:single-p} we can compute $\numocc{G}{\tri}$, $\numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ in further time $o\inparen{m^{1+\eps_0}}+O(m)$. Thus, the overall, reduction takes $o\inparen{m^{1+\eps_0}}+O(m)= o\inparen{m^{1+\eps_0}}$ time, which violates~\cref{conj:graph}.
\end{proof}
\qed
Before moving on to prove ~\cref{th:single-p}, let us state the results, lemmas and defintions that will be useful in the proof.
We need to list all possible edge patterns in an arbitrary $G$ consisting of $\leq 3$ distinct edges.
\subsubsection{Preliminaries and Notation}
We need to list all possible edge patterns in an arbitrary $G$ consisting of at most three distinct edges. We have already seen $\tri,\threepath$ and $\threedis$, so here we define the remaining patterns:
\begin{itemize}
\item Single Edge $\left(\ed\right)$
\item 2-path ($\twopath$)
\item 2-matching ($\twodis$)
\item Triangle ($\tri$)
\item 3-path ($\threepath$)
%\item Triangle ($\tri$)
%\item 3-path ($\threepath$)
\item 3-star ($\oneint$)--this is the graph that results when all three edges share exactly one common endpoint. The remaining endpoint for each edge is disconnected from any endpoint of the three edges.
\item Disjoint Two-Path ($\twopathdis$)--this subgraph consists of a two path and a remaining disjoint edge.
\item 3-matching ($\threedis$)--this subgraph is composed of three disjoint edges.
%\item 3-matching ($\threedis$)--this subgraph is composed of three disjoint edges.
\end{itemize}
%Let $\numocc{G}{H}$ denote the number of occurrences of pattern $H$ in graph $G$, where, for example, $\numocc{G}{\ed}$ means the number of single edges in $G$.
For any graph $G$, the following formulas compute $\numocc{G}{H}$ for their respective patterns in $O(\numedge)$ time, with $d_i$ representing the degree of vertex $i$.
For any graph $G$, the following formulas for $\numocc{G}{H}$ for their respective patterns can be used to compute them exactly in $O(\numedge)$ time, with $d_i$ representing the degree of vertex $i$ (proofs are in~\cref{app:easy-counts}):
\begin{align}
&\numocc{G}{\ed} = \numedge, \label{eq:1e}\\
&\numocc{G}{\twopath} = \sum_{i \in V} \binom{d_i}{2} \label{eq:2p}\\
@ -37,40 +57,49 @@ For any graph $G$, the following formulas compute $\numocc{G}{H}$ for their resp
&\numocc{G}{\twopathdis} + 3\numocc{G}{\threedis} = \sum_{(i, j) \in E} \binom{\numedge - d_i - d_j + 1}{2}\label{eq:2pd-3d}
\end{align}
A quick argument to why \cref{eq:2m} is true. Note that for edge $(i, j)$ connecting arbitrary vertices $i$ and $j$, finding all other edges in $G$ disjoint to $(i, j)$ is equivalent to finding all edges that are not connected to either vertex $i$ or $j$. The number of such edges is $m - d_i - d_j + 1$, where we add $1$ since edge $(i, j)$ is removed twice when subtracting both $d_i$ and $d_j$. Since the summation is iterating over all edges such that a pair $\left((i, j), (k, \ell)\right)$ will also be counted as $\left((k, \ell), (i, j)\right)$, division by $2$ then eliminates this double counting.
%A quick argument to why \cref{eq:2m} is true. Note that for edge $(i, j)$ connecting arbitrary vertices $i$ and $j$, finding all other edges in $G$ disjoint to $(i, j)$ is equivalent to finding all edges that are not connected to either vertex $i$ or $j$. The number of such edges is $m - d_i - d_j + 1$, where we add $1$ since edge $(i, j)$ is removed twice when subtracting both $d_i$ and $d_j$. Since the summation is iterating over all edges such that a pair $\left((i, j), (k, \ell)\right)$ will also be counted as $\left((k, \ell), (i, j)\right)$, division by $2$ then eliminates this double counting.
Equation ~\ref{eq:2pd-3d} is true for similar reasons. For edge $(i, j)$, it is necessary to find two additional edges, disjoint or connected. As in ~\cref{eq:2m}, once the number of edges disjoint to $(i, j)$ have been computed, then we only need to consider all possible combinations of two edges from the set of disjoint edges, since it doesn't matter if the two edges are connected or not. Note, the factor $3$ of $\threedis$ is necessary to account for the triple counting of $3$-matchings. It is also the case that, since the two path in $\twopathdis$ is connected, that there will be no double counting by the fact that the summation automatically 'disconnects' the current edge, meaning that a two matching at the current vertex will not be counted. The sum over all such edge combinations is precisely then $\numocc{G}{\twopathdis} + 3\numocc{G}{\threedis}$.
%\cref{eq:2pd-3d} is true for similar reasons. For edge $(i, j)$, it is necessary to find two additional edges, disjoint or connected. As in ~\cref{eq:2m}, once the number of edges disjoint to $(i, j)$ have been computed, then we only need to consider all possible combinations of two edges from the set of disjoint edges, since it doesn't matter if the two edges are connected or not. Note, the factor $3$ of $\threedis$ is necessary to account for the triple counting of $3$-matchings. It is also the case that, since the two path in $\twopathdis$ is connected, that there will be no double counting by the fact that the summation automatically 'disconnects' the current edge, meaning that a two matching at the current vertex will not be counted. The sum over all such edge combinations is precisely then $\numocc{G}{\twopathdis} + 3\numocc{G}{\threedis}$.
%Original lemma proving the exact coefficient terms in qE3
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{The proofs}
Note that $\rpoly_{G}^3(\prob,\ldots, \prob)$ as polynomial in $\prob$ has degree at most six. Next, we figure out the exact coefficients since this would be useful in our arguments:
\begin{Lemma}\label{lem:qE3-exp}
When we expand $\poly_{G}^3(\vct{X})$ out and assign all exponents $e \geq 1$ a value of $1$, we have the following result,
%When we expand $\poly_{G}^3(\vct{X})$ out and assign all exponents $e \geq 1$ a value of $1$, we have the following result,
For any $p$, we have:
\begin{align}
&\rpoly_{G}^3(\prob,\ldots, \prob) = \numocc{G}{\ed}\prob^2 + 6\numocc{G}{\twopath}\prob^3 + 6\numocc{G}{\twodis} + 6\numocc{G}{\tri}\prob^3\nonumber\\
&+ 6\numocc{G}{\oneint}\prob^4 + 6\numocc{G}{\threepath}\prob^4 + 6\numocc{G}{\twopathdis}\prob^5 + 6\numocc{G}{\threedis}\prob^6.\label{claim:four-one}
\end{align}
\end{Lemma}
\begin{proof}[Proof of \cref{lem:qE3-exp}]
\begin{proof}%[Proof of \cref{lem:qE3-exp}]
By definition we have that
\[\poly_{G}(\vct{X}) = \sum_{\substack{(i_1, j_1),\\ (i_2, j_2),\\ (i_3, j_3) \in E}} \prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}.\]
Rather than list all the expressions in full detail, let us make some observations regarding the sum. Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$. Notice that each expression in the sum consists of a triple $(e_1, e_2, e_3)$. There are three forms the triple $(e_1, e_2, e_3)$ can take.
\[\poly_{G}^3(\vct{X}) = \sum_{\substack{(i_1, j_1),\\ (i_2, j_2),\\ (i_3, j_3) \in E}} \prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}.\]
Hence $\rpoly_{G}^3(\vct{X})$ has degree six. Note that the monomial $\prod_{\ell = 1}^{3}X_{i_\ell}X_{j_\ell}$ will contribute to the coefficient of $p^i$ in $\rpoly_{G}^3(\vct{X})$, where $i$ is the number of distinct variables in the monomial.
%Rather than list all the expressions in full detail, let us make some observations regarding the sum.
Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$. Notice that each expression in the sum consists of a triple $(e_1, e_2, e_3)$. There are three forms the triple $(e_1, e_2, e_3)$ can take (and in each case, we will account for their contribution to $\rpoly_{G}^3(\vct{X})$).
\textsc{case 1:} $e_1 = e_2 = e_3$, where all edges are the same. There are exactly $\numedge$ such triples, each with a $\prob^2$ factor in $\rpoly_{G}\left(\prob_1,\ldots, \prob_\numvar\right)$.
\textsc{case 1:} $e_1 = e_2 = e_3$, where all edges are the same. There are exactly $\numedge=\numocc{G}{\ed}$ such triples, each with a $\prob^2$ factor in $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
\textsc{case 2:} This case occurs when there are two distinct edges of the three, call them $e$ and $e'$. When there are two distinct edges, there is then the occurence when $2$ variables in the triple $(e_1, e_2, e_3)$ are bound to $e$. There are three combinations for this occurrence. It is the analogue for when there is only one occurrence of $e$, i.e. $2$ of the variables in $(e_1, e_2, e_3)$ are $e'$. Again, there are three combinations for this. All $3 + 3 = 6$ combinations of two distinct values consist of the same monomial in $\rpoly$, i.e. $(e_1, e_1, e_2)$ is the same as $(e_2, e_1, e_2)$. This case produces the following edge patterns: $\twopath, \twodis$.
\textsc{case 2:} This case occurs when there are two distinct edges of the three, call them $e$ and $e'$. When there are two distinct edges, there is then the occurence when $2$ variables in the triple $(e_1, e_2, e_3)$ are bound to $e$. There are three combinations for this occurrence in $\poly_{G}^3(\vct{X})$. Analogusly, there are three such occurrences in $\poly_{G}^3(\vct{X})$ when there is only one occurrence of $e$, i.e. $2$ of the variables in $(e_1, e_2, e_3)$ are $e'$. %Again, there are three combinations for this.
This implies that all $3 + 3 = 6$ combinations of two distinct edges $e$ and $e'$ contribute to the same monomial in $\rpoly_{G}^3$. % consist of the same monomial in $\rpoly$, i.e. $(e_1, e_1, e_2)$ is the same as $(e_2, e_1, e_2)$.
Since $e\ne e'$, this case produces the following edge patterns: $\twopath, \twodis$, which contribute $p^3$ and $p^4$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
\textsc{case 3:} $e_1 \neq e_2 \neq e_3$, i.e., when all edges are distinct. For this case, we have $3! = 6$ permutations of $(e_1, e_2, e_3)$. This case consists of the following edge patterns: $\tri, \oneint, \threepath, \twopathdis, \threedis$.
\textsc{case 3:} All $e_1,e_2$ and $e_3$ are distinct. For this case, we have $3! = 6$ permutations of $(e_1, e_2, e_3)$, each of which contribute to a different monomial in the SOP expansion of $\poly_{G}^3(\vct{X})$. This case consists of the following edge patterns: $\tri, \oneint, \threepath, \twopathdis, \threedis$, which contribute $p^3,p^4,p^4,p^5$ and $p^6$ respectively to $\rpoly_{G}^3\left(\prob,\ldots, \prob\right)$.
\end{proof}
\qed
Since $p$ is fixed,~\cref{lem:qE3-exp} gives us one linear equation in $\numocc{G}{\tri}, \numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ (we can handle the other counts due to~\cref{eq:1e}-\cref{eq:2pd-3d}). However, we plan to generate two more independent linear equations in these three variables. Towards, this end we generate more graphs that are related to $G$:
\begin{Definition}\label{def:Gk}
For $k > 1$, let graph $\graph{k}$ be a graph generated from an arbitrary graph $\graph{1}$, by replacing every edge $e$ of $\graph{1}$ with a $k$-path, such that all $k$-path replacement edges are disjoint in the sense that they only intersect at the original intersection endpoints as seen in $\graph{1}$.
For $\ell > 1$, let graph $\graph{\ell}$ be a graph generated from an arbitrary graph $\graph{1}$, by replacing every edge $e$ of $\graph{1}$ with a $\ell$-path, such that all $\ell$-path replacement edges are disjoint. % in the sense that they only intersect at the original intersection endpoints as seen in $\graph{1}$.
\end{Definition}
Next, we relate the various sub-graph counts in $\graph{2}$ and $\graph{3}$ to their counterparts in $\graph{1}=G$.
\begin{Lemma}\label{lem:3m-G2}
The number of $3$-matchings in graph $\graph{2}$ satisfies the following identity,
\begin{align*}
@ -100,42 +129,67 @@ The number of $3$-paths in $\graph{3}$ satisfies the following identity,
\end{Lemma}
\begin{Lemma}\label{lem:tri}
For $k > 1$, any graph $\graph{k}$ has the property that $\numocc{\graph{k}}{\tri} = 0$.
For $\ell > 1$, any graph $\graph{\ell}$ has the property that $\numocc{\graph{\ell}}{\tri} = 0$.
\end{Lemma}
Due to lack of space we defer the proof of the above results to~\cref{app:hard}.
\AR{Need to do this move.}
Using the results we have obtained so far, we will prove the following reduction result:
\begin{Lemma}\label{lem:lin-sys}
Using the identities of lemmas [\ref{lem:3m-G2}, \ref{lem:3m-G3}, \ref{lem:3p-G2}, \ref{lem:3p-G3}, \ref{lem:tri}] to compute $\numocc{G}{\threedis}, \numocc{G}{\threepath}, \numocc{G}{\tri}$ for $G \in \{\graph{2}, \graph{3}\}$, there exists a linear system $\mtrix{\rpoly}\cdot (x~y~z~)^T = \vct{b}$ which can then be solved to determine the unknown quantities of $\numocc{\graph{1}}{\threedis}, \numocc{\graph{1}}{\threepath}$, and $\numocc{\graph{1}}{\tri}$.
%Using the identities of lemmas [\ref{lem:3m-G2}, \ref{lem:3m-G3}, \ref{lem:3p-G2}, \ref{lem:3p-G3}, \ref{lem:tri}] to compute $\numocc{G}{\threedis}, \numocc{G}{\threepath}, \numocc{G}{\tri}$ for $G \in \{\graph{2}, \graph{3}\}$, there exists a linear system $\mtrix{\rpoly}\cdot (x~y~z~)^T = \vct{b}$ which can then be solved to determine the unknown quantities of $\numocc{\graph{1}}{\threedis}, \numocc{\graph{1}}{\threepath}$, and $\numocc{\graph{1}}{\tri}$.
Fix $p\in (0,1)$. Given $\rpoly_{\graph{\ell}}^3(\prob,\dots,\prob)$ for $\ell\in [3]$, we can compute in $O(m)$ time a vector $\vct{b}\in\rel^3$ such that
\[ \begin{pmatrix}
1 & \prob & -(3\prob^2 - \prob^3)\\
-2(3\prob^2 - \prob^3) & -4(3\prob^2 - \prob^3) & 10(3\prob^2 - \prob^3)\\
-18(3\prob^2 - \prob^3) & -21(3\prob^2 - \prob^3) & 45(3\prob^2 - \prob^3)
\end{pmatrix}
\cdot
\begin{pmatrix}
\numocc{G}{\tri}]\\
\numocc{G}{\threepath}\\
\numocc{G}{\threedis}
\end{pmatrix}
=\vct{b},
\]
from which we can compute $\numocc{G}{\tri}, \numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ in $O(1)$ time.
\end{Lemma}
\AH{I didn't think of a more appropriate name for $\vct{b}$, so I have just stuck with what Atri called it on chat.}
The above result immediately imples~\cref{th:single-p}:
\begin{proof}[Proof of~\cref{th:single-p}]
It is easy to check that in $O(m)$ time we can compute $\graph{2}$ and $\graph{3}$ from $\graph{1}=G$ (and further note that these graphs also have $O(m)$ edges). Thus,
in time $O(T(m))$, we can compute $\rpoly_{\graph{\ell}}^3(\prob,\dots,\prob)$ for $\ell\in [3]$.~\Cref{lem:lin-sys} then completes the proof.
\end{proof}
\qed
%\AH{I didn't think of a more appropriate name for $\vct{b}$, so I have just stuck with what Atri called it on chat.}
Using \cref{def:Gk} we construct graphs $\graph{2}$ and $\graph{3}$ from arbitrary graph $\graph{1}$.
We then show that for any of the patterns $\threedis, \threepath, \tri$ which are all known to be hard to compute, we can use linear combinations in terms of $\graph{1}$ from Lemmas \ref{lem:3m-G2}, \ref{lem:3m-G3}, \ref{lem:3p-G2}, \ref{lem:3p-G3}, \ref{lem:tri} to compute $\numocc{\graph{i}}{S}$, where $i$ in $\{2, 3\}$ and $S \in \{\threedis, \threepath, \tri\}$. Then, using ~\cref{claim:four-two}, \cref{lem:qE3-exp} and \cref{lem:lin-sys}, we can combine all three linear combinations into a linear system, solving for $\numocc{\graph{1}}{S}$.
%Using \cref{def:Gk} we construct graphs $\graph{2}$ and $\graph{3}$ from arbitrary graph $\graph{1}$.
%We then show that for any of the patterns $\threedis, \threepath, \tri$ which are all known to be hard to compute, we can use linear combinations in terms of $\graph{1}$ from Lemmas \ref{lem:3m-G2}, \ref{lem:3m-G3}, \ref{lem:3p-G2}, \ref{lem:3p-G3}, \ref{lem:tri} to compute $\numocc{\graph{i}}{S}$, where $i$ in $\{2, 3\}$ and $S \in \{\threedis, \threepath, \tri\}$. Then, using ~\cref{claim:four-two}, \cref{lem:qE3-exp} and \cref{lem:lin-sys}, we can combine all three linear combinations into a linear system, solving for $\numocc{\graph{1}}{S}$.
%$%^&*(
\subsubsection{More notation}
Before proceeding, let us introduce a few more helpful definitions.
\begin{Definition}\label{def:ed-nota}
For the set of edges in $\graph{k}$ we write $E_k$. For any graph $\graph{k}$, its edges are denoted by the a pair $(e, b)$, such that $b \in \{0,\ldots, k-1\}$ and $e\in E_1$.
For the set of edges in $\graph{\ell}$ we write $E_\ell$. For any graph $\graph{\ell}$, its edges are denoted by the a pair $(e, b)$, such that $b \in \{0,\ldots, \ell-1\}$ and $e\in E_1$, where $(e,0),\dots,(e,\ell-1)$ is the $\ell$-path that replaces the edge $e$.
\end{Definition}
\begin{Definition}[$\eset{k}$]
Given an arbitrary subgraph $S\graph{1}$ of $\graph{1}$, let $\eset{1}$ denote the set of edges in $S\graph{1}$. Define then $\eset{k}$ for $k > 1$ as the set of edges in the generated subgraph $S\graph{k}$.
\begin{Definition}[$\eset{\ell}$]
Given an arbitrary subgraph $S\graph{1}$ of $\graph{1}$, let $\eset{1}$ denote the set of edges in $S\graph{1}$. Define then $\eset{\ell}$ for $\ell > 1$ as the set of edges in the generated subgraph $S\graph{\ell}$ (i.e. when we apply~\Cref{def:Gk} to $\graph{1}=S\graph{1}$.
\end{Definition}
For example, consider $S\graph{1}$ with edges $\eset{1} = \{e_1\}$. Then the edges of $S\graph{2}$, $\eset{2} = \{(e_1, 0), (e_1, 1)\}$.
For example, consider $S\graph{1}$ with edges $\eset{1} = \{e_1\}$. Then the edges of $S\graph{2}$, s defined as $\eset{2} = \{(e_1, 0), (e_1, 1)\}$.
\begin{Definition}\label{def:ed-sub}
Let $\binom{S}{t}$ denote the set of subsets in $S$ with exactly $t$ edges. In a similar manner, $\binom{S}{\leq t}$ is used to mean the subsets of $S$ with $t$ or fewer edges.
\end{Definition}
The following function $f_k$ is a mapping from every $3$-edge shape in $\graph{k}$ to its `projection' in $\graph{1}$.
The following function $f_\ell$ is a mapping from every $3$-edge shape in $\graph{\ell}$ to its `projection' in $\graph{1}$.
\begin{Definition}\label{def:fk}
Let $f_k: \binom{E_k}{3} \mapsto \binom{E_1}{\leq3}$ be defined as follows. For any $S \in \binom{E_k}{3}$, such that $S = \pbrace{(e_1, b_1), (e_2, b_2), (e_3, b_3)}$, define:
\[ f_k\left(\pbrace{(e_1, b_1), (e_2, b_2), (e_3, b_3)}\right) = \pbrace{e_1, e_2, e_3}.\]
Let $f_\ell: \binom{E_\ell}{3} \mapsto \binom{E_1}{\leq3}$ be defined as follows. For any $S \in \binom{E_\ell}{3}$, such that $S = \pbrace{(e_1, b_1), (e_2, b_2), (e_3, b_3)}$, define:
\[ f_\ell\left(\pbrace{(e_1, b_1), (e_2, b_2), (e_3, b_3)}\right) = \pbrace{e_1, e_2, e_3}.\]
\end{Definition}
\AH{Just questioning if the notation is clear in the ~\cref{def:fk-inv}. For more details, see the immediately following todo note which is commented out.}
%\AH{Just questioning if the notation is clear in the ~\cref{def:fk-inv}. For more details, see the immediately following todo note which is commented out.}
%\AH{I found ~\cref{def:fk-inv} a bit imprecise and bulky and have attempted to refine it.
%\par Since this an inverse function, the signature is reversed, \vari{but},
%\par...the challenge is in quantifying the size of the set (of 3 edge subsets) that is returned...
@ -143,97 +197,104 @@ Let $f_k: \binom{E_k}{3} \mapsto \binom{E_1}{\leq3}$ be defined as follows. For
%\par...but the catch is that for $r \geq 3$, the set will be strictly less than $\binom{r\cdot k}{3}$ since $f_k$ does not map e.g. an input $\{(e_a, b_1), (e_a, b_2), (e_a, b_3)\}$ (where $a$ is constant and $b_1, b_2, b_3 \in \{0,\ldots, k -1\}$) to more than one edge, \textit{and} it is the case for $r \geq 3$ that $f_k^{-1}$ will not map such an input to its input of size $r$, meaning we must subtract off all such subsets of $\binom{E_k}{3}$.
%\par My fix was to use a variable in the exponent and explain in prose. Perhaps there is a better, simpler notation/solution.}
\begin{Definition}[$f_k^{-1}$]\label{def:fk-inv}
The inverse function $f_k^{-1}: \binom{E_1}{\leq 3}\mapsto \left\{\binom{E_k}{3}\right\}^{h}$ takes an arbitrary $\eset{1}$ of at most $3$ edges and outputs the set of all subsets of $\binom{\eset{k}}{3}$ such that each subset $s^{(k)}$ of the output set is mapped to the input set $s^{(1)}$ by $f_k$, i.e. $f_k(s^{(k)}) = s^{(1)}$. The set returned by $f_k^{-1}$ is of size $h$, where $h$ depends on $\abs{s^{(1)}}$, such that $h \leq \binom{\abs{s^{(1)}} \cdot k}{3}$.
\begin{Definition}[$f_\ell^{-1}$]\label{def:fk-inv}
The inverse function $f_\ell^{-1}: \binom{E_1}{\leq 3}\mapsto 2^{\binom{E_\ell}{3}}$ takes an arbitrary subset $S\subseteq E_1$ of at most $3$ edges and outputs the set of all subsets of $\binom{\eset{\ell}}{3}$ such that each set $T\in f_\ell^{-1}(S)$ is mapped to the input set $S$ by $f_\ell$, i.e. $f_\ell(T) = S$. %The set returned by $f_\ell^{-1}$ is of size $h$, where $h$ depends on $\abs{s^{(1)}}$, such that $h \leq \binom{\abs{s^{(1)}} \cdot \ell}{3}$.
\end{Definition}
Note, importantly, that when we discuss $f_k^{-1}$, that, although potentially counterintuitive, each \textit{edge} present in $s^{(1)}$ must have an edge in $s^{(k)}$ that `projects` down to it. \textit{Meaning}, if $|s^{(1)}| = 3$, then it must be the case that each $s^{(k)}$ be a set $\{ (e_i, b), (e_j, b), e_\ell, b) \}$ where $i \neq j \neq \ell$.
Note, importantly, that when we discuss $f_\ell^{-1}$, that each \textit{edge} present in $S$ must have an edge in $T\in f_\ell^{-1}(S)$ that `projects` down to it. In particular, if $|S| = 3$, then it must be the case that each $T\in f_\ell^{-1}(S)$ si given by $T=\{ (e_i, b), (e_j, b), (e_m, b) \}$ where $i,j$ and $m$ are distinct.
We first note that $f_\ell$ is well-defined:
\begin{Lemma}\label{lem:fk-func}
$f_k$ is a function.
\end{Lemma}
\begin{proof}[Proof of Lemma \ref{lem:fk-func}]
Note that $f_k$ is properly defined. For any $S \in \binom{E_k}{3}$, $|f(S)| \leq 3$, since it has to be the case that any subset of $3$ edges in $E_k$ will map to at most 3 edges in $\graph{1}$. All mappings are in the required range. Then, since for any $b \in \{0,\ldots, k-1\}$ the edge $(e, b) \mapsto e$ is a mapping for which $(e, b)$ maps to no other edge than $e$, and this implies that $f_k$ is a function.
\begin{proof}%[Proof of Lemma \ref{lem:fk-func}]
Note that $f_\ell$ is properly defined. For any $S \in \binom{E_\ell}{3}$, $|f(S)| \leq 3$, since it has to be the case that any subset of $3$ edges in $E_\ell$ will map to at most three edges in $E_1$. All mappings are in the required range. Then, since for any $b \in \{0,\ldots, \ell-1\}$ the map $(e, b) \mapsto e$ is a function, which %` mapping for which $(e, b)$ maps to no other edge than $e$, and this
implies that $f_\ell$ is a function.
\end{proof}
\qed
\AR{TODO for {\em later}: I think the proof will be much easier to follow with figures: just drawing out $S\times \{0,1\}$ along with the $(e_i,b_i)$ explicity notated on the edges will make the proof much easier to follow.}
\subsubsection{Three Matchings in $\graph{2}$}
We are now ready to prove the structural lemmas. Note that $f_\ell$ maps subsets of three edges in $\graph{\ell}$ to a subset of at most three edges in $E_1$. To prove the structural lemmas, we will use the map $f_\ell^{-1}$. In particular, to count the number of occurrences of $\tri,\threepath,\threedis$ in $\graph{\ell}$ we count for each $S\in\binom{E_1}{\le 3}$, how many of $\tri/\threepath/\threedis$ subgraphs appear in $f_\ell^{-1}(S)$.
\begin{proof}[Proof of Lemma \ref{lem:3m-G2}]
For each edge pattern $S$, we count the number of $3$-matchings in the $3$-edge subgraphs of $\graph{2}$ in $f_2^{-1}(S)$. We start with $S \in \binom{E_1}{3}$, where $S$ is composed of the edges $e_1, e_2, e_3$ and $f_2^{-1}(S)$ is the set of all $3$-edge subsets of the set
%\subsubsection{Three Matchings in $\graph{2}$}
\subsubsection{Proof of Lemma \ref{lem:3m-G2}}
For each subset $\eset{1}\in \binom{E_1}{\le 3}$, we count the number of $3$-matchings in the $3$-edge subgraphs of $\graph{2}$ in $f_2^{-1}(\eset{1})$. We first consider the case of $\eset{1} \in \binom{E_1}{3}$, where $\eset{1}$ is composed of the edges $e_1, e_2, e_3$ and $f_2^{-1}(\eset{1})$ is the set of all $3$-edge subsets of the set
\begin{equation*}
\{(e_1, 0), (e_1, 1), (e_2, 0), (e_2, 1), (e_3, 0), (e_3, 1)\}.
\end{equation*}
We do a case analysis based on the `shape' of $\eset{1}$:
\begin{itemize}
\item $3$-matching ($\threedis$)
\end{itemize}
Consider the $\eset{1} = \threedis$ pattern. Note that edges in $\eset{2}$ are {\em not} disjoint only for the pairs $(e_i, 0), (e_i, 1)$ for $i\in \{1,2,3\}$. All subsets for $b_1, b_2, b_3 \in \{0, 1\}$, $(e_1, b_1), (e_2, b_2), (e_3, b_3)$ will compose a 3-matching. One can see that we have a total of two possible choices for each edge $e_i$ in $\graph{1}$ yielding $2^3 = 8$ possible 3-matchings in $f_2^{-1}(S)$.
When $\eset{1} \equiv \threedis$, that edges in $\eset{2}$ are {\em not} disjoint only for the pairs $(e_i, 0), (e_i, 1)$ for $i\in \{1,2,3\}$. All choices for $b_1, b_2, b_3 \in \{0, 1\}$, $(e_1, b_1), (e_2, b_2), (e_3, b_3)$ will compose a 3-matching. One can see that we have a total of two possible choicesi for $b_i$ for each edge $e_i$ in $\graph{1}$ yielding $2^3 = 8$ possible 3-matchings in $f_2^{-1}(\eset{1})$.
\begin{itemize}
\item Disjoint Two-Path ($\twopathdis$)
\end{itemize}
For $\eset{1} = \twopathdis$ edges $e_2, e_3$ form a $2$-path with $e_1$ being disjoint. This means that $(e_2, 0), (e_2, 1), (e_3, 0), (e_3, 1)$ form a $4$-path while $(e_1, 0), (e_1, 1)$ is its own disjoint $2$-path. We can only pick either $(e_1, 0)$ or $(e_1, 1)$ from $f_2^{-1}(S)$, and then we need to pick a $2$-matching from $e_2$ and $e_3$. Note that a four path allows there to be 3 possible 2 matchings, specifically,
For $\eset{1} \equiv \twopathdis$ edges $e_2, e_3$ form a $2$-path with $e_1$ being disjoint. This means that $(e_2, 0), (e_2, 1), (e_3, 0), (e_3, 1)$ form a $4$-path while $(e_1, 0), (e_1, 1)$ is its own disjoint $2$-path. We can only pick either $(e_1, 0)$ or $(e_1, 1)$ for $f_2^{-1}(\eset{1})$, and then we need to pick a $2$-matching from $e_2$ and $e_3$. Note that the four path allows there to be 3 possible 2 matchings, specifically,
\begin{equation*}
\pbrace{(e_2, 0), (e_3, 0)}, \pbrace{(e_2, 0), (e_3, 1)}, \pbrace{(e_2, 1), (e_3, 1)}.
\end{equation*}
Since these two selections can be made independently, there are $2 \cdot 3 = 6$ choices for $3$-matchings in $f_2^{-1}(S)$.
Since these two selections can be made independently, there are $2 \cdot 3 = 6$ choices for $3$-matchings in $f_2^{-1}(\eset{1})$.
\begin{itemize}
\item $3$-star ($\oneint$)
\end{itemize}
When $\eset{1} = \oneint$, the inner edges $(e_i, 1)$ of $\eset{2}$ are all connected, and the outer edges $(e_i, 0)$ are all disjoint. Note that for a valid 3 matching it must be the case that at most one inner edge can be part of the set of disjoint edges. When exactly one inner edge is chosen, there are 3 such possibilities. The remaining possible 3-matching occurs when all 3 outer edges are chosen. Thus, there are $3 + 1 = 4$ 3-matchings in $f_2^{-1}(S)$.
When $\eset{1} \equiv \oneint$, the inner edges $(e_i, 1)$ of $\eset{2}$ are all connected, and the outer edges $(e_i, 0)$ are all disjoint. Note that for a valid 3 matching it must be the case that at most one inner edge can be part of the set of disjoint edges. When exactly one inner edge is chosen, there are 3 such possibilities. The remaining possible 3-matching occurs when all 3 outer edges are chosen. Thus, there are $3 + 1 = 4$ many 3-matchings in $f_2^{-1}(\eset{1})$.
\begin{itemize}
\item $3$-path ($\threepath$)
\end{itemize}
When $\eset{1} =\threepath$ it is the case that all edges beginning with $e_1$ and ending with $e_3$ are successively connected. This means that the edges of $\eset{2}$ form a $6$-path in the edges of $f_2^{-1}(S)$, where all edges from $(e_1, 0),\ldots,(e_3, 1)$ are successively connected. For a $3$-matching to exist, there must be at least one edge separating edges picked from a sequence. There are four such possibilities: $\pbrace{(e_1, 0), (e_2, 0), (e_3, 0)}, \pbrace{(e_1, 0), (e_2, 0), (e_3, 1)}, \pbrace{(e_1, 0), (e_2, 1), (e_3, 1)},$\newline $\pbrace{(e_1, 1), (e_2, 1), (e_3, 1)}$ . Thus, there are four possible 3-matchings in $f_2^{-1}(S)$.
When $\eset{1} \equiv\threepath$ it is the case that all edges beginning with $e_1$ and ending with $e_3$ are successively connected. This means that the edges of $\eset{2}$ form a $6$-path in the edges of $f_2^{-1}(\eset{1})$, where all edges from $(e_1, 0),\ldots,(e_3, 1)$ are successively connected. For a $3$-matching to exist in $f_2^{-1}(\eset{1})$, we cannot pick both $(e_i,0)$ and $(e_i,1)$. % there must be at least one edge separating edges picked from a sequence.
There are four such possibilities: $\pbrace{(e_1, 0), (e_2, 0), (e_3, 0)}, \pbrace{(e_1, 0), (e_2, 0), (e_3, 1)}, \pbrace{(e_1, 0), (e_2, 1), (e_3, 1)},$\newline $\pbrace{(e_1, 1), (e_2, 1), (e_3, 1)}$ . Thus, there are four possible 3-matchings in $f_2^{-1}(\eset{1})$.
\begin{itemize}
\item Triangle ($\tri$)
\end{itemize}
For $\eset{1} = \tri$, note that it is the case that the edges in $\eset{2}$ are connected in a successive manner, but this time in a cycle, such that $(e_1, 0)$ and $(e_3, 1)$ are also connected. While this is similar to the discussion of the three path above, the first and last edges are not disjoint, since they are connected. This rules out both subsets of $(e_1, 0), (e_2, 0), (e_3, 1)$ and $(e_1, 0), (e_2, 1), (e_3, 1)$ leaving us with $2$ remaining edge combinations that produce a 3 matching.
For $\eset{1} \equiv \tri$, note that it is the case that the edges in $\eset{2}$ are connected in a successive manner, but this time in a cycle, such that $(e_1, 0)$ and $(e_3, 1)$ are also connected. While this is similar to the discussion of the three path above, the first and last edges are not disjoint, since they are connected. This rules out both subsets of $(e_1, 0), (e_2, 0), (e_3, 1)$ and $(e_1, 0), (e_2, 1), (e_3, 1)$ leaving us with $2$ remaining edge combinations that produce a 3 matching.
Let us now consider when $S \in \binom{E_1}{\leq 2}$, i.e. patterns among
\begin{itemize}
\item $2$-matching ($\twodis$), $2$-path ($\twopath$), $1$ edge ($\ed$)
\end{itemize}
Let us also consider when $S \in \binom{E_1}{\leq 2}$. When $|S| = 2$, we can only pick one from each of two pairs, $\pbrace{(e_1, 0), (e_1, 1)}$ and $\pbrace{(e_2, 0), (e_2, 1)}$. This implies that a $3$-matching cannot exist in $f_2^{-1}(S)$. The same argument holds for $|S| = 1$, where we can only pick one edge from the pair $\pbrace{(e_1, 0), (e_1, 1)}$, thus no $3$-matching exists in $f_2^{-1}(S)$.
When $|\eset{1}| = 2$, we can only pick one from each of two pairs, $\pbrace{(e_1, 0), (e_1, 1)}$ and $\pbrace{(e_2, 0), (e_2, 1)}$. This implies that a $3$-matching cannot exist in $f_2^{-1}(\eset{1})$. The same argument holds for $|\eset{1}| = 1$, where we can only pick one edge from the pair $\pbrace{(e_1, 0), (e_1, 1)}$, thus no $3$-matching exists in $f_2^{-1}(\eset{1})$.
Observe that all of the arguments above focused solely on the shape/pattern of $S$. In other words, all $S$ of a given shape yield the same number of $3$-matchings, and this is why we get the required identity.
\end{proof}
\qed
Observe that all of the arguments above focused solely on the shape/pattern of $S$. In other words, all $S$ of a given shape yield the same number of $3$-matchings in $f_2^{-1}(\eset{1})$, and this is why we get the required identity using the above case analysis.
%\end{proof}
%\qed
\subsubsection{Three matchings in $\graph{3}$}
\subsubsection{Proof of~\cref{lem:3m-G3}}
\begin{proof}[Proof of Lemma \ref{lem:3m-G3}]
For any $S \in \binom{E_1}{\leq3}$, we again then count the number of $3$-matchings in $f_3^{-1}(S)$.
For any $\eset{1} \in \binom{E_1}{\leq3}$, we again then count the number of $3$-matchings in $f_3^{-1}(\eset{1})$ via a case analysis:
\begin{itemize}
\item $1$ edge ($\ed$)
\end{itemize}
When $\eset{1} = \ed$, $f_3^{-1}(\eset{1})$ has one subset, $(e_1, 0), (e_1, 1), (e_1, 2)$, which clearly does not contain a $3$-matching. Thus there are no $3$-matchings in $f_3^{-1}(\eset{1})$ for this case.
When $\eset{1} \equiv \ed$, $f_3^{-1}(\eset{1})$ has one subset, $(e_1, 0), (e_1, 1), (e_1, 2)$, which clearly does not contain a $3$-matching. Thus there are no $3$-matchings in $f_3^{-1}(\eset{1})$ for this case.
\begin{itemize}
\item $2$-path ($\twopath$)
\end{itemize}
Fix then $\eset{1} = \twopath$ and now we have all edges in $\eset{3}$ form a $6$-path, and similar to the discussion in the proof of \cref{lem:3m-G2} (when $eset{1} = \threepath$ in $\graph{2}$), this leads to $4$ $3$-matchings in $f_3^{-1}(\eset{1})$.
When $\eset{1} \equiv \twopath$ and now we have all edges in $\eset{3}$ form a $6$-path, and similar to the discussion in the proof of \cref{lem:3m-G2} (when $\eset{1} \equiv \threepath$ in $\graph{2}$), this leads to four $3$-matchings in $f_3^{-1}(\eset{1})$.
\begin{itemize}
\item $2$-matching ($\twodis$)
\end{itemize}
For $\eset{1} = \twodis$, all edges of $\eset{3}$ are predicated on the fact that $(e_i, b)$ is disjoint with $(e_j, b)$ for $i \neq j\in \{1,2\}$ and $b \in \{0, 1, 2\}$. Pick an aribitrary $e_i$ and note, that $(e_i, 0), (e_i, 2)$ is a $2$-matching, which can combine with any of the $3$ edges in $(e_j, 0), (e_j, 1), (e_j, 2)$ again for $i \neq j$. Since the selections are independent, it follows that there exist $2 \cdot 3 = 6$ $3$-matchings in $f_3^{-1}(\eset{1})$.
For $\eset{1} \equiv \twodis$, all edges of $\eset{3}$ are predicated on the fact that $(e_i, b)$ is disjoint with $(e_j, b)$ for $i \neq j\in \{1,2\}$ and $b \in \{0, 1, 2\}$. Pick an aribitrary $e_i$ and note, that $(e_i, 0), (e_i, 2)$ is a $2$-matching, which can combine with any of the $3$ edges in $(e_j, 0), (e_j, 1), (e_j, 2)$ again for $i \neq j$. Since the selections are independent, it follows that there exist $2 \cdot 3 = 6$ many $3$-matchings in $f_3^{-1}(\eset{1})$.
Now, we consider the 3-edge subgraphs of $\graph{1}$, starting with $\eset{1} = \tri$.
\begin{itemize}
\item Triangle ($\tri$)
\end{itemize}
Now, we consider the 3-edge subgraphs of $\graph{1}$, starting with $\eset{1} = \tri$. As discussed in proof of \cref{lem:3m-G2} for the case of $\tri$, the edges of $\eset{3}$ are a cyclic sequence, and we must be careful not to pair $(e_1, 0)$ with $(e_3, 2)$ in a $3$-matching. For any $s \in f_3^{-1}(S)$, $s$ is a $3$-matching when we have that for the edges $(e_1, b_1), (e_2, b_2), (e_3, b_3)$ where $b_1, b_2, b_3 \in \{0, 1, 2\}$, such that, for all $i \in [3]$ it is the case that if $b_i = 2$ then $b_{i \mod{3} + 1} \neq 0$. Iterating through all possible combinations, we have
As discussed in proof of \cref{lem:3m-G2} for the case of $\tri$, the edges of $\eset{3}$ are a cyclic sequence, and we must be careful not to pair $(e_1, 0)$ with $(e_3, 2)$ in a $3$-matching. For any $T \in f_3^{-1}(\eset{1})$, $T$ is a $3$-matching when we have that for the edges $(e_1, b_1), (e_2, b_2), (e_3, b_3)$ where $b_1, b_2, b_3 \in \{0, 1, 2\}$, such that, for all $i \in [3]$ it is the case that if $b_i = 2$ then $b_{i \mod{3} + 1} \neq 0$. Iterating through all possible choices for $e_1$, we have
\begin{itemize}
\item \textsc{$(e_1, 0)$}
\item For \textsc{$(e_1, 0)$}, there are five possibilities:
\begin{itemize}
\item $\pbrace{(e_1, 0), (e_2, 0), (e_3, 0)}$
\item $\pbrace{(e_1, 0), (e_2, 0), (e_3, 1)}$
@ -241,13 +302,13 @@ Now, we consider the 3-edge subgraphs of $\graph{1}$, starting with $\eset{1} =
\item $\pbrace{(e_1, 0), (e_2, 1), (e_3, 1)}$
\item $\pbrace{(e_1, 0), (e_2, 2), (e_3, 1)}$
\end{itemize}
\item \textsc{$(e_1, 1)$}
\item For \textsc{$(e_1, 1)$}, there are eight possibilities:
\begin{itemize}
\item $\pbrace{(e_1, 1), (e_2, 0), (e_3, 0)}, \ldots\pbrace{(e_1, 1), (e_2, 1), (e_3, 2)}$
\item $\pbrace{(e_1, 1), (e_2, 2), (e_3, 1)}$
\item $\pbrace{(e_1, 1), (e_2, 2), (e_3, 2)}$
\end{itemize}
\item \textsc{$(e_1, 2)$}
\item For \textsc{$(e_1, 2)$}, there are five possibilities:
\begin{itemize}
\item $\pbrace{(e_1, 2), (e_2, 1), (e_3, 0)}$
\item $\pbrace{(e_1, 2), (e_2, 1), (e_3, 1)}$
@ -256,12 +317,12 @@ Now, we consider the 3-edge subgraphs of $\graph{1}$, starting with $\eset{1} =
\item $\pbrace{(e_1, 2), (e_2, 2), (e_3, 2)}$
\end{itemize}
\end{itemize}
for a total of 18 3-matchings in $f_3^{-1}(\eset{1})$.
for a total of 18 many 3-matchings in $f_3^{-1}(\eset{1})$.
\begin{itemize}
\item $3$-path ($\threepath$)
\end{itemize}
Consider when $\eset{1} = \threepath$ and all edges in $\eset{3}$ are successively connected to form a $9$-path. Since $(e_1, 0)$ is disjoint to $(e_3, 2)$, both of these edges can exist in a $3$-matching. This relaxation yields 3 other 3-matchings that couldn't be counted in the case of the $\eset{1} = \tri$, namely
When $\eset{1} \equiv \threepath$ and all edges in $\eset{3}$ are successively connected to form a $9$-path. Since $(e_1, 0)$ is disjoint to $(e_3, 2)$, both of these edges can exist in a $3$-matching. This relaxation yields 3 other 3-matchings that couldn't be counted in the case of the $\eset{1} = \tri$, namely
\begin{equation*}
\pbrace{(e_1, 0), (e_2, 0), (e_3, 2)},\pbrace{(e_1, 0), (e_2, 1), (e_3, 2)}, \pbrace{(e_1, 0), (e_2, 2), (e_3, 2)}.
\end{equation*}
@ -274,47 +335,42 @@ Assume $\eset{1} = \twopathdis$, then the edges of $\eset{3}$ have successive co
\begin{equation*}
\pbrace{(e_2, 0), (e_3, 0)},\ldots, \pbrace{(e_2, 1), (e_3, 2)}, \pbrace{(e_2, 2), (e_3, 1)}, \pbrace{(e_2, 2), (e_3, 2)}.
\end{equation*}
These matchings can be paired independently with either of the $3$ remaining edges of $(e_1, b)$, for a total of $8 \cdot 3 = 24$ 3-matchings in $f_3^{-1}(\eset{1})$.
These matchings can be paired independently with either of the $3$ remaining edges of $(e_1, b)$, for a total of $8 \cdot 3 = 24$ many 3-matchings in $f_3^{-1}(\eset{1})$.
\begin{itemize}
\item $3$-star ($\oneint$)
\end{itemize}
Given $\eset{1} = \oneint$, the edges of $\eset{3}$ are restricted such that the outer edges $(e_i, 0)$ are disjoint from another, the middle edges $(e_i, 1)$ are also disjoint to each other, and only the inner edges $(e_i, 2)$ intersect with one another at exactly one common endpoint. To be precise, any outer edge $(e_i, 0)$ is disjoint to every middle edge $(e_j, 1)$ for $i \neq j$. As previously mentioned in the proof of \cref{lem:3m-G2}, at most one inner edge may appear in a $3$-matching. For arbitrary inner edge $(e_i, 2)$, we have $4$ combinations of the middle and outer edges of $e_j, e_k$, where $i \neq j \neq k$. These choices are independent and we have $4 \cdot 3 = 12$ 3-matchings. We are not done yet, as we need to consider the middle and outer edge combinations. Notice that for each $e_i$, we have $2$ choices, i.e. a middle or outer edge, contributing $2^3 = 8$ additional $3$-matchings, for a total of $8 + 12 = 20$ $3$-matchings in $f_3^{-1}(\eset{1})$.
When $\eset{1} \equiv \oneint$, the edges of $\eset{3}$ are restricted such that the outer edges $(e_i, 0)$ are disjoint from another, the middle edges $(e_i, 1)$ are also disjoint to each other, and only the inner edges $(e_i, 2)$ intersect with one another at exactly one common endpoint. To be precise, any outer edge $(e_i, 0)$ is disjoint to every middle edge $(e_j, 1)$ for $i \neq j$. As previously mentioned in the proof of \cref{lem:3m-G2}, at most one inner edge may appear in a $3$-matching. For arbitrary inner edge $(e_i, 2)$, we have $4$ combinations of the middle and outer edges of $e_j, e_m$, where $i \neq j \neq m$. These choices are independent and we have $4 \cdot 3 = 12$ many 3-matchings. We are not done yet, as we need to consider the middle and outer edge combinations. Notice that for each $e_i$, we have $2$ choices, i.e. a middle or outer edge, contributing $2^3 = 8$ additional $3$-matchings, for a total of $8 + 12 = 20$ many $3$-matchings in $f_3^{-1}(\eset{1})$.
\begin{itemize}
\item $3$-matching ($\threedis$)
\end{itemize}
Given $\eset{1} = \threedis$ subgraph, we have the case that all edges in $\eset{3}$ have the property that $(e_i, b)$ is disjoint to $(e_j, b)$ for $i \neq j$. For each $e_i$, there are then $3$ choices, independent of each other, and it results that there are $3^3 = 27$ 3-matchings in $f_3^{-1}(\eset{1})$.
When $\eset{1} \equiv \threedis$ subgraph, we have the case that all edges in $\eset{3}$ have the property that $(e_i, b_i)$ is disjoint to $(e_j, b_j)$ for $i \neq j$. For each $e_i$, there are then $3$ choices, independent of each other, and it results that there are $3^3 = 27$ many 3-matchings in $f_3^{-1}(\eset{1})$.
All of the observations above focused only on the shape of $S$, and since we see that for fixed $S$, we have a fixed number of $3$-matchings, this implies the identity.
\end{proof}
\qed
All of the observations above focused only on the shape of $\eset{1}$, and since we see that for fixed $\eset{1}$, we have a fixed number of $3$-matchings, this implies the identity.
%\end{proof}
%\qed
\subsubsection{Three Paths in $\graph{2}$}
Computing the number of 3-paths in $\graph{2}$ and $\graph{3}$ consists of much simpler linear combinations.
\subsubsection{Proof of~\cref{lem:3p-G2}}
For $\mathcal{P} \in f_2^{-1}\inparen{ \eset{2}}$ such that $\mathcal{P} $ is a $3$-path, it \textit{must} be the case by definition of $f_2$ that (i)eall edges in $f_2(\mathcal{P} )$ have at least one mapping from an edge in $\mathcal{P} $ and recall that (ii) $\mathcal{P} $ is connected. These constraint rules out every pattern $\eset{1}$ consisting of $3$ edges (it can be verified that in each three-edge pattern at least one of (i) or (ii) is violated), as well as when $\eset{1} = \twodis$. For $\eset{1} = \ed$, note that $\eset{1}$ doesn't have enough edges to have any output in $f_2^{-1}(\eset{1})$, i.e., there exists no $\eset{1} \in \binom{E_2}{3}$ such that $f_2(\mathcal{P} ) = \eset{1}$. The only surviving pattern is $\eset{1} \equiv \twopath$, where the edges of $\eset{2}$ have successive connectivity from $(e_1, 0)$ to $(e_2, 1)$. There are then two $3$-paths sharing edges $e_1$ and $e_2$ in $f_2^{-1}(\eset{1}), \pbrace{(e_1, 0), (e_1, 1), (e_2, 0)} \text{ and }\pbrace{(e_1, 1), (e_2, 0), (e_2, 1)}$.
All of the observations above focused only on the shape of $\eset{1}$, and since we see that for fixed $\eset{1}$, we have a fixed number of $3$-paths, this implies the identity.
%\end{proof}
%\qed
\subsubsection{Proof of~\cref{lem:3p-G3}}
The argument follows along the same lines as in the proof of \cref{lem:3p-G2}. Given $\mathcal{P} \in f_3^{-1}\inparen{\eset{1}}$, it \textit{must} be that every edge in $f_3(\mathcal{P})$ has at least one edge in $\mathcal{P}$ mapped to it (and $\mathcal{P}$ is connected). Notice again that this cannot be the case for any $\eset{1} \in \binom{E_1}{3}$, nor is it the case when $\eset{1} = \twodis$. This leaves us with two patterns, $\eset{1} = \twopath$ and $\eset{1} = \ed$. For the former, it is the case that we have two $3$-paths across $e_1$ and $e_2$, $\pbrace{(e_1, 1), (e_1, 2), (e_2, 0)}$ and $\pbrace{(e_1, 2), (e_2, 0), (e_2, 1)}$. For the latter pattern $\ed$, it it trivial to see that an edge in $\graph{1}$ becomes a $3$-path in $\graph{3}$, and this proves the identity.
All of the observations above focused only on the shape of $\eset{1}$, and since we see that for fixed $\eset{1}$, we have a fixed number of $3$-paths, this implies the identity.
%\end{proof}
%\qed
\begin{proof}[Proof of Lemma \ref{lem:3p-G2}]
For $\mathcal{P} \subseteq \eset{2}$ such that $\mathcal{P} $ is a $3$-path, it \textit{must} be the case by definition of $f$ that all edges in $f_2(\mathcal{P} )$ have at least one mapping from an edge in $\mathcal{P} $ (and recall that $\mathcal{P} $ is connected). This constraint rules out every pattern $\eset{1}$ consisting of $3$ edges, as well as when $\eset{1} = \twodis$. For $\eset{1} = \ed$, note that $\eset{1}$ doesn't have enough edges to have any output in $f_2^{-1}(\eset{1})$, i.e., there exists no $s \in \binom{E_2}{3}$ such that $f_2(\mathcal{P} ) = \eset{1}$. The only surviving pattern is $\eset{1} = \twopath$, where the edges of $\eset{2}$ have successive connectivity from $(e_1, 0)$ to $(e_2, 1)$. There are then $2$ $3$-paths sharing edges $e_1$ and $e_2$ in $f_2^{-1}(\eset{1}), \pbrace{(e_1, 0), (e_1, 1), (e_2, 0)} \text{ and }\pbrace{(e_1, 1), (e_2, 0), (e_2, 1)}$.
\end{proof}
\qed
\subsubsection{Proof of~\cref{lem:tri}}
\subsubsection{Three Paths in $\graph{3}$}
\begin{proof}[Proof of Lemma \ref{lem:3p-G3}]
The argument follows along the same lines as in the proof of \cref{lem:3p-G2}. Given $\mathcal{P} \subseteq \eset{3}$, it \textit{must} be that every edge in $f_3(\mathcal{P})$ has at least one edge in $\mathcal{P}$ mapped to it (and $\mathcal{P}$ is connected). Notice again that this cannot be the case for any $\eset{1} \in \binom{E_1}{3}$, nor is it the case when $\eset{1} = \twodis$. This leaves us with two patterns, $\eset{1} = \twopath$ and $\eset{1} = \ed$. For the former, it is the case that we have $2$ $3$-paths across $e_1$ and $e_2$, $\pbrace{(e_1, 1), (e_1, 2), (e_2, 0)}$ and $\pbrace{(e_1, 2), (e_2, 0), (e_2, 1)}$. For the latter pattern $\ed$, it it trivial to see that an edge in $\graph{1}$ becomes a $3$-path in $\graph{3}$, and this proves the identity.
\end{proof}
\qed
\subsubsection{Triangles}
\begin{proof}[Proof of Lemma \ref{lem:tri}]
The number of triangles in $\graph{k}$ for $k \geq 2$ will always be $0$ for the simple fact that all cycles in $\graph{k}$ will have at least six edges.
\end{proof}
\qed
The number of triangles in $\graph{\ell}$ for $\ell \geq 2$ will always be $0$ for the simple fact that all cycles in $\graph{\ell}$ will have at least six edges.
%\end{proof}
%\qed