diff --git a/src/teaching/cse-562/2019sp/slide/2019-05-01-IncompleteDBs.erb b/src/teaching/cse-562/2019sp/slide/2019-05-01-IncompleteDBs.erb index a0df9fa1..9bc9c698 100644 --- a/src/teaching/cse-562/2019sp/slide/2019-05-01-IncompleteDBs.erb +++ b/src/teaching/cse-562/2019sp/slide/2019-05-01-IncompleteDBs.erb @@ -1,6 +1,6 @@ --- template: templates/cse4562_2019_slides.erb -title: Incomplete and Probabilistic Databases +title: Querying Incomplete Databases date: May 1, 2019 textbook: "PDB Concepts and C-Tables" dependencies: @@ -180,7 +180,7 @@ dependencies: ["Name", "ZipCode"], [ ["Alice", "10003"], ["Bob","14260"], - ["Bob","14290"], + ["Bob","19260"], ["Carol","13201"], ["Carol","18201"] ], @@ -205,7 +205,7 @@ dependencies: ["Name", "ZipCode"], [ ["Alice", "10003"], ["Bob","14260"], - ["Bob","14290"], + ["Bob","19260"], ["Carol","13201"], ["Carol","18201"] ], @@ -230,7 +230,7 @@ dependencies: ["Name", "ZipCode"], [ ["Alice", "10003"], ["Bob","14260"], - ["Bob","14290"], + ["Bob","19260"], ["Carol","13201"], ["Carol","18201"] ], @@ -315,12 +315,7 @@ dependencies: Not bigger than the aggregate input...
- ...but at least it only reduces to bin-packing
(or a similarly NP problem.)
+ ...but at least it only reduces to bin-packing
(or a similarly known NP problem.)
In short, incomplete databases are limited, but have some uses.
-What about probabilities?
-Idea: Make $\texttt{bob}$ and $\texttt{carol}$ random variables.
+Limitation: Can't distinguish between possible-but unlikely and possible-but very likely.
+Idea: Make variables probabilistic
+$$\texttt{bob} = \begin{cases} 4 & p = 0.8 \\ 9 & p = 0.2\end{cases}$$
$$\texttt{carol} = \begin{cases} 3 & p = 0.4 \\ 8 & p = 0.6\end{cases}$$
+ SELECT COUNT(*)
+ FROM R NATURAL JOIN ZipCodeLookup
+ WHERE State = 'NY'
+
+ $$Q(\mathcal D) = \begin{cases} @@ -52,12 +111,278 @@ dependencies:
In general, computing probabilities exactly is #P
In general, computing marginal probabilities for result tuples exactly is #P
... so we approximate
Idea 1: Sample. Pick (e.g.) 10 random possible worlds and compute results for each.
+$$R_{<%=i+1%>} \Leftarrow \{\; \texttt{bob} \rightarrow <%= bob[i] %>, \; \texttt{carol} \rightarrow <%= carol[i] %>\}$$
+ + <%= data_table( + ["Name", "ZipCode"], + [ ["Alice", "10003"], + ["Bob","1#{bob[i]}260"], + ["Carol","1#{carol[i]}201"] + ], + name: "$\\mathcal R_{#{i+1}}$", + rowids: true, + ) %> + +$$\mathcal Q = \{\;<%=counts.join(",\\;")%>\;\}$$
+ <% if i == 10 %> +$$E[\mathcal Q] \approx <%=counts.avg.round(2)%>$$
+$$P[\mathcal Q \geq 2] \approx <%=(counts.select { |c| c >= 2 }.count / counts.size.to_f).round(2) %>$$
+ <% else %> ++
+ <% end %> +
Problem: Sloooooooooooow.
+Can we make it faster?
+Idea 1: Sample. Pick 10 random possible worlds and compute results for each.
+Idea 1.A: Combine all samples into one query.
+$\pi_A(R) \rightarrow $ $\pi_{A, \mathcal{ID}}(R)$
+$\sigma_\phi(R) \rightarrow $ $\sigma_{\phi}(R)$
+$R \uplus S \rightarrow $ $R \uplus S$
+$R \times S \rightarrow $ $\pi_{R.*, S.*, R.\mathcal{ID}}\big($$\sigma_{R.\mathcal{ID} = S.\mathcal{ID}}( $$ R \times S)\big)$
+$\delta R \rightarrow $ $\delta R$
+$_A\gamma_{Agg(*)}(R) \rightarrow $ $_{A, \mathcal{ID}}\gamma_{Agg(*)}(R)$
+Still sloooooow.
+There's a lot of repetition.
+Idea 2.B Use native array-types in DBs
+$\pi_A(R) \rightarrow $ $\pi_{A}(R)$
+$\sigma_\phi(R) \rightarrow $ ?
++
+
+
+
Idea 1.B' Also mark which tuples are present in which samples
+$\pi_A(R) \rightarrow $ $\pi_{A}(R)$
+$\sigma_\phi(R) \rightarrow $ $\sigma_{\mathcal W = 0}($$\pi_{\mathcal W \;\&\; \vec \phi}(R))$
+$R \uplus S \rightarrow $ $R \uplus S$
+$R \times S \rightarrow $ $\sigma_{\mathcal{W} = 0}\big($$\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( $$ R \times S)\big)$
+$_A\gamma_{Agg(B)}(R) \rightarrow $ $_A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R)$
+$\pi_A(R) \rightarrow \pi_{A}(R)$
+$\sigma_\phi(R) \rightarrow \sigma_{\mathcal W = 0}(\pi_{\mathcal W \;\&\; \vec \phi}(R))$
+$R \uplus S \rightarrow R \uplus S$
+$R \times S \rightarrow \sigma_{\mathcal{W} = 0}\big(\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( R \times S)\big)$
+$_A\gamma_{Agg(B)}(R) \rightarrow $ $_A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R)$
+(Generate aggregates for each sample separately)
+Good luck ever doing an equi-join.
+Hope your group-by variables aren't uncertain.
+Inefficient equi-joins on uncertain variables.
+Inefficient aggregates with uncertain variables.
+How many samples necessary to get desired precision?
+Idea 2: Symbolic Execution (Provenance)
+$\sigma_{count \geq 2}(Q) =$
+$\texttt{bob} = 4 \wedge \texttt{carol} = 8 $
+ $\vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3 $
+ $\vee\; \texttt{bob} = 4 \wedge \texttt{carol} = 3$
$P[\sigma_{count \geq 2}(Q)] = ?$ $\approx$ #SAT
+$P[\texttt{x} \wedge \texttt{y}] = P[\texttt{x}] \cdot P[\texttt{y}]$
(iff $\texttt{x}$ and $\texttt{y}$ are independent)
$P[\texttt{x} \wedge \texttt{y}] = 0$
(iff $\texttt{x}$ and $\texttt{y}$ are mutually exclusive)
$P[\texttt{x} \vee \texttt{y}] = 1- (1-P[\texttt{x}]) \cdot (1-P[\texttt{y}])$
(iff $\texttt{x}$ and $\texttt{y}$ are independent)
$P[\texttt{x} \vee \texttt{y}] = P[\texttt{x}] + P[\texttt{y}]$
(iff $\texttt{x}$ and $\texttt{y}$ are mutually exclusive)
Good enough to get us the probability of any boolean formula over mutually exclusive or independent variables
+ +... and otherwise?
+For a boolean formula $f$ and variable $\texttt{x}$:
+ +$$f = (\texttt{x} \wedge f[\texttt{x}\backslash T]) \vee (\neg \texttt{x} \wedge f[\texttt{x}\backslash F])$$
+ +Disjunction of mutually-exclusive terms!
+... each a conjunction of independent terms.
+... and $\texttt{x}$ removed from $f$
+ +Ok... just keep applying Shannon!
+Each application creates 2 new formulas (ExpTime!)
+Idea 2.A: Combine the two. Use Shanon expansion as long as time/resources permit, then use a #SAT approximation.
+ +