--- template: templates/cse4562_2019_slides.erb title: Incomplete and Probabilistic Databases date: May 1, 2019 textbook: "PDB Concepts and C-Tables" dependencies: - lib/slide_utils.rb --- <% require "slide_utils.rb" %>
https://www.anishathalye.com/2017/07/25/synthesizing-adversarial-examples/
Deep Learning Demystified

What happens when you don't know your data precisely?


      SELECT * FROM Posts WHERE image_class = 'Cat';
    

      SELECT COUNT(*) FROM Posts WHERE image_class = 'Cat';
    

      SELECT user_id FROM Posts
      WHERE image_class = 'Cat'
      GROUP BY user_id HAVING COUNT(*) > 10;
    

Incomplete Databases

Probabilistic Databases

  1. Representing Incompleteness
  2. Querying Incomplete Data
  3. Implementing It
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %> or <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>

Incomplete Database ($\mathcal D$): A set of possible worlds

Possible World ($D \in \mathcal D$): One (of many) database instances

(Require all possible worlds to have the same schema)

What does it mean to run a query on an incomplete database?

$Q(\mathcal D) = ?$

$Q(\mathcal D) = \{\;Q(D)\;|\;D \in \mathcal D \}$

<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %> or <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>

$$Q_1 = \pi_{Name}\big( \sigma_{state = \texttt{'NY'}} (R \bowtie_{zip} ZipLookups) \big)$$

{ <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %> or <%= data_table(["Name"], [["Alice"]], name: "$Q(R_2)$", rowids: true) %> }
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %> or <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>

$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$

{ <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %> or <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_2)$", rowids: true) %> }
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %> or <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>

$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$

{ <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$ or $Q(R_2)$", rowids: true) %> }


Challenge: There can be lots of possible worlds.

Observation: Possibilities for database creation break down into lots of independent choices.

Factorize the database.

<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_1$", rowids: true) %> <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_2$", rowids: true) %>
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_3$", rowids: true) %> <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_4$", rowids: true) %>

Alice appears in both databases.
The only differences are Bob and Carol's zip codes.

List Out Choices

<% [false, true].each do |with_annotations| %>
<%= data_table( ["Name", "ZipCode"], [ ["Alice", "10003"], ["Bob","14260"], ["Bob","14290"], ["Carol","13201"], ["Carol","18201"] ], name: "$\\mathcal R$", rowids: true, annotations: if with_annotations then [ "always", "if $\\texttt{bob} = 4$", "if $\\texttt{bob} = 9$", "if $\\texttt{carol} = 3$", "if $\\texttt{carol} = 8$" ] else nil end ) %>
+

$\big[\;\texttt{bob} \in \{4, 9\},\; \texttt{carol} \in \{3, 8\}\;\big]$

<% end %>
<%= data_table( ["Name", "ZipCode"], [ ["Alice", "10003"], ["Bob","14260"], ["Bob","14290"], ["Carol","13201"], ["Carol","18201"] ], name: "$\\mathcal R$", rowids: true, annotations: [ "a", "b", "c", "d", "e" ] ) %>
+

Pick one of each: $\big[\;\{a\},\; \{b, c\},\; \{d, e\}\;\big]$

Set those variables to $T$ and all others to $F$

$R_1 \equiv \big[a \rightarrow T, b \rightarrow T, d \rightarrow T, * \rightarrow F\big]$

<%= data_table( ["Name", "ZipCode"], [ ["Alice", "10003"], ["Bob","14260"], ["Bob","14290"], ["Carol","13201"], ["Carol","18201"] ], name: "$\\mathcal R$", rowids: true, annotations: [ "T (a)", "T (b)", "F (c)", "T (d)", "F (e)" ] ) %>

Use provenance as before...

... but what about aggregates?


                SELECT COUNT(*) 
                FROM R NATURAL JOIN ZipCodeLookup 
                WHERE State = 'NY'
    

$$= \begin{cases} 1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\ 2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\ 3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3 \end{cases}$$

Problem: A combinatorial explosion of possibilities

Idea: Simplify the problem

  1. Is a particular tuple Possible?
  2. Is a particular tuple Certain?
Certain Tuple
A tuple that appears in all possible worlds
$\forall D \in \mathcal D : t \in D$
Possible Tuple
A tuple that appears in at least one possible world
$\exists D \in \mathcal D : t \in D$

Non-aggregate queries

Is a tuple Certain?
Is the provenance polynomial a tautology?
Is a tuple Possible?
Is the provenance polynomial a contradiction?

Pick your favorite SAT solver, plug in and go

Aggregate queries

As before, factorize the possible outcomes

$$1 + \{\;1\;\textbf{if}\;\texttt{bob} = 4\;\} + \{\;1\;\textbf{if}\;\texttt{carol} = 3\;\}$$

Not bigger than the aggregate input...

...but at least it only reduces to bin-packing
(or a similarly NP problem.)

In short, incomplete databases are limited, but have some uses.

What about probabilities?