+ SELECT * FROM Posts WHERE image_class = 'Cat';
+
+
+ SELECT COUNT(*) FROM Posts WHERE image_class = 'Cat';
+
+
+ SELECT user_id FROM Posts
+ WHERE image_class = 'Cat'
+ GROUP BY user_id HAVING COUNT(*) > 10;
+
+ + <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %> + | or | ++ <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %> + |
Incomplete Database ($\mathcal D$): A set of possible worlds
+Possible World ($D \in \mathcal D$): One (of many) database instances
+(Require all possible worlds to have the same schema)
+What does it mean to run a query on an incomplete database?
+$Q(\mathcal D) = ?$
+$Q(\mathcal D) = \{\;Q(D)\;|\;D \in \mathcal D \}$
++ <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %> + | or | + <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %> + |
$$Q_1 = \pi_{Name}\big( \sigma_{state = \texttt{'NY'}} (R \bowtie_{zip} ZipLookups) \big)$$
+{ | ++ <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %> + | or | + <%= data_table(["Name"], [["Alice"]], name: "$Q(R_2)$", rowids: true) %> + | +} | +
+ <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %> + | or | + <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %> + |
$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$
+{ | ++ <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %> + | or | + <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_2)$", rowids: true) %> + | +} | +
+ <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %> + | or | + <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %> + |
$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$
+{ | ++ <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$ or $Q(R_2)$", rowids: true) %> + | +} | +
Challenge: There can be lots of possible worlds.
+Observation: Possibilities for database creation break down into lots of independent choices.
+ +Factorize the database.
++ <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_1$", rowids: true) %> + | + <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_2$", rowids: true) %> + |
+ <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_3$", rowids: true) %> + | + <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_4$", rowids: true) %> + |
Alice appears in both databases.
The only differences are Bob and Carol's zip codes.
$\big[\;\texttt{bob} \in \{4, 9\},\; \texttt{carol} \in \{3, 8\}\;\big]$
+Pick one of each: $\big[\;\{a\},\; \{b, c\},\; \{d, e\}\;\big]$
+Set those variables to $T$ and all others to $F$
+$R_1 \equiv \big[a \rightarrow T, b \rightarrow T, d \rightarrow T, * \rightarrow F\big]$
+ <%= data_table( + ["Name", "ZipCode"], + [ ["Alice", "10003"], + ["Bob","14260"], + ["Bob","14290"], + ["Carol","13201"], + ["Carol","18201"] + ], + name: "$\\mathcal R$", + rowids: true, + annotations: [ + "T (a)", + "T (b)", + "F (c)", + "T (d)", + "F (e)" + ] + ) %> +Use provenance as before...
+... but what about aggregates?
+
+ SELECT COUNT(*)
+ FROM R NATURAL JOIN ZipCodeLookup
+ WHERE State = 'NY'
+
+ + $$= \begin{cases} + 1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\ + 2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\ + 3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3 + \end{cases}$$
+Problem: A combinatorial explosion of possibilities
+Idea: Simplify the problem
+Pick your favorite SAT solver, plug in and go
++ As before, factorize the possible outcomes +
++ $$1 + \{\;1\;\textbf{if}\;\texttt{bob} = 4\;\} + \{\;1\;\textbf{if}\;\texttt{carol} = 3\;\}$$ +
++ Not bigger than the aggregate input... +
+
+ ...but at least it only reduces to bin-packing
(or a similarly NP problem.)
+
In short, incomplete databases are limited, but have some uses.
+What about probabilities?
+RSVP (limited space available) to participate
+Lots of interesting strategies used in Checkpoint 3
+(lambda-architecture edition)
+Due May 20
+No restarts.
+No (although you can).
+Ok... so what else can I do?
+Problem 1: More indexes = Slower writes (bad for OLTP)
+Problem 2: Fewer indexes = Slower reads (bad for OLAP)
+What if you have both OLAP and OLTP workloads?
+Idea: Weekly / Nightly / Hourly dump
from OLTP System to OLAP system.
(Index the data while dumping)
+Problem: Not seeing the freshest data!
+Better Idea: OLTP DB + OLAP DB.
+OLTP DB has few indexes, but only stores recent updates.
+OLAP DB has many indexes, and stores everything except recent updates.
+Periodically migrate updates into OLAP DB.
+(Lambda Architecture)
+
+ INSERT INTO FOO(A, B, C) VALUES (1, 2, 3);
+
+
+ SELECT COUNT(*) FROM lineitem WHERE mktsegment = 'BUILDING';
+
+
+ DELETE FROM FOO WHERE A > 5;
+
+ ... but that's not quite how SQL Delete works.
+
+ DELETE FROM FOO WHERE A > 5;
+
+
+ DELETE FROM Orig WHERE Something;
+
+ <%=
+ relational_algebra do
+ ra_select("NOT Something",
+ ra_table("Orig")
+ )
+ end
+ %>
+
+ INSERT INTO lineitem(...) VALUES (...);
+ INSERT INTO lineitem(...) VALUES (...);
+ DELETE FROM lineitem WHERE shipdate BETWEEN date(1997-10-01)
+ AND date(1997-10-30);
+ SELECT COUNT(*) FROM lineitem WHERE mktsegment = 'BUILDING';
+
+
+ UPDATE Foo SET A = 1, B = 2 WHERE C = 3;
+
+
+ UPDATE Foo SET A = 1, B = 2 WHERE C = 3;
+
+ <%=
+ relational_algebra do
+ ra_union(
+ ra_select( "C = 3",
+ ra_project( { A: "1", B: "2", C: "C" },
+ ra_table("Foo")
+ )
+ ),
+ ra_select( "C ≠ 3",
+ ra_table("Foo")
+ )
+ )
+ end
+ %>
+
+ UPDATE Foo SET A = 1, B = 2 WHERE C = 3;
+
+ <%=
+ relational_algebra do
+ ra_project( { A: "CASE WHEN C = 3 THEN 1 ELSE A END", B: "CASE ...", C: "C"},
+ ra_table("Foo")
+ )
+ end
+ %>
+
+ SELECT CASE WHEN C = 3 THEN 1 ELSE A END AS A,
+ CASE WHEN C = 3 THEN 2 ELSE B END AS B,
+ C AS C
+ FROM Foo;
+
+ Idea: Make $\texttt{bob}$ and $\texttt{carol}$ random variables.
+$$\texttt{bob} = \begin{cases} 4 & p = 0.8 \\ 9 & p = 0.2\end{cases}$$
+$$\texttt{carol} = \begin{cases} 3 & p = 0.4 \\ 8 & p = 0.6\end{cases}$$
++ $$Q(\mathcal D) = \begin{cases} + 1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\ + 2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\ + 3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3 + \end{cases}$$
++ $$ = \begin{cases} + 1 & p = 0.2 \times 0.6\\ + 2 & p = 0.8 \times 0.6 + 0.2 \times 0.4\\ + 3 & p = 0.8 \times 0.4 \end{cases}$$ +
++ $$ = \begin{cases} + 1 & p = 0.12\\ + 2 & p = 0.56\\ + 3 & p = 0.32\end{cases}$$ +
++ $$Q(\mathcal D) = \begin{cases} + 1 & p = 0.12\\ + 2 & p = 0.56\\ + 3 & p = 0.32\end{cases}$$ +
+$E\left[Q(\mathcal D)\right] = 0.12+1.12+0.96 = 2.20$
+$P\left[Q(\mathcal D) \geq 2\right] = 0.56+0.32 = 0.88$
+In general, computing probabilities exactly is #P
... so we approximate
+Idea 1: Sample. Pick 10 random possible worlds and compute results for each.
+