Importing 'the cutting edge' blog from way back when

pull/1/head
Oliver Kennedy 2017-10-08 21:45:59 -04:00
parent 5ba40e5930
commit f6f6bd733a
48 changed files with 1346 additions and 0 deletions

View File

@ -0,0 +1,11 @@
---
title: "Hello world!"
author: Oliver Kennedy
---
Welcome.
My name is Oliver Kennedy, and starting this Fall, I'll be junior faculty at UB.  I'm starting this blog to give people a bit of an idea of the stuff I'm working on.  My goal is to keep things relatively short.  I'm not sure exactly what I'll be posting here, but for general posts I'll be keeping the language at a level that's understandable by most people (not just those in CS).
So, without further ado, let me say hello again, and welcome you to The Cutting Edge.
(Oh, and why the name?  I like swords, and I like tech)

View File

@ -0,0 +1,14 @@
---
title: "DBToaster and the Viewlet Transform"
author: Oliver Kennedy
---
<p>A big issue these days is large, rapidly changing data.  Users often need to keep a close eye on this data.  Algorithmic trading, scientific computing, network monitoring, and even things like data warehousing are all examples of areas that have lots of data, and that need to react very quickly to certain (potentially complex) conditions in that data.</p>
<p>The overarching goal of the DBToaster project is to produce a tool chain capable of effectively performing these monitoring tasks.  Our latest paper (to be presented at VLDB 2012 this August) discusses one of the core ideas behind our approach: exploiting incrementality (through something we call the Viewlet Transform).</p>
<p>To get at the basic idea across, let me use a common task as an analogy: the monthly report.  If your'e in a relatively stable business, the content of the report will probably be mostly the same from month to month.  Instead of rewriting the report from month-to-month, you might just take last month's report and update it with any new facts, figures, and other changes in the past month (in fact, your boss might be interested in only the changes… but that's getting a little off-track). This still requires a lot of work.  If the report has a lot of figures, you wont re-create the figures from scratch either.  Instead, you'll probably have a spreadsheet (also from last month) that you can just punch the new numbers into.</p>
<p>Loosely speaking, you have a repeating task (writing your report) that is easier to perform if you only have to figure out what changed (the figures) since the last time you did it.  This idea has been around for a long time in databases (since the 80s at least) in the form of something called Incremental View Maintenance (or IVM for short).  Let's say you have a query that you want to repeatedly evaluate.  If you're smart, you'll just evaluate the query result once and save the result for the next time you need the answer.  </p>
<p>But of course, the data you're querying might change in the meantime.  The core idea behind IVM is that you can evaluate what's known as a Delta Query, which is a simpler form of the original query.  Instead of giving you the full query answer, it looks at the changes to the input data, and tells you what changes in the query results.  This delta query is usually simpler and faster to evaluate than the original query, but can still be pretty slow (especially if the original query is a biggie), making IVM a poor choice for many realtime applications.</p>
<p>Let's go back to our example of the monthly report that you update each month.  Even though this is faster than creating the report from scratch, you still have to update all the figures as well.  Of course, you don't create the figures from scratch either.  If you're smart, you'll just edit last month's spreadsheets.  The viewlet transform is based on exactly the same idea.  The delta query is a query that you evaluate over and over and over again.  We figure, why not just evaluate it once and then just update it when the data changes.</p>
<p>So now you have your original query, and some delta queries.  Instead of re-evaluating the delta queries every time your data changes, you evaluate the delta query once and store the result.  Now, whenever your data changes, you only need to read the delta query result out of storage, instead of doing any expensive computations.  Of course, now you need to keep your delta queries up to date as well.  You do this by using delta queries of each of the delta queries (A "second-order delta").  This process continues (giving you third, fourth, fifth, etc… order deltas), with each delta query becoming progressively simpler than the last.  Eventually you reach a point where the delta query is incredibly simple, and you stop.  </p>
<p>You might have a lot of these queries sitting around and needing to be kept up-to-date, but each of them reduces the cost of maintaining another query by enough that it's incredibly worth it.  Combined with other techniques, we've gotten a typical performance improvement of 3-4 orders of magnitude over several commercial data management systems.  </p>
<p>And that's the basic idea of the viewlet transform.  More soon.</p>
<p>-Oliver</p>

View File

@ -0,0 +1,28 @@
---
title: "AGCA, The language of change (Part 1: Everything is Multiplicities)"
author: Oliver Kennedy
---
<p style="text-align: justify;">Before I can go into detail on the viewlet transform, I first need to talk about a language called the AGgregate CAlculus.  Over the coming weeks, I'll try to give a high-level overview of the language from a practical perspective, and hopefully give you some insight into why it is important.  </p>
<p style="text-align: justify;">Although the name "Aggregate Calculus" might sound imposing, AGCA is actually very close to SQL.  Its key feature (one of the reasons that it is crucial for the viewlet transform) is that it separates values that can be maintained incrementally from those that cannot.  For reasons that will become clear, we'll call incrementally maintainable values <strong>multiplicities</strong>, and non-incrementally maintainable values <strong>variables</strong>.</p>
<p style="text-align: justify;">The core mantra… the one thing that everyone who I've ever tried to teach AGCA to has struggled to wrap their head around (at first, anyway), is that <strong>Everything is Multiplicities</strong>.  Remember this phrase.  <strong>Everything is Multiplicities</strong>.  </p>
<p style="text-align: justify;">I'm being a little overly general.  If I wanted to be precise, I'd say that "Everything is a mapping from tuples of variables to multiplicities."  That's a bit of a mouthful (and I like short catch phrases), so let's stick with <strong>Everything is Multiplicities</strong>.</p>
<p style="text-align: justify;">And just to be sure you're following along: <strong>Everything is Multiplicities</strong>.</p>
<p style="text-align: justify;">So, how do we write queries in AGCA?  Well, to start, we need some way to refer to our input data.  In spite of trends in the corporate world, AGCA works with relational data.  So, all all of our inputs will be tables (or if you want to get fancy, "relations").  If we want to refer to a table, we write down its name and give each of its columns a name.  For example, if we write down:</p>
<pre style="text-align: left;">R(A,B,…)</pre>
<p style="text-align: left;">We mean "all of the rows (or tuples) of R, and we'll name first column of R 'A', name the second column 'B', and so forth…"  This is pretty much the simplest possible SQL query you can think of:</p>
<pre style="text-align: left;">SELECT A, B, … FROM R;</pre>
<p style="text-align: left;">Simple, right?  Well, ok, there's actually a little twist.  See, in SQL, you're allowed to have several identical rows in a table (by default anyway, keys have to be added explicitly).  To use the technical term, SQL works with what are called multisets (also known as "bags").  So, we're going to do something clever in AGCA.  Let's say you have a table of your customer's first names.  If you write the AGCA expression:</p>
<pre style="text-align: left;">CUSTOMER(FIRSTNAME)</pre>
<p style="text-align: justify;">You're <em>not</em> going to get one row for every customer.  Instead, you're going to get something like this:</p>
<pre><span style="text-decoration: underline;">__FIRSTNAME______#__</span></pre>
<pre>&lt; John &gt; -&gt; 8</pre>
<pre>&lt; Joe &gt; -&gt; 3</pre>
<pre>&lt; Steve &gt; -&gt; 5</pre>
<pre>&lt; Alphonse &gt; -&gt; 1</pre>
<p style="text-align: justify;">That is to say, you'll get one output row every <em>distinct</em> row in your table, together with the number of times that this row occurs in your table (that is, you get its <strong>multiplicity</strong>).  In the above example, you have 8 customers named John, but only 1 customer named Alphonse.  This is the core idea of AGCA: Everything is multiplicities.  If it helps, you can think of every AGCA expression as having an implicit group-by COUNT(*) aggregate around it.  </p>
<pre>CUSTOMER(FIRSTNAME)</pre>
<p style="text-align: justify;">is effectively the SQL query:</p>
<pre><strong>SELECT</strong> *, COUNT(*) <strong>FROM</strong> CUSTOMER <strong>GROUP BY</strong> CUSTOMER.*</pre>
<p style="text-align: justify;">By the way, there's a little bit of weirdness here.  AGCA doesn't have precisely the same semantics for COUNT as SQL.  See, every AGCA expression describes an infinite number of rows.  Even CUSTOMER(FIRSTNAME) effectively has an infinite number of rows: There aren't any customers named Zardoz, but the expression does <em>technically</em> contain the row &lt; Zardoz &gt; -&gt; 0.  That said, in general, we're only interested in a few of those rows (the ones that aren't 0).  In the interest of keeping things intuitive, I'm going to sweep this issue under the rug for now and come back to it in a future post.</p>
<p style="text-align: justify;">That's it for now.  Tune in next week for: JOINs and UNIONs in AGCA.  </p>
<p style="text-align: justify;">(If you want to learn more now, and have a good understanding of algebraic structures, have a look at the PODS2010 paper on AGCA: "<a href="http://dl.acm.org/citation.cfm?id=1807100">Incremental Query Evaluation in a Ring of Databases</a>")</p>

View File

@ -0,0 +1,67 @@
---
title: "AGCA, The language of change (Part 2: Ringing in Change)"
author: Oliver Kennedy
---
<p>Last week, I started talking about the AGCA (Aggregate Calculus) query language.  AGCA is a query languages that makes an explicit distinction between the parts of data that can be easily managed incrementally, and the parts that can not.  This makes it incredibly useful for incremental computation techniques like the viewlet transform.</p>
<p>At the heart of AGCA is a simple mantra that I harped on extensively last week: Everything is multiplicities.  If I write down a query, such as:</p>
<pre>CUSTOMER(FIRSTNAME)</pre>
<p>What I will get back is a table with one row for each <em>distinct</em> customer first name.  </p>
<pre><span style="text-decoration: underline;">__FIRSTNAME______#__</span></pre>
<pre>&lt; John &gt; -&gt; 8</pre>
<pre>&lt; Alphonse &gt; -&gt; 1</pre>
<p>The table has two columns: the FIRSTNAME column, and a count, or multiplicity for each row.  In the example above, I have 8 customers named John, and 1 named Alphonse.  </p>
<p>There's actually another way of looking at CUSTOMER(FIRSTNAME).  You can think of it as a function (for those who are interested, it's actually something called a <a href="http://en.wikipedia.org/wiki/Monad_(functional_programming)">monad</a>).  If I give it a value for FIRSTNAME, it gives me back the number of customers who have that particular first name.  </p>
<p>And this leads me to this week's topic: JOINs (definition <a href="http://en.wikipedia.org/wiki/Join_(relational_algebra)#Joins_and_join-like_operators">here</a>) and UNIONs (specifically what's known as a "Bag Union", defined briefly <a href="http://en.wikipedia.org/wiki/Set_(computer_science)#Multiset">here</a>). As we'll see, there's a nice way of looking at these two common database operations.</p>
<p>For those unfamiliar with the concept, a JOIN between two tables matches up rows of each table based on some rule.  For example, I might have two tables with information about bicycles available for purchase: FRAMES(COLOR, TIRESIZE) and TIRES(TIRESIZE, TIRE).  Here's some sample data (along with multiplicities… ignore those for now):</p>
<pre><span style="text-decoration: underline;">__COLOR____TIRESIZE______#__</span></pre>
<pre>&lt; Blue , 26" &gt; -&gt; 1</pre>
<pre>&lt; Red , 26" &gt; -&gt; 3 </pre>
<pre>&lt; Red , 20" &gt; -&gt; 1</pre>
<pre>&lt; Black, 20" &gt; -&gt; 2</pre>
<pre><span style="text-decoration: underline;">__TIRESIZE__TIRETYPE______#</span></pre>
<pre>&lt; 26" , Mountain &gt; -&gt; 1</pre>
<pre>&lt; 26" , Road &gt; -&gt; 1</pre>
<pre>&lt; 20" , Road &gt; -&gt; 2</pre>
<p>Let's say I'm interested in all the possible options I have for a new bike.  In this case, I need to pick out a frame and a tire type.  Clearly, I want to make sure that the tire I get is appropriate for the frame: The TIRESIZE of the tire has to match up with the TIRESIZE of the bike I get.  So, to enumerate all the possible options, I can compute a JOIN between these two tables on the condition that the TIRESIZEs are identical (again, ignore the multiplicities for now):</p>
<pre><span style="text-decoration: underline;">__COLOR____TIRESIZE__TIRETYPE______#__</span></pre>
<pre>&lt; Blue , 26" , Mountain &gt; -&gt; 1</pre>
<pre>&lt; Blue , 26" , Road &gt; -&gt; 1</pre>
<pre>&lt; Red , 26" , Mountain &gt; -&gt; 3</pre>
<pre>&lt; Red , 26" , Road &gt; -&gt; 3</pre>
<pre>&lt; Red , 20" , Road &gt; -&gt; 2</pre>
<pre>&lt; Black , 20" , Road &gt; -&gt; 4</pre>
<p>This is actually a special kind of join (condition) called a <em>natural join</em>; A natural join matches up rows based on columns that share the same name (in this case, the TIRESIZE column).  As we'll see in a bit AGCA uses only natural joins.  But, how exactly do joins work in AGCA?  </p>
<p>Well, let's start by looking at those multiplicities in our input data.  There is only a single type of Blue bike with 26" tires available for purchase, but two different types of 26" tires (Mountain and Road).  Together these produce 2 different options for purchase (the first 2 rows of the output).  </p>
<p>The multiplicity of the Red, 26" bike row is 3.  What this means in practical terms is that there are 3 different types of Red, 26" bikes available.  If I wanted one of those, I would actually have 6 different options (2 types of tire as before, and now 3 different types of bike).  </p>
<p>The same thing happens with the 20" bikes.  There are two different types of 20" tire (this time, both are road tires).  There are also 2 different types of Black, 20" bikes.  So, we have 4 different purchase options.</p>
<p>You've probably seen the pattern by now.  When we JOIN two rows together, <strong>the multiplicities of the JOINed rows multiply</strong>.  Because of this, JOINs in AGCA are written down as products.  For example, we'd write down the above join as:</p>
<pre>FRAMES(COLOR, TIRESIZE) * TIRES(TIRESIZE, TIRETYPE)</pre>
<p>Note that the join condition is not explicitly specified in this query.  This is because all joins in AGCA are natural joins.  In this case, TIRESIZE appears in both FRAMES and TIRES, so we know that the query is only asking to match up the TIRESIZEs of both inputs.</p>
<p>Remember that I mentioned earlier that you can think of a table as a function (monad).  This holds for any AGCA expression.  The example join defines a function (monad) with three parameters: COLOR, TIRESIZE, and TIRETYPE.  Let's say I wanted to evaluate it with the parameters (Black, 20", Road):</p>
<pre>FRAMES(Black, 20") * TIRES(20", Road) = 2 * 2 = 4</pre>
<p>Cool, right?</p>
<p>By the way, if there are no overlapping column names, then the natural join is just a <a href="http://en.wikipedia.org/wiki/Cartesian_product">cartesian cross product</a>.</p>
<p>Ok, so, what about UNIONs?  Well, let's say we have tables representing several different purveyors of coffee beans: ORENS(ROAST), CTB(ROAST) (respectively):</p>
<pre><span style="text-decoration: underline;">__ROAST_________#__</span></pre>
<pre>&lt; Espresso &gt; -&gt; 2</pre>
<pre>&lt; Light &gt; -&gt; 1</pre>
<pre><span style="text-decoration: underline;">__ROAST______#__</span></pre>
<pre>&lt; Medium &gt; -&gt; 2</pre>
<pre>&lt; Light &gt; -&gt; 3</pre>
<p>If I wanted to know all of my coffee-purchasing options, I could compute the union of these two tables.  Recall that we're working with multisets/bags, so to keep things simple let's assume that there's no overlap between the offerings of both stores.  If that's the case, then we can just merge the two tables together, adding together the multiplicities of rows that appear in both inputs:</p>
<pre style="margin: 8px;"><span style="text-decoration: underline;">__ROAST_________#__</span></pre>
<pre style="margin: 8px;">&lt; Espresso &gt; -&gt; 2</pre>
<pre style="margin: 8px;">&lt; Medium &gt; -&gt; 2</pre>
<pre style="margin: 8px;">&lt; Light &gt; -&gt; 4</pre>
<p>In short, <strong>UNION adds row multiplicities together</strong>. Just like AGCA uses product to represent joins, sum represents unions.  So our coffee shop query (let's call it Q) is written down as</p>
<pre>Q(ROAST) := ORENS(ROAST) + CTB(ROAST)</pre>
<p>Again, we can look at this query as a function.  For example:</p>
<pre>Q(Light) = ORENS(Light) + CTB(Light) = 1 + 3 = 4</pre>
<p>There are some caveats here.  First off, the column names of the things being UNIONed together typically have to be the same.  ORENS(ROAST) + FRAMES(COLOR, TIRESIZE) doesn't really make much sense.  I'll return to this assumption before long, but be aware of it for now.  </p>
<p>Second, as I briefly mentioned last week, tables (and expressions in general) contain an infinite number of rows -- although only a small number have non-zero multiplicities.  This is something to keep in mind when thinking about AGCA expressions as functions.  For example:</p>
<pre>Q(Espresso) = ORENS(Espresso) + CTB(Espresso) = 2 + 0 = 2</pre>
<p>Or, going back to our bike options example:</p>
<pre>Q(Black, 20", Mountain) = FRAMES(Black, 20") * TIRES(20", Mountain) = 2 * 0 = 0</pre>
<p>Note how the zeroes propagate through joins.  This makes sense if you think about it.  JOINing against a row that doesn't exist doesn't produce any output rows (ignoring outer joins for the moment).</p>
<p>For those interested in algebraic structures, AGCA actually forms something called a Ring, with product and sum defined as above.  I'll talk about constants in a few weeks, but you can think of the "zero" and "one" values of the ring as special tables ZERO() and ONE() with no columns, and only a single row each: <span style="font-family: Courier; font-size: 12px;">&lt;&gt; -&gt; 0</span>, and <span style="font-family: Courier; font-size: 12px;">&lt;&gt; -&gt; 1</span>, respectively.  The set underlying the ring is a specific incarnation of the monads that I've been alluding to.  </p>
<p>That's it for now.  Next week: Two typically unrelated operations, brought together by AGCA in a somewhat surprising way: PROJECTion and the COUNT aggregate.</p>

View File

@ -0,0 +1,48 @@
---
title: "AGCA, The language of change (Part 3: The Sum Project)"
author: Oliver Kennedy
---
<p>For a few weeks now, I've been talking about the AGCA query language, a language for incremental computation.  If you haven't already done so, you should probably read <a href="http://www.xthemage.net/blog/?p=14">Part 1</a> and <a href="http://www.xthemage.net/blog/?p=42">Part 2</a> before continuing on with this post.  </p>
<p>Just to recap, in AGCA, queries are written down as algebraic formulas.  The most basic term in the language is the table (Relations, if you want to be fancy).  Multiplication is the natural join, and addition is bag union.  </p>
<p>And of course, the most important thing about AGCA is that Everything is Multiplicities.  Unlike SQL, where the result of a query is simply a list of output rows, a query result in AGCA is more like a lookup table.  Each row of the output is associated with a <em>multiplicity</em> (loosely speaking, the number of times that the row occurs in the result). Because the rows are unique, you can use the row to look up its multiplicity in the query result.  To use the technical terms, query outputs are Maps (a.k.a., Hashes, Dictionaries, HashMaps, etc…), each row is a key in the map, and multiplicities are the mapped values.  </p>
<p>Note by the way, that this doesn't stop us from talking about empty rows.  Look at the following SQL query:</p>
<pre>SELECT </pre>
<pre>FROM ACTORS;</pre>
<p>Or equivalently, in English, "give me an empty row for every actor."  There's a good chance your favorite SQL system won't approve of this query.  You can also certainly make a good argument that this query isn't especially useful.  That said, the query does have a meaning.  Give me one row (with nothing in it) for every actor.  </p>
<p>Recall one more thing from last week: How addition/union works in AGCA.  "Duplicate" rows on either side are merged, and their multiplicities are added together.  If a row occurs twice on one side of the union, and three times on the other, then the final unioned output has five copies of the row (or as AGCA would put it, the row has a multiplicity of 5 in the output).</p>
<p>Where am I going with this?  Well, empty rows are all identical.  So, if you have a result that contains only empty rows, the result is guaranteed to have exactly one row (or zero rows, that is, one row with multiplicity 0).  Let's see an example on this table of actor first names.  </p>
<pre><span style="text-decoration: underline;">__FIRSTNAME______#__</span></pre>
<pre>&lt; Steve &gt; -&gt; 2</pre>
<pre>&lt; Jim &gt; -&gt; 1</pre>
<pre>&lt; John &gt; -&gt; 3</pre>
<p>Let's say we get rid of the FIRSTNAME column (project it away, to use the technical term).  We end up with</p>
<pre><span style="text-decoration: underline;">________#__</span></pre>
<pre>&lt;  &gt; -&gt; 2</pre>
<pre>&lt;  &gt; -&gt; 1</pre>
<pre>&lt;  &gt; -&gt; 3</pre>
<p>But that's wrong.  Every row is supposed to be unique.  All those duplicate empty rows need to be merged together.  So, just like we merge together rows when computing a UNION, we add up the multiplicities of these empty rows.  </p>
<pre><span style="text-decoration: underline;">________#__</span></pre>
<pre>&lt;  &gt; -&gt; 6</pre>
<p>What exactly just happened here?  Well, by projecting away the FIRSTNAME column, we've essentially computed the COUNT(*) of the number of rows in the input.  Recall that when I first described AGCA, I mentioned that every query has an implicit COUNT(*) around it.  Instead of</p>
<pre>SELECT FROM ACTORS;</pre>
<p>what AGCA actually computes is</p>
<pre>SELECT COUNT(*) FROM (SELECT FROM ACTORS);  </pre>
<p>or, put more simply</p>
<pre>SELECT COUNT(*) FROM ACTORS;</pre>
<p>The same idea can actually be taken a bit further.  Let's say you have the following table:</p>
<pre><span style="text-decoration: underline;">__FIRSTNAME__LASTNAME_________#__</span></pre>
<pre>&lt; Steve , Carell &gt; -&gt; 1</pre>
<pre>&lt; Steve , Coogan &gt; -&gt; 1</pre>
<pre>&lt; Jim , Carrey &gt; -&gt; 1</pre>
<pre>&lt; John , Depp &gt; -&gt; 1</pre>
<pre>&lt; John , Galecki &gt; -&gt; 1</pre>
<pre>&lt; John , Rhys-Davies &gt; -&gt; 1</pre>
<p>What happens if we project away just the LASTNAME column?  We get 2 Steves, 1 Jim, and 3 Johns, exactly the same table that we started with.  In other words, by projecting away just the LASTNAME column, we end up computing a group-by COUNT(*) aggregate:</p>
<pre>SELECT FIRSTNAME, COUNT(*)</pre>
<pre>FROM ACTORS </pre>
<pre>GROUP BY FIRSTNAME;</pre>
<p>This technique of using projection to compute the COUNT(*) aggregate also lets us compute group-by aggregates.  <strong>Projection and the COUNT(*) aggregate are the same thing in AGCA</strong>.  AGCA uses a special operator called AggSum to represent this operation.  For example, the above group-by COUNT(*) aggregate is written as:</p>
<pre>AggSum([FIRSTNAME], ACTORS(FIRSTNAME,LASTNAME))</pre>
<p>Or in general:</p>
<pre>AggSum([{group by var 1}, {group by var 2}, …], {aggregated AGCA expression})</pre>
<p>And there you have it: How AGCA handles projection and the COUNT(*) aggregate.  Next week, the SUM() aggregate, and conditions/selection in AGCA.</p>

View File

@ -0,0 +1,89 @@
---
title: "AGCA, The language of change (Part 4: The table special)"
author: Oliver Kennedy
---
<p>For several weeks now, I've been describing AGCA, a language for incremental processing.  </p>
<p>In <a href="http://www.xthemage.net/blog/?p=14">part 1</a>, I covered the core idea behind AGCA: Instead of storing data as a list of rows, AGCA keeps only one copy of each unique row, and tags it with the number of times it appears.  In other words, data in AGCA is a function that maps each row to the number of times that it appears in the data (the row's multiplicity).  </p>
<p>In <a href="http://www.xthemage.net/blog/?p=42">part 2</a>, I showed you how AGCA handles unions (it sums multiplicities, so we call it +) and joins (it multiplies multiplicities, so we call it *).</p>
<p>Most recently, in <a href="http://www.xthemage.net/blog/?p=49">part 3</a>, I showed you how AGCA handles the COUNT(*) aggregate, and how COUNT(*) and projection are actually the same thing in AGCA.  After columns are projected away, the results might end up containing duplicate rows -- so the multiplicities of these rows get added together to produce the final result.</p>
<p>This week, I plan to talk about the SUM() aggregate, and selection (filtering) in AGCA.  Let's start with SUM().  An interesting thing to note about COUNT(*) is that it's a special instance of SUM().  Specifically</p>
<pre>SELECT COUNT(*) FROM R;</pre>
<p>is completely equivalent to </p>
<pre>SELECT SUM(1) FROM R;</pre>
<p>In other words, for every row that appears in the result, we add the numerical value '1' to the final result.  Straightforward enough, right?  But what if we wanted to add a more complex/interesting value to the result?  Say we wanted to compute the following SQL query</p>
<pre>SELECT SUM(2) FROM R;</pre>
<p>In other words, for every copy of every row that appears in the result, we add the numerical value '2' to the final result.  Just to recap, the AGCA for the COUNT(*) of R is:</p>
<pre>AggSum([], R(A,B))</pre>
<p>This query takes the multiplicity of every row of R, and sums them, giving us SUM(1) of R.  In order to compute SUM(2), we'd have to multiply those multiplicities by 2.</p>
<p>As I pointed out in week 2, joins multiply multiplicities.  So what we need is a special kind of table. We need a table with one row with a multiplicity of '2', which matches with every row of R.  We need a table that looks like this:</p>
<pre>________#__</pre>
<pre> &lt; &gt; -&gt; 2</pre>
<p>Because there are no columns in this special table, it joins with every row in any table/query you can come up with.  We can call this special table {2} when writing down AGCA queries:</p>
<pre>R(A,B) * {2}</pre>
<p>Which says "Give me two copies of every row of R".  Incidentally, as I mentioned before, AGCA forms an algebraic structure called a ring.  One of the properties of a ring is distributivity.  That is,</p>
<pre>R(A,B) * {2} = R(A,B) * ({1}+{1}) = R(A,B) + R(A,B)</pre>
<p>Or in other words, two copies of everything in R is identical to the union of R with itself (which is, of course, true).  So in summary, the AGCA query to compute the SUM(2) of R is</p>
<pre>AggSum([], R(A,B) * {2})</pre>
<p>What's cool about this is that one of these special tables can be created for <strong>any</strong> number that we want to sum up.  AGCA is not just limited to positive integers, so there's nothing wrong with the AGCA query:</p>
<pre>R(A,B) * {-1}</pre>
<p>or even the query</p>
<pre>R(A,B) * {3.14159265}</pre>
<p>Let that sink in for a moment, because <strong>if this doesn't weird out out you're probably missing something</strong>.  I've been talking about multiplicities as the number of times a row occurs in a data set.  AGCA plays multiplicities a lot more fast and loose.  An AGCA query result can include fractions of rows, or even rows with negative multiplicities -- a sort of anti-row:</p>
<pre>R(A,B) + R(A,B) * {-1} = R(A,B) * ({1} + {-1}) = R(A,B) * {0}</pre>
<p>Negative multiplicities will turn out to be super-useful for incremental maintenance, as you'll see.</p>
<p>Moving on, the ability to multiply multiplicities by constants is useful, but what we really need is the ability to compute sums with variables in them.  For example:</p>
<pre>SELECT SUM(A) FROM R;</pre>
<p>In other words, for every copy of every row that appears in the result, we want to take the value of column 'A' and add it to the final result.  AGCA actually has a special table for this as well.  If we call that table {A}, the AGCA for SUM(A) of R is</p>
<pre>AggSum([], R(A,B) * {A})</pre>
<p>So what does {A} look like?  Well, let's say we have some example data in R:</p>
<pre>___A__B______#__</pre>
<pre> &lt; 1, 1 &gt; -&gt; 2</pre>
<pre> &lt; 2, 2 &gt; -&gt; 1</pre>
<pre> &lt; 2, 5 &gt; -&gt; 3</pre>
<p>We want to turn this into:</p>
<pre>___A__B______________#__</pre>
<pre> &lt; 1, 1 &gt; -&gt; 2 * 1 = 2</pre>
<pre> &lt; 2, 2 &gt; -&gt; 1 * 2 = 2</pre>
<pre> &lt; 2, 5 &gt; -&gt; 3 * 2 = 6</pre>
<p>{A}, when joined with R(A,B) should multiply the rows where A is 1 by 1, where A is 2 by 2, and so forth…  In other words, we want a table that can be joined with R(A,B) on A.  A table with one row for every possible value of A in its 'A' column, and the same value as each row's multiplicity.</p>
<pre>___A______#__</pre>
<pre> &lt; 1 &gt; -&gt; 1</pre>
<pre> &lt; 2 &gt; -&gt; 2</pre>
<pre> &lt; 3 &gt; -&gt; 3</pre>
<pre> ...</pre>
<p>Unlike the special table for constants, the special table for variables actually has an infinite number of rows in it.  As I've mentioned before, technically, all AGCA queries produce an infinite number of results, but only a (relatively) smaller number of 'interesting' ones (in technical terms, the query results have 'finite support').  That's not the case with this kind of special table.  There are an infinite number of rows. So a query like:</p>
<pre>{A}</pre>
<p>Would produce an infinite number of rows.  Of course, such a query doesn't really make sense either.  You're always going to join this kind of special table with a table that doesn't produce an infinite number of (interesting) rows, and that has 'A' as a column.  When you do that, all but a few of the rows of {A} will get zeroed out.  Try it yourself on the SUM(A) example if you're not convinced.</p>
<p>By the way, what I am describing is identical to a concept in programming languages/logic called bound and free variables (also known as safe/unsafe variables in query processing).  'A' is a free variable in {A}, and a bound variable in R(A,B).  Just like in SQL, a query with any free/unsafe variables in it has to have an external assignment of values to those variables before it can be evaluated.  More on that later.</p>
<p>On a related note, AGCA treats columns more as variables than anything else.  For this reason, I'll be using the terms 'column' and 'variable' interchangeably from now on.  </p>
<p>One last thing before I move on to selection.  It's possible for AGCA to represent even more complex special tables.  For example, if I wanted wanted to compute the following SQL query</p>
<pre>SELECT SUM(exp(2, A) + 2*B) FROM R;</pre>
<p>Any computable expression can be turned into one of these special tables, regardless of how many variables appear in it.  To compute the the SQL query above I could use</p>
<pre>AggSum([], R(A,B) * {exp(2, A) + 2 * B})</pre>
<p>Which is equivalent to</p>
<pre style="margin: 8px;">AggSum([], R(A,B) * {exp(2, A)} + {2 * B}) = AggSum([], R(A,B) * {exp(2, A)} + {2} * {B})</pre>
<p>Regardless, every variable that appears in the formula defining a special table will be one of the special table's columns.  </p>
<p>AGCA uses the very same idea to implement selection (filtering).  Let's say we wanted to compute:</p>
<pre>SELECT COUNT(*) FROM R WHERE A &lt; B;</pre>
<p>In other words, we want to filter out some rows of R (the ones where A &gt;= B), and keep others (where A &lt; B).  Again, AGCA deals entirely in multiplicities.  Filtering out a row is equivalent to setting its multiplicity to 0.  Keeping a row means leaving its multiplicity unchanged.  </p>
<pre>___A__B______________#__</pre>
<pre> &lt; 1, 1 &gt; -&gt; 2 * 0 = 0</pre>
<pre> &lt; 2, 2 &gt; -&gt; 1 * 0 = 0</pre>
<pre> &lt; 2, 5 &gt; -&gt; 3 * 1 = 3</pre>
<p>Just like a special table exists for any expression, any boolean predicate can be turned into a special table where each row has a multiplicity of either 0 or 1.  For example, for A &lt; B:</p>
<pre>___A__B______#__</pre>
<pre> &lt; 1, 1 &gt; -&gt; 0</pre>
<pre> &lt; 1, 2 &gt; -&gt; 1</pre>
<pre> &lt; 1, 3 &gt; -&gt; 1</pre>
<pre> ...</pre>
<pre> &lt; 2, 1 &gt; -&gt; 0</pre>
<pre> &lt; 2, 2 &gt; -&gt; 0</pre>
<pre> &lt; 2, 3 &gt; -&gt; 1</pre>
<pre> ...</pre>
<p>And there you have it.   Computing the filtered count of R is as simple as</p>
<pre>AggSum([], R(A,B) * {A &lt; B})</pre>
<p>Note that this makes the special tables {1} and {0} equivalent to the booleans TRUE and FALSE respectively.  Also note that while * is equivalent to a boolean AND, + is not equivalent to a boolean OR ({1} + {1} = {2}).  More on how we handle this later.</p>
<p><br /><br /></p>
<p>With that, I've covered all of the basics of AGCA.  Next week, a quick reference summary of everything that I've covered so far, and the week after that, I'll dive into the viewlet transform itself.  </p>
<p>By the by, I'm ignoring two features of AGCA for now, one needed to support nested aggregates, and one needed to support existential quantification (as well as certain kinds of nested aggregates).  I'll get to these once I've covered the viewlet transform and can better describe the challenges that nested aggregates create.</p>
<p>And to anyone who reads this before tomorrow: Keep an eye on http://www.dbtoaster.org for the official DBToaster release on July 9.</p>

View File

@ -0,0 +1,7 @@
---
title: "DBToaster Released"
author: Oliver Kennedy
---
<p>Go get your copy at <a href="http://www.dbtoaster.org">http://www.dbtoaster.org</a></p>
<p>What are you waiting for?  Go!</p>
<p>Seriously, stop reading this.</p>

View File

@ -0,0 +1,42 @@
---
title: "AGCA Summary"
author: Oliver Kennedy
---
<p>For the past month, I've been talking about AGCA, a language for incremental processing. Next week, I'll go into AGCA's primary application: The viewlet transform. But before I get to the transform, I'm going to do a quick overview of AGCA so that the basics of the language are all in one post.</p>
<p>I also lied a bit... I want to introduce one more (extremely) concept: A simplified form of an operation AGCA calls Lift.</p>
<p>Something that's required for any real query language is the ability to define tuples inline. This is something you might do in SQL as</p>
<pre>(SELECT 1 AS A)</pre>
<p>AGCA uses the Lift operation for this purpose:</p>
<pre>(A ^= 1)</pre>
<p>Think of the lift operation as variable assignment in a programming language. It creates a single-column relation with a single row in it</p>
<pre>___A______#__</pre>
<pre> &lt; 1 &gt; -&gt; 1</pre>
<p>Lift can be combined with the Natural Join to construct arbitrarily wide single-row relations. For example:</p>
<pre>(SELECT 1 AS A, 2 AS B, 3 AS C)</pre>
<p>would be expressed in AGCA as</p>
<pre>(A ^= 1) * (B ^= 2) * (C ^= 3)</pre>
<p>Lift and Union combine similarly to create multiple row relations.</p>
<pre>(A ^= 1) + (A ^= 2) + (A ^= 3)</pre>
<p>We'll need the lift when we talk about incrementality. That said, let's get to the summary.</p>
<h3>Relation (Table)</h3>
<pre>NAME(COL1, COL2, ...)</pre>
<p>Represents the contents of a base relation (aka a Table). The output is a mapping from every distinct row of the relation to the tuple's multiplicity in the relation. If the same relation appears more than once in the same expression, each occurrence of the relation can have different column names.</p>
<h3>Natural Join</h3>
<pre>A * B</pre>
<p>The natural join of the relations defined by expressions A and B. Every row in the output of A will be matched with every row in the output of B that has the same values for columns with the same name. If there are no columns with the same name, this is effectively the cartesian cross-product. For every row in A matched with a row in B, a single row with a multiplicity equal to the product of the two matched rows will be output.</p>
<h3>Bag Union</h3>
<pre>A + B</pre>
<p>The bag union of the relations defined by expressions A and B. In general, A and B should have the same schemas (although AGCA does support the case where they don't). Every row of the output has a multiplicity equal to the sum of the row's multiplicities in A and B.</p>
<h3>Sum/Count Aggregate &amp; Projection</h3>
<pre>AggSum([col1, col2, ...], A)</pre>
<p>The Sum aggregate (grouping by col1, col2, ...) of A. This is equivalent in AGCA to projecting away everything except for col1, col2, ...etc. The output rows have schema col1, col2, ..., and any given row in the output has a multiplicity equal to the sum of all rows that got projected down to the output row.</p>
<h3>Value Expression</h3>
<pre>A * { f(var1, var2, ...) }</pre>
<p>When applied to an expression A by the natural join, multiplies each row's multiplicity by an arbitrary single-valued function f over columns var1, var2, ... of A.</p>
<h3>Comparison Predicate</h3>
<pre>A * { f(var1, var2, ...) θ g(var1, var2, ...) }</pre>
<p>When applied to an expression A by the natural join, filters out rows that do not satisfy the predicate (f θ g) where θ is a comparison operation, and f and g are arbitrary single-valued functions over columns var1, var2, ... of A.</p>
<h3>Lift</h3>
<pre>(var ^= value)</pre>
<p>Outputs a single row with column named var, with the indicated value, and a multiplicity of 1.</p>
<p> </p>

View File

@ -0,0 +1,67 @@
---
title: "The Viewlet Transform (Part 1: Deltas in AGCA)"
author: Oliver Kennedy
---
<p>Over the last few weeks, I've been covering various aspects of AGCA, the language for incremental processing behind DBToaster.  Now, I'm going to chat a bit about the heart of DBToaster: the viewlet transform.  </p>
<p>The basic idea behind viewlet transform is actually something that's been around for a very long time: delta queries, commonly used for Incremental View Maintenance.  Let's say you have a query <em>Q</em>. If you need to evaluate <em>Q</em> over and over again, it usually makes sense to evaluate it once and just store the results somewhere.  </p>
<p>That's great, but if the data that goes into <em>Q</em> changes, you need to update the stored results accordingly.  However, instead of re-evaluating the entire query from scratch, you can compute what's known as a delta query.  The Delta of <em>Q</em> (with respect to table <em>T</em>) is a simplified form of <em>Q</em> that tells you how the results of <em>Q</em> will change (when you apply some change <em>∂T</em> to table <em>T</em>).  Put algebraically:</p>
<p><em>Q(T) = Q(T+∂T) + ∂Q(T, ∂T)</em></p>
<p>The idea is that computing <em>∂Q</em> is more efficient, and generally faster than computing <em>Q</em> from scratch.  </p>
<p>Let's have a look at how AGCA interacts with this delta operation.  Recall that everything in AGCA is multiplicities.  We just need to figure out which rows change, and what the change in their multiplicities will be.  To start, we're going to work with updates to one table at a time, and updates to one row of that table at a time (note that this may still result in multiple rows changing in the result, but the input data will only change by one row at a time).  </p>
<h3>Deltas of Tables</h3>
<p>Concretely, what happens to the results of an AGCA query when we insert a single row, with values <em>&lt;X, Y, Z, …&gt;</em> into table <em>T</em>?</p>
<p>If the table is the one being updated, then the change is just the single row being inserted.  We can construct a singleton rows in AGCA by using the lift operation, just like I described last week.</p>
<pre>∂T(A,B,C,…) = (A ^= X) * (B ^= Y) * (C ^= Z) * …</pre>
<p>If the table isn't the one being updated, then there is no change at all.  In other words, we need a special table with every single row having a multiplicity of 0.  We know how to do that.</p>
<pre>∂S(…) = {0}</pre>
<p>The delta for a deletion is similar.  Again, remember that AGCA uses multiplicities for everything.  If we're deleting a row from table <em>T</em>, we want to reduce the multiplicity of that row by 1.  How do we do this?  We add a negative multiplicity</p>
<pre>∂T(A,B,C,…) = {-1} * (A ^= X) * (B ^= Y) * (C ^= Z) * …</pre>
<h3>Deltas of Bag Unions</h3>
<p>What about bag unions?  What if we have an expression like </p>
<pre>Q = Q1 + Q2</pre>
<p>Well, let's assume that we can figure out an expression for computing the deltas of Q1 and Q2.  Then we know that </p>
<pre>Q + ∂Q = (Q1 + ∂Q1) + (Q2 + ∂Q2)</pre>
<p>AGCA is a ring, so the normal rules (distributivity, associativity, commutativity, etc…) for + and * apply to bag union and natural join as well.  So, reshuffling a bit, we get</p>
<pre>Q + ∂Q = (Q1 + Q2) + (∂Q1 + ∂Q2) = Q + (∂Q1 + ∂Q2)</pre>
<p>In other words</p>
<pre>∂(Q1 + Q2) = ∂Q1 + ∂Q2</pre>
<h3>Deltas of Natural Joins</h3>
<p>We can do something similar for natural joins.</p>
<pre>Q + ∂Q = (Q1 + ∂Q1) * (Q2 + ∂Q2) = (Q1*Q2) + (Q1*∂Q2) + (∂Q1*Q2) + (∂Q1*∂Q2)</pre>
<p>So.</p>
<pre>∂Q = (Q1*∂Q2) + (∂Q1*Q2) + (∂Q1*∂Q2)</pre>
<p>This one's a bit stranger, so let's look at it in a bit more detail.  If only <em>Q1</em> changes (but not <em>Q2</em>), then <em>∂Q2</em> = {0}.  So…</p>
<pre>∂Q = (Q1*{0}) + (∂Q1 * Q2) + (∂Q1*{0})</pre>
<p>Again, we benefit from AGCA being a ring.  {0} is AGCA's additive identity (a fancy way of saying that {0} + X = X, and also that {0} * X = {0}).  So...</p>
<pre>∂Q = {0} + (∂Q1 * Q2) + {0} = ∂Q1 * Q2</pre>
<p>This kinda makes sense.  A join takes every row of one table and matches it against every row of the other table.  If you insert a row into one of the two tables (for example, inserting ∂Q1 into Q1), the change to the final result comes from joining it against every row of the other table (Q2 in this case).  The exact same thing happens if you only insert into Q2, but not Q1.</p>
<p>So what if both Q1 and Q2 change.  For example, if query could be</p>
<pre>Q = T(A,B) * T(B,C) </pre>
<p>This is also known as a self-join.  If we insert a row into T, then there'll be three parts to the update:</p>
<ol>
<li>The inserted row joined against all of the T(B,C)s (∂T(A,B) * T(B,C))</li>
<li>The inserted row joined against all of the T(A,B)s (T(A,B) * ∂T(B,C))</li>
<li>And there's a possibility that the inserted row might join against itself. (∂T(A,B) * ∂T(B,C))</li>
</ol>
<h3>Deltas of Special Tables</h3>
<p>The delta of one of these special tables we use for constants, numerical formulas, or comparisons is always {0}.  This may seem a little unintuitive, but it actually makes sense.  Let's say we have the query </p>
<pre>Q = T(A) * {A^2}</pre>
<p>If we insert a row &lt;3&gt; into A, that's going to change T, but it won't change the fact that (3^2 = 9).  The row &lt;3&gt;-&gt;9 is always present in the special table {A^2}, regardless of whether or not it's in T.  So we'd get</p>
<pre>∂Q = {0} + (∂T(A) * {A^2}) + {0} = ∂T(A) * {A^2}</pre>
<p>Which is precisely correct.</p>
<h3>Deltas of AggSums</h3>
<p>The delta of an AggSum is the AggSum of the delta.</p>
<pre>∂AggSum([…], Q) = AggSum([…], ∂Q)</pre>
<p>The reasoning behind this is identical to the reasoning behind the delta of bag union (+), since AggSum uses precisely the same mechanics.</p>
<h3>In Summary</h3>
<p>I'm going to punt on the delta of a Lift expression for now.  There's a bit of hidden complexity there that I'll return to in two weeks.  For now, the simplified form of the Lift operation that I've described so far always has a delta of {0}.</p>
<p>So the full list of delta rules (minus the lift) for the delta of an update to T is</p>
<pre>∂T(A,B,C,…) = {+/- 1} * (A ^= X) * (A ^= Y) * (A ^= Z) * …</pre>
<pre>∂S(…) = {0}</pre>
<pre>∂(Q1 + Q2) = ∂Q1 + ∂Q2</pre>
<pre>∂(Q1 * Q2) = (Q1 * ∂Q2) + (∂Q1 * Q2) + (∂Q1 * ∂Q2)</pre>
<pre>∂({…}) = {0}</pre>
<pre>∂(AggSum([…], Q)) = AggSum([…], ∂Q)</pre>
<p>Note two things about this.  First, the delta of a query expressed in AGCA is itself a query in AGCA.  This is a huge deal. Prior to AGCA, delta queries had special funny business that required special logic in the database to process.  In other words, if you don't need any special query-processing infrastructure to process the delta of an AGCA query, you just need support for AGCA itself.</p>
<p>The second distinction is that the delta query is simpler.  Roughly speaking, every time you take the delta of a query in AGCA, you remove one table from each join.  In other words, if you have an AGCA query, and you take its delta, and the delta of that, and so forth, eventually you'll end up with {0}.  </p>
<p>These two facts are critical to the Viewlet Transform, which I'll finally get to next week.</p>

View File

@ -0,0 +1,109 @@
---
title: "The Viewlet Transform (Part 2: The Naive Viewlet Transform)"
author: Oliver Kennedy
---
Last week, I introduced you to how deltas work in AGCA.  To recap
<pre>∂T(A,B,C,…) = {+/- 1} * (A ^= {X}) * (A ^= {Y}) * (A ^= {Z}) * …</pre>
<pre>∂S(…) = {0}</pre>
<pre>∂(Q1 + Q2) = ∂Q1 + ∂Q2</pre>
<pre>∂(Q1 * Q2) = (Q1 * ∂Q2) + (∂Q1 * Q2) + (∂Q1 * ∂Q2)</pre>
<pre>∂(AggSum([…], Q)) = AggSum([…], ∂Q)</pre>
<pre>∂({…}) = {0}</pre>
<pre>∂(V ^= {…}) = {0} (for the simplified Lift operation only)</pre>
Now, we've been down in the nitty gritty of AGCA for a while now.  Let's pop our heads up for a moment to remember where we're going with all of this.
We have a query (let's call it Q) and we want to be able to incrementally maintain it.  That is, we want to store a copy of the results of evaluating Q on a database (stored on disk, in memory, anywhere really), and <strong>every time the database changes in some way, we want to update the stored results to match</strong>.
<h3>Applying the Delta Transform</h3>
That's fairly easy to do somewhat efficiently if we have these deltas.  Let's say we have the query
<pre>Q := AggSum([], R(A,B) * S(B,C) * A)</pre>
or in SQL
<pre>SELECT SUM(A) AS Q FROM R, S WHERE R.B = S.B;</pre>
Let's say R contains
<pre><span style="text-decoration: underline;">___A__B______#__</span></pre>
<pre> &lt; 1, 1 &gt; -&gt; 1</pre>
<pre> &lt; 1, 2 &gt; -&gt; 1</pre>
<pre> &lt; 2, 2 &gt; -&gt; 1</pre>
and S contains
<pre><span style="text-decoration: underline;">___B__C______#__</span></pre>
<pre> &lt; 1, 1 &gt; -&gt; 2</pre>
<pre> &lt; 2, 2 &gt; -&gt; 1</pre>
For this data, Q = 5.
Let's say we insert a new row: S(2,1).  We could certainly re-evaluate the entire query from scratch and discover that the new result is Q = 8, but this would be pretty inefficient.  Even in the best case, where everything fits in memory, re-evaluating the join requires O(|R| + |S|) work.  That's where the deltas come in.  The delta of Q tells us how the query results change with respect to a change in the table.  So let's take the delta of Q with respect to an insertion of tuple &lt;@Y,@Z&gt; into S.
<pre>∂Q := ∂(AggSum([], R(A,B) * S(B,C) * {A})</pre>
<pre> := AggSum([], ∂(R(A,B) * S(B,C) * {A}))</pre>
<pre> := AggSum([], R(A,B) * ∂(S(B,C) * {A})</pre>
<pre> + ∂R(A,B) * (S(B,C) * {A})</pre>
<pre> + ∂R(A,B) * ∂(S(B,C) * {A}))</pre>
<pre> := AggSum([], R(A,B) * ∂(S(B,C) * {A)</pre>
<pre> + {0} * (S(B,C) * {A})</pre>
<pre> + {0} * ∂(S(B,C) * {A}))</pre>
<pre> := AggSum([], R(A,B) * ∂(S(B,C) * {A}))</pre>
<pre> := AggSum([], R(A,B) * (S(B,C) * ∂{A}</pre>
<pre> + ∂S(B,C) * {A}</pre>
<pre> + ∂S(B,C) * ∂{A}))</pre>
<pre> := AggSum([], R(A,B) * (S(B,C) * {0}</pre>
<pre> + ((B ^= {@Y}) * (C ^= {@Z}) * {A})</pre>
<pre> + ∂S(B,C) * {0}))</pre>
<pre> := AggSum([], R(A,B) * (B ^= {@Y}) * (C ^= {@Z}) * {A})</pre>
I'm not going to get into the details of optimizing AGCA expressions (yet), but trust me for now that the following (simpler) query is equivalent
<pre>∂Q := AggSum([], R(A,@Y) * {A})</pre>
or in SQL
<pre>SELECT SUM(A) FROM R WHERE R.B = @Y;</pre>
Note, by the way, that @Y is a parameter to this delta query.  When you evaluate a delta query (for example our delta for insertions into S), these parameters take their value from the tuple being modified (so when you insert &lt;2,1&gt; into S, then @Y = 2).  That said, @Y is just a normal variable/column.  There's nothing special about it (other than the @ in the name).
The delta query tells us how the query results change.  If we insert &lt;2,1&gt; into S, then we evaluate the delta query for insertions into S (∂Q above), setting @Y to 2, and @Z to 1.
<pre>AggSum([], R(A, 2) * {A})</pre>
… which, for our initial dataset above, gives us ∂Q = 3.  To figure out what Q will give us on the modified database (after inserting &lt;2,1&gt; into S), we just add ∂Q to our initial result (5 + 3 = 8).
<h3>Parameters and AggSums</h3>
I keep saying that parameters just normal variables, and that there's nothing special about them.
That's mostly true.  I actually oversimplified a bit on the delta rules.
We want these parameters to be visible from the outside so that evaluating ∂Q for a specific insertion (or deletion) essentially amounts to selecting a single row from the output of ∂Q.  In other words, the AggSums need to be rewritten slightly so that the parameters appear in the group-by variables (where appropriate).  That is, the correct delta with respect to an insertion into S : +S(@Y,@Z) is
<pre style="margin: 8px;">AggSum([@Y], R(A, @Y) * {A})</pre>
or in SQL
<pre style="margin: 8px;">SELECT R.B, SUM(R.A) FROM R GROUP BY R.B;</pre>
That little hiccup out of the way, let's get to the actual viewlet transform
<h3>Auxiliary Views</h3>
The delta query is an improvement over evaluating the entire query from scratch.  For this particular example though, we still need to scan over multiple rows of R (even if it is only a small subset of R).  We can do even better.
Right now, every time the database changes we update Q with ∂Q.
<pre>ON +S(@X, @Y)</pre>
<pre> Q += AggSum([@Y], R(A,@Y) * {A})</pre>
But recall that for any AGCA query Q, ∂Q is just an ordinary, simple, unexceptional AGCA query (no funny business is introduced by the ∂).  If we can store ('materialize' to use the technical term) the results of Q, what's to stop us from storing the results of ∂Q?  Nothing!
Let's say we had another view materialized (let's call it M_S), this time with a group by variable:
<pre style="margin: 8px;">M_S[Y] := AggSum([Y], R(A, Y) * {A})</pre>
For the initial dataset above, this would contain
<pre><span style="text-decoration: underline;">___Y______#__</span></pre>
<pre> &lt; 1 &gt; -&gt; 1</pre>
<pre> &lt; 2 &gt; -&gt; 3</pre>
This view can help us substantially when we need to update Q after an insertion into S.  Expressing this update as a trigger:
<pre>ON +S(@Y, @Z)</pre>
<pre>  Q += M_S[@Y]</pre>
In other words, to update Q, we just need to look up one row of M_S.  The update can be done in constant time!  That said, we now have an extra view that we need to maintain.  Fortunately, M_S is simpler than Q, and has only one table, in this case R.  Whenever R changes, we need to update M_S.  Since M_S is defined in terms of a normal, ordinary AGCA expression, we update it in exactly the same way that we update Q, using the delta of M_S.  For an insertion of &lt;@X,@Y&gt; into R, this would be:
<pre>∂M_S[Y] := AggSum([Y], R(A,Y) * {A})</pre>
<pre> := AggSum([@X, @Y, Y], (A ^= {@X}) * (Y ^= {@Y}) * {A})</pre>
<pre> := (Y ^= {@Y}) * {@X}</pre>
Or, expressed as a trigger
<pre>ON +R(@X, @Y)</pre>
<pre> M_S[Y] += (Y ^= {@Y}) * {@X}</pre>
This can be simplified a bit, since the update operation only produces one row
<pre>ON +R(@X, @Y)</pre>
<pre> M_S[@Y] += {@X}</pre>
(also a constant time operation)
<span class="Apple-style-span" style="font-size: 14px; font-weight: bold;">Recursive Deltas</span>
What's happening here is that we're saving the results of Q and maintaining them with (several instantiations of) ∂Q.  The key idea of the viewlet transform is that we can also save the results of ∂Q and maintain them with (several instantiations of) ∂(∂Q).  This process repeats recursively, giving us ∂(∂(∂Q)), ∂(∂(∂(∂Q))), and so on.
Every time we add another ∂, another table drops out, making it the delta query simpler.  After enough repetitions, we end up with a query that doesn't depend on the database at all (e.g., ∂M_S above).  At this point, we can stop, since the query can be evaluated in constant time.
This is the viewlet transform.  <strong>Start by materializing the original query, and then alternate computing its delta(s), and recursively materializing the delta(s)</strong>.
I've obscured the issue a bit by not subscripting my ∂s, remember that each delta is taken with respect to a particular event.  Q has four deltas: for both insertion and deletion of both R and S (∂<span class="Apple-style-span" style="font-size: 9px;">+R</span>, ∂<span class="Apple-style-span" style="font-size: 9px;">-R</span>, ∂<span class="Apple-style-span" style="font-size: 9px;">+S</span>, ∂<span class="Apple-style-span" style="font-size: 9px;">-S</span>).  Similarly, M_S has two deltas: for both the insertion and deletion of R and S.
The viewlet transform of Q produces five views: one for Q, and one for each of the first-tier deltas (the second-tier deltas are all constants).
As it turns out, we can be even more efficient than that!  Furthermore, the procedure I describe above runs into problems with special tables (as a bit of a teaser for this, take a close look at the delta of Q with respect to insertions into R).  This <em>naive</em> viewlet transform is insufficient.  Next week, I'll start discussing some changes to the naive viewlet transform that make it more practical and efficient.

View File

@ -0,0 +1,31 @@
---
title: "The Viewlet Transform (Part 3: Input Variables and Partial Materialization)"
author: Oliver Kennedy
---
<p>For the past few weeks, I've been discussing the viewlet transform.  The key idea of this process is that because the delta transform is closed over AGCA (that is, it doesn't add any funny business to the query), it's possible to materialize and incrementally maintain the deltas of a query just as easily as we can maintain the original query.  Because the deltas of a query are materialized as their own views, practically no processing is required to incrementally maintain the query; we just read from the delta view and update the original query accordingly.</p>
<p>This process continues recursively.  The delta queries each have their own deltas -- we'll call these second-order deltas of the original query. The second-order deltas, of course, each have third-order deltas, and so forth.  This continues, building up a hierarchy of auxiliary views, each used to efficiently maintain their parents in the hierarchy.  Even though this hierarchy has many views in it, it's still typically possible to maintain them efficiently.</p>
<p>Of course, there are always corner cases.  This week, I'm going to discuss one of them: Input Variables.  Let's have a look at a relatively straightforward query:</p>
<pre>Q := AggSum([], R(A,B) * S(C,D) * {A &lt; C})</pre>
<p>Or in SQL</p>
<pre>SELECT COUNT(*) AS Q FROM R, S WHERE R.A &lt; S.C</pre>
<p>Innocuous as this query is, when we take its delta, we run into a problem.  Let's take the delta with respect to R (the delta with respect to S is nearly the same).  </p>
<pre>dR(&lt;X,Y&gt;) Q := AggSum([], S(C,D) * {X &lt; C})</pre>
<p>In other words, whenever we insert a tuple (row) into R, we need to run the above query, substituting X and Y with values from the tuple being inserted into R.  <span style="font-family: Courier; font-size: 12px;">dR Q</span> is certainly simpler, but there's a problem.  Recall that <em>special tables</em> (like {A &lt; C} or {X &lt; C}) have an infinite number of rows.  This was fine in the original query Q, because the query had terms that limited the number of distinct values of A and C that we were interested in.  The special table might have had an infinite number of rows, but the overall query did not.  </p>
<p>In the delta query however, there's no term limiting the number of distinct values of X that we're interested in.  It's ok when we actually evaluate the delta query because we have a value of X (from the tuple being inserted into R), but until we get that value we can't actually compute a value for it.  In other words, we can't store the results of the query.  </p>
<p>There's actually a term for this in programming languages and query processing: X is known as an <strong>unbound</strong> or <strong>unsafe</strong> variable (or sometimes a range restricted variable).  AGCA calls it an <strong>input variable</strong> (or parameter) of <span style="font-family: Courier; font-size: 12px;">dR Q</span>.  We don't know what it is, so we can't evaluate the expression.  </p>
<p>There are a few things we can do in this situation.  If you're particularly familiar with query processing techniques, you might look at this query and say "But wait, we can actually materialize this using a range tree (or similar index structure)."  And you'd be right.  Of course, then I could give you a more complex query (e.g., replace the inequality with an arbitrary black box function f(A, B)) and we'd be right back where we started.  For now, let's assume that it's simply impossible to materialize the entire expression in one go.  </p>
<p>So what else  is left?  Well, if we can't materialize the entire thing, then what about materializing it in bits?  We can create one or more views that <strong>can</strong> be stored efficiently, and then do some (but not all) of the heavy lifting afterwards, once we actually know what X is.</p>
<p>Let's make this a bit clearer and procedural.  We have a query</p>
<pre>AggSum([], S(C,D) * {X &lt; C})</pre>
<p>Now I'm going to introduce an extra little bit of syntax into AGCA.  We'll call it the materialization operator M.  Everything in the materialization operator is going to get materialized.  Everything outside of the materialization operator is going to be evaluated when the query results need to be accessed.  An AGCA query with a materialization operator in it is called a <strong>materialization decision</strong>.</p>
<p>We arrive at a final materialization decision by starting with the default (naive) decision where we materialize everything.</p>
<pre>M(AggSum([], S(C,D) * {X &lt; C}))</pre>
<p>… and then iteratively refining the decision until we arrive at a satisfactory one.  As I've been saying, a materialization decision with an input variable is not valid, so we need to rewrite it.  Input variables only appear in these special tables (the ones in the curly braces), so the basic idea is actually pretty easy.  We'll start by pushing the materialization operator inside the AggSum:</p>
<pre>AggSum([], M(S(C, D) * {X &lt; C}))</pre>
<p>Now we're looking at a materialization decision applied to a product.  We can split the materialization operator across the elements of the product so that only the parts without input variables get materialized</p>
<pre>AggSum([], M(S(C,D)) * {X &lt; C})</pre>
<p>And we're there.  This materialization decision is valid, but not quite as efficient as it could be.  </p>
<p>Specifically, look at what we're storing: S(C,D).  We care about the individual values of C (because they get applied to the predicate {X &lt; C}), but D is never used, and will actually get aggregated away.  We can save ourselves a little trouble when we need to evaluate the delta query by storing only an aggregated value.  In other words, We can tack on an extra aggsum.</p>
<pre>AggSum([], M(AggSum([C], S(C,D))) * {X &lt; C})</pre>
<p>Note that C is a group-by variable of this AggSum because it is needed by the predicate.</p>
<p>Alright.  Hopefully that gives you a bit of the flavor of rewriting queries to support input variables.  The details of this process are actually quite messy, but I'll see if I can cover them in detail next week.  For the impatient, our VLDB paper "DBToaster: Higher-Order Delta Processing for Dynamic, Frequently Fresh Views" gives a reasonable overview of the process.</p>

View File

@ -0,0 +1,38 @@
---
title: "The Viewlet Transform (Part 4: Input Variables and Partial Materialization Continued)"
author: Oliver Kennedy
---
<p>Last week, I talked about a variation on the core viewlet transform idea.  The delta operation introduces input variables into a query, which can not be properly materialized.  Often, these input variables can be eliminated through variable unification (something I'll start getting into in a week or two), but not always.  In these cases, it is necessary to materialize the delta query in parts.  </p>
<p>We do this by splitting a query Q into two (or more) parts Qmain, and Q1, Q2, … etc.  We materialize Q1, Q2, … etc, and then whenever we need to evaluate Q, we compute Qmain(Q1, Q2, …).  We express this partitioning by a special materialization operator M, and recur through the query expression to find the exact bits we can materialize.</p>
<p>Before I actually get into the partial materialization process, let me quickly introduce four bits of nomenclature regarding queries (that I'll define more thoroughly next week): <strong>inputs</strong>, <strong>outputs</strong>, <strong>scope</strong>, and <strong>schema</strong>.  The inputs and outputs of an expression are the unbound and bound (respectively) variables appearing in the expression.  The scope of an expression is the set of variables that are bound when the expression is evaluated.  The schema of an expression is the set of variables that an expression is expected to bind.  Note that while the inputs and outputs of an expression are uniquely identified by the expression, scope and schema are contextual, and can change depending on how the expression is evaluated.  Also note that any inputs <strong>must</strong> be in the scope, and that anything in the schema <strong>must</strong> be in the outputs of a query.</p>
<p>That said, we eliminate input variables through a recursive process that starts with a fully materialized query</p>
<pre>M(Q)</pre>
<p>and recursively descend into the expression using the following rules, until the expression chosen to be materialized (the materialization decision) has no inputs.</p>
<p> </p>
<h3>Materializing AggSums</h3>
<pre>M(AggSum([…], Q))</pre>
<p>If we have an AggSum that we need to materialize, one of two things can happen.  If the AggSum has no inputs, we're done.  If the AggSum has input, then we need to recur, and push the materialization operator inside the AggSum. </p>
<pre>AggSum([…], M(Q))</pre>
<p>As I mentioned last week, we can actually do better. As we push the materialization operator down into the AggSum, we keep track of the variables used by the AggSum.  Note that the schema of the query nested inside the AggSum (Q, that is) must be identical to the group-by variables of the AggSum.  As we push the materialization operator down into the query, we keep track of its schema.  When we finally settle on a location for the materialization operator, we look at both its schema and its outputs.  If there are more outputs than the schema calls for, we add an additional AggSum to trim the unnecessary outputs away.  </p>
<h3>Materializing Relations, Value Expressions, and Comparison Predicates</h3>
<pre>M(R(A, B, …))</pre>
<pre>M({f(A, B, …)})</pre>
<pre>M({f(A, B, …) θ g(A, B, …)})</pre>
<p>Relations never have inputs and are always materialized.  Value expressions and comparisons are exclusively inputs, and are never materialized alone (i.e., unless bound by a relation)</p>
<h3>Materializing Unions and Joins</h3>
<pre>M(A + B + …)</pre>
<pre>M(A * B * …)</pre>
<p>As with AggSums, unions or joins with no inputs are always materialized in their entirety.  For both unions and joins, we first partition the expression into a subset of the expression with no inputs (A, B, …), and multiple subsets with inputs (C, D, E, …).  We materialize the input-free bit as is, and recursively descend into the remaining components</p>
<pre>M(A + B + …) + M(C) + M(D) + …</pre>
<pre>M(A * B * …) * M(C) * M(D) * …</pre>
<p>There are a few caveats for materializing joins.  Specifically, there can be ordering constraints (which I'll get into next week) over the terms of a join (they don't quite commute).  It may sometimes be necessary to partition the expression into multiple subsets for materialization if there is a term (let's call it C) that must occur to the right of (A*B), and to the left of (D*E), then we would choose to materialize it as</p>
<pre>M(A * B) * M(C) * M(D * E)</pre>
<h3>Materializing Lifts</h3>
<pre>M(A ^= {B})</pre>
<p>A lift is tricky in that it involves both input and output variables.  Typically, the lift will get unified away (again, something I'll talk about in a few weeks).  Other times, it may be possible to include the lift in an input-free query if another relation binds the inputs of the lift (in which case it'll get caught by the Union/Join case).  In either case, if we've gotten to this point, the best we can do is to not materialize anything.</p>
<p>When I get into nested subqueries, and we start using lifts for more complex things, this rule will need to be changed.</p>
<p> </p>
<p>Alright.  I know this week was short (and a bit cheap), but the CIDR deadline's this weekend.  Next week, I'll get back to some more interesting stuff, with optimization techniques for AGCA.  Until then, cheers.</p>
<p> </p>
<p> </p>
<p> </p>

View File

@ -0,0 +1,55 @@
---
title: "Optimizing AGCA (Part 1: Ringing in the optimizations)"
author: Oliver Kennedy
---
<p>Although AGCA is designed primarily for incremental query evaluation, it is a fully fledged query language (albeit only for non-aggregate queries and certain kinds of aggregates).  As such, it's useful to have a strategy for optimizing arbitrary query expressions.  As it turns out, optimization is relevant, even in the incremental case, as it can often produce simpler expressions that are easier to incrementally maintain.  Over the next few weeks, I'll discuss several techniques that we've developed for optimizing, simplifying, and generally reducing the cost of evaluating AGCA queries.</p>
<p>But before I get into any of that, let me quickly bring up one point that I've been glossing over, mostly as a point of convenience.  </p>
<p><strong>By default, AGCA expressions are evaluated as the English language is read: Left to Right</strong></p>
<p>This has two consequences.  First, ordering has an impact on query evaluation performance.  We'll be returning to that before long.  For now though, the important feature is that information flows left-to-right in AGCA as well.  Specifically, consider the following expression: </p>
<pre>R(A) * {A}</pre>
<p>or in SQL</p>
<pre>SELECT SUM(R.A) FROM R</pre>
<p>In the SQL query, you can think of information as flowing from the R table to the SUM operator.  This notion of information flow is pretty common in programming languages, and AGCA incorporates it as well.  I mentioned the idea of binding patterns when I first introduced the special tables used for value expressions (i.e., {A}).  The term R(A) binds the A variable, which is then used by the term {A}.  In short, information is flowing through the product operation from R(A) to {A}.  In AGCA, this information flow is <strong>always left to right</strong>.  This is more than just a matter of convenience.  It makes it possible to identify binding patterns in a single scan of the query, rather than an exponential search, which in turn makes many of the optimizations that I will discuss tractable.  </p>
<p>Unfortunately, this also has the side effect of making certain expressions (sometimes) invalid.  For example, the expression</p>
<pre>{A} * R(A)</pre>
<p>could not be evaluated in isolation.  However, the expression</p>
<pre>S(A) * {A} * R(A)</pre>
<p>is be perfectly valid.  In other words, AGCA's product operation is not generally commutative (A * B ≠ B * A).  Many terms do commute (e.g., R(A) and S(A)), but commutativity is not always possible.  Worse still, the commutativity of two terms can not be determined locally.  As in the above example, {A} and R(A) in isolation do not commute.  However, if the variable A is bound outside of those two terms, then commutativity is possible.  That said, commutativity can be determined locally if the scope in which an expression is evaluated is known (and this information is typically available during optimization).  </p>
<p>From now on, I'll be assume that commutativity can be easily determined.  </p>
<p>As I present each of these rules, I'll briefly discuss the core ideas of each optimization, comment on how the rule interacts with both incremental and batch query evaluation, and then summarize the rules as a set of transformations over AGCA expressions.  </p>
<h3>Pre-evaluation</h3>
<p>A relatively straightforward, practically braindead optimization that appears in nearly all compilers is constant folding.  If an expression such as 1+2 is fed into the compiler, the compiler silently turns that into a constant 3.  DBToaster does this, but there are several nuances that arise in the DBToaster case.  </p>
<p>Firstly, as I've mentioned before, AGCA is a ring.  Thus, the constants 1 and 0 have some useful properties.  When an expression is multiplied by 1 or added to 0, the result is unchanged, so we can eliminate any appearance of {1} in a product, or {0} in a sum.  Furthermore, whenever {0} appears in a product, the entire product term can be replaced by {0}.  Although it might seem unlikely that such a query would ever arise, {0} terms actually appear quite often in the incremental processing case.  The delta rule produces a {0} for a large number of different terms, and queries with nested subqueries (which I'll get to one of these days) often produce queries with a ({1} + {-1}) term.</p>
<p>A second nuance is value expressions themselves.  Practically speaking, the following two expressions are identical</p>
<pre>{A}+{2} == {A+2}</pre>
<p>For a number of reasons, the latter form is considerably more efficient most of the time.  I'll get into the details in two weeks when I talk about a materialization optimization called Hypergraph Factorization that I'll talk about next week, but essentially, we (almost) always want to put value terms together.</p>
<p><span style="text-decoration: underline;">The Rules</span></p>
<p>{0} + Q =&gt; Q</p>
<p>{0} * Q =&gt; {0}</p>
<p>{1} * Q =&gt; Q</p>
<p>{X} + {Y} =&gt; {X + Y} (see caveat for Hypergraph Factorization)</p>
<p>{X} * {Y} =&gt; {X * Y} (see caveat for Hypergraph Factorization)</p>
<p>{f(X, Y, Z, …)} =&gt; {eval(f(X, Y, Z, …))}</p>
<h3>Polynomial Factorization</h3>
<p>One additional feature of rings is that the product operation distributes over union.  In other words</p>
<pre>A * (B + C) &lt;=&gt; (A * B) + (A * C)</pre>
<p>This operation goes both ways, and we can take advantage of that to reduce our workload. Whenever we encounter an expression like (A * B) + (A * C), we can rewrite it as A * (B + C), and save ourselves a re-evaluation of A.  This is particularly important for incremental processing, as the delta operation frequently produces expressions of this form (and better still, B and C are typically value terms that can be further optimized by pre-evaluation).</p>
<p>This optimization bears similarity to both a programming language technique called common subexpression elimination, as well as tradition arithmetic factorization (hence then name).  That said, there's a nuance in factorizing AGCA expressions that doesn't arise in arithmetic factorization: commutativity.  Let's say we have an expression of the form</p>
<pre>(A * B) + (C * A)</pre>
<p>If this were an arithmetic polynomial, clearly we could factorize out the A.  Unfortunately, in AGCA, terms don't always commute.  This leaves us with two possibilities: Either we can commute the A with the C and produce the following factorized expression:</p>
<pre>A * (B + C)</pre>
<p>Or we can commute the A with the B and produce the following factorized expression:</p>
<pre>(B + C) * A</pre>
<p>Note that in the latter case, the A appears after the factorized terms in the expression.</p>
<p>In general, factorization is a hard problem.  In any given polynomial, there might be several terms that could be factorized out of the expression.  For example</p>
<pre>(A * B) + (A * C) + (D * B)</pre>
<p>This expression can be factorized out to one of the two following expressions</p>
<p>(A * (B + C)) + (D* B)</p>
<p> ((A + D) * B) + (A * C)</p>
<p>It's not always clear which of these will be more efficient to evaluate.  A simple heuristic is to factorize out terms that are guaranteed to involve a cost (e.g., table terms), but a cost-based optimizer is typically the most effective.   We'll get to that in a few weeks.</p>
<p><span style="text-decoration: underline;">The Rules</span></p>
<pre> </pre>
<p>(… * A * …) + (… * A * …) + … =&gt; A * (… * {1} * …) + (… * {1} * …) + …) (if A commutes to the head of each term)</p>
<p>(… * A * …) + (… * A * …) + … =&gt; (… * {1} * …) + (… * {1} * …) + …) * A (if A commutes to the tail of each term)</p>
<p> </p>
<p>That's it for this week.  Next week, I'll jump back to the optimized viewlet transform with Hypergraph Factorization.</p>

View File

@ -0,0 +1,43 @@
---
title: "The Viewlet Transform (Part 5: Hypergraph Partitioning)"
author: Oliver Kennedy
---
<p>I've been talking for several weeks now about tools and techniques related to AGCA and the viewlet transform.  Most recently, I've been talking about optimization techniques for AGCA, but I'm going to take a quick detour this week and provide a quick overview of another technique: Hypergraph Partitioning.  In general, this technique is most suited for optimizing the materialization process, but there are applications to the optimization of aggregate computations as well.</p>
<p> </p>
<p> </p>
<p> </p>
<h3>The Query Hypergraph</h3>
<p> </p>
<p> </p>
<p>Before I get into the technique though, we need to discuss an alternate representation of AGCA expressions (one that's actually used pretty frequently in query optimization): the query <a href="http://en.wikipedia.org/wiki/Hypergraph">hypergraph</a> (basically a graph where an edge can connect any number of nodes.  This kind of hypergraph can be created for any product of terms (in the trivial case, we have a product of just one term).  Each node in the hypergraph is a variable/column of the query (both output and input variables are treated identically for this purpose).  Each hyperedge corresponds to one term in the product, and each edge connects all variables that appear in the term corresponding to the edge (regardless of whether they appear as inputs or outputs).  </p>
<p> </p>
<p> </p>
<h3>Hypergraph Partitioning</h3>
<p> </p>
<p>Remember that the product operator corresponds to the natural join (and that comparisons are implemented as relations).  As a consequence, any disconnected components in the graph effectively correspond to cross products (a natural join with no shared columns).  For example, consider the following trivial example.</p>
<pre>R(A) * S(B)</pre>
<p>R(A) is a hyperedge touching only A.  S(B) is a hyperedge touching only B.  Thus A and B are separate disconnected components.  Note, by the way, that there are no comparisons between A and B in this query.  This product is a pure cartesian cross-product.  The following query would not be:</p>
<pre>R(A) * S(B) * {A &lt; B}</pre>
<p>In this query, the term { A &lt; B } connects both A and B.  </p>
<p>Now, if we have disconnected components, it typically pays to materialize them separately.  For example, going with R and S above, we could materialize them as </p>
<pre>M( R(A) * S(B) )</pre>
<p>But now we have to store |R| * |S| entries (where |R| is the number of tuples in R).  Worse, if we need to update the materialized view, it will cost us |S| after an update to R, and |R| after an update to S.  On the other hand, we could materialize as</p>
<pre>M(R(A)) * M(S(B))</pre>
<p>Now we only store |R| + |S| tuples (between the two materialized views), and updating either can be done in constant time.  Better still, we lose nothing with this representation.  It costs us O(|R|*|S|) to iterate over every element of either materialization of the expression.</p>
<p>You might say that this is a crazy corner case -- people almost never compute cross products.  That's usually true, but in DBToaster, this situation crops up quite frequently.  For example, consider the three way join query:</p>
<pre>R(A) * S(A,B) * T(B)</pre>
<p>The (optimized) delta of this query with respect to +S(dA, dB) is</p>
<pre>R(dA) * T(dB)</pre>
<p>Because each delta essentially removes a hyperedge in the query hypergraph, partitioned components are created extremely frequently.</p>
<p> </p>
<h3>Partitioning and Trigger Parameters</h3>
<p>There's also one more situation where this is beneficial.  Consider the following query.</p>
<pre>R(A) * S(A) * T(A)</pre>
<p>And its delta with respect to the insertion +S(dA)</p>
<pre>R(dA) * T(dA)</pre>
<p>Even though dA is touched by both R and T, we lose nothing if we materialize them separately (as before, evaluation is O(1) either way), and materializing them separately results in more efficient maintenance.  In this case, dA is a trigger parameter -- one of the variables drawn from the relation being modified.  These trigger parameter variables can be excluded from the query hyper graph.</p>
<p> </p>
<h3>Applications to Query Optimization</h3>
<p>In general, when computing aggregates, hypergraph partitioning can be used to select a more efficient computation order.  Each materialized component gets scanned independently, and the resulting aggregate can be computed.</p>
<p> </p>
<p>And that's about it for now.  Next week, we return to AGCA optimization with a discussion of the interplay between equality and lifts, and how to optimize expressions of this form.</p>

View File

@ -0,0 +1,44 @@
---
title: "Optimizing AGCA (Part 2: Lifting Equalities)"
author: Oliver Kennedy
---
<p>I'm going to turn, this week, back to optimization of AGCA expressions, and in particular, one pair of optimizations that combine to substantially simplify AGCA expressions: Lifting Equalities, and Equality Unification.  </p>
<p>Recall the four sets of variables that we work with when evaluating any AGCA expression:</p>
<ul>
<li>Scope variables are variables that are bound (assigned values) by the time the AGCA expression is evaluated (either earlier in the expression, or outside of it).</li>
<li>Schema variables are variables that something outside of the expression being evaluated expects to be bound by this expression.</li>
<li>Input variables are variables that are not bound in the expression we're evaluating (every input variable must be in the scope when the expression is evaluated, but not every variable in the scope must be an input variable)</li>
<li>Output variables are variables that are bound in the expression we're evaluating (when evaluating the expression, every schema variable must be an output variable; if an output variable is in the scope, the expression is treated as a lookup or join)</li>
</ul>
<p>Now, note that because any output variable may be in the scope when an expression is evaluated, the following three expressions are more/less equivalent.  All three have the same input and output variables, and react identically to scope/schema changes from the outside.</p>
<pre>R(A,B) * {A = B}</pre>
<pre>R(A,B) * (A ^= {B})</pre>
<pre>R(A,B) * (B ^= {A})</pre>
<p>Note, by the way, that this is only possible due to the R(A,B).  In the following lift operation, B is an input variable and A is an output variable.</p>
<pre>(A ^= {B})</pre>
<p>If we were to look at only the lift/comparison operation (without anything that binds both A and B), then A and B would be input variables for the equality comparison, and one of them would be an output variable in either lift.  In other words, the only difference between these three is which variables are bound and when.  </p>
<p>Now, in general (and in one specific way that you'll see momentarilly), output variables are good.  We like output variables, so when it comes to equality predicates, we want to transform them to lifts whenever possible.</p>
<p>Let's look at an example:</p>
<pre>R(A) * S(B) * {A = B}</pre>
<p>This expression is a simple, straightforward equi-join, but is somewhat inefficient.  For every row of R, we'll loop over every row of S, and then pick out only the pairs of A x B where the two variables are identical.  In other words, this is effectively a nested loop join.  Now, consider, consider the following (equivalent) expression:</p>
<pre>R(A) * (B ^= {A}) * S(B)</pre>
<p>We've gotten rid of the nested loop.  Now, every row in R is extended with a new column B with the same value as the A column, and we do a lookup on that one row of B.  This is effectively a hash join (assuming we have a hash index already built over S).  </p>
<p> </p>
<h3>Equality Lifting</h3>
<p> </p>
<p>So how do we generalize from this example?  Let's start with products of simple (relation, comparison, arithmetic expression, and lift) terms.  Our goal is to get an expression Q of this type into the form </p>
<pre>Q := X * Y * {A = B}</pre>
<p>Where X and Y are both individual expressions with the following properties (this example works just as well if you swap A and B):</p>
<ul>
<li>A is bound in X, and may also be bound in Y</li>
<li>B is bound in Y but not X</li>
<li>(and for reasons that will become apparent next week) If possible, B should not be in the schema with which we evaluate Q.</li>
</ul>
<p>We can replace this equality constraint with a lift in either direction:  (A ^= {B}) or (B ^= {A}), but given the first two constraints, it makes the most sense to replace it with (B ^= {A}).  That way, we can commute the lift all the way to the left and get</p>
<pre>Q := X * (B ^= {A}) * Y</pre>
<p>As it turns out, working with simple products is not all that restrictive.  We can treat all the remaining operators as if they were simple terms, and use a handful of other transformations (that I'll get to in two weeks), to deal with the nesting structures inside them (e.g., each term in a sum, or the expression being aggregated).  In other words, this is pretty much the algorithm.  Partition (if possible) each expression into two independent subexpressions that each bind one of the variables on either side of the equality, and then substitute the equality with the relevant lift term.</p>
<p>One other note: When dealing with an equality comparison with a more complicated expression in it, you might have additional restrictions on what you can lift.  For example, you might have:</p>
<pre>R(A) * S(B,C) * {A = B * C}</pre>
<p>In which case, you could only substitute in (A ^= {B * C}).  Fortunately, in this case, we can also commute the earlier terms in the expression into an amenable form:</p>
<pre>S(B, C) * (A ^= {B * C}) * R(A)</pre>
<p>Alrighty.  Next week, we cover an optimization designed to interact with equality lifting: Unification.</p>

View File

@ -0,0 +1,39 @@
---
title: "Optimizing AGCA (Part 3: Unification)"
author: Oliver Kennedy
---
<p>Last week we covered equality lifting, the first half of a two-part process for simplifying expressions.  The second part is commonly known in PL circles as Unification.  In some expressions, it's possible to eliminate a lift by inlining the expression being lifted into a variable.  </p>
<p>For example, let's say you have the following expression:</p>
<pre>AggSum([], (A ^= B) * A)</pre>
<p>For all practical purposes, that lift doesn't need to be there.  Instead, we can rewrite this expression as</p>
<pre>AggSum([], B)</pre>
<p>Much simpler (and to make things even better, we can get rid of the AggSum too, since the inner expression now has no output variables).</p>
<p>Also, keep in mind that if the expression being lifted has already been fully evaluated (down to a simple numeric value), unification might allow us to do even more evaluation down the line.</p>
<p>Fundamentally, that's all there is to this week's theme.  Take lifts and propagate their values through the expression.  Unfortunately, as with many things, the devil is in the details.  There are a number of situations where unification is simply not possible, and some situations where it's possible, but only with a bit of a hack.  So, let's get to it.  What do you need to be aware of when unifying lifts in AGCA?</p>
<h3>Syntactic Restrictions</h3>
<p>Simple lifts like (A ^= B) can be unified anywhere.  However, as you may have noticed, more complex expressions can appear on the right-hand side of a lift.  For example, in the expression</p>
<pre>AggSum([], (A ^= B+1) * R(A))</pre>
<p>The syntax of AGCA doesn't allow us to write an expression like</p>
<pre>AggSum([], R(B+1))</pre>
<p>Admittedly, this is a somewhat trivial case, but as you see when we get to nested subqueries, there's a good reason for this.</p>
<h3>Respecting the Scope and Schema of the Complete Expression</h3>
<p>Recall the definitions of the scope and schema of an expression being evaluated.  The scope is the set of variables that are already bound when the expression is evaluated, and the schema is the set of output variables that we're expecting the expression to bind.  If the variable being lifted into appears in either the scope or the schema, it can not be unified.  For example, in the expression</p>
<pre>AggSum([], R(A) * ((A ^= B)+(A ^= C)))</pre>
<p>We can't eliminate the lifts, because by the time we get to the two lifts, the variable A is in scope already.  That said, we can do a little bit of rearrangement.  For example, the expression</p>
<pre>AggSum([], ((A ^= B) * R(A)) + ((A ^= C) * R(A)))</pre>
<p>is a legitimate rewriting of the first that can be unified.  I'll get into some of these rewritings next week, but most of them are really quite trivial.  Perhaps more challenging is when the variable is in the schema of an expression.  For example:</p>
<pre>AggSum([A], R(B) * (A ^= B))</pre>
<p>Now, in this case, we're not allowed to unify A away because it's part of the scope (A must appear in the output).  Yet, there's still a possible simplification of this expression:</p>
<pre>AggSum([A], R(A))</pre>
<p>Note, by the way, that simply replacing the lift with an equality and relying on equality lifting to resolve the issue won't work, since A is already out of scope -- we're not allowed to replace it with an equality.  Instead we need a special case to handle this.  If the lifted expression is a simple variable that's not in the scope then we have a chance!</p>
<p>We start with the product expression that the lift is a part of.  In this case:</p>
<pre>(R(B) * (A ^= B))</pre>
<p>From here, we backtrack, exactly like we do with equality lifting, until the lifted variable (B) falls out of scope, and then we can attempt to replace all instances of the lifted variable with the variable being lifted into (i.e., replacing all Bs with As).</p>
<h3>Respecting the Scope and Schema in which the Complete Expression is Evaluated </h3>
<p>Of course, even with these rewritings, it's possible that neither of these conditions will be satisfied due to external forces.  When an AGCA expression is evaluated, the caller can provide an external scope, or an expected schema.  The most trivial case of this is an expression like </p>
<pre>(A ^= B)</pre>
<p>If this is the entire expression being evaluated, the caller must (through external methods) provide a B.  Similarly, the caller expects to read out a result containing a single column: A.  </p>
<p>Because all of this is dependent on the caller, there's very little that we can do about this inside the AGCA framework.  One technique that we've had success with is to use the standard transformations that I'll discuss next week to propagate all the lifts to the head (or as close to the head as possible) of the expression being evaluated, and then explicitly to pick out all the lift expressions that rename variables appearing in the schema.  </p>
<p>For example, if we were preparing to evaluate the above expression with an externally defined scope containing only B, then we would note the presence of the rewriting (A ^= B) at the head of the expression.  We would eliminate this lift from the expression, and replace every instance of B with A.  Then, when evaluating the expression, we would bind A to the value that we would have previously bound to B.</p>
<p> </p>
<p>And that's it for this week.  Next week, I'll be going over several simple rewrite rules that allow us to minimize the use of AggSum, and other forms of nesting in AGCA.</p>

View File

@ -0,0 +1,40 @@
---
title: "Those Marvelous Lifts and Exists (Part 1)"
author: Oliver Kennedy
---
<p>We've been talking for the past few weeks about optimization of AGCA expressions.  So far, most of our optimizations have made one extremely significant simplifying assumption: They ignore nested expressions.  I was going to talk this week about techniques for un-nesting expressions, but before I get to that, I'm going to cover the two sources of nesting in AGCA expressions that I haven't covered yet: Lift in its full glory, and Exists.</p>
<p>So far I've used Lift as a simple form of assignment.  </p>
<pre>X ^= {Y}</pre>
<p>Computes the value of Y and assigns it to X.  I've used it for more complex expressions too:</p>
<pre>X ^= {2*Y + Z}</pre>
<p>But again, it's a simple arithmetic expression being used in the assignment.  What if we want to do something more complex?  What if we want to express something like</p>
<pre>SELECT SUM(A)</pre>
<pre>FROM R</pre>
<pre>WHERE R.B = (SELECT COUNT(*) FROM S)</pre>
<p>This is an example of a nested aggregate query, and it poses a bit of a problem, both in terms of AGCA, and more generally for incremental computation in the sense of delta operations.  Up to this point, whenever we added (or deleted) a value from one relation, we'd need to add (or subtract) something to (from) the result we were trying to compute.  Nested subqueries are different.  Let's have a look with the following example database</p>
<pre><span style="text-decoration: underline;">_R_(_A__B_)____#_</span></pre>
<pre> &lt; 1, 1 &gt; -&gt; 1</pre>
<pre> &lt; 1, 2 &gt; -&gt; 1</pre>
<pre> &lt; 2, 2 &gt; -&gt; 1</pre>
<pre><span style="text-decoration: underline;">_S_(_C_)____#_</span></pre>
<pre> &lt; 1 &gt; -&gt; 1</pre>
<p>Let's evaluate the SQL query.  The COUNT(*) of S is 1, so we find all the rows of R where B = 1 (just the first one), and sum up their A columns for a total result of 1.</p>
<p>Now what happens if we add the tuple &lt;1&gt; to S?  Well, the value of the nested aggregate changes from 1 to 2, so now the query result is based on a completely different set of rows (in this case summing to 3).  The delta isn't just a simple addition; we need to delete the existing value (-1), and then add in an entirely new and unrelated value (+3).  </p>
<p>Put another way, conditionals are different.  They're not straight arithmetic, they actually trigger a different control flow.  This is part of why AGCA restricts conditionals to having only arithmetic expressions in them.  Yet, we still need a way to express these changes in control flow.  Lifts give us an ideal tool for this.  We can express the above query as:</p>
<pre>AggSum([], (B ^= AggSum([], S(C))) * R(A,B) * A)</pre>
<p>Read the first term of this expression as "Compute COUNT(*) of S and assign it to B."  </p>
<p>I originally said that the delta of a (simple) Lift was 0.  This is not true in the general case.  In particular, note that so far we've only been looking at lifts where the value being assigned is computed from a simple arithmetic expression.  As we've already covered, the delta of such an expression is always 0.  But what happens when you lift an expression that has a nonzero delta?  For example, what is the delta (with respect to an insertion into S) of:</p>
<pre>(B ^= AggSum([], S(C)))</pre>
<p>Let's consider this in terms of the example data above.  The initial aggregate value is 1, so the table for this expression would be</p>
<pre><span style="text-decoration: underline;">___B______#_</span></pre>
<pre> &lt; 1 &gt; -&gt; 1</pre>
<p>After we add a tuple to S, the table becomes</p>
<pre><span style="text-decoration: underline;">___B______#_</span></pre>
<pre> &lt; 2 &gt; -&gt; 1</pre>
<p>We can only do arithmetic on the multiplicity column; we can't just add 1 to B (remember, this is supposed to represent a control flow decision).  So... we actually have to delete the old tuple and put in the new one.  In other words, the value computed by delta for this expression should be</p>
<pre><span style="text-decoration: underline;">___B_______#_</span></pre>
<pre> &lt; 1 &gt; -&gt; -1</pre>
<pre> &lt; 2 &gt; -&gt; 1</pre>
<p>The full delta rule for lifts reflects this insert/delete pair:</p>
<pre>∂(X ^= A) = (X ^= A + ∂A) - (X ^= A)</pre>
<p>If you're paying attention, you should notice something horribly wrong with this.  More on that next week.</p>

View File

@ -0,0 +1,36 @@
---
title: "Those Marvelous Lifts and Exists (Part 2)"
author: Oliver Kennedy
---
<p>Last week, we started talking about using the Lift operation to express nested subqueries.  I ended on a bit of a cliffhanger: </p>
<pre>∂(X ^= A) = (X ^= A + ∂A) - (X ^= A)</pre>
<p>There's something horribly wrong with this delta rule.  The expression A appears intact, in its entirety in the delta rule (it actually appears not once, but twice).  The delta of a lift is NOT simpler than the original.  </p>
<p>Admittedly, for simple lifts, this isn't a problem.  In particular, when ∂A = 0, then we get</p>
<pre>∂(X ^= A) = (X ^= A) - (X ^= A) = 0</pre>
<p>Which is, in fact simpler.  But, once we start putting relation terms into the expression being lifted, we get something nasty.  For example, let's say we wanted to compute the SQL query:</p>
<pre>SELECT COUNT(*) FROM R WHERE (SELECT COUNT(*) FROM S) = R.A;</pre>
<p>This translates to the following AGCA expression.</p>
<pre>AggSum([], R(A) * (X ^= AggSum([], S(B))) * {X = A})</pre>
<p>If we take the delta of this query with respect to the insertion S(1), we get:</p>
<pre>AggSum([], R(A) * ( (X ^= AggSum([], S(B) + (B ^= 1))) - (X ^= AggSum([], S(B))) ) * {X = A})</pre>
<p>Messy... and it really doesn't help us much.  We could materialize this expression, but since the deltas aren't simpler, if we repeat the process recursively, we'll end up with an infinite number of materialized expressions.  Not good.  </p>
<p>We deal with this problem using partial materialization.  First a little reorganization.   The lifts commute with the relation term:</p>
<pre>AggSum([], ( (X ^= AggSum([], S(B) + (B ^= 1))) - (X ^= AggSum([], S(B))) ) * R(A) * {X = A})</pre>
<p>Now, rather than materializing the entire thing, we materialize the lifts separately.  More precisely, rather than materializing the lifts, we materialize the expression being lifted.  </p>
<pre>AggSum([], ( (X ^= M(AggSum([], S(B) + (B ^= 1)))) - (X ^= M(AggSum([], S(B)))) ) * M(R(A) * {X = A}))</pre>
<p>Remember our materialization operator M().  Let's call these new datastructures Q1[] (= AggSum([], S(B) + (B ^= 1)), Q2[] (= AggSum([], S(B))), and Q3[A] (= AggSum([A], R(A))).  This gives us the expression</p>
<pre>AggSum([], ((X ^= Q1[]) - (X ^= Q2[])) * Q3[A] * {X = A})</pre>
<p>Of course, we can do better.  If we were to applying the standard materialization optimization rules that we discussed several weeks ago to the expression AggSum([], S(B) + (B ^= 1)), we actually get two simpler expressions, one of which is constant, and the other of which is equivalent to Q2.  Thus, our full materialization decision becomes</p>
<pre>AggSum([], ((X ^= Q2[] + 1) - (X ^= Q2[])) * Q3[A] * {X = A})</pre>
<p>And applying polynomial expansion, equality lifting and lift unification, we get the absolute simplest expression:</p>
<pre>AggSum([], (X ^= Q2[] + 1) * Q3[X]) - AggSum([], (X ^= Q2[]) * Q3[X])</pre>
<p>So that's it.  Don't materialize deltas of lift expressions in their entirety.  There are however, two corner cases that need to be considered.  First, it's often more efficient to recompute the entire expression from scratch than it is to compute the delta.  The precise definition of these cases is a bit subtle and nuanced, but basically, in any situation where there are no correlated variables (i.e., the example above), you're essentially computing the entire expression from scratch... twice.  In these situations, it's entirely reasonable just to recompute the entire expression from scratch, but just once.  If you make appropriate materialization decisions, it may still be possible to compute this in constant time.</p>
<p>Second, in some situations, it actually pays to materialize the delta along with the rest of the expression.  For example, consider the query (note the inequality predicate):</p>
<pre>SELECT COUNT(*) FROM R, T WHERE (SELECT COUNT(*) FROM S) &lt; R.A AND R.C = T.C; </pre>
<p>Or in its AGCA form:</p>
<pre>AggSum([], R(A,C) * T(C) * (X ^= AggSum([], S(B))) * {X &lt; A})</pre>
<p>Consider the delta with respect to T(dC).  </p>
<pre>AggSum([], R(A,dC) * (X ^= AggSum([], S(B))) * {X &lt; A})</pre>
<p>You could materialize R and the S separately, but you'd end up needing to compute a full iteration over all of the elements of R (to evaluate the aggregate over an inequality predicate) on every insertion into T.  Conversely, putting them together creates a new map that you need to maintain, but the new maps add only a constant factor cost to the time complexity of the existing maintenance costs.</p>
<p> </p>
<p>Next week, I wrap up with my discussion of lifts, with some thoughts on a related operator (and the last operator in AGCA): The exists predicate.</p>

View File

@ -0,0 +1,23 @@
---
title: "Those Marvelous Lifts and Exists (Part 3)"
author: Oliver Kennedy
---
<p>We've been talking for the last few weeks about how the Lift operator can be used to express nested subqueries.  This gives AGCA nearly the full power of (non-recursive)SQL.  </p>
<p>That said, there's one thing that AGCA <strong>can't</strong> do with what I've said before: existential quantification.  For example, consider the query:</p>
<pre>SELECT COUNT(*) FROM R WHERE EXISTS (SELECT * FROM S WHERE R.A = S.A)</pre>
<p>Now, if you stare at this query long enough you might think up with the following potential encoding:</p>
<pre>AggSum([], R(A) * S(A))</pre>
<p>In a way, this makes sense.  You get the count of R(A), but only if there's a matching S(A).  Unfortunately, this encoding isn't correct.  Remember, we're dealing with bags, not sets.  What if S looks like:</p>
<pre><span style="text-decoration: underline;">_&lt;_A_&gt;____#_</span></pre>
<pre> &lt; 1 &gt; -&gt; 2</pre>
<p>And R looks like</p>
<pre><span style="text-decoration: underline;">_&lt;_A_&gt;____#_</span></pre>
<pre> &lt; 1 &gt; -&gt; 1</pre>
<p>That is, there are two copies of the tuple &lt;1&gt; in S, and our query result will be 2 (instead of 1).  We're not looking for the specific number of tuples in S (that match a particular pattern), we're just looking to test whether there are ANY tuples in S (that match a particular pattern).  </p>
<p>We need an operator that can act as (not quite, but something sort of like) a step function.  Inside the operator is a nested query (just like Lift and AggSum).  If the nested query evaluates to 0, the operator evaluates to 0.  If the nested query evaluates to something other than 0, the operator evaluates to 1, regardless of what precisely the nested expression evaluates to.</p>
<p>This operation is actually something that you can't do with AGCA as I've described it up to this point (try it yourself if you don't believe me).  Thus, we have the Exists operator (which I've just described), and we can express the example query as:</p>
<pre>AggSum([], R(A) * Exists(S(A)))</pre>
<p>What about deltas?  Well, it turns out the exists operator is actually quite close to the lift operator.  The exists operator doesn't introduce any new columns into the schema (like the lift), but it is a non-linear operation (unlike AggSum).  Furthermore, unlike the only other non-linear operation (comparison), it can have a non-zero delta.  So, we get:</p>
<pre>∂Exists(Q) = Exists(Q+∂Q) - Exists(Q)</pre>
<p>Again, the delta has the original query in it, but this can be addressed using the same materialization tricks we talked about last week.</p>
<p>And that's it.  That's all there is to AGCA!</p>

View File

@ -0,0 +1,22 @@
---
title: "Collaborative web applications"
author: Oliver Kennedy
---
<p>This week, I'm going to step back from AGCA and start venturing into some higher level topics.  This week, let's talk about web applications.  </p>
<p>Not just any web applications mind you, let's talk about collaborative web applications.  The term is new, but the idea isn't.  You've probably heard of at least a few of the following: Google Docs (aka, Google Drive now), Google Wave, Office 365, Dropbox... </p>
<p>These applications are pretty nifty.  They allow you to log in from wherever, and edit/view documents.  But not only that, they also allow you to interact with other users of the same system.  If someone else opens up the same document, they see any changes that you make as soon as you make them.  In other words, the state of the application is mirrored in realtime across all the browsers in which the application is running.</p>
<p>Building these applications in a web-browser also forces you to rely heavily on the browser metaphor, HTML5 functionality, and on HTTP.  Integrating an application's design with this ecosystem is often kludgy, but can bring several extremely nice benefits:</p>
<ul>
<li>A distinction between communications <em>channels</em> and communications <em>sessions</em>: You usually can't rely on a single, stable connection to the server, so the communications protocol will typically place each message in a separate HTTP request.  This in turn means that the application is resilient to being suspended (i.e., switching apps on an iPhone, or putting your laptop to sleep), or moving across networks (switching from cellular to wifi, or plugging your laptop into an ethernet jack).</li>
<li>Stateless servers: The use of HTTP generally encourages servers to be stateless -- each request is served in an identical manner.  This means that the server can be designed to scale, without sacrificing the client's ability to suspend/resume its participation in the computation.</li>
<li>A Refresh Button: Web browsers have a refresh button.  The application has to be resilient to users pressing it if something is misbehaving.  This in turn makes error recovery much easier -- if the application is misbehaving, restarting it is trivial.</li>
<li>Layout through the DOM: Web-based applications have been designed from the ground up to be GUI-oriented.  A text-based interface is possible, but frequently more work than a simple graphical user interface.</li>
</ul>
<p>Applications like this are incredibly cool, and incredibly useful.  So why don't we see more of them?</p>
<ul>
<li>Why doesn't blogger have the same functionality?</li>
<li>Where are the collaborative whiteboards?</li>
<li>What other kinds of awesome applications could you implement in this space?</li>
</ul>
<p>The short answer is that the infrastructure isn't there.  And it's hard.  Ask any first-year distributed systems student -- they'll tell you that state replication isn't easy, even when your data is only coming from one source.  Web developers are used to dealing with nice, simple, standardized backend infrastructures.  Apache, MySQL/Postgres, and maybe some sort of CMS like Django are reasonable expectations... but none of these support the sort of scalable realtime state replication required to implement a collaborative web application.</p>
<p>Just a thought.</p>

View File

@ -0,0 +1,11 @@
---
title: "Intent"
author: Oliver Kennedy
---
<p>What is the difference between intent and effect?  </p>
<p>It's tempting, especially for a computer scientist to consider both of these to be similar.  Intent is, after all, just an effect that hasn't happened yet.  </p>
<p>On the other hand, an intent may never happen.  It might happen in one of a number of different ways.  Attempting to describe an intent in terms of its effect (or possible effects) is as like as not to be incredibly inefficient.  This is not at all a new concept -- Databases have, for ages now, supported an access mode that allows users to express their intent as a sequence of operations (i.e., transactions).  Even so, transactions are expressed in terms of effect.  The user says "UPDATE" and the database applies the relevant changes to its state (perhaps without committing them, but the update is still effected)</p>
<p>This is quite helpful.  User code can test the database for its present state, and take actions dictated by the results of those tests.  User code can specify iterations over multiple entries in that state.  It is frequently possible to specify the user's intent far more compactly than the effects of that intent.  Better still, in this way, we can encode the full range of possible effects and outcomes of that intent.</p>
<p>This too is not an entirely novel idea.  Modern, distributed database systems have noticed this nice, friendly, compact encoding, and started allowing users to package their intent into nice little snippets of code to be executed as a transaction.  The code executes on the database, which can guarantee that the user's intent is followed precisely, without the need for (slow) locking, or commit protocols (that may require the user to repeatedly restart their transaction).  The user's intent is seamlessly translated into an effect.</p>
<p>Prepackaged intent is good for interleaving transactions coming from multiple clients.  But what about for actually managing the data itself?  Even these database systems will eventually evaluate the intent, and transform it into a nice flat, easily readable form.  But what if you weren't concerned about reads?  What if you were simply interested in keeping data in synch?  </p>
<p>We started by expressing intent in terms of effect.  Then we moved to a seamless transition between intent and effect.  Why not push things a step further?  Why not express effect in terms of intent?</p>

View File

@ -0,0 +1,15 @@
---
title: "What's Wrong With Probabilistic Databases? (Part 1)"
author: Oliver Kennedy
---
<p>A large chunk of my graduate work has to do with a subfield of database research called probabilistic databases.  </p>
<p>The idea is simple: Most databases store precise values.  A row of a normal database might indicate that Bob's SS# is 199-..-....</p>
<p>A probabilistic database allows users to provide data specified by a probability distribution.  Perhaps Bob was a little sloppy filling out a form, and the OCR software couldn't tell whether he intended to put down 199-... or 149-...</p>
<p>It might not be able to determine a precise value for that slot, but it can tell you that Bob's SS# is either 199-... (with some probability) or 149-... (with some other probability).  The database can store both of these.  When you write queries over this data, you treat them as normal, ordinary queries.  When you get an answer, you get an answer with some probability distribution: If you're asking for someone with SS# 199-..., then you'll get the answer Bob (with the corresponding probability).  </p>
<p>Probabilistic DBs are pretty cool.  Unfortunately, they haven't managed to get much traction beyond the research community.  They've been applied here and there (including in some of my own work), but as of this time, no major DB vendor supports ProbDB functionality.  Why?</p>
<p>To answer that question, we first have to understand why people would use a probabilistic database.  We have to start with sources of uncertainty.  Most probabilistic database work attempts to address one (or both) of two types of uncertainty:</p>
<ul>
<li><strong>Noisy Data</strong> - Your data gathering process is flawed (e.g., using OCR software or web-scraping techniques).  The data you have contains typos, omissions, or other mistakes.  A thorough data-cleaning could potentially fix these errors, but you lack the necessary manpower or resources.  That is to say that a hypothetical 'clean' version exists.  When the data is queried, you want to find the query results most likely to correspond to the query results on this clean version.</li>
<li><strong>Missing Knowledge</strong> - The data being queried is derived from a model, and has no corresponding 'clean' version.  There are many possible outcomes, each with a varying likelihood.  Queries over this type of data are typically the database to make a prediction, and you're typically looking for an expectation or a percentile result.</li>
</ul>
<p>Although the underlying techniques used to query both types of data are extremely similar, the way users approach both of these types of data are quite different.  Over the next two weeks, I'll talk about each of these, and try to understand what's keeping people from using probabilistic databases to address these problems.</p>

View File

@ -0,0 +1,26 @@
---
title: "What's Wrong With Probabilistic Databases? (Part 2)"
author: Oliver Kennedy
---
<p>Last week, I introduced the concept of probabilistic databases: databases that store values characterized by probability distributions, and not (necessarily) by specific values.  Although a pretty cool, and potentially quite useful idea, there are a number of practical concerns that have prevented it from gaining traction.  This week, we explore one class of problems that probabilistic databases are ideally suited for: noisy data.  </p>
<p>Data is only useful if you can analyze it -- ask questions about it.  Problem is, that very few data-gathering pipelines are entirely perfect.  Data can be missing.  Data can contain typos.  Data can contain measurement error.  And on top of it all, even if a data source is perfect, god help you if you have to combine it with another data source.  Integrating multiple data sources means dealing not only with inconsistent formatting, but also inconsistent data values (The same person could be referred to as Mary Sue, Ms. M Sue, Mrs Sue, or any of a practically infinite number of variations on the same theme).  A whole research area (typically called entity resolution) has sprung up around this problem.</p>
<p>In short, before using any dataset (or when merging two datasets), it's typically necessary to go through a (often time-consuming) data-cleaning process.  Often, this process can be automated.  You can call out obvious errors: duplicates of values that should be unique, values out of bounds, improperly formatted expressions, etc... Many of these issues can be fixed automatically.  Formatting mismatches between different datasets can be easily fixed by translating to a single common format.  </p>
<p>Unfortunately, automated processes can only take you so far.  If two people appear with the same social security number, then clearly at least one of them is wrong.  But typically, an automated process can't decide which is correct, nor can that process decide what kind of number to assign to the person who now has no identifier.  Typically, one of three things happens at this point:</p>
<ol>
<li><strong>Immediately punt to the user</strong>: A common example of this is key constraints in traditional databases.  The datastore simply won't allow data that it knows to be unclean to be entered into the system.  This approach ensures that data is correct before anyone asks any questions about it, but necessitates end-users to put a huge up-front effort into ensuring data quality before a single question can be asked.</li>
<li><strong>Guess</strong>: This happens often in situations where an automated system can efficiently compute the probability that a particular interpretation is correct, like handwriting recognition or sentence parsing.  The system settles on one specific way of interpreting the data (the one with the highest probability), and discards the rest.  Ironically, this can be just as much a source of data errors as any other data gathering process if the guessing algorithm isn't perfect.</li>
<li><strong>Ignore it</strong>: Failing all else, you can simply ignore the problem.  You implicitly accept that answers to your questions may be erroneous, but don't especially care.</li>
</ol>
<p>In short, either you put a huge amount of effort in upfront to clean your data, or you deal with mistakes in your answers.  </p>
<p>Probabilistic databases aren't a magic bullet.  They can't magically make your data clean, or fix the mistakes in your answers.  What they can do, however, is tell you how much of a mistake there is.  And you don't even have to do anything different.  You can just query your data as if it were normal, ordinary data.  All of the trickery for dealing with uncertainty happens under the hood, except that you get a probability value as your output.</p>
<p>So where's the problem?  Why aren't probabilistic databases being used more aggressively?  </p>
<p>As I see it, there are two issues at play here for the general populace:  </p>
<ul>
<li><strong>People don't know what to do with probabilities</strong>.  Statisticians aside, very few people know how to deal with probabilities.  If someone gets a response that is 75% accurate, they're not going to generally want to perform a complex risk analysis.  Either they trust the result or don't.  In other words, guessing is usually sufficient here, because ultimately the user is interested in the most probable result anyhow (which isn't always guaranteed by guessing).  </li>
<li><strong>People don't know how to define probabilities to begin with</strong>.  Again, statisticians aside, very few people can build good statistical models.  Sometimes your data comes with probabilities already associated with it, but more likely than not, the average data-user won't have a good sense of how to define their automated data cleaning processes probabilistically.</li>
</ul>
<p>In short, the problem with applying probabilistic databases to the challenge of noisy data is the probabilities.  People are used to dealing with fuzzier notions: "Certainly Not", "Unlikely", "Possibly", "Likely", Certainly So".</p>
<p>So what's the takeaway from all of this?  Well, I'm not entirely certain.  I think probabilistic database research needs to start looking at ways of isolating users from the specifics of the probabilistic distributions underlying the system.  Instead of presenting users with query results and probabilities, we need to give users a more intuitive way of visualizing what possible outputs there are.  Rather than giving users specific confidence values, we need to give users a more intuitive notion of how to interpret that confidence value.  </p>
<p>Better still, we need to provide the user with things that they can do to improve the confidence level; Instead of immediately punting to the user when a data error occurs, let the user run queries on the noisy data, and then point them at the specific cleaning tasks that they need to perform in order to get better results.</p>
<p>And of course, we need to give users better tools for automating their data cleaning processes -- tools that natively integrate with probabilistic database techniques.  Tools that know how to associate probabilities with the data they generate.  </p>
<p>Next week, we look at a second class of problems that probabilistic databases can be used to address: modeling.  </p>

View File

@ -0,0 +1,18 @@
---
title: "What's Wrong With Probabilistic Databases? (Part 3)"
author: Oliver Kennedy
---
<p>Two weeks ago, I introduced the idea of probabilistic databases.  In this last installment in this little miniseries, I'm going to talk about the second major use of probabilistic databases: dealing with modeled data.</p>
<p>Unlike last week where we talked about having missing or erroneous data, where there is some definitive ground-truth, a probabilistic model attempts to capture a spread of possible outcomes.  Not only is there no ground truth, there usually won't be (at least not so long as questions are being asked).</p>
<p>That's not to say that there's no overlap between modeled and erroneous data, just that there's a different mentality about how this data is used.  In this case, queries encode scenarios rather than questions.  </p>
<p>That is to say that a probabilistic database must take its uncertain inputs from somewhere.  At some level, there has to be a probabilistic model (or more likely, several) passed as input to the query.  Even if the probabilistic database is capable of filling in any parameters that the model needs, someone still had to sit down and figure out the general framework of the model.  This role generally falls to someone with a background in statistics.  </p>
<p>This is where the problems come in.  The machinery required to get even a relatively simple model off the ground is usually pretty extensive.  Even something as simple as a gaussian distribution can require days, or even weeks of validation against test data.  So, if you want to ask questions about your nice, simple, elegant model, you're not going to want or need the complex machinery of a database.</p>
<p>That said, where the machinery does come in handy, is when you need to integrate multiple models, or to integrate your model with existing data.  A simple example I used in a paper a while back was for capacity planning: One (simple) model gives you the expected capacity (e.g., CPU) of a server cluster at any given time over the next few months (e.g., accounting for the probability of failures), while a second (also simple) model gives you the expected demand on that cluster.  Each of these can be tested, analyzed, and independently validated.  Then, these models can be combined to provide a single model for the probability of having insufficient capacity on any given day.  This relationship can be represented as an extremely simple SQL query, and then executed efficiently on a probabilistic database.</p>
<p>In short, probabilistic databases can be used as a sort of scaffolding to combine multiple data sources (both real and modeled) together, to build more complex models.  </p>
<p>So what's missing?</p>
<ul>
<li><strong>Langauge Support</strong>: Although many statisticians are comfortable working with SQL, this is not the case for everyone who uses probabilistic models.  Languages like R, Python, Java, and C++ are far more common, and less alien to researchers and model-builders.  There has already been some work on integrating these languages with database techniques.  The Scala guys are working on improvements that let you translate code written using certain fragments of Scala into equivalent database queries.  There's no reason that we can't do something similar with Python.  Similarly, there have been numerous efforts to translate R into some form of relational algebra.</li>
<li><strong>Efficiency</strong>: Over the years, database research has become synonymous with work on monolithic one-size-fits-all database systems.  Most work with nontrivial-models already requires extremely expensive monte-carlo methods, and model-builders are often reluctant to delegate the task of hand-optimizing their code to an automated system that they perceive (often correctly) as being less efficient.  We need ways to give them good performance out of the box, with a minimum amount of coding overhead and setup.  If this performance is insufficient, we need to make it possible for them to seamlessly transition to an environment where they can fine-tune the evaluation strategy, again, without needing to learn anything that they don't already know.</li>
<li><strong>Interfaces</strong>: R provides a number of useful analytic and visualization tools right out of the box.  While I suspect that no probabilistic database will be quite as complete in the short term, we need to get there before people will start looking seriously at probabilistic databases as an effective analytics and probabilistic modeling tool.</li>
</ul>
<p>Given all of this, I think probabilistic database techniques could be adapted easily.  The only real challenge standing in our way at this time is interfaces.  How do we present end-users with an interface that is not only as powerful as the tools they're used to working with, but is also similar enough that the learning curve is minimized.</p>

View File

@ -0,0 +1,15 @@
---
title: "Consistency through semantics"
author: Oliver Kennedy
---
<p>When designing a distributed systems, one of the first questions anyone asks is what kind of consistency model to use.  This is a fairly nuanced question, as there isn't really one right answer.  Do you enforce strong consistency and accept the resulting latency and  communication overhead?  Do you use locking, and accept the resulting throughput limitations?  Or do you just give up and use eventual consistency and accept that sometimes you'll end up with results that are just a little bit out of sync.</p>
<p>It's this last bit that I'd like to chat about today, because it's actually quite common in a large number of applications.  This model is present in everything from user-facing applications like Dropbox to SVN/GIT, to back-end infrastructure systems like Amazon's Dynamo and Yahoo's PNUTs.  Often, especially in non-critical applications latency and throughput <strong>are</strong> more important than dealing with the possibility that two simultaneous updates will conflict.  </p>
<p>So what happens when this dreadful possibility does come to pass?  Clearly the system can't grind to a halt, and often just randomly discarding one of these updates is the wrong thing to do.  So what happens? The answer is common across most of these systems: They punt to the user.  </p>
<p>Intuitively, this is the right thing to do.  The user sees the big picture.  The user knows best how to combine these operations.  The user knows what to do, so on those rare occurrences where the system can't handle it, the user can.</p>
<p>But why is this the right thing to do?  What does the user have that the infrastructure doesn't? </p>
<p>The answer is Semantics.</p>
<p>Each update does something with the data. It increments, it multiplies, it derives, it computes.  It produces some new value of the data.  It has specific semantics, and the systems I enumerated above (and those like them) make no effort to try to understand those semantics.  I addressed this in part already, when I discussed intent vs effect a few weeks ago.  </p>
<p>The user, conversely, does understand the semantics of an application.  Given two updated values (and a suitable visualization tool, like diff), a user can usually infer the intent of the updates and merge their effects appropriately.  </p>
<p>Sometimes this is the best way to do things.  When writing source code or other text, where the user is directly modifying the files (i..e, the system never receives a representation of the intent in the first place), the overhead of manually merging periodically is typically lower than the overhead of having to encode edits in terms of intent (though this might be interesting if combined with bug/feature tracking systems).  </p>
<p>Conversely, if an application is interacting with the data directly, the application can provide tools for resolution.  This is indeed the case in Dynamo, where the application provides a merge function for resolving inconsistent updates.  But this is only the first step.  What can you do to avoid creating two inconsistent versions of the data in the first place?  How do you infer the user/application's intent, while minimizing the burden of declaration placed the user/app developer.  </p>
<p>In short, what can you do to both detect and leverage an application's semantics to help the application stay consistent?</p>

View File

@ -0,0 +1,26 @@
---
title: "Uncertainty in Distributed Computation"
author: Oliver Kennedy
---
<p>Probabilistic databases are a solution to a simple problem -- Sometimes you don't have all the data.  </p>
<p>Probabilistic databases address this problem in the context of a specific domain: asking questions about data that is incomplete, imprecise, or noisy.  But this is only one domain that this problem occurs in; Noisy, incomplete data occurs everywhere.</p>
<p>A prime example of this is distributed computation.  Each node participating in the distributed computation knows (for certain) what is going on locally, but not what's going on elsewhere.  If an update occurs on one node, it takes time to propagate to other nodes.  </p>
<p>A good way to think of this is that the node has its own view of the state of the world.  Slowly, over time, this view diverges from the "real" state of the world.  As the node communicates with other nodes, the view reconverges.  </p>
<p>Many early distributed protocols were designed to enforce this sort of convergence, at least to the point where certain properties (e.g., relative ordering) could be guaranteed.  For the past few years, the fashion has been to use eventual consistency, where the end-user is presented with results that are not guaranteed to be entirely accurate.  </p>
<p>This doesn't have to be a binary choice; many such systems (Zookeeper[1], PNUTS[2], Percolator[3], etc...) offer a hybrid consistency model where end-users can choose to receive results guaranteed to be consistent, albeit at the cost of higher access times.  </p>
<p>What I've been seeing lately is a tendency to take this even further: To actually try to capture the uncertainty in the computation in the distributed programming model itself.  The first instance that my quick (and quite incomplete) scan of deployed systems was Facebook's Cassandra [4], which used a technique called φ-accrual [5] to get a running estimate of the likelihood of a particular server being up or down.  </p>
<p>More recently, a similar idea has appeared in Google's Spanner [6].  Here, the uncertainty was on the timing of specific events, and the goal was to determine relative ordering and to obtain guaranteed consistency by establishing a bound on how accurate (or inaccurate) the timestamps you're using are.</p>
<p>This idea can be taken a lot further.  Although I can't imagine programmers wanting to explicitly account for uncertainty in their code, they may be willing to work with a language that does this accounting for them.  Maybe I don't need a precise result to present to the user, maybe I just need something in the right ballpark.  Maybe I just need an order of magnitude!</p>
<p>What would a language designed around this look like?</p>
<p>How could the programmer specify the bounds on uncertainty that they were willing to accept?</p>
<p>Could such a language be combined with online techniques (i.e., provide the end-user with a stream of progressively more accurate answers).</p>
<p>Can PL ideas such as promises be adapted to this context?  Here's an answer, it has accuracy X.  The result of the computation you want to do with it can also be computed, and the uncertainty of that computation (based on the uncertainty in the input) is Y.</p>
<p>This seems like it would be a really cool programming platform, if it could be made to be both usable and efficiently functional.</p>
<h3>Citations</h3>
<p>[1] Hunt, P. et al. 2010. ZooKeeper: Wait-free coordination for Internet-scale systems. USENIX ATC. (2010).</p>
<p>[2] Cooper, B.F. et al. 2008. PNUTS: Yahoo!'s hosted data serving platform. Proceedings of the VLDB Endowment. 1, 2 (Aug. 2008), 12771288.</p>
<p>[3] Peng, D. and Dabek, F. 2010. Large-scale incremental processing using distributed transactions and notifications. (2010).</p>
<p><span style="font-size: 12px; text-indent: -24px;">[4] Lakshman, A. and Malik, P. 2010. Cassandra—A decentralized structured storage system. </span><em style="font-size: 12px; text-indent: -24px;">Operating systems review</em><span style="font-size: 12px; text-indent: -24px;">. (2010).</span></p>
<p>[5] Hayashibara, N. et al. The φ accrual failure detector. 6678.</p>
<p>[6] Spanner: Google's Globally-Distributed Database: http://research.google.com/archive/spanner.html</p>
<p> </p>

View File

@ -0,0 +1,23 @@
---
title: "Filesystems, Application Semantics, and Walled Gardens (Part 1)"
author: Oliver Kennedy
---
<p>There's been a fundamental limitation of web applications that has bugged me for the longest time.  I don't think I got a good idea of what it was until I started looking at my iOS devices, and realized it was the same thing that bugged me about them.  There's no common filesystem on any of them.</p>
<p>This is something incredibly frustrating about the iOS: Each application has its own filesystem.  Each application has its own way of listing your available documents.  Each application has its own way of interfacing with external document storage systems (i.e., Dropbox, iCloud, etc...).  It's possible to move documents between these filesystems: There's a shared photo repository, and a way for users to explicitly send documents from one application to another, but this functionality has to be enabled by the developers of both applications, has to be initiated by the user explicitly, and creates a copy of the document, which makes it a pain to keep track of which application currently has the "working" copy of your document.  </p>
<p>This same thing has been showing up in web applications.  The scarcity of fully featured web applications makes it difficult to see this effect clearly, but try editing a document with both Office 365 and Google Docs, and see how sane you stay.</p>
<p>This begs the question: why?  Why limit your applications in this way?</p>
<p>In the case of web applications, it's a technical limitation.  I'll get back to that, but first let's have a look at the iOS.  There's an actual hierarchical filesystem under the hood.  App developers must deal with paths and file pointers.  Yet, there was a conscious decision on the part of iOS's designers to force application developers to build their own document management systems:</p>
<ul>
<li>There's no file management widget or application.</li>
<li>There's no integration support for alternative hierarchical document management systems (e.g., Dropbox).</li>
<li>Each application has (conceptually at least) its own independent filesystem.  </li>
</ul>
<p>So why?  Why build each application into a walled garden?</p>
<p>I don't have a singular answer to this question, but this model actually has a number of very nice benefits.  </p>
<p><span style="text-decoration: underline;"><strong>Specialized vs Standardized Formats</strong></span></p>
<p>There are a number of formats out there, many of which are (explicitly or implicitly) standardized.  PDF, Jpeg, PNG, TIFF, MP3, Matroshka Video, Powerpoint, and Rich Text Format are all examples of formats for encoding a wide range of different document types.  Even among standards, there is some duplication.  Jpeg, PNG, and TIFF are all formats for encoding image data, but each has a slightly different set of benefits, each applicable to different types of image data.  </p>
<p>If there's this much duplication among standard image formats, imagine how much duplication there is among non-standards.  Let's have a look at some image formats used by the products of a single company: Adobe.  Photoshop, Illustrator, and InDesign each manage image data, but have their own distinct formats.  </p>
<p>Each application is designed to do something different.  Photoshop is designed to edit raster data, Illustrator is designed to edit vector data, and InDesign is designed to manage page layouts.  Because the applications are designed with different functionality in mind, each has a different notion of what an ideal layout is for the data being managed.  If nothing else, different applications may find different metadata or index structures necessary on top of the core data.</p>
<p>My blog editor is a good example.  At the heart of it, the editor just manages a list of rich text (HTML) files organized in a nice simple hierarchy.  But then for each directory it has some metadata (the blog that the directory corresponds to), and for each text file it has some metadata (title, tags, categories, server options).  There are inter-document relationships that it keeps track of (if a post has media/images/etc...), and it keeps itself synchronized with the blog posts already on the server.  The core content (the HTML text file) is enhanced by application-specific metadata that allows the editor to effectively interact with the blog.  It would be much harder to design such an application if the user was required to explicitly manage application state.  Inter document relationships in particular are extremely difficult to manage (as anyone who has tried to move HTML from one website to another can attest to).  </p>
<p>Meanwhile, providing an explicit export function (a'la iOS) forces the application developer to provide functionality for translating their own custom document format/metadata/etc... into a standardized format.  This clearly demarcated boundary, if used properly, could actually <strong>increase</strong> compatibility between applications, while allowing each to maintain their own custom data formats appropriate for their own specific application.</p>
<p>This is running a little long, so I'll return next week with a few more benefits of the iOS document model.</p>

View File

@ -0,0 +1,25 @@
---
title: "Filesystems, Application Semantics, and Walled Gardens (Part 2)"
author: Oliver Kennedy
---
<p>Last week, I started talking about some advantages of the iOS data model -- specifically its lack of a common filesystem.  I started by talking about the issue from a file format/metadata perspective.  Today, I look at three more benefits: Security, Presentation, and the Data Cloud</p>
<p><span style="text-decoration: underline;"><strong>Security</strong></span></p>
<p>Perhaps the strongest argument for giving each application a walled garden is security.  As far back as I can remember, security has been a huge usability problem.  In this case, the problem being attacked is authorization of code: How can a user safely grant an application the right to access a user's secure data?  </p>
<p>The traditional way has been to ask the user (i.e., "Can 'Irate Avians' access your contacts?").  Unfortunately, questions like this require the user to think.  The user has to sit there and ask themselves questions like</p>
<p>"Why is Irate Avians asking me for my address book?"  </p>
<p>"Do I trust the code of Irate Avians to not do anything I wouldn't want with my address book?"</p>
<p>Odds are that a typical user won't have the background to be able to answer questions like this by themselves.  Worse still, because most code running on a typical user's device(s) is perfectly trustworthy, so generally, the answer to these questions is yes and the user learns that it's ok to not think about these questions.  </p>
<p>So how do you get the user to really think about these questions?  </p>
<p>Quite simply, you don't. </p>
<p>Instead of having the user try to figure out the application's intent, you design your user interface so that the <strong>user's</strong> intent is clear.  </p>
<p>The walled garden model forces a user to push state from one application to another, rather than the filesystem based model where each application pulls state from the filesystem.  In both cases, the application sit's between the user's intent and the filesystem.  </p>
<p>However, in one case, a (potentially) untrusted application is acting on its own, while in the other a trusted application (one that already has access to the data) is indicating that it wishes to extend that trust to another application.  In the latter case, the user (through the trusted application) has explicitly given permission for their data to be accessed.</p>
<p><span style="text-decoration: underline;"><strong>Presentation</strong></span></p>
<p>Different types of documents can be presented in different ways.  For code, a nice hierarchical, alphabetical listing is often best.  For photos, you want thumbnails.  In short, the nature of the browser being used depends heavily on the type of work you're doing.  We're seeing this phenomenon with apps like iPhoto, Front Row, Eclipse and Xcode, each of which has a custom file browser (some of which don't even correspond to the underlying filesystem).  </p>
<p>There's not really much to this point, just that different types of data need to be organized in different ways, and the entity best suited to managing this organization is the application that created/is responsible for managing the data.</p>
<p><span style="text-decoration: underline;"><strong>The Data Cloud</strong></span></p>
<p>On a related note, some data doesn't fit into a neat little document (or similarly, into the filesystem) model.  Sometimes you want to keep different datatypes independent (i.e., The giant mess of files in a website).  Sometimes you have data that can't be structured exactly into a hierarchical model (i.e., email, or music).  </p>
<p>A perfect example of data that doesn't fit into the document model is social network data (and graph data in general).  In this type of data model, you have lots of little nuggets of information, which can be sorted, organized, collected, distributed, and aggregated in any number of ways.  There's not generally a single concept that you can group each comment, post, or message into.  Sure, you can put this data into a file, but more likely than not, all of this data will go into a single file (or equivalently, into a single grouping of files).</p>
<p>For this type of data, the walled garden is great.  You can present the user with an interface ideally suited to organizing, aggregating, and displaying all of these little nuggets of information.  Since such applications tend to be decentralized, you can easily interface with networked components as well.  You don't need to forcibly coerce this application-specific data presentation model into the filessytem model that everything else expects.</p>
<p> </p>
<p>So there you have it.  Four particularly nice aspects of the iOS walled garden application state model.  Next week, I'll talk about how this all connects to web applications.</p>

View File

@ -0,0 +1,17 @@
---
title: "Filesystems, Application Semantics, and Walled Gardens (Part 3)"
author: Oliver Kennedy
---
<p>For the last two weeks, I've been discussing the beneficial aspects of the iOS walled garden document model.  As it turns out, there are quite a few.</p>
<p>That said, let me emphatically state that this model is a bad idea.  It forces all stages of your document processing workflow to live in a single application (lest you suffer the pain and agony of tracking multiple document versions).  This in turn hurts a developer's ability to deliver functionality incrementally (e.g., a developer who wants to deliver a simple graphics filter has to develop a full graphics editing suite around it).  </p>
<p>I regularly use two different text editors (SubEthaEdit and TexShop) when editing LaTeX source, and the LaTeX compiler is a third application that needs to access the files.  Sometimes I script certain pieces of functionality (e.g. generating certain tables or graphs).  In a desktop processing environment, this is typically trivial.  Every bit of data is accessible through a shared filesystem.  The lack of a shared filesystem on iOS means that there might be five applications, each with 90% of the features required by your workflow, instead of one set of composable applications that provide all of those features.</p>
<p>This is unacceptable.  However, iOS is Apple's sandbox, it's their prerogative to design it how they see fit.</p>
<p>Unfortunately, this is not the only space where one sees the walled garden document model.</p>
<p>Consider Google Docs, Office 365, and iCloud.  In theory, they're compatible.  They share compatible formats for word processing documents, spreadsheets, and presentations (even if it is the office format).  But, like a walled garden, each has its own domain.  If you want to edit a presentation in Google Docs, you upload it to Google Drive.  If you want to use it with iCloud, you import it into Keynote, and something similar has to happen if you want to use Office 365 or some other online presentation software (e.g., Presvo).</p>
<p>Unlike with the iOS, this does not appear to have been a deliberate decision on the part of the application developers.  Every web application has its own mechanism for keeping persistent state, quite simply, because no user would use an application that deletes all your data whenever you quit your browser.  They persist state because it's a nice side effect of having to share data between multiple clients/browsers.  What they don't do is persist state because they expect another application to be able to start messing with that state.</p>
<p>Let me put that another way.  Shared filesystems are (by design) a nice way for applications both to pass data between themselves and to maintain persistent state.  The mechanic behind both of these is identical (again, by design): Any application can write data to the filesystem, and any application can read data back out of the filesystem (modulo permissions).  In short, a desktop application that needs to store persistent state gets (essentially for free) the ability to automatically exchange data with other applications.</p>
<p>Web applications have no shared filesystem.  WebDAV, the one contender that comes to mind, is frequently too unstable or slow for practical use, and made even harder to use by browser security models.  Worse still, while desktop application developers can typically assume the presence of some sort of disk drive in the device they're developing for, web application developers can't assume that all of their clients will have a WebDAV server somewhere.  </p>
<p>Cloud storage solutions like Dropbox should present a potential solution, but I haven't seen a lot of uptake there either.  My guess is that this is a combination of the browser security model issue, and latency issues with realtime updates (If anyone reading this has a better idea, I'd love to hear it).  </p>
<p>What this adds up to is that in order to store persistent state, web application developers have had to roll their own application-specific filesystems.  They need to persist state, but don't need to make their web application play nice with other web apps (admittedly, services like Google Drive now play nicely with desktop applications).  </p>
<p>What we need is a filesystem for the web application world.  A system that extends an application's ability to persist state (and/or its ability to replicate and collaboratively edit state with multiple clients), into the ability to collaborate with other applications to form a much more powerful workflow.</p>
<p>The logistics of deploying such a beast (to say nothing of the chicken/egg problem of getting both user and developer buy in) are beyond me at the moment, but it's something that I'd very much like to see happen.  Since this post is already getting fairly long, I'll wrap this segment up next week discussing the mechanics of such a filesystem (were it to exist), and how what we learned from iOS in the last two posts can be applied to it. </p>

View File

@ -0,0 +1,20 @@
---
title: "Filesystems, Application Semantics, and Walled Gardens (Part 4)"
author: Oliver Kennedy
---
<p>For the past month, I've been writing about the similarity between the iOS document model, and that of most modern web applications.  After acknowledging the strengths of the iOS document model, last week I addressed the danger that such a similarity poses for the future of web applications.  </p>
<p>Although we don't want to go wholeheartedly down the iOS route of walled garden filesystems, we can still learn from their efforts and successes.</p>
<p>So... what is there to learn?</p>
<p><span style="text-decoration: underline;">Security</span></p>
<p>Security is perhaps the most difficult to address, so let's start with it.  The strength of the iOS document model lies in getting users to actively indicate that they want to transfer control of a specific document between applications, rather than passively accepting that an application wants to access/change a document.  This is a critical distinction, because it forces the user to make a conscious decision to grant permissions instead of just rubber-stamping a request that pops up.</p>
<p>Can we do something similar for web applications?  Part of the answer to this question depends on how we address the challenge of a web-application filesystem.  There's a huge space of possibilities here, so I'm going to adopt the most general form possible: There are three entities/trust in the system: your browser, one or more sites that hold your data, and one or more sites hosting the web applications you use.  They don't have to be separate, but I'll think about them as being so to keep things simple.</p>
<p>Practically speaking, there are two separate levels of authorization to grant: access to read the data, and access to modify the data.  Let's break things down and address each in turn.</p>
<p>Read access is easily the harder of the two to pin down, mostly due to limiting the scope of access.  Let's say I have a web application for email/messaging and a second application with my address book.  Clearly, the email client could make use of the address book data.  The absolute wrong way to do this is for the email/messaging application to put up a dialog box saying "Can I have access to your address book data?"  I'm not just talking about authorization issues here; Even gimmicks like Facebook/Twitter's application authorization tokens essentially boil down to the same thing: "Click button in order to use software".</p>
<p>We need to get users to consciously decide that they want to transfer data between applications.  In its simplest form, this means, from within the address book application (or the filesystem storing the data) clicking on a standardized widget to "Open this data with email client".  </p>
<p>Even this though, is somewhat awkward.  You don't want to have to do this every time your address book data changes, and you might not want to grant access to your entire address book.</p>
<p>A second option would be to use the drag/drop metaphor.  Dragging contact information from one web application to another is a clear indication that a user wants to transfer access to the contact to the email application.  Still, this is somewhat awkward.  It would be nice to have address book support within the application itself.</p>
<p>HTML/Javascript provide us with a third option.  Javascript provides us with a (securable) framework for importing widgets from one codebase into another.  I'm not sure how a secure implementation of this could be properly developed, but you could use javascript to modify the email application's text input field to pop up an autocomplete panel.  Selecting an autocompletion would be an explicit choice on the user's part to pass the contact information over to the email client.  Of course, now we've just reversed the problem -- This approach means that you need to find a way to get the user to agree to the address book modifying their email client.  Furthermore, browser security being what it is, you want some way to guarantee that the address book javascript code doesn't have access to the application state.</p>
<p>Ok, that was a bunch of blathering about read security.  In most cases, an approach analogous to the traditional filesystem approach suffices: double click a document to open it, or bring up a widget that's part of the filesystem, which grants access to the selected data.  Any application-specific state can be kept separate, and inaccessible to other applications.</p>
<p>So what about writes?</p>
<p>I don't have a particularly good answer here.  The problem is that you don't want one application to do something to your data that breaks another application's functionality.  In part, this can be solved by keeping application-specific metadata separate from common state.  Most likely, the best approach here is to keep state in versioned form, like a distributed revision control system (e.g., GIT).  If an application breaks something or deletes something, you provide the user with going back and undoing some or all changes performed by the offending application.  This approach works reasonably well for Wikipedia, which is subject to a similar attacker model.  </p>
<p>And, as expected, I've gone pretty crazy with this discussion of security.  I'll wrap up for real next week with a discussion of the remaining three (smaller) points: Interface Formats, Presentation, and Non-Document Data</p>

View File

@ -0,0 +1,18 @@
---
title: "Filesystems, Application Semantics, and Walled Gardens (Part 5)"
author: Oliver Kennedy
---
<p>In the past weeks, I've been talking about how a persistence layer for web applications could be developed, informed by both the successes and failures of the iOS walled garden document model.  This week, I'll (hopefully) wrap up with a discussion of how three benefits of iOS can be incorporated into a web application filesystem.</p>
<p><span style="text-decoration: underline;">Interface Formats</span></p>
<p>A significant virtue of the iOS document model is that applications are forced to get their data into standardized formats before exporting them out; Stripping off the application-specific metadata and passing only the stuff that another application is guaranteed to understand.  A similar phenomenon can be found, rooted in the oldschool idea of web mashups.  Each application builds on a standardized data representation (e.g., Google Maps, or Facebook Comments).  The data representation has a standardized interface, and allows users to put their own data on top of it.  </p>
<p>There was once a part of the MacOS (back in the OS 7 days) called OpenDoc.  Although it didn't survive for political reasons, it actually featured some extremely nifty ideas.  The core concept was that of nested document types.  An application would register itself, not as a standalone component of the operating system capable of editing entire files, bur rather as having the ability to edit data of a certain type.  For example, you would have a word processing application component.  These application components could be nested -- The word processing document could have graphical data embedded within it, and would allow the user to edit that graphical data through a fully-featured graphical editor.  </p>
<p>Similar ideas have been brought up in the web world.  XML (and to a lesser extent HTML) are perfect examples of this.  XML's nested structure echoes this idea perfectly, and many a web application has been embedded into another through iframes.  </p>
<p>The web is ideal for this sort of application design.  The persistence layer should encourage developers to structure their applications to work with this general layout.</p>
<p><span style="text-decoration: underline;">Presentation</span></p>
<p>Filesystems are tricky.  People have different ways of identifying document data.  Although the filesystem has always forced us to use filenames, this archaic concept is rapidly being replaced in many contexts with internal identifiers that the user never needs to see or interact with.  Different types of data can be presented in different ways.  Short summaries of text (something commercial operating systems have  automated the extraction of since the mid-90s) can be useful for paging through textual data.  Thumbnails are excellent for graphics.  Short previews work for video/audio data.  Header comments are useful overviews of code.  Even things like date/time last modified can be useful in the identification of what you're looking for.  In short, the filesystem needs to be able to work with the application in order to better visualize the data contained within.  The OSX quicklook feature is a great example of this, as each application can provide a plugin that quickly renders a preview snapshots of individual files.  </p>
<p><span style="text-decoration: underline;">Non-Document Data</span></p>
<p>Non-document data is, quite frankly, the hardest to work with.  Consider an address book, or a BibTeX bibliography manager.  You might have multiple individual address books, or multiple bibliography documents, but ultimately you want the data in these address books or bibliographies to be linked.  If your friend and coworker changes their phone, you want the phone number updated in both your personal and work contact lists.  If you find a typo in a bibliography entry, you want to fix the typo in all of your manuscripts.  Decentralization is critical.</p>
<p>This in turn makes it hard to share data without running into the security concerns I discussed last week.  Logical groupings of entries are one way to manage access control, and specialized interface widgets are another.  What it also means is that to be efficient, the filesystem has to operate on a sub-document granularity.  For example, the HFS filesystem featured a quaint little idea called a resource fork.  In addition to the normal notion of a file as a sequence of bits (the data fork), each file also had a structured component.  The structured component contained lots of bits of data (resources), each individually addressable.  Moreover, there were standardized ways of accessing these bits of data.  For example, you could have a collection of icons in one file.  The operating system provided primitives that allowed any application to get into that file and access each of those icons as needed.  More recently, OSX's notion of Packages or Bundles achieves a similar end.  Applications are conceptually a single file, but have structured contents in standardized formats such as graphical data or XML/plist.</p>
<p>In short, it is extremely helpful if the filesystem supports (securely) being able to drill down into data from other applications, no matter how it's structured.</p>
<p> </p>
<p>Well, that's it for this thread.  On to new and wonderful things next week.  Happy Holidays, and qoSraj QI'lop jaj ghubDaQ.</p>

View File

@ -0,0 +1,5 @@
---
title: "Happy New Year"
author: Oliver Kennedy
---
<p><a href="http://www.xthemage.net/newyear2013/">http://www.xthemage.net/newyear2013/</a></p>

View File

@ -0,0 +1,11 @@
---
title: "Procedural Stories and Constraint Satisfaction (Part 1)"
author: Oliver Kennedy
---
<p>Ages ago, in the first issue of Dragon Magazine that I had ever read, there was an article on campaign preparation.  For those of you unfamiliar with it, Dungeons and Dragons is a game of cooperative storytelling.  One player, often referred to as the game- or dungeon-master, puts together the outline of a story, and the remaining players take up the role of characters in that story.  This outline, often referred to as a campaign, can also be thought of as the fragment of the story not under the direct control of the normal player's characters (PCs).  </p>
<p>The point that the Dragon article was trying to make was that, like any story, a campaign must be self-consistent, or the players will lose their suspension of disbelief and stop being interested.  Over the course of the game, players will learn facts about the world, and the other characters in it (aka, a non-player characters, or NPCs).  For example, players might learn that a certain duke (let's call him Bob) lives in a particular city and hates ferrets.  If a later event calls for some duke to be present half a continent away, in a village known for its ferret breeders, then Bob is probably not the best choice for this particular role.</p>
<p>If you squint hard, the problem of campaign construction begins to look a lot like a big constraint satisfaction problem.  You have certain entities in the world: NPCs, villages, organizations, etc..., and certain relationships between those entities.  Based on these relationships, entities in the world can take actions that change their respective relationships (e.g., one NPC leads an attack on a neighboring town and either succeeds or fails, changing the state of the world in the process).  </p>
<p>The tricky bit is that a world is typically far too complex to try and simulate in real time.  I have yet to meet someone who runs a game of D&amp;D by exhaustively deciding explicitly what all of his NPCs are doing at any given moment.  Rather, what I have seen most often is that a game master will put together an outline of a character's motivations, and maybe some general plans.  They might have a timeline (of varying complexity) that says what will happen and when.  However, especially since the story is meant to be driven by the other players, the exact details are never developed until they become relevant to the story.</p>
<p>Think of this as a sort of Schroedinger's story.  The exact nature of the story can be entirely undetermined, until the players start to interact with it.  </p>
<p>This can be a bit dangerous, since like quantum particles, a game master can't afford to get into an inconsistent state (while preserving the player's suspension of disbelief).  In short, although the story is evaluated lazily, the lazy evaluation must avoid leading the story into an inconsistent state.</p>
<p> </p>

View File

@ -0,0 +1,12 @@
---
title: "Procedural Stories and Constraint Satisfaction (Part 2 - Plot twists)"
author: Oliver Kennedy
---
<p>After a 1 week absence due to school starting, this week we return with more on procedural story generators for games.  In my last post, I introduced the general idea of procedural story generation as a constraint satisfaction problem.  Here, I introduced the idea of lazy evaluation -- where you generate only the information relevant to the current story.  As we'll see in a moment, this can actually be a lot of information.</p>
<p>Nominally, one would envision interactions with a procedural generator as being entity-driven.  When the player(s) interact(s) with an entity (a town, a character, etc...), information about that entity is generated and extrapolated.  This is fine for a static world, but for the world to be interactive, things need to change.</p>
<p>Furthermore, as with any game, players should be involved in the story.  They have the ability to interact with and affect the story.  However, just raw automated AI characters are unlikely to generate sufficient drama to keep players engaged.  We want stories to evolve in a way that keeps the players engaged -- generally by providing some direction to the story.  One possible approach to this is to inject certain drama-inducing plot fragments into the mix.  Plot twists, as it were, that spice up the story ("Your friend Bob is actually a foe" kind of things).  </p>
<p>As any writer will tell you, a good plot-twist needs set-up.  It needs backstory.  Why is Bob angry at you?  Bob should have shown signs that, while maybe not immediately apparent, could indicate that he doesn't exactly have your best interests in mind. </p>
<p>This is where the problem comes in.  This backstory (which I'll refer to as the groundwork for a plot twist), has to be injected into the story quite a bit before the actual twist occurs.  In other words, the generator needs to commit to a plot twist well in advance of the plot twist becoming apparent to the characters.  Worse still, a plot twist is generally going to be fairly open-ended; giving the procedural story generator an enormous number of possibilities (many of which could potentially conflict).  </p>
<p>It gets even worse -- due to player actions, or other events within the story, the groundwork for a story might be made entirely irrelevant.  For example, maybe one day when the players bring Bob on a hunt, he's accidentally mauled to death by a bear.  Balancing the groundwork for plot twists in a D&amp;D game is tricky enough for a human to do... I'm not sure if it's even possible for a computer.</p>
<p>Still, it would be interesting to see.</p>
<p>(side note: I've been discussing this idea from the perspective that players interact with the story in a linear manner.  An interesting game mechanic that this sort of procedural story generator would enable is time-travel.  Allow players to provide a basic AI for their characters, and then bop back and forth, adjusting their character's behavior at various points in time to see how it changes the world)</p>

View File

@ -0,0 +1,22 @@
---
title: "Using Constraints to Define \"Correctness\""
author: Oliver Kennedy
---
<p>Data curation is the act of taking data and massaging it into a form that can be analyzed.  There's a common saying among DBAs and Librarians that data curation is the biggest time sink of data management.  I can certainly cite a number of examples of this.  My wildlife biologist girlfriend spends almost as much time organizing and inputting data as she does out in the field collecting it, or analyzing it.  The kind folks working on the DBLP do nothing but data curation.  If you squint a bit, data mining can be thought of a specialized form of data curation, where signals (usable/analyzable data) is extracted from noisy, messy data.  </p>
<p>In short, this is an area that a lot of people spend a lot of time worrying about.  It's also an area on which a lot of people have expended a considerable amount of effort.  </p>
<p>Why is that?</p>
<p>Although Data curation is an extremely repetitive task (suggesting that it might be ideal for computers), the kernel of this repetition, the very heart of this task is something entirely nontrivial: data validation.  </p>
<ul>
<li>Do both of these datasets contain information about John Doe?</li>
<li>Is John Doe the same person as Johnny Doe?</li>
<li>Is "The House at the End of the Row, Birminghamshire, England" a valid address?</li>
<li>How do I deal with John Doe not having a home phone number?</li>
</ul>
<p>When analyzing data, just like writing code, we make certain assumptions about the data.  For example, "This dataset contains one row for each unique individual".  If these assumptions are invalid, then our analysis will be incorrect.  In addition to getting the data into the right, readable format, its primary task is to ensure that the assumptions that analysts make about the data are valid.</p>
<p>Of course, this requires us to explicitly declare these assumptions.  Databases have a mechanism for this, called constraints (e.g., Primary Key constraints, Foreign Key constraints, Validation Triggers, etc...). However, even these are flawed.</p>
<p>Let's take the example I mentioned just now: "This dataset contains one row for each unique individual"  This is a nontrivial example to encode.  How does a database figure out whether two individuals are identical?  "Joe" and "Joey" could be different names for the same person.  Deduplication is something people have studied for a very very long time, and even now they don't have a particularly good solution that's 100% correct all the time.</p>
<p>Moreover, what happens when the database detects such a violation.  The typical solution is for the database to simply reject any insertion that would cause the constraints to be violated.  </p>
<p>This typically annoys users, who, at least in the short term, just want to load their data into the database.  Consequently, constraints are used quite infrequently, and then, usually only for values that the database itself generates (e.g., entity ids/counters).  Specifying more complex constraints is just out of the question.</p>
<p>Although constraints automate the data validation process, they are insufficient.  There's a clear tension between how tightly we specify these constraints (e.g., An address is always a number, followed by a street, followed by a newline, followed by a city, etc ...), and how usable the database is.  Extremely tight constraints are convenient for the analysts and database programmers, since they can make stronger assumptions about the data... but make it difficult to insert data into the database (and hard to handle corner cases, like addresses in rural England).  Weak constraints are the exact opposite.  </p>
<p>Finding the right balance between strong and weak constraints is hard.  It's a large part of data curation.  How much do you automate, and how much do you put on the analysts?</p>
<p>Is there a middle ground?  Are there other ways of creating constraints that are strong enough to satisfy the analysts, but that don't make inserting data into the database a miserable experience?  How do we alert the analysts when a corner case appears in the database that violates their assumptions about the data?</p>

View File

@ -0,0 +1,13 @@
---
title: "The Analizerificationist"
author: Oliver Kennedy
---
<p>There's been a lot of talk lately about "wisdom of the crowd" and "tapping the collective consciousness" and the like, so I figure I might as well weigh with my 2c, by expanding on an idea that came recently in a conversation I had recently with one of my colleagues Jan Chomicki and his student Ying.  (Credit should also go to Dieter Gawlick and Zhen Hua Lu of Oracle, who provided inspiration for this discussion)</p>
<p>Recently, especially in high profile events like the US presidential election, classical political punditry has been getting supplemented (and even in some cases replaced) by data mining algorithms.  Powerful, and often quite accurate algorithms exist to predict anything from elections, to ball games, to the stock market, to what you will be doing next Tuesday evening at 6:41 PM.  </p>
<p>Yet, in spite of the daunting array of algorithmic predictors that exist out there, there's still more to be done.  Data mining is almost more of an art form than a science -- Yes, there are practical, general purpose techniques for finding correlations, outliers, and other interesting features of datasets, but ultimately, you need to know (or at least have a general sense of) what you're looking for.  A lot of the beautiful work in data mining lies in finding clever ways to apply the general techniques to specific datasets.  </p>
<p>So... where does the wisdom of the crowd come in?  Well, let's start with tools like <a href="http://support.google.com/fusiontables/answer/2571232?hl=en">Google Fusion Tables</a>, or <a href="http://pipes.yahoo.com/pipes/">Yahoo Pipes</a>.  Here, we have a pretty nifty mechanisms for doing data extraction, and analysis, even dataset lookup and organization.  Can we do any better?</p>
<p>What's missing from these systems is a way of organizing the derivation process.  So you've created a great visualization, and maybe you've even shared it with your friends.  Now how can we take your efforts and use them to benefit even more people?  </p>
<p>Let's say you have an idea.  You think you know exactly how to predict the next election, but it will require a lot of data.  What do you need to do?  Well, first, you'll have to find and/or extract all that data from content on the internet.  Here, fusion table and pipes have you covered.  There are some fairly high-quality datasets available, as well as some nifty tools for getting useful data out of the interwebs.  But now that you have it, you'll still need to massage it a bit.  </p>
<p>Fortunately for you, it's quite likely that someone else has had to do data manipulations on similar datasets.  It would be quite useful to have a system that could point you towards such efforts on the part of other people so that you might base your own efforts on theirs.  As an added benefit, it might be possible to piggyback on the computational efforts already expended for the prior attempt(s) at massaging similar datasets.  </p>
<p>Now that the data is in the right form to be analyzed, there's still that pesky analysis to be done.  Here, once again, the system has the potential to help.  What questions have other people asked about similar data?  What kind of aggregate values might be useful.  What kind of visualizations might be appropriate.  Are there mash-ups that people have assembled out of similar data (google maps as the most general example).  What even qualifies as "similar" data?</p>
<p>In fact, this works from both directions.  Let's say you know what kind of information you're looking for.  How could you ask the system for strategies that other people have applied to get similar answers?  How would you even indicate what you're looking for to the system in the first place?</p>

View File

@ -0,0 +1,43 @@
---
title: "Semantics as Data"
author: Oliver Kennedy
---
<p>Something I've been getting drawn to more and more is the idea of computation as data.  </p>
<p>This is one of the core precepts in PL and computation: any sort of computation can be encoded as data.  Yet, this doesn't fully capture the essence of what I've been seeing.  Sure you can encode computation as data, but then what do you do with it?  How do you make use of the fact that semantics can be encoded?</p>
<p>Let's take this question from another perspective.  In Databases, we're used to imposing semantics on data.  Data has meaning because we chose to give it meaning.  The number 100,000 is meaningless, until I tell you that it's the average salary of an employee at BigCorporateCo.  Nevertheless, we can still ask questions in the abstract.  Whatever semantics you use, 100,000 &lt; 120,000.  We can create abstractions (query languages) that allow us to ask questions about data, regardless of their semantics.</p>
<p>By comparison, an encoded computation carries its own semantics.  This makes it harder to analyze, as the nature of those semantics is limited only by the type of encoding used to store the computation.  But this doesn't stop us from asking questions about the computation.</p>
<p> </p>
<h3>The Computation's Effects</h3>
<p>The simplest thing we can do is to ask a question about what it will compute.  These questions span the range from the trivial to the typically intractable.  For example, we can ask about…</p>
<ul>
<li>… what the computation will produce given a specific input, or a specific set of inputs.  </li>
<li>… what inputs will produce a given (range of) output(s).  </li>
<li>… whether a particular output is possible.  </li>
<li>… whether two computations are equivalent.</li>
</ul>
<p>One particularly fun example in this space is Oracle's Expression type [1].  An Expression stores (as a datatype) an arbitrary boolean expression with variables.  The result of evaluating this expression on a given valuation of the variables can be injected into the WHERE clause of any SELECT statement.  Notably, Expression objects can be <strong>indexed</strong> based on variable valuations.  Given 3 such expressions: (A = 3), (A = 5), (A = 7), we can build an index to identify which expressions are satisfied for a particular valuation of A.</p>
<p>I find this beyond cool.  Not only can Expression objects themselves be queried, it's actually possible to build index structures to accelerate those queries.</p>
<p>Those familiar with probabilistic databases will note some convenient parallels between the expression type and Condition Columns used in C-Tables.  Indeed, the concepts are almost identical.  A C-Table encodes the semantics of the queries that went into its construction.  When we compute a confidence in a C-Table (or row), what we're effectively asking about is the fraction of the input space that the C-Table (row) produces an output for.</p>
<p> </p>
<h3>Inter-Computation Relationships</h3>
<p>Another class of questions is how different computations, or computation fragments relate or interact.  For example, we can ask about…</p>
<ul>
<li>… what the algebraic properties of a computation are (i.e., do two computations commute)</li>
<li>… what the dependencies of a computation are.</li>
<li>… given a sequence of computations, what does the information flow graph look like</li>
<li>… given a sequence of computations, does a specific pattern exist, and if so on which computation fragments?</li>
</ul>
<p>This is an area that has not been explored quite as extensively.  Distributed computing has looked long and hard at some of these questions (i.e., when do operations commute), but almost always in a specific context.  Probably the closest idea, spiritually, appears in systems like Delite [2]. These sorts of compiler generation tools allow users to establish semantic restrictions on a domain specific language that lead to powerful optimizations.  In a sense, these kinds of queries regarding computation interactions are also a form of optimization... but more general.</p>
<p> </p>
<h3>Combining Computations</h3>
<p>Ultimately, one of the biggest distinctions between computation and normal data, is that it's possible to easily combine computation.  Computation representations such as <a href="http://en.wikipedia.org/wiki/Monad">Monads</a> are explicitly designed for this, but even simple iterative programs can still be concatenated.  Computations can be broken apart, stitched together, sliced, diced, and sorted every which way... and the result of each is still more computation.</p>
<p> </p>
<h3>Summary</h3>
<p>Where is this leading?  Nowhere specific.  We have a variety of tools and techniques for expressing computation, and now we need some tricks and techniques for effectively querying them as well.</p>
<p> </p>
<h3>References</h3>
<p>[1] Gawlick, D. et al. Applications for expression data in relational database systems. 609620.</p>
<p style="margin: 0px 0px 0px 24px; text-indent: -24px; font-size: 12px;">[2] Chafi, H. et al. 2010. Language virtualization for heterogeneous parallel computing. <em>ACM Sigplan …</em>. (2010).</p>
<p> </p>
<p> </p>
<p> </p>

View File

@ -0,0 +1,16 @@
---
title: "Languages with first-class ORM primitives"
author: Oliver Kennedy
---
<p>I was at a seminar on <a href="http://en.wikipedia.org/wiki/Object-relational_mapping">Object Relation Mappers</a> (ORMs) recently.  The idea behind these is actually quite simple: they're a persistence layer for object-oriented languages.  Through a little bit of glue code injected into the language's runtime engine, and some introspection tricks, object instances are transparently mirrored to a persistence layer like a database engine.  </p>
<p>Having a database sitting behind an ORM, actually provides some nifty functionality.  In particular, you can do nifty things like pose queries over object instances, classes, and so forth.  Often, these queries can be posed in a database-agnostic way (i.e., without using SQL).  </p>
<p>This is quite handy, since it gives object-oriented developers the power and optimization tricks of a declarative query processor.  For example, in an application that manages a school's student population, you might have a relation that represents all of the students mapped to instances of a "student" object.  The student object exposes functionality that might be performed on a student (i.e., register, etc...), and has access to all of the data available about the student.  The developer can actually pose queries, and get all of the object instances that satisfy some predicate (e.g., all students with a GPA &gt; 3.5).</p>
<p>That got me thinking.  I've seen this before.</p>
<pre>set myFiles to the documents of the application where the name of the owner is "Oliver"</pre>
<p>This is an example of a language called Applescript, Apple's answer to shell scripting back in the 80s.  The language still exists, and is occasionally used for automating tasks on OSX -- most often as a wrapper around shell scripts (If you've ever set up a Raspberry Pi on a mac, you know what I mean).  </p>
<p>The clever thing about Applescript is that the language includes first class query primitives.  A large fragment of the language actually has a direct correspondence to relational algebra.  For example, the above Applescript code fragment could be rewritten in SQL as:</p>
<pre>CREATE TEMPORARY VIEW myFiles AS</pre>
<pre>  SELECT * FROM application.documents WHERE owner.name = "Oliver" INTO myFiles;</pre>
<p>Applescript is designed to work very closely with applications.  Each application and/or system component provides what's called a "Dictionary" which includes nouns (object classes) and verbs (object methods).  That is, Applescript allows applications and system components to expose objects via predefined schemas.  These objects can be queried just as easily as a normal language would operate on them.</p>
<p>I'd like to see more such things.  Even now, ORMs feel like they're bolting query operations onto the language, as sort of a hack.  This is true even for ORM-like functionality in DSL-friendly languages like Ruby (e.g. Ruby on Rails).  It seems like this sort of query functionality needs to appear in the language from the ground up -- all the way from the design of the grammar.  </p>
<p>Getting anew language, a sort of successor to Applescript that supported this kind of functionality would be awesome, especially if it could tie into an existing language like Java, Python, etc...  Especially if it could tie into ORM functionality, connect to a database, and do all sorts of other tricks like that.  That... would be really cool.</p>

View File

@ -0,0 +1,13 @@
---
title: "Never tell me the odds"
author: Oliver Kennedy
---
<p>A while back, I had a series of articles on probabilistic databases, and shortcomings thereof.  As a quick recap, probabilistic databases are databases that allow you to express data in terms of probability distributions instead of precise values.  Such representations have a number of potential applications, such as developing and analyzing hypothetical "what-if" scenarios, or avoiding information loss due to errors in data (e.g., if the data comes from OCR software).</p>
<p>One of the conclusions that I reached was that people don't like working with probabilities.  Qualitative results are typically more meaningful to an end-user than quantitative ones.  Worse still, unless your data comes from some sort of automated source (like OCR software), how probabilities should be assigned is often unclear.  This is something statisticians get paid big money to do.  Expecting end-users to arbitrarily assign probabilities to data that they're not completely certain about is silly.</p>
<p>So... where does this leave us?  Well, fortunately, a lot of work in the probabilistic database area (especially more recent stuff like [1,2,3]) leaves the exact nature of the underlying probability distributions open to the end-user.  Conceptually, there's nothing to stop us from sticking something more qualitative in its place.  The question is what?</p>
<p>Here's one thought.  Users may not have a good sense of assigning precise probabilities, but they can certainly tell you whether a data value is definitively correct, or just a guess (maybe even something more, like an "educated guess" or a possibly incorrect fact", but let's keep things simple).  In fact, you can get lots of users to give you this kind of information -- different users might even have differing guesses or "definitive" values.  When queries are posed on the data, you might get many possible outputs -- different guesses (or definitive values) can each produce a different query output.  Now each output can be annotated with the set of users who support (or contradict) it.</p>
<p>This effectively forms a lattice of outputs, providing at least a partial order over outputs.  We can do things like give a skyline of the most likely answers.  We can use techniques like web of trust to find answers from people a user is likely to support, or use various measurements of past accuracy to identify users who are likely to provide accurate guesses.  If we have a way of validating guesses (e.g., ground truth eventually becomes available), users can also be ranked.  Low performing users might even be identified and contacted with suggestions about how to improve their guesses.</p>
<p>----------</p>
<p>[1] Green, T.J. et al. 2007. Provenance Semirings. (New York, New York, USA, 2007), 3140.</p>
<p>[2] Huang, J. et al. 2009. MayBMS: a probabilistic database management system. (New York, New York, USA, 2009), 1071.</p>
<p>[3] Kennedy, O. and Koch, C. 2010. PIP: A database system for great and small expectations. (2010), 157168.</p>

View File

@ -0,0 +1,19 @@
---
title: "Are you sure?"
author: Oliver Kennedy
---
<p>In the last 5 years or so, we've experienced a dramatic shift in how we interact with computers.  As early as the late 90s, we had fairly reasonable speech-to-text and speech-to-command software.  Now, though, we've seen tools from Yahoo, Microsoft, Google, and most recently (and publicly) Apple's Siri that allow us to make perfectly arbitrary verbal requests of our computers, and have them be answered.  </p>
<p>Still, this interaction remains mostly unidirectional.  The user makes requests, Siri et al. go out and fetch the responses and present them to the user.  What could we do if we had more, if our computers had the ability to come up to us and as <span style="text-decoration: underline;">us</span> for information.  For example, I could ask my computer to make me a reservation at a nearby restaurant at around 6, and to invite a few of my friends.  </p>
<p>Granted, there are systems integration issues here -- the restaurant and all of my friends need to be using the same (or at least compatible) scheduling systems.  That's not necessarily out of the question -- CalDAV has evolved as a pretty reasonable scheduling exchange system, and there's room for some upstarts to come in and create a compatibility layer between iCloud, exchange, google calendar, and other related systems (this would be really frigging cool if someone were to do it).  </p>
<p>Let's put that aside for now, and look at the core challenge of answering the question itself.  Scheduling is a huge (worse than NP) problem, largely because it's hard to convey every nuanced detail of a person's preferences and expectations.  There's a degree of uncertainty that comes from everything a person says.  When I ask for a reservation "around 6", it may be reasonable for that reservation to occur at 7.  When I ask for a nearby restaurant, what does that mean?  Walking distance, biking distance, or driving distance?  </p>
<p>How do I specify which friends I'm looking to meet?  Clearly I don't want my computer going through my entire address book.  Once I've specified them, there's uncertainty.  My friends might not be able to make specific times.  "Maybe" has to be a perfectly reasonable answer to the question of "Do you want to meet at 6".  The computer now has to take this into account.  The computer can create a set of different possibilities.  Making it even worse, it may well be the case that none of the possibilities fulfill the stated objectives.  If two or three friends have mutually exclusive schedules, one of them will need to be dropped.  Now there's multiple possibilities for how the stated objectives can be relaxed.  </p>
<p>Ultimately, this boils down to three significant problems:</p>
<ul>
<li>When the user asks the computer an open-ended question, how can degrees of freedom in the query be extrapolated.</li>
<li>When the user asks the computer an open-ended question, how can the degrees of freedom be prioritized (i.e., can we extrapolate a cost curve for each degree of freedom)</li>
<li>When the user asks the computer an open-ended question with no possible answers (or the user asks for more possibilities), how can we infer additional degrees of freedom.</li>
</ul>
<p>The field of preference databases tries to address efficient query processing when there are degrees of freedom like this, but most of this work assumes that a structured query (and cost curve) is (are) already available.  How do we impose this kind of cost model on the query?  How do we infer it from the user's verbal statement?  </p>
<p>Let's take this in another (related) direction.  What happens when the computer needs to know something from you.  Say you're one of the friends and are being asked whether you can make a 6:00 dinner appointment.  Maybe you're interacting with the computer to diagnose an issue (e.g., with your car).  What happens when computer asks you a question, and you don't know what the answer is.  I don't know is a reasonable answer.  I don't know, but I will know in 30 minutes is another.  There are a range of answers "Maybe, Possibly, I think so, I don't think so etc.." all meaning that there are two possible outcomes.  How can these possibilities be effectively communicated to the originator of the query.  If you're diagnosing car troubles, how does the computer deal with this.  It's a different class of information than "No".  There's some work here for the NLP community -- Can we quantify the level of uncertainty associated with a qualitatively uncertain statement of fact?</p>
<p>There's another class of responses to such questions.  A reasonable response that a friend might give is "If I'm still available by 4:00."  In effect, the user has provided an uncertain answer, but one with a specific resolution strategy.  At the current moment, the answer is uncertain, but at 4:00, the database is triggered and springs into action, resolving the uncertainty and creating a new set of constraints. </p>
<p>Anyhow, these are just some random thoughts on a pretty cool problem space.</p>

View File

@ -0,0 +1,15 @@
---
title: "Laasie: Building the next generation of collaborative applications"
author: Oliver Kennedy
---
<p>With the <a href="http://webdb2013.lille.inria.fr/Paper%2034.pdf">first Laasie paper</a> (ever) being presented tomorrow at WebDB (part of SIGMOD), I thought it might be a good idea to explain the hubbub.  What is Laasie?</p>
<p>The short version is that it's an incremental state replication and persistence infrastructure, targeted mostly at web applications.  In particular, we're focusing on a class of collaborative applications, where multiple users interact with the same application state simultaneously.  A commonly known instance of such applications is the Google Docs office suite.  Multiple users viewing the same document can simultaneously both view and edit the document.</p>
<p><strong><span style="text-decoration: underline;">For Developers</span></strong></p>
<p>The goal of Laasie is to provide an infrastructure on which the next generation of collaborative applications can be built.  For developers, this means that the infrastructure should fade into the background.  The entire development process should proceed (almost) as if one were writing a single-site application.  To use the MVC paradigm as a basis, Laasie acts as the M(odel), persisting your data and making sure each client has a shared view of it, and making sure that clients can revive themselves after the fact.</p>
<p>Not only does Laasie make it easier for you to get your collaborative application off the ground, it also provides a range of useful features.  In addition to some fun access control, sanity checking, and sandboxing capabilities, our eventual goal will be to provide support for distributed Laasie instances.  End users requiring offline support, added privacy, or similar features will be able to instantiate their own Laasie instances, which will "just work" with your application.  </p>
<p><strong><span style="text-decoration: underline;">For Researchers</span></strong></p>
<p>The primary challenge of providing such an infrastructure is the question of how we represent state updates.  The more general you get, the harder it is to be efficient.  </p>
<p>To wit, we could transfer the full state on every single update (this is roughly what Dropbox does).  This is certainly quite general, and allows us to express any sort of state change that we like.  On the other hand, it's a bit hard to implement efficiently.  This is why you don't see many distributed applications that use Dropbox for this purpose (as a shared filesystem perhaps, but not for low-latency sharing).</p>
<p>At the other end of the spectrum, there are a whole range of optimizations you can implement.  Knowing that two operations are commutative (or that there's an applicable operational transform) creates a simpler, leaner, more efficient consistency model.  Being able to subdivide an application's state allows client instances to pull only relevant data, or changes to fragments of the state.  Bulk changes to structured data (numbers, collections, matrices, images) can often be transmitted more efficiently as a description of the change (add 1 to every number in this collection).  You could create an infrastructure that was super-optimized and tailored specifically to your application.  Unfortunately, then you've tied the infrastructure to your application's semantics.  If those semantics change (e.g., you add features), you need to change the infrastructure.  </p>
<p>The core insight of Laasie is that functions (aka procedures, aka monads, etc...) are a way of representing state updates that is both general (not turing complete yet, but we're getting there), but still amenable to optimization.  Because the full application semantics are expressed in the update, it is possible to analyze each update, assert properties about updates, and more generally, to restructure and optimize the overall state representation.</p>
<p>More on this next week, when I introduce the Log as a Service state representation.</p>

View File

@ -0,0 +1,25 @@
---
title: "Log as a Service (Part 1 of 2)"
author: Oliver Kennedy
---
<p>Last week I introduced some of the hype behind our new project: Laasie.  This week, let me delve into some of the technical details.  Although for simplicity, I'll be using the present tense, please keep in mind that what I'm about to describe is work in progress.  We're hard at work implementing these, and will release when-it's-ready (tm blizzard entertainment).  </p>
<p>So, let's get to it.  There are two state abstractions in Laasie: state land, and log land.  I'll address each of these independently.</p>
<h3>State Land</h3>
<p>State land is what application developers interact with directly, and is most easily thought of as a big JSON object.  Those familiar with MongoDB, Pig Latin, or JaQL should feel right at home here.  However, Laasie provides a powerful set of abstractions for developing collaborative web applications.  That is, although it has a RESTful API, Laasie's true power lies in its state replication and programatic update features.  Let's see what they can do.</p>
<h4>Reads</h4>
<p>In a normal REST API, reads are performed by a client specifying a key (or equivalently a path) of interest.  The infrastructure obtains the key, passes it to the client, and the interaction is complete.  </p>
<pre>Object read(path)</pre>
<p>Laasie on the other hand, is designed for state replication.  In the Laasie model, reads operate (conceptually) in three stages:</p>
<ol>
<li>When a client first connects, it requests a session token by providing a path of interest and any relevant authentication tokens (e.g., username/password).  </li>
<li>Using the session token, the client initializes its state.  This is analogous to a RESTful read, except that the requested value is returned along with a state token (effectively a timestamp).</li>
<li>Using its session and state tokens, a client can request an update: a javascript function that, if executed, will transform one version of the state into the next.  This returns a new session token.</li>
</ol>
<pre>SessionTok createSession(path, client_identity)</pre>
<pre>{Object, StateTok} initSession(SessionTok)</pre>
<pre>{function(x) -&gt; newx, StateTok} updateSession(SessionTok, StateTok)</pre>
<p>The update function is typically going to be smaller than reading the entire state from scratch, making this an ideal way to keep a clients up-to-date.  Also note that we can use blocking HTTP requests to support PULL-style functionality in updateSession, while keeping control over updates in the hands of the client.  This is crucial for disconnect-heavy settings like mobile computing, where browser-based apps are extremely common.</p>
<p>We plan to develop client-side libraries (e.g., in Javascript) to simplify the task of state maintenance.  Such a library will essentially maintain a local copy of the requested object and manage updates. </p>
<h4>Writes</h4>
<p>Like reads, Laasie exposes a more powerful write API.  Laasie allows developers to express updates as functions.  Although we expect many of these functions to be simple (overwrite value X, add 2 to value Y), the API is actually quite powerful, and we plan to add more features, domain-specific extensions and DSLs over time.  The full extend of this language is more than I want to get into in this post, but if you're familiar with Pig Latin or JaQL, you should feel right at home.  </p>
<p>An important feature of Laasie is that these update functions are transmitted and stored as-is in Laasie (Laasie doesn't typically evaluate them).  Instead, Laasie uses the update's semantics to identify and evaluate potential optimizations.  Next week, I'll get into how Laasie does this, and show how infrastructure managers can use Laasie's second abstraction: log land, to extend Laasie's optimization capabilities with application-specific optimizations.</p>

View File

@ -0,0 +1,13 @@
---
title: "SIGMOD Wrapup"
author: Oliver Kennedy
---
<p>This year's SIGMOD/PODS was quite exciting.  Attended by over 800 students, researchers, and members of industry, the DB community is more vibrant than ever.</p>
<p>The highlight for me was a new event at this year's PODS, a panel discussion on <a href="http://www.sigmod.org/2013/ctcbd.shtml">future trends in Database research</a>.  Many of the speakers discussed specialized forms of data processing where creative ideas were needed: a particularly impassioned plea came from Andrew McCallum, who argued for a tighter coupling between the database, machine learning, and data mining communities.  This sentiment was echoed by a number of the panelists, who suggested that database researchers had dropped the ball on the challenge of "Big Data", allowing it to be defined almost exclusively in terms of data-mining and systems challenges.  Social Graph Databases, Astronomy (e.g., Skyserver), and similar projects were put forth as areas where peta-scale (or larger) query processing are critical.  </p>
<p>Joe Hellerstein made some interesting points that I saw echoed throughout the remainder of the conference: He mentioned the almost obvious parallel between communication and storage, namely that communication is a form of messaging to the future.  The primary distinction lies between who is responsible for what -- In storage, the sender is responsible for doing the work to put the message/signal someplace where the recipient can easily retrieve it.  Conversely, in communication, the recipient is responsible for listening and waiting for the message/signal to show itself.  Parallels exist throughout the DB community, query processing vs stream processing, being the obvious example.  I saw this sentiment echoed throughout the conference, as papers like the <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/04/PIQLSigmod2013.pdf">latest PIQL offering</a> suggested the need to revisit the tradeoffs between pre-computation and online query processing.  </p>
<p>A third theme that arose both at the panel discussion and throughout the conference was consistency.  Between Joe's <a href="http://www.deeprecursion.com/file/2011/10/7552941-cidr11-bloom.pdf">CALM conjecture</a>, an excellent tutorial by Phil Bernstein and Sudipto Das and other chatter throughout the conference, it seems clear that consistency and the CAP theorem are once again rearing their ugly heads.  The key takeaway from all of this seems to be that each application has different consistency requirements, and the underlying platform needs to establish a clear, understandable contract with the programmer about what "consistency" means.  Also clear from all of this is that consistency requirements vary between applications.  Through DSLs and other platforms, we are once again talking about how to figure out what kind of consistency an application requires</p>
<p><img title="das-consistency.jpg" src="http://www.xthemage.net/blog/wp-content/uploads/das-consistency.jpg" alt="Das consistency" width="600" height="463" border="0" /></p>
<p>Hardware continues to be a growing trend, and over the past few years, I've been seeing a shift towards (Eric Sedlar's prediction of) specialized hardware for databases.  An interesting point in this space is a <a href="http://infoscience.epfl.ch/record/186376">measurement paper out of EPFL</a> where it is observed that instruction cache misses are a major bottleneck in query processing.  Pinar's suggested solution to this is that we devote individual cores to specific tasks that fit entirely into an instruction cache.</p>
<p>I've been seeing a lot more effort on crowdsourcing.  In particular, the field seems to be shifting towards more specialized forms of crowdsourcing -- focusing the crowdsourcing efforts on domain specialists and data mining the results of such queries.  One paper on <a href="http://www.math.tau.ac.il/~milo/projects/modas/papers/sigmod13.pdf">crowdmining</a>, discussed efforts to infer causal connections and trends in data by querying users for instantiations of these trends.</p>
<p>And that's all...  pretty jazzy if I do say so.</p>
<p> <img title="NewImage.png" src="http://www.xthemage.net/blog/wp-content/uploads/NewImage.png" alt="NewImage" width="600" height="450" border="0" /></p>

View File

@ -0,0 +1,14 @@
---
title: "Expressiveness vs Efficiency"
author: Oliver Kennedy
---
<p>There's an odd dichotomy that hit me recently.  On the one hand, there's been a big recent push towards DSLs, or domain specific languages.  Examples include Bloom (distributed computation for monotonic programs), the many DSLs implemented in Delite (which include things like matrix computations, ML algorithms, etc…), GraphLab, GraphChi, and so forth.</p>
<p>On the other hand, people continue to want more expressive languages.  We keep adding more features to things like SQL (which has been turing-complete for the last few years).  I understand this drive.  We want to be able to efficiently capture more ideas.  This idea of abstracting concepts is what computer science is all about.</p>
<p>As I was discussing the design of an indexing data-structure with one of my students the other day, the weight of dichotomy really hit me.  We were discussing building more and more corner cases into the data-structure (or rather into objects that we were indexing).  This struck me as a bad idea, since I really hate corner cases.  On the other hand, a critical feature of the indexing data-structure was the ability to perform set-containment on the objects we were indexing.  </p>
<p>As many of you know, if you allow a set description language to get too complex, set containment can easily work its way into NP or even intractability.   So there it was: a conundrum.  On the one hand, a complex language would give us more flexibility, and on the other, if we made it too complex, using the indexing structure would cost more time than it saved.</p>
<p>That got me thinking.  Many problems that are intractable on a turing-complete language become feasible on certain well-defined subsets of the language.  In fact, they may even be tractable on multiple well-defined subsets, potentially multiple non-overlapping subsets.</p>
<p>And that's where DSLs come in.  A DSL allows you to specify a restricted form of a language that's far more amenable to optimization, analysis, and other useful features than a fully general language like C, Java, Python or Ruby.  </p>
<p>Often, the DSL doesn't even need to live outside the confines of a general language.  Bloom has a Ruby-based implementation that exposes the full (turing-complete) power of Ruby for those program fragments that can't be easily expressed in Bloom's framework.  Scala has a Sql compatibility layer that transforms a specific fragment of Scala into equivalent relational operators (similar to VC++'s Linq, but more tightly coupled with the language).  </p>
<p>This… this is super cool, because it suggests that different DSLs can live and cooperate in the same language (you see some of this in the Delite framework already).  It also suggests that certain fragments of the language might translate naturally into a corresponding DSL's infrastructure.  Why is this cool?  Because it means you might be able to get the best of both worlds -- expressiveness and efficiency.  </p>
<p>Imagine a language that could automatically analyze your program to identify the specific language fragment best suited to encoding it.  Although there might be some cost-estimation factors to help decide between multiple different language fragments, this actually seems like it might be doable with pure static analysis.  Such an analysis tool might also be able to identify trouble spots in your program -- point to specific operations that prevent it from descending into a specific program fragment.</p>
<p>Just a thought.</p>

View File

@ -0,0 +1,11 @@
---
title: "Log as a Service (Part 2 of 2)"
author: Oliver Kennedy
---
<p>A few weeks ago, I started introducing Laasie, our new system for building powerful collaborative web applications.  We introduced the primitive interface for managing state -- state land.  This week, I'm going to provide a quick introduction to Laasie's more powerful abstraction for manipulating state -- log land.  </p>
<h3>Log Land</h3>
<p>Laasie represents application state not just in terms of its precise value at any given point in time, but also as a DAG of state changes.  Any DAG can be resolved (reified) into a concrete state by treating the DAG as a partial order, and evaluating the state updates in a compatible total order, starting with an initial "empty" state.  This particular representation of application state is quite powerful, as it allows us to access the full history of the application's state. We can track who made a change, when it was made, and what it depends on.  </p>
<p>More precisely, every time an update is written to Laasie, a new log entry is recorded for it.  Laasie then creates a set of pointers from the log entry to all log entries on which it depends, and can establish additional pointers if necessary (e.g., to an undo record, or to the last log entry for the value being modified).  The resulting set of log entries and pointers forms the Log DAG.</p>
<p>So what does the Log DAG buy us?  Well, it's a more powerful way of doing log analysis.  Typical graph properties such as (conditioned) reachability, isomorphism, cyclicity, and min-cut arise quite frequently when discussing optimal management of application state.  Using this simple abstraction allows us to create a single data management system capable of encoding a broad range of application-specific optimizations.</p>
<p>We're currently exploring analysis in log land using SparQL.  It turns out that a surprising number of properties can be mapped directly into SparQL with only a very small BarQL equivalent.  These include properties like reachability from the root (i.e., for garbage collection), commutativity, and recoverability (i.e., for operations like merges).  </p>
<p>We're on the verge of releasing an analysis tool for Laasie-generated logs, the first step towards both online and offline optimization of application state in Laasie.</p>

View File

@ -0,0 +1,21 @@
---
title: "Why text editors are bad for programmers."
author: Oliver Kennedy
---
<p>Let me start with a bit of a history lesson.  For over a decade now, we've known a particularly annoying quirk of Moore's law.  That's the "law" that says that the <strong><em>number of transistors</em></strong> on a chip doubles roughly every one and a half years.  A lot of people, however, interpreted Moore's law as meaning that the <strong><em>speed</em></strong> of processors would double at that rate.  For a while, that was indeed the case.</p>
<p>Then, somewhere around 2005 or so, we hit a roadblock.  The standard bag of tricks for converting more transistors to more speed (e.g., deeper pipelines, redundancy for overclocking) started to run dry.  Mind you, we were still getting more and more transistors on the die, but we couldn't use those to make things faster.</p>
<p>Intel et al. had more transistors.  Since they couldn't make them do things faster, they made the transistors do more.  Enter multicore.  </p>
<p>This scared a lot of people.  After all, a lot of people had been banking on the false interpretation of Moore's law.  After all, if it runs slow now, it'll run just fine in 2-4 years.  With CPU designers shifting their emphasis to multicore, getting that kind of speedup meant reorganizing your code to run in parallel.  The natural speedups of yesteryear were no longer "free", and the research community shifted to ways of exploiting natural parallelism in user code.</p>
<p>And that brings us to the present, as well as my thought of the week.  <strong><em>Text editors are fundamentally bad for programming.</em></strong>  </p>
<p>I know that sounds a bit radical, but hear me out.  The fundamental data representation of a text editor is serial: A string of instructions.  For a serial program, this is perfect.  The order is there, and the computer knows exactly how to execute these instructions serially.  A text editor encourages people to think serially about their code.  For parallel programs, however, this is a horrible idea.</p>
<p>What we as the research and software development communities need to explore are non-linear approaches to representing code. Graph-based data-flow diagrams are a start.  For example, one (admittedly crude from a full development standpoint) nonlinear programming environment is Yahoo Pipes.  </p>
<p>Nonlinear features can be increasingly found in IDEs and programming models as well: Eclipse's "Go to {Definition, Call Sites, …}" features (now present in nearly every other major IDE) are canonical examples of this, making it easier to mentally trace nonlinear code execution paths. Models like Map Reduce compartmentalize parallel computations (each Mapper/Reducer is a separate class), forcing a developer to consider them as individual components of a bigger program.  </p>
<p>Now, that level of serial thinking is still necessary.  CPUs still operate one instruction at a time, but can we do better?  Can we create a programming environment that actively encourages users to compartmentalize their computations?</p>
<p>Consider the following simple program</p>
<pre>A = input.right;</pre>
<pre>foreach(i in input.left){ B += i.left; C += (i.right &gt; 0?i.right:i.left) }</pre>
<pre>B += input.right;</pre>
<p>And compare to the following structure: </p>
<p><img src="http://www.xthemage.net/blog/wp-content/uploads/Dag1-300x191.png" alt="Dag.png" width="300" height="191" class="alignnone size-medium wp-image-203" /></p>
<p>The parallelism is inherently visible, and easy to follow -- even if the rest of the graph may not be.</p>
<p>How can we replace the text editor?  How can we come up with better ways to represent data flow in a computation.  Can we take cues from programming environments like Alice or Apple's Automator?  Can we create a non-linear text editor… something that inherently displays the branching structure of a program?</p>

View File

@ -0,0 +1,12 @@
---
title: "Finding truth in the bits"
author: Oliver Kennedy
---
<p>What is truth, and what is data?</p>
<p>At the very least, they're different.  Ask any scientist, and they'll caution you about conflating the two: Data includes measurements and observations, mere points and samples of the whole of the universe.  </p>
<p>This may seem a bit philosophical, but my point is that, while there is often a strong correlation between data and truth, the two are distinct.  Even in the best case, when working with data of perfect quality, it represents only a subset of a bigger picture.  And data is very infrequently of perfect quality.  Substantial massaging is often required to get data into a standardized form for analysis.  As data is being massaged, assumptions are often made about the data: Floats are cast to integers, Comment fields are dropped or ignored, Extenuating circumstances or outliers are rolled into the core data.  The data cleaner's assumptions are being applied to interpret the data.</p>
<p>That's not to say that these transformations are bad.  Substantial effort goes into data cleaning efforts.  But the fact is that when you run a query on the database, it's important to realize that what you're getting back is data, and not truth.</p>
<p>It might be nice to have a database that acknowledges this distinction.  </p>
<p>What would such a database look like?</p>
<p>I envision a database with two (or more) layers, each layer providing a view over the layer below it.  The bottom layer would consist of the base data, intact, unchanged, and as-gathered.  The uppermost layer would represent "truth".  The base data is completely deterministic; We know these values precisely, but the values themselves may be wrong or not representative.  As we travel up the levels, we get to progressively lower levels of determinism.  Queries run on the higher levels are guaranteed to provide "true" results, but may emit annotated results, ranges of possible results, probability distributions, or simply say "I don't know."  </p>
<p>The crucial challenge then, is how do we make such a database usable?  How can this process be integrated into a normal data cleaning workflow with minimal changes and/or overheads?</p>

View File

@ -0,0 +1,14 @@
---
title: "Gathering Data, Interactive Programming, and Analysis"
author: Oliver Kennedy
---
<p>Data exploration is an interactive process. Let's say I have a dataset… I want to ask questions about it.  Often though, I'm not going to have a precise idea of what questions I want to ask, even if I do have a vague sense of them.  I want to be able to explore the data.</p>
<p>So what's standing in the way of me doing that?</p>
<p><strong>Gathering the data:</strong> It's possible that the data is not immediately available and needs to be gathered.  Even if I know what I'm looking for, I might not immediately have access to the data that I'm looking for.  Before anything, I need to find the data that I'm interested in, and (if necessary) transport it to somewhere that allows me to compute over it.</p>
<p><strong>Structuring the data</strong>: Data pulled from the outside world needs to be put into a structured form before any sort of automated analysis.  This may be as simple as parsing (e.g., a CSV file), or more complex: I might be able to extract all manner of features from a log file, for example.  I might split based on records, based on lines, or even based on sets of records.  I might be interested in writing a parser that pulls out certain features from the log entry -- the timestamp, the message, or the component causing the alert.  This is a bit of an ad-hoc process -- I may only be interested in specific patterns and subsets of the data now, but that might change as I explore more of the data.</p>
<p><strong>Cleaning the data:</strong> Even after I've imposed some structure on the data, there's no guarantee that the data is 'correct'.  Strange entries, outliers, and missing or corrupted data will make any results I obtain useless.  At this stage, one typically goes through a set of sanity checks, examining schema warnings from the previous stage, asserting constraints like key dependencies, and validating against secondary data sources.  I may also want to apply my domain knowledge; Past experiences may have given me a sense of what could go wrong with my data collection process.</p>
<p><strong>Query processing:</strong> Finally, I'm ready to actually manipulate the data.  This means transforming the data into a form that matches what you need -- merging datasets, rotating/pivoting the data, and/or filtering out entries of interest, for example.</p>
<p><strong>Visualization:</strong> A step in the process that's often associated with this last query processing stage is summary and visualization; Obtaining aggregates, samples, and/or graphical representations of the data is a crucial part of the entire analytical development process.  (1) As I'm gathering the data, I need to be able to see bits and pieces of it so that I can be sure that it's what I'm looking for, (2) As  I'm structuring the data, I want to make sure that my regular expression and/or parsing scheme is correct, (3) As I'm cleaning the data, I want to see/visualize outliers, and (4) obviously, I want to see the results.</p>
<p> </p>
<p>Really, each of these aspects of analysis is interrelated.  One bounces back and forth between different stages, gathering more data, parsing out more fields, cleaning, etc… A strong analytical pipeline relies on being able to see the data quickly, see results even if they're only estimates, and then go back to iterate on your analysis.  </p>
<p>How do we achieve this?  What kind of interfaces can we build to improve feedback, and to anticipate the user's needs.  What infrastructures are needed to support this kind of anticipatory computation?</p>