<h4>(NoSQL, but with SQL)</h4>
<h3>First a little history</h3>
<div class="fragment">
<dt>Early-Mid 1900s</dt>
<dd>Computers used for tabulating data</dd>
<div class="fragment">
<dd>Relational model, Postgres, System-R, Oracle, DB2</dd>
<div class="fragment">
<dd>Lotus, dBase</dd>
<div class="fragment">
<dd>Object/Object-Relational Databases, Distributed Databases</dd>
<div class="fragment">
<dd>The Dark Ages...</dd>
<p><b>Google: </b> Databases suck! Use Map/Reduce Instead</p>
<img src="graphics/mapreduce.png" height="500px">
<p><b>Yahoo: </b> Our Map/Reduce implementation is open source</p>
<img src="graphics/hadoop.png" height="500px">
<h3>The Good</h3>
<li class="fragment">Programmer-Friendly Language</li>
<li class="fragment">Distributed-Computing-Friendly Metaphors</li>
<li class="fragment">Extremely Resilient Runtime</li>
<h3>The Bad</h3>
<li class="fragment"><strike>Programmer-Friendly</strike>Non-Declarative Language</li>
<li class="fragment"><strike>Distributed-Computing-Friendly</strike>Programmer-Hostile Metaphors</li>
<li class="fragment">Extremely <strike>Resilient</strike>Slow Runtime</li>
<img src="graphics/hadoopVSdbs.svg">
<img src="graphics/spark.png" height="400px">
<h3>Key Features</h3>
<li>High-performance resilience.</li>
<li>Use of metaphors to extract parallelism.</li>
<li>Lots of metaphors for distributed programming.</li>
<li>If you can do it in { Scala, Python, Java, R }, you can do it in Spark.</li>
<li>If you know SQL and { Scala, Python, Java, R }, you know Spark</li>
<svg data-src="graphics/sparkstack.svg" height="600px">
<h3>Resilient Distributed Data Structures (RDDs)</h3>
<dl style="font-size: 75%">
<div class="fragment">
<dd>You can't insert, update, or modify rows...</dd>
<div class="fragment">
<dd>... but you can create (cheaply) new RDDs by modifying existing RDDs.</dd>
<div class="fragment">
<dd>Spark just sees a bunch of rows. It doesn't know how to interpret them.</dd>
<div class="fragment">
<dd>Spark saves <b>how</b> to construct an RDD, but waits to actually do so.</dd>
<div class="fragment">
<dd>When Spark constructs an RDD, it automatically assigns rows to workers.</dd>
<h3>Where do RDDs come from</h3>
<li>Call "parallelize" on a { Scala, Python, Java, R } array/collection</li>
<li>Load a text file from disk or HDFS (1 row per line).</li>
<li>Load a database table (1 row per row).</li>
<li>Transform (map, flatMap, filter) an existing RDD.</li>
<p>A function that reads in one row and returns any number of rows.</p>
<p>A function that reads in one row and returns one row.</p>
<p>A function that reads in one row and returns true (keep) or false (toss).</p>
<h3>Resilient Distributed Data Structures (RDDs)</h3>
<dl style="font-size: 75%">
<dd>You can't insert, update, or modify rows...</dd>
<dd>... but you can create (cheaply) new RDDs by modifying existing RDDs.</dd>
<div class="fragment highlight-blue">
<dd>Spark just sees a bunch of rows. It doesn't know how to interpret them.</dd>
<dd>Spark saves <b>how</b> to construct an RDD, but waits to actually do so.</dd>
<dd>When Spark constructs an RDD, it automatically assigns rows to workers.</dd>
<p>RDDs with Schemas: Every row has a set of attributes and all of the records have the same attributes.</p>
