Spark

Spark

(NoSQL, but with SQL)

First a little history

Early-Mid 1900s
Computers used for tabulating data
1970s
Relational model, Postgres, System-R, Oracle, DB2
1980
Lotus, dBase
1990s
Object/Object-Relational Databases, Distributed Databases
2000s
The Dark Ages...

Google: Databases suck! Use Map/Reduce Instead

Yahoo: Our Map/Reduce implementation is open source

The Good

  • Programmer-Friendly Language
  • Distributed-Computing-Friendly Metaphors
  • Extremely Resilient Runtime

The Bad

  • Programmer-FriendlyNon-Declarative Language
  • Distributed-Computing-FriendlyProgrammer-Hostile Metaphors
  • Extremely ResilientSlow Runtime

Key Features

  • High-performance resilience.
  • Use of metaphors to extract parallelism.
  • Lots of metaphors for distributed programming.
  • If you can do it in { Scala, Python, Java, R }, you can do it in Spark.
  • If you know SQL and { Scala, Python, Java, R }, you know Spark

Resilient Distributed Data Structures (RDDs)

Read-Only
You can't insert, update, or modify rows...
Transformable
... but you can create (cheaply) new RDDs by modifying existing RDDs.
Opaque
Spark just sees a bunch of rows. It doesn't know how to interpret them.
Lazy
Spark saves how to construct an RDD, but waits to actually do so.
Distributed
When Spark constructs an RDD, it automatically assigns rows to workers.

Where do RDDs come from

  • Call "parallelize" on a { Scala, Python, Java, R } array/collection
  • Load a text file from disk or HDFS (1 row per line).
  • Load a database table (1 row per row).
  • Transform (map, flatMap, filter) an existing RDD.

FlatMap?

A function that reads in one row and returns any number of rows.

Map?

A function that reads in one row and returns one row.

Filter?

A function that reads in one row and returns true (keep) or false (toss).

Resilient Distributed Data Structures (RDDs)

Read-Only
You can't insert, update, or modify rows...
Transformable
... but you can create (cheaply) new RDDs by modifying existing RDDs.
Opaque
Spark just sees a bunch of rows. It doesn't know how to interpret them.
Lazy
Spark saves how to construct an RDD, but waits to actually do so.
Distributed
When Spark constructs an RDD, it automatically assigns rows to workers.

DataFrames

RDDs with Schemas: Every row has a set of attributes and all of the records have the same attributes.

Demo