"This is a short introduction and quickstart for the PySpark DataFrame API. PySpark DataFrames are lazily evaluated. They are implemented on top of [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview)s. When Spark [transforms](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) data, it does not immediately compute the transformation but plans how to compute later. When [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) such as `collect()` are explicitly called, the computation starts.\n",
"This notebook shows the basic usages of the DataFrame, geared mainly for new users. You can run the latest version of these examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).\n",
"There is also other useful information in Apache Spark documentation site, see the latest version of [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html), [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html), [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html), [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html) and [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html).\n",
"\n",
"PySaprk applications start with initializing `SparkSession` which is the entry point of PySpark as below. In case of running it in PySpark shell via <code>pyspark</code> executable, the shell automatically creates the session in the variable <code>spark</code> for users."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql import SparkSession\n",
"\n",
"spark = SparkSession.builder.getOrCreate()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame Creation\n",
"\n",
"A PySpark DataFrame can be created via `pyspark.sql.SparkSession.createDataFrame` typically by passing a list of lists, tuples, dictionaries and `pyspark.sql.Row`s, a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and an RDD consisting of such a list.\n",
"`pyspark.sql.SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the DataFrame. When it is omitted, PySpark infers the corresponding schema by taking a sample from the data.\n",
"\n",
"Firstly, you can create a PySpark DataFrame from a list of rows"
"Alternatively, you can enable `spark.sql.repl.eagerEval.enabled` configuration for the eager evaluation of PySpark DataFrame in notebooks such as Jupyter. The number of rows to show can be controlled via `spark.sql.repl.eagerEval.maxNumRows` configuration."
"`DataFrame.collect()` collects the distributed data to the driver side as the local data in Python. Note that this can throw an out-of-memory error when the dataset is too larget to fit in the driver side because it collects all the data from executors to the driver side."
"PySpark DataFrame also provides the conversion back to a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to leverage pandas APIs. Note that `toPandas` also collects all data into the driver side that can easily cause an out-of-memory-error when the data is too large to fit into the driver side."
"These `Column`s can be used to select the columns from a DataFrame. For example, `DataFrame.select()` takes the `Column` instances that returns another DataFrame."
"PySpark supports various UDFs and APIs to allow users to execute Python native functions. See also the latest [Pandas UDFs](https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs) and [Pandas Function APIs](https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-function-apis). For instance, the example below allows users to directly use the APIs in [a pandas Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) within Python native function."
"Another example is `DataFrame.mapInPandas` which allows users directly use the APIs in a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) without any restrictions such as the result length."
" asof_join, schema='time int, id int, v1 double, v2 string').show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting Data in/out\n",
"\n",
"CSV is straightforward and easy to use. Parquet and ORC are efficient and compact file formats to read and write faster.\n",
"\n",
"There are many other data sources available in PySpark such as JDBC, text, binaryFile, Avro, etc. See also the latest [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) in Apache Spark documentation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CSV"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------+---+---+\n",
"|color| fruit| v1| v2|\n",
"+-----+------+---+---+\n",
"| red|banana| 1| 10|\n",
"| blue|banana| 2| 20|\n",
"| red|carrot| 3| 30|\n",
"| blue| grape| 4| 40|\n",
"| red|carrot| 5| 50|\n",
"|black|carrot| 6| 60|\n",
"| red|banana| 7| 70|\n",
"| red| grape| 8| 80|\n",
"+-----+------+---+---+\n",
"\n"
]
}
],
"source": [
"df.write.csv('foo.csv', header=True)\n",
"spark.read.csv('foo.csv', header=True).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parquet"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------+---+---+\n",
"|color| fruit| v1| v2|\n",
"+-----+------+---+---+\n",
"| red|banana| 1| 10|\n",
"| blue|banana| 2| 20|\n",
"| red|carrot| 3| 30|\n",
"| blue| grape| 4| 40|\n",
"| red|carrot| 5| 50|\n",
"|black|carrot| 6| 60|\n",
"| red|banana| 7| 70|\n",
"| red| grape| 8| 80|\n",
"+-----+------+---+---+\n",
"\n"
]
}
],
"source": [
"df.write.parquet('bar.parquet')\n",
"spark.read.parquet('bar.parquet').show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ORC"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------+---+---+\n",
"|color| fruit| v1| v2|\n",
"+-----+------+---+---+\n",
"| red|banana| 1| 10|\n",
"| blue|banana| 2| 20|\n",
"| red|carrot| 3| 30|\n",
"| blue| grape| 4| 40|\n",
"| red|carrot| 5| 50|\n",
"|black|carrot| 6| 60|\n",
"| red|banana| 7| 70|\n",
"| red| grape| 8| 80|\n",
"+-----+------+---+---+\n",
"\n"
]
}
],
"source": [
"df.write.orc('zoo.orc')\n",
"spark.read.orc('zoo.orc').show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Working with SQL\n",
"\n",
"DataFrame and Spark SQL share the same execution engine so they can be interchangeably used seamlessly. For example, you can register the DataFrame as a table and run a SQL easily as below:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+\n",
"|count(1)|\n",
"+--------+\n",
"| 8|\n",
"+--------+\n",
"\n"
]
}
],
"source": [
"df.createOrReplaceTempView(\"tableA\")\n",
"spark.sql(\"SELECT count(*) from tableA\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition, UDFs can be registered and invoked in SQL out of the box:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------+\n",
"|add_one(v1)|\n",
"+-----------+\n",
"| 2|\n",
"| 3|\n",
"| 4|\n",
"| 5|\n",
"| 6|\n",
"| 7|\n",
"| 8|\n",
"| 9|\n",
"+-----------+\n",
"\n"
]
}
],
"source": [
"@pandas_udf(\"integer\")\n",
"def add_one(s: pd.Series) -> pd.Series:\n",
" return s + 1\n",
"\n",
"spark.udf.register(\"add_one\", add_one)\n",
"spark.sql(\"SELECT add_one(v1) FROM tableA\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These SQL expressions can directly be mixed and used as PySpark columns."