# 10 minutes to pandas APIs on Spark

This is a short introduction to pandas APIs on Spark, geared mainly for new users. This notebook shows you some key differences between pandas and pandas APIs on Spark. You can run this examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/pyspark.pandas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb). For Databricks Runtime, you can import and run [the current .ipynb file](https://raw.githubusercontent.com/databricks/koalas/master/docs/source/getting_started/10min.ipynb) out of the box. Try it on [Databricks Community Edition](https://community.cloud.databricks.com/) for free.

Customarily, we import pandas APIs on Spark as follows:

In [1]:
import pandas as pd
import numpy as np
import pyspark.pandas as ks
from pyspark.sql import SparkSession

## Object Creation



Creating a pandas-on-Spark Series by passing a list of values, letting pandas APIs on Spark create a default integer index:

In [2]:
s = ks.Series([1, 3, 5, np.nan, 6, 8])

In [3]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a pandas-on-Spark DataFrame by passing a dict of objects that can be converted to series-like.

In [4]:
kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

In [5]:
kdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Creating a pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [6]:
dates = pd.date_range('20130101', periods=6)

In [7]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]:
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [9]:
pdf

Unnamed: 0,A,B,C,D
2013-01-01,-0.621429,1.515041,-1.735483,-1.235009
2013-01-02,0.844961,-0.999771,0.108356,0.109456
2013-01-03,1.343862,-1.25798,0.099766,-0.137677
2013-01-04,3.001767,-0.208167,-1.059449,0.312599
2013-01-05,-0.035864,0.312126,0.252281,0.627551
2013-01-06,-1.200404,0.276134,-0.344308,-0.367934


Now, this pandas DataFrame can be converted to a pandas-on-Spark DataFrame

In [10]:
kdf = ks.from_pandas(pdf)

In [11]:
type(kdf)

pyspark.pandas.frame.DataFrame

It looks and behaves the same as a pandas DataFrame though

In [12]:
kdf

Unnamed: 0,A,B,C,D
2013-01-01,-0.621429,1.515041,-1.735483,-1.235009
2013-01-02,0.844961,-0.999771,0.108356,0.109456
2013-01-03,1.343862,-1.25798,0.099766,-0.137677
2013-01-04,3.001767,-0.208167,-1.059449,0.312599
2013-01-05,-0.035864,0.312126,0.252281,0.627551
2013-01-06,-1.200404,0.276134,-0.344308,-0.367934


Also, it is possible to create a pandas-on-Spark DataFrame from Spark DataFrame.  

Creating a Spark DataFrame from pandas DataFrame

In [13]:
spark = SparkSession.builder.getOrCreate()

In [14]:
sdf = spark.createDataFrame(pdf)

In [15]:
sdf.show()

+--------------------+--------------------+--------------------+--------------------+
|                   A|                   B|                   C|                   D|
+--------------------+--------------------+--------------------+--------------------+
| -0.6214290839748133|  1.5150410562536945| -1.7354827055737831| -1.2350091172431052|
|  0.8449607212376394| -0.9997705636655247| 0.10835607649858589|  0.1094555359929294|
|  1.3438622379103737| -1.2579798113362755|  0.0997664833965215|-0.13767658889070905|
|   3.001767403315059|-0.20816676142436616| -1.0594485090898984| 0.31259853367492724|
|-0.03586387305407219|  0.3121259401964947|  0.2522808041799677|  0.6275512901423211|
| -1.2004042904971255| 0.27613400857508563|-0.34430818441482375|-0.36793440398703187|
+--------------------+--------------------+--------------------+--------------------+



Creating pandas-on-Spark DataFrame from Spark DataFrame.
`to_koalas()` is automatically attached to Spark DataFrame and available as an API when pandas APIs on Spark are imported.

In [16]:
kdf = sdf.to_koalas()

In [17]:
kdf

Unnamed: 0,A,B,C,D
0,-0.621429,1.515041,-1.735483,-1.235009
1,0.844961,-0.999771,0.108356,0.109456
2,1.343862,-1.25798,0.099766,-0.137677
3,3.001767,-0.208167,-1.059449,0.312599
4,-0.035864,0.312126,0.252281,0.627551
5,-1.200404,0.276134,-0.344308,-0.367934


Having specific [dtypes](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes) . Types that are common to both Spark and pandas are currently supported.

In [18]:
kdf.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

## Viewing Data

See the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html).

See the top rows of the frame. The results may not be the same as pandas though: unlike pandas, the data in a Spark dataframe is not _ordered_, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use `.loc` or `iloc` instead.

In [19]:
kdf.head()

Unnamed: 0,A,B,C,D
0,-0.621429,1.515041,-1.735483,-1.235009
1,0.844961,-0.999771,0.108356,0.109456
2,1.343862,-1.25798,0.099766,-0.137677
3,3.001767,-0.208167,-1.059449,0.312599
4,-0.035864,0.312126,0.252281,0.627551


Display the index, columns, and the underlying numpy data.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later

In [20]:
kdf.index

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [21]:
kdf.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [22]:
kdf.to_numpy()

array([[-0.62142908,  1.51504106, -1.73548271, -1.23500912],
       [ 0.84496072, -0.99977056,  0.10835608,  0.10945554],
       [ 1.34386224, -1.25797981,  0.09976648, -0.13767659],
       [ 3.0017674 , -0.20816676, -1.05944851,  0.31259853],
       [-0.03586387,  0.31212594,  0.2522808 ,  0.62755129],
       [-1.20040429,  0.27613401, -0.34430818, -0.3679344 ]])

Describe shows a quick statistic summary of your data

In [23]:
kdf.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.555482,-0.060436,-0.446473,-0.115169
std,1.517076,1.007223,0.792741,0.648616
min,-1.200404,-1.25798,-1.735483,-1.235009
25%,-0.621429,-0.999771,-1.059449,-0.367934
50%,-0.035864,-0.208167,-0.344308,-0.137677
75%,1.343862,0.312126,0.108356,0.312599
max,3.001767,1.515041,0.252281,0.627551


Transposing your data

In [24]:
kdf.T

Unnamed: 0,0,1,2,3,4,5
A,-0.621429,0.844961,1.343862,3.001767,-0.035864,-1.200404
B,1.515041,-0.999771,-1.25798,-0.208167,0.312126,0.276134
C,-1.735483,0.108356,0.099766,-1.059449,0.252281,-0.344308
D,-1.235009,0.109456,-0.137677,0.312599,0.627551,-0.367934


Sorting by its index

In [25]:
kdf.sort_index(ascending=False)

Unnamed: 0,A,B,C,D
5,-1.200404,0.276134,-0.344308,-0.367934
4,-0.035864,0.312126,0.252281,0.627551
3,3.001767,-0.208167,-1.059449,0.312599
2,1.343862,-1.25798,0.099766,-0.137677
1,0.844961,-0.999771,0.108356,0.109456
0,-0.621429,1.515041,-1.735483,-1.235009


Sorting by value

In [26]:
kdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
2,1.343862,-1.25798,0.099766,-0.137677
1,0.844961,-0.999771,0.108356,0.109456
3,3.001767,-0.208167,-1.059449,0.312599
5,-1.200404,0.276134,-0.344308,-0.367934
4,-0.035864,0.312126,0.252281,0.627551
0,-0.621429,1.515041,-1.735483,-1.235009


## Missing Data
Pandas APIs on Spark primarily use the value `np.nan` to represent missing data. It is by default not included in computations. 


In [27]:
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])

In [28]:
pdf1.loc[dates[0]:dates[1], 'E'] = 1

In [29]:
kdf1 = ks.from_pandas(pdf1)

In [30]:
kdf1

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.621429,1.515041,-1.735483,-1.235009,1.0
2013-01-02,0.844961,-0.999771,0.108356,0.109456,1.0
2013-01-03,1.343862,-1.25798,0.099766,-0.137677,
2013-01-04,3.001767,-0.208167,-1.059449,0.312599,


To drop any rows that have missing data.

In [31]:
kdf1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.621429,1.515041,-1.735483,-1.235009,1.0
2013-01-02,0.844961,-0.999771,0.108356,0.109456,1.0


Filling missing data.

In [32]:
kdf1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.621429,1.515041,-1.735483,-1.235009,1.0
2013-01-02,0.844961,-0.999771,0.108356,0.109456,1.0
2013-01-03,1.343862,-1.25798,0.099766,-0.137677,5.0
2013-01-04,3.001767,-0.208167,-1.059449,0.312599,5.0


## Operations

### Stats
Operations in general exclude missing data.

Performing a descriptive statistic:

In [33]:
kdf.mean()

A    0.555482
B   -0.060436
C   -0.446473
D   -0.115169
dtype: float64

### Spark Configurations

Various configurations in PySpark could be applied internally in pandas APIs on Spark.
For example, you can enable Arrow optimization to hugely speed up internal pandas conversion. See <a href="https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html">PySpark Usage Guide for Pandas with Apache Arrow</a>.

In [34]:
prev = spark.conf.get("spark.sql.execution.arrow.enabled")  # Keep its default value.
ks.set_option("compute.default_index_type", "distributed")  # Use default index prevent overhead.
import warnings
warnings.filterwarnings("ignore")  # Ignore warnings coming from Arrow optimizations.

In [35]:
spark.conf.set("spark.sql.execution.arrow.enabled", True)
%timeit ks.range(300000).to_pandas()

311 ms ± 30.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [36]:
spark.conf.set("spark.sql.execution.arrow.enabled", False)
%timeit ks.range(300000).to_pandas()

1.25 s ± 29.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [37]:
ks.reset_option("compute.default_index_type")
spark.conf.set("spark.sql.execution.arrow.enabled", prev)  # Set its default value back.

## Grouping
By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

In [38]:
kdf = ks.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})

In [39]:
kdf

Unnamed: 0,A,B,C,D
0,foo,one,0.392094,-0.197885
1,bar,one,0.39724,0.768301
2,foo,two,-1.683135,-0.210606
3,bar,three,-1.776986,-0.092022
4,foo,two,-0.499332,0.463287
5,bar,two,0.386921,1.995358
6,foo,one,-0.514731,1.042816
7,foo,three,0.194186,1.745033


Grouping and then applying the [sum()](https://koalas.readthedocs.io/en/latest/reference/api/pyspark.pandas.groupby.GroupBy.sum.html#databricks.koalas.groupby.GroupBy.sum) function to the resulting groups.

In [40]:
kdf.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-0.992825,2.671637
foo,-2.110918,2.842644


Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [41]:
kdf.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,one,-0.122637,0.844931
foo,two,-2.182467,0.252681
bar,three,-1.776986,-0.092022
foo,three,0.194186,1.745033
bar,two,0.386921,1.995358
bar,one,0.39724,0.768301


## Plotting
See the <a href="https://koalas.readthedocs.io/en/latest/reference/frame.html#plotting">Plotting</a> docs.

In [42]:
pser = pd.Series(np.random.randn(1000),
                 index=pd.date_range('1/1/2000', periods=1000))

In [43]:
kser = ks.Series(pser)

In [44]:
kser = kser.cummax()

In [45]:
kser.plot()

On a DataFrame, the <a href="https://koalas.readthedocs.io/en/latest/reference/api/pyspark.pandas.frame.DataFrame.plot.html#databricks.koalas.frame.DataFrame.plot">plot()</a> method is a convenience to plot all of the columns with labels:

In [46]:
pdf = pd.DataFrame(np.random.randn(1000, 4), index=pser.index,
                   columns=['A', 'B', 'C', 'D'])

In [47]:
kdf = ks.from_pandas(pdf)

In [48]:
kdf = kdf.cummax()

In [49]:
kdf.plot()

## Getting data in/out
See the <a href="https://koalas.readthedocs.io/en/latest/reference/io.html">Input/Output
</a> docs.

### CSV

CSV is straightforward and easy to use. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/pyspark.pandas.DataFrame.to_csv.html#databricks.koalas.DataFrame.to_csv">here</a> to write a CSV file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_csv.html#databricks.koalas.read_csv">here</a> to read a CSV file.

In [50]:
kdf.to_csv('foo.csv')
ks.read_csv('foo.csv').head(10)

Unnamed: 0,A,B,C,D
0,-0.821342,-0.325142,0.904636,-0.925984
1,1.498758,0.045747,0.904636,0.726606
2,1.498758,0.045747,0.904636,0.726606
3,1.498758,1.534086,0.904636,0.726606
4,1.498758,1.534086,0.904636,0.726606
5,1.498758,1.534086,0.904636,0.726606
6,1.498758,1.534086,0.904636,0.726606
7,1.498758,1.534086,0.904636,0.856176
8,1.498758,1.534086,0.904636,0.856176
9,1.498758,1.534086,0.904636,1.532448


### Parquet

Parquet is an efficient and compact file format to read and write faster. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/pyspark.pandas.DataFrame.to_parquet.html#databricks.koalas.DataFrame.to_parquet">here</a> to write a Parquet file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_parquet.html#databricks.koalas.read_parquet">here</a> to read a Parquet file.

In [51]:
kdf.to_parquet('bar.parquet')
ks.read_parquet('bar.parquet').head(10)

Unnamed: 0,A,B,C,D
0,-0.821342,-0.325142,0.904636,-0.925984
1,1.498758,0.045747,0.904636,0.726606
2,1.498758,0.045747,0.904636,0.726606
3,1.498758,1.534086,0.904636,0.726606
4,1.498758,1.534086,0.904636,0.726606
5,1.498758,1.534086,0.904636,0.726606
6,1.498758,1.534086,0.904636,0.726606
7,1.498758,1.534086,0.904636,0.856176
8,1.498758,1.534086,0.904636,0.856176
9,1.498758,1.534086,0.904636,1.532448


### Spark IO

In addition, pandas APIs on Spark fully support Spark's various datasources such as ORC and an external datasource.  See <a href="https://koalas.readthedocs.io/en/latest/reference/api/pyspark.pandas.DataFrame.to_spark_io.html#databricks.koalas.DataFrame.to_spark_io">here</a> to write it to the specified datasource and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_spark_io.html#databricks.koalas.read_spark_io">here</a> to read it from the datasource.

In [52]:
kdf.to_spark_io('zoo.orc', format="orc")
ks.read_spark_io('zoo.orc', format="orc").head(10)

Unnamed: 0,A,B,C,D
0,-0.821342,-0.325142,0.904636,-0.925984
1,1.498758,0.045747,0.904636,0.726606
2,1.498758,0.045747,0.904636,0.726606
3,1.498758,1.534086,0.904636,0.726606
4,1.498758,1.534086,0.904636,0.726606
5,1.498758,1.534086,0.904636,0.726606
6,1.498758,1.534086,0.904636,0.726606
7,1.498758,1.534086,0.904636,0.856176
8,1.498758,1.534086,0.904636,0.856176
9,1.498758,1.534086,0.904636,1.532448
