How to automate cohort analysis and charting with Python and Matplotlib – The Python Cohort Charting Cookbook

When you are dealing with potentially hundreds or thousands of customers, the rate at which those customers come or leave is a critical KPI. Especially when it comes to certain industries like SaaS companies, customers are coming and going all the time. In fact, the rates at which clients come and go can also change drammatically based on different factors like marketing campaigns, price segmentation, price changes, etc..

So if you are tracking things (and you should) you will end up, one way or the other, doing a cohort analysis. In fact, in general you would want to see 2 things:

  • Of course, an increasing number of paying customers
  • Less obvious, but still important, an increasingly longer paying period for each customer

The first bullet point is quite easy to analyse as you mostly only need to correlate the number of customers with the total revenue coming in. The second bullet point is way less simple as it requires some data transformations that tend to be complicated in Excel or Google Sheets.

There are many good tutorials on the web on cohort analysis with spreadsheets, but here we will take a different route and analyse things with Python. Now, if you have never coded in Python before this might seem scary, but the reality is that you could build your own cohort analysis in Python in a quite short period of time if you follow the right direction.

Time, Productivity and More Fun

Building things in Excel or Google Sheets might be the only choice for quite few analysts or business owners. But while the “build” part of things might seem to be fast as you know well spreadsheets already, the “maintainance” might take a lot of time in the long term.

There is also the issue that cohort analysis tends to be data heavy and this quickly makes spreadsheets slow, buggy and, in general, not fun to use. In fact, I have registered that one of the main reasons people want to move from spreadsheets to Python is to be able to crunch datasets of much bigger size.

Finally, you might want to learn Python in general as it is going strong these days in data analytics. Even as a solution architect or IT director knowing a bit of Python will help you a lot when managing your customers expectations on analytics an BI.

Comulated Monthly Recurring Revenue and Customer Count

In order to get started, we first have to read our data. We have data in raw format with 3 main columns:

  • CustomerId -> The Id of the customer
  • Time -> The months in which the customer paid to us
  • Revenue -> The monthly price paid by the customer

So, let’s go ahead and read the data from Excel. This particular file has 0’s for each month from 2015 for each customer, so we will also filter out all records where Revenue is 0.

import pandas as pd
import numpy as np
import calendar

#Read data
d = "C:\\Users\\eg013949\\Google Drive\\Assembling Analytics\\Articles\\Fast and Automated Cohort Analysis Charts with Python\\"
xls = "SaaS_Revenue.xlsx"
df = pd.ExcelFile(d+xls).parse("V2")
df = df.query('Revenue > 0')

This is the table we get:


We now want to do our first analysis which is revenue and customer count by month. In oder to do this we will use groupby()

#Filter values <= 0 and then Group data
g = df.groupby(['Time']).agg(
    {'CustomerId': 'count', 'Revenue': 'sum'}).reset_index()

Showing a table grouped by ‘Time’ (month) with CustomerId count and Revenue sum. All measures are done by month (think like an Excel Pivot Table would work).


Now that our data is nicely grouped by month, we can go ahead and build our first chart. We will use matplotlib for charting here.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

def dualAxis2Lines(timeAxis, y, z, title, axis_1_label, axis_2_label):
    colors =['xkcd:sky blue','green', 'coral']
    fig, ax = plt.subplots()

    ax.plot(timeAxis,y, color=colors[0], linewidth=4, label=axis_1_label)
    ax.legend(bbox_to_anchor=(1.05, 1), loc=2)
    ax.fill_between(timeAxis.dt.to_pydatetime(), y, color=colors[1], alpha=0.3) #Create an area chart
    ax.set_ylabel(axis_1_label, fontsize=18, color=colors[0])

    ax2 = ax.twinx()
    ax2.plot(timeAxis,z, color=colors[2], linewidth=4, label=axis_2_label)
    ax2.legend(bbox_to_anchor=(1.05, 1.05), loc=2)
    ax2.set_ylabel(axis_2_label, fontsize=18, color=colors[2])

    fig.suptitle(title, fontsize=18)

title = 'MRR and Customers Count'
axis_1_label = 'MRR(€k)'
axis_2_label = 'Customers Count'
dualAxis2Lines(g["Time"], g["Revenue"], g["CustomerId"], title, axis_1_label, axis_2_label)

The above code is actually quite simple. Most of the code is used for configuring the matplotlib chart in the right way but I have bundled it inside the dualAxis2Lines() function. So if you like the below chart, you can just copy the whole function in your script and use it. It will give you a dual axis chart on revenues and customer count.


As you can see the chart looks pretty good and is more visually pleasing than a lot of spreadsheet charts people do.

Cohort analysis

Getting into the cohort analysis from here will require a bit more transformations. The below code is an adaptation from this tutorial which is also a good read.

Step 1

Create a CohortGroup column. The idea here is to create a column where values are equal to the first time the customer paid his monthly price. This column is normally the most complex to calculate in Excel as you will have to transpose the data.

#thgroupby(level=0) means you are grouping on the index
#This is very similar to the Tableau's LOD calculation concept
df.set_index('CustomerId', inplace=True)
df['CohortGroup'] = df.groupby(level=0)['Time'].min().apply(lambda x: x.strftime('%Y-%m'))

Which produced the below table. Notice how CohortGroup has always the same value and that value corresponds to the first month of purchase of this specific customer.


Step 2

Now that we have our cohort group, we want to groupby our data by cohortGroup and rename things with more meaningful names.

cohorts = df.groupby(['CohortGroup', 'Time']).agg({'CustomerId': pd.Series.nunique,
                       'Revenue': np.sum})
# renaming
cohorts.rename(columns={'CustomerId': 'TotalCustomers'}, inplace=True)


This will give us the below table.


Step 3

Calculate the cohort period. The cohort period is calculated so that we can see for how many paid months each customer has inside the cohort group. For instance, if a customer first joined on Jan 2017 and is still around on Jan 2018, he will have 12 cohort periods.

cohorts = cohorts.reset_index()

def cohort_period(df):
    #Step 1, take the length of the dataframe; step 2, calculate a range from 0 
    #to the length of the df; step 3, apply the calculated range as new column to
    #the df (memo: this works because "Period" is sorted ascending already)
    df['CohortPeriod'] = np.arange(len(df)) + 1
    return df

cohorts = cohorts.groupby(['CohortGroup']).apply(cohort_period)



Step 4

Now we need to calculate user retention. Notice that the math here is quite simple we do:

  • Customers Retention = Total Cohort Customers In That Period/Total Cohort Customers

Do notice that “Total Cohort Customers” is in fact the total number of customers that paid in Period 1 of each cohort. This is based on the definition of a cohort, which is the group of customers that started using the service in a specific month.

As time goes by, the customers that first purchased the service will start dropping, so you would expect the rate of (Total Cohort Customers In That Period/Total Cohort Customers) to start dropping below 100%.

# TotalCohortCustomers. first() pulls period 1 which is the total customer nr the cohorts starts with
cohort_group_size = cohorts.groupby(['CohortGroup'])['TotalCustomers'].first()

# Pivot with unstack(0) so that we have a column for each period, then divide by total
# customers per cohort group
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)
customer_retention = cohorts['TotalCustomers'].unstack(0).divide(cohort_group_size, axis=1)

Step 5

Plotting with Seaborn as it has a nice heatmap object.

colors = ['red','coral', sns.xkcd_rgb["medium green"],'green']
plt.figure(figsize=(24, 8))
plt.title('Cohorts: User Retention', fontsize=22)
sns.heatmap(customer_retention.T, mask=user_retention.T.isnull(),
            annot=True, fmt='.0%', cmap=colors, cbar=False)

And here below is the chart.


Optional Step – Add all to a PDF

We can now take all the charts we created and add them to a PDF file for further distribution. We can output any type of file we want. For instance, if you are adding your comments to the report a Word .doc or a Power Point might also be used.

import os
import datetime

cwd = os.getcwd()
f = os.path.join(cwd, "temp.png")

now = str(

from fpdf import FPDF

pdf = FPDF()
pdf.set_font('Arial', 'B', 16)
pdf.cell(40, 10, 'Customer Cohort Report '+now)
pdf.image('pic1.png', x = 0.1, w=250)
pdf.image('pic2.png', x = 0.1, w=300)
pdf.output('Cohort Report.pdf', 'F')


If you liked this post, you might be interested in my training plan, you find all the info over here.

Leave a Reply

Your email address will not be published. Required fields are marked *