In this post we will go through the end-to-end process of building a dashboard using a database as a backend and python for ELT and dashboard reporting.
Technologies we will use:
- Postgres as our database backend
- Plotly, python library for building dashboards
High level concepts we will go through:
- Data lakes
- Version control for dashboards
- Row Level Security
- Superset and Plotly and when to use which
- Dashboard speed performance improvements
- Dynamic filters
Before we dive into the article, a quick note. Creating dashboards with Python is a topic we tackle in our training program. This is a training program that is personalized to your analytics needs and that focuses on building – you will build analytic insights and dashboards as part of this program.
OK, back to the tutorial.
How to use this document
This is a jupyter notebook which mixes documentation and code. You can go through top down. I have tested all the code below and it works. If you go through it step by step you will be able to create a dashboard yourself on any dataset you might have.
If you are building a self-service BI platform or advanced analytics dashboards for clients, a data lake can be very helpful in getting data organized and improving speed performance. The best way to picture a data lake is as computer folders. Think as if all the data you need, and by all I really mean all, is located in folders and all of them are inside 1 master folder.
This way all of your data is in 1 place. This makes the data easier to organize and it is always at your fingertips for analytics. Also, since we will be loading all data into 1 database speed will improve dramatically.
Set up database
For the data lake we will use Postgres which one of the most reliable SQL-compliant open source databases out there.
There are many ways to set up a Postgres db instance. You can install your own if you want on your own servers. For this example we are using a Google Cloud instance.
Note: it is important to keep in mind that managing databases can be complex. If you do not have a skilled dba in your team and cloud is possible (as in, there are no legal or company constraints), then I advise to use cloud.
Ingest the data
We assume data will be coming from csv files here, but data can come from anywhere. With some more work APIs can be proved to work with this method too. In order to ingest APIs we need to take into account incremental vs all data ingests. Most APIs can be ingested incrementaly on a scheduled basis (as records get created they are pulled into our data lake).
We will now go into the code in the cell below. You will notice we are using the copy_from functionality. This is because a standard ODBC connection with psycopg2 would be too slow. In general, we never want to ingest big datasets into a database using ODBC as ODBC can be slow. Postgres offers a bulk ingest method called copy; copy_from implements that method.
from psycopg2 import connect import io csv = "C:\\pathToCsv\\medicare.csv" def writeToPostgres(f): conn = connect("postgresql://postgres:password@ipaddress/postgres") cursor = conn.cursor() cursor.execute('create table medicare (Hospital_Name text, State text, Avg_Spending_Per_Episode_Hospital int);COMMIT; ') cursor.copy_from(f, 'medicare', columns=('Hospital_Name', 'State', 'Avg_Spending_Per_Episode_Hospital'), sep='\t') cursor.close() #Let's read only the columns we need with usecols df = pd.read_csv(csv, usecols = ['Hospital_Name', 'State', 'Avg_Spending_Per_Episode_Hospital']) #Let's load data in a temporary csv file f = io.StringIO() df.to_csv(f, index=False, header=False, sep='\t') f.seek(0) #Let's write to Postgres by sending our temporary csv f to copy writeToPostgres(f) print("done")
- The read_csv() method reads data from our csv file and puts it into a pandas dataframe.
The io.StringIO() method creates a temporary file. This is necessary as our copy_from method only takes files.
The conn = connect() method is our link to our postgres database. After setting up postgres in the cloud we use our db specific password, user and location information to create this string. For further reference you can research sqlalchemy url strings.
The copy_from method pulls data from our temp csv file and does a bulk data ingestion into Postgres.
Building the dashboard
OK, so now we have the data in Postgres and we want to make some charts. We will use plotly here as it’s the best python charting library right now. You might be wondering why we are not using Superset–this is something I plan to add next. It is important to keep in mind, however, that if you are building an application that needs flexibility and customization (i.e. as in client facing) then plotly can be a better alternative to Superset.
So, the code is below. You will recognize the postgres url string in quotes under the create_engine() method, that is our way to connect to the database. Please read through the code below. The explanations follow after the code section.
import pandas as pd from sqlalchemy import create_engine engine = create_engine("postgresql://postgres:password@ipaddress/postgres") #Get data from database. Use groupby here to get the least amount of data from db, #this will give us more speed df = pd.read_sql(''' select "state" as "state", sum("avg_spending_per_episode_hospital") as "avg_spending_per_episode_hospital" from public."medicare" group by "state"; ''', con=engine) df = df.fillna('') #Let's sort the data so we can see values from smaller to bigger df = df.sort_values(['avg_spending_per_episode_hospital']) #We do all the plotting here import plotly.plotly as py import plotly.graph_objs as go from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot init_notebook_mode(connected=True) iplot([go.Bar( x=df['state'], y=df['avg_spending_per_episode_hospital'], )])
Some key points to note from the above code
1 Version control
As you can see, the code for plotting is rather contained and small. While the lines of codes will increase if we add options, they will stay quite compact and easy to manage through version control. This is one selling point for plotly. You can take the above code and put it on version control and be done with it. Plotly, being a library and not an app like Superset, makes this step straightforward.
2 Make your database do the heavy lifting
The pd.read_sql() method above is a great way to run most of the analytics within the db. Databases are built for this kind of work and they are very fast at it. By doing a group by we avoid queries for thousands or millions of rows and we get only what is needed.
3 Inject dynamic filters easily
Another important aspect of pd.read_sql() is that we can inject all the filters we need for the dashboards here as where clauses. This means that the user can keep adding filters and this will actually increase performance as it will reduce the data we pull from the database.
4 Row level security
The focus of this document is now RLS, but we can point to a simple solution by again considering the method pd.read_sql(). We can put this method inside a decorator function that looks up the user role. For instance, if the user can see only the state “AL”, we program a function that automatically injects an SQL “where state=’AL'” clause. This would be relatively easy to implement even on top of an existing security application (if there is no security application then we can build our own using something like Flask).