Google Sheets is a great tool when it comes to managing a lot of data quickly without custom code development. You can build easily quick spreadsheets to solve your data or business application workflows. You can also use spreadsheets as a way to collect input data from multiple people.
But Google Sheets, like all spreadsheets, is not great when it comes to building visually pleasing dashboards, there are better tools for that. Also, if you need to blend multiple sheets or workbooks, Google Sheets can be very manual and labour intensive.
Before we dive into this tutorial, just a quick note. Connecting to data sources across the web programmatically in Python is one of the topics we tackle in our training program. This is a training program that is personalized to your analytics needs and that focuses on building – you will build analytic insights and dashboards as part of this program.
OK, back to the tutorial.
In this article we will show how easily we can set up a solution that does the following:
- Pulls data from Google Sheets programmatically and automatically
- Uploads said data into our data lake
- Analyse said data in Apache Superset
The final result will be an Apache Superset dashboard. There can be multiple dashboards built with the same data, as an example, this is what I came up with:
In order to follow this guide, you will have to refer to the following links:
- Code here in this article
- The google spreadsheet with the data is here
- The Superset dashboard GoogleSheetsDemo. Make sure you are signed into https://superset.assemblinganalytics.com with user and password “demo” before you click on the dashboard link
Getting the data from Google Sheets
Before we do anything, we need to load the data. In this article we will use Python for the data management piece.
We use the libraries gspread and oauth2client to connect to Google Sheets and pull the data. The code below is an adjusted version of the code from a Twilio blog post.
You can refer to the Twilio guide for more information on what is going on below and how to set up the Google credentials to read your own spreadsheets. You will have to create a project into your google account set up the API, the twilio link above explains how to do that quickly in few minutes.
OK, so once you are set up with Google, you can run the below code to get the data from google sheets into a pandas dataframe.
import gspread from oauth2client.service_account import ServiceAccountCredentials import pandas as pd # use creds to create a client to interact with the Google Drive API scope = ['https://spreadsheets.google.com/feeds'] creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope) client = gspread.authorize(creds) # Find a workbook by name and open the first sheet # Make sure you use the right name here. sheet = client.open("Alpha - Sales and Invoices").get_worksheet(0) dfInvoices = pd.DataFrame(sheet.get_all_records()) sheet = client.open("Alpha - Sales and Invoices").get_worksheet(1) dfCustomers = pd.DataFrame(sheet.get_all_records())
Preparing the flat table
Now that we managed to load all the data from google sheets into a pandas dataframe we want to create a flat table. We have 2 tables: Invoices and Customers. The “Invoices” table does not contain customer names, only ids, so we go ahead and do a merge of the 2 table.
The resulting dataframe will include all invoices + the customer names.
df = pd.merge(dfInvoices, dfCustomers, how='left', on=['Customer Id', 'Customer Id'])
Let’s enrich the dataset
We do 2 things. First, we make sure the “Date” field is changed to a date type into the dataframe. Second, we create year, month and week columns for convinience.
df['Date'] = pd.DatetimeIndex(df["Date"], coerce=True) df['year'] = pd.DatetimeIndex(df["Date"], coerce=True).year df['month'] = pd.DatetimeIndex(df["Date"], coerce=True).month df['week'] = pd.DatetimeIndex(df["Date"], coerce=True).week
Uploading all to our database instance
The cool thing with Pandas is that there is a function for everything. In this case, we can use to_sql() to send data from our dataframe to our cloud Postgres instance.
to_sql() will take care of creating the table on Postgres with the right column types. Remeber above we defined “date” as a datetime type in our pandas dataframe? Well, since we did that the Postgres column will also be created with the date type (else it will be created as “text”).
from sqlalchemy import create_engine engine= create_engine("postgresql://postgres:uaBvG8dDRHid@188.8.131.52/postgres") df.to_sql('gsheetsDemoInvoices', engine, if_exists='replace') print("done")
Note the “if_exists” attribute in the “to_sql” function, this attribute is set to “replace” and is the equivalent of a SQL Table Drop. If you do not want to clear the table every time, for instance if you want to append the data, you can use the “append” attribute instead.
Adding the new database and table to Superset
OK, so now we have the data in Postgres, but we still need to set up Superset. First off, let us add the database to superset by going to Databases and doing the following.
We now click the “+” sign on the right side.
Now we have to add our Database name and the SQLAlchemy URI. The URI includes both your password and user name, which is all the information Superset needs to connect to your Postgres database.
And we are done with the database. Go ahead and click “Test Connection” to see if everything is working as it should.
If everything is working, we can now head over to the Tables page.
Click the “+” sign again and we are into the create table page.
Here make sure to write the table name exactly as it is in Postgres. Once done click save.
As a final step, head over again to the table record and click “Edit Metadata”. Then go to “Columns List” to set up your columns. Superset needs to be configured for each column, for instance, if you want to use a particular column as a filter, you need to set it up as “filterable” in the columns list.
Congratulations! You are now ready to go and use your new data in Superset for analysis and visualizations.
Some considerations and remarks
We have reviewed how to go from google sheets to a Superset report. We have showed that this solution is quite simple to set up and runs quite smoothly especially if you are working with a standard SQL database that is well supported by SQLAlchemy.
We did not, however, review few other things that are also important, namely:
- Security. We assume Table level security here, no Row level
- Scheduling. If people are updating your google sheets often, you will need to schedule the Python scripts to run periodically. This can be done quite easily, but it needs some further explanations and a python server setup
- Data input. We said google sheets is great for data input and it is true, however if your team does not keep discipline it becomes quickly hard to manage, so make sure your team is well instructed on how to use the sheets
- Database setup. We assumed a database with a SQLALchemy URI is already available, this is of course not always the case, please let us know in the comments below if you have questions on this one
Building integrated analytics solutions with Python, Superset and a SQL database is quite easy. Those solutions require some technical know how but they are powerful. Pulling data from google sheets is only a start, you can then go further and pull data from anything with an API. This offers a chance to analyze numbers in 1 place (Superset here, but can be any other reporting tool or library), which is a critical step to do analytics productively and quickly.