dbt Journey: From Data Mess to Success

Learn In Public. That’s what it’s called, and I want to make it a habit from now on.

I have thought of publicizing my learning journey in order to gain more insights from people who can see my progress. Since I’m currently in the journey of studying Data Engineering concepts on the side, my first topic will be related to data–the dbt!

Before encountering dbt, whenever I had to start a project and deal with data, I will only load it to a notebook and explore it manually. Of course, that works especially if the project doesn’t use large datasets. However, it gets confusing and tedious in the long run! I can’t validate that the data cleaning and other data transformations that I’ve done is 100% sure. The problem became apparent when I was in my internship…

There has to be a way to manage all of this… right?

I asked myself then… I moved on with my life.

But hopping back to my Data Engineering side studies… one of the lessons in the DataCamp track I was taking is about using this tool. Data Build Tool (dbt), as it’s called.

I didn’t understand its use at first and just went with the flow. At first it was confusing because there are a lot of things to remember: models, source, tests, seeds, macros, jinja format, etc…

Working on a mini project of exploring dbt more (specifically, the dbt Cloud) made me realize how convenient dbt is for my future data projects.

First what is dbt? In a nutshell, it is all about handling the ‘T’ in the ETL/ELT process.

Extract,TRANSFORM, and Load. Data Build Tool helps in managing data transformations. It can easily connect to different data warehouses too (e.g. Snowflake, BigQuery).

One of the challenges I faced was figuring out what exactly the dbt workflow is. It is a bit confusing to me on where it starts and ends. As I’ve learned from the DataCamp course, the dbt Workflow is as follows:

Create project: dbt init
Define configuration: profiles.yml
Create / Use models / Templates
Instantiate models: dbt run
Verify / Test / Troubleshoot
Repeat as needed

My mini project dbt exploration

1. Data Source Setup

I started with the dataset itself. I found a healthcare dataset from Kaggle. It is a synthetic healthcare dataset designed to mimic real-world healthcare data.

Afterwards, I have set up my project in BigQuery:

BigQuery

I then proceeded to configure my dbt cloud account and all the connections necessary. My data is uploaded in BigQuery.

2. dbt Project Setup

As for initializing the project, I’ve used dbt cloud.

After executing: dbt run

After making sure that everything works and reflects in the BigQuery, this is the part where the data transformation takes place.

The views created in BigQuery from my models

I have written SQL files under the models directory. The models are divided into 2: staging and transformation. The former is for cleaning and standardizing the data, and the latter is for analyzing and aggregating of data.

Query result before transformation

Query result after transformation

3. Testing

Next is to validate our data. This step is crucial for us to make sure that the data is transformed according to our goal. It could mean not having null values, or only accepting certain values.

The dbt has 3 types of tests: built-in, singular, and generic.

The built-in test has the following: unique, not_null, accepted_values, and relationships; singular is a custom test; generic is a reusable test.

Executed dbt test

The command dbt test will return all the failing rows based on the test provided. From there, we can check the errors and correct them accordingly (e.g. dropping null values). I’ve used the built-in tests: not_null, accepted_values, and relationship. I’ve written custom tests to ensure that there are no zero nor negative values from the aggregation done.

4. Documentation

After executing the dbt run command, we can execute dbt docs generate to automatically generate a documentation of the project.

I’ve used the dbt cloud so it’s kind of different from dbt core since I don’t have to serve the docs. To be honest, I’ve been through the hoops just to figure out how to see the documentation after the dbt docs generate command.

I had to set up some configurations on the project settings.

Below is an error I have encountered due to adding of custom tests to the schema.yml. I removed them leaving the built-in tests in the yaml file.

It worked after this since it is not being recognized as macros anymore, but as a test :>

Error after executing dbt docs generate

The documentation itself

What I have learned…

My recent exploration of dbt has revealed that data transformation can be orderly and collaborative. It’s essential for the data to be validated and monitored. While I have been accustomed to working solo as a student, I recognize that in the future, teamwork will be inevitable, and employing a tool like dbt can enhance our collective efficiency.

I’m excited to implement this tool in my upcoming projects! Data Science isn’t just about analyzing data and offering insights—it should also facilitate a data pipeline that allows various stakeholders (data scientists, analysts, engineers) to work effectively, particularly during the transformation stage. From mess to success :>

The project’s source code is accessible here.

If you are interested to learn about dbt too, you can check their official dbt course which is free. You can also learn this course in DataCamp!

Beyond The Vinculum