How I built my CDP with Airflow

A primer on how to use Airflow and other components of the modern data stack to build and maintain your Customer Data Platform.

Dave Melillo
8 min readMar 22, 2022

Table of Contents:

CDP Theory 📚

Physical Components of a CDP 🛠

Modern Data Platform 2.0

Data Pipelines with Airflow 🚀

Data Transformations with Airflow 🚀

The Later Miles 🛣

Closing 🗣

CDP Theory

It might help to start this off with a definition of Customer Data Platform.

A customer data platform (CDP) is software that collects and unifies first-party customer data — from multiple sources — to build a single, coherent, complete view of each customer.

While the definition above, provided by Oracle, is still generally sufficient, there is no denying that the concept of the CDP has evolved rapidly in recent times. What was once a tool for non-technical, drag and drop, point-to-point solutions with laser specific applications is now a dynamic force driving innovation across the modern data stack.

Customer Data Platforms aim to solve common data challenges: aggregation of data across multiple systems, managing many complex transformations, and finding a way to automate it all. But the CDP’s proximity to tangible business value makes it special, giving CDP solutions the power to change the trajectory of data infrastructure.

When faced with the challenge of building my own CDP, I decided to forgo the recommendations of “one size fits all” vendors and stitch together the same quality of platform I would build to handle any other data problem. The idea that a CDP must be more targeted, truncated and simplified is questionable at best — In reality, a CDP must be so vast, flexible and adaptive that it allows a business to answer questions they don’t even know they have.

Physical Components of a CDP

There are many vendors in the marketplace claiming they can offer a silver bullet CDP solution, which is an extremely confusing message. After all, Customer Data Platform has PLATFORM in its name, suggesting that a broad, flexible and integrative solution is needed to solve customer data problems successfully. This is not to say that all-inclusive solutions are of poor quality; Rather, if you truly want to build a best in class solution that will solve the problems you have today AND tomorrow, picking from best of breed solutions is the only option.

In 2022, chances are high that your company already has some data infrastructure in place, and that’s where I would start when gathering the physical components of your CDP.

Does your company already use something like Snowflake, RedShift or BigQuery as a data warehouse? GREAT! You can (and should) use a partition of that data warehouse for your CDP, to make your work easily accessible to the whole organization.

Does your company depend heavily on a visualization tool like Tableau, PowerBI or Preset? GREAT! You already have the self service component of your CDP, and all you have to do is build upstream infrastructure.

These concepts of unbundling your CDP and leaning into your data warehouse to support your CDP efforts are not new, but the proliferation of great tools that enable these practices, including Airflow, has made them seem like the obvious choice.

Modern Data Platform 2.0

The picture above represents my view on all of the generic components required to build a modern data platform in 2022. This is an update to a graph in a previous article I wrote on building modern data platforms, and while many aspects of both the graph and the topic have changed over the past year, drilling into the details is for another time …

What I want to call attention to are the integration and transformation components, as they sit at the core of any data platform and provide a foundation for anyone consuming data downstream. Without reliable, flexible and automated integrations and transformation solutions, you don’t have a data platform, you have an application acting as an abstraction layer that obscures your business’s true potential.

So those are the elements that I am going to focus on in talking about how Airflow helped me build my CDP. (I say helped because I also used other tools, like Snowflake, HighTouch and Preset, to round out my data stack.)

Data Pipelines with Airflow

Many enterprise- grade solutions do very specific things very well, but don’t have much flexibility beyond being able to get data from point A to point B. And while there’s a long list of players in the data pipeline space, none of them can offer out of the box integrations with every data source their customers could ever want. CDPs, in particular, need to integrate many sources, including enrichment sources, 2nd/3rd party data, and unique sources like web/LinkedIn scraping results.

I recognize that citing Airflow’s data pipeline capabilities and flexibility is nothing revolutionary. Building data pipelines has been a strength of Airflow since the beginning, if not its primary use case. I believe its main differentiator is that data pipelines built with Airflow allow you to automate anything you can build in Python.

In my early work on the CDP, I took advantage of Airflow’s PythonOperator and SnowflakeOperator to do most of my testing in a Jupyter notebook, and then simply cut and paste my working code into an Airflow DAG. My scenario covers well known CDP contributors like Salesforce and HubSpot, but it also covers niche players like LeadMagic.

Imagine a scenario where I am trying to create a comprehensive analysis across all prospect/customer accounts using my CDP. In order to do this, I need to get information from:

  • Salesforce (accounts, leads, outbound sales activities, deals)
  • HubSpot (inbound form fills, email marketing signals)
  • LeadMagic (website views, data enrichment)

In the Airflow UI, my data pipeline workflow looks something like this:

The backup_truncate steps are straight forward, taking snapshots of existing data and truncating/clearing production tables to prep them for new data. This step is driven by simple SQL commands like the one you see below:

The extract_and_load steps are much more complicated and are bespoke to each source. This is where Airflow truly shines, as any code that I can write in a Jupyter Notebook can be converted into a python callable that can be used by Airflow’s PythonOperator. A simple example selecting ALL fields from the Salesforce Campaign object and storing to Snowflake is below:

Data Transformations with Airflow

The semantics get a little tricky here. Some people would argue, reasonably, that a data pipeline represents all steps involved in data processing, from extraction to transformation to pushing data into a downstream system. For this discussion, though, I want to be clear that I see data pipelines and data transformations as two distinct steps, where data pipelines dump raw data into a database, and data transformations take that raw data and prepare it for analysis.

(I guess I should explain my position here, just to try and win over some of you who disagree. I think decoupling raw extraction from transformation allows for more transparency and efficient data governance. Data pipeline scripts should generally be void of business logic, while transformation scripts should be where business logic is applied. This also allows more non-technical roles, like analysts, managers, and directors, to follow along with a transformation layer written in SQL, versus a complex data pipeline written in Python.)

Getting back to our imagined scenario, I’m still trying to create a comprehensive analysis across all prospect/customer accounts using the CDP, and I’m a little closer now that I have all the information I need in one place (Snowflake). However, I need to transform that data so I can understand things like:

  • How many TOFU, MOFU and BOFU forms have been filled out per account?
  • How many times has an account visited our blog in the past 30 days?
  • What are leading indicators of success?

The next step to being able to answer these questions is transformation and aggregation. In the Airflow UI, my data transformation workflow looks something like this:

The framework I am using is to perform local transformations/aggregations, at the source/object level (i.e salesforce/accounts, hubspot/forms) and global transformations/aggregations that aim to present a final data set for analysis.

For example, I might want to aggregate success measures and activity signals from Salesforce:

I might want to transform and aggregate HubSpot form detail so I can provide funnel analysis:

Lastly, I want to transform and aggregate signals from LeadMagic to understand what companies are visiting different parts of the site:

After all of these local transformations/aggregations, I want to create an account_stats object that can be used to build dashboards or even for machine learning projects.

The implementation of this process in Airflow is seamless and elegant. I simply drop all of my SQL queries into the includes folder and reference them in the SnowflakeOperator.

The Later Miles

I’m reluctant to talk about the process from here forward as “the last mile,” because that mile seems to keep getting longer and longer. In the past, the last mile generally meant getting data into a dashboard where people could interact with it — or even giving tech-savvy users access to a curated database or view, to produce datasets themselves.

But the days of the traditional DBA are numbered, and now the last mile extends back to original first party applications, downstream databases, collaboration tools and whatever else the future holds.

This is where we come to the tools that round out our CDP, like Hightouch. Since Hightouch is a developer-friendly platform, it fits into the modern data stack seamlessly and plays a specific but critical role in getting data from the warehouse back into business applications.

Remember when we created all those great account stats like form fills and web pages visited for every account? Well, Hightouch makes it easy for me to push all of those metrics back into Salesforce, where the sales and marketing team can actually turn them into actions. Hightouch can also facilitate the socialization of machine learning projects, like intent models, churn models or classification exercises. Being a developer friendly platform, Hightouch makes it easy to integrate with other major components of the CDP, like Airflow, through the Hightouch operator

The basic value-add of Hightouch is the fact that it empowers me to easily share all of this great work I am doing with the CDP in business applications that people are using on a daily basis. This is a much more effective strategy than building endless dashboards and reports that never see the light of day.

Closing

This is where the fun starts, as the CDP infrastructure we’ve built churns out insights. But the best part is not what we planned to do with the CDP — the real advantage of this infrastructure is that it allows us to do anything we need to do, now and in the foreseeable future.

Does a new third party data source become available that changes our industry? We can incorporate it.

Do we want to switch CRMs to better align with new strategic goals? No problem.

Do we want to make CDP insights available to other parts of the org? This platform is designed for sharing at scale.

Without the infrastructure that a modern data platform provides we would be much farther away from the fun stuff.

--

--

Dave Melillo

The Full Data Stack! Data Engineer, Data Architect, Data Scientist ++ practical application of data science 🛠