From star schema to event schema: One simple step to a self-serve analytics data model
There are many views and hopes when it comes to self-serve analytics—or a broad description of tools and processes that let non-technical team members get data insights for themselves.
In plenty of companies, self-serve looks something like this: BI (business intelligence) dashboards showing visualizations of how team-based or companywide KPIs (key performance indicators) are faring. Being able to pull up this data with a couple of clicks helps everyone in the business understand things like operation and revenue performance and spot early indications on missing targets or points of improvement.
While dashboards like these can be essential assets, they aren’t necessarily going as far as enabling true self-serve analytics. The reason why is that they tend to spark more new questions for the people viewing the dashboard than answers.
For the sake of analytics, questions and data exploration are key. But most BI datasets for dashboards are built on top of an existing data model reduced to contain the essential metrics and dimensions needed for the dashboard. This means plenty of follow-up questions from self-servers require extensions to the dataset and the dashboard itself, which typically need requests from analytics or data science teams that can take weeks to fulfill. This destroys any real flow of exploration for non-technical analysts and puts a big lag on further insights.
We want to create this flow of exploration since we can all agree that the true magic happens when domain expertise and data meet.
The answer to unlocking this is an event analytics data model. And the good news is it’s something that can be built upon your existing star schema in your data warehouse and easily explored in depth with a true self-serve analytics platform like Mixpanel.
From star schema to event schema
Unlike than BI tools, Mixpanel uses an event-based data model to enable its data views and analytics. The nice thing about this is you can easily extend the event model with context or dimensional data to enrich the dataset and enable useful breakdowns at a self-serve level.
Of course, you need an event schema on your data to make this all possible. Rest assured this is not a large lift since you essentially already have one within your existing data warehouse model. To illustrate this, I will pick a well-known example and extend it slightly to fit our purpose. This should give you a good idea of how to do something similar in your data model. The most important aspect is that we don’t change the data model; we just introduce a new presentation layer that works with event-based analytics tools.
I will build my solution based on the dbt Jaffle Shop example. It has a simple star schema for shop orders with customer, product, location, and supply dimensions.
In this model, we have only one fact table: orders. This is our natural starting point for defining events. Since fact tables usually have a timestamp, a unique identifier, and references to other dimensional values, they are already in an event table format. We need to transform them slightly to match the final event schema.
Initially, the order table looks like this:
For an event table, we need a timestamp, which we can use the ordered_at column for. Then we need an event name. Here, we add a new column, event_name, and set it to this string value order submitted. We need an identifier to combine these events in a sequence (e.g., to analyze a customer order history). We will use the customer_id for that.
This is our basic event table structure:
Can we extend that? In an event data schema, we can also include properties that add more context to our events. These properties can be used to break down a metric. In our order case, we can do it by a revenue cluster (1-100, 101-500, etc) or by information like order category type (food, home goods, office supplies, etc).
In addition, we could also find and join more information like the exact products that have been ordered and in which location. This is a design decision since we have three ways to achieve that:
- Join all context with the event in the data model
This can be a better way to design the data when we extensively use dimension tables as slowly changing dimensions. In the case of the supply, when we make sure that changes to the supplier (if they change for a product) or the prices of the supply are recorded with historical context (valid from and to). This method is easy and gives us more control when we already have this information in the event table at the time of the event.
Another reason to go this route is to simplify the data structure in Mixpanel. I would join the order items here to populate the first product context so we can correctly use Mixpanel’s list of object properties. - User context
We can use the user properties and the feature to load all information about our customers via the data warehouse. User properties are first-class citizens in Mixpanel, and the data is available for all reports. This means we can use user properties to break down any event, for example, by country. - Add context as an additional dimensional table and use Mixpanel’s lookup tables feature
Mixpanel has supported lookup tables for a long time, but with the new data warehouse integration, we can make the sync of these as easy as possible. Given we don’t have historic changes for the dimension tables, we can sync the dimension tables regularly and map the value to the existing property values. This gives us an easy way to extend this information. When we pick the products as an example, the category team can enhance the product data with information on the product margin. Using a lookup table where we load a full refresh in a nightly job, a new column would be picked up automatically.
In the dbt sample data, slowly changing dimension tables are not used. Therefore, we will enrich the event table with the order items information as a Mixpanel list of objects property.
The user properties will be loaded by using the customer table.
We will load the product, location, and supplier tables as lookup tables into Mixpanel.
First, we construct the event table with this query:
This is not much work to transform a fact table and join some dimensional data into an event model.
Additionally, we add three dimensional tables for the event model, which are almost the same as the original ones; we just filter down to the relevant values for us. Here is an example of the location model:
With this data modeled, we can now make it available in Mixpanel.
Make the new data available in Mixpanel
You might be familiar with the traditional way Mixpanel has done data tracking: using an SDK to send events directly from either client or server environments. But since last year, Mixpanel has introduced an even more powerful option: syncing directly from your data warehouse.
There are two methods to sync data into Mixpanel, the standard sync and the new Mirror function.
The standard sync
The standard sync functions like most reverse ETL products. You define a table with your event data or user properties, set it up in Mixpanel, and specify how often you want it to sync.
It’s important to know that this sync only works for records, typically using timestamp-based updates. The job searches for new event data based on timestamps and loads them. This works fine if your event structure doesn’t need historical data adaptation. New properties will only apply to new events.
Recommendation: Use one table per event. This approach lets you introduce new events later and sync the full history into Mixpanel.
The Mirror sync
Instead of syncing based on the timestamp, Mirror identifies changes in the source table and syncs these changes into Mixpanel, similar to change data capture (CDC).
This also lets you change event tables for all historical data. For example, you could introduce a new calculated property like account_num_orders for all past events. With the standard sync, this would only apply to new events. But with Mirror, you can update all events.
For our example, we use the standard sync, and for the one event, we will also be good with this setup for production. It might be worth switching to Mirror if you plan to introduce more events syncing from the warehouse. The good thing is that you can choose the sync method differently for each source you sync.
The event sync looks like this:
And you can also load the location information as a lookup table and map it to the location ID that we already have in our event model:
Now with the data in Mixpanel, you can start to enable your business teams with new self-service analysis.
Self-service order analytics
You can start the setup by checking the metrics layer that is already present in the dbt model:
You can build them quickly in Mixpanel, too:
As you can see, all metric definitions are done on the left by using the same order submitted event and different filters and aggregation.
The total order is built by using the aggregated sum of the revenue property:
For the New Customer Orders metric, we use an existing filter to only use the first time this event every was recorded for this customer:
For large orders, we use the same definition as total orders but add a filter that revenue per order must be greater or equal than 20.
But now comes the self-service analytics part. Our business teams are happy with the core metrics but they want to know and do so much more with that.
Self-serve use case 1: Revenue ops wants to learn about revenue retention
Your revenue ops team is eager to learn more about customers who buy again. They want to understand revenue retention over a longer period to develop forecasts for new customers and determine how much the marketing team can spend on them to have a higher return on investment.
In Mixpanel, there are different ways to analyze this, but the best one is to start with a retention analysis:
Again, you only use the order submitted event to build the retention analysis. Start with the first order and look at the repeating orders as retention criteria. And as we can see here, the dbt sample data is dream data. Almost all of the customers keep buying more things every month.
But the revenue ops team can now go some levels deeper by using the different breakdowns.
You can use the information about the purchased products to analyze product buying retention:
As you can see, the revenue team, with one or two provided initial reports, can easily do 20-30 different deep dives to understand the revenue retention behavior in more dimensions.
Self-serve use case 2: Let the growth team build audiences for their discount initiative
Now let’s say your growth team wants to focus on customer retention in the next weeks. They had some good experiences with special discounts for very specific audiences. However, building these audiences took time since the data team needed to build and sync these audiences manually.
In Mixpanel, they can build all different audiences themselves and run many more experiments.
Start with a simple one, the welcome discount, which is a standard audience for all online shops:
You can simply filter all new users and ensure they have submitted an order in the last 30 days. As you can see, there’s flexibility with the lookback window, and we could even specify the minimal amount they would need to order for the first time to include them in the cohort.
The next example could be a basic win back initiative:
With this cohort, you would select all customers who have ordered something 30-120 days ago but have not bought something in the last 30 days. The action could be to send them a nice new offer. The growth team could even experiment with more specific offers. For example, they could refine the initial orders to be from a higher order value.
Now, the final step is to activate these users. The growth team could use Mixpanel’s Braze integration to automatically sync the defined cohorts as new audiences for customer messaging. In Braze, it’s possible to set up emails, text, or other communications to be sent to users when they appear in this cohort. And all this happens automatically. As a final bonus, if you add the discount code to your order model in dbt, you can can sync that data and enable the growth team to analyze their campaigns in Mixpanel directly.
Recap
As we’ve gone over, it only takes one simple step to extend your existing star schema to an event mode that can be synced into Mixpanel to unlock true self-serve analytics.
In Mixpanel, different business teams can dive deeper, starting from the core metrics, to create new metrics, funnel and retention views, and break them all down by things like product information and locations.
We’ve gone over a lot here, but one big takeaway is that self-service analytics is an incremental approach. The more access that domain experts gain to data and analysis, the more possibilities you can unlock for better business outcomes. Like I said, magic really can happen.