Why slowly changing dimension tables are the best thing for your product analytics setup
Before we go into the beautiful details about slowly changing dimension (SCD) tables, let's first do some setup by laying out the importance of dimensions, or properties, as they’re known as in event analytics and product analytics.
Properties give any event and user essential context that enables you to understand and break down things to find insights that can drive business value.
To emphasize this more, we can use this example: Let’s say you have a ticket system product where track users; these users perform multiple events in each session, covering the core activities of your product, including creating tickets, adding information, assigning tickets, commenting on tickets, and closing them. On the general user level, this gives insights into how all users, on average, use the product. We can build funnels from ticket creation to closure that will tell us the total average of tickets closed rates and times to close a ticket.
But, naturally, you have completely different users. Some of them are project managers and others are contributors. Some work for startups and some work for big enterprises. Some of your users are active in the product mere days out of the month, while others use it for weeks. We can assume they will all use the product differently but still create value. But how do we recognize this?
This is where properties come in to help us to segment our data.
We can introduce three user properties to our setup: user role, user company size, and user experience level. When these are applied, we can create a ticket performance funnel broken down by roles, company sizes, and experience levels. This breakdown will show us different insight patterns that will help the product team. For example, it will help the product team work on the experience of new project managers in big enterprises because sales have identified this as a very promising commercial audience.
The different types of user properties
In Mixpanel setups, user profile properties enable a powerful use case. We set a user profile property like "Enterprise" for "user company size" and Mixpanel joins this information with all events that this user has done within our product, even when we haven't tracked "user_company_size" with any of these events before.
So, in our scenario, the sales team and product team sit together, and sales explains their strategy to talk to more enterprise customers. They and the product team would benefit highly from understanding how these users perform vs. all others. The data team now can get this implemented quickly, and because of the way Mixpanel handles user profile properties (by joining them at query time and not at ingestion time), both teams have immediate access to this new breakdown for all historical data.
This works great because "user_company_size" will not change significantly over time. Yes, companies change, and some companies will go from startup to enterprise in seven years, but these are exceptions, and it is still very slowly changing. So "user_company_size" is static; the same goes for a property like "user_company_industry."
The "user role" property is somewhere in between. Usually, job roles don't change weekly, but they do change over time. So a different way of tracking this would be helpful, as well.
On the other hand, "user experience level" is more dynamic. Naturally, users will gain experience level the longer they use the product. Let's say customer success uses a classification of:
- 1-4w = Beginner
- 5-10w = Intermediate
- 11-30w = Advanced
- 30w+ = Experts
By default, the value will change over time. In data terms, this is still slow (we would talk about fast when it happens in seconds or minutes), but it is different compared to "company size." Using this user profile property, we always overwrite all earlier levels when a new level is set. When a user gets to the advanced level, and we update the user profile property, all events will now inherit the advanced level, even when the user is on the intermediate or beginner level.
This is not helpful. In the past, I recommended a workaround for these properties to send an event instead, like "user experience change," and set the property here or send the property with all events. Not ideal. We’ll get to the new, more elegant solution later. (If you guessed it has something to do with slowly changing dimension tables, you’re on the money.)
First, I would like to briefly switch to the data warehouse space and see how this problem is solved in data modeling. This will be relevant when I show the new solution.
Slowly changing dimensions
In data modeling, we usually call properties “dimensional values.” And we encounter the same types of dimensional values as the ones we have described above. Our data engineering team integrated the user table from our production database. In the application database, engineers only care about the latest value. They will overwrite job roles, company sizes, and experience levels whenever this is changed in the application by user input or by backend jobs determining it. That makes sense; for the application, knowing the current experience level is important to show a specific context, but the historical one is usually irrelevant.
In data analytics, historical dimensions are very important since we often gain insights into developments over time. Because of this, overwriting dimensional values in the application database is a problem for data engineers since it removes all historical context.
Some data engineering teams use snapshots to overcome this problem. This setup will create snapshots of dimensional data from the application database often every night. One strategy would be to store each snapshot with a date and use it as a historical reference. But in most scaling data setups this will lead to immense storage costs.
Therefore, most data teams use the concept of slowly changing dimension tables. There are six types of SCD tables, but we will only focus today on the second type since this is the most commonly used type in data modeling.
In this version of a slowly changing dimensional table, the data is compared to the current version, and when a value has changed, a new row is inserted for the same entity (like the user) with the new values. This created duplicates in the table. We also added information about the time period when these values were valid to handle duplicates. This is often done by adding a “start_date,” “end_date,” and “active_record” column. Here, we add the information about when these values were valid.
Let's have a look at this example:
We have two entries for user 1. We see start and end dates for the "Beginner" experience level. For "Intermediate," start and NULL end dates are needed. This shows us that the second row is the current record.
When joining this data with events, we use the event timestamp to identify the right record for user 1. By that, we provide the right dimensional value at the right point in time when we go back in time.
The model becomes flexible by using this approach, and we can cover all changes in any dimensional value over time.
The new Profile History feature in Mixpanel
Now it is time to bring this all together. Mixpanel's new feature Profile History gives us more flexibility for the different user profile properties we have mentioned.
We use Mixpanel's new Warehouse Connectors and the new Mirror mode to enable this. The new Mirror mode lets us connect to a table in our data warehouse and sync it automatically into Mixpanel. This sync not only adds new values in the table but also removes deleted rows (because we fixed an issue) and all changes in values.
If we discover that some revenue data was collected incorrectly and needs correction, we apply this change in our data warehouse table, and Mixpanel will detect this change and update all data in Mixpanel with the change. Being in product analytics for over 15 years, this is the one feature I always wanted to have.
With these new connectors, we can unlock the new Profile History feature.
Coming back to our example:
After speaking with the product, marketing, and sales teams, we have identified that adding "company size," "job role," and "experience level" will help all teams unlock new insights and plan better activation based on this. This is a defining project for the next six weeks.
But instead of adding new tracking for user profile properties, the data team sits down and checks the data warehouse for the user data. Since the user data originally comes from the application database, the data team implemented the integration a while ago. The data is already used in business intelligence (BI) reports.
It is a best practice that the data team also implement the user table based on nightly snapshots and by using a type of slowly changing dimensional table like we just went talked about (this is not conveniently designed like this to serve the example; I come across this quite often).
With the new Warehouse Connectors, the data team now just goes into Mixpanel and adds a new warehouse dataset:
And it configures it to use a history table with Mirror:
As you can see, we map the column in our SCD table "start" to the "Start Time" in Mixpanel. With this information, Mixpanel automatically figures out when to use each value based on the time information.
That is all they need to do. With the first sync, we have all the data in Mixpanel that we need.
With the saved time, the data team also provides a new activity score as a new user profile property similarly. This activity score is calculated monthly and only works with the Profile History table feature.
The sales team now starts to use the new data to understand the change in activity score for their most important accounts and users.
Account activity score from the beginning of December 2023:
Account activity score some weeks later by the end of December 2023, just when the team wants to talk to the customer:
With this new information, the account manager knows that she must first talk about why the activity went down, creating a completely different customer experience.
Meanwhile, the product team can use this new information to analyze the change in time of activity scores:
The new feature Profile History unlocks these analytics use cases and many more.
But the real power comes in combination with the SCD tables. These tables often already exist in most data warehouses. They often contain data potentials that have not yet been unlocked because product and growth teams couldn't access them.
With the new Warehouse Connectors, Mirror, and Profile History features, data teams can enrich the Mixpanel data set with user data from the application database but also account/user information from CRM systems like Salesforce or Hubspot or with data from support systems like Zendesk.
With these new datasets, product analysts can now create better segmentations for the product and customer success teams to optimize features or product communication.