Instamojo on data stack philosophy
As the Head of Analytics at Instamojo, an e-commerce enabler that provides e-commerce solutions to over 1.5 million micro- to medium-sized businesses in India, Ankur Sharma is in charge of all things data—data engineering, strategy, and core data analytics. In this interview, he spoke to us about what’s in Instamojo’s data stack and the philosophy behind it.
What’s in your data stack?
Our data stack consists of three components. The first component is used to extract data, for which we use Fivetran, and it brings data from multiple places. The second component is the data warehouse, which is Amazon Redshift. The third component consists of dashboards that we have built on top of Redshift. There are Business Intelligence (BI) dashboards like Klipfolio and tools like Periscope (acquired by Sisense) for business use cases. Then of course we have data going into Mixpanel for product analytics.
We also use Mixpanel to capture click-stream data coming on to our platform. It is a very crucial component because most consumer data is being tracked via Mixpanel. For any behavioral analysis that we want to do on our consumers or business to track our KPIs, we can do it in Mixpanel.
To integrate Mixpanel into our stack, we used the Mixpanel SDK since our implementation is older, but we are now seeing modern tools like reverse ETL (Extract, Transform, Load) available, which is interesting as it adds new options.
On top of this, we have our machine learning pipeline. We use AWS Lambda to build our machine learning models and some of the other tasks that run on top of Redshift.
How does ELT change things?
Traditionally, companies have been using ETL. With ETL, data is extracted from multiple sources, transformed in a way you would like it to behave, and then loaded into the data warehouse. There is a lot of pre-processing, cleaning, and manipulation required.
On the other hand, ELT (Extract, Load, Transform) is a new paradigm. You can extract data from different sources and load it as it is into the warehouse without any cleaning or manipulation. The entire transformation is done in the data warehouse itself. ELT speeds up and simplifies the loading process. You can just take data from the source as it is—be it from databases, files, APIs, or webhooks. With the data loaded, you’re free to do any transformation you want. You don’t have to go back to change the transformation and reload the data again.
I believe ELT empowers analysts more than ETL; it allows them to grow their skills and become a full-stack owner of their analytics stack.
What’s the philosophy behind your data stack?
It’s built on the philosophy of being a lean team. This data stack allows us to remain a nimble organization. We’ve consciously chosen not to bloat ourselves with a large team and instead get more things done with less. This is why we’ve picked tools which work in synergy with each other and without a lot of interference from people in the company.
Our entire product is built on AWS and we make sure to use AWS data products to have a connected data stack. Even if we use outside products like Google Cloud Platform, we ensure that all data eventually makes its way back into Redshift. The idea is to have all your analytical data, irrespective of source, to sit in one data warehouse to ensure accessibility to all. It’s perfect that Mixpanel is able to easily connect with the other tools in our data stack.
If data is not your core product, don’t build your data stack in-house. Try to get it done with off-the-shelf tools… Don’t involve your super valuable engineering resources to build this out when you can achieve similar results from available products.
Who should own the data stack?
I would advocate for end-to-end ownership of the data stack within the analytics team. If you need to change certain things in your data pipeline and have to wait on other teams to do it, you’ll end up stuck. It’s helpful to reduce dependency on other teams so that the analytics team can have full control. This also allows the team to get things done faster, try out different transformations quickly, and tweak things that don’t work out well.
It’s important to empower the analysts and make sure the right data is available for analysis. Owning the data stack would allow them to be more familiar with the whole process and better understand the benefits and limitations that come along with it.
How do you foresee your data stack evolving in the near future?
Most of the tools we have are built for scale, I’m reasonably confident that our tech stack will grow with us as we grow as a company. I’m looking forward to plugging into a lot of APIs which have been elusive for us till now. There are new APIs coming in every day, especially in FinTech. I can also see that Mixpanel is evolving in the same way, which is great.
As we are making inroads into supporting more small businesses in India and providing them with the right platforms to grow and manage their business, I see our tools evolving, as well. It would be interesting to see how data will continue to be firepower for making decisions, supporting our intuition.
What advice do you have for startups and scaleups looking to build their data stack?
If data is not your core product, don’t build your data stack in-house. Try to get it done with off-the-shelf tools. Building in-house is extremely resource-intensive. Especially for small teams, it is important to focus efforts on core competencies. Don’t involve your super valuable engineering resources to build this out when you can achieve similar results from available products. In-house data stacks may also be a limitation as it is harder to achieve scalability to grow with your business.