I’m a Software Engineer at Mixpanel, working on our data export pipeline. My focus is on making it as easy as possible to send the data you collect in Mixpanel, to your destination of choice.
A couple of months ago, we released a connector that sends data from Mixpanel to Amazon Redshift Spectrum, Google BigQuery, Snowflake, Google Cloud Storage and Amazon S3. Since then, our customers have told us that this pipeline has helped them save time, engineering hours, and enabled their team to spend more time analyzing the data, and less time preparing it.
This pipeline sends your data to Redshift Spectrum, which is different than Redshift. In this post I’ll explain why we chose Redshift Spectrum, and why this will make your life easier.
What exactly is Redshift Spectrum?
Redshift Spectrum is simply the ability to query data stored in S3 using your Redshift cluster. You put the data in an S3 bucket, and the schema catalog tells Redshift what’s what. The schema catalog simply stores where the files are, how they are partitioned, and what is in them. The AWS service for catalogs is Glue. The setup is straightforward, with your data in S3 and a Redshift cluster with some of your other data. Then you just glue them together using Glue (ba dum tss). Now you can query the data from Redshift, and even join Redshift tables with your S3 data. Since this is a multi-piece setup, the performance depends on multiple factors including Redshift cluster size, file format, partitioning etc.
Redshift Spectrum means cheaper data storage, easier setup, more flexibility in querying the data and storage scalability
Because we can just write to S3 and Glue, and don’t need to send customers requests for more access. We wouldn’t care about how much data, we just write to S3 with a reliable and predictable throughput. Also, the biggest bonus is we don’t need to store any credentials through the marvel of AWS IAMs and cross-account roles.
It is also cheaper for customers. They will pay a low amount for S3 storage, and only need to pay when they read the data through Spectrum. Another great side effect of having a schema catalog in Glue, you can use the data with more than just Redshift Spectrum. You can query the data using Athena (Presto), write Glue ETL jobs, access the formatted data from EMR and Spark, and join your data with many other SQL databases in the AWS ecosystem.
Why we chose Redshift…at first
But let’s go back to the beginning. When we started building our Data Warehouse Export pipeline in early 2018, we had to pick the first data warehouse to support. Amazon Redshift was the obvious choice, for two major reasons.
First, I had used Redshift previously on a considerable scale and felt confident about ETL procedures and some of the common tuning best practices. Second, it is part of AWS, and that alone makes Redshift’s case strong for being a common component in a customer’s stack.
Problem #1: Copying data to Redshift does not scale
As soon as we began testing the first version of the Redshift adapter, problems began to appear. I had allocated a small cluster for development purposes and used a basic method of creating tables and issuing COPY statements to load data into them. The test project had around 300 events, which meant loading would create 300 tables. The slow CREATE statements would end up acquiring the lock on the schema, and become sequential. They were followed by transactions to load data into each table, because you cannot COPY to a table outside a transaction.
These were just the issues that arose with a blank cluster. When updating the data in a cluster that’s already holding Mixpanel data, the issues get even worse.
Problem #2: VACUUMing with too much data can cause a circle of data death
A quick note about VACUUMing. When the data volume is high, vacuuming can be unpredictable. For example, VACUUMs can take longer than expected, which can suffocate the cluster during the hours when all the nightly batches are trying to load data into the cluster. As a result, all those loads will also take a considerably longer amount of time to finish. Just imagine the VACUUM after one export lasts until the next export begins. It will start a cycle that never ends unless you pause the exports or skip a few VACUUMs, which causes missing data or slow queries due to unsorted data. Well, skip too many VACUUMs and you have the chance of getting a mega-vacuum that can last so long, you’ll start Googling how to monitor, debug, or even cancel a VACUUM on Redshift. This in itself is a painful process, due to the huge amount of data stored in the cluster.
Problem #3: It requires A LOT of coordination
This got me thinking. Should I fine tune WLM to give my data loader user more resources? Or optimize the process to shave off a few seconds of the load time? But, the more I thought about it, the more obvious it became that designing a Redshift export process that would work for every customer, would be extremely challenging.
Trusting another company with your company’s data comes with limitations. They’ll most likely create a data loader user for the provider and whitelist a set of IPs for them to connect to the destination cluster. At this point, in the case of Redshift, they trust (or make sure) that you store the credentials safely. But this always spurred more questions, such as:
- Can you configure your WLM to give the Mixpanel user more resources?
- Your cluster is too small for your data, can you please resize? (Resizing a Redshift cluster is a story of its own)
- Can you make sure nothing else is running until I finish writing data to your warehouse?
- Can we coordinate when you run VACUUMs? Is it ok if I do a DEEPCOPY?
The list goes on, but sometimes the conversation hits a dead end after the first question.
How we settled on Redshift Spectrum
In the midst of trying to find the right cluster size for our test data, a customer call for Redshift export was scheduled. This customer had a lot of expertise in AWS and had a savvy data team that used the latest and greatest technologies in the platform. They were looking into replacing their in-house Mixpanel to Redshift ETL pipeline to stop ingesting their events and people profiles twice. During the call, they asked about writing data in S3, and querying it from Redshift. Initially, I assumed they wanted to take over the process right after the S3 export and manage the Redshift copies on their own. But once it became clear that they were looking to export to Redshift Spectrum, I saw the light!
But what if I REALLY want my data in Redshift?
Ok, maybe we haven’t convinced you. If you really want your data in Redshift or your data lake, you can still use our data pipeline for that. If you have the schema and your data is in S3, then you can follow these directions to load your data into Redshift. Or, if you have glue enabled with Mixpanel, or use crawlers, you can use these directions. Both methods are easy to do.
Sign me up for Data Warehouse Export
Data warehouse export is available to customers on a paid plan and comes with a 30-day free trial, which includes a daily scheduled export and one day of backfill.
Please contact your Customer Success Manager, Relationship Manager, or Account Executive to get started, or start using the trial today.
Currently we support these data warehouses: Snowflake, Amazon Redshift Spectrum and Google BigQuery and these cloud storage platforms: Google Cloud Storage and Amazon S3. Don’t see your destination of choice listed here? We’d love to hear from you! Reach out to us at firstname.lastname@example.org and let us know. Or, if you wanna solve such problems for thousands of tech companies, we are hiring!