As we built Interana, we knew that ingesting data is often the biggest pain point in trying and adopting an analytics system. Engineers can spend months learning about the source data that will be fed into the new tool, determining the appropriate structure for storing it in the new product, and implementing the dreaded ETL pipleline. This is all needed before the system can even start to be useful. We wanted to minimize this large, upfront investment and at the same time avoid some of the common pitfalls that trip up developers.
Our design was guided by some hard-earned insights, gathered across decades of work on diverse products that span the spectrum from spreadsheets, to databases, to content management systems, and all the way to a little web app called Facebook.
- Build it simple. This is much harder than it sounds. Robust systems have as few moving parts as possible with as much compartmentalization as possible. Complexity will creep in, but needs to be isolated in a way that protects the common case from becoming torturous.
- Minimize schema. Reducing the required elements to a minimum and allowing columns to be sparse and added on the fly creates a flexible system that minimizes initial setup as well as storage costs.
- Gotta catch em all. Don’t throw away data. Have you ever seen an anomaly in a dashboard, then realized there was no way to dive in and find the actual source without re-importing all the data into a new schema.
- Unite your events. A union of events in a table is a much more powerful representation of what is happening, rather than joins across different tables.
- Create expensive derivations on import. IP to Geo lookup is a perfect example of this. Finding the city or country from an IP address can take precious resources and extracting this once, out of the query scan loop, is a huge performance win.
- DDD – Don’t Double Data. We created an easy way to de-duplicate data on import which allows a lazy approach to cursors and save points in the import pipeline. It means bringing in the same file multiple times is just repeated work, not corrupted query results and that you won’t ever again need to tell your CEO that sales really haven’t doubled…