The Modern Analytics Stack: How Today’s Fastest Growing Companies Build Their Data Pipelines
The data pipeline you need for proper event analytics has undergone a huge shift in the last few years. Gone are the messy logfiles and “high priests of data” needed to interpret them. Gone are the slow dashboard widgets and their slow search queries. Gone, too, are the inefficient tools behind them—well, if not gone, then at least irrelevant.
Building a data-informed culture means using tools that create informed team members.
If you’re using Hadoop for analytics, then you’re ensuring that only those people who can write MapReduce programs can ask questions of your data.
If you’re using modern data tools that let you build a clean pipeline no matter your event volume, and analyzing the results with Interana, then you’re ensuring that everyone on your team can ask questions of your data and get quick responses back without having to go off to learn… Java.
Here’s what that modern data landscape looks like.
Unless you’re sending files to our HTTP API or uploading JSON directly to our servers, the data you want to analyze is probably in either Amazon S3 or Microsoft Azure Blob storage.
There used to only be one real option in this space, but today, those differences are thinner than ever.
Amazon Web Services has arguably been the top dog in cloud computing since its launch in 2006. And S3, or Simple Storage Service, has been at the center of that dominance.
First available in March of 2006, Amazon S3 offered cloud file storage at an unprecedented price point, enabling a generation of startups and entrepreneurs to emerge. Today, they’re used by Netflix, Reddit, Dropbox, Tumblr, and Pinterest.
However, owing to outages—as well as pricing confusion and product complexities—some startups are looking elsewhere for their cloud storage needs.
Azure is the crown jewel of Microsoft’s resurgence as orchestrated by CEO Satya Nadella. After coming on to lead the company in 2014, Nadella has restored its reputation by pivoting away from Windows and towards cloud computing. And while AWS still owns the cloud, Azure has grown quickly to become a real challenger.
The advantage is with Azure if you’re a Microsoft shop, of course. If you have a MSDN subscription, you get free credits to try Azure.
For a decade, the big problem with logging events was the near impossibility of getting answers when searching across a dataset of any size.
That problem was solved partially with technologies like Kafka that enabled real-time data processing, and it’s been solved partially by technologies that allow for logging to get smarter. For the latter, it’s basically about making your applications spit out logs that were written to be read later. Plus, for business users, these kinds of tools can offer integrations with 3rd party APIs (for lead enrichment, say) before the data is sent into software like Interana.
You don’t have to wrangle your data to get it to work in Interana—but we will ingest data from your pipeline no matter what it consists of.
mParticle builds a tool that allows you to build more sophisticated logging around your event data and send that data to the vendor or warehouse of your choice.
Interana has a transformation tier that converts event data into line delimited JSON and does lightweight transforms/data cleaning, but you may want to transform your data beforehand—for instance, if you want to enrich your user data with 3rd party APIs or create audience segments (although you can do this in Interana).
Bleacher Report, one of our customers, retired most of their data stack when they realized they could use Interana, coupled with mParticle for data ingest, to achieve better results than they were before.
For more, check out our mParticle integration cookbook.
Segment’s data hub product lets you collect all your customers’ data in one place and send it wherever you please using their long list of integrations. They’re like the glue between your S3 buckets and behavioral analytics like Interana.
For more, check out our Segment integration cookbook.
Data Ingest and Processing
Once your needs grow to a sufficient level of complexity—lots of writes, constant, real-time read access needed, fault-tolerance—you’re going to need a real-time data ingest and processing program. That might be an integration of Apache Kafka and Apache Storm/Spark Streaming—Kafka for your backbone and communication hub, and Storm or Spark Streaming to process the streams of data it sends.
They’re the engines of your data pipeline, funneling the event information you need to Interana so that it can be parsed, analyzed, and understood.
Apache Kafka was first released in 2011 after being developed by a team inside LinkedIn to help handle the growing company’s data-processing woes. It’s a message broker—messages come from “producers” (web apps, wearables) and are sent to “consumers” (like Interana) in the form of “topics.”
The reason it’s become so popular is that it lets you work with those messages in a stream. You can publish, subscribe to, store, and process those streams as the events are actually happening. That makes Kafka perfectly suited for real-time applications, where you need to be able to manipulate the data and analyze it quickly. Kafka becomes a hub for all that data, buffering it and serving it to all the “consumers” that subscribe to it—anyone who wants to do something useful with it can tap in.
At Interana, we place all events sent to our HTTP API on an internal Kafka bus before sending them to the next stage of the ingest pipeline.
A modern analytics stack needs real-time analytics. For real-time analytics, you need a pipeline that’s always feeding you new data.
That’s where streaming frameworks like Storm come in—which does for real-time processing what Hadoop did for batch processing.
Storm (and Spark Streaming, below) are often used to fix recurring issues with Hadoop, but they function better when they’re used as the processing arm of a real-time data platform.
Storm “spouts” can communicate with the data sources you expect—Apache Kafka, logs, APIs. You use “bolts” to transform and otherwise manipulate your data. And you get exactly-once processing with their high-level abstract processing library, Trident.
Spark Streaming is similar to Storm in many ways, but there are differences. Spark isn’t strictly for real-time processing—it processes events in micro-batches rather than in one record like Storm. This can be problematic if you have unusual event data, as Jim Scott at Mapr points out:
If you end up with 500 events in a window at your peak time of day, but then every window has one event, you’re probably not going to be happy at the end of the end when you have RDDs with one object each in them, as you may see negative performance impact.
Spark does have the advantage of the RDD—the resilient distributed dataset—which allows your data to be processed in a fault-tolerant manner. If you’re already using Spark for other reasons, then Spark Streaming probably makes sense for your business.
The Full-Stack Solution
Interana works with all your existing data sources and data pipelines.
Most importantly, it also works equally well if you aren’t using any of the tools that we talked about above. That’s because Interana is a full-stack analytics solution. Funnel in anything that produces “events,” whether that’s a web application or a Wi-Fi enabled microwave, and Interana will pull the data and put it into an intuitive interface that every member of your team can use to do real-time analysis.
No data wrangling or re-processing necessary, because all of that is integrated directly into the product. You can turn your messy data into well-formed JSON with our transform engine, send them into a database as a stream, and then turn that into an easily-accessible dashboard you can use to ask questions about behavior and define funnels, segments, and cohorts for closer analysis.
That means no more need to worry about your different tools and how they work together, no more delicate, brittle data workflows, and no more “high priests of data” having all of the fun. That’s good, because those ninjas really are too busy.