Data Clutter

Image: Kevin Utting,www.flickr.com/photos/tallkev/, CC-BY-2.0

INTRODUCTION

Sampling is important for anybody who wants to explore vast amounts of data. We do it all the time without thinking about it when faced with more information than we can easily consume. And when it comes to behavioral analytics, often finding the right question is even harder than finding the right answer.

Sampling with Interana supercharges the process by reducing the query response time to a few seconds even across billions of events. You can explore questions more quickly, worry less about asking questions that don’t go anywhere, and generally dive deeper into your data. A typical exploratory session starts with a vague idea about some gem that might be lurking in the data, turns into a series of queries that zero in on the right question, and then saves the final query by pinning it to a dashboard. 

Other solutions require sampling just to be able to work at scale. For Interana, sampling is a useful tool but not essential for fast results at massive scale. You can get statistically accurate sampled results at the speed of thought, refine your queries, and when ready get fast unsampled results for hundreds of billions of events. Another Interana advantage is that we ingest and store all the raw events and optionally sample during the query. Solutions that sample at ingest are not well suited for behavioral analytics at scale.

Let’s review some key concepts:

  • Sampling is the process of selecting a subset of some general population that can be used to accurately estimate important attributes of the population as a whole.  That subset is called the sample.
  • Behavioral Analytics focuses on how various actors behave when interacting with a product or service. Those actors could be people, devices, sensors, etc. The behavior is tracked as a sequence of events that occur at specific times. The order, duration, and time between events are all relevant for understanding behavior.
  • Event data is data from any occurrence that has significance for a product or service. Events describe an action associated with an actor at a specific time.
Millions of Users

Image: Paul Wilkinson, www.flickr.com/photos/eepaul/, CC-BY-2.0

It’s also critical to keep the context in mind. Many new and interesting applications relate to connected applications with millions of users and thousands of events per user session. Interana customers see hundreds of millions of events per hour. Being able to ingest, store, and analyze all that data in terms of behavior takes a dedicated approach. General-purpose solutions for big data analytics might keep up at smaller scales, but at high volumes they’re forced to make compromises like using much more expensive clusters, taking longer to get their answers, and depending on busy data scientists to translate basic questions into code.

The attraction of behavioral analytics is to discover something new about users and interactions. That discovery process is exploratory by nature, and exploration is best done interactively. There’s just something compelling about diving deep into the data and seeing it in new ways and from different angles. Time to Discovery is a critical measure of an analytics solution.  The good news is that there’s usually a balance between how accurate an answer needs to be, how much it costs to get a more accurate answer, and how much value additional accuracy brings to the organization.

WHEN TO SAMPLE

Statistical sampling is as old as the field of statistics itself. But sampling has a mixed reputation with Big Data users. For behavioral analytics of event data, there are clearly right and wrong ways to sample.

It’s tempting to sample at the data collection points. There are potential upsides: the data shrinks and gets easier to ingest, less data needs to be stored, and it can be processed as-is without further reduction. But for behavioral analytics, this approach is tricky and limited.

Selecting events without regard to actors is the wrong way to sample

Figure 1: Selecting events without regard to actors is the wrong way to sample

First, the sampled events must represent a series of actions by a set of actors. Their contents, sequence, and timing are all important. You can’t just take every 100th event (Figure 1). Second, we don’t know which actors will be interesting ahead of time. That’s part of the discovery process. Plus, the interesting criteria may change from query to query and aren’t known in advance. Lastly, sampling might reduce the collected data to such a degree that it can’t be used for important workflows like A/B testing. The fraction of sampled users shown the modified product might become too small to draw statistically meaningful conclusions.

The correct way to sample is by selecting representative actors at query time

Figure 2: The correct way to sample is by selecting representative actors at query time

For behavioral analytics of event data, the correct approach is to record all the events and make them part of the dataset. Sampling needs to be based on all the events for a representative set of actors (Figure 2) from the population. It needs to happen at the time of the query, not during ingest. This approach moves the burden of correct sampling from the end user and onto the analytics platform. If the answer is so clear-cut, why isn’t everybody doing it the same way? The answer is implementation: A solution focused on event data can organize and manage data in ways that don’t make sense for a general purpose analytics solution. That organization brings the power to store and query huge volumes of event data efficiently.

INTERANA APPROACH

Like all modern analytics platforms, Interana is architected (Figure 3) using a scale-out clustered approach. There are multiple nodes (machines) in the cluster, and the number scales to match the demands of the data and concurrent users. Each machine has one or more jobs within the overall solution. The import nodes are tasked with ingesting event data from event logs, traces, etc. The data nodes manage efficiently storing and scanning the event data. The string nodes compress and deduplicate all the strings in the event data; handling storage and translation so that all data operations can take place efficiently. The end result is that no matter the scale, Interana can match the requirements. Our customers have some of the busiest services in the world, with clusters hosting over a trillion events.

 

Interana Scale-out Architecture

Figure 3: Interana’s scale-out architecture allows flexible sizing

One mechanism to support accurate behavioral sampling is to make sure all the data for an actor is stored together on the same node. And that the storage and placement of individual actors is fair and even for all actors in the population. We do that mathematically by using a hash with appropriate properties, followed by fairly selecting one of many shards (containers) to hold that particular actor. Because the actors are evenly distributed among shards, every shard contains a representative slice of the overall population. Sampling takes advantage of the fact that a single shard is representative of the population. When we sample, each data node processes a subset of the shards and scans the requested time range for all the actors in selected shards. Because even a single shard may hold more actors than necessary for a statistically valid sample, we can sample with progressively larger subsets of the shard until we have sufficient confidence in the sampled result.

Sampling Warning

Figure 4: Sampling warning automatically alerts the user when sampling isn’t appropriate

Of course sampling isn’t always appropriate. Certain data isn’t going to be evenly distributed among the shards. Some events are very rare and unlikely to show up in a sampled result. Sometimes you’re looking for a tiny set of events but aren’t sure when they occurred. Sometimes the selection filters leave too few events to sample accurately. Interana detects situations where sampling isn’t appropriate and returns a clear warning (Figure 4) with the results. The user can then either reformulate their query, or choose to disable sampling and compute the query across the full set of data. But even when running a query on the full set of events, Interana returns results in seconds. The entire system was designed around the proposition that scanning billions of events is the common case and needs to complete quickly.

CONCLUSION

Interana is a purpose-built solution for behavioral analytics of event data at massive scale. The solution consists of a highly scalable cluster which is combined with an intuitive visual interface to interactively explore trillions of events in seconds. Part of how that’s possible is the architecture of the solution, and part is the integral role of sampling. If you’d like to learn more about how Interana samples, download our free eBook.

Don’t stop here! Learn more about how other companies — Tinder, Microsoft, Sonos, Imgur, Asana, Flowroute, BloomBoard, and more — have used Interana to gain insights into their customers’ behavior. We have many resources that can help you get a better understanding of Interana’s solution and behavioral analytics on event data at massive scale. See how Interana can help you discover what your customers think and do — Request a demo of the Interana solution in action.

Thanks

Big thanks to Boris Dimitrov, Paddy Ganti, Mark LaRosa, Nicole O’Malley, and Mark Horton for their help reviewing drafts and discussing the underlying technology.