At Interana, we build a fast, scalable behavioral analytics solution for event data. By fast, I mean really fast (answers in seconds), and by big I mean really big (billions and billions of events). We currently have customers like Microsoft and Tinder that have nearly reached a trillion rows of data and still getting results in seconds. Scale and speed are critical to Interana and the types of analytics we do (conversion, engagement, retention, root cause, etc.) on massive volumes of raw event data. We knew that designing and writing a system like this would not be easy, and it wasn’t. But we didn’t expect it to be so hard to find good data for testing, measurement, and demos. That’s why we decided to make our own data.
Today, we’re releasing eventsim to the world. This is a tool that I wrote internally to produce a stream of real looking (but fake) event data. We use this for development, testing, and demos. This blog posts explains why I wrote a fake data generator, how it works, and how to get (and use) it.
You can get the code for the simulator from https://github.com/interana/eventsim.
Background and Motivation
At Interana, we built a system for viewing event data. Events are measurements that capture a moment in time; each event describes an observation (something that was seen, or something that happened), and attributes about that event. Examples of events include web page view, credit card transactions, SMS messages, and industrial sensor readings.
In industry, there are a huge number of data sources that look like this. I worked with data like this myself (at LinkedIn, Netflix, and Verisign), and our customers produce large volumes of data like this (at Asana, Imgur, Bing, and other places). Typically, the data contains a set of events associated with a set of users over a long time period.
At Interana, we wanted to find data sets that showed how people behaved over time, and that could be used to calculate common business metrics. Unfortunately, we struggled to find free, open data sources that looked like this. We found some data sets that satisfied some of our requirements, but not all. (For example, there is a good data set of wikipedia edits. This data set contains many events, but is less than ideal for engagement metrics.)
We decided that our best bet was to simulate the action of many users on a completely fake web site. We wanted the simulator to have the following features:
- Configurable time period. We wanted to be able to create data for long or short time periods, and include timestamps up to the present.
- Configurable volume. We needed to be able to create data for many different numbers of users, from tens to millions. (This lets use the same data for small development projects and massive performance testing projects).
- Realistic traffic patterns. Many of us have worked at big consumer web sites, and know that traffic varies by time of data and day of week. We wanted more traffic in the day than at night, and more during the week than on weekends (and holidays).
- One time or continuous. We wanted to be able to generate data once, or to generate data continuously.
- JSON output.
- Output to files, or to Apache Kafka.
- Pseudo-random output. We wanted the data to look random, but to be generated deterministically (to ease testing and recreating data).
- Different behavior for different users. We wanted different users to behave a little differently: some arrived more frequently than others, some clicked around differently.
- Colorful attributes. We wanted to make the data fun and interesting: to assign users names, to have them use different browsers, to have them come from different places. And we wanted them to do interesting things.
- Growth and attrition. We wanted to be able to calculate growth metrics, so new users appear over time (and some leave).
After some work, I decided to simulate a fake music web site, like Apple Music, Pandora, or Spotify. I chose this use case because I think it’s intuitive for most users (lots of people have experience with music streaming services), and fun. I also had some interesting data to use for faking it: the Million Song Dataset. (I used data from that project to create realistic names and distributions of songs.)
The Fake Web Site
Our simulation contains events from a fictitious web site:
- When users first arrive at the site, they are unregistered and logged out. We don’t know who they are. Users register, log in, and start using the site. Over time, users may log out. They may arrive at the site again, and log back in. They might cancel their membership. They might also upgrade from a free to a paid status.
- Users typically arrive at the home page. They might visit a registration page, an about page, or a help page.
- Logged in users also listen to songs. (These are labeled at “nextSong” events, and have song information associated with them.)
- For actions requiring confirmation, there is typically a page, a “submit” action, and a confirmation action. For example, users go to the upgrade page, then submit an upgrade, then are redirected to the home page.
- Errors randomly occur
When you’re exploring the output, you can see users transition between states (logged in and out, free and paid). You will also see users arrive, use the site for some period of time, then randomly leave for a while. (This simulates sessions.) Here are some fun things to look for in the data:
- Different users behave differently: some use the site more often than others.
- Paid users typically have longer sessions with more songs.
- Usage is cyclical: higher during the data than at night.
Getting and Running the Simulator
You can get the code for the simulator from https://github.com/interana/eventsim. This site contains detailed directions on how to build and install the simulator. (The simulator is written in Scala and uses some Java 8 features. We assume that you know how to build and run Scala programs.)
We recommend testing the simulator on a small scale first (try 12 weeks of data for 1000 users). After a while, you can hack the simulator to do more stuff. You can simulate A/B tests for different sets of users, run different simulators in parallel to speed up data generation, or pipe data out to Kafka and process it in real time.
Help Us With the Simulation Effort
We’d love your help in making this simulator better. Help us find bugs and performance bottleneck; help us make the code clearer and more readable; help us make the data more colorful or flexible. We’d like to try creating simulators for other data types (maybe retail, transportation, or communication data).