Why Speed Matters When It Comes to Querying Your Data
When evaluating new tools, how fast it works can sometimes be at odds with how many bells and whistles it comes with or how much information it can handle. You can get a CRM platform with a ton of features, but it’ll take a while for your team to understand how to use them. You can get the most highly tailored automated email generator, but exponentially increase the time it takes to generate an email.
When it comes to analytics and querying your data, speed matters.
From everyday search engine queries to enterprise-level data queries, getting results as quickly as possible is just as important, if not more so, than the volume of data that can be stored or the complexity limits of a query.
As we’ll talk about in this post, this conclusion is based on how people actually search for things—as a multi-step, imprecise process that follows the speed of thought, not a single detailed question.
What we do when we ask questions
To start thinking about how people ask questions of their data, it helps to look at a place where 3.5 billion queries are submitted each day—Google.
Even five years ago, when you did a Google search, you were on average searching through an index of 52,000,000 pages.
But according to a 2015 Chitika study, the first result alone got an average of 32.5% of the total traffic. And forget about looking at the 2nd or 3rd page of results—only 8% of traffic makes it there.
People are really only clicking on the first few results—which seems to point to the conclusion that they must be finding the links they want within those results. Out of those 52,000,000 pages, are people using picture perfect search terms that get them the exact result they need within just the first 4?
Probably not. Let’s take a look at another statistic—the number of words used in Google queries.
94% of searches use five words or fewer. And over a third of searches are just a single word!
If people were finding the results they wanted straightaway, you’d expect them to need more words to get there. Short search queries tend to give broader results, with what you want buried further down the list or a few pages in. But we know that people aren’t clicking on those later results. This suggests that people aren’t actually finding what they want in one search, but are searching using an iterative style, formulating many short search queries until what they want shows up in the first few results.
For example, let’s say you want to learn about different types of cloud computing environments. You’ll probably search something like, “cloud types.” Whoops. All the results on the first page are about real clouds. But you’re not going to waste time looking through results until you find something that has to do with cloud computing (not until page 4, by the way). Instead, you’ll change your search to something like, “cloud computing types.” From the titles of those results, you’ll see that there’s discussion of public, private, and hybrid clouds, so to get more detailed you’ll search “hybrid clouds.” But you’re more interested in advantages and disadvantages than definitions, so you search “hybrid cloud benefits,” which gets you to a link you’ll actually click on.
People don’t carefully think out detailed search phrases or look through page after page of results. Instead, they rapidly iterate on imprecise queries until the result they want is listed near the top.
Why speed is key for queries
This behavior—looking at a query as more than just a single search—is what makes speed so important.
If people could instantly come up with queries detailing exactly what they wanted to look at, or if they had the time and motivation to look through huge reports, then the amount and accuracy of the data on hand would be more important than speed. But this isn’t how people work.
People can’t predict which sectors of the customer population will be most interesting before they look at the data. They won’t know all the variables that need to be included and excluded. And your data team can’t solve this by producing long-winded reports with a thousand variations—people won’t look past page 1.
The only solution is to take out the middleman and let every employee interact with the data quickly and easily. If it takes a while to receive a query result, the process of iterating off of imperfect queries will be incredibly time consuming—so people won’t do it. They’ll either work with the imprecise results of their first query or not look into the data at all.
In Interana’s 2017 State of Data Insights Report, almost a third of the 200 respondents from various companies in various positions listed slow query speeds as a top pain point with their existing analytics solution. And almost two-thirds couldn’t get answers to their queries in less than a full day.
If it takes more than a few seconds, let alone a few days or weeks, to get a question answered, then your company loses a large piece of the benefits of collecting data in the first place.
Scuba runs interactive, ad hoc queries in less than a second. It was built because Facebook’s old query systems, Hive and Peregrine, were too slow to catch performance bugs before they affected a large portion of users. With Scuba, these bugs could be identified and fixed in minutes to hours, not days. And even though Scuba was just meant for performance analysis, it soon became the “system of choice” for many teams. People noticed a huge leap in how quickly insights were gleaned from data, and former Facebook engineers even say that Scuba is the thing they “miss the most” about working there.
If you want people across your organization to start using your company’s data analytics tool every day, it needs to be fast enough.
If you want everyone at your company to use your analytics tool daily, it needs to be fast enoughTweet this
Making analytics faster
One of the ways analytics can be made faster is by dismantling the batch-processing that underlies frameworks like Hadoop and opening the way for real-time processing.
Hadoop is a powerful tool. It can store enormous amounts of data and perform huge extract, transform, and load jobs. But it’s a poor choice for executing many fast, ad-hoc queries. That’s because the heart of Hadoop, the MapReduce paradigm, is a batch-processing system.
In batch-processing, programs are executed on a “batch” of inputs, instead of a single job. Rather than begin processing right away, these batches can be saved for a time when computing resources are not in high demand, such as overnight. Batch-processing is perfect for highly complex jobs where user interaction isn’t needed, so they don’t tie up computing resources.
But batch-processing doesn’t make sense for fast-paced, off-the-cuff analytics queries.
Interana’s analytics stack was built specifically to make querying continuous time data as fast as the speed of thought. For that, we built up a whole new system, which CTO Bobby Johnson outlines:
- Commodity Linux boxes for the system to run on
- Very simple C++ code to crunch billions of rows of data at once
- A layer of Python to manage queries and sampling logic
- A columnar database system which processes 100 million rows per core per second
- A REST interface for the front end
And it works—you can run queries over months and months of data in under a few seconds. Our speed comes from the fact that queries are performed on raw data, eliminating intermediate processing steps. In addition, we run queries lock-free, which means that if one computation is blocked, the other CPUs can keep going. The columnar database also cuts down latency because any columns not relevant to the query can be skipped over entirely. Finally, we also save time by not relying on limiting business intelligence concepts, like dimensions, which require further processing.
Aside from actual processing time, we also speed up analytics by providing a visual, interactive interface that anyone can use—without having to take the time to learn SQL, Java, or any other new syntax.
Speed up your queries
If you’re having problems getting people at your company to use your data analytics stack, speed may be the problem. High latency is an anathema to reeling off queries on a daily basis.
Ideas will be lost if people have to carefully formulate each question, go to their data team, and wait forever for a response. When you use a high-speed analytics stack designed for ad-hoc queries, you allow people to query the way they do in the rest of their life—quickly, imperfectly, successively.