Because Ninjas Really Are Too Busy

Interana Blog Staff

Don't use Splunk or other index-based log managers for your business dashboards on event data anymore. You're wasting time and money on something that's limited, slow and brittle.

Back in the late 90s and early 2000s, logs were sent to local files across server farms and brave sysadmins would grep through them manually if there was a problem. Sometimes developers and application support people would ask for access and the understandably grumpy sysadmins would ship them a copy of the log files they wanted. Reckless sysadmins would give them access to the servers just not to be bothered (until the devs made a change and brought down the production systems — ask me about when that happened to in ~1998 when I was in charge of MSN's logging infrastructure.) The actual log files were horrendous spews of free text payload with a minimally structured header. What log management tools there were only worked on some standard log formats from security, OS and networking sources — but the totally variable and freeform application logs that were relevant to the actual business value delivered by IT were out of bounds.

Then along came Splunk in 2005. I'm extremely proud to have run product there from April 2005 before our launch through version 4. Splunk ushered in the era of log management tools that worked on a search and indexing foundation.

We made it possible for people across IT and development organizations to centrally search for log events from a web interface regardless of original format and get back needle-in-a-haystack matches in seconds. Our insight was to embrace what I call "messy data." No more parsing and normalizing just to query. And this scaled to terabytes with distributed architectures.

Over time, we at Splunk (and our SaaS and OSS imitators like SumoLogic and Elastic-Logstash-Kibana, aka ELK) implemented read-time schemas and structured query extensions to our search languages to support structured reporting and dashboards too—so IT and engineering folks could build log-based equivalents to their traditional monitoring tools' interfaces to watch error counts, performance trends, and the like.

Clever Splunk admins started to use this to put up dashboards on more business oriented metrics for users beyond IT in roles such as LOB leadership, marketing, product, success and account management. A lot of the log events in their indexes had business relevance and they could put up a logging tool based dashboard on data they already had faster than their BI teams could stand up a data warehouse and a BI/reporting tool to get the same dashboard. Adept Splunk and ELK search ninjas became heroes in many companies.

But there was an elephant in the room. (And it wasn't Hadoop.)

Splunk and similar tools are slow and inefficient at statistical reporting across large datasets.

The freedom to get answers in seconds for searches for individual log events disappears as soon as you ask for a count of events by a grouping field. That means dashboards over any significant time period either need to be on a very filtered subset of the data (not good for looking across all activity on a service), or require brittle and hard to configure incremental scheduled pre-summarization. It's called "summary indexing" in Splunk parlance, but the same is found in ELK and SumoLogic and every other schema-on-read, index-based system. That means the resulting dashboards are static. A clever administrator can configure a bunch of linked dashboard widgets to achieve defined drilldown paths between different levels of summarization —but it's a lot of work and extremely brittle.

Beyond the speed issue for simple aggregations, things get even trickier when the questions are about behavior across large populations of people and things in mobile services, apps, websites, IoT and the like. The Splunk and imitator search languages just weren't built to analyze funnels, paths or sessions, nor segment users based on them. And the backend isn't at all optimized for sequencing by user or device ids. These are painful queries for the ninjas to build.

So suddenly you have a tiny, priestly class of overworked ninjas and a massive number of disappointed and despairing dashboard consumers. These consumers often don't know what the dashboards really mean. The inability to drill down into the raw data and the obscure syntax of the underlying queries mean they can't figure it out for themselves and just have to trust what was built for them. And if the dashboards lead to a new question, they need to go back to the ninjas to build and run new queries, which can take a lot of development and processing time to backfill summary indexes on old data.

Madness, right? But there's now a better way.

First off, the events that have business value don't have to be hidden in the midst of messy logfiles anymore. We're a few years into the era of clean data pipelines using technologies like Kafka, Kinesis, Segment and mParticle. I call this logging with intent — developers now see emitting well-formed (usually JSON or something that can convert to it) events for downstream analysis as a first-class requirement for new applications (and a high priority requirement for refactoring old ones.) And they put these events into pipelines that deal with consolidation across increasingly dynamic infrastructures (no more local logfiles across a thousand physical servers.)

That means that the main original value of the search- and index-based logging tools is irrelevant, at least for high value, intentional events. And the cost associated with free text indexing in terms of slow read-time schema and aggregation is unnecessary. It's far better to put these events that the business cares about into a datastore that is structured to support fast ad hoc analytics on known dimensions and is particularly optimized for behavioral queries.

With that foundation, it's possible to deliver a totally different user experience to dashboard consumers. You can have living dashboards that run on queries directly against raw data at trillion-event scale yet still refresh in seconds, instead of "dead dashboards" that are artifacts of hours, days or weeks of incremental background processing. This means users can re-run dashboard queries with new parameters and see the query logic and raw data behind a chart. Users also have the freedom to build queries that are about behavior and involve trying out different definitions of funnels, sessions, and segmentation.

When users have this kind of transparency and freedom to iterate, they trust and rely on data more. That is the holy grail of a data-informed culture.

This happens to be Interana's approach. We take in well-formed JSON events as a stream and put them into a columnar database sharded on one or more actors, which may be users, devices, topics or other people or things. We then put a visual interface on top of it with living dashboards. We're also designed to ask questions about behavior in event data with user-definable behavioral objects like funnels, cohorts, metrics and sessions — not just simple measures and dimensions like the logging tools.

Users with no special training use these living dashboards as jumping-off points to freely explore their data. They can even tweak definitions of funnels or cohorts, all without writing any specialized query syntax. The queries they run return answers in seconds even on trillions of rows of data. And they freely run dozens of queries in a few minutes’ session, following their intuition. When I show this to my former Splunk colleagues and customers, their jaws literally drop.

We've had many customers that previously used Splunk and other log management "dead dashboards" switch to this approach. They've seen usage go way up and hundreds of passive consumers become active data explorers.

Most of the time these new Interana customers were able to reduce the data going into their logging systems to more purely IT and security relevant data, saving money on license, servers and storage — not to mention freeing their ninjas to do more than be data butlers. Often they also could retire a more expensive proprietary system like Splunk (sorry about that pricing model, everyone) in favor of a less costly OSS solution like Elastic/ELK or a simpler hosted solution like SumoLogic.

While Splunk and its brethren have become jacks of all trades, and masters of none, Interana is the ultimate way to build a data-informed culture.

Previous article Blog Summary Next article