It’s Time to Stop Using Hadoop for Analytics
Since the early days, data has always been (and continues to be) a fundamental part of Facebook’s product development. When I was at Facebook, one of my challenges was figuring out how to scale our infrastructure so we could not only store all the data we needed, but also access it quickly in order to make important product and business decisions based off that data.
We deployed our first Hadoop clusters in 2007, right as we were beginning to hit tens of millions of users and converging on 1 PB of data. At the time, Hadoop attracted us for two main reasons:
1) Cost-effective scalability. Hadoop was an open source framework and used commodity hardware. We knew it would scale with us, as it was already being used at petabyte scale.
2) No schema on write. This meant we could literally store everything without having to worry about what we’re going to do with it beforehand. All the data was saved in its original form and nothing was thrown away.
It wouldn’t be an exaggeration to say that Hadoop initially changed our lives at Facebook. We spent years improving it and building on it, so that it would be our source of analytics. But, unfortunately, it never quite lived up to its promise for analytic workloads. I believe this is an experience shared by a lot of other organizations that have been experimenting with Hadoop; it works great for storage and batch processing, but falls short for analytics in a couple of key places.
The promise of big data analytics isn’t that you can save all the data that you want. It’s that you can save all the data so that every person has the power to answer every kind of question, quickly. And since it’s impossible to ask perfect questions of your data on the first go, the ideal analytics stack should be fast enough to allow for continuous iteration and refinement of your questions.
Hadoop’s major strengths are in storing massive amounts of data and processing massive extract, transform, load (ETL) jobs, but the processing layers suffer in both efficiency and end-user latency. This is because MapReduce, the programming paradigm at the heart of Hadoop is a batch processing system; while batch is great for some things like machine learning models, or very large, complex jobs, it’s unsuitable for analytics questions which are more ad hoc in nature.
A variety of newer tools such as Spark are trying to address this problem by using memory more aggressively and designing for lower latency, but they have only been partially successful. The Hadoop stack has many layers with deep architectural assumptions about batch versus ad hoc workloads; you really have to unwind all of them to achieve performance that matches a system built ground-up for low latency.
Plus, the whole reason to use Hadoop is for scale–but once you force things back in-memory, you lose that. If you’re in-memory, you might as well be using a traditional data warehouse that is more mature and full-featured.
In fact, the dirty secret of Hadoop is that for the vast majority of analytics workloads, it’s still much worse than the legacy tools it claims to displace. Many of these legacy tools outperform Hadoop in a huge number of use cases.
Stuck in the developer world
For the data revolution to be an actual revolution, the power to ask questions of data is one that every person should have–not just the computer scientists. Even though everyone at Facebook was analytically minded, it was impossible for those who did not know how to write MapReduce jobs to ask even the simplest questions. And those who did, spent hours to days writing programs instead of exploring their data.
Despite its flexibility, unless you’re a programmer, people have even less access to data when it’s stored in a data lake like Hadoop, than if you simply used “old school” business intelligence tools.
This is the fundamental reason Hadoop has remained, for the past decade, the “next big thing” in analytics, without ever becoming “the” big thing. It is stuck in the developer world.
To solve this, at Facebook, we built a suite of custom tools to make the data we stored in Hadoop actually useful for the non-programmer. For example, Hive sat on top of Hadoop and let people use a SQL-like language to query the data instead of Java. A number of other internal tools were intended to help with adoption, but ultimately, Hadoop never really broke out the developer community.
Facebook then wrote Scuba, which was purpose-built for ad hoc analysis by both non-developers and developers. It made data accessible to more people by providing a web user interface and was tuned for returning results in interactive times. While Hadoop still has many important uses at Facebook, Scuba has taken over the heavy lifting for analytics.
It’s also worth remembering where Hadoop came from: papers by Google about their infrastructure for indexing web pages. It turns out that ETL processing looks a lot like indexing web pages so Hadoop caught on there. But it was never about analytics–especially analytics that humans need to interact with.
At Google, no one has used MapReduce for analytics in a very long time. It has been replaced by multiple generations of tools built specifically for analytics, like Dremel and PowerDrill.
Data insights for the masses
There are millions of people who are capable of analyzing data and who strive to be analytical in their workplace. The fact that many of the existing tools we have require data analysts and data scientists to also be computer scientists is a failure on the part of software engineers. For data to be useful for the masses, scalability isn’t the only challenge that needs to be addressed. Making sure an analytics stack is both powerful and flexible–and also accessible for any who need the data–is an important challenge that I and many others are still continuing to tackle.