Hadoop, Spark, and the emerging big data landscape

Paul Miller, VP, Principal Analyst

Feb 2 2016

Not very long ago, it would have been almost inconceivable to consider a new large-scale data analysis project in which the open source Apache Hadoop did not play a pivotal role.

Every Hadoop blog post needs a picture of an elephant. (Source: Paul Miller)

Then, as so often happens, the gushing enthusiasm became more nuanced. Hadoop, some began (wrongly) to mutter, was "just about MapReduce." Hadoop, others (not always correctly) suggested, was "slow."

Then newer tools came along. Hadoop, a growing cacophony (innacurately) trumpeted, was "not as good as Spark."

But, in the real world, Hadoop continues to be great at what it's good at. It's just not good at everything people tried throwing in its direction. We really shouldn't be surprised by this. And yet, it seems, so many of us are.

For CIOs asked to drive new programmes of work in which big data plays a part (and few are not), the competing claims in this space are both unhelpful and confusing. Hadoop and Spark are not, despite some suggestions, directly equivalent. In many cases, asking "Hadoop or Spark" is simply the wrong question.

And an already confusing space becomes more confusing when vendors, commentators, analysts, customers, developers and others use "Hadoop" to mean such different things. Sometimes they mean Apache Hadoop, the open source project. Sometimes they mean the original open source project plus a constellation of related open source projects that extend core Hadoop into areas like stream processing, in-memory computation, machine learning, and more. Often, in this view, Spark is "part of" Hadoop. Sometimes, "Hadoop" almost seems a loose umbrella term for "the big data project we're doing," where Hadoop itself is just part of the whole.

In my latest report, published today, I take a look at the ways in which Apache Hadoop and Apache Spark really fit together. The reader should come away, better able to cut through the contradictory noise they are bombarded with, and better able to understand where and when to use either — or both.

I also conclude with a question, of sorts, and would welcome your thoughts:

"But it's not impossible to imagine a near future in which Spark and its backers increasingly bypass the core Hadoop project, making connections of their own to all of those other projects in the Hadoop ecosystem. Once that happens, is it really a Hadoop ecosystem anymore?"

Get The Insights At Work Newsletter

Country*

Yes, I’d like to receive Forrester’s Insights At Work newsletter and receive occasional survey invitations and marketing communications.

Thanks for signing up.

Stay tuned for updates from the Forrester blogs.

Get The Insights At Work Newsletter

Country*

Yes, I’d like to receive Forrester’s Insights At Work newsletter and receive occasional survey invitations and marketing communications.

Thanks for signing up.

Stay tuned for updates from the Forrester blogs.

Categories

See Paul Miller at:

Get The Insights At Work Newsletter

Thanks for signing up.

Three Bold Ways To Spotlight Your Tech Excellence

Seize your chance to earn a prestigious Forrester award for innovation in tech strategy, enterprise architecture, or data & AI. Elevate your team and show the world what true trailblazers look like.

Midyear CIO Check-In: Are You Crushing It Or Barely Treading Water?

Why AI ROI Remains Elusive Despite Widespread Adoption

Get The Insights At Work Newsletter

Thanks for signing up.