Hadoop, Spark, and the emerging big data landscape
Not very long ago, it would have been almost inconceivable to consider a new large-scale data analysis project in which the open source Apache Hadoop did not play a pivotal role.
Then, as so often happens, the gushing enthusiasm became more nuanced. Hadoop, some began (wrongly) to mutter, was "just about MapReduce." Hadoop, others (not always correctly) suggested, was "slow."
Then newer tools came along. Hadoop, a growing cacophony (innacurately) trumpeted, was "not as good as Spark."
But, in the real world, Hadoop continues to be great at what it's good at. It's just not good at everything people tried throwing in its direction. We really shouldn't be surprised by this. And yet, it seems, so many of us are.
For CIOs asked to drive new programmes of work in which big data plays a part (and few are not), the competing claims in this space are both unhelpful and confusing. Hadoop and Spark are not, despite some suggestions, directly equivalent. In many cases, asking "Hadoop or Spark" is simply the wrong question.
And an already confusing space becomes more confusing when vendors, commentators, analysts, customers, developers and others use "Hadoop" to mean such different things. Sometimes they mean Apache Hadoop, the open source project. Sometimes they mean the original open source project plus a constellation of related open source projects that extend core Hadoop into areas like stream processing, in-memory computation, machine learning, and more. Often, in this view, Spark is "part of" Hadoop. Sometimes, "Hadoop" almost seems a loose umbrella term for "the big data project we're doing," where Hadoop itself is just part of the whole.
In my latest report, published today, I take a look at the ways in which Apache Hadoop and Apache Spark really fit together. The reader should come away, better able to cut through the contradictory noise they are bombarded with, and better able to understand where and when to use either — or both.
I also conclude with a question, of sorts, and would welcome your thoughts:
"But it's not impossible to imagine a near future in which Spark and its backers increasingly bypass the core Hadoop project, making connections of their own to all of those other projects in the Hadoop ecosystem. Once that happens, is it really a Hadoop ecosystem anymore?"