Hadoop: Future Of Enterprise Data Warehousing? Are You Kidding?
I kid you not.
What’s clear is that Hadoop has already proven its initial footprint in the enterprise data warehousing (EDW) arena: as a petabyte-scalable staging cloud for unstructured content and embedded execution of advanced analytics. As noted in a recent blog post, this is in fact the dominant use case for which Hadoop has been deployed in production environments.
Yes, traditional (Hadoop-less) EDWs can in fact address this specific use case reasonably well — from an architectural standpoint. But given that the most cutting-edge cloud analytics is happening in Hadoop clusters, it’s just a matter of time — one to two years, tops — before all EDW vendors bring Hadoop into their heart of their architectures. For those EDW vendors who haven’t yet fully committed to full Hadoop integration, the growing real-world adoption of this open-source approach will force their hands.
Where the next-generation EDW is concerned, the petabyte staging cloud is merely Hadoop’s initial footprint. Enterprises are moving rapidly toward the EDW as the hub for all advanced analytics. Forrester strongly expects vendors to incorporate the core Hadoop technologies — especially MapReduce, Hadoop Distributed File System, Hive, and Pig — into their core architectures. Again, the impressive growth in MapReduce as a lingua franca for predictive modeling, data mining, and content analytics will practically compel EDW vendors to optimize their platforms for MapReduce, alongside high-performance support for SAS, SPSS, R, and other statistical modeling languages and formats. We see clear signs that this is already happening, as with EMC Greenplum’s recent announcement of a Hadoop product family and indications from some of that company’s competitors that they have similar near-term road maps.
Please do not interpret this as Forrester forecasting the demise of traditional EDWs built on relational, columnar, dimensional, and other approaches for storing, manipulating, and managing data. All of your investments in pre-Hadoop EDWs, data marts, data hubs, operational data stores, and the like are reasonably safe from obsolescence. The reality is that the EDW is evolving into a virtualized cloud ecosystem in which all of these database architectures can and will coexist in a pluggable “Big Data” storage layer alongside HDFS, HBase (Hadoop’s columnar database), Cassandra (a sibling Apache project that supports peer-to-peer persistence for complex event processing and other real-time applications), graph databases, and other “NoSQL” platforms behind an abstraction layer with MapReduce as its focus.
That trend is also clear. All of this makes me happy that I stated as much in a Forrester report that we published almost two years ago. At that time, in the context of an in-database analytics discussion, I stated that within the next several years, most EDW and advanced analytics vendors will incorporate MapReduce and Hadoop support into their architectures to enable standards-based development of advanced analytics models with flexible in-database pushdown optimization in the cloud.
I also took the analysis to the next evolutionary step, identifying the industry roadmap for embedding of Hadoop/MapReduce into the larger paradigm that we now call “Big Data.” This paradigm involves embedding a more comprehensive range of application functions and logic — both analytical and transactional — into the virtualized cloud EDW. Essentially, the cloud EDW will become the core “application server” for the next generation of use cases — such as next best action — that require tight integration of historical, real-time, and predictive analytics.
Within the Big Data cosmos, Hadoop/MapReduce will be a key development framework, but not the only one. These specifications will form part of a broader, but still largely undefined, service-oriented virtualization architecture for inline analytics. Under this paradigm, developers will create inline analytic models that deploy to a dizzying range of clouds, event streams, file systems, databases, file systems, complex event processing platforms, business process management systems, and information-as-a-service environments.
At Forrester, we see these requirements coming directly from CTOs and other senior decision-makers in large organizations who are driving convergence of investments across all of these formerly separate technology domains. Vendors are racing to address this convergence in their product portfolios.
No kidding. Hadoop is the core platform for Big Data, and it’s a core convergence focus for enterprise application, analytics, and middleware vendors everywhere.