The New Paradigm Of In-Database Cloud Analytics, And Google’s Role As Catalyst
Dreams do come true sometimes. Or, at the very least, they may start to feel less like dreams than intuitions that ripened a bit earlier in the dreamer’s mind than in the world in which he or she may live.
The dream of a global analytics cloud – aka "data warehousing (DW) in the cloud," "DW 2.0," "DW as a Service" – is continuing to materialize, as evidenced by a steady stream of important industry developments. Perhaps "cloud" is the wrong metaphor, considering that this vision is more of an expanding hypersphere of deep data that, through its massive gravitation, pulls an ever-growing nebula of complex computational challenges into its orbit.
Maybe we should call this uber-DW the "analytics orb" – in other words, the conceptual mothership of the industry’s growing focus on "in-database analytics." Under this vision, analytics migrate to the DW platform and leverage its full parallel-processing, partitioning, scalability, and optimization functionality. Why move huge data sets to other platforms to be processed when all that analytical heavy lifting can be done on the most powerful, scalable, and cost-effective platform [appliance, cloud, orb] available – that also happens to be the planet where the data permanently resides?
For I&KM professionals, this vision is starting to become a commercial reality, as implemented by a growing range of DW vendors, both startup and veteran. The most recent industry development in this regard was last week’s announcements by DW vendors Greenplum and Aster Data that they have implemented the Google-developed parallel computation API called MapReduce in their respective products. For its part, Google has been using MapReduce in its massive search environment to efficiently query petabytes of data – unstructured, semi-structured, structured – through MPP-optimized SQL extensions. One of the key innovations with MapReduce is that it provides a framework for parallelizing any in-database analytical algorithm – not just SQL queries (parallelizing the latter is old hat – it’s what every vendor of a shared-nothing MPP DW has long provided).
Another important recent development came from DW pure-play Netezza. Several months ago, it acquired predictive analytics tool vendor, NuTech, announcing that this firm’s technology would help Netezza evolve its DW appliance product family into an extensible platform for customer- and partner-provided analytic applications. Then last week Netezza announced that several partners had rolled out advanced analytics applications designed to leverage the parallel-execution, scalability, and query optimization features in its DW platform.
Oh…and of course DW powerhouses Teradata and Oracle have recently made partner-friendly in-database analytics a key theme in their DW strategies. Netezza certainly isn’t the only DW vendor beating that drum. And I’m certainly not the only industry analyst who’s been dreaming this dream. Check out the impressive cloud of industry commentary on the MapReduce announcements. There’s a lot of low-dangling conceptual fruit in this new paradigm available to be plucked by any prepared mind.
So what’s it all mean? Essentially, what all of these developments point to is the inexorable rise of the DW as the scalable, parallel-processing muscle within the new generation of analytics-driven application platforms. What’s more, these developments point to the growing role of the DW as a general-purpose information-consolidation point in this new age of Web 2.0, unstructured data, and SaaS.
That said, here are the core DW capabilities in this new paradigm, as near as this analyst’s crystal ball will reveal (I’m using "distributed analytic platform" as the catch-all term for this new paradigm):