Architects can, and do, choose a primary cloud service provider and/or Hadoop system to house their data. Moving, transforming, cataloging, and governing data is a different story, so architects come to me after throwing up their arms searching for solutions to tame the information fabric, thinking they must be missing something: “Isn’t there a single platform?” they ask.
Sadly, no. There are only best-of-breed tools or data management platforms in transition.
There’s history behind this. Data management middleware companies tend to be relatively small. Information management vendors such as IBM, Oracle, and SAP pick off smaller data management vendors and add their offerings as solutions to their overall platform portfolio to sell as enablers of their big data and cloud systems. Small vendors don’t have the funds to preemptively build capabilities as markets shift toward new architectures like big data and cloud. Big vendors solve the 80% rule of firms running their businesses on traditional reliable technology. Thus, data management and governance have lagged behind the big data and cloud trends. Ultimately, both vendors have had a wait-and-see strategy, building capabilities and rearchitecting solutions only when customers began to show higher levels of interest (it’s in the RFI/RFP).
Our Forrester Wave™ evaluations document this story. As Forrester saw that 50% of companies were building Hadoop data lakes in 2011 and analytics/BI was moving to the cloud shortly after, data management vendors in our Waves were only just starting to figure out how to work in these environments and run natively in 2015. Even today, many of these vendors still offer one on-premises tool and another cloud tool. Newer ones may only run in the cloud.
Venture capitalists and private equity firms jumped in to fund big data startups early. But few startups emerged when there was already a whole marketplace of open source tools for ingestion, pipelines, security, and metadata. Where was the money in that? Thus, the market shifted to the sexier value proposition of machine learning, and investor money followed. Why care about data when you can have insights?
Well, enterprises care about the data. They always did and always do. It is the biggest area of technical and talent debt in an organization. The failure of big data lakes and stalls in scaled-out system areas such as IoT and AI all stem from lagging data foundations. It’s a cart-before-the-horse scenario.
“Great!” you say. “Nice history lesson. So what do we do?”
Recognize new tools for what they are. Ignore the platform and solution labels applied to product names and offers. What is available is loosely consolidated functionality for specific data use cases. Potential for complete solutions is there in commercial products. User interfaces and experiences are better than open source. More communication and collaboration functionality exists. Vendors know that regulatory compliance and security support are table stakes for any enterprise. And if there aren’t connectors for the leading cloud and Hadoop platforms, or leading BI and business applications, that’s a deal breaker. The baseline strategy for acquiring these tools comes down to knowing your user and their processes, the openness of the metadata repositories, and subscription models. Ultimately, you need to solve for today and give yourself room for growth (check out what my colleague Noel Yuhanna just published on future-proofing). You’ll refactor your platform sooner rather than later.
Now, here’s what you need to know for the primary data management tools:
- Metadata management. You will need two or three data catalogs: one for physical and logical metadata management that data engineers need to build and manage systems; one for data stewards to manage logical metadata, semantics, and data policies; and possibly a third data catalog that supports search and consumption capabilities for BI analysts and data scientists to use data if the data governance catalog for data stewards doesn’t do the job. Yes, Collibra, EDQ, and Informatica are common bedfellows. Alation with Navigator or Atlas in the Hadoop ecosystem are not unusual for data lakes, either.
- Master data management. There’s usually the traditional relational-based MDM tool running to support complex mappings of data between systems. It lives at the core of the databases and integration. Then you find graph-based MDM to handle complex views for customers and products sitting closer to the BI and business application systems when logical models need more preparation and conversion to semantic or business models. Then there’s the DIY MDM living inside data virtualization and Kafka that informs the data model and mapping for BI views, microservices, and ESBs.
- Data integration. This is where the fun begins as ETL, data virtualization, a data bus, streaming, replication, ingestion tools, and data preparation all live independently or in an integrated pipeline. Workload patterns define which data integration tools are used and where in the data flow or the ecosystem (cloud/on-premises) they are needed. Your data architecture takes on reference patterns aligned to transactions, business processes, automation, analytics, and analytics (OLAP) and operational (OLTP) workloads. Your reference architecture is designed first for data flows, not data sources as traditionally done.
- Data profiling and lineage. Standalone or embedded — take your pick. But the key is that if the profiling and lineage analysis is embedded, chances are it’s oriented toward the foundational solution. Repositories profile for metadata and data source capture. Data governance tools profile for logical and business metadata and source lineage. Data catalogs profile for physical and logical metadata, data relationships, and source lineage. Some might profile data flow metadata. Standalone tools tend to focus on metadata, model, lineage, and data flow analysis for root-cause analysis. Be mindful of who will use the tool, what they need to know, and that profiling and lineage analysis is mandatory for all data responsibilities to understand the data.