Stick to the process
An ideal process flow diagram would have as few interposing boxes as possible. This never, ever happens in practice and there are many reasons for this. But one of the most important is that software vendors target the product, not the process. They pursue a strategy that attempts to insert or implant a product as one of several interposing boxes in a process flow. In effect, they design themselves into a process.
The DM industry’s response to big data has been more of the same. In most cases, this means a bouillabaisse of proprietary, stack-centric big data “solutions”, self-serving technological or architectural prescriptions, and not-yet-ready-for-prime-time front-end tools.
But big data is different because it’s inescapably multi-disciplinary: it presupposes interconnectedness, interoperability and exchange, between and among domains. It is holistic in scope in precisely the way that data management is not.
From a product perspective, a big data-aware tool must operate in a context in which problems, practices and processes are multi-disciplinary. No product will be completely self-sufficient or operate in isolation. But this doesn’t mean you can’t have big data-oriented products that target very specific use cases, or more generalised big data oriented products that address specific process, domain or function practices. And it doesn’t automatically mean that an entire class of existing products will suddenly become “pre-Big Data”.
More of the same is more of the wrong approach
But most of the vendors are developing and marketing “Big Data-in- a-Platform” products. The one thing each of these “solutions” has in common is a product-centric model: each aims to insert or implant itself – as an interposing box – into a process. But each interposing box introduces latency and increases complexity and fragility.
Worse still, each interposing box has its own infrastructure. This includes its own vendor-specific support staff with its own esoteric knowledge-base. At best, this means recruiting armies of Java or Pig Latin programmers, or training-up DBAs and SQL programmers in the intricacies of HQL. At worst, it means investing significant amounts of time and money to develop platform-specific knowledge-bases.
Automation is the answer
The way to address this dysfunction is to focus on automating the practices and processes that support and enable a data warehouse environment, such as scoping, warehouse creation, ongoing management, and periodic refactoring. You could even automate the creation and management of warehouse documentation, diagrams, and lineage information by completely eliminating hand-coding in SQL or in esoteric, tool-specific languages.
Big data products do not need their own infrastructure. They should speak the languages and accommodate the idiosyncrasies of OLTP systems, warehouse platforms, analytic databases, NoSQL or big data repositories, BI tools, and all of the other “boxes” that collectively comprise an information ecosystem.
Products should target the disconnects between isolated systems in a process, the points at which a process flow breaks down. This type of breakdown is the inevitable consequence of a product-focused development and marketing strategy. By the looks of it, we’re going to see lots of breakdown in the big data-scape.
Think of the big data-scape as a kind of free trade- zone in which “trade” is analogous to process: i.e., data moves from box to box, with minimal restriction or interference and without platform-specific embargoes from inessential interposing boxes.
Automation is the answer. Not automation for its own sake, but automation as integral to process flow to eliminate breakdown, increase responsiveness, lower costs and empower IT to focus on value creation.
Let’s all try not botch this one up!