To paraphrase Yoda, organizations need to learn how to lose control of their restrictive, overly rigid governance regimens. This involves recognizing when governance is an impediment to innovation -- particularly to innovation that has the potential to accelerate analytical development.
There's no shortage of technologies or practices that fit this category. Examples include analytics sandboxes, which are incubators for analytical development and prototyping; data warehouse automation (DWA) software, which embeds an agile, test-driven development methodology in a continuous delivery paradigm; and self-service data-prep tools, which put the power of data transformation and advanced data engineering -- including the design of complex data flows -- into the hands of data scientists, business analysts, and other savvy users.
The problem is that an overly restrictive governance regimen can impede -- and in some cases, stymie -- all of these innovations. Take Teradata Data Lab, which is a great example of an innovative sandbox environment. Teradata positions Data Lab as an analytics sandbox that's embedded in the Teradata data warehouse.
In other words, Data Lab has access to production data in the Teradata environment; instead of the extraction, transformation, and loading (ETL) of data to and from a physically or logically separate analytics sandbox, Data Lab users can access it, in effect, in situ. In the Data Lab model, users themselves -- individuals or groups -- control the sandbox instance, which gives them the capacity to provision access to warehouse data or resources.
Data Lab users generally describe the combination of self-service freedom and in-database, massively parallel processing performance as a win-win, albeit with one critical exception: Data Lab is most useful if and when an organization adjusts or adapts internal processes and, particularly, the governance regimen to exploit it.
"It is extremely difficult to get something into the production] work stream," said a senior director of architecture with a company that specializes in video analytics and targeted marketing. "The business groups are like, 'I don't have time to go through all of that red tape, whatever it involves, [from] project managers [to] business analysts [to generating] reams of documentation to get anything done. It's a bit too onerous. There's a lot of back and forth with that right now … so unfortunately, [the Data Lab sandbox] ends up being a system that's not fully integrated in many cases."
Even though Data Lab permits it to rapidly develop and test analytics and ETL prototypes against production data in the context of the destination Teradata environment itself, this organization still won't relax the overly restrictive requirements of its software development life cycle and production pipeline. For example, the senior director argues, the company could amend its policies so as to expedite Data Lab prototypes into production; however, that isn't something it's willing to do: "We're being hamstrung now because it's so hard to put stuff into production."
Alain Bond, manager of information management with Canadian National Railway (CN), found himself in a similar situation. Basically, CN's internal policies mandate the use of an enterprise-standard ETL tool -- in this case, PowerCenter from Informatica -- for processing all the data destined for its enterprise data warehouse (EDW). It's impossible to square the use of a DWA tool with this requirement without negating the very advantages (i.e., agility and accelerated analytical development) that DWA purports to deliver.
In response, Bond came up with a pragmatic solution: he's using DWA technology (in this case, RED from WhereScape) to build subject- or function-specific data marts. This permits him to give his line-of-business customers what they need, such as, to cite just one example, safety and maintenance analyses that take into account the location, age, and operating telemetry of railroad switches.
Best of all, this data also gets pushed into the much slower production pipeline for CN's EDW. (DWA tools generate data definition language, or DDL, code that can be passed to Informatica and other ETL tools.) The irony is that Bond's approach exploits a kind of governance fiction: CN classifies the data he's ingesting and the analytics that result from it as "prototypes."
These prototypes are in reality tested, hardened, and effectively perfected production reporting and analytics applications. "We're using RED only to 'prototype,'" Bond explains. "WhereScape generates the documentation, all of the data lineage, all of the business rules. Once we get the prototype right, we pass it over to the ETL team.
"Building a prototype used to [involve] working with a single extract, with a static set of data. When you do it [with DWA software], you're able to refresh with live data, which means you can bring in the deltas and everything. This really helps to iron out issues up front because you can show it to users and they're able to give you much better requirements much earlier. If you do a good job with your last [i.e., presentation] layer of information ... you can just unplug that presentation layer [from the data mart] and with a minimal amount of work start [using it] with the EDW" once, he adds, it's finally vetted for use there.