Menu Request Demo

Building a More Logical Data Warehouse with a Data Vault

Date: 14 August 2017 Author: Mark Madsen

Published via insideBIGDATA on Aug 14, 2017 (View the article at insideBIGDATA.com here).

In this special guest feature, Mark Madsen, President of Third Nature Inc. discusses the rise of the data vault as a database modeling method that provides long-term historical storage of data from multiple operational systems and enables users to look at historical data that deals with issues such as auditing, and trace data to determine where it originated. Mark worked in the analytics field for 25 years, starting in AI at the University of Pittsburgh and robotics at Carnegie Mellon University. Today he is president of Third Nature, where he advises organizations on analytics and data science strategy, architecture and governance. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide. He is also involved with emerging technology as a researcher, speaks on analytics internationally, sits on the O’Reilly Strata conference committee and chairs the Accelerate data science and analytics conference.

For more than 25 years, storage experts have operated on the premise that data works best when all access and controls are centralized. As business intelligence needs took hold, organizations began working towards a unified data architecture that made sense for all data use. But despite our best efforts, data continues to be distributed across more silos than ever before.

The failure to deliver access for all and provide a common centralized home was made evident by the high demands of analytics, IoT, big data and virtualization. With these technologies in hand, organizations need to do more with the data they collect. However, data warehouse systems were designed to address predetermined questions and needs via queries, reports and dashboards.

Organizations that want to support a range of new analytic use cases need to leverage techniques that are defined by open-ended exploration and discovery.

Logical Data Warehouse

The Logical Data Warehouse (LDW) was introduced as a means to accommodate multiple repositories for comprehensive and easy access to data. But even the logical data warehouse lacks a common approach or methodology, especially in distributed environments.

LDW systems serve as a primary destination for derived transaction data from a variety of sources including pre-packaged and custom applications, streamed processing systems, services and cloud applications. Often the sources for this data are available in a warehouse, but sometimes they reside in another data repository.

In these distributed environments, users need a way to collect and manage data from transactional systems in order to support multiple downstream integration and analysis processes that form the logical data warehouse.

Introducing the Data Vault

A data vault is a database modeling method that provides long-term historical storage of data from multiple operational systems. It enables users to look at historical data that deals with issues such as auditing, and trace data to determine where it originated.

The data vault approach treats the problem of data warehousing as two issues: data collection and data use. It separates these issues as two classes of the overall problem. On the one side are data-vault-specific techniques for collection and distribution. On the other side are techniques to enable delivery of data, such as populating a dimensional model for business intelligence.

The data vault is essentially a different way of modeling content and its relationships to data. Rather than forcing all data into a unified model based on user needs, a data vault alters the unified model to allow information to be collected and loaded easily. While there are some structural changes to capture keys and relationships, the data values themselves are not touched.

Take for instance Micron Technologies, a semiconductor company out of Idaho.  They recently worked with a data automation solutions company, WhereScape, to accelerate access to manufacturing data. In the semiconductor business, part of the big data needed comes from the manufacturing process. For Micron, analyzing the drivers for cycle time, yield, and quality would help streamline their data needs. The company decided to inject a Data Vault into their system to manage their significant data workloads.

The enterprise analytics and data department formed an agile SCRUM team to implement the Data Vault.  The new technology enabled the team to move straight to prototyping, allowing quicker delivery to users. Beyond greater speed of delivery, greater consistency, quality and more transparency across the entire company were some of the highlights Micron Technologies saw with the new technology.

Unlike traditional third normal form (3NF) or dimensional methodologies, the data vault does not model data from the point of data use. Instead, it is designed to focus on capturing and managing the source data first and addressing the use of that data afterward. The Data Vault allowed Micron Technologies to complete projects in a matter of days rather than months it previously took.

Storage experts describe Data Vault modeling as a best-of-all-worlds, combining 3NF and dimensional modeling to enable mapping of source data into the repository to speed initial loading. The Data Vault also does not depend on business rules or data quality rules on data that is loaded into the raw vault, saving effort during the data collection.

Data Vault structures are also considered to be more flexible and resilient than the corresponding implementation in 3NF or dimensional models. Introducing non-trivial changes, such as redefining the relationships between business keys, won’t break or damage a Data Vault model.

With a Data Vault approach, organizations can tailor delivery to specific formats or models such as populating an information mart as a dimensional model from the raw vault.

LDW has many points of data consumption, often with overlapping data, and there needs to be a mechanism that streamlines the data distribution. Here is where the data vault really makes an impact. For example, the vault can provide data to the traditional BI query environment, provide core transaction data to a marketing data mart, and populate a behavioral analysis system with customer data. Typically these would be treated as three completely independent systems, despite their overlapping data needs.

The data vault approach makes it easy to map data for different uses from a central repository. It supports warehousing as a logistical construct, not in the sense of final delivery to consumers. No matter how you look at it, this is a core need and a central design principle in any logical data warehouse architecture.