A number of projects I have been working on recently have benefited from the use of Data Vaults. The success has led to questions regarding Data Vaults and their application to Enterprise Data Warehousing (EDW).
This comparative analysis has been created to assist customers with making decisions regarding the design approach for an EDW Project. It compares and contrasts two different approaches: Third Normal Form (3NF), historically a very popular option for EDW and Data Vaults, a purpose-designed for EDW modeling technique. There is no “right” answer, both options have advantages and drawbacks, and the final decision should be based on an organization’s unique circumstances. WhereScape professional services are experienced in both techniques, and are available to make recommendations where required.
Third Normal Form (3NF) model
Definition of 3NF:
- A database, in which each attribute in the relationship, is a fact about a key, the whole key and nothing but the key. 3NF usually refers to a fully normalized structure where information is stored in 3rd Normal Form (E. F. Codd’s 3NF).
- A 3NF structure is used for the EDW popularized in W. H. Inmon’s Corporate Information Factory (CIF).
- The generally accepted goal is that a company has one centralized EDW - Data Marts and other Analytical Business Structures are fed from the EDW.
- The data warehouse is stored using Database Normalization rules in 3NF – tables are structured into subject areas.
- 3NF models are used for Operational Data Stores as well as Enterprise Data Warehouses.
- 3NF structures are not recommended for queries and reports.
- 3NF databases offer performance and stability for Online Transaction Processing Applications (OLTP).
Data Vault Model
Definition of Data Vault:
- The Data Vault (DV) has a detail oriented, historical tracking, and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach designed to encompass the best of breed between third normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise.
- A DV is intended to address the business challenges that DW practitioners meet. It is designed to avoid or minimize issues related to Dimensional Model and 3NF methodologies.
- DV Modeling is an EDW design methodology that provides historical storage from multiple sources with complete tracking back to system of origin.
- This method can be adapted to changes in the business environment.
- The Data Vault is organized around existing business keys.
Comparison between 3NF and Data Vault for Enterprise Data Warehousing
|Initial Purpose||OLTP||EDW||The DV link structure allows flexibility in handling relationships between the Business Keys (M:M, M:1, 1:M, 1:1, etc.) without changing the model structure. In 3NF, relationships between Business Keys in the same table are "solidified" to, for example, 1:M. When past or future data requires different relationships (e.g. new set of data requires M:M instead of 1:M), the whole structure has to be changed in 3NF, while a DV supports this case in Link table.|
|Number of Objects||Less||More||The number of objects in a DV vs. 3NF is approximately a 3:1 ratio at the beginning of the EDW build and approx. 2:1 for mature warehouses. Objects in DVs are generally smaller and easier to confederate.|
|Code Complexity||Higher||Lower||Although code is generated by WhereScape RED in both cases, the DV code makes it slightly less complex to follow and document, because the objects are smaller and have specific meaning.|
|Speed of Processing||More to Same||Less to Same||DVs allow a high level of parallelism, therefore the total time to complete the DW loading, including Data Mart layer, is often lower than it would be for 3NF. DVs generally offer more stability and performance.|
|Number of Joins||Less||More||DVs utilize more joins; however, this does not decrease throughput of the DW process due to high parallelism and effectiveness of the foreign key (FK) joins.|
|Number of Indexes||Less to Same||More to Same||The DV utilizes more indexes, because there are more tables, however most additional indexes are on the PK/FK. The number of indexes may be equal for mature DWs because of the need to index non-key columns in 3NF in order to support the variety of queries against attributes in the tables.|
|Adaptability to Changes||Medium||High||Adaptability is necessary in case there are changes in the business, e.g. adding a new data source of changing analytical requirements. 3NF structures need rework while with DV additional Hubs/Links/Satellites can be added without disturbing the existing structure. The changes are typically in a single place and easily adapted because the code is likely to be at only 1 or 2 places.|
|Agile Development||Medium||High||Although both models work well with Agile Development principles, DVs allow the whole EDW project to be broken into well-defined stories/sprints/tasks, etc. based on Units of Work. It is easier to put the DV design into stories/sprints/tasks.|
|Historical Data Handling||Harder||Easier||Although both models allow historical data to be handled, in DVs it is done in separate object types: one or many Satellites, while in 3NF history is held in a Normalized History object very much like a SCD. When built in WhereScape RED, Satellites are built the same way as 3NF history tables. This means Satellites = Normalized History tables.Another point to consider is the issue of cascading snapshot dates in master-detail-detail 3NF models. This adds to the complexity of the ETL code and makes it difficult to extract "as-of" data sets (i.e. need to use “select(max(snapshot date))” type sub queries).|
|Real-Time, Near-RT Loads||Good||Good||Although both 3NF and DVs handle Real-Time and Near Real-Time loads well, DVs are preferred because of high-parallelism, hence shorter individual loads.|
|Very Large DW Loads||Good||Good||Both 3NF and DVs handle very large database loads well. Nevertheless, any changes in business that require vast initial load are easier to handle in DVs.|
|“Forgiveness” of the Model||Less Strict||More Strict||DVs require DW practitioners to have more discipline in following architectural and processing rules; defying the DV rules can potentially destroy the whole DW structure.|
|Historical Tracking||Good||Best||DVs allow complete “geological” type tracing from the lowest grain of data to the source system’s atomic level. 3NF is mainly designed for holding operational data history.|
|Data Access||Bad||Worse||DVs and 3NF are both impractical for direct querying; it is particularly difficult for DW users to combine data from different sources into meaningful information in 3NF.A Presentation Layer is required for both models and the amount of effort to build the layer is about the same.On the other hand, if these queries are performed by a super user, I believe DVs are better (not worse) because the user can create and follow a template to generate SQL Views to extract data from DVs much easier than with 3NF.|
|Access to Skilled Modellers||Less||More||There are more people trained in 3NF than there are in DV.|
Figure 1 Predisposition of various “flavors” of Data Warehousing Models in relation to various purposes of Warehouses
The following shows the data models, built from the same source data, for a 3NF and a DV architecture. The diagrams are an example of what can be produced by WhereScape 3D at the planning stage of an EDW Project.
Figure 2 Northwind Tutorial Data Model built in Data Vault “flavor”.
Figure 3 Northwind Tutorial Data Model built in Third Normal Form “flavor”
- http://danlinstedt.com/datavaultcat/mathematics-of-joins-denormalization-rows-per-block-and-io/ et al.
- Building the Operational Data Store, 2nd Edition by William H. Inmon
- DW2.0, 2008 by William H. Inmon. Bill Inmon stated that the “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)
- Supercharge your Data Warehouse, 2010-2011 by Dan Linstedt
- Data Vault and Data Modeling, 2010 Genesee Academy
- http://www.box.net/shared/mic6eo2np68y9fff4y7b et al.