Menu Request Demo

A Four-Phased Approach to Building an Optimal Data Warehouse

Date: 12 September 2017 Author: Barry Devlin

Published via datanami on Sept 12, 2017 (View the article at datanami).

If you’re planning to create a data warehouse, make sure you create one that is cross-functional and provides a long-life foundation for data provision and decision support. This means covering the entire enterprise and satisfying the needs of multiple projects and groups for several years. The foundation must provide consistent, reconciled, legally binding data to your business clients.

But, easier said than done. Right?

Think of your project in these four steps: Design, Build, Implement and Maintain.

Designing your data warehouse

Let’s start at the design phase. When planning your design, the vision for your new data warehouse is best laid out over an enterprise data model (EDM), which consists of high-level entities including customers, products and orders.

In the past, EDMs were built from scratch, which worked for data modelers but not business users who were drawn into definitional debates rather than seeing the desired results. Today, many EDMs are customized from standard industry models, which are much faster and easier to follow. Sometimes an EDM already exists from a previous data warehouse or other enterprise-wide undertakings, such as master data management.

warehouse_shutterstock_andrija-pajtic-300x188

(Andrija Pajtic/Shutterstock)

After establishing your EDM, your next major challenge is to define and refine the logical and physical structure of the data warehouse relational tables. This part will consider the limitations of the source systems, the challenges in combining data from multiple sources, and possible changes in business needs and source system structures over time.

A traditional design approach involves mapping entities to “loosely normalized” tables based on third normal form (3NF) or based on a dimensional or star-schema model. However, both present challenges for modern data warehouse development.

Another approach uses the Data Vault Model (DVM), which is a hybrid of the 3NF and star-schema forms.   First introduced by Dan Linstedt, the Data Vault is a detail-oriented, history-tracking, linked set of normalized tables designed to support multiple functional business areas.

The DVM consists of three specialized types of entities/tables: hubs based on rarely changed business keys, links that describe associations or transactions between business keys, and satellites that hold all temporal and descriptive attributes of business keys and their associations. A new version introduced in 2013 consists of a data model, methodology, and systems architecture, which provides a design basis for data warehouses to emphasize core data quality, consistency, and agility to support enterprise-wide data provision requirements.  In May 2017, data warehouse automation specialist, WhereScape announced automation software to enable rapid and agile Data Vault 2.0 development, cutting delivery time of Data Vault-based analytics solutions by two-thirds.

Get Busy Building

Once you set your design, now comes the hard work of building your data warehouse. But before you start, accept the fact that no matter how nicely you’ve designed your model, you will face the reality of imperfect data source systems.

Data warehouse builders struggle with missing data in source systems, poorly defined data structures, incorrect content and missing relationships. Implementation is a delicate balancing act between the vision of the model and the constraints of the sources.

The building process comes down to five steps:

  1. Understand the data sources.  Keep in mind that legacy systems might be “bent to fit” emerging and urgent requirements. And modern big data sources might lack documentation.
  2. Compare the data available to the data warehouse model and define appropriate transformations to convert the former to the latter.
  3. Where transformations are too difficult, modify the data warehouse model to accommodate the reality of the data sources. Changing the data sources is usually impossible for reasons of cost and politics.
  4. Test performance of load/update processes and check the ability of the modified model to deliver the data the business requires.
  5. If successful, declare victory. Otherwise, rinse and repeat.
warehouse_contruction_shutterstock_zhengzaishuru-300x200

(zhengzaishuru/Shutterstock)

Traditionally, the output of the above process would be encoded in a script or program and run overnight in batch to populate the warehouse. Any changes in requirements or the source systems would require a round trip back through steps one to five, followed by code update. The approach is manual, time-consuming, and error-prone.

Improved approaches to automating the process have emerged in stages over the history of data warehousing: extract, transform, load (ETL) tools, data integration systems and, finally, data warehouse automation (DWA). In essence, each stage on this journey depicts an increasing level of automation, using DWA to address the entire process of designing, building, operating and maintaining a data warehouse.  Companies such as WhereScape offer useful tools to automate the data source discovery, design and prototyping phases of projects. Additionally, advanced automation solutions with an integrated development environment (IDE) targeted to your data platform can eliminate the majority of traditional hand-coding required and dramatically streamline and accelerate the development, deployment, and operation of data infrastructure projects.

In the transition from design to build, the combination of a well-structured data model and DWA offers a more powerful approach. This is because the data model provides an integrated starting set of metadata that describes the target tables in both business terms and technical implementation. This is particularly true with the Data Vault model, which has been designed and optimized from the start for data warehousing.

A DWA tool automates the transformation of the data structures of the various sources to the optimized model of the Data Vault and populates the target tables with the appropriate data. This approach creates necessary indexes and cleanses and combines source data to create the basis for the analysis to address the business need.

Shifting to Operations

Having designed and built your data warehouse, the next step is to deliver it successfully to the business and run it smoothly on a daily basis.

warehouse_shutterstock_don-pablo-300x200

(Don Pablo/Shutterstock)

Historically, however, approaches to operating a data warehouse environment have been somewhat lax. Manual and semi-automated methods for populating and updating the contents of the warehouse have been widespread. Advances made in the data warehouse itself are being offset by the spread of ad hoc approaches in departmentally managed data marts. The data lake has seen the re-emergence of a cottage industry of handcrafted scripts, often operated by individual data scientists.

The data warehouse has become the repository of truth and history for businesses to analyze challenges and opportunities that can be addressed by advanced management and automation practices. The combination of DWA and a Data Vault address these needs from two perspectives: function deployment and ongoing day-to-day operations.

Deployment seldom gets the attention it deserves. A data warehouse is substantially more complex than most IT projects, but must be deployed correctly as we move toward data-driven business and agile development.

As you move from a development phase to test, quality assurance, and on to production, you must address mundane issues such as packaging and installation of the code built in the previous phase. The clear aim is to automate and speed deployment in an agile environment to reduce human error across the full lifecycle.

Having deployed the system to production, the next—and ongoing—task is to schedule, execute, and monitor the continuing process of loading and transforming data into the data warehouse. In this phase, jobs consist of a sequence of interdependent tasks. To ensure that data consistency is maintained, if a task fails during execution, then all

downstream dependent tasks are halted. When the problem has been resolved, the job is restarted and will pick up from where it left off and continue through to completion. From an operational point of view, given potential interdependencies of data across these systems, it makes sense to manage this ensemble as a single, logical environment.

The smooth, ongoing daily operation of the entire data warehouse environment is a fundamental prerequisite to its acceptance by users and its overall value to the business.

Maintaining with Agility

In more traditional IT projects, when a successful system is tested, deployed and running daily, its developers can sit back and take a well-deserved rest. Developers of today’s data warehouses have no such luxury.

forklift_shutterstock_kzenon-300x200

(Kzenon/Shutterstock)

Instead, you can count on working on a regular basis to deliver new and updated data to your businesses. To make life easier, leverage and apply agility whenever possible.

Vendors and practitioners have long recognized the importance of agility to deliver new and updated data through the data warehouse. Unfortunately, such agility has proven difficult to achieve.

Now, ongoing digitalization of business is driving ever-higher demands for new and fresh data. Some people think a data lake filled with every conceivable sort of raw, loosely managed data will address these needs. That approach may work for non-critical, externally sourced social media and Internet of Things data. However, it doesn’t help with historical and real-time data.

Fortunately, the agile and automated characteristics of the Data Vault / DWA approach applies also to the maintenance phase. In fact, it may be argued that these characteristics are even more important in this phase.

One explicit design point of the Data Vault data model is agility. The engineered components and methodology of the Data Vault are particularly well suited to the application of DWA tools.

At this point, widespread automation is essential for agility because it increases developer productivity, reduces cycle times, and eliminates many types of coding errors. Look for a solution that incorporates key elements of the Data Vault approach within the structures, templates, and methodology to make the most of the potential automation gains.

Another key factor in ensuring agility in the maintenance phase is the ongoing and committed involvement of business people. An automated, template approach to the entire design, build and deployment process allows business users to be involved continuously and intimately during every stage of development and maintenance of the data warehouse and marts.

With maintenance, we come to the end of our journey through the land of automating warehouses, marts, lakes, and vaults of data. At each step of the way, combining the use of the Data Vault approach with DWA tools simplifies technical procedures and eases the business path to data-driven decision-making.

About the author: Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988. Barry is founder and principal of 9sight Consulting. A regular blogger, writer and commentator on information and its use, Barry is based in Cape Town, South Africa and operates worldwide.