The image below shows the traditional data warehouse development process, from requirements through to logical and physical modelling, architecture, storage and index management, ETL mapping, deployment and eventually maintenance and documentation. Each step has its own tool and its own team, and usually takes from between six weeks and six months or more to complete. Once you get to the end it is very complicated to make retrospective changes.
Data Warehouse Models
Business requirements form the foundation of the technical staircase. Once all requirements are collected, the logical model, or reference architecture as it is known, is created. This tests whether the data warehouse should work in theory and is usually presented informally in Microsoft Word, Excel or Powerpoint. Then we need the data itself. We need to understand its sources, where they are located, what quality the data is in and how much of it there is. Then it’s time for the profiling and modelling, which is traditionally where tools like Erwin or Power Designer are used. Models are built to understand the source system, the core data warehouse and target data marts. Using conventional tools and methodologies, this process takes weeks and months. With WhereScape you can get here days and weeks.
Physicalizing the Model
Once the model is approved, it needs to be physicalized. Typically the following questions would be asked:
- What schemas and databases are we going to use?
- What protocols are we going to use to transfer data?
- What security access are we going to use?
- How are we going to store the data: multi-parallel? Column-wise? Row-wise?
- Where do we need indexing and what kind of indexing are we going to use?
- Where are the bottlenecks and what are we going to do to resolve them?
This is all usually done with database tools: Oracle TOAD, Microsoft SQL Server Management Studio or DB Architect to name a few. Many of these are proprietary to the chosen database, so could be locked into a certain toolset.
Once these questions have been answered, the code to move data from A to B to C and so on needs to be written. This is traditionally done with ETL mapping and development tools such as Informatica, DataStage or Microsoft SSIS. These tools move the data from the data sources to a landing area, transform that into a staging area, then move it into data models such as star schemas, data vaults and so on. This is all done with platform specific code such as SQL, Python or C Sharp. All this work needs version control so you can go back to old versions if something goes wrong, which often involves another tool such as GitHub.
End User Interaction
After this development work is complete, the question of how end users are going to interact with the data must be considered. Things like raw data vaults are very storage-friendly but not user-friendly. So teams tend to build data marts such as star schemas with facts and dimensions, and use them as presentation access layers, often with an OLAP cube on top so requests can be processed rapidly.
Then the workflows need to be chosen: how often does the data warehouse need to be refresh? Daily? Weekly? By the second? By the minute? Real-time? How often do we need to do it to keep the business informed and up-to-date? A workflow tool might be used here such as IBM ClearCase, and a scheduler such as Airflow, Control-M or Maestro.
Production and Changes
Once all the above has been achieved in a development environment, the infrastructure and needs duplicating in a test environment. When testing is finished, it is pushed into a production environment to make sure everything is exactly the same. This deployment process uses another tool such as IBM ClearQuest or JIRA.
Once the model is deployed after three to nine months of work to this point, there is the issue of maintenance. Every time a change is needed because of a new requirement, or perhaps something didn’t perform as it should have in testing, we start again from the bottom of the staircase and follow the same process:
- What are the requirements?
- Check the source
- Model the source
- Look at the physical infrastructure required
- Are there any changes to table structures?
- Do we need more storage or compute space?
- Do we need more network bandwidth?
- How is the ETL affected?
- Is an extra data needed in the pipelines?
- If so, how does this affect the scalability of the flows?
- Move into the test environment
- Move into the production environment
Data Warehouse Development Time
Data Warehouses that take this long to build often fall down precisely because they took that long. The requirements might have changed completely in that time as the business evolves. Technology can move on drastically at a rapid pace, so the original design could be outdated or defunct. Then there is the day-to-day maintenance. When a data warehouse has been running for a few years, it develops issues that need to be managed.
Along the foundation of the staircase in figure 1 is the word governance. This is the idea that the whole process is being tracked, controlled and quantified at management level, traditionally by the CIO but perhaps these days by the CDO or CTO. With all the different tools and teams in the stairway process, it’s extremely difficult for a single person, or even a small committee, to have control.
Data Quality in Data Warehouses
At each step step on figure 1 you are changing people and tools. After one team has finished with the code it passes it onto the next, which can often lead to human error and confusion if the completed work and instructions are not explained properly or sufficiently documented. Each team only sees what the last team sends them; they are often not aware of the original requirements and contextual conversations between the teams that precede them.
Requirements from the business are given to the framework people. They produce some architecture and pass it over to the modellers, then it is passed onto the builders who do the coding and send it to the testers and so on. It’s likely the new team won’t be ready to start working on the project right away, causing further delays. Some of these processes overlap, but it tends to be a linear path as Figure 1 suggests, with little loops back along the way.
When the stairway methodology was in its heyday in the late 90s and early 00s, offshoring was very popular as a lot of staff were needed for such a complex project. Companies would send their instructions to offshore countries with cheaper labour, and often be disappointed when the work came back. They would then work out that often their requirements or documentation were not correct. Teams were disparate enough when they were in the same building, so when you added in teams in different countries, on different time zones, the flow of the project got even more disjointed, inefficient and time-consuming.
Agile Software Development
Since the heyday of the traditional approach in the early 00s, the attitude towards software engineering has changed dramatically with the proliferation of the Agile Manifesto:
We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value:
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
In practice, the elements of this manifesto most relevant to data warehousing have been customer collaboration and the ability to respond to changes in customer requests (the customer in this instance being the business user rather than an external customer). With the scale of data warehousing projects, often automation is the only way to respond at a speed close to that which is now demanded.
Agile Development with Automation
With this in mind, the process in Figure 1 should be largely redundant today, but in reality only a small, albeit growing percentage are actually doing Agile development while many large companies continue with the traditional model.
Over the past couple of years there has been a real shift towards using Data Automation as the backbone of any serious data warehousing project. This is how it works.
Automated Scoping and Modeling
In Figure 2 above, the whole process and all its steps are managed in one toolset/framework and performed by the same team (often the same person). Rather than a staircase, the process is more of an escalator. Using a common software means every act is taken according to the same standards and conventions.
Once connected to your data ecosystem, WhereScape quickly scopes all data sources to give a profile of the data available, its quality, location and so on. Prototype architectures can then be spun up using this data in under an hour, and show it to the colleague(s) who requested it and check all requirements are understood before the build starts with no misunderstandings or misinterpretations. With the traditional approach, the process would be that the modelling team takes the requirements and the requestor only sees the finished article days or weeks later six months later or longer when the project has reached the top of the staircase and changes are difficult to implement.
ELT Code Generation
Once the model has been approved by all parties, the code that physicalizes it is automatically generated in WhereScape’s Integrated Development Environment. Rather than ETL developers having to spend months writing all the code by hand, WhereScape writes thousands of lines of SQL code in seconds and uses ELT instead of ETL, so the transformation work is done by the power of the database, not the intermediary ‘black box’ used by legacy ETL tools. The code is native to your existing platform as WhereScape is database agnostic. It merely sits on top of the infrastructure you already have and makes it perform much faster using metadata.
Time-consuming elements of the traditional approach, such as storage and index management, mapping and version control are done automatically by WhereScape, saving months of intricate work and difficult decision-making processes that take up a lot of person hours. Those long lists of questions in section 1 are largely irrelevant because WhereScape makes these decisions according to industry best practice templates. However, it is important to point out that these templates are configurable if required, so the developers maintain control throughout. The resulting architecture is tested and pushed into production.
Automated Documentation and Change Management
The whole process above can be instantly documented at the click of a button. This means the whole architecture is described in a common language, and there is no delay in waiting for developers to get round to doing a job that is time-consuming and often left until last or not done at all. If any of the developers were unavailable or had left the company, someone could read the documentation, so the knowledge is not locked away in the minds of a handful of tech leads as may be the case in many companies.
With a modern, automated architecture, the CDO can provide governance for the entire process because he/she can track everything in WhereScape, with full lineage and automatically generated documentation. Every action taken is tracked and timestamped. As the majority of actions are automated, once the CDO sees that the tool can be trusted, they only really need to keep track of the elements along the stairway that are controlled by human decision-making. The direction and outcome of the project still lie in human control of course, but the undertaking of the time-consuming, repetitive tasks is performed by the tool for added reliability and shorter projects.
Data Automation in Warehouses
The real wow factor with WhereScape comes when we do a proof of concept at a prospect company, and our techies are able to complete the whole process in figure 2 in just three to five days. We often say conservatively that our tool increases the output of each developer by 5x, but when figure 1 can take between six months to multiple years this figure is often tens or hundreds of times higher. This is why we have such a high success rate when we get to the proof of concept stage and show what the tool can do. These gains in speed are life-changing for data teams who have been working under the traditional methodology , and this actually changes working lives, not just project completion dates.
Once the automated building process is complete, changes are much faster to commit as the tool can either pinpoint and tweak the relevant section of metadata and make all necessary up-and down-stream alterations to avoid unexpected repercussions, or it can simply rebuild the element you want to change from scratch given the minimal time and effort required to do so. Changes and bug fixes can now be completed in two hours rather than the days weeks or months required when multiple teams have used multiple tools to shape the code over a long period of time.
Future-Proofing Data Architecture
WhereScape is metadata-driven and version managed. This means there is a reliable record of the whole architecture and its processes kept in a repository at all times, and so all this can be more easily migrated onto a different platform, Cloud or otherwise, without the need to replicate all the work you have already done.
Many companies use the adoption of WhereScape as a good excuse to switch to a Cloud database they have been wanting to use for some time, because WhereScape can generate the huge amount of code needed to migrate existing data and data infrastructure from one platform to another.
Others opt to automate their existing architecture, safe in the knowledge that if they choose to switch in the future, they can do so easily and are not locked into any provider. This means that the choice of database is no longer a ten-year decision as it used to be. In fact, WhereScape customers such as Legal & General opt to use WhereScape as a common data warehousing build and management tool, and allow each of their departments to work on the database that suits their needs best.
Once organizations realize the incredible savings in time and resources that are possible, they quickly recognize the possibilities available to data teams using WhereScape in the age of automation.