Data Warehouse Automation vs Data Architecture Automation 

Data Architecture

Many organizations are faced with the challenge of how to adapt to the changing role of data. Data is increasingly considered a key asset that can improve and optimize business  processes and decision-making processes. This changing role leads to new requirements for IT systems.  Additionally, innovative new data processing and storage technologies, such as fast analytical SQL  database servers, object storage technology, and parallel messaging technology, are creating new  business opportunities. Nowadays, systems can be developed that were almost unimaginable or too  expensive to develop years ago.  

To support all these new requirements and to leverage these new technologies, organizations are  rethinking and redeveloping the data architectures of their current data platforms. It is not an option to  develop new data platforms with architectures that will be as outdated within a few years as current  systems are today. They need data platforms with sustainable data architectures.  

Sustainable Data Architecture

Data platforms with sustainable  data architectures enable organizations to cope with the changing role of  data and to leverage new and future technologies. A sustainable data  architecture is not one that only supports current data requirements and all  newly identified requirements, but one that is adaptable and extensible enough to support (as yet unknown) future requirements. When data consumption requirements change,  the data architecture must be easy to adapt without the need for major reengineering and redevelopment  exercises. Sustainable data architectures are as nimble as animals that have survived millions of years by  adapting to their changing environment.  

Requirements for Sustainable Data Architecture

The key requirements for sustainable data architectures are:  

  • Definition independent  
  • Technology independent  
  • Runtime platform independent  
  • Distribution independent  
  • Architecture independent  
  • Design Technique independent  
  • Metadata independent  
  • History independent  

The Need for Code Generators 

Developing data platforms with sustainable data architectures using low-level development tools is technically possible, but  time-consuming and error-prone. A more efficient approach to accelerate  development and meet the above requirements is to use code generators (generators for short). Most of the development work to create IT systems is repetitive and formalizable,  making it perfectly suited for the use of generators.  

IT departments have been using generators for years. For example, ETL tools have been used to  generate code to copy data from one database to another, BI tools to create SQL statements to access  databases, and data modeling tools to generate database structures. The general benefits of generators  are: improved flexibility, productivity, maintenance, performance, quality and integrated support for  metadata.  

Data Warehouse Automation

Most generators generate only one component of an entire data architecture  and not the entire architecture. A class of generators that generate entire data architectures are data warehouse automation tools. The current state is that many of them are  focused on developing data platforms that support traditional forms of  data consumption, such as a classic data warehouse architecture consisting  of a staging area, an enterprise data warehouse, and several physical data  marts that support traditional forms of reporting and dashboarding. Some forms of data consumption require other data architectures, such as data lakes and data hubs.  

A sustainable data architecture must be able to support a wide range of data consumption forms  and must be able to quickly adapt to new forms.  

Developing data platforms with sustainable data architectures requires generators capable of  creating, adapting, and extending all kinds of data architectures. Such generators are called data architecture automation tools.  

WhereScape

This whitepaper describes how the WhereScape data architecture automation tool is able  to generate sustainable data architectures. It gives a brief overview of the product and provides a short  introduction to how it works with model conversion rules and templates to meet the above requirements.  

Sustainable Data Architectures  

New Data Requirements

 Almost every organization is trying to adapt to the changing role of data. Data is  increasingly seen as a key asset that can improve and optimize business processes and decision-making  processes and that allows organizations to expand into new markets. This changed role has led to new  requirements for IT systems. Data platforms must process larger volumes of data because more data is  being produced, but also the requirements for data consumption are changing and broadening. Providing  the organization with simple dashboards and reports that show yesterday’s data is no longer sufficient.  New forms of data consumption need to be supported, including data science, real-time dashboards, and  self-service BI, and more external parties, such as customers, patients, suppliers, and legislators, must be  able to access data. Additionally, requirements for data security and especially data privacy have become  more complex.  

But it is not only the customer pull that leads to new requirements, it is also the technology push.  Many new powerful technologies have become available, such as fast solutions for data storage,  processing, and analytics, cloud platforms with virtually unlimited resources and flexible fee structures,  and new tools that simplify and accelerate analytics. All these new technologies are not there for  technology’s sake, they can make it possible to improve processes or change business models. New  technologies create new opportunities.  

Data Architecture vs. Data Platform

This section briefly explains two terms  frequently used in this whitepaper: data architecture and data platform.  Wikipedia defines data architecture as follows: “A data architecture is  composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.”  Most architects will more or less agree with this definition. More practically, it is a document that  describes how an organizations produces, stores, processes, secures, consumes, exchanges, and deletes. It  describes a data platform. A data platform is the implementation of a data architecture. It is the entire set  of IT systems and data stores involved in producing and consuming data. Data warehouse systems and  data lake-based systems are examples of data platforms.  

Adaptable Data Platforms 

To support new data requirements and to benefit from and exploit  new technologies, organizations are rethinking and redeveloping the data architectures of their current  data platforms. The architectures of new data platforms should not be developed in the traditional way,  as they only support existing and known future requirements. The resulting data platforms may be  outdated even before they are completed and operational.  

It is important that new data platforms are developed in such a way that they can be changed and  extended quickly and easily when requirements for data production or data consumption change (again).  Nothing in their architectures should be fixed and impractical to change. For example, when new data  storage technologies can improve performance or reduce storage costs, the platform must be able to  adopt them easily; when data must be migrated to a more scalable cloud platform, it should not require a  major redevelopment of the entire platform; stricter data privacy rules must be easy to implement; and  when customers need live access to zero-latency data, the data platform must be able to provide it with  minimal effort.  

Note that in real life situations, most changes are very repetitive by nature. For example, changing  the length of a field, replacing one SQL database server by another, and changing a data structure from a  snowflake to a stars chema pattern, are all very repetitive tasks. Such tasks are ideal for automation.  

Sustainable Data Architectures

It is time for data platforms with  sustainable data architectures. A sustainable data architecture is not one  that only supports current data requirements and all newly identified  requirements, but one that is adaptable and extensible enough to  support (as yet unknown) future requirements. This latter is important to  deal with completely unexpected and unplanned changes that lead to  new requirements. For example, an organization may acquire another  company that operates in a different industry with completely different data requirements, or urgent  market developments due to business opportunities, disasters, or new aggressive competitors, can also  lead to new data requirements.  

A sustainable data architecture is one that can survive for a long time, because it is easy to adapt  and extend. A sustainable data architecture enables an organization to quickly implement current, known  upcoming and still unknown future data requirements; see Figure 1. When data consumption  requirements change, the data architecture adapts accordingly without the need for major reengineering  and redevelopment exercises. Sustainable data architectures are nimble.  

Data Architecture Automation

This chapter describes eight key requirements for sustainable data architectures; see also Figure 2.  

The definitions of objects, such as tables, fields and keys, can change over time. For  example, the data type, uniqueness, integrity rules, maximum length, and optionality of a field can  change; fields may have to be added to or dropped from a table; and tables may need to be decomposed  into two tables. If such changes occur, databases, ETL programs, reports, and dashboards must be  changed accordingly.  

If a field occurs in one table of one database, the amount of work involved may be limited. Some  fields, however, occur in numerous tables throughout a data platform. For example, the field customer-id  may appear in several tables in the staging area, in the enterprise data warehouse, and in several physical  data marts. Changing the characteristics of such a field may lead to changes of many columns throughout  all the databases of a data platform and to changes of several ETL programs. Although the change to be  made is very similar wherever the field occurs and thus very repetitive, implementing all these changes is  time consuming and, more importantly, error-prone.  

The same is true when the definitions of tables change. Adding columns and decomposing tables  is also very repetitive work, and just like changing fields, it is time-consuming and error-prone.  A sustainable data architecture should simplify the work involved in changing the definitions of  fields, tables, and other objects. Ideally, if the definition of an object changes, the change is automatically  dispersed across all its implementations. This improves productivity, time-to-market, and avoids that  incorrect or incomplete changes are implemented.

Technology Independent Architecture

A sustainable data architecture must make a data platform technology independent. This applies to various kinds of technologies, including database servers, messaging busses,  and ETL solutions. This simplifies adoption of and migration to faster, better, or cheaper of new and future  technologies and it allows organizations to reap their benefits.  

For example, replacing a SQL database server with another in a  data platform usually requires many SQL statements to be translated into  similar statements supported by the other database server. Although this is  very repetitive task, it requires a lot manual work and is also error-prone. In  a data platform with a sustainable data architecture it must be almost as easy as replacing a laptop with a  new one. In other words, a sustainable data architecture supports plug and play with regards to  technologies.  

Runtime Platform Independent

It must be easy to replace the runtime platform on which a data platform  runs. For example, a sustainable architecture supports easy migration of the entire or parts of a data  platform from on premises to a cloud platform (hybrid cloud). In addition, it must be possible to run  different parts of one data platform on different cloud platforms from different vendors (multi-cloud). A  sustainable data architecture can be distributed across multiple runtime platforms and is designed to  simplify changing of the current setup.  

Distribution Independent

A sustainable data architecture enables logically related data to be stored with  different databases. For example, the data of a data warehouse or data lake is distributed across multiple  physical databases. Distributing the data across multiple databases can improve performance and it  allows data to be stored close to geographically distant user groups.  

Another reason may be to use different storage technologies to store different types of data. For  example, photos, and videos are stored in object storage technology; structured sensor data is stored in  an analytical SQL database server; frequently used structured data is stored in an in-memory database  server; and, ‘cold’ data is stored in a traditional SQL database server. There is no storage technology that  is perfect for every type of data and for every form of data consumption. Heterogeneity of data storage  technology has become the new normal.  

Architecture Independent

Most data architectures are based on a specific  reference data architecture, such as the logical data warehouse  architecture, the classic data warehouse architecture (also known as the  corporate information factory), the data lake architecture, and the data  mesh. Such a reference architecture may be the best at any given time, but  it can change when data production increases, new forms of data  consumption are introduced, or when data latency requirements change.  

A sustainable data architecture is an architecture that enables organizations to switch to another  (reference) data architecture. For example, an architecture with physical data marts may be the ideal  solution today to support dashboards and self-service BI, but virtual solutions may be the right solution  tomorrow. Or, a data warehouse needs to be replaced by a centralized data lake. A sustainable data  architecture is adaptable at this architectural level.  

Design Technique Independent 

Some components of a data architecture can be designed with different  techniques, for example, an enterprise data warehouse can be designed using the data vault or  normalization techniques, a physical data mart can be designed as starschemas or snowflake schema, and  a data lake can be designed using one or more tiers or zones. Another example relates to data copying  and data transformation, an ETL or ELT approach can be used, and data integration can take place before  the data is loaded into the data warehouse or when it is loaded into the physical datamarts. These are all  examples of architectural design techniques. A sustainable data architecture makes it possible to easily  change design techniques when new requirements dictate.  

Metadata Independent

Many technologies store technical and/or business metadata. This metadata must  be accessible to IT specialists and business users. The latter group is primarily interested in business  metadata. All the metadata must remain available to both groups in a sustainable data architecture, even  if technologies for capturing, storing, and exposing metadata change.  

History Independent – Many different approaches exist to keep track of historic data. For example, all the  historic data can be kept in a separate database, or an approach could be used that is similar to how data  vault organizes historical data. A sustainable data architecture should enable organizations to switch  approaches when new requirements demand it.  

Code Generators

ETL Tools and Code Generators 

Different approaches exist to develop data platforms with sustainable data architectures. The  approach described in this whitepaper is based on code generators (generators for short). Generators  transform higher-level specifications to lower-level specifications executed by specific runtime  technologies, such as compilers, database servers, messaging products, or ETL tools. If changes need to be  made, the high-level specifications are changed and everything or parts are regenerated. All the repetitive  work is handled by the generator. In general, working with generators accelerates development (shorter  time-to-market), is less error-prone (because generators do not make mistakes) and new requirements  are implemented faster.  

Data Automation 

Generators automate tasks. Automation is not a new concept in the IT  industry. In fact, IT is and has always been about automating tasks. IT systems have been developed to  automate full or partial administrative, industrial, and technical processes. For example, no organization  does their accounting completely by hand anymore, cars are no longer developed by hand, and ordered  products are no longer extracted from warehouses manually anymore, but by robots. Nowadays, many  products and services would be too expensive if automation is not applied.  

Even in the early days, IT was used to speed up tasks or make them more cost effective. For  example, to speed up the 1890 USA census, Herman Hollerith1 was asked to create a machine that could  speed up the counting of the American population. The previous census of 1880 was done manually and  had taken eight years to complete. By automating the process, a ten-year job was reduced to a few  months. Another example from long ago was the machine created by Alan Turing2 and his colleagues who  worked for the Government Code and Cypher School at Bletchley Park in the United Kingdom during the  second World War. This machine was designed to speed up code breaking which it did successfully by  automating a highly repetitive process.  

The first traditional IT systems were also created to speed up accounting, inventory management,  finance, and so on. Nowadays, robot arms are deployed in car factories to automate repetitive manual  work. IT is all about automation.  

Tasks Suited for Automation

Not every task is suitable for automation  and for which suitable generators can be developed. The best tasks suitable  for automation are repetitive by nature and can be expressed as formal  algorithms. The above examples are all repetitive, such as counting the  census, breaking cryptic cyphers, and putting window shields in cars in a factory. Many administrative  tasks are also repetitive, such as entering new orders and calculating net wages.  All these processes can be formally specified as algorithms that indicate which steps to perform,  what to do in special cases and how to react when something goes wrong. In other words, these tasks can  be formalized.  

Note that artificial intelligence and machine learning make it easier to automate less repetitive  and less formalized tasks, such as discovering cracks in dikes by analyzing photos made by drones. This  increases the potential set of tasks suited for automation.  

Automation of IT

As indicated, IT systems are developed to automate tasks, but automation can also be  applied to the development of IT systems. In other words, IT tasks such as software design, development,  and testing can be automated as well. Especially when they are highly repetitive and formalizable,  generators can be developed.  

The first generators were probably the compilers for programming languages such as COBOL and  Fortran. They were introduced to accelerate development. Rather than having to program in low-level  assembler languages, more high-level languages could be used. One language construct in COBOL would  replace numerous statements in assembler. In other words, repetitive forms of programming were  replaced by simpler constructs. The same was true with the introduction of more modern languages, such  as Java and C#. And SQL has replaced many statements in older database languages.  

Currently, generators exist for all kinds of IT tasks. For example, all kinds of generators are  available for the development of ETL programs and database structures. For example, ETL tools that  transform high-level specifications to be transformed to lower-level code to do the actual ETL work have  been widely accepted. Or, many advanced BI tools can be considered to be generators because business  developers use them to develop their KPIs and dashboards using drag-and-drop features. These tools  generate SQL statements or whatever is required to pull the data from databases. Some data science tools  also allow scientists to work at a high conceptual level from which code is generated. And the list goes on.  

If all these generators had not been developed, all programming would still be done in assembler,  or worse in machine code. And business users would not even be able to develop their own dashboards.  Note, some generators are technically not compilers but interpreters, but both forms generate  code and replace repetitive tasks. But whatever the implementation is of the generator, compiler or  interpreter, the goal is the same: improve productivity.  

Benefits of Data Automation  

As indicated, the approach described in this whitepaper to develop data platforms with sustainable data  architectures is based on generators that transform high-level specifications to lower-level specifications.  This chapter lists the benefits of generators and especially generators for data architectures.  

Development Acceleration

The primary benefit of automation is  development acceleration. A generator often hides many technical details  of the technologies, allowing developers to focus on the functionality that  the data platform must support. Also, a generator prevents the wheel from being invented over and over by human developers as they try to develop solutions to similar problems.  

Run-time Platform Independence

Generators create an abstraction layer. Generating code makes the  original specifications independent of the run-time platform. A run-time platform consists of an operating  system, database server, security system, and so on. Due to the run-time platform independence, the  data platform can be generated for other platforms, simplifying a migration; see Figure 3.  

Maintenance Improvement

No matter how perfect IT systems are, eventually they have to be changed to  support new requirements. Because many technical details are hidden by the generator, there is less code  or specifications (depending on the system), which makes it easier to implement changes. To take the  analogy of assembler languages again, changing code in a C++ program takes much less time than a  comparable change to the assembler equivalent.  

Performance Improvement

Generators generate generic code. Some think that generic code is by definition  less efficient and slower than handwritten code. Whether this is true depends on the quality of the  generator and the manual developer. For example, if a generator always generates standard SQL code  regardless of the database server it uses, it may not exploit all the proprietary and performance enhancing features that a database server supports, nor generate the most efficient SQL formulations.  Good generators generate specific code for database servers and the most efficient formulation for each  target platform.  

In fact, the efficiency of handwritten code depends entirely on the expertise level of the  developers and experts and it depends on whether they are given the time to create the most efficient  code. The technologies for which they can develop efficient handwritten code is also limited.  

Developers of generators are given the time to make sure that it  generates the most efficient code. This generated code is used by countless  organizations that will return comments on how it can be improved.  Developers of generators automatically become super experts, which is  reflected in the generated code. With each version of the generator, the generated code can also be more  efficient. So, generated code can definitely be more efficient than handwritten code.  

Quality Improvement

The correctness of generated code depends on the quality of the generator. But if a  specific type of generated code is correct, it always generates correct code. It does not make sudden  mistakes by forgetting something or generating the wrong code. It is much more complex for human  developers to always produce correct code.  

Metadata Improvement

Correct generators can also generate the right metadata. For example, because  some generate the entire architecture, they also know how to present a full picture of how all the data is  connected. Users can see where the data comes from and how it has been processed and transformed. In  other words, a complete and up-to-date lineage or impact analysis can be generated.  

Developer Independence

As indicated, the quality and efficiency of non-generated code is dependent on the  quality of the developer. This approach has certain drawbacks. First, the scalability for development  productivity is low, because developers can only work a limited number of hours per day. To increase  productivity, more developers must be hired. Second, new developers may need to build up experience  before they can write efficient code. Third, when developers change jobs, their organizations lose  expertise. All these drawbacks do not apply to generators.  

Vendor Independence

The ability to easily migrate to other technologies makes a data architecture less  dependent on the vendor of a technology. Switching to alternative technologies can be limited to a simple  regeneration. A vendor dependency that remains is the dependency on the generator itself.  

Data Warehouses and Data Architectures  

All generator examples described in Chapter 4 produce one component of an  entire data platform, such as an application or database. For example, an ETL tool generates ETL programs, BI tools generate SQL statements, and  data modeling tools generate data structures. Multiple, independent  generators are required to generate an entire data platform. Since the  generators require similar specifications, they are defined multiple times, or in other words they are duplicated; see Figure 4. It is a challenge to keep all those specifications  consistent, to make sure they work together optimally, and to guarantee that if one specification is  changed all the duplicate specifications are changed accordingly.  

Generating Data Warehouse Architectures

Many of the tasks involved in designing and developing data  architectures are quite repetitive and formalizable. That makes these tasks suitable for automation. For example, when an enterprise data warehouse uses a data vault design technique and the physical data  marts use starschemas, both can be generated from a central data model including the ETL code to copy  the data from the warehouse to the data marts.  

The principles that apply to generators of individual platform  components can also be applied to generators of entire data platforms.  Currently, the category of generators operating on the architectural level  are the data warehouse automation tools. They do not generate code for  one component of the architecture, but for several. Traditional data warehouse automation tools generate, for example, staging areas, enterprise data warehouses, physical  data marts, the ETL solutions that copy data from one database to another, and metadata. Several of  these tools have been on the market for many years and have proven their worth.  

Generating Sustainable Data Architectures

A disadvantage of several data warehouse automation tools is that  they only generate data platforms with traditional data warehouse architectures. Such platforms are only  suitable for a restricted set of data consumption forms. In other words, they are single-purpose data  platforms.  

To develop truly sustainable data architectures, generators are  required that can generate other data architectures in addition to the more  traditional data warehouse architectures, such as data lake and data hub  architectures. Such generators can be used to generate data platforms for other forms of data consumption than those supported by data warehouse architectures.  Note: The term data warehouse automation is probably a misnomer for such tools. Data  architecture automation tool might be more appropriate.  

WhereScape

WhereScape is a data architecture automation  tool that has been on the market since 2002. The product supports two  approaches to data warehouse automation: data-driven and model-driven  automation. These two approaches roughly correspond to the two products that make up WhereScape: WhereScape 3D and WhereScape RED. The products augment each  other and can also be used separately. In the whitepaper the two products are considered one product  and are simply referred to as WhereScape.  

Model-Driven Automation

With the model-driven approach a neutral data model, consisting of entities,  attributes and relationships, is defined and stored in the WhereScape repository. These models can be  automatically transformed to comparable models with different design techniques. For example, a  normalized model can be transformed to a data vault model. Since WhereScape knows the source and  target model, it can also generate the logic to copy the data from one model to another.  

Model Conversion Rules

How a model is transformed is described in detail in  so-called model conversion rules; see Figure 5. These rules define in detail  how one data model is transformed to another. The rules of a conversion  model can be changed and adapted to the needs of an organization. For example, if an organization uses its own, slightly adapted version of data vault, it only needs to change the  conversion rules model once. The generator always applies these changed model conversion rules  whenever a new data model needs to be transformed. This guarantees that this change is executed  consistently throughout the model. Work that would have been highly repetitive, time-consuming, and  error-prone is now handled automatically and error-free by the generator.  

New model conversion rules can be added. For example, model conversion rules can be defined  to transform a normalized model to a data model for a data hub.  

Templates

WhereScape also uses templates when it generates real code.  The templates specify in detail the code to be generated see Figure 6. It  supports different types of templates, including DDL (Data Definition  Language), DML (Data Manipulation Language), and load templates.  

Templates are designed for specific data platforms. For example, there will be a DDL template to generate  a CREATE TABLE statement for Oracle and another to generate MySQL equivalents. These templates are  probably slightly different because SQL products use slightly different dialects. Some data platforms also  support proprietary features that can improve performance or minimize storage. If relevant, these  templates use those proprietary features and if not, they can be changed to use them.  

DML templates are used to generate code to insert, update, and delete data, and are used, for  example, to generate code that moves data from one model to another. They are also used when changes  need to be implemented. For example, splitting a table into two tables may require multiple statements  to move data from the old table to the two new tables. As with DDL templates, DML templates also take  into account the dialect differences and proprietary features of data platforms.  

Load templates, as the name suggests, specify the statements to load data into a specific database  or file. These generated statements may be specified in the language of the loading tool belonging to a  database server or target cloud platform.  

Data-Driven Automation

The data-driven approach starts with defining the model of a source database  within WhereScape. Next, developers specify how that model needs to be transformed into a target  model. WhereScape uses these specifications to generate the code required to copy data from the source  to the target database. With this data-driven approach, templates are also used to generate real code to  create and change databases.  

The two approaches do not work in isolation, but augment each other. Models and transformations  defined with the model-driven approach can be imported in the data-driven approach. Figure 7 contains a  screenshot of a data model created with WhereScape.  

Summary – The generator provided by WhereScape uses model conversion rules and templates to  transform models and generate databases. The product is not a fixed generator. The model conversion  rules and templates can be adapted to the requirements of an organization. When needed, developers  can also add their own code that will be used every time the generator is used.

Sustainable Data Architectures with WhereScape  

This chapter describes how WhereScape supports the requirements for generating sustainable data architectures as described in Chapter 4.  

Definition Independent

WhereScape stores the definitions of fields, tables and relationships in its  repository. When these definitions change, code is regenerated and redeployed to reflect those changes  within the real data platform in which the objects are implemented. Depending on the change, DDL  statements, such as ALTER TABLE and DROP TABLE, are generated, as well as DML and Load statements,  because data must be unloaded and reloaded, or data must be moved from one table to another. In other  words, many templates may be used to generate all the right statements to implement the change.  

Important to understand is that almost every type of change ever made to a data model has been  made countless times. Without a generator, this would result in repetitive, time-consuming and error prone work. WhereScape automates this entire process. It performs this repetitive work automatically  without errors.  

Technology Independent

WhereScape offers technology independence in  many different ways, but primarily through its templates. Technology is  ‘hidden’ in the templates. For example, if a table needs to be moved to a  different database technology, it is purely a matter of regenerating with a  different set of templates and WhereScape creates all the new tables and indexes. Moreover, it also generates the code to copy the data to the new database technology.  Besides the migration, if organizations want to change how certain aspects of tables or fields are  generated for the new technology, they just need to change the template. Only typical DBA procedures,  such as those for backup and recovery and database server tuning, require manual changes.  

After this generator and migration process, the data platform continues to operate in exactly the  same way and keeps producing the same results.  

Runtime Platform Independent

In the same way that a database server can be replaced, runtime platforms can be replaced. If an organization wants to migrate their on-premises database to a cloud platform, only  some minor template changes are required to cater for the specifics of the cloud platform. If they want to  exploit some of the new features that are only available for the database server on the cloud platform,  then this can also be done through the templates. 

 WhereScape makes a generated data platform highly runtime platform independent. It enables  organizations to easily switch from a fully on-premises solution to a cloud-based solution (multi-cloud or  hybrid-cloud).  

Distribution Independent

Metadata can be defined for all the objects defined within WhereScape.  Organizations can introduce their own types of metadata tags. For example, a metadata tag can be  introduced to indicate with a percentage the estimated quality level of the data in each table. These  metadata tags can be used in templates to influence the generated code.  

If an organization wants to distribute the tables of a data warehouse across two physical  databases using different technologies they can use a specific metadata tag. For example, a metadata tag  can be introduced that indicates for each table in which database it needs to implemented. Each time a  table is generated, the tag affects the code generated by the template. This enables development of a  distributed architecture where, for example, parts of a database run on-premises using a specific database  server and the other part runs on a cloud platform using another database server.  

Architecture Independent

New technologies can enable organizations to develop simpler data architectures  or data architectures that can support many forms of data consumption. This requires that the current  data architecture must be easy to change. For example, physical data marts may have to be replaced by  virtual ones, or the staging area and enterprise data warehouse need to be merged into one physical  database. The latter change could also mean that the current ETL process that physically copies data from  the staging area to the data warehouse is replaced by an ELT process running entirely within the database  server that manages both databases.  

Implementing such changes with WhereScape is merely a matter of  changing the relevant model conversion rules and templates. For example,  if physical data marts need to be replaced by virtual ones, instead of  generating CREATE TABLE and INSERT statements, WhereScape generates  CREATE VIEW statements to develop a virtual data mart. The data models  themselves remain unchanged.  

The model conversion rules  enable switching to other  data architectures.

Design Technique Independent

For each database in a data platform a specific design technique for table  structures is applied, such as normalization, data vault, anchor, starschema and snowflake. Organizations  may also have invented their own design techniques. All of these techniques have their pros and cons, and  they may have been selected for a specific database server due to the forms of data consumption. If new  forms of data usage need to be supported or if new data storage requirements apply, it may be necessary  to change the design technique. For example, it may be necessary to switch from a normalized to a data  vault data structure or from a snowflake to a starschema structure.  

In WhereScape, switching to another design technique involves changing the model conversion  rules and templates followed by a regenerator process.  

Organizations can adapt the design technique. For example, in a standard data vault, each hub  table has specific columns, such as the source of the data and the start and end time. If they want to add some geo-spatial columns to each hub table, it just needs to be defined in the model conversion rules and  templates. But this only needs to be done once and the generator applies those rules and templates  consistently every time a hub table is generated.  

Metadata Independent

Metadata can be defined within WhereScape. This can be technical and business  metadata. As indicated, organizations can define their own metadata tags. Metadata can affect the  generator process, and it can be exported for import in other tools.  

History Independent

There are many ways to store historical data in databases, such as the one used by  data vault. If this needs to be changed, only the conversion rules and templates must be changed and the  generator process restarted. This could mean regenerating the entire database. This may take a lot of  processing time, but it leads to a consistent and error-free result. It also takes a lot less time than  manually making those changes for each table individually.  

Data Architecture Automation

As many roads lead to Rome, different approaches exist to develop data platforms with sustainable data  architectures that are future-ready. This whitepaper describes the generator approach deployed by data  architecture automation tools. By separating the model definitions from the data platform, development  is accelerated and it is easier to change the architecture of the platform and apply changes that enable  the platform to support other forms of data consumption. This concept makes data architectures more  sustainable. Without sustainable data architectures, making those changes is time-consuming, expensive  and error-prone.  

Practice has shown that the IT industry has used generators from the early days onwards to  improve productivity of repetitive tasks. But generators are only popular within one component of the  platform. This has been widely accepted by the market. In fact, some developers don’t even realize that  they are using generators anymore.  

It is time for the data architecture automation tools. Tools such as WhereScape have been on the  market for many years and have proven their worth. If organizations have not studied the latest state of  automation tools, it’s time they do.  

WhereScape is an adaptable data architecture automation tool. It  uses model conversion rules and templates to generate the implementation  from data models. This makes the generated data platform independent of  definitions, technologies, runtime platforms, distribution, architectures,  design techniques, metadata and the way historical data is stored. In other  words, the generated architecture is a sustainable data architecture.  

Data platforms with a sustainable data architecture support the current, known upcoming and the  still unknown future requirements. They enable organizations to adapt more easily to the changing role of  data within their organization.