Many organizations are faced with the challenge of how to adapt to the changing role of data. Data is increasingly considered a key asset that can improve and optimize business processes and decision making processes. This changing role leads to new requirements for IT systems. Additionally, innovative new data processing and storage technologies, such as fast analytical SQL database servers, object storage technology, and parallel messaging technology, are creating new business opportunities. Nowadays, systems can be developed that were unimaginable or too expensive to develop years ago.
To support all these new requirements and to leverage these new technologies, organizations are rethinking and redeveloping the data architectures of their current data platforms. It is not an option to develop new data platforms with architectures that will be as outdated within a few years as current systems are today. They need data platforms with sustainable data architectures.
Data platforms with sustainable data architectures enable organizations to cope with the changing role of data and to leverage new and future technologies. A sustainable data architecture is not one that only supports current data requirements and all newly identified requirements, but one that is adaptable and extensible
Sustainable data architectures enable organizations to cope with the changing role of data.
When data consumption requirements change, the data architecture must be easy to adapt without the need for major reengineering and redevelopment exercises. Sustainable data architectures are as nimble as animals that have survived millions of years by adapting to their changing environment.
The key requirements for sustainable data architectures are:
Developing data platforms with sustainable data architectures using low-level development tools is technically possible, but time-consuming and error-prone. A more efficient approach to accelerate development and meet the above requirements is to use code generators
IT departments have been using generators for years. For example, ETL tools have been used to generate code to copy data from one database to another, BI tools to create SQL statements to access databases, and data modeling tools to generate database structures. The general benefits of generators are: improved flexibility, productivity, maintenance, performance, quality and integrated support for metadata.
Most generators generate only one component of an entire data architecture and not the entire architecture. A class of generators that generate entire data architectures are data
Warehouse automation tools. The current state is that many of them are focused on developing data platforms that support traditional forms of data consumption, such as a classic data warehouse architecture consisting of a staging area, an enterprise data warehouse, and several physical data marts that support traditional forms of reporting and dashboarding. Some
Data architecture automation tools for developing sustainable data architectures.
Forms of data consumption require other data architectures, such as data lakes and data hubs. A sustainable data architecture must be able to support a wide range of data consumption forms and must be able to quickly adapt to new forms.
Developing data platforms with sustainable data architectures requires generators capable of creating, adapting, and extending all kinds of data architectures. Such generators are called data architecture automation tools.
Almost every organization is trying to adapt to the changing role of data. Data is increasingly seen as a key asset that can improve and optimize business processes and decision-making processes and that allows organizations to expand into new markets. This changed role has led to new requirements for IT systems. Data platforms must process larger volumes of data because more data is being produced, but also the requirements for data consumption are changing and broadening. Providing the organization with simple dashboards and reports that show yesterday’s data is no longer sufficient. New forms of data consumption need to be supported, including data science, real-time dashboards, and self-service BI, and more external parties, such as customers, patients, suppliers, and legislators, must be able to access data. Additionally, requirements for data security and especially data privacy have become more complex.
But it is not only the customer pull that leads to new requirements, it is also the technology push. Many new powerful technologies have become available, such as fast solutions for data storage, processing, and analytics, cloud platforms with virtually unlimited resources and flexible fee structures, and new tools that simplify and accelerate analytics. All these new technologies are not there for technology’s sake, they can make it possible to improve processes or change business models. New technologies create new opportunities.
This section briefly explains two terms frequently used in this whitepaper: data architecture and data platform. Wikipedia defines data architecture as follows: “A data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.” Most architects will more or less agree with this definition. More practically, it is a document that describes how an organizations produces, stores, processes, secures, consumes, exchanges, and deletes. It describes a data platform. A data platform is the implementation of a data architecture. It is the entire set of IT systems and data stores involved in producing and consuming data. Data warehouse systems and data lake-based systems are examples of data platforms.
To support new data requirements and to benefit from and exploit new technologies, organizations are rethinking and redeveloping the data architectures of their current data platforms. The architectures of new data platforms should not be developed in the traditional way, as they only support existing and known future requirements. The resulting data platforms may be outdated even before they are completed and operational.
It is important that new data platforms are developed in such a way that they can be changed and extended quickly and easily when requirements for data production or data consumption change (again). Nothing in their architectures should be fixed and impractical to change. For example, when new data storage technologies can improve performance or reduce storage costs, the platform must be able to adopt them easily; when data must be migrated to a more scalable cloud platform, it should not require a major redevelopment of the entire platform; stricter data privacy rules must be easy to implement; and when customers need live access to zero-latency data, the data platform must be able to provide it with minimal effort.
Note that in real life situations, most changes are very repetitive by nature. For example, changing the length of a field, replacing one SQL database server by another, and changing a data structure from a snowflake to a starschema pattern, are all very repetitive tasks. Such tasks are ideal for automation.
It is time for data platforms with sustainable data architectures. A sustainable data architecture is not one that only supports current data requirements and all newly identified requirements, but one that is adaptable and extensible enough to support (as yet unknown) future requirements. This latter is important to deal with completely unexpected and unplanned changes that lead to new requirements. For example, an organization may acquire another company that operates in a different industry with completely different data requirements, or urgent market developments due to business opportunities, disasters, or new aggressive competitors, can also lead to new data requirements.
A sustainable data architecture is one that can survive for a long time, because it is easy to adapt and extend. A sustainable data architecture enables an organization to quickly implement current, known upcoming and still unknown future data requirements; see Figure 1. When data consumption requirements change, the data architecture adapts accordingly without the need for major reengineering and redevelopment exercises. Sustainable data architectures are nimble.
The definitions of objects, such as tables, fields and keys, can change over time. For example, the data type, uniqueness, integrity rules, maximum length, and optionality of a field can change; fields may have to be added to or dropped from a table; and tables may need to be decomposed into two tables. If such changes occur, databases, ETL programs, reports, and dashboards must be changed accordingly.
If a field occurs in one table of one database, the amount of work involved may be limited. Some fields, however, occur in numerous tables throughout a data platform. For example, the field customer-id may appear in several tables in the staging area, in the enterprise data warehouse, and in several physical data marts. Changing the characteristics of such a field may lead to changes of many columns throughout all the databases of a data platform and to changes of several ETL programs. Although the change to be made is very similar wherever the field occurs and thus very repetitive, implementing all these changes is time consuming and, more importantly, error-prone.
The same is true when the definitions of tables change. Adding columns and decomposing tables is also very repetitive work, and just like changing fields, it is time-consuming and error-prone. A sustainable data architecture should simplify the work involved in changing the definitions of fields, tables, and other objects. Ideally, if the definition of an object changes, the change is automatically dispersed across all its implementations. This improves productivity, time-to-market, and avoids that incorrect or incomplete changes are implemented.
A sustainable data architecture must make a data platform technology independent. This applies to various kinds of technologies, including database servers, messaging busses, and ETL solutions. This simplifies adoption of and migration to faster, better, or cheaper of new and future technologies and it allows organizations to reap their benefits.
For example, replacing a SQL database server with another in a data platform usually requires many SQL statements to be translated into similar statements supported by the other database server. Although this is very repetitive task, it requires a lot manual work and is also error-prone. In a data platform with a sustainable data architecture it must be almost as easy as replacing a laptop with a new one. In other words, a sustainable data architecture supports plug and play with regards to technologies.
It must be easy to replace the runtime platform on which a data platform runs. For example, a sustainable architecture supports easy migration of the entire or parts of a data platform from on premises to a cloud platform (hybrid cloud). In addition, it must be possible to run different parts of one data platform on different cloud platforms from different vendors (multi-cloud). A sustainable data architecture can be distributed across multiple runtime platforms and is designed to simplify changing of the current setup.
A sustainable data architecture enables logically related data to be stored with different databases. For example, the data of a data warehouse or data lake is distributed across multiple physical databases. Distributing the data across multiple databases can improve performance and it allows data to be stored close to geographically distant user groups.
Another reason may be to use different storage technologies to store different types of data. For example, photos, and videos are stored in object storage technology; structured sensor data is stored in an analytical SQL database server; frequently used structured data is stored in an in-memory database server; and, ‘cold’ data is stored in a traditional SQL database server. There is no storage technology that is perfect for every type of data and for every form of data consumption. Heterogeneity of data storage technology has become the new normal.
Most data architectures are based on a specific reference data architecture, such as the logical data warehouse architecture, the classic data warehouse architecture (also known as the corporate information factory), the data lake architecture, and the data mesh. Such a reference architecture may be the best at any given time, but it can change when data production increases, new forms of data consumption are introduced, or when data latency requirements change.
A sustainable data architecture is an architecture that enables organizations to switch to another (reference) data architecture. For example, an architecture with physical data marts may be the ideal solution today to support dashboards and self-service BI, but virtual solutions may be the right solution tomorrow. Or, a data warehouse needs to be replaced by a centralized data lake. A sustainable data architecture is adaptable at this architectural level.
Some components of a data architecture can be designed with different techniques, for example, an enterprise data warehouse can be designed using the data vault or normalization techniques, a physical data mart can be designed as starschemas or snowflake schema, and a data lake can be designed using one or more tiers or zones. Another example relates to data copying and data transformation, an ETL or ELT approach can be used, and data integration can take place before the data is loaded into the data warehouse or when it is loaded into the physical datamarts. These are all examples of architectural design techniques. A sustainable data architecture makes it possible to easily change design techniques when new requirements dictate.
Many technologies store technical and/or business metadata. This metadata must be accessible to IT specialists and business users. The latter group is primarily interested in business metadata. All the metadata must remain available to both groups in a sustainable data architecture, even if technologies for capturing, storing, and exposing metadata change.
Many different approaches exist to keep track of historic data. For example, all the historic data can be kept in a separate database, or an approach could be used that is similar to how data vault organizes historical data. A sustainable data architecture should enable organizations to switch approaches when new requirements demand it.
Different approaches exist to develop data platforms with sustainable data architectures. The approach described in this whitepaper is based on code generators (generators for short). Generators transform higher-level specifications to lower-level specifications executed by specific runtime technologies, such as compilers, database servers, messaging products, or ETL tools. If changes need to be made, the high-level specifications are changed and everything or parts are regenerated. All the repetitive work is handled by the generator. In general, working with generators accelerates development (shorter time-to-market), is less error-prone (because generators do not make mistakes) and new requirements are implemented faster.
Generators automate tasks. Automation is not a new concept in the IT industry. In fact, IT is and has always been about automating tasks. IT systems have been developed to automate full or partial administrative, industrial, and technical processes. For example, no organization does their accounting completely by hand anymore, cars are no longer developed by hand, and ordered products are no longer extracted from warehouses manually anymore, but by robots. Nowadays, many products and services would be too expensive if automation is not applied.
Even in the early days, IT was used to speed up tasks or make them more cost effective. For example, to speed up the 1890 USA census, Herman Hollerith1 was asked to create a machine that could speed up the counting of the American population. The previous census of 1880 was done manually and had taken eight years to complete. By automating the process, a ten-year job was reduced to a few months. Another example from long ago was the machine created by Alan Turing2 and his colleagues who worked for the Government Code and Cypher School at Bletchley Park in the United Kingdom during the second World War. This machine was designed to speed up code breaking which it did successfully by automating a highly repetitive process.
The first traditional IT systems were also created to speed up accounting, inventory management, finance, and so on. Nowadays, robot arms are deployed in car factories to automate repetitive manual work. IT is all about automation.
Not every task is suitable for automation and for which suitable generators can be developed. The best tasks suitable for automation are repetitive by nature and can be expressed as formal algorithms. The above examples are all repetitive, such as counting the census, breaking cryptic cyphers, and putting window shields in cars in a factory. Many administrative tasks are also repetitive, such as entering new orders and calculating net wages. All these processes can be formally specified as algorithms that indicate which steps to perform, what to do in special cases and how to react when something goes wrong. In other words, these tasks can be formalized.
Note that artificial intelligence and machine learning make it easier to automate less repetitive and less formalized tasks, such as discovering cracks in dikes by analyzing photos made by drones. This increases the potential set of tasks suited for automation.
As indicated, IT systems are developed to automate tasks, but automation can also be applied to the development of IT systems. In other words, IT tasks such as software design, development, and testing can be automated as well. Especially when they are highly repetitive and formalizable, generators can be developed.
The first generators were probably the compilers for programming languages such as COBOL and Fortran. They were introduced to accelerate development. Rather than having to program in low-level assembler languages, more high-level languages could be used. One language construct in COBOL would replace numerous statements in assembler. In other words, repetitive forms of programming were replaced by simpler constructs. The same was true with the introduction of more modern languages, such as Java and C#. And SQL has replaced many statements in older database languages.
Currently, generators exist for all kinds of IT tasks. For example, all kinds of generators are available for the development of ETL programs and database structures. For example, ETL tools that transform high-level specifications to be transformed to lower-level code to do the actual ETL work have been widely accepted. Or, many advanced BI tools can be considered to be generators because business developers use them to develop their KPIs and dashboards using drag-and-drop features. These tools generate SQL statements or whatever is required to pull the data from databases. Some data science tools
If all these generators had not been developed, all programming would still be done in assembler, or worse in machine code. And business users would not even be able to develop their own dashboards. Note, some generators are technically not compilers but interpreters, but both forms generate code and replace repetitive tasks. But whatever the implementation is of the generator, compiler or interpreter, the goal is the same: improve productivity.
As indicated, the approach described in this whitepaper to develop data platforms with sustainable data architectures is based on generators that transform high-level specifications to lower-level specifications. This chapter lists the benefits of generators and especially generators for data architectures.
The primary benefit of automation is development acceleration. A generator often hides many technical details of the technologies, allowing developers to focus on the functionality that the data platform must support.
No matter how perfect IT systems are, eventually they have to be changed to support new requirements. Because many technical details are hidden by the generator, there is less code or specifications (depending on the system), which makes it easier to implement changes. To take the analogy of assembler languages again, changing code in a C++ program takes much less time than a comparable change to the assembler equivalent.
Generators generate generic code. Some think that generic code is by definition less efficient and slower than handwritten code. Whether this is true depends on the quality of the generator and the manual developer. For example, if a generator always generates standard SQL code regardless of the database server it uses, it may not exploit all the proprietary and performance-enhancing features that a database server supports, nor generate the most efficient SQL formulations. Good generators generate specific code for database servers and the most efficient formulation for each target platform.
In fact, the efficiency of handwritten code depends entirely on the expertise level of the developers and experts and it depends on whether they are given the time to create the most efficient code. The technologies for which they can develop efficient handwritten code is also limited.
Developers of generators are given the time to make sure that it generates the most efficient code. This generated code is used by countless organizations that will return comments on how it can be improved. Developers of generators automatically become super experts, which is reflected in the generated code. With each version of the generator, the generated code can also be more efficient. So, generated code can definitely be more efficient than handwritten code.
The correctness of generated code depends on the quality of the generator. But if a specific type of generated code is correct, it always generates correct code. It does not make sudden mistakes by forgetting something or generating the wrong code. It is much more complex for human developers to always produce correct code.
Correct generators can also generate the right metadata. For example, because some generate the entire architecture, they also know how to present a full picture of how all the data is connected. Users can see where the data comes from and how it has been processed and transformed. In other words, a complete and up-to-date lineage or impact analysis can be generated.
As indicated, the quality and efficiency of non-generated code is dependent on the quality of the developer. This approach has certain drawbacks. First, the scalability for development productivity is low, because developers can only work a limited number of hours per day. To increase productivity, more developers must be hired. Second, new developers may need to build up experience before they can write efficient code. Third, when developers change jobs, their organizations lose expertise. All these drawbacks do not apply to generators.
The ability to easily migrate to other technologies makes a data architecture less dependent on the vendor of a technology. Switching to alternative technologies can be limited to a simple regeneration. A vendor dependency that remains is the dependency on the generator itself.
All generator examples described in Chapter 4 produce one component of an entire data platform, such as an application or database. For example, an
ETL tool generates ETL programs, BI tools generate SQL statements, and data modeling tools generate data structures. Multiple, independent generators are required to generate an entire data platform. Since the generators require similar specifications, they are defined multiple times, or in other words they are duplicated; see Figure 4. It is a challenge to keep all those specifications consistent, to make sure they work together optimally, and to guarantee that if one specification is changed all the duplicate specifications are changed accordingly.
Many of the tasks involved in designing and developing data architectures are quite repetitive and formalizable. That makes these tasks suitable for automation. For example, when an enterprise data warehouse uses a data vault design technique and the physical data marts use starschemas, both can be generated from a central data model including the ETL code to copy the data from the warehouse to the data marts.
The principles that apply to generators of individual platform components can also be applied to generators of entire data platforms. Currently, the category of generators operating on the architectural level are the data warehouse automation tools. They do not generate code for one component of the architecture, but for several. Traditional data warehouse automation tools generate, for example, staging areas, enterprise data warehouses, physical data marts, the ETL solutions that copy data from one database to another, and metadata. Several of these tools have been on the market for many years and have proven their worth.
A disadvantage of several data warehouse automation tools is that they only generate data platforms with traditional data warehouse architectures. Such platforms are only suitable for a restricted set of data consumption forms. In other words, they are single-purpose data platforms.
To develop truly sustainable data architectures, generators are required that can generate other data architectures in addition to the more traditional data warehouse architectures, such as data lake and data hub architectures. Such generators can be used to generate data platforms for other forms of data consumption than those supported by data warehouse architectures. Note: The term data warehouse automation is probably a misnomer for such tools. Data architecture automation tool might be more appropriate.
WhereScape is a data architecture automation tool that has been on the market since 2002. The product supports two approaches to data warehouse automation: data-driven and model-driven automation. These two approaches roughly correspond to the two products that make up WhereScape: WhereScape 3D and WhereScape RED. The products augment each other and can also be used separately. In the whitepaper the two products are considered one product and are simply referred to as WhereScape.
With the model-driven approach a neutral data model, consisting of entities, attributes and relationships, is defined and stored in the WhereScape repository. These models can be automatically transformed to comparable models with different design techniques. For example, a normalized model can be transformed to a data vault model. Since WhereScape knows the source and target model, it can also generate the logic to copy the data from one model to another.
How a model is transformed is described in detail in so-called model conversion rules; see Figure 5. These rules define in detail how one data model is transformed to another. The rules of a conversion model can be changed and adapted to the needs of an organization. For example, if an organization uses its own, slightly adapted version of data vault, it only needs to change the conversion rules model once. The generator always applies these changed model conversion rules whenever a new data model needs to be transformed. This guarantees that this change is executed consistently throughout the model. Work that would have been highly repetitive, time-consuming, and error-prone is now handled automatically and error-free by the generator.
New model conversion rules can be added. For example, model conversion rules can be defined to transform a normalized model to a data model for a data hub.
WhereScape also uses templates when it generates real code. The templates specify in detail the code to be generated see Figure 6. It supports different types of templates, including DDL (Data Definition Language), DML (Data Manipulation Language), and load templates.
Templates are designed for specific data platforms. For example, there will be a DDL template to generate a CREATE TABLE statement for Oracle and another to generate MySQL equivalents. These templates are probably slightly different because SQL products use slightly different dialects. Some data platforms also support proprietary features that can improve performance or minimize storage. If relevant, these templates use those proprietary features and if not, they can be changed to use them.
DML templates are used to generate code to insert, update, and delete data, and are used, for example, to generate code that moves data from one model to another. They are also used when changes need to be implemented. For example, splitting a table into two tables may require multiple statements to move data from the old table to the two new tables. As with DDL templates, DML templates also take into account the dialect differences and proprietary features of data platforms.
Load templates, as the name suggests, specify the statements to load data into a specific database or file. These generated statements may be specified in the language of the loading tool belonging to a database server or target cloud platform.
The data-driven approach starts with defining the model of a source database within WhereScape. Next, developers specify how that model needs to be transformed into a target model. WhereScape uses these specifications to generate the code required to copy data from the source to the target database. With this data-driven approach, templates are also used to generate real code to create and change databases.
The two approaches do not work in isolation, but augment each other. Models and transformations defined with the model-driven approach can be imported in the data-driven approach. Figure 7 contains a screenshot of a data model created with WhereScape.