DS183 WS Blogimage LJ V2

Extraction, transformation, and loading (ETL) processes have been in existence for almost 30 years. It has been a programming skill set mandatory for those responsible for the creation of analytical environments and their maintenance. Sadly though, ETL alone is not good enough to keep up with the speed at which modern analytical needs change and grow.

The increasingly complex infrastructures of most analytical environments, the addition of massive amounts of data from unusual sources, and the complexity of the analytical workflows all contribute to the difficulties that implementation teams have in meeting the needs of the business community. Just the length of time it takes to create a new report – a relatively simple process – demonstrates that just having ETL skills is not enough. We must improve and speed up all data integration by introducing automation into ETL processes.

Automating is more than just relieving the implementers of creating over and over the many mundane and repetitive tasks. Among its many benefits are the following:

1. Automated Documentation

Automation ensures that the ETL processes are not just tracked but documented in terms of up-to-date metadata on every extraction, every transformation, and every movement of the data, and every manipulation performed on it as it makes its way to the ultimate analytical asset (a report, an analytic result, a visualization, a dashboard widget, and so on). This metadata is not an after-thought; it is integral to the automation software itself and is always current. It is as useful to the business community as it is to the technical implementation staff. Business users increase their adoption of analytical assets if they can determine that the asset was created from the same data they would have used, that it was properly integrated with other sets of data, and that the ultimate analytical asset is exactly what they need. In other words, they trust the data and asset.

2. Implementing Standards

By setting up routine programs to handle common tasks like date and time processing, reference and look-up tables, and serial key creation, the analytical teams establish much-needed standards. The implementers can spin up new data and analytical assets or perform maintenance on existing assets without introducing “creative” (non-standard) data into these critical components. No matter where the data resides (on-premises, in the cloud, in a relational database or not), these sets of data remain the same, making their utilization so much easier by all (business community or technical staff).

3. Automated Data Lineage

A significant automation boon to any analytical environment is its automatic creation of the data’s lineage. Data lineage consists of the metadata that shows all the manipulations occurring to data from its source(s) to its ultimate target database as well as the individual operations to produce analytical assets (algorithms, calculations, etc.). Think how useful that information becomes to business users, data scientists, others using and creating analytical assets. Being able to understand how upstream ETL changes can affect downstream analytical assets eliminates so many problems for users and implementers alike.

4. Quicker Time-to-Value

Project lead time is greatly reduced with automation when adopting a new technological target (e.g., moving to Snowflake or Synapse) or migrating from an on-premises environment to a cloud-based one. Much of the ETL code generated from an automation technology can be easily retrofitted to the new environment through simple pull-down menu options. Minimal additional recoding efforts will be needed. In essence, by adopting automation, an organization is basically “future-proofing” its analytical architecture – no small accomplishment!

5. Getting Agile

ETL automation supports the technical staff as they move to adopt a more iterative and agile methodology. Rather than having a series of discrete steps in a traditional methodology with hand-offs between staff, all the steps for data integration are encapsulated in the automation tool so that moving from one step to another is seamless and fast. In fact, the same resource can perform all the data integration steps without any handoffs. This makes the adoption of an agile methodology not only possible but compelling.

6. Improved Data Governance

By capturing all the technical metadata and ensuring its accuracy and currency, automated ETL serves another audience nicely – the data governance function. Understanding the full life cycle of data integration from initial capture to ultimate target, data stewards can monitor where the data came from (approved sources or not), what changes and transformations were performed on it (standard calculations or personalized ones), and what analytical assets can now be certified (“Enterprise-approved” or “Corporate Standards”).

7. Switch Modelling Styles Faster

One of the more difficult migrations an analytical environment may go through is a change in its data modeling style. For example, switching from a star schema-based data warehouse to one based on the Data Vault design. Without data integration automation and well-documented metadata, this change would almost certainly require a total rewriting of all ETL code. With automation, all the steps leading to the ultimate storage of the data may be preserved and only the last few processes that create the database schema and load the data would have to be altered. Much of the intellectual capital can be preserved and the change made quickly and efficiently.

8. Adopting a Data Fabric

Finally, many organizations are considering a new architecture to replace their aging data warehouses – the “Data Fabric”. The idea of a data fabric started in the early 2010’s. Since then, many papers, vendors, and analyst firms have adopted the term. The goal of a data fabric is to create an architecture that encompasses all forms of analytical data for any type of analysis (e.g., from straight-forward reporting to complex business analysis to complicated data science explorations) with seamless accessibility and shareability by all those with a need for it. Data in a data fabric may be stored anywhere throughout the enterprise which makes automated ETL a mandatory tool for increasing the likelihood of success in this new endeavor. Well-documented ETL greatly reduces the overall complexity by streamlining creation and maintenance of this highly distributed environment.

These are just a few of the most important benefits of automating data integration. They are all compelling and illustrate the value of the technology not only to the technical implementation staff but also to the business community. In today’s complex analytical environments, an enterprise can’t afford to have old-fashioned, slow, error-prone ETL processes; it must turn on a dime, create new analytical assets quickly while preserving the integrity of the existing assets. Automating your ETL processes is the only way to achieve this.