Read the article at Computer World

Native automation and automated discovery of data sources to enable CI/CD

In a data-driven world, getting your data to the right locations in as short a time as possible, is crucial. WhereScape claims to automate the plumbing for this, and makes sure data gets to where it needs to go at the right time. We sat down with CTO Neil Barton to discuss these claims.

The goal for WhereScape is as simple as it is complex, according to Barton. They want to make the time-to-value for data as short as possible. In doing this, the provenance and type of data shouldn't matter. In order to be able to do this, automation is key. In a nutshell, that's what WhereScape 'does'.

The variety of sources, and the sheer amount of data, can make building a data warehouse a complex and time-consuming task. Barton mentions that ingesting all their data manually could take companies months, or even longer, depending on the size of the company and the volume of data involved. And that timeframe is no longer an option in today's data centric world.

Automation

The best way to speed up the time-to-value is by coupling the right infrastructure with automation, according to Barton. The WhereScape automation solution builds and manages the data processing pipelines and schema, catered to each different platform that the software runs on (for example Oracle, Exadata and Teradata), as each can have different schemas.

WhereScape doesn't require its customers to add a layer onto their IT infrastructure in order to be able to use its tools. As the automated ingestion model leverages the relational database management system for processing logic and compute power, this also means there's no additional performance overhead incurred by WhereScape, according to Barton. All of this results in a very light footprint, with just some tiny schedulers putting some load on the infrastructure as a whole.

If you use more than one platform, the WhereScape tools can still ingest from each different platform in parallel, both from multiple sources and to multiple targets. This is because the ingestion processing sits within each platform, rather than as a separate engine. That wouldn't make a lot of sense anyway, because you would lose the benefit of building it in native code.

Using a metadata repository that stores all the metadata related to the data warehouse environment, and which is persisted within a database, the WhereScape tool is able to generate all of the processing code, scheme and job scheduling components within the applications, connected to the right data sources.

Using metadata doesn't only help with automating the coding of the data warehouse. It can also be used for documentation, lineage and impact analysis. That is, to generate reports about where data comes from, where it goes and what transformations are applied to it. That's a nice 'added bonus', especially if you have to comply to certain guidelines or regulations. GDPR is an obvious example, but various industries have their own requirements. With the reports that you can generate using WhereScape, you can demonstrate compliance and/or adequacy.

WhereScape RED and 3D

When it comes to building your data warehouse, and wanting to reduce your time-to-value, there are two phases that are relevant. First there's the discovery phase, in which you want to know what your data sources are, discover structures, relationships and quality issues in the data, and build a model for ingestion.

Once you've done that, you can prototype to see if the resulting model is correct. This allows the IT development team to work with the business users in an iterative fashion to validate that the target model meets the needs of the business. Once the model is correct, you can get to work on development and deployment of it and take it into production.

For the discover and design part WhereScape offers WhereScape 3D, and for the development and deployment part there's WhereScape RED. You can have the latter without the former, in which case you build the model manually, but not the other way around. That is, WhereScape 3D without WhereScape RED doesn't make sense, as you would only have a model, without being able to execute on it.

Together with automating as much as possible across the board, these products help to move towards CI/CD (DevOps) more and more. Mind you, Barton doesn't want to take building a data warehouse away from IT, something that is often implied when DevOps enters the discussion. He thinks this should still be a task for that department, but by speeding up the deployment process they can focus on redeploying IT developers' time to focus on delivering even more analytic capabilities for the business.

What about streaming data?

As anyone who deals with data (especially data ingestion) knows, there are two types of data: data at rest and data in motion. Data at rest consists of the types of data we discussed above, that reside in databases for example (Oracle Exadata, Teradata, Snowflake, and so on). Data in motion comes from things like IoT sensors.

Going forward, streaming data will be something WhereScape is going to focus on even more, according to Barton. Kafka and S3 are supported, covering a large chunk of the entire picture already. Support for more platforms is coming. There are so many that it doesn't make sense to try and support them all in a hurry, though. Like every other modern (and successful) company, WhereScape primarily looks at what the market wants.

Keeping focus going forward

Looking at what the market wants doesn't mean that WhereScape is going to venture into areas adjacent to what they do now, though. That could be very tempting, because they could probably add value in those areas as well. However, Barton wants the company to keep the focus it has now, and make products and solutions that fit into the existing infrastructure of businesses, to enable them to use the full potential of their data as quickly as possible.