HVR Replication and Data Warehouse Automation

Good morning, good afternoon, good evening. Depending on where you're calling from. Welcome to our webinar again. This is a webinar featuring a team from HVR and WhereScape. We're really excited to have you here today. The webinar is accelerate your data warehouse project. We're really excited to bring you the A joint webinar together. We're going to talk today about how our solution brings automation and rapid data integration to your data infrastructure efforts and how our solutions help accelerate the delivery of real time data to business. The purpose of today's webinar event is to talk in more detail about our solutions. Before we dive in, I'd like to go over a few housekeeping items. Now we've done a sound check and please know that everyone on the call is on mute. Now, that doesn't mean that you won't have a platform to ask questions. I think most of you have done a GoToWebinar event before, but if you haven't, please know there is a questions box in the control panel. 

WhereScape

Feel free at any time to type in a question and then we will ask your question at the end of the webinar. All right, but everyone is ready to get started. I'd like to introduce to you today your host. Really excited to introduce Douglas or Doug Barrett. He's a Senior solutions architect with WhereScape and he's joining us today with over 20 years of experience in helping organizations build data warehouses. In fact, at WhereScape he's actually developed a solution that has significantly reduced risk by applying automation, agility and modeling and has been able to help facilitate speed and communication for data warehousing projects prior to where escape. He's also been at Microsoft in their consulting services division. Now, Doug is calling in today from Portland, Oregon and he says that where you need to love rain and beer. He's actually been there for eight years and is quite enjoying it. 

HVR

If you haven't seen Ted before Doug before, he often speaks at Tech Ed sequel Saturdays and TDWI conferences. So again, welcome Doug. We also have Paul Spit O'Leary of HVR. He is a senior solutions architect with an extensive background in data restriction technology. He's also been in the industry for quite some time, over 20 years of experience. He's really experienced with data replication, data integration, and real time change data capture. Before HVR, Paul has played a huge role in helping products such as SharePlex and Golden Gate launch and be developed. Now Paul is actually on the opposite coast. He's on the east coast, or some might say the right coast, depending on where you're from. And he is a huge hockey player. He coaches hockey and big fan of hockey. If hockey fans out there talk to Paul, he's a guy. I'm going to go over the agenda before host starts talking about their piece. 

We're going to have Doug go first. He is going to talk about WhereScape. For those who aren't familiar with WhereScape, this is an opportunity to understand what is data warehouse automation. He's going to talk about HVR, WhereScape automates the design, development, deployment, and operation of the data warehouse infrastructure. Paul is going to talk about HVR, what it is for those who aren't familiar and what data replication technology can offer organizations. In the next section, Doug and Paul will talk about how HVR and WhereScape can together help accelerate your data warehouse projects. Again, for those on the call, if you're embarking on a project, this is a really key time to listen in and see how these two technologies work together. Doug will be leading us through a demo as well. At the end, which is about the last ten minutes of the call, we'll be having a Q and A. 

Automate ETL

Excellent. Right, well, first of all, I'd like to say how excited I am about this partnership. This partnership with HVR really allows us to extend WhereScape solutions. It adds the ability to load data in real time or more near real time. Loading from source allows us to do better audit tracking of data from source. It allows us to better access data in applications like SAP or Salesforce. I'm really excited about how HVR extends HVR WhereScape solution. For those of you that don't know where escape, we're kind of the preeminent data warehouse automation company. Data warehouse automation is all about automating the design, the development, the documentation, the deployment and the operations of a data warehouse. Data warehouse automation is all about productivity. It's all about removing 80% to 90% of the manual effort in building a data warehouse. And productivity equals agility product. Agility is really key to the success of a data warehouse. 

Agile Data Warehouse Design

Agility boils down to collaboration with the business and iteration so that we can work through a project and adapt the project during the project to a better understanding of the data and a better understanding of the requirements. That's really important because often when we're working with the business don't know what they need until they see it. By iterating through a project, iterating through a design that's actually built using WhereScape. They can see what data is available and how it can be applied. We've been doing this for a fairly long time. We've been automating projects for getting on for 20 years, and we got around 700 customers. Talking of customers, you can see there's a selection up on the board here. We got some very large organizations, we've got some very small organizations. The organizations that we work with are across all industries, as you can see. 

We've got retailers like Costco, we got financial services, we got colleges, we have retailers, we got all sorts in there. Some of the very large organizations might have a team of 100 developers working with WhereScape to build out the data warehouse, might be loading terabytes of data a day. We also have smaller organizations like my local community college. They have one part time guy using Westscape to build a student retention data mart on SQL Server. It really does scale up, scale down and it doesn't matter what industry it's applied to. All of these organizations needed the ability to be more productive and to be agile in delivery of data warehouse projects. Now the reason that data warehouse projects take so long is that it's a labor intensive process to build a data warehouse using a traditional suite of tools. One of our customers recently compared how they used their modern ETL tool against WhereScape and to build a dimension table. 

ETL Tool

They, they found that they were doing 28 development steps in their ETL tool against three in WhereScape. The rest of it could be automated. Really when you think about it, you want that automation because it applies standardization. Building a type two dimension, for example, should be the same the next time you build a type two dimension, then the next time and the next time everything should be the same about it. The code pattern, the housekeeping columns, the start and end dates should all be consistent, should be the same. And that's what automation does. It allows us to automate the development process, but it also allows us to standardize the development process as well. What we get is much more standardized output. That's really handy because when it comes to support, we don't want there to be much in the way of variations. We don't have to want to understand how each developer codes a certain way in order to support their code. 

It doesn't matter who's driving the keyboard, when they're using WhereScape, they get the same code at the end of it. It also allows us to be traditional way, doesn't allow us to be fast enough to be agile. Working the business, working with the business, we need a cadence of change that allows us to remain engaged with the business. If it takes a few days or a few weeks to make a change based on some input from the business is going to lose focus and lose that engagement. We need something that's a lot faster to maintain that engagement with the business. Because usually when we go into an organization and we ask so what's your documentation like for the data warehouse? Everybody looks a bit sheepish and looks at each other and looks at their shoes because usually there isn't very good documentation. If there's documentation at all, it's often out of date. 

Automated Documentation

Now by using an automation tool, we get documentation. Without that we have to rely on tribal knowledge. We have to rely on what's in people's heads to understand how something was built and how it should be supported and how it all hangs together. And all of this leads to frustration. The business wants access to data as soon as possible. They want access to the data now, but it takes time to get it right and it takes too long to get it right. There's this frustration between business and between it. That's a problem that we want to solve with automation. Automation, data warehouse automation, solves these problems by trying to eliminate a lot of this labor intensive, we call it donkey work of the data warehouse. Now, it does this by using metadata and code patterns and design patterns. Data warehousing is riddled with patterns and we'll see more about that in a minute. 

We want the code to be the same for the, for similar objects, to make it easier to support, to make it quicker to build. We want to apply best practices as we do that. Now we're using code patterns that have been developed against each target platform to apply the best practices of that platform as well as the best practices of the data warehouse processing that we're trying to apply. That just happens by using the tool. This reduces the complexity, it reduces the time it takes to use new platforms or to bring people up to speed. We can make fairly average coders look like superstars using WhereScape because a lot of the output is done for them. They just have to worry about little snippets which are applying specific business rules and the overall code pattern is applied over the top of that to make it quick and easy to build a lot of code very rapidly within the data warehouse. 

Data Warehouse

All of this allows us to be much more reactive, a lot more agile as we work with the business. That's the key to successful business, a successful project, particularly as we're working in the business layer or the semantic layer of a data warehouse. Of course, delivering projects faster means that they're delivered with less cost as well. As we have a look at a data warehouse, there's patterns everywhere. There's patterns in loading data, there's patterns in storing data, there's patterns to apply surrogate keys to generate hash keys, there's indexing patterns, there's star schema patterns, there's third normal form design patterns. There's data vault patterns and there's patterns that are specific to different target platforms. Like do we do a C test or create table a select to process data and persist data or do we use a merge statement or an insert statement or an update statement? 

Traditional Data Warehouse

All of these patterns are built into the tool as it applies code generation and table management within a target platform. WhereScape really tries to leverage all of these patterns in order to automate the development. Now, if we have a look at conventional way of building a data warehouse. These are the steps that we've identified in how you build a data warehouse. Starting at the bottom with requirements gathering, looking at source data, profiling source data, defining a target model, designing ETL, building ETL, testing ETL, managing versions, managing workflow, managing deployments, and so on. As you scan up the left hand side of that staircase, we can see there's usually a suite of tools that we use as we try and deliver those projects as ETL tools, modeling tools, profiling tools, and whatnot. Each one of those tools is not particularly well integrated with the next. There's often a handover between these steps and those handovers are actually slow down the development process. 

Usually a different set of skills in each tool, maybe a different set of people as well. This is what slows down development, but also makes changes very hard. Because if I decide, hey, we need to change the target model to cope with a better understanding of the data. For example, often I have to come down to the design, change the design, update the ETL mapping, go back and redevelop the ETL test against the mapping, and so on. It's a complicated and relatively inefficient way of developing and making changes. We call that last slide, this is a Staircase of Doom. 

WhereScape

With WhereScape, what we try and do is we try and blend together as many of these tasks as possible. We can do that with a rich metadata repository over the top of this. What we do is we will blend these together so that if I needed to make a change, for example, build something for the first time and then subsequently change it, I'm changing the target model, the physical model, I'm changing the code that populates that target, and I'm regenerating documentation, all as a single development task. 

I can also manage workflow, I can manage deployments, I can manage all of these things in one tool with a nice integrated set of metadata to make my life easier in all of the tasks of building a data warehouse. If the last slide was the Staircase of Doom, we could call this the elevator of awesomeness. We're going to get to deliver our projects a lot better. That top part, the documentation that always gets done when we're using WhereScape. 

Data Vault Express

Now we actually have two main tools and a bundle of those tools for Data Vault people called Data Vault Express. 

WhereScape RED

We have WhereScape, 3D and WhereScape RED. WhereScape RED is our primary development tool that manages the development, it manages the operations, it manages the documentation, and it manages the deployment of the data warehouse. 

WhereScape 3D

Complementary to that, we have WhereScape 3D. WhereScape 3D is more of a design tool, so more of a traditional modelling tool. 

It stands for data driven design. It allows us to reverse engineer a source. It allows us to design a target model. Once we're happy with that target model, we can just push it across into WhereScape RED. The two tools work very nicely together. When we're building data vaults, we have nice design automation in 3D and we have nice build automation in Red to make up Data Vault Express. This slide really lays out the processing architecture of a data warehouse and the way that we might build it with WhereScape RED. All of the boxes between the two book ins are tables that are managed in a data warehouse by WhereScape RED. We manage the loading, we manage persistence in the foundation layer, we manage data cleanup, data transformations, business transformations, and end user layer. Red does all this by creating tables maybe from an imported design or maybe just driven from the structure of the source data. 

Generate Native Code

We apply rules to that data and we apply rules to that data by generating code. That code is native SQL to the target platform. So all these boxes are tables. Those tables are existing in a data warehouse platform. Might be Redshift it might be snowflake it might be Azure it might be SQL Server oracle Whatever the target database platform is that you've chosen for your data warehouse. Red has code templates that generate best practice code in the native SQL of that platform. 

Metadata Lineage

We also have very rich metadata across the bottom. That metadata keeps track of data lineage, keeps track of impact analysis, keeps track of a business glossary and any other documentation we want to add into the data warehouse. 

Automated Documentation

On top of that, from that metadata, we can generate documentation. We generate documentation that's compiled into HTML so we can publish it out across to any users of the data warehouse. 

We also use that metadata for deployment to deploy between dev, test and production. The tool will walk developers or system admins through a deployment process that regenerates or generates refactoring DDL for the target environment like test or production. We also have an integrated scheduler says we need to automate the execution of a workflow, a series of tasks. We can define that within the scheduler. We also have this ability to retrofit or import existing designs or existing models from a modeling tool. Now that retrofit functionality right at the bottom there is how we're going to be able to import tables that are built and managed using HVR. That's what we'll be showing you in the demonstration in a little while. Well, that's my last slide on my introduction to WhereScape. Hopefully that clears up any questions about HVR, WhereScape is, and what we do. I'm going to hand over to Paul to talk about HVR. 

HVR Data Replication

Thanks, Doug. Sure. So what is HVR? HVR is high speed data replication or data integration that can move data between two different systems, right? Whether they're database platforms or file systems, we support multiple number of databases which we'll get into, as well as different types of files, and as well as some common use cases that people typically use when dealing with a change data captured product like HVR. HVR essentially stands for High Volume Replication. Just like WhereScape, HVR has been around for about 20 years. It was developed in Netherlands under Post NL and then sold privately and owned by Lufthansza which handled a lot of the flight and airline reservation systems and flight planning systems, and then primarily was used for other airlines throughout the world. Wasn't only until about maybe five or six years ago, where then HVR, the company was born to take it public. 

The technology has been around for a while and just like Weescape, there are hundreds of customers for HVR everywhere from small medium business SMBs to large enterprise customers like Verizon and Amazon. I think when it comes to supporting change to capture and data replication, there's many use cases around the world, no matter what type of business you're operating in. When we take a look at some of the use cases that range the gambit right? This technology, started about 20 years ago, is typically for offloading reporting any change data capture log based replication solution was developed for reporting purposes or data warehousing, right. How can I get my data off of my source system onto a system where I can run reports without affecting production, at the same time keeping the overhead low on the source? Since then, now it's really broadened into multiple different use cases. 

Cloud Data Architecture

Not only are we getting into the data warehousing, the reporting real time analytics, but now obviously with cloud adoption, moving data from on prem into the cloud, whether it's AWS, Azure or Google Cloud GCP, it becomes a lot easier. Now with a product like HVR, as we get into the architecture, not only can we do real time change data capture or data replication between two different systems, but we can also handle the bulk loading of data between them as well. Essentially helping you establish a target from a live source, whether it's on prem to Onprem, moving data from Onprem into the cloud or even cloud the cloud. The beauty about the architecture is it's very flexible and can be deployed anywhere. Some other use cases get into more of the disaster recovery or high availability, right? If you have geographical distribution where you might be load balancing between two or different sites from an application perspective, and data replication can be used to move the data in between them from a bidirectional or Active scenario. 

Disaster Recovery

The same thing goes for disaster recovery, right? If you ever need to have a secondary system built in for fault tolerance to be a failover too, having real time data synced up to that allows for a seamless failover and a quicker recovery time to bring that back up. 

Data Migration

Lastly, our migrations, whether it's dealing with upgrades on systems or migrating from on prem into the cloud or even between data centers. Moving that data typically takes a long time, right? Having changed data capture incorporated with that bulk or initial load allows you to really eliminate the downtime when moving data in between different platforms. We can talk about that when we get into the architecture. When we look at HBO's architecture, it is a distributed architecture rather than a centralized architecture. And now what does that mean? We are like to say it's a hub and spoke type of approach with the idea is if you're looking in the industry, you'll see here a lot of industry terms, agent versus agent less when it comes to replication. 

Data Replication

We offer both and you might want one over the other depending on really what use case you're supporting or how much access you have to a particular system. I think it all starts in the middle with what we refer to as their HVR Hub. The Hub is really just a centralized unit, allow you to really install and configure the software and provide an easy way to set up and configure replication in general. That's where you would set up essentially your endpoints or your locations that you're connecting to, as well as set up what we refer to as channels or your replication streams, defining where I want to pull my data from or capture my data from and where I want to push my data to. It can set up multiple channels based off of all the locations you set up, all one centralized Hub. The Hub can sit on prem, they can sit in the cloud. 

It sits on a lightweight machine, whether it's a VM on prem or whether it's an EC Two or VM in AWS or Azure, it's not a lot of resources that take up to install and configure that Hub. Once that's installed, then you essentially make your location endpoints or your connections to where you want to pull data from and push data to. Here's where we have that agent versus agent less approach. We can come in remotely to make remote calls to APIs to grab the data essentially of the transaction logs via the agent less approach. Being in this industry for about 20 years, there's always that constant debate about the agent versus agent less. Quite honestly, it's always recommended to actually have a local capture process running on the source machine rather than coming remotely through an API call. And that's why we offer both. Not only is it flexible in terms of that architecture, but it's more scalable as well. 

Data Capture

And how does that work? So, from a capture perspective, we're capturing data out of the transaction logs, whether it's Oracle Redo and Archive logs or SQL Server DB two postgres MySQL transaction log. Having a native capture process is really a much more efficient way of grabbing the data instead of coming outside from a remote call into an API. The reason being is because typically the transaction logs sit outside the database so that we are not incurring any overhead on the database by our log reading technology. More importantly, it's more of just a listener process, not necessarily a full blown agent. With that listener process or capture process, it's not taking up any resources from the OS perspective. Not only is it having a low overhead on the database, it's low overhead on the system. It's more flexible and more scalable as you deploy that capture process. 

Data Migration

I think in addition to that, it provides an efficient and secure way to move data between your source and the HVR Hub. That is because we are compressing and encrypting that data between those two different hosts. This is the same as we move towards the delivery side. We also offer the same concept right, of agent versus agent less. With that agent approach, we can deploy what we refer to as an integrated process or delivery process that would sit on that target machine so that you can compress encrypt the data between the Hub and wherever that target machine might live. More importantly, take advantage of native loading technologies to then load that data into the target database. Whether we're talking about standard Oracle and SQL Server, or whether we're pushing data up into the cloud when dealing with a snowflake or Redshift or any of the data warehouse platforms that we support. 

This also might work when dealing with non database targets like a Kafka or pushing data into S Three. We can take that agent less approach by pushing right onto a topic within Kafka, or we can have an integrated process running on an EC two node writing locally to an S three bucket if you're writing data out. Into flat files. Or, of course, into Azure Blob storage. If you're writing into an Azure data lake store, not only do we support all the databases which we'll get into, but then of course, all the different files, whether you're moving data into flat files like XML files or CSV files, but also if you're pushing into maybe Parquet for loading into S three or leveraging Avro or JSON when pushing on to Kafka. Really, there is no limitation to how you can move this data between two different systems between the source and the target. 

DDL

Once we have that architecture set up, then we take advantage of the three levels of functionality within the product which are highlighted in blue. First and foremost is that initial load that I mentioned before, not only being able to move all your data from the source to the target, but creating all your DDL and all your table structures as well. This is really powerful, especially when you're crossing platforms, right? If I'm moving data from Oracle or SQL Server into a snowflake environment, I might be starting with a blank canvas on the target. HVR will do all AI transcription, convert all those data types from whatever the source might be to help create all the table structures, all your DDL on that target system. In conjunction with that, we will load all the data as well. Not only are you creating the tables, but now we can do that bulk initial load between your source and your target. 

Data Integration

Encompass that with the next process, which is the capture and Integrate, the CDC, the real time replication, if you will. Now you can establish your target from a live source. Essentially, since we talked about having a distributed architecture and our capture and delivery processes being decoupled, we can now start our capture process from a specific point in time or sequence number within the logs to start capturing data in real time and storing it on the Hub. At the same time, the integrate process or delivery process would be down, waiting to receive those changes before pushing it into the target. The integrate will remain down until we run that refresh or initial load so that we can grab all the data from the source and push it into the target. Whether that takes an hour, several hours, or even over 24 hours, we're safe to know that the capture process is still running and you're not incurring any downtime while this process is taking place. 

Compare Data

Again, that allows you to establish your target from a live source. Once that refresh process stops, then you can release the queue, if you will. The integrate process will reconcile what's already been captured with what already has been loaded through that refresh process. You're essentially not loading data twice and it can bring it now in sync with your source. With that refresh and the capture integrate, now you can get that target in sync with the source in a real time manner. The last part of that is our Compare functionality, and the Compare is going to be your data validation. The great thing about this is, again, it not only works like to like, but it works across platforms. You can run that Compare between an Oracle and a Snowflake or Redshift, or a SQL Server and a postgres, whatever the combination of source and target there might be to give you that peace of mind, knowing that all your data is in sync. 

A lot of our customers take advantage of this in a multiple different ways, not only after the initial process to establish that target, but maybe run it once a week, once a month, once a quarter. If you're leveraging that target for data warehousing or reporting, you might want to run that Compare just to give you that peace of mind, knowing all the days in sync before you run your month in reports, and knowing that's not going to take any additional overhead on the systems. The great thing about Compare is it's doing a. Hash level compare at the table level first to determine how that data is going to remain in sync. It can also handle what we call online Compare, which means even if there's transactions in Flight and the systems are live, we will still be able to run a compare process to ensure that data is going to be in sync at any given point in time. 

Database Platforms

Just to talk about some of the broad platform support, I know I mentioned a lot of these already. You can see the sources lifted on the left, where they are their standard database platforms like Postgres or SQL Server, Oracle or DB Two, but also getting into some other sources, things like pulling data from Salesforce, for instance. Right now, that when I first started 20 years ago with SharePlex, were talking about Siebel Systems and PeopleSoft, and now it seems it's all about salesforce and workforce now and workday, and really just grabbing all that data now is coming from different sources than what we traditionally look at from a standard application and database. Being able to pull data from Salesforce and push it into a target system, whether it's going to be a data warehouse or another database per se for reporting, we really broaden the support for what we can capture data from and deliver data to. 

On the right side, you'll see a much more extensive list, obviously, because we deliver data to more platforms than we can capture. For instance, like a territata. Some of these platforms that do not have log structures, like the data warehouses, such as a Green Plum or Terry Data, things like that. Also we can push data into MongoDB, Cassandra, and even that flat file format that I mentioned before, whether it's going to be in a parquet format on three, we're leveraging JSON and Avro for Kafka. Another thing to note, if you are an SAP shop, we do support SAP in a couple of different applications. If you're leveraging ECC, we do handle the pool and cluster tables. We have an SAP transform process that will unpack those so we can support the entire application from SAP as well as if you've leveraged Hana, we are the only log based capture or extraction tool for Hana on the market today. 

HVR really concentrates hard on making sure that we have the best platform support, no matter where we're capturing from or delivering data to. Some of the data warehouse platforms that we've already talked about, right? Snowflake. We see a lot more of them. Right where it used to be the territory's and the green pumps of the world, now people are pushing their data up into the cloud, in which case we're seeing the snowflake. The Azure SQL data warehouse and redshift snowflake now is on both AWS as well as Azure, and I don't think they're too far behind from getting onto GCP. We have a really great partnership with them just like we do HVR WhereScape. Leveraging HVR to move that data in and provide real time data integration or data replication into snowflake is something that is pretty common within HVR. I think to sum it all up, the partnership between Where escape and HVR has been great because now you can take advantage of a lot more use cases, obviously faster decisions, because now we're handling transactional consistent data, real time replicated data, change data capture into your target systems. 

Data Processing

You're going to eliminate essentially the batch windows associated with that, right? If you've ever leveraged an Informatica or Navy initio or any ETL tool, they're great for transforming that data. Your batch windows start to shrink on some of these source machines, given the fact that most of these applications are now 24 x seven x 365. You really don't have that window to leverage an Atl tool to grab that data. Having a low overhead, lightweight change data capture solution has really eliminated that problem. The last thing is really just the combination HVR WhereScape and HVR for optimized processing. Right now you can leverage real time data into loading or staging tables and then HVR WhereScape do their thing to convert it into a star schema. Right? If you're pulling data from a transactional system and pushing it into more of a data warehouse, we're definitely going to probably take advantage of more different star schemas. 

That's essentially where the partnership has really been great between HVR and WhereScape. Now I'll turn it back over to Doug to complete the presentation and get into a demo. 

Awesome. Thanks, Paul. Yes. The combination of the two together, HVR can bring that data in, synchronize the data in from source systems, source applications, files, whatnot, bring it and land it into the data warehouse, into a staging area, or persisted staging area, like what we would call a data store area. That eliminates the worstcape having to manage that load process and do micro batches where we've only got really access to the data, whereas HVR gives us access to the infrastructure, the data infrastructure of the source to pull the data in, synchronize that data in. The two tools work really nicely together. That brings the raw data in, we don't have to worry about latency, it's all being managed, monitored compared by HVR. We can apply the transforms in the target platform using Wescape. What I'm going to do now is I'm going to switch over and show you what that might look like in a WhereScape environment. 

WhereScape RED Demo

Here I've just opened up WhereScape RED, and for a bit of drama, I've renamed one of our object types to be HVR objects. Now, in my environment, there's a bunch of tables that already exist. It's as if an HVR person came in, set up replication between a source environment and my data warehouse. Now I've got a bunch of tables sitting in my target environment in a staging area. I want to bring that into WhereScape RED. One of my slides earlier I mentioned we would use the retro option, the retro fit option that allows me to go out and pull in a bunch of tables that already exist this into Red's metadata. Within my data warehouse, I got this HVR schema and I'm going to push this into an HVR target. Now, in actual fact, all I'm doing is I'm just saying that we've already defined an HVR target that actually matches that schema. 

Tables

What this process is going to do is it's going to read the structure of all the tables that HVR has created for us and so read in the structure of all these tables. Now, WhereScape RED will try and assign an object type, a target type. Now this is often we could be loading in stage tables, dimension tables, fact tables. In this case, they're all the same target types. I can just select all of these tables and I can say, all right, convert these to the target object types. It's kind of figured out based on the naming pattern of this process. Now we have all these tables and they're imported. So we've got the data in there. The data is being managed by HVR. We don't have to worry about a load process. Now, we can use these tables like we would normally within WhereScape. We can come here and we can say, well, let's build some processing where I'm going to take my data warehouse or my HVR tables. 

Let's come here, my HVR tables, and let's create a staging table. In this staging table, we want to do some work to it. I'm going to define it based on the products table. I'm also going to add in some attributes from, say, this related categories table. We're going to go through a process of denormalization, which is a very common pattern within data warehouse processing. I'm going to denormalize this data so that we've got the relevant information pulled together to make it easier to analyze this data. WhereScape is really good at allowing me to make changes to this supply transforms, rename things. Here. I'm just going to add in AI transcription and I could come here and I could say, right, I want to make this and give it a function. I'm going to use a function and it really depends on what my target platform is as to what the function library I get is. 

SQL Server

In this case, I'm just using SQL Server. I could come here and I could say, well, let's use the upper function and just make sure the supplier name is in capitals. I can access any of the features and functionality of the target platform. Once I've done that, then I can create the table and I can create the code to populate that table and apply those transforms for me. Now because we put together that denormalization, I can come through here and just specify, well, how do we join that data back together? I'm going to use a left join between the products and the suppliers and then I'm going to do a left join again between the categories and that's built me some SQL that we can go off and we can run and we can look at the results. Okay, so that's now denormalized that data. 

We can see here that we have our supplier name, all capitals based on that transform, we have our category name and we have all the other attributes of products. Maybe I've done some preparation in this case to now publish to a dimension table. Now I can come here and I can say, well let's take the data from our staging area and create a dimension table. Here we're going to walk through again some patterns. This is a type one dimension of type two dimension, a type three, or something more cunning. In this case I'm going to go for a type two dimension. You can see in the background the automation process has added a surrogate key because it's a dimension table. You know that's a common design pattern. I can customize the naming pattern of that. It's also added a start date and end date, a current flag, a version number, create and update some housekeeping columns that we know are common for type two dimensions. 

Now I'm just going to build the code to populate this table. I'm going to identify a business key of this data, product ID, because we hadn't previously identified what uniquely identifies a product. Now I'm going to identify what my change columns are, my type two columns, all the other stuff I'm just going to overwrite. What this is going to do is it's going to create some code for me, it's going to create a table for me and it's built for me. It's very rapid. Now if I wanted to make a change, let's just drop a couple of columns, this reorder level and discontinued, let's drop a couple of columns. I can come in here and say, right, let's get rid of this reorder level, don't need that in my dimension. Units on order, let's delete that. Units in stock that's delete, that not duplicate, let's delete it. 

HVR Data Replication 

Now I can just say, well, let's have a look at what that would look like with that change being applied. Here we can see that we're going to recreate the table, regenerate the code, run it and look at the results. We get a unit test very rapidly and that looks much cleaner. On the end of my table I can see my start date and end date being managed for me if there was any changes to come through from the replication process. So I can build tables. I can build code. Another aspect of this, though, is the rich metadata. I can do a trackback diagram. It's going to show me the data lineage of the process I just built in products. Where did the data come from? We can see our denormalization process here, and I can double click in here and see the code that we generated underneath the covers based on our inputs into the dialogues. 

Metadata Documentation

Now, all this metadata we can compile into documentation. Here I'm just going to create documentation, and this is going to compile that metadata into HTML so that we got access to it even if we're not using the tool. If I want to expose this to a wider audience, they can just come to that place where I just outputted the documentation and see it in their browser. So, for example, we can describe the end user layer of the data warehouse. In this case, Star Schema. We can add in descriptive information. You saw me add descriptive information to our products dimension, but I only added one description to that product's dimension. This is really exposing the data dictionary. I can import that data dictionary if I wanted to from some other source. We've got a glossary of the end user layer. We also have documentation that's intended for a different audience, one that's intended for technical audience, one that needs to know how something was built. 

In this case, how do we build our products dimension that we just built? What are the steps went through? If we come back to the stage table, we want to see any transforms that were applied to the data. We might even want to dive into the actual code and see the code that was generated in this case for SQL Server. It's SQL Server code templates that we've used. If this was Snowflake, if this was Redshift, if this was Azure, we would see the code that is specific to those different platforms. Okay, so that is a very brief demonstration of how we would import the tables managed by HVR into Where Escape and then how we could use that to then build out business processing, the denormalization, the derivation, perhaps the data cleanup that's required as we build out the business layer of a data warehouse. 

Building a Data Warehouse

Thanks, Doug. I think, in summary, having the two HVR and where escape are both comprehensive and efficient as we build out a data warehouse, first and foremost a streamlined approach for not only delivering the data, but then modeling it. In a way that makes sense for the data warehouse and taking into account all the different features and functions that went through, starting with your initial loads and creating all the target structures leveraging that in conjunction with the Change data capture to create your target from a live source and then getting into really being able to leverage the WhereScape to quickly design, develop, deploy and operate into that data warehousing within days rather than weeks or months. Then, of course, once this is all in place, leveraging the right monitoring and documentation to make sure everything is working properly. The couple of things that we did not discuss but are available in HVR is really just leveraging your performance statistics as you're moving this data into your target system. 

We'll overlay your latency or report on your latency, both from a capture and delivery perspective, and then overlay that on top of other performance metrics broken down by how many inserts up the delete flowing through the system, but more importantly, taking it one step further into breaking it down per table. Now I can see if my latency does tend to climb or seem to climb on my target system to about ten minutes. I can quickly identify who the culprit is. Essentially, if I'm operating a large batch operation or an update to one particular table, I can see that in real time. Now with the beauty of HVR's architecture, we can capture one's, deliver many. I can separate that table out into a separate delivery process so that it wouldn't affect other tables involved in replication. This is really important when leveraging a data warehouse like Snowflake or Redshift or any data warehouse that is not designed for OLTP like a SQL Server oracle. 

Right? Having that distributed architecture that went over allows us to support multiple topologies like a capture once, deliver many. Having one capture process reading from the source, but delivering that data into multiple threads on the target so that you can get that data in a timely manner.