Data Warehouse Automation

We have this ability to share information quickly and efficiently with the business users. How does that look? Well, a lot of systems, we see a lot of data warehouse systems have to have control, but not at the price of value. Value needs to be delivered quickly, often early. We need to lift that line in the middle there, ladies and gentlemen. You can see lift the line up to create value earlier on and then apply the controls and the governance and all the foundations that these systems need. We need to make sure we get value from them first. And why is that? It's because the businesses that you are all part of or you consult with have these challenges, whether it's simplification business change, self service agility, they've all got these wonderful database platforms supporting them to make their data warehousing and reporting that much more effective and quicker.

We meet customers that no two customers of warscape do the same thing. Some want to rapidly build data mart and business, get the involvement of the business ASAP, maybe in a gold or silver environment. Sometimes we want to build those EDW, those history layers, data vault, third normal form in Kimberly. We don't really mind whether you model first and generate your data warehouses or whether you can actually build from the data. So we're incredibly flexible in that approach. For us, moving this forward, what do we do for anybody who's not aware of data warehouse automation and is becoming incredibly commonplace these days? We basically use the power of the database to design, build through a Gui interface that's metadata driven, it documents and it brings that faster time to value. Again, Rick will cover more of the why and Frederick will cover the demonstration. Really this is a typical scenario that everybody is involved in.

WhereScape Data Warehouse Automation

We do the same thing, we just build a data warehouse. However, we do it in a far simpler way. Just using two products were escaped Red, which is the builder tool, and WhereScape 3D which is the design data driven design, even at that. Same thing just done using the power of the database. Here's the slide that just says we utilize SQL Server, Teradata, Teaser, Oracle, whichever, the database that's supported by WhereScape we enable, that's where the transformations, that's where the power is to do all this, that's where we speed this up through the user interface. That allows us my last slide before handing to Rick is really to work in a more nimble, let's say more nimble approach with our customers, a closer collaboration with the business. We thoroughly enjoy doing it, we thoroughly enjoy the results we get and we thoroughly enjoy working with data to start with.

We work with the data and we build out the requirements. Therefore success is very received early and often. The question here from our perspective is that hypoall reality. I would like to hand over to Rick to discuss his experiences and his background and pick up from me. So thank you. Over to you, Rick.

Automating ETL

Okay, thank you, Rob. Thank you for the introduction and a good day to everyone. As Rob said, the key question I would like to answer in the coming months is whether this is hype or reality. It really possible to automate the process of designing and generating data warehouse automation, march and ETL programs? And more importantly, does it make sense? Right? Why should we do this? To be able to answer these questions, I think we have to start at the beginning.

Big Data and Business Intelligence

So let's start here. Let's start with discussing what has changed over the last few years with respect to development of business intelligence systems and data warehouses. Well, there was a time when Bi Systems delivered the results on this white green computer paper. Some of you may still remember that nowadays users would not accept this anymore, right? They want the reports, the report results to be accessible on modern gadgets such as tablets and smartphone.

They also want to be able to combine internal data with data coming from external sources such as social media networks and open data they don't accept anymore. The data they have to work with is a few days old. They don't even accept data that is a few hours old anymore. They want to work with data that is almost 100% up to date. Of course, everything has to run in the cloud and has to be big, right? On the one hand, we have the business pool and on the other hand we have the technology push that are changing Bi. For example, on the business pool side, we see more and more companies interested in developing big data systems where massive amounts of new data are stored, processed and analyzed. And we all agree on this. The impact of big data on Bi is big. Also, there was a time when all the data created by a company was stored in databases running in the computing center, hidden away somewhere in the basement.

Not anymore. Every organization is producing data in a distributed fashion. Data is producing cloud based applications such as salesforce. In internal applications, they have been outsourced to the cloud. Manufacturing plants producing massive amounts of machine generated data are situated all over the world and so on. The question then becomes how do we bring all the data to one central site for reporting and analytics?

Now, next, as I've already mentioned, the request for operational basis intelligence keep growing. Users want to work with zero latency data, especially Bi systems developed for operational management and for the workforce itself. Their users demand real time data. Of course, we have self service Bi, and then we have predictive analytics of forecasting. Just being able to indicate what has happened in the business is not good enough anymore. Bi systems must be able to show what is likely going to happen and how the future can be influenced.

Business Intelligence

Clearly, bi has changed and will keep on changing. It's not only bi that has changed. Also, everything about data has changed is my feeling. For example, we are living in the big data era now. We're not only going to work with structured data, but also unstructured data. We don't have to work with slightly aggregated data anymore, but we can use the lowest level of detail and so on. Maybe the most important change is that data has turned from a simple reporting source for administrative tasks to a critical asset for many lines of businesses. With the right data, organizations can optimize their business processes, improve their customer relationships and differentiate themselves from the competition. With that, the dependency of organizations on data has also intensified. Data has changed and it has changed the organizations. In fact, without the right data at the right time, companies almost operate in the blind.

Well, to be honest, I could easily spend an entire session on this aspect alone. Now, this diagram shows a classic bi system where new data produced in production systems is copied to a staging area, then copied onwards to a data warehouse, and then made available for reporting and analytics by copying it to data marks. ETL programs are commonly used for copying the data. They're responsible for transforming, cleansing and integrating the data. Now, although this diagram is an oversimplification of a real system, I bet that this architecture does look very familiar to most of you.

Metadata

Now, to make a system such as this work, the specifications have to be entered, called the metadata specifications. For example, data structure specifications are required to describe the table structures of all the databases, of all the integrity rules, but also integration specifications needed to describe how data from multiple data sources have to be joined.

Of course, we need transformation and cleansing specifications to get the data on the right data quality level. All these metadata specifications end up being implemented in the ETL tools, in the database service, and in the reporting tools. What's important to understand is that in most of the current bi systems, many of these specifications are developed, designed and maintained by hand. For example, many transformation specifications end up in ETL programs that are designed and developed by hand. Or particular transformation specifications end up in ETL programs and semantic layers of the reported rules. Implementing all these metadata specifications is a very manual process today.

Traditional Data Warehouse Architecture

Now, the key question every organization has to ask itself can their existing data warehouse architecture, the architecture they designed so carefully many years ago, and it has served so well for many years, cope with all the new developments? My meeting is more and more that the answer is no.

One of the reasons is that designing, develop and maintain all the metadata specifications is too time consuming. Let's take some of these changes of the Bi sector. I just discussed big data self service Bi, and let me try to indicate what the problems are when an organization has a traditional data warehouse system and wants to implement these changes. So let's start with big data.

Hadoop and NoSQL

Now, the interest in big data has led to the introduction of many new data storage technologies, such as Hadoop, such as no Sequel. You must have heard of them. Now, most of these new products do not support SQL, nor the familiar table and column concepts. If organizations want to embed these technologies in their Bi systems, for example, to replace their SQL Database server by a particular no sequel product or Hadoop, a lot of the metadata specifications have to be redeveloped.

 

Most of the transformation and data structure specifications have to be translated from SQL to another language. If this has to be done by hand, the migration will undoubtedly be a long and time consuming process and think about the costs. Adopting new data storage technologies is not that straightforward. In fact, even replacing a SQL Database server by another SQL Database server can be a lot of work because they all support a slightly different SQL dialect and they all have to be tuned and optimized differently. Another aspect that's impacting Bi that I want to discuss is an important one self service Bi tools for reporting and analytics. We all know they've become immensely popular. Now, with all the new self service Bi tools such as Clickfue, Tableau, Spot, Fires, users cannot access any kind of data source. They can integrate data sources. They can even transform and analyze the data sources themselves.

They can do all this without any involvement of It specialists. Compare this to years ago when data was guarded by It. It would determine how and when users could access the data. Nowadays, users in the Bi and data warehouse in the world don't want to wait for the Bi computer center or the data warehouse department anymore. They prefer the do it yourself approach. They don't want to be limited to having access to a data mark or a data warehouse. They want to dig directly into any kind of data source, including operational databases in certain cases. Now, adopting self service Bi influences data warehouse systems in many different ways. I think the most important impact is the following users interested in self service Bi pick these products because they want to improve development speed. They want to be able to change existing specs quickly. They expect from It that they can implement changes in the data warehouse system with the same speed as they can in their own easy to use tools.

But that's not so easy. Even a small change may require a lot of work. For example, if users need an extra column to be added to the Sales Service report, it may have to add the same column to the data mart. They may have to change the ETL program. All this can be a mountain of work. It may look simple, but it can be a mountain of work because most of the specifications, again, have been developed by hand.

Cloud Computing

And then we have the cloud. Yeah, more and more companies adopt cloud computing. They do it because it offers elasticity and it unburdens them. How do you migrate an existing data warehouse system to the cloud? To be honest, there's not one single answer to this question. There are so many bi in the cloud vendors, and they have very different offerings. In many cases, what will have to happen is that other database servers will have to be used.

Moving to the Cloud

Moving to the cloud may involve a migration to another database server technology. Again, if all the specifications are developed by hand, this migration can be expensive and time consuming. In fact, it can be so expensive that it could be the showstopper for migrating to the cloud and getting the benefits of the cloud.

 

Data Warehouse

The last change I would like to mention, and it's definitely not the least one here's the impact of all the new users of the data warehouse. I think you can say that in the beginning, bi systems were developed to support decision making on the strategic and tactical levels. Such users could live with data that was one day old or maybe even older. Also, these users didn't have to see the data on its lowest level of detail. They were happy it was slightly aggregated data. Now, with all the new groups of users, this has changed.

Nowadays we allow operational management and the workforce to access the data, and also external users such as consumers, suppliers and agents. We have given these external users simple reporting capabilities on our own data warehouse. What all these new users have in common is that they don't want to see yesterday's data, nor do they want to work with aggregated data. No, they want to see up to date data on the lowest level of details. Access to a traditional data mart is probably not going to work for them. Imagine you have to support just one of these new groups of users. The data is stored in a particular data mart, should not be stored in aggregated form anymore, but should be stored on the lowest level of detail available. Now, this change would require that all the data structures of that data mark has to be changed and all the ETL programs have to be changed, and so on.

Again, this is going to be a lot of work to support these new user groups. I think these are just a few of the changes that bi departments are confronted with today. I think we can safely say that the process to develop and maintain our bi systems is too manual. It's slowing us down. It will make it hard to turn an existing bi system into a system that is ready for all these new requirements. The manual implementation of all these specifications, and also the distribution of all these specifications across the entire system are key reasons for why development and maintenance is slow. In an ideal situation, all these metadata specifications are somehow stored and managed centrally. They are stored only once and are reused many times. That will make it easier to operate, manage and maintain these specifications. It will make it easier for an organization to make the changes.

Data Warehouse Automation

The business pull and the technology push have raised the bar for developing bi systems. How are we going to raise the bar? Now, what all this means is simple. It has to be able to develop faster. It must be possible to implement changes much faster. The adoption of new technologies must be made easier. And the same applies for the architectures. Valuable metadata specifications should not be replicated everywhere, but have to be somehow managed centrally. One solution to make this all possible is data warehouse automation. As we all know, designing, developing, managing warehouse systems is a very manual exercise. Data models for data warehouses and data marks have to be designed and implemented. Physical database designs have to be tuned and optimized. ETL programs have to be designed, implemented, and tuned. And the list goes on.

Automate Repetitive Tasks

Now, many of those activities are very repetitive. Especially ETL developers can confirm this.

 

Many of them have the meeting that they are programming the same thing over and over again. It's just that the table names and the column names change. Take for example, handling slowly changing dimension structures. We all know how such structures have to be modeled in data warehouse and data models. We all know what the ETL logic should look like. Still, the developers have to go through the same design and development process over and over again. Now, if there's so much repetition in an activity, it's time for automation. It's something we have done since the birth of computing. We have the tools such as warscape to help us with that data warehouse automation process. This is nothing new in it. For example, some of you still remember cobalt compilers can be seen as generators, right? Assembler code is generated from a higher level programming language. With Cobalt and many other languages, we have automated repetitive tasks or part of programming.

This is what computing is good at, right? Ultimate repetitive processes. What this means for data warehouses is simple. We need generators that take a data model that describes all the aspects of the data and then automatically transforms the data model to the right data warehouse and or the right data mart. They should also generate the physical database structure and generate the proper ETL code. This it should be able to do for a long list of data storage technologies so that migration becomes easy and the specifications are generated. Two big advantages become clear right away.

Benefits of Automation

First, productivity increase. Developing new data march will require so much less time. Just as an example, right? But also changing them. When information and reporting needs change, it's all going to be so easy. Second, all the modules making up the data warehouse architecture are generated from one and the same set of specifications.

To be honest, I'm a big fan of the whole software data warehouse automation process. Let's spend a few minutes on some of the benefits of data warehouse automation. First, most of the metadata specifications needed to make the data warehouse system run are generated from one set of non redundant specifications. The consequence is that the generated specifications are all consistent with each other. For example, two data marks containing a dimension tablet called, let's say, product groups that will contain 100% identical specifications of this tablet. That's guaranteed. Second, an important benefit that stems from this first benefit is that if the specification implemented in the same way, the reports return consistent results. If two users have, for example, analyze sales data, it's very likely that when everything is generated, they are looking at the same data. Next, if a specification has to be changed, it only has to be changed in one spot and then the generator has to be switched on.

No intense manual process but automation. And this clearly improves productivity and maintainability. Number four, if it can implement changes more quickly because of the automated process, the time to market for a new report will improve as well. The business going to love that. Benefit five that are more specifications are generated from one source, the easier the governance and auditability will be, right?

 

That's what you get.

But everything is generated from one source. Simply put, there will be less chance for errors such as inconsistent implementation. Benefit number six, because tools such as warscape can generate for different data storage technologies, migration will be much easier to do. In addition, less in depth know how is required to adopt one of the newer data storage technologies. The last one I want to mention here is that it will be much easier to switch to data model in the data warehouse or data mark. For example, if there is a need to change from a normalized data structure to a Star schema, it only requires a regeneration of the data structure. That's sort of it. Imagine how much work this would be if this has to be done by hand. All these ETL programs that you have to reprogram and all the data structures that you have to redesign, that's a lot of work.

So data data warehouse automation. An interesting academic and theoretical exercise is very practical. It offers organizations a long list of benefits.

Now, if I'm allowed to make a side step here and make a few personal remarks. As you probably know, in most organizations, databases are designed in two steps, right? In the first step, information needs are analyzed some conceptual or logical data model is developed. Such a model is not a little technical, right? It's a formal description of the information needs. In the second step, this model is transformed to a physical database implementation, let's say a set of great table statements. This step doesn't require business knowledge, but in that knowledge of the database server in use. Now, when I started my career in 1980, I worked for a research lab of an American computer company called Control Data. At that time one of the biggest vendors in the It industry.

The first project I worked on was called Glue. Glue was a tool that would automatically transform a conceptual data model to a relational database design. In other words, it automated. The second step shown on the previous slide. If the model was changed, all the required changes on the database structures could be generated. So this is right. I'm still young and I work in this lab environment. My meeting was if I can develop such a tool, probably all companies on this planet are probably using some kind of automation tool as well. When I moved to another job where I was involved in more traditional It projects, I found out that this was not the case at all. No one was using generators. Everyone was doing all this stuff by hand. It takes a very long time making mistakes in the process and making changes, which is a nightmare.

It has always surprised me that it took the market so long to adopt tools for generating databases. I'm extremely happy, to be honest, that we now have mature and reliable data warehouse automation products such as Wescape. It was about time. So, to summarize what I've been saying, that the bar for Bi systems has been raised right today we can't keep on developing Bi systems the way we did 20 years ago. That would be naive and silly. It's time for generating the Bi metadata specifications. It's really time for data warehouse automation. Many of the things that have to be designed and developed are so repetitive, making it the perfect fit for automation. To answer my own question, is data data warehouse automation reality? I think it's reality. The technology is there. The techniques are there, the customers are there to prove that It works. I strongly recommend organizations to study how they can use such tools such as Whereascape to turn this older data warehouse systems into systems that are easier to develop and easier to maintain and that are ready for the new business pools and technology pushes.

My recommendation will be it's time to move on. It's time that we make more time available for developing the right reports for the right users and spend less time on designing and developing all these modules that are needed to make the data warehouse system work. Okay, so let me hand the controls to Frederick so that we can start with the demo of whereascape so that you can see how searching to work. I'll be back for the Q and A.

WhereScape

Thank you. Rick, hi, this is Frederick, I'm a Senior Solution Architect at Verescape and I would like to take you on a journey and show you our product suite. I would like to start with actually building some flows together based on metadata specifications. I would like to show you a more model driven approach, as Rick showed at the end of this presentation. On the screen you kind of see typical architecture wellscape.

ELT

Our software is ELT Tool. That means that we sit on top of the database, your data warehouse target database instance. Whatever you see in red, consider that to be your target database instance. Could be your Oracle instance, it could be your Terradata instance. The blue part here, that's the scary outside world. As we're doing ELT, what we're doing is we're extracting phone sources. Could be databases, file systems, Unix drives, SSS packages hadoop clusters.

We're extracting the data without modifying it and loading it into non persistent loading tables over here. That's the first step, using the loaded utilities of your database. We're propagating it through different layers. Which layers? Well, that depends on your architecture. We have to be very flexible. Not two clients are the same. We have people using historical ODS structures three, and a layouts, data vault, star schemas modeling using views or any combination of the ones I just mentioned. So we support all of those. It all depends on how you want to model your activities.

Metadata Automation

As Rick mentioned, all of our things we develop are driven from metadata specifications. Using our software, you're actually modifying metadata. Based on the metadata, we will generate for you the specific artifact. We will make sure that all the tablet structures are created correctly, the loading logic is created correctly. If you need to apply some physical tuning options, that's included as well.

If you then want to tie all that development together in a nice job, we will allow you also to generate job definitions as well. Let me just explain you and give you some examples of how it works. I'll give two examples. I'll extract from a source system, a data set and uploaded into a SQL Server data warehouse table. So I'll create a load tablet. Another example I will show you is creating a dimensional tablet. What I want to show you, explain you, is that you will see that we try to be consistent. We try everybody in your team should work in the same fashion. So naming conventions are all automatically applied. Everything is based on templates. If somebody knew when the team is developing a particular ETL flow or somebody with ten years of experience, the end result should be the same. We should from the same metadata generate the same functional logic.

So let's switch to our builder tool. This is the interface. Interface consists of four different windows. The left side corresponds with the different parts over here. One vertical is one kind of object. Type corresponds in our tree with one note. If I click on my load tablet and I click here on a particular load category definition, I will see the detailed metadata. In the middle part, I can double click on something and see lots of really specific metadata over here. Whenever I modify the metadata so it requires a change on the underlying database, I will get the feedback in the bottom pane over here. If I want to create a load table or a dimensional object, how does that work? Well, I will take a data set from a previous layer and track and drop it into the loading area. If I want to create a dimensional object, the data warehouse will become my own source and I will take a load or stage table and drop it into the dimensional area.

What I expect is that naming standards are enforced, templates are used, and only the relevant questions are asked to me. Let's just go ahead and build some flow. I double click on my loading area and at the right side I'm connected to a SQL Server source system. In there I have a customer data set. That customer data set, I can double click on it. See the columns, I can right click display the data. Hey, this is of interest to me. I want to load this data. What I do is drag and drop and a business starts it automatically is suggesting a name for me. So I just continue. And what happened? A load customer object has been created in metadata. It has retrieved all the detailed specifications about that object from the source system. It actually created that table in my loading area. Here I see the feedback from the database.

Here I see that actually the table has been created and a loaded strategy has been applied. That's how long it takes to load data from the external world into your target system. That data driven design means we want to get access to relevant data as soon as possible, loaded into our loading area and start experimenting with it. I could repeat now the same thing to load XML files fixed with files from Hadoop, but it would all be the same process. You connect, you take a data set and drop it in the area. Let's move on to something which is more interesting. If I now would like to take that load customer to data set and convert it into a customer dimension, as I already said, is that the data warehouse will become my source. So let's refresh that one. This list is rather long, so let's just filter only load tablet.

Creating a Dimensional Object

Yeah, here I have my load customer data set. Okay, I want to turn that into a dimension. Actually creating a dimension is not complicated. Once you've agreed in your specific architecture, this is the way how to create dimensions dimension needs a service key, you need to have a business key, you need to have technical metadata columns that track changes. But that's architecture. Once your architecture principles have been defined, creating the dimensions should be really easy stuff. Let's just see how we do it. Drag and drop. And again, it's proposing me a name. What I now see is that actually a different template is being used because creating a load table or creating a dimension is based on different principles. First assigned decision is do I want to create a type one, type two dimension? Let's just create a type one and a visit continues. The second decision I am taking is what makes a customer record in my dimension unique, what is the business key?

Once I've selected this one, I continue with the wizard and lots of stuff happened behind the scenes. Now what I have is the customer dimension which has been created. I can have a look at it. What I see is, as Rick mentioned, the surrogate key. I don't have to think about it. It has been added automatically in there. The name is based on the naming convention I specified. What I also see is that it's based on a sequencer. I have a default record inserted as well and at the end I have a loot timestamp inserted as well. I can't forget it, my colleague can't forget it. It's done based on conventions automatically. Based on some metadata database DDLS have been generated, ETL code has been generated, in this case the ETL code to load the customer dimension, it's setbased SQL which is always packaged as procedures.

If I want to look at indexes, what I also see is that by default we have generated indexes on the circuit key and on the business key. We also give you an environment where you can apply physical tuning options. Of course, not every dimension is the same, the query parts are the same. You might want to additional indexes or remove the ones we generate automatically. We give you in our interface all the possibility to overrule, to override standard behavior. That's why in the visit you earlier so you have so many options. I generated a dim customer using all the defaults. Let's now kind of mimic an Iterative scenario because typically what we do is we just dump the data and quickly build something and get early feedback. So, okay, I've built my team customer. I have here my company name. I would like to duplicate company name and apply a business rule.

Company name right click duplicate that column. I'll give it a very bad name. Company name two. Okay, and I'll go in and I'll program a business rule. Just for the sake of this demo, I'll take a very straightforward one. I will convert it into an uppercase. Here, what you see is that we give you access to all the setback very powerful functions of your database. You can also register additional user defined functions in here to make you super efficient and to reuse business logic. So I have injected the business rule. Okay, so now I've modified my metadata and now we will take the slow process. How it works, if you modify your metadata specifications, then we need to regenerate all the dependent code. In this case, it's a three step process. First we will regenerate our table, recreate our tablet, because we need to make sure that additional column has been added.

What we need to do is make sure that our ETL code is rebuilt. So we'll just do the visit again. Does my business key change? No, of course not. It remembers the choices I've made earlier. Once that's done, then I can execute. Let's do that again. Sorry, rebuild? Yeah, that's okay. What I can do is right click and execute. So that's typically the process. You modify metadata, you realign the structures, you recreate your loading logic, and then you execute the loading logic. Now if I right click and look at the data, I will see that my business rule has been applied. I was talking about set based operations that work in the back end and procedures. Which set based operations are being done? Well, I can just have a look at the code. This is actually the code that we esscape generated behind the scenes.

 

This is the assembly codes based on Cobalt, as Rick mentioned at the end of his presentation. As you see, this code is perfectly readable. It is nicely documented, actually the best documented code I've ever created in my life. What you see is it's lengthy, it's based on the best practices. It's giving you the best possible merge strategy. In this case on SQL Server, on Teradata, we will use the best possible way how to absurd records there. You also see is that we are doing lots of operational logging in there. Whatever happens in the procedures that are created by landscape, we create lots of detailed operational logging information. So I've now done two iterations. I've created the first one and it waited over it. Actually what I can do is I can have a look here and what event name I recreate something, we keep a version of it.

What I could do now is I could compare the version which is now active with the previous one. If I now scroll down, it will highlight in the codes, it should highlight it's the wrong one. If I now do compare to viewer, it will highlight any changes I've made in the codes automatically. For me, that's very powerful. Okay, I've shown you now a dim customer tool. I have applied business through I kind of find that business through rather very simplified business rule.

Historical ODS Objects

I'm feeling sorry, I just want to roll back. What I could now just do is delete that object completely because that's agile. Sometimes you develop something and you say, well, it wasn't really the thing I wanted to develop. What I can do is just go back in and recreate an earlier version. Because behind the scenes every change I've done whereas Kip has kept in its version repository, it kept track of all the different versions.

In an agile process, you will experiment. Sometimes you'll prototype something which doesn't make sense, but we give you a safety net. You can always roll back to a previous state. No problem there. I've just shown you how to load data in a loading table. I've shown you how to create a dimensional object. Over here we have similar templates for three and F objects, historical ODS objects for fact tables. It's always based on the same principle. I hope you kind of get an idea that it's very easy for you to take data and propagate it through all the different layers. Imagine you becoming very powerful, very productive, and creating lots of objects. You might build a complete star schema like this particular fact table. That fact table, where does the data come from? But I've built tens of objects. I've kind of lost track of where the data comes from.

Data Lineage

The last thing I'm going to do is look at the code to find out where the data comes from. The code has been derived from the metadata specification, so I can just have a look at the metadata. We actually included in the product some very nice visualization. What I can do is in this case, do a trackback diagram. Here I can see oh, this fact tablet. Actually, the data comes from this process. I can see this on a table level and on column level. So this is interesting. Typically whenever we do a POC, we are asked to build something and that people are meeting quantifying how long it takes us to build something. But that's the wrong metric. The real interesting metric is time to refactor something. Because typically in a data warehouse, we spend 60% to 80% on keeping the existing stuff up to date alive.

If I need to change something in my fact orders, this is a starting point. I look at the lineage information and it explains me where the data comes from. This is helping me to understand this is also a cool way to organize my code base because if I end up with hundreds of thousands of objects, what I now can do is I can group that code into a nice project. So I can create my project folder. Instead of being confronted with all the objects, I can have a dedicated project folder. Imagine having your personal development folder prepping all the objects for a particular release. You can structure the code base easily and repository that way. Okay, so let's go back to my diagram over here. Until now, I explained you a way of quickly developing all those different entities. In this case, what we see is some of them are based on data vaults using the data vault visits.

Some of them are based on data marks using the Data Mart visits. Okay, I have a clear way of how to navigate between them. I can look at the impact analysis. I can do that track back or track forward so I can see and go in and look at the particular first column and see who's using it. Okay, so I've done that. I'm already know because in production I need to make sure on a daily basis all those entities are correctly loaded. They need to be loaded in this particular sequence because I can't load fact orders before loading stage fact orders. I want to load not all sequentially, but in the most optimal way to use my load window in the most efficient way. Well, what's the most efficient way to load this? Actually my impact analysis diagram is visualizing that. We help you in creating a job by giving you the possibility to convert an impact analysis diagram into a job definition.

Job Visualization

Here we have the job definition which is automatically created for you. These are the dependencies it has inherited copies from there. Are these dependencies correct? I can't really see that from this list. Again, we try to make things easy for you. We've included in software job visualization as well. This is a job which has been derived from that impact analysis diagram. What used to be aligned vertically is now aligned horizontally. This is the most efficient job to loot at fact tablet. So again, this is also automation. It's not only helping you to generate the code to generate the DDLS, but to help you create and keep your job definitions up to date. Okay, so I've now created many objects, a data vault. In this case, I've created my fact table on top of it. Yeah, I've loaded it. So I'm doing agile. That means that I want to have early feedback from the business to see if I'm on the right track.

Now I phoned my contacts in the business. Hey, come along, let's have a coffee and let's discuss if my fax orders table is really what you want to do. I switched to my SQL navigation window and I started to do some joining. Yes. No, this is the wrong way to do this. Because if you want to alienate the business from you, this is the way to continue by doing some advanced queries on the data. Wescape realized that as well. If you're working agile, we want to have that early feedback, but you want to show the data in a tool, in a layout the business understands. We have the inbuilt capabilities to generate automatically polar cubes, SS cubes on top of your data marks. It only takes you a few clicks and it allows you then to use Excel to connect to your data mark, let's just do that together.

It's the same principle in this case. If I want to create a cube, the fact table will become the source. I will navigate here to fact tables. I will drag and drop the middle part. And again the visit starts. Now it's going to scan all the different dimensions. Because the metadata knows to which dimensions effective is created and now I can include exclude attributes of all the dimensions. I'll just accept all the defaults here. Now it's listing me all the different potential candidates for measures. Okay, so I continue entertaining. I have created a cube that's only in metadata at this moment. Now I have to continue with the next step. After creating the Cuban metadata, the next step is to generate a structure. It's the same process as launching a DDL on RDBMS here, recreating the order cube on the SS server. I get the feedback in the same results over here.

So that all went fine. The next step, because it's a molar cube, I need to refresh it. So, same process. I took a fact table, dropped it in the cube area, visits launched. So why am I creating that queue? Because we sometimes get feedback from people say, hey, I'm really fast at creating San Jose queue. Yeah, but if after showing it to the business, my early prototype, they say, well, I would like to have them customer an additional level and I would like you to make that string longer and I would like you to change the name intervention on that attribute. It's hard work to make sure everything stays correctly aligned. We would just at that moment throw away the cube and recreate it as I just showed you. So now everything why? Because now I can use Excel and I do some draganddrops and do some pivoting.

This is making things really easy for me. Now I could potentially show this cube to the business. If I click on Belgium, I see my hierarchy, BXL and CHL. I'm from Belgium. It's Brussels and Shahbah. We've used the code in the hierarchy and oh, there it is. So something is wrong in this hierarchy. While discussing with the business, we realize that and the process would be that I would go back into my red builder product, recreate, restructure my dimension, and recreate a cube. Tomorrow or this afternoon I would be discussing again with the business and say, hey, what do you think of my next prototype? That's kind of the agile way of working together. Okay, so I showed this to the business. I'm kind of done now. No, well, I need to still document what I've done because that's typically the phase. We always skip its documentation.

All the code we've created is based on metadata. We can use the same metadata, create documentation in the software. We have a built in wizard to create documentation. What happens there is that all the relevant information is extracted from the repository in standalone HTML files. Standalone HTML files you can then take and put them on a file server somewhere. People who don't have access to our software can still have a look at the documentation. We have the user flavor of documentation, but we also have the technical documentation. Let's just have a look at our fact table we visited earlier. In here I can actually see the codes which is running on that environment without having access to the database. If I scroll down, I can see all the different columns, I can see the star schema, I can see my impact analysis diagram, I can see all the different indexes, the volume of data in there and all other information I've added in software.

 

You can extend what's being documented by adding additional chapters in the documentation here as well. Typically what we say is agile means you do not do big requirements gathering upfront. No, you build while documenting, your documents while building. You model while building, you build while modeling. That's kind of a process we're advocating here. The documentation I just showed you, it's not what we should have built, what we might build in the future. No. What's actually been built in that repository and that's very powerful. In my past as a data warehouse designer, developer, architect, I always started looking at documentation but ended looking into the code because the code was the only reality out there. There was always a difference. Here the code equals to the documentation because they're both derived from the metadata. Okay, so I have a documented project jobs which will be loaded in the most efficient way.

Yeah, but I'm in development. This still needs to be propagated to the production environment. Now I have to wait for another month. Yeah. If you want to be truly agile, you will deploy more frequently and smaller units of development. You need to have a very strict deployment process in our software. We can deploy in a few clicks. I can define here a deployment activity. Okay. What I could do is and that's cool, if I created a project, I can just take whatever is in the project and put it in the scope of my deployment. If I, for instance, click on effect table here, that's the cool thing I clicked on Effect Table. The system told me that if you want to deploy Effect Table you want to deploy DDL, but also the indexes and the loading logic, all of it. That for the sake of this demo, just deploy everything within that particular repository, I click okay, what happens right now it's going to export from my repository everything to file systems.

Because when I deploy, I'm deploying the metadata which is over here from one repository, I'm deploying the metadata of a project to another repository. Now I'm going to take that metadata and inject it into a production repository. Over here I have my production repository. This is an important window because deployments only work in a frequent mode if they are strict, if they're very repeatable, if it always happens the same way. If I'm deploying a load table, existing load tablet, they will be recreated because they are non persistent. If I'm deploying a dimension, typically you don't want to lose your data in a dimension. They will be older. We have a very repeatable fixed way of deployment. I have very experienced colleagues with many years of experience in landscape. They say that if you ever lose the data in your dimensions, it's because you've done something really stupid.

Because our software is making really sure that your data gets migrated correctly from one state to another state. Again, this is a very important part of the automation. It's a lifecycle management part. In short, this is a demo on the builder part. I hope I've been able to show you that we're able to handle structures of many different flavors. The loading logic with setback SQL packages, procedures, we give you the ability to create job definitions as well. Everything is nicely documented based on that metadata. We include a very important deployment process, which is the part I personally like the most of our software offering. I think I've shown you a development environment on steroids with lots of automation happening behind the scenes and that enables you to take one entity and put it through all the different layers. Now you might say, well, I like more model driven approach.

Data Modeling

I like to do things differently. I like to think and create a logical model first. I would like to just generate lots of load tables at once and generate a complete data market or data model at once. So let's show that. That's our 3D, our data driven design tool that can kick into action here. I want to show you two things about 3D. It's the profiling and mobile conversion. I have the ability of connect to any source system out there being a database or file system and start profiling it because people become very efficient in building stuff with theirscape, but sometimes they don't know what to build. First step is to define the scope. I'm only going to profile tables, okay? The discovery phase is going to identify what are the entities and attributes in that particular source system. It's scanning the source system, I got a pop up.

Apparently in the source system there are no referential integrity defined and that's something which happens all the time. What I now can do is using some visits, I can identify easily what are the relationships between tablet based on naming conventions. I continue with this and I get now a model. Let's cancel this. I have now a model and if I look at it, my customer entity is not linked to anything. Oops, that's bizarre. So let's just go in here. That's because in my physiologic metrics I said, okay, keys are only in teachers and never customers. I will now add an additional relationship. But those relationships are fake. They could be right, they could be wrong. The only way to discover that is by doing profiling. In the next step, what I'm now going to do is profiling. What happens now is that our software is going to scan the content of the data and were doing profiling on three different levels.

We're doing it on the table level, on the attribute level, and on the relationship level. If I click on a particular relationship over here, properties, what I see is that actually I have an issue because my order, the link between order and order details, 20% of the records are not matched. Within minutes, using our software profiling source system, I can find out that the data is really problematic. I have a real data quality issue there and that could be a reason in the agile philosophy to just decide not to start building any bi solution on top of that and to hand it back to the source team and say first clean up the real problems before we will build something on top of it because reports will not be usable anyway, right? Clicking on any entity allows me to visualize many different profiling metrics over here and that's making things very interesting because profiling, again, we want to do everything in the same consistent way.

Automated documentation

Whenever we've done some profiling in software, again, we can document everything in nice visual documentation so a document can be generated. Imagine the surprise on the face of the source owner when you're going for that very first meeting to do an initial interview and you come fully prepared with your documentation and you have a full lineage, full profiling of every single entity of every single religion. You're armed and dangerous at that moment, so that's profiling. We have as well the ability and software to create some EDW models. What I could now do is create a business model which is inspired on the source system. Let's just take orders, order details and the customer entity, which are just profiles. Because I like that data. I created my EDW model and now this business model, I would like to turn that into a data fold. In our software, we have a very cool ability to apply model conversions.

Model Conversion Templates

My data folds, everything here is version as well and multi user. What I now can do is I can apply a mobile conversion template. That's really cool because you can by hand convert a three model into a database that actually is based on mathematical principles. What I now can do is convert it with a wizard. What now happens is that model has been converted automatically into a data fold based on model conversion templates, which you can set to your own specifications. I used to work at Volvo. I was a customer at Volvo. We have our own way of converting a tree and f into a database. We're not restricting any of the clients to a specific way. Now what I could do is last time, which I'm not going to show you, is I could generate all the load and stage structures and then throw it over the fence into our builder products and have that full integration and model driven approach.