Advantages of Data Warehouse Automation

Again, my name is Patrick Howard, product Marketing Manager and I'm going to do a quick introduction on WhereScape, who we are, what we do before we come to what you come here for, which is Kent talking about data automation. 

 

WhereScape

If you look on the right side of this I'll make this short and sweet because I know you're waiting to listen to Kent but I want to talk just about WhereScape is and what it is we do and what it is our tools do. Starting on the right side WhereScape has been around for 20 years. We started in 2002 in New Zealand as a data warehouse consulting company and they realized after a couple of projects they were doing a lot of repetitive tasks. They developed a tool for automating and creating data warehouses and avoiding all those tasks. One of their first customers, Wells Fargo was amazed at how productive they were and asked them how they were able to do this. 

They showed Wells Fargo these tools and Wells Fargo said can we buy them? Shortly after that WhereScape became a software company instead of a consulting company. What do we do? What's the main benefit of WhereScape tools and data automation? It's improved output. Data automation is faster, more accurate, less errors. Things are right the first time. Less technical information that you have to know. Combine all that and we tell people it's a four to six developer output. Now there are some extreme cases. I can tell you one, we had a two day session with a customer where at the end of the two days they'd said we'd finished a six month project. That's an extreme case but that was right in our wheelhouse. Everything clicked. That's an example of an extreme example of the kind of benefits you can get with data automation and with. 

WhereScape specifically we've got 1000 customers around the world in APAC and Asia and Europe and Africa, South America, North America, small companies, large companies insurance, in finance and healthcare, in manufacturing, in sports, in education. We're not tied to any specific industry or geography. Moving to the left side, what does WhereScape do we basically are an end automation tool. Everything from designing your data warehouse logically to physically to implementing it, deploying it and then running it and keeping all the tables up to date and data up to date. All of that is done with one tool. There's no hand offs. There's no person a does job a Person B does job B. Person C does job C. That's one of the advantages of data automation. This is done because we're metadata driven. You don't tell WhereScape how to do things. You tell it what you want to do. 

Automated Database Documentation

I want to create a dimension. I want to create a slowly changing dimension using geographic data to track the changes. I want to create a hub using this column as a business key. That's the level of work you do. WhereScape as a data automation tool, will implement that in Snowflake, in SQL Server, in Oracle, and terry data in whatever target you select. One of the advantages of this is the automated documentation and data lineage. 

Challenges of Data Warehouses

A lot of the failures of data warehouses or large data projects is the end users who just don't trust the data because of the metadata driven approach. We've got documentation that's accurate and up to date. As soon as you make a change, you can regenerate the documentation and I think it's three clicks and you've instantly got documentation for the end users, the business units, as well as the developers on where did this data come from, what transforms did it go through? 

Data Vault

How do I know this data came from the source and what it presents itself to be? Lastly, I'll talk about an open design methodology. We're not tied to data vault. We talk about it. We can certainly help you develop Data Vault very quickly. Whether it's Kimmel or Inman or any hybrid model that you can think of. We do have the tools to help you produce data objects, data warehouses, data lakes, data marks in a variety of methodologies, and we will hand hold and facilitate to get those things created, but we're not tied to any specific methodology. Lastly, I'll put a plug in down at the [email protected] You can go to that specific URL, requested Demo, or you can just go to WhereScape.com. There's a button on there that says Requested Demo. There's also a link for resources under Resources or Events. Events include a lot of 30 minutes presentations on WhereScape from ended, how to create a data vault in 30 minutes, how to create a Star schema in 30 minutes. 

 

Snowflake

They're, they're very helpful and very informative. So that's it. That's our plug. That's that's WhereScape we're hosting this. We're going to be hosting this with Kent. Kent, if you want to move to the next slide, please, we'll do a quick introduction. Kent, as most of is the former chief Technical Evangelist at Snowflake. He's been a lot of credentials there 40 years, 1925 years of work related work, 25 years of data warehousing and analytics, author and co author of books. The one thing I learned yesterday talking with Kent is that I asked him when he first was introduced to WhereScape, and he said it was in 2009. That's 13 years ago was when he first saw WhereScape and saw the benefits of a real tool that actually did data warehouse automation. He's been working on and continuing to develop a partnership with us since then. 

 

Unstructured Data

Now you look at where we are today and you've got all sorts of new data types as well. We've got not just structured data bill inman famously said back in the company's, data is unstructured, which back then unstructured simply meant it wasn't in a database. It wasn't relational, is what he meant at the time. That, of course, has turned out to be true. 

 

IoT Data

JSON data all over the place, coming from the web. Now we've got IoT data, where we're looking at 75 billion IoT devices by 2025 as well. Right now, there's more than 26 billion of them. In 2019, there was about only 180 smart speakers being shipped, and they're being shipped about every hour. Right? 2018 was only 67. The volume has just ramped up so much. We have to learn how to work with data coming from these devices. The amount of data generated each day this is crazy. 

Expected to be 363 zetabytes a day. Again, that's 1000 to the 6th power, where exabytes 463 exabytes. That's just the 6th power. That one is, again, a lot of zeros. We just got a ton of data coming at us, and it's coming from all sorts of places that we never expected it to come from. Our modern architectures now encompass IoT data, social data from social media sites like Twitter and Facebook and Instagram. That generally comes in the form of JSON data. Then, of course, you have your databases, you have custom databases, you have off the shelf databases, you have applications, you have SAS applications now. With the advent of Hadoop back a little over a decade ago, we had to start dealing with files because it really was just a file system. We're just taking data out of structured systems and throwing them into flat files. 

 

Data Lake Reporting

We want to load them back in again. Our analytics landscape just changed dramatically over the last decade, for sure, but even more so in the last five years since the time that I joined with Snowflakes. Just a rapid growth of data and how we need to deal with this data. Because we're not only dealing with the traditional data warehouses. Now we have data lakes, sometimes your data lakes feeding your data warehouse. Sometimes data lakes are feeding appliances or even analytic reports right out of the data lake. 

 

Machine Learning

The data pipeline has gotten very complex. Not only do you have your standard bi reports, but now we talk about machine learning and AI doing data modeling from a different perspective. Modeling it from a machine learning perspective to really try to get more value out of the data. That's really kind of the key to all of us, is we're trying to get more value out of our data now. 

Lots of things in addition to the data who have been driving these sorts of changes. Some of them are internal from organizations like changing business needs and requirements, and some are external, like COVID. Our pandemic have changed the demands for data. We now have an environment where people need answers right now. I think the Pandemic showed us that many organizations, especially when we're dealing with supply chain, right? It's like things changed so fast with the lockdowns. Where is everything? Where are the materials? How are they going to get from point A to point B? Restaurants trying to figure out whether or not they had the workforce to stay open, how are they going to serve their new customers? These were really massive demands to change on a dime. In order to survive, they had to be able to adapt as quickly as possible. Of course, the evolving technology landscape of all these different things that are coming out, like the cloud and the separation compute from storage and all the innovations that Snowflake brought, it's constantly changing. 

 

Cloud Adoption

Being able to adapt is going to be critical for maintaining a company's performance and still trying to reduce costs. I mean, think about how many companies just move to the cloud as a result of the Pandemic working at Snowflake during that time, we saw a massive uptick in adoption and all of a sudden people say, well now how do we deal with that? Well, that was something if people had to really get their hands around. You have organizational changes, mergers and acquisitions. We've certainly seen a lot of those in the last couple of years. Replatforming things as a result of the Pandemic moving to the cloud and of course, the never ending demand for self service capabilities. Everybody wants to empower their business users as quickly as possible, but in an intelligent manner. Now, one of the other things that has come up of course, if we are not able to adapt to change quickly enough, then we end up with even more complex infrastructure and often have shadow it drop popping up if people aren't able to get the data they need as fast as they need. 

Because there's a bottleneck in the It department, sometimes you have the shadow it. These are things that as an organization, we have to look at, how do we deal with all of this? Of course, with these changes come all sorts of challenges. Some of them are big, some not so big, but they all got to be overcome somehow. With changing technology means there's going to be new skills required and we're potentially going to have to hire new people or we're going to have to upskill the current staff. As we get all these changes, that of course, introduces the possibility of errors. Every new line of code, every new software package we install, every modification we need to make to a database or a report or anything like that, is potentially going to introduce errors because as humans, we make errors all the time. Sorry, my slides are flipping on me. 

 

Automated Documentation

Out of control for some reason. Get back on the right slide there. All right. All of this change can of course cause corruption in our data and sometimes it's undocumented. When were trying to build documentation as fast as we can as we try to be agile in the agile movement early on people said oh no, there's no documentation in agile. Well no, that's not true. We still need to have documentation. The problem was were changing so quickly we couldn't keep the documentation up to date and so oftentimes we might have been successful in an agile approach and making a lot of rapid changes to our systems but then we didn't keep the documentation up to date so it was obsolete as soon as it was written, nearly. And that was a challenge. That's definitely a challenge for all of us. Having to buy new tools and train people, that's really the big one. 

If you're training people again, they're not documenting things and so it really can just get out of step. If you can't keep up with all of this then your hard won efforts on building a data warehouse or analytics platform or even a data lake are somewhat for. Not because the users have to have confidence in that data otherwise it becomes a data graveyard or is in the data lake world we call the data swamp where you just got all this data but nobody really trusts the data, nobody really knows what's there. Documentation becomes something that really needs to be part of what we're dealing with. Again, with the changes coming as rapidly as they are, it's even more important to keep track of that. Now, the solution to this in my mind and this is something I have I said I've met the WhereScape folks back in 2009 but I've been doing automation for decades now and this has always been what I felt the key to success was going to be. 

 

Why Automation is Important

The key to being agile and being able to pivot and to adapt to changes quickly was to eliminate the more mundane, redundant, error prone types of work that a team may be doing through hand coding. Really we need to be focused on delivering the value from data as rapidly as possible. The only real way to do that is through automation. In my mind a good automation tool is going to let you describe what you want done, not how you want it done. Over the years I went from being a programmer to really being more of analyst and then analyst to being an architect. And really I don't want to code. I want to be able to work with the business, figure out what are their requirements, what are their needs and then figure out what's the best way to deliver that to them as quickly as possible. 

 

Data Automation Tools

That means that if you have a new database for your data warehouse or source system, you want an automation tool that can help take care of all those technical nitty gritty details and generate the vendor specific code to solve the problem without you having to go learn a new language like I did when PL SQL came along. 

 

Data Vault

What about implementing a new methodology? Over the last six years, I have seen a massive growth in the interest in implementing Data Vault. Well, that requires some training and there's a lot of gotchas in the approach. As simple as Data Vault may appear to be having a tool that can help implement a new design methodology and generate that. Again, you don't have to necessarily know all the minute nuances yourself. How do you build the right kind of insert for Data Vault? Well, if you can generate it, then you don't have to worry about that as much. 

 

Code Automation

It really is kind of the key to reducing errors and rework is automating the generation of our data pipeline and all the code that goes with it. I know it's all SQL, but still a human typing in at the command line is error prone. That's all there is to it. I used to say years ago, I've been Zandas for years since I got involved in all this. I have never seen a code generation tool get a syntax error. The simplest of syntax errors for getting a semicolon at the end of a line, type and weigh all your code and skip that. That doesn't happen. With a good code generation pool, you can build templates. If it's template based, then you can have the confidence that all the objects are going to be generated from a template and that they're going to work. It should include data lineage looking forward and looking back along the data pipeline so that you can see where the data came from. 

If you're going to build confidence in the system. When a user who's running a report says, hey, where did that number come from? I don't think that's right. Well, how do you trace that? It used to take us days, if not weeks in some cases, to trace it. I worked somewhere at one point they had a very large healthcare data warehouse and everything had been basically hand coded using some templates. At one point somebody questioned the numbers that were coming out of the data mart and it took weeks to research it, to go through the ETL code and go through all these different tools and the transformations and everything else to figure out where did that data really come from. That of course, diminishes the confidence of people using the system. If you can demonstrate quickly oh yeah, here's a little diagram. That column came from this table here, came from here, and we applied this transformation rate here. 

We noticed that A plus B equals C and it came from here and it came from that source system. The last time that data was loaded was on this date. A lot more confidence and the additional benefit, which I've always thought was a wonderful benefit once I got into Agile was that a good automation tool. You've got the documentation, it's all there. It's called one button documentation. Your data dictionary, your lineage, everything there. You can be doing Agile sprints and you can be updating and changing the way things are being done. You can apply new business rules and as soon as you do, the documentation is there because you've done it in this tool which has given you just so much power to be able to adapt to the business needs as the business needs. Let's talk about results and the rewards from those results. 

Micron

One of the big customers that I had the privilege of working with when I was at Snowflake was Micron. Micron I actually got introduced to at the Worldwide Data Vault conference a bunch of years ago when they got up and did this presentation about their Data Vault environment and it was world class, right? It had been implemented on Teradata. They've been featured at the Teradata World Summit. Massive thing. Hundreds, literally hundreds of terabytes of data IoT data at that. I of course, being the Snowflake evangelist at the time, I said oh well, that's interesting. Have you considered moving to Snowflake? And at the time they said no. Well, fast forward a couple of years, we had some more conversations. Snowflake evolved and they said yeah, this makes sense for us. We are ready to move to the cloud. We're going to do it with Snowflake. 

 

Generate Snowflake Native Code

Great. As you can imagine, massive Migration project. Well, turns out what they built in Teradata was built with WhereScape. Early on I had the privilege of working with WhereScape's CEO at the time, Mark and had convinced him that they should join forces with Snowflake and become Snowflake partners. WhereScape had built all the connectors for Snowflake and generated Snowflake native code. It occurred to me, it's like, well, if all the metadata is in WhereScape already can't they just change the target from teradata to Snowflake and push the button and regenerate all the code? And the answer was pretty much yes. They had a couple of templates that they had custom built. They did some very specific custom teradata things that of course they didn't need anymore and they needed to modify those templates to basically eliminate things that no longer needed to be done in Snowflake. 

The result, net result of this was that with the power of WhereScape and the metadata and they had stuck diligently to using the tool properly, they were indeed able to push the buttons and regenerate all the code and move it into Snowflake. As a result they were able to be up and running in production and off of their teradata box in under four months. We're talking about nearly a petabyte of data and all the processes that go with it streaming trillions of rows of data from IoT machines, from manufacturing into Snowflake all because they had done this. To my knowledge it was the fastest migration of a major organization and a huge, I mean huge data warehouse environment from one platform over to Snowflake. 100% was the result of having built their platform using WhereScape and using WhereScape automation. So it really boggles my mind. 

 

Migrating Teradata to Snowflake

I have seen some massive migrations at Snowflake, specifically from Teradata over to Snowflake. In the two that I'm thinking of, it took them 18 months to basically get it going and it was about 24 months before they were fully cut over. Two years and the fact that they were doing Data Vault actually might have accelerated because Data Vault is so template based that presumably they would have built their own little toolkit to generate hubs links and satellites, things like that. They still would have had to factor in all the nuances of Snowflake, drop all the territory stuff like any indexing and all of that and any of the performance optimization stuff. They would have had to rewrite quite a bit. I would expect that it would have taken, somewhere between 18 and 24 months if they had not been able to do this. 

 

Migrating Netezza to Snowflake

I like I said, I saw several very large customers globally make this transition from Teradata to Snowflake as well as Netezza to Snowflake. Every one of them it was 13 months. Man on all yeah, I'll talk about Migrations also because I probably get weekly calls with Where skate customers who are looking at migrating from something on premise to something on the cloud. Snowflake does seem to be a very popular one but I can certainly talk about a couple of these. These are all current WhereScape customers. If you're curious. We've got case studies on each one of these on our website under Resources you can look at case studies but I'll just mention Bachelor. They've been a WhereScape customer for quite a while and we got a few comments from them about their use of Whereescape data automation tools in general. They're saying it's taking 50% less time to build and validate and deliver new subject areas. If they get a new source system they get a new business unit that wants a data mark built, and they're saying it's taking half the time to do that than it was before. 

They're developing data models faster, and it really allows them to focus not on the nitty gritty details. As Kent said, you describe what you want to do, not how you want to do it. All that technical implementation is hidden from you. I mean, you can't expose it and you can modify it if you do need to, but in general, you don't need to. It allows you to focus on delivering value to the business users much, much faster. 

Toyota Financial Services

Toyota Financial if I talk about them briefly, they in seven months moved 95% of infrastructure into Snowflake. They had about a seven month project, and then they designed and built the data vault, and they've really had just incredible ROI on that. Again, seven months to get migrated to Snowflake with a data vault instance. McAllister I'll just give a quote from one of the Westgate users there because it demonstrates what using WhereScape is really like. 

The one of the DBA's there said, I spent most of my time solving puzzles that WhereScape solves automatically. Today I focused on how to flow the data coming into the data warehouse, cleaning it up, and helping the business make better decisions based on that data. Again, what they're doing as WhereScapeusers is they're less focused on the technical, nitty gritty aspects of how do I check things in, how do I set up GitHub, how do I find a template or Google to find sample code on how to implement some feature when all that's built in and generated automatically? Because you're manipulating and you're inputting metadata data about what you want to do, as Kent said, not how you want to do it. 

 

Oracle to Snowflake Migration

Now, there's lots of technical and nontechnical aspects of a platform migration. There's obviously setting up the infrastructure, there's QA, there's project management, there's assigning resources. Just purely on the technical aspect of how do I replicate my data warehouse that's currently an on premise Oracle into Snowflake? How do I convert all the code, the objects, the indexes, everything else indexes into Snowflake? How do I migrate the data? What's that going to cost? They came to us and said, we need help with this. Can you give us a statement of work? We've got a team, offshore team that has done a dozen of these or more. So we gave him a quote. To migrate from on premise Oracle to Snowflake was 90 hours. 90 hours, which with two people, so about a week and a half of real time. The quote was under $10,000. At the end of that alliance had a fully functional data warehouse running in Snowflake with all the objects, all the procedures, and all. 

 

Data Warehouse Automation

The data. Now it was up to them to take that. Beyond that, do they want to run them in parallel for a while? How are they going to test it? And so on. The impact of having data warehouse automation is that again, you describe what you want done and the actual technical details of how it's done are automatically filled in for you. I mean, I kent talked about code generation and things like that. I was involved in developing a physical warehouse system for Cliffs, Pallets, all that kind of stuff inside the four walls back in the 90s using, believe it or not, Informix. We had a lot of code generation tools there. Anything we had to do two or three times, we developed templates for and had automatic scripts to generate the, what do they call it is QL, I believe was what their SQL code was. 

 

Hyper Automation

All right, so what's next? What's coming up next? There's so much change continuing in the data world. We've got talk about data mesh and data fabric and data ops, which a good automation tool helps data ops a lot. This new term, hyper automation, that one came out of the blue. This is something Gartner is talking about and I had to really look into this one because it's kind of interesting. Hyper automation is in government specifically. It is a systemic approach by governments to rapidly identify, vet and automate as many business and it processes as possible. Hyper automation involves the orchestrated use of multiple technologies, tools and platforms. It really goes beyond just simple task automation that we've kind of been talking about here. It includes low code, no code development platforms, and it really goes all the way through to what they call intelligent automation, which is driven by AI and machine learning. 

 

XaaS

Things are continuing to change and they're continuing to accelerate. Imagine if this concept of hyper automation catches on where we want to automate everything there's, XaaS, anything as a service. We're talking about all kinds of things. We've had infrastructure as a service, we have software as a service, we now have data as a service. We got data cloud, stuff like data cloud with really data services there, and data products being sold in a marketplace, all with low code or no code. That's where the future is eliminating coding bottlenecks through things like automation. And now we talk about hyper automation. How we handle all of these changes and prepare for these changes really in my mind is we've got to start automating now so it's easier to adapt as the changes overtake your organization. The examples we just talked about with Micron and Allianz and some of the others, it's made a huge difference to their bottom line that they were able to adapt to a fairly massive change. 

In the data world we're talking about platform changes. That's pretty big. To be able to do that is very important. The history of data warehousing and data lakes and analytics is just strewn with these nightmarish stories of epic failures. We really need to change that perception and change those outcomes because none of us really want to be on low performing teams and failing teams. We want to be part of a successful team, right? That's what we're here for. That's why we're interested in this is we want to drive success and drive value. We don't want to be part of some nasty failure. Really the only way to do that is to change the way that you're doing things today. As you say, we can't keep doing the same thing and expect different results. I mean that is the definition of insanity. Doing the same thing over and over again and expecting different results. 

 

Process Automation

In my experience, really, automation with a repeatable method and standards is really the best way to do it and really be prepared for an unknown future. That's how we're going to all get there. So I hope I've made my point. Automate. Automate. Automate. I have been into automation for over three decades now and it gives you these benefits. Before the agile movement existed, were able to be agile and adapt to change flexibility, extensibility, adding on to what you've already built, standards enforcement. This is huge, especially today with all the compliance regulations out there. To be able to work in an environment where the standards are built in, they're built into the templates, the documentation is built into the product. It doesn't have to be some a secondary thought that oh man, we've got to spend time, make sure we got to check everything for this. 

When you get new staff, how long does it take them to come up to speed? Especially on standards, even coding standards, formatting standards. We all know the nightmare of having to pick up somebody else's code and try to modify it. Who knows how they laid that code out because it wasn't the way I would lay that code out. And it's so hard to understand. If you get to using automation and code generation, we don't have that problem anymore. You have one standard, it's built into the tool and your newest developer versus your most experienced developer. The code that comes out looks the same, behaves the same. You can run standard regression testing on it. There's no guesswork there. It's a huge productivity boost to be able to do that. As things are changing so rapidly, it's going to be way easier to adapt to change by using something like code generation. 

As far as sustainability, being able to maintain and manage these systems over time, it was one thing when I started it out and it was a ten gig warehouse with just data from a financial system. I wrote a bunch of SQL code, really a lot of SQL views and a small set of tables, and I was one person. I actually managed it, actually, in case at the time, we still had to hand write all the code. So I had to keep testing it. If they came and said they wanted a different report, they wanted the data, that the calculation was wrong, I had to manually go in and change that code and then test it and test it before we could move it into production to make sure that I didn't mess something up. The fact that data was only coming once a month, then we had a little time to do that back then. 

 

Data Streaming

Now you've got data streaming. We want near real time response. We want near real time access to our data. We don't have that time. The window has shrunk down to next to nothing for when we can affect a change. The only way to do that is to automate. My question, as I said in my blog post for those who read that, is why not automate today? Why not try it? Change the way you're doing things. Try to get some of these benefits for your organization. Don't put it off. Because the future is coming at you at breakneck speed. 175 zettabytes of data. All this different kind of data, structured, unstructured, semi structured data IoT device is being added constantly. People on your stream that information in and do near real time analytics on it. I don't think there's a choice. I think this is the only way to be successful going forward is to embrace automation and code generation.