Have a question about a term or phrase associated with data infrastructure automation and data warehousing? Check out WhereScape’s Data Warehousing Glossary for common explanations and definitions.
A data warehouse is a collection of data primarily used for reporting and analysis. It’s a way to provide business analysts and other users with a centralized repository of accurate, enterprise data from which to glean timely analytic insights that guide everyday business decisions.
Data typically flows into the data warehouse from transactions systems, databases, and other internal and external data sources. Data warehouses usually include historical data derived from transactions, but can also include real-time streaming data as well as other sources.
A data mart is a subset of data (typically a subset of a data warehouse) that is focused on a specific area of the business. When users need only data about one subject area or department sometimes a data mart is the answer. Think of it as providing a boutique store versus a giant warehouse for their data needs. A data mart versus a data warehouse can make it easier for business users and analysts to locate and discover the answers they need more quickly.
A data vault stores data coming from various operational systems and other sources. The difference is that it uses a different modeling approach than traditional data warehouses. Developed by Dan Linstedt, the Data Vault 2.0 model is designed to offer greater flexibility, scalability, consistency and adaptability than traditional modeling approaches.
According to Linstedt, “The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema.”
The tools and systems that collect, process, store, analyze, and report on an organization’s data make up the data infrastructure. It’s the foundation for turning data into insights to support and sustain business growth, profitability, competitive advantage, and innovation.
Data infrastructure is a categorical term as well, inclusive of the wide variety of different data repositories such as data warehouses, data vaults, data lakes and data marts.
Data Warehouse Automation
Traditionally, data warehouses were designed, developed, deployed, operated and revised through the manual efforts of teams of developers. The average data warehouse project developed by hand could take years to complete, from requirements gathering to production availability, with a high risk of failure.
Data warehouse automation uses metadata, data warehousing methodologies, pattern detection and more to provide developers with templates and wizards to auto-generate designs and coding previously required to be done by hand. Automation eliminates the repetitive, time-consuming and manual design, development, deployment and operational tasks within the data warehouse lifecycle. By automating as much as 80 percent of the lifecycle, IT teams can deliver and manage more data warehouse projects than before, much faster, with less project risk and at a lower cost.
Data Infrastructure Automation
Data infrastructure automation uses metadata, data warehousing methodologies, pattern detection and more to provide developers with templates and wizards to auto-generate the code and documentation needed to design, develop, deploy and operate data warehouses, data vaults, data lakes and data marts.
With data infrastructure automation, organizations can deliver data infrastructure projects in less time, with less risk and at a lower cost. Additionally, organizations are able to more easily adapt their data infrastructure to business change.
Data Vault Automation
While completed data vaults deliver many benefits, designing and developing them by hand can be challenging, requiring significant time, effort and money. By leveraging metadata and templates designed for the best practice use of the Data Vault 2.0 methodology, IT can automate the data vault lifecycle. Automating the design, development, deployment and operation can dramatically cut project timelines and costs, and increase delivery success.
Cloud Data Warehousing
Traditionally, data warehouses were on-premises, using hardware and software located in a company’s own data center or perhaps co-located in a third-party data center. However, with the advent of cloud computing, the concept of data warehouse as a service (DWaaS) is changing the way companies are choosing to implement data warehousing. With DWaaS, or cloud data warehousing, the service provider automatically handles infrastructure, optimization, availability, data protection and more.
Cloud data warehousing brings all the benefits of the cloud to analytical data infrastructures: agility, cost effectiveness, scalability and performance.
Batch Data Processing
Batch-based data is data that is collected over time into a batch and then processed all at one time. Batch processing is a simplified way to process data.
Compared to batch-based data, real-time or streaming data is continual and is processed in near-real time.
Big data is the term used to describe extremely large or complex data sets that require advanced analytical tools to analyze. Unlike traditional data, big data typically exhibits five characteristics that set it apart from other data sets: volume, variety, velocity, variability and veracity.
It’s these characteristics that make big data so valuable for innovation and insight. For instance, it can feed machine-learning algorithms to make artificial intelligence-driven processes smarter. Big data can also be integrated with smaller enterprise data sets for more comprehensive, granular insights.
Data Warehouse Lifecycle
The data warehouse lifecycle includes all the phases of creating and maintaining a data warehouse, including:
- Discovery: Understanding business requirements and data sources needed to meet those requirements
- Design: Iteratively designing and testing the data warehouse model
- Development: Writing or generating the schema and code to create and load the data warehouse
- Deployment: Moving the data warehouse into production where business analysts can access the data
- Operation: Monitoring and managing the operations and performance of the data warehouse
- Enhancement: Making changes to support evolving business and technology needs
Data Vault 2.0
Data Vault 2.0 is the latest specification for data vault modeling. Designed to be resilient to business and technology changes, the Data Vault 2.0 data model, created by Daniel Linstedt, simplifies integrating data from multiple sources, makes it easy to evolve or add new data sources without disruption, and increases scalability and consistency of enterprise data infrastructure.
Change Data Capture
Simply put, change data capture (CDC) lets you identify and track data that has changed, which becomes very important as the volume of data increases. By capturing individual changes to data, you can apply only those changes to your data warehouse, data vault, or data mart instead of importing all the data.
Like its older brother DevOps, DataOps is about using tools, philosophies, and practices to improve the speed and accuracy of data analytics. An emerging discipline, DataOps applies concepts such as continuous integration to the entire lifecycle of data to accelerate the delivery of analytics and insights to the business. A core component of DataOps is automation of the design, development, deployment and operation of the analytical data infrastructure.
Most organizations already have existing data infrastructure. When an organization chooses to adopt a new data platform, it must often migrate not only the data to the new platform, but the organization of the data and its current use of the data, such as its users and role permissions, as well.
Traditionally, migrating existing data warehouses to new environments has taken months, with a team of developers needed to discover and understand the existing data infrastructure, map it to the new data structure, write and test scripts, recreate metadata, and finally transfer and validate the data. However, automating the migration of data and data infrastructure dramatically reduces project costs and effort and speeds up the migration process significantly.
Metadata is data about data. It describes the structure of the data warehouse, data vault or data mart. Metadata captures all the information necessary to move and transform data from source systems into your data infrastructure and helps users understand and interpret the data in the data infrastructure. It is also the foundation for aspects such as documentation and lineage.
ETL vs ELT
ETL (extraction, transformation, and loading) is a technique for extracting and moving information from upstream source systems to downstream targets. Traditionally, ETL was used to move data from source systems to a data warehouse, with transformation (reformatting) happening outside of the data warehouse.
ELT, on the other hand, approaches data movement differently by using the target data platform to transform the data. The data is copied to the target and transformed in place. Because ELT doesn’t need a transformation engine, it is a more flexible and agile approach that is less complex and time-consuming than ETL to develop and maintain.