One of the first steps we take for any client seeking an enterprise data warehouse or an updated reporting solution is to determine the best possible way to organize their information by developing a data model. We often consider the benefits of two forms of data models when deciding which would be best for our clients. The first is a relational model; the second, which many clients are less familiar with, is a data vault model.
Both options feed information into the data warehouse used for reporting, but there are some key differences to consider. Making the right decision around how source data is modeled and organized can make or break the performance and quality of your organization’s ETL process and reporting.
In this post, we’ll focus on the data vault model. We’ll review what it is and explore scenarios around when to use it and when not to use it.
What Is a Data Vault Model?
A data vault model, more commonly referred to as simply data vault, is a practice of organizing data that separates structural information, such as a table’s unique identifier or foreign key relationships, from its attributes. It was created to enable storage and auditing of historical information, allow for parallel loading, and allow organizations with many source systems to scale without needing to redesign the entire solution. It adds flexibility and scales easily, making it great for growing organizations that would normally encounter many redesigns of their data solution. To achieve these benefits, the model is comprised of three basic table types:
- Hub tables hold all unique business keys of a subject. For example, HUB_EMPLOYEE may use an employee number to identify a unique employee.
- Link tables track all relationships between hubs. For example, LINK_EMPLOYEE_STORE would track the relationship between an employee and the stores they work at.
- Satellite tables hold any attributes related to a link or hub and update them as they change. For example, SAT_EMPLOYEE may feature attributes such as the employee’s name, role, or hire date.
3 Reasons to Use Data Vault
Reason 1: You have multiple source systems and relationships that change frequently.
Data vault provides the most benefits when your data comes from many source systems or has constantly changing relationships. Data vault works well for systems with these characteristics because it makes adding attributes simple. If there is a change to only one source system, that change doesn’t have to show up for all source systems. Similarly, you can limit the number of places changes are made because attributes are stored separately from structural data in satellites. Additionally, it is easier to account for new and changing relationships by closing off one link and creating another. You don’t have to change the historical data to account for a new relationship or update an existing schema; you only need to account for the changes going forward.
Reason 2: You need to be able to easily track and audit your data.
Data vault inherently enables auditing, as load times and record sources are required for every row. It also tracks a history of all changes as satellites include the load time as part of the primary key. When an attribute is updated, a new record is created. All of this auditing enables you to easily provide auditability for both regulatory and data governance purposes. Because you store all of your history, you can access data from any point in time.
Reason 3: You need data from multiple systems to load quickly.
Data vault also enables quicker data loading because many of the tables can be loaded at the same time in parallel. The model decreases dependencies between tables during the load process and simplifies the ingestion process by leveraging inserts only, which load quicker than upserts or merges.
When Would I Not Use Data Vault?
Reason 1: You need to load data directly into your reporting tool.
First and foremost, a data vault model should never directly feed into your reporting tool. Due to the necessity of the three types of tables, it would require your reporting tool to marry together all related tables to report on one subject area. These joins would slow down report performance and introduce the opportunity for error because reporting tools are not meant to do that form of data manipulation. The data vault model would need to feed data into a dimensional model or have an added reporting layer to enhance report performance. If you plan to implement a data model that can be directly reported on, you should create a dimensional model.
Reason 2: You only have one source system and/or relatively static data.
Another situation where data vaults are not a great fit is when data is relatively static or comes from a single source system. In these cases, you won’t be able to glean many of the benefits of data vault, and a dimensional model may be more simplistic and require less data manipulation. Implementing a data vault would require an increased amount of business logic where it is not needed.
Furthermore, data vault requires a lot more storage. Splitting up a subject area into three different tables essentially increases the number of tables by at least a multiple of three — not to mention the inserts-only nature of those tables. For these reasons, a data vault model is not worth implementing if your data is straightforward and the benefits mentioned above can be easily attained through a more simple dimensional model.
While some factors to consider are outlined above, this decision is often nuanced and falls within the gray areas. The impacts outlined above show up in unexpected ways and become more impactful over time. Aptitive has helped many organizations determine if reconsidering their architecture and implementing a data vault model is the answer for their challenges.
If you’re interested in learning more about data modeling, data architecture, or data management in general, fill out our quick contact form to set up a free, no-obligation data consultation and whiteboarding session.
Originally published at https://aptitive.com on June 16, 2020.