The Data Vault – What is it? – Why do we need it? (click to download)
As mentioned in my last blog (Big Data & NoSQL Technologies) understanding and using the right tools for the job is essential. Database systems and applications are no exception. For Big Data or Small Data, knowing and using the right data modeling and methodology is also critical. Today, business uses the Internet to make money yet harvesting the data in an efficient way often develops into a considerable IT challenge. The ongoing debate behind the many issues involved are not the focus of this blog however, the concepts presented here are.
Data Warehouse and Business Intelligence (DW/BI) Systems have become a solution of choice for industries to address the many critical business questions ingrained in this vital data. Providing an essential data integration process supporting a variety of reporting requirements, DW/BI Systems often involve significant construction effort, on-going maintenance, and reengineering effort when data structures and business rules change. And they do change; presenting us with problem #1.
Building effective and efficient DW/BI Systems can quickly become a daunting and difficult technical ordeal even for veteran engineering teams. Several integrated technologies are required from database systems, data transformation (ETL) tools, various programming languages, administration and reporting software to high performance networks and computers having very large storage capacities. In many cases even the sheer volume of data can be overwhelming; Yup, problem #2.
Once reporting is delivered to diverse user communities that consume the data simplified usability is expected yet technological realities today often require highly trained end-users. The design, creation, delivery, and support of a robust, effortless DW/BI System for intelligent use, sadly, are complex. You guessed it; problem #3.
Furthermore, as upstream systems change, and as DW/BI technology plows ahead, and as the dynamic complexities involved prevail, we also find that every so often new data sources need to be added to the mix. These are usually unpredicted and unplanned for. The integration impact can be enormous of requiring complete regeneration of the aggregated data; hence, problem #4.
So how do we solve these problems? Well …
Bill Inmon widely regarded as the father of data warehousing, defines a data warehouse as:
“a subject oriented, nonvolatile, time-variant collection of data in support of management’s decisions” (https://en.wikipedia.org/wiki/Bill_Inmon).
Ralph Kimball, a pioneering data warehousing architect, developed the “dimensional modeling” methodology now regarded as the de-facto standard in the area of decision support. The Dimensional Model (called a “star schema”) is different from Inman’s “normalized modeling” (sometimes called a “snowflake schema”) methodology. In Kimball’s Star Schema, transactional data is partitioned into aggregated “facts” with referential “dimensions” surrounding and providing descriptors that define the facts. The Normalized Model (3NF or “third normal form”) stores data in related “tables” following relational database design rules established by E. F. Codd and Raymond F. Boyce in the early 1970’s that eliminate data redundancy. Fostering vigorous debate amongst DW/BI Architects as to which methodology is best, both have weakness when dealing with inevitable changes in the systems feeding the data warehouse and in cleansing data to conform to strict methodology requirements
Further, the OLAP cube (for “online analytical processing”) is a data structure that allows fast analysis of data from multiple perspectives. The cube structure is created from either a Star or Snowflake Schema stored as metadata from which one can view or “pivot” the data in various ways. Generally cubes have one time based dimension that supports a historical representation of data. Creating OLAP cubes can be very expensive and often create a significant amount of data that is of little or no use. The 80/20 rule appears in many cases to hold true (where only 20% of the OLAP cube data proves useful) which begs the question: Built upon a traditional architecture does an OLAP cube truly deliver sufficient ROI? Often, the answer is a resounding, NO! Durable DW/BI systems must deliver real value.
** A Different Approach **
Data Vault Modeling is a hybrid methodology providing historical data storage from multiple sources designed to be resilient to environmental changes. Dan Linstedt, the creator of this methodology, describes the resulting database as:
“A detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3NF and Star Schemas. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.” (https://en.wikipedia.org/wiki/Data_Vault_Modeling)
One Data Vault philosophy is that all data is relevant, even if it is wrong. Data being wrong is a business problem, not a technical one. I agree! This simple premise of the Data Vault is to ingest 100% of the source data 100% of the time, good, bad, or ugly. Relevant in today’s world, auditability and traceability of all the data in the data warehouse thus become a standard requirement. This data model is architected specifically to meet the needs of today’s enterprise DW/BI Systems:
“To understand the Data Vault is to understand the business”
Focused on the business process, the Data Vault as a data integration architecture, has robust standards and definitional methods which unite information in order to make sense if it. The Data Vault model is comprised of three basic table types:
HUB – containing a list of unique business keys having its own surrogate key
LNK – establishes relationships between business keys (typically hubs, but links can link to other links)
SAT – holding descriptive attributes that can change over time (similar to a Kimball Type II slowly changing dimension)
There are several advantages to the Data Vault approach:
- Simplifies the data ingestion process
- Removes the cleansing requirement of a Star Schema
- Instantly provides auditability for HIPPA and other regulations
- Puts the focus on the real problem instead of programming around it
- Easily allows for the addition of new data sources without disruption to existing schema
In conclusion, the Data Vault modeling and methodology addresses most elements of the problems we identified above:
- It adapts to a changing business environment
- It simplifies the DW/BI design complexities
- It increases usability by business users because it is modeled after the business domain
- It allows for new data sources to be added without impacting the existing design
As the architect of TrinityIDC, a DV/BI cloud based solution, I can say that the more I use the Data Vault modeling and methodology, the more I love to use it! Ask us at DB Best to help you design yours!