I tried so many ways to kick out lethargy in me and force me to blog. This is one more thought that I had. Every day I would think / learn / understand about 5 new things and blog about those 5 new things.. See if this initiative I can make it into a habit.
Data Vault (DV) modeling:
Start with Disclaimer 🙂 :
Last 1 week (inclusive of today), I have read and tried to understand “Data Vault” modeling, though admittedly not experienced it enough to comment. I always believed that learning / understanding is multiple degrees ranging from “Read –> Understand –> Explain –> Experience”, with “Read” & “Experience” being lowest and highest levels of learning. With regards to DV modeling, I DO NOT have enough implementation experiences to give a fair comparison and comment. Treat below as my learning and not any passionate wars :).
Objective of Data Vault (DV) modeling: Based on readings from multiple blogs,
- DV modeling tries to retain data state in “As-Is” form, from source systems avoiding costly and time consuming conversions when loading data into DWH data models
- Loading into Star Schema includes format (structure) conversions and data quality checks that delays entire process (Design of data models & load of data in dimension models)
- Detailed oriented. Have same grain as source systems thus avoiding “On the fly” aggregation in data model.
- On system with high velocity of data flow (like Telecom, Weblogs), general practice (though optional) is to aggregate data to higher levels to aid in reports (but loosing detailed oriented data permanently)
- Agile (both in modeling and loading).
- Minimal transformation of data format from source
- Agile BI vs Waterfall BI
- Ability to Audit and track changes
Basics of DV modeling: Data Vault modeling consists of
- Hub: Business Keys + Surrogate Keys + Source Systems & Timestamps
- Links: Capture relationships between Hubs
- Satellites: Capture context of Hubs and Links.
Comparing with Dimension Model:
|Dimension Model||Data Vault Model||Remarks|
|Dimension||Hub + Satellite|
|Fact||Link + Satellite + Other Links if needed|
DV model is best suited at between 3NF of source systems and Star Schemas (2NF (Dimensions) and 3NF (Facts) of Dimension model). IMHO, transformation between source systems (3NF) to DV model should be easier as there will be minimal lookups / transformations and conversions. But DV model can not be exposed directly to consumers of DWH service. So instead a Dimensional model (Physical or virtual) need to be built. Physical being converting DV to a dimension model and Virtual implying creating views on top of DV models. Recommendation should be creating a virtual data model (using Views) as this aligns with agile developmental goal of DV modeling. So, if Dimensional Model is modeling for reporting, DV modeling is middle layer model consuming data from backend modeling that is more a 3NF modeling. Microsoft SQL Server Analysis services, internally maintains KStore, RStore, HStores. KStore is analogous to HUB and RStore is analogous to LINK but in SSAS these are maintained for attributes in a dimension and DV it is for across dimensions.
At high level this is what I understood about how DV stands with respect to other DWH models. I am still in midst of reading / understanding and internalizing various articles listed below
http://tdan.com/data-vault-series-5-loading-practices/5285 (Articles 1 to 5)
Some questions that I have lingering?
How does DV deal with SCD type 2, Type 3 dimensions?
How does DV modeling deal with different fact types (Transactional, Snapshot etc)?
Is it Agile in true sense and can model be developed iteratively?
Is Data Vault model suited to new age companies with MPP and Hadoop architectures??
What are pitfalls with DV models that can be envisaged and avoided?
Need to learn and think more.. moving to 2nd learning / thought of day..