“Learn 3 things A Day – Learning 2”

I tried so many ways to kick out lethargy in me and force me to blog. This is one more thought that I had. Every day I would think / learn / understand about 5 new things and blog about those 5 new things.. See if this initiative I can make it into a habit.

Mathematical thought “why in a distributed system there will always be a definite failure of at the least one system but very remote chances for entire system failure”

It always intrigued me when in documentation / presentation it is mentioned “In a highly distributed systems, there is / are always failed single or multiple systems somewhere but entire solution should withstand such failures”. Today I read it again in documentation of Redis, so instead of going through document, I digressed on what it meant and finally came up with below answer :).

 

Assume an unreliable system, that keeps failing frequently. Now someone asks question “Will it fail tomorrow?”. Answer would be probably but not sure. Mathematically it implies that there is 50% (1/2) chance that system will fail and same chance that it will not fail.

P(Failure) = 1/2 and P(Not Failure) = 1/2

To increase reliability we add one more system with same reliability i.e, even for second system

P(Failure) = 1/2 and P(Not Failure) = 1/2

 

Assuming these two systems are independent of each other:

  • Probability that both systems fail at same time = P(Failure (A)) and P(Failure (B)) = 1/2 * 1/2 = 1/4 = 25%.
  • Probability that either of system fail = P(Failure (A)) or P(Failure (B)) = 1/2 + 1/2 = 1 = 100% :).

 

Now assume if we have 10s of such unreliable system then

  • Probability of all system fail at same time  = 1/(2^10) = .1% (failure) ie 99.9% System as a whole would be available.
  • Probability of atleast one system failure = 1/2 + 1/2 + 1/2……1/2 > 1. Implies there would be a definite failure of atleast one system.

 

Admittedly this is a very novice understanding and there could be more complexities involved in real world computations like conditional probability, but this atleast for new suffices as an answer to question that I always left me wondering..

 

May be if I can expand further, if we give x as reliability of individual systems (by prediction) and required reliability from entire systems, may be such a computation can be reversed engineered to some meaningful calculations..

 

Till my next learning…

Abhyast

Advertisements

“Learn 3 things a Day – Learning 1”

I tried so many ways to kick out lethargy in me and force me to blog. This is one more thought that I had. Every day I would think / learn / understand about 5 new things and blog about those 5 new things.. See if this initiative I can make it into a habit.

Data Vault (DV) modeling:

Start with Disclaimer 🙂 :

Last 1 week (inclusive of today), I have read and tried to understand “Data Vault” modeling, though admittedly not experienced it enough to comment. I always believed that learning / understanding is multiple degrees ranging from “Read –> Understand –> Explain –> Experience”, with “Read” & Experience” being lowest and highest levels of learning. With regards to DV modeling, I DO NOT have enough implementation experiences to give a fair comparison and comment. Treat below as my learning and not any passionate wars :).

Objective of Data Vault (DV) modeling: Based on readings from multiple blogs,

  • DV modeling tries to retain data state in “As-Is” form, from source systems avoiding costly and time consuming conversions when loading data into DWH data models
    • Loading into Star Schema includes format (structure) conversions and data quality checks that delays entire process (Design of data models & load of data in dimension models)
  • Detailed oriented. Have same grain as source systems thus avoiding “On the fly” aggregation in data model.
    • On system with high velocity of data flow (like Telecom, Weblogs), general practice (though optional) is to aggregate data to higher levels to aid in reports (but loosing detailed oriented data permanently)
  • Agile (both in modeling and loading).
    • Minimal transformation of data format from source
    • Agile BI vs Waterfall BI
  • Ability to Audit and track changes

 Basics of DV modeling: Data Vault modeling consists of

  • Hub: Business Keys + Surrogate Keys + Source Systems & Timestamps
  • Links: Capture relationships between Hubs
  • Satellites: Capture context of Hubs and Links.

 Comparing with Dimension Model:

Dimension Model Data Vault Model Remarks
Dimension Hub + Satellite  
Fact Link + Satellite + Other Links if needed  

DV model is best suited at between 3NF of source systems and  Star Schemas (2NF (Dimensions) and 3NF (Facts) of Dimension model). IMHO, transformation between source systems (3NF) to DV model should be easier as there will be minimal lookups / transformations and conversions. But DV model can not be exposed directly to consumers of DWH service. So instead a Dimensional model (Physical or virtual) need to be built. Physical being converting DV to a dimension model and Virtual implying creating views on top of DV models. Recommendation should be creating a virtual data model (using Views) as this aligns with agile developmental goal of DV modeling. So, if Dimensional Model is modeling for reporting, DV modeling is middle layer model consuming data from backend modeling that is more a 3NF modeling. Microsoft SQL Server Analysis services, internally maintains KStore, RStore, HStores. KStore is analogous to HUB and RStore is analogous to LINK but in SSAS these are maintained for attributes in a dimension and DV it is for across dimensions.

At high level this is what I understood about how DV stands with respect to other DWH models. I am still in midst of reading / understanding and internalizing various articles listed below

 

  • Some questions that I have lingering?
    • How does DV deal with SCD type 2, Type 3 dimensions?
    • How does DV modeling deal with different fact types (Transactional, Snapshot etc)?
    • Is it Agile in true sense and can model be developed iteratively?
    • Is Data Vault model suited to new age companies with MPP and Hadoop architectures??
    • What are pitfalls with DV models that can be envisaged and avoided?

Need to learn and think more.. moving to 2nd learning / thought of day..

Happy Learning………….

–Abhyast