I tried so many ways to kick out lethargy in me and force me to blog. This is one more thought that I had. Every day I would think / learn / understand about 5 new things and blog about those 5 new things.. See if this initiative I can make it into a habit.

Mathematical thought “why in a distributed system there will always be a definite failure of at the least one system but very remote chances for entire system failure”

It always intrigued me when in documentation / presentation it is mentioned “In a highly distributed systems, there is / are always failed single or multiple systems somewhere but entire solution should withstand such failures”. Today I read it again in documentation of Redis, so instead of going through document, I digressed on what it meant and finally came up with below answer :).

Assume an unreliable system, that keeps failing frequently. Now someone asks question “Will it fail tomorrow?”. Answer would be probably but not sure. Mathematically it implies that there is 50% (1/2) chance that system will fail and same chance that it will not fail.

P(Failure) = 1/2 and P(Not Failure) = 1/2

To increase reliability we add one more system with same reliability i.e, even for second system

P(Failure) = 1/2 and P(Not Failure) = 1/2

Assuming these two systems are independent of each other:

- Probability that both systems fail at same time = P(Failure (A)) and P(Failure (B)) = 1/2 * 1/2 = 1/4 = 25%.
- Probability that either of system fail = P(Failure (A)) or P(Failure (B)) = 1/2 + 1/2 = 1 = 100% :).

Now assume if we have 10s of such unreliable system then

- Probability of all system fail at same time = 1/(2^10) = .1% (failure) ie 99.9% System as a whole would be available.
- Probability of atleast one system failure = 1/2 + 1/2 + 1/2……1/2 > 1. Implies there would be a definite failure of atleast one system.

Admittedly this is a very novice understanding and there could be more complexities involved in real world computations like conditional probability, but this atleast for new suffices as an answer to question that I always left me wondering..

May be if I can expand further, if we give x as reliability of individual systems (by prediction) and required reliability from entire systems, may be such a computation can be reversed engineered to some meaningful calculations..

Till my next learning…

Abhyast