Naive Bayes (Introduction to Probability) – Part 1

Naive Bayes algorithm is based on mathematical foundations of probability and is an extension of conditional probability. My goal at end of this NAIVE BAYES series is to read, understand, experience and exhilarate and enjoy Naive Bayes and transform myself into a Bayesian geek :). So let me start from ground up and in this post detailing high level probability concepts. Sorry this is going to be one long post.

  • Definition of Probability:
    • Probability is a numerical statement or quantification of likelihood of a result or an outcome of phenomenon or experiment or situation whose outcome is uncertain. Probability can also be understood to be quantification of randomness (risk), randomness that makes result all the more uncertain.
    • Probability enables to think in systematic manner providing quantified outcomes about arbitrary uncertain events.
    • Probabilities axioms:
      • Non Negativity: Probabilities will be non negative. Probability of event ranges from 0 to 1 (inclusive). 0 <= Probability <= 1.
        • 0 implies event will not occur and
        • 1 implies even will occur
    • Normalization: Probability of entire sample space (all possible output) is 1.
    • Additivity: if P (A & B) = NULL Set then P(A union B) = P(A) + P(B)
  • Probabilistic Sample Space: For an experiment all possible outcomes constitute sample space. It is one super set where all events (subsets) belong.  Sample space can be discrete / finite or continuous and infinite based on experiment / situation considered.
    • Example
      • Discrete and finite: Outcome of a dice thrown would have 6 distinct results with each result different from others and it is finite as it has only 6 possible outcomes.
      • Continuous and infinite: Assume we draw imaginary circles in sky to explore stars inside that circle. Probability of finding stars within circle is an event or subspace. Sky is superset and infinite. sample space can be continuous. Another example could be stock price of a company. Sample space could range from 0 to a large set of number and if not rounded to nearest number will be continuous number.
  • Event: Event is subset of sample space that is of interest for experiment. Of all outcomes of an experiment, event is that outcome that is related to scope of experiment or study.
  • Types of Probabilities:
    • There are 2 types of probabilities. Objective Probabilities and Subjective Probabilities.
    • Objective probability: Probability derived from past or historical data. Additionally it can also be derived from “Classical” or “Logical” probability where probability of events can be derived without performing experiments.

Examples of Objective probability: Capturing student marks for past academic years. Using frequency distribution as basis of probability, it can be predicted that any student randomly picked during current or future academic years has high likelihood would get marks between 60 to 80 (Sum of Probabilities P(60) + P(70) + P(80) = .2 + .34 + .2 = 0.74)

Count of Student

Marks Probability (Relative)
5 40 0.03
15 50 0.10
30 60 0.20
50 70 0.34
30 80 0.20
15 90 0.10
5 100 0.03

In Classical probability likelihood of events can be computed without performing experiments and is mostly useful in academics. Example

  • Tossing a fair coin and finding probability of either  “Head” or “Tail”. (1/2)
  • Rolling a fair dice and finding probability of any number between 1 to 6 (1/6)
  • Drawing a red colored card from a pack of 52 cards. (26/52 = 1/2)

Subjective probabilities are cases where past records may or may not be useful in quantification. Instead experiences of experts either through methods of survey are conducted to quantify or predict. Example could be sending questionnaire and seeking response from analyst about likelihood of stock market crash.

Note: Naive Bayes is an example of Objective probabilities where past data is used to build frequency tables and algorithm learns likelihood of each outcome.

Types of Event:

Mutually Exclusive Event: An event is said to be mutually exclusive if only one (of all possible) outcome is possible for any arbitrary event. Examples

  • An incoming Email is marked as “Spam” or “Not – Spam” but it can not be both.
  • A day selected a random will belong to any but only one month at a time. There are 12 distinct outcomes and each of them are mutually exclusive.
  • Toss of coin would result in “Head” or “Tail” but not both.

Below Spam and Not Spam are mutually exclusive and collectively exhaustive. In cases of events with only two (collectively exhaustive) and mutually exclusive outcomes if probability of one case is known probability of other case is 1 – probability(known case). If probability of Not Spam is 90% or .9 then probability of spam is 1 – 0.9 = .1 or 10% of mails received are marked spam.

Classification model generally is also an example of “Mutually Exclusive” events as data categorized by classifier belongs to a single class. Each class output is disjoint set with other outputs.


Collectively Exhaustive: A event is collectively exhaustive if list of outcomes includes every possible outcome. Above spam or not spam is an example of collectively exhaustive event thus sum of P(Spam) + P(Not Spam) = 1. Similarly random day selected belonging to any of 12 months (Jan to Dec) is also mutually exclusive and collectively exhaustive.

Equally Likely: Events with outcomes where likelihood of each outcome is same are equally likely. Example in a fair coin toss, both “Head” and “Tail” outcomes are “Equally likely” , “Mutually Exclusive” and “Collectively exhaustive”. In real world practical scenarios it is highly unlikely that all events are equally likely. Even coin toss is not equally likely and thus always marked as fair coin. (

Experiments are understood by knowing context better. An experiment may contain a single trial or multiple trial. For example, experiment may be a single toss of coin and other experiment maybe toss of 4 coins repeated over times. “Joint” and “Conditional” probabilities capture probabilities of events.

Joint probabilities deal events across multiple experiments where combination (both) of events is computed. Events from multiple experiments may be independent and dependent.

Conditional probabilities deal event across multiple experiments that captures how outcome of one experiment impacts outcome of other experiment.

Independent Events: Events in which outcome of one trail does not have impact or effect on outcome of subsequent trials. Example Student securing 70% score in exam and spam in emails are independent events.

For example in Naive Bayes, features are assumed to be independent of each other and are only related to label (output) variable. Thus the name “Naive” as  it is highly likely that in pratical scenarios, such assumptions would fail.

Dependent Events: Events in which outcome of one event impacts outcome of second event.

  • Probability of Salary greater than 100 K is dependent  upon Number of years of education.
  • Probability of Rain is dependent upon cloud formation.

Probability with mutually exclusive events:

Mutually exclusive, collectively exhaustive and equally likely deal with outcomes of single instance of an experiment. Independent and dependent events deal with outcomes of multiple experiments how experiments are related to each other to impact outcome of each experiment (event) .

Below is case that details all events (collectively exhaustive) of a roll of dice experiment. As can be noticed each event is mutually exclusive or disjoint sets (as there is no intersection). Now if we are interested in events that result in even numbers from a single roll (2,4,6) then probability of each event (1/6) and probability of event with result an even number is 1/6 + 1/6 + 1/6 = 3/6 = 1/2.  Generalizing, for events that are mutually exclusive can be computed by simple sum.

P(On1 or On2 or On3… Onk) = P(n1) + P(n2) + P(n3)… P(nk).


Probability of events where outcomes are mutually exclusive can be computed by adding individual probabilities of each outcome If A and B are mutually exclusive outcomes of an event, then P(A or B) = P(A) + P(B) and P(A & B) = 0 as there is no intersection between P(A) and P(B).

Probability with non-mutually exclusive events:

If events are not mutually exclusive, that is there is an overlap, in such cases probability is computed as P(A or B) = P(A) + P(B) – P(A & B).

  • P(A or B) = P(A) + P(B) – P(A & B)
  • P (A or B) = P(A) + P(B and A’s complement)
    • A’s complement is entire sample space expect or other than A.
  • P (A or B or C) = P(A) + P(B and Ac) + P(C and Bc and Ac)
    • Ac is complement of A
    • Bc is complement of B

Below is an example of case where outcomes of experiments are NOT mutually exclusive. What is probability that any card drawn from pack of cards is either a “Red Card” or any of King Faced card. There are 26 red cards out of 52 cards in a pack. Probability of red cards is 26 / 52 = 1/2. But that red cards is inclusive of 2 king faced red card as well. Similarly King Cards are 4 in 52 cards. Thus 4/52 = 1/13 that is inclusive of 2 king cards that are in Red color. Coming to question to compute probability of outcomes of events that are not mutually exclusive

P(Red Cards or King Faced cards) = P(Red Cards) + P(King Faced Cards) – P(Red cards that are King faced). 1/2 + 1/13 – 1/26 = 0.54

As both Red Card outcome as well as King faced cards outcome include King faced red cards there is a double counting of “king faced red cards”. To correct probability we deduct once intersection of Red cards that are King faced


  • For Mutually exclusive or disjoint events:
    • P(A or B) = P(A) + P(B)
  • For Non Mutually exclusive or overlapping events:
    • P(A or B) = P(A) + P(B) – P(A intersection B)
    • P (A union B) <= P(A) + P(B)
    • If A is subset of B then P(A) < P(B)

Joint probabilities with independent events: When outcome of experiment does not have impact on outcome of next experiment then joint probability is product of probabilities of each event.

P(A & B) = P(A) x P(B)

Example: A fair 6 sided dice is rolled 2 times. What is probability that both dice rolled result in number 6. As individual dice roll is mutually exclusive, probability of landing on 6 is 1/6. Requirement that experiments should have 1/6. Probability (6 on first roll and 6 on second roll) = P(6 on first roll) * P(6 on second roll) = 1/6 x 1/6 = 1/36.

Conditional probabilities of Independent events: When outcome of an experiment does not have impact on outcome of next experiment than conditional probability P(A | B) (read probability of A give probability of B) is probability of  A.

  • P(A | B) = P(A) or
  • P(B | A) = P(B)

Example: A fair coin is tossed two times. What is probability that coin lands on head given that first experiment resulted in a Tail. As outcomes of both events are independent, outcome of first experiment does not have any impact on second.

  • P(H given T) = P(H) = 1/2.
  • P(T given H) = P(T) = 1/2

Next post conditional probability of dependent events and then jump in Bayes Theorem.


Linear Regression.. Simple understanding..

A uni-variate regression captures linear relationship between an independent variable and a dependent variable. For each value of X  and corresponding Y values are plotted as shown below. Intentionally I have put Y as 1.5 * X. There are lots of notes on regression analysis and there could be much more theoretical explanations.


But some of points that need to be noted are

  • Each point on scatter plotter graph is an instance of a data and Regression line aims to generate pattern of relationship between X and Y variables.

When we try to draw a regression line for above data there are 2 degrees of freedom for us to use.

1. Angle of regression line wrt. to X and Y axis (Called slope).

2. Intercept of regression line i.e where would regression line cross Y axis when X = 0.

For above example, regression line is as below, where  slope is 1.5 and intercept is 0.


Excel provides a way to control intercept and thus angle or slope of regression line. Above case is regression line with intercept = 0. Changing intercept values, forces a different slope as regression line should be such that there should be minimum deviation of estimated from actual value. That is why if we notice last case where intercept is 1000, regression line R2 value is negative indicating, regression line is NOT a best fit line for provided data and infact it has opposite slope. In other words, last case, though our data points have a positive correlation using intercept, a negative correlation has been forced.

Regression Line with intercept = 25


Regression Line with intercept = 50


Regression Line with intercept = 100



Regression Line with intercept = 1000


This was a simple thought process (for those of you who play with stats gurus, this is a trivial thing), but when have to understand “How is excel computing values of slope and intercept” in such a manner that estimated value  is as near to actual value (Gradient descent??)