Naive Bayes (Introduction to Probability) – Part 1

Naive Bayes algorithm is based on mathematical foundations of probability and is an extension of conditional probability. My goal at end of this NAIVE BAYES series is to read, understand, experience and exhilarate and enjoy Naive Bayes and transform myself into a Bayesian geek :). So let me start from ground up and in this post detailing high level probability concepts. Sorry this is going to be one long post.

  • Definition of Probability:
    • Probability is a numerical statement or quantification of likelihood of a result or an outcome of phenomenon or experiment or situation whose outcome is uncertain. Probability can also be understood to be quantification of randomness (risk), randomness that makes result all the more uncertain.
    • Probability enables to think in systematic manner providing quantified outcomes about arbitrary uncertain events.
    • Probabilities axioms:
      • Non Negativity: Probabilities will be non negative. Probability of event ranges from 0 to 1 (inclusive). 0 <= Probability <= 1.
        • 0 implies event will not occur and
        • 1 implies even will occur
    • Normalization: Probability of entire sample space (all possible output) is 1.
    • Additivity: if P (A & B) = NULL Set then P(A union B) = P(A) + P(B)
  • Probabilistic Sample Space: For an experiment all possible outcomes constitute sample space. It is one super set where all events (subsets) belong.  Sample space can be discrete / finite or continuous and infinite based on experiment / situation considered.
    • Example
      • Discrete and finite: Outcome of a dice thrown would have 6 distinct results with each result different from others and it is finite as it has only 6 possible outcomes.
      • Continuous and infinite: Assume we draw imaginary circles in sky to explore stars inside that circle. Probability of finding stars within circle is an event or subspace. Sky is superset and infinite. sample space can be continuous. Another example could be stock price of a company. Sample space could range from 0 to a large set of number and if not rounded to nearest number will be continuous number.
  • Event: Event is subset of sample space that is of interest for experiment. Of all outcomes of an experiment, event is that outcome that is related to scope of experiment or study.
  • Types of Probabilities:
    • There are 2 types of probabilities. Objective Probabilities and Subjective Probabilities.
    • Objective probability: Probability derived from past or historical data. Additionally it can also be derived from “Classical” or “Logical” probability where probability of events can be derived without performing experiments.

Examples of Objective probability: Capturing student marks for past academic years. Using frequency distribution as basis of probability, it can be predicted that any student randomly picked during current or future academic years has high likelihood would get marks between 60 to 80 (Sum of Probabilities P(60) + P(70) + P(80) = .2 + .34 + .2 = 0.74)

Count of Student

Marks Probability (Relative)
5 40 0.03
15 50 0.10
30 60 0.20
50 70 0.34
30 80 0.20
15 90 0.10
5 100 0.03

In Classical probability likelihood of events can be computed without performing experiments and is mostly useful in academics. Example

  • Tossing a fair coin and finding probability of either  “Head” or “Tail”. (1/2)
  • Rolling a fair dice and finding probability of any number between 1 to 6 (1/6)
  • Drawing a red colored card from a pack of 52 cards. (26/52 = 1/2)

Subjective probabilities are cases where past records may or may not be useful in quantification. Instead experiences of experts either through methods of survey are conducted to quantify or predict. Example could be sending questionnaire and seeking response from analyst about likelihood of stock market crash.

Note: Naive Bayes is an example of Objective probabilities where past data is used to build frequency tables and algorithm learns likelihood of each outcome.

Types of Event:

Mutually Exclusive Event: An event is said to be mutually exclusive if only one (of all possible) outcome is possible for any arbitrary event. Examples

  • An incoming Email is marked as “Spam” or “Not – Spam” but it can not be both.
  • A day selected a random will belong to any but only one month at a time. There are 12 distinct outcomes and each of them are mutually exclusive.
  • Toss of coin would result in “Head” or “Tail” but not both.

Below Spam and Not Spam are mutually exclusive and collectively exhaustive. In cases of events with only two (collectively exhaustive) and mutually exclusive outcomes if probability of one case is known probability of other case is 1 – probability(known case). If probability of Not Spam is 90% or .9 then probability of spam is 1 – 0.9 = .1 or 10% of mails received are marked spam.

Classification model generally is also an example of “Mutually Exclusive” events as data categorized by classifier belongs to a single class. Each class output is disjoint set with other outputs.


Collectively Exhaustive: A event is collectively exhaustive if list of outcomes includes every possible outcome. Above spam or not spam is an example of collectively exhaustive event thus sum of P(Spam) + P(Not Spam) = 1. Similarly random day selected belonging to any of 12 months (Jan to Dec) is also mutually exclusive and collectively exhaustive.

Equally Likely: Events with outcomes where likelihood of each outcome is same are equally likely. Example in a fair coin toss, both “Head” and “Tail” outcomes are “Equally likely” , “Mutually Exclusive” and “Collectively exhaustive”. In real world practical scenarios it is highly unlikely that all events are equally likely. Even coin toss is not equally likely and thus always marked as fair coin. (

Experiments are understood by knowing context better. An experiment may contain a single trial or multiple trial. For example, experiment may be a single toss of coin and other experiment maybe toss of 4 coins repeated over times. “Joint” and “Conditional” probabilities capture probabilities of events.

Joint probabilities deal events across multiple experiments where combination (both) of events is computed. Events from multiple experiments may be independent and dependent.

Conditional probabilities deal event across multiple experiments that captures how outcome of one experiment impacts outcome of other experiment.

Independent Events: Events in which outcome of one trail does not have impact or effect on outcome of subsequent trials. Example Student securing 70% score in exam and spam in emails are independent events.

For example in Naive Bayes, features are assumed to be independent of each other and are only related to label (output) variable. Thus the name “Naive” as  it is highly likely that in pratical scenarios, such assumptions would fail.

Dependent Events: Events in which outcome of one event impacts outcome of second event.

  • Probability of Salary greater than 100 K is dependent  upon Number of years of education.
  • Probability of Rain is dependent upon cloud formation.

Probability with mutually exclusive events:

Mutually exclusive, collectively exhaustive and equally likely deal with outcomes of single instance of an experiment. Independent and dependent events deal with outcomes of multiple experiments how experiments are related to each other to impact outcome of each experiment (event) .

Below is case that details all events (collectively exhaustive) of a roll of dice experiment. As can be noticed each event is mutually exclusive or disjoint sets (as there is no intersection). Now if we are interested in events that result in even numbers from a single roll (2,4,6) then probability of each event (1/6) and probability of event with result an even number is 1/6 + 1/6 + 1/6 = 3/6 = 1/2.  Generalizing, for events that are mutually exclusive can be computed by simple sum.

P(On1 or On2 or On3… Onk) = P(n1) + P(n2) + P(n3)… P(nk).


Probability of events where outcomes are mutually exclusive can be computed by adding individual probabilities of each outcome If A and B are mutually exclusive outcomes of an event, then P(A or B) = P(A) + P(B) and P(A & B) = 0 as there is no intersection between P(A) and P(B).

Probability with non-mutually exclusive events:

If events are not mutually exclusive, that is there is an overlap, in such cases probability is computed as P(A or B) = P(A) + P(B) – P(A & B).

  • P(A or B) = P(A) + P(B) – P(A & B)
  • P (A or B) = P(A) + P(B and A’s complement)
    • A’s complement is entire sample space expect or other than A.
  • P (A or B or C) = P(A) + P(B and Ac) + P(C and Bc and Ac)
    • Ac is complement of A
    • Bc is complement of B

Below is an example of case where outcomes of experiments are NOT mutually exclusive. What is probability that any card drawn from pack of cards is either a “Red Card” or any of King Faced card. There are 26 red cards out of 52 cards in a pack. Probability of red cards is 26 / 52 = 1/2. But that red cards is inclusive of 2 king faced red card as well. Similarly King Cards are 4 in 52 cards. Thus 4/52 = 1/13 that is inclusive of 2 king cards that are in Red color. Coming to question to compute probability of outcomes of events that are not mutually exclusive

P(Red Cards or King Faced cards) = P(Red Cards) + P(King Faced Cards) – P(Red cards that are King faced). 1/2 + 1/13 – 1/26 = 0.54

As both Red Card outcome as well as King faced cards outcome include King faced red cards there is a double counting of “king faced red cards”. To correct probability we deduct once intersection of Red cards that are King faced


  • For Mutually exclusive or disjoint events:
    • P(A or B) = P(A) + P(B)
  • For Non Mutually exclusive or overlapping events:
    • P(A or B) = P(A) + P(B) – P(A intersection B)
    • P (A union B) <= P(A) + P(B)
    • If A is subset of B then P(A) < P(B)

Joint probabilities with independent events: When outcome of experiment does not have impact on outcome of next experiment then joint probability is product of probabilities of each event.

P(A & B) = P(A) x P(B)

Example: A fair 6 sided dice is rolled 2 times. What is probability that both dice rolled result in number 6. As individual dice roll is mutually exclusive, probability of landing on 6 is 1/6. Requirement that experiments should have 1/6. Probability (6 on first roll and 6 on second roll) = P(6 on first roll) * P(6 on second roll) = 1/6 x 1/6 = 1/36.

Conditional probabilities of Independent events: When outcome of an experiment does not have impact on outcome of next experiment than conditional probability P(A | B) (read probability of A give probability of B) is probability of  A.

  • P(A | B) = P(A) or
  • P(B | A) = P(B)

Example: A fair coin is tossed two times. What is probability that coin lands on head given that first experiment resulted in a Tail. As outcomes of both events are independent, outcome of first experiment does not have any impact on second.

  • P(H given T) = P(H) = 1/2.
  • P(T given H) = P(T) = 1/2

Next post conditional probability of dependent events and then jump in Bayes Theorem.


Decision Trees – Part 8

Preceding seven posts detailed

  • Decision Trees basics
  • Attribute to split data on basis of impurity
    • Information Gain (C5), Gain Ratio
    • GINI Index (CART)
    • Misclassification Error

In this post, I will try to detail out data types and impact of data types on Decision Trees.

In a data set there are attributes whose value needs to be predicted (dependent variables) and there are attributes that help in predicting called independent variables.

Variable Type Data Type Model
Dependent Variable Categorical Classification
Dependent Variable Continuous Regression
Variable Type Data Type Remarks
Independent Variable Categorical

Can be used for both classification and regression models. But if there are too many categorical values (like Date / Name / Phone Number, Pin Code), it is either better to ignore such columns are convert them to values that are useful for prediction.

Independent Variable Continuous

Again can be used both for classification and regression models. With continuous variable understanding specific algorithms in important. Some implementations will quantize values based on single threshold values and others may quantize based on range, min , max or using normal curve. And there could many others that are unawares.

This post details as much as possible data understanding and preparation specific to independent variables.

Categorical Variables:

  • Null Values: Attributes that have all or most of values blank may not be useful for prediction, so can be removed / ignored.
  • Identical Values: Attributes with same value for all examples are not useful, so can be safely removed / ignored.
  • Unique Values: Attributes with unique values for each example like an alphanumeric identifier (other than date, timestamp or numeric) (Phone Number, Pin Codes, Names, User or Login Names, Email Addresses).
    • Phone Number: Generally first few digits of phone numbers are common and may correspond to circle (location). If there are no other attributes that like to geography may be this can be used (if accurate, never worked on it)
    • Zip Code: Same as above
    • Name: Will name of person / company or entity will have any bearing on output and if so, are names (either first or last names repeated).
      • For example for prediction of child names based on parent names, in such cases names may be useful but if say problem of interest is related to sales, will name have an impact on sale. If no, remove columns, else retain column but split it into first and last names (may be last name repeats more often).
    • Email Address: Take domain name (after @)
  • Correlated Attributes: Find attributes (IV) that have high correlation between attributes. An IV highly correlated with other IV(s). Due to high correlation only one of such attributes could be used.
    • Example: Month Name (Jan.. Dec) and Month Numbers (1 to 12) are highly correlated.
    • Using either of them to split data would result in same sub tree structure.
    • Such scenarios use only one attribute for modeling.
    • Selection of which attribute to use depends upon
      • Performance (Select high performing attribute)
      • Business scenario (Select descriptive).
      • For most input data, knowledge of domain and data are necessary to take such decisions.

Continuous Variables:

  • Excel in some cases converts date and time to numerical values. There will be need to convert them back to date and time and then prepare data. Working on date and time includes
    • Splitting date and time components into Date (Day , Month, Quarter, Semester, Year) and Time (Hours, Minutes, Seconds) depending upon requirements.
      • Notes for me: Need to check if numerical representation of date and / or time has better impact on regression models.

Decision Trees – Part 7

In earlier posts we discussed how an attribute is selected based on Information Gain (Entropy) , GINI Index. Similar to those computations “Misclassification Error” is another method to select optimal attribute to split and build decision trees.

At Node (N) with S number of total elements and “i” number of elements belonging to class, “Misclassification Error” can be computed asclip_image002

Misclassification errors range between 0 (minimum) and 0.5 (maximum)


Below is table where nodes are split as “Hired” and “Not Hired” and how Entropy / GINI impurity index and Miscalculation error computed values.

Hired Not Hired Total Entropy GINI Misc.
0 10 10 0.00 0 0
1 9 10 0.47 0.18 0.1
2 8 10 0.72 0.32 0.2
3 7 10 0.88 0.42 0.3
4 6 10 0.97 0.48 0.4
5 5 10 1.00 0.5 0.5
6 4 10 0.97 0.48 0.4
7 3 10 0.88 0.42 0.3
8 2 10 0.72 0.32 0.2
9 1 10 0.47 0.18 0.1
10 0 10 0.00 0 0


  • To select an appropriate attribute for splitting, decision trees uses method “impurity reduction” method.
  • Impurity of nodes can be computed using
    • Information Gain
    • GINI Impurity Index
    • Misclassification Error
  • Impurity reduction can be computed as difference between “Impurity as Node before Split” and “Aggregated impurities of all child nodes”.
  • Information Gain method is biased towards categorical attributes that has many distinct values (singleton splits with 100% purity). To avoid this, enhanced measurement called “Gain Ratio” is used.

Next post is data types and impact of data types on Decision Trees..


Decision Trees – Part 6

Similar to information again, alternate measures can be used to measure impurity of node and thus play role in selection of an attribute to split node to sub nodes (branches) or leaves.

GINI: Similar to Information Gain, GINI measures impurity of a node. GINI is an alternative and can be used in place of Information Gain. Example CART (Classification Tree) uses GINI index for splitting decision tree nodes.

Node (N) with “s” total data elements has subset (count) data elements of class “i”, then


Similar to entropy, GINI impurity index values range from 0 to .5. Graph plotted with values of GINI index is below


While entropy values range between 0 to 1 , GINI index values range between 0 to 0.5. Additionally GINI includes number of classes (and count), it may not need to compute something like “Gain Ratio” as in case of Information Gain.

See below comparison of Entropy and GINI values with different splits of Hired and Not Hired in 10 total candidates.

Hired Not Hired Total (Hired / Not Hired) Entropy GINI
0 10 10 0 0
1 9 10 0.468996 0.18
2 8 10 0.721928 0.32
3 7 10 0.881291 0.42
4 6 10 0.970951 0.48
5 5 10 1 0.5
6 4 10 0.970951 0.48
7 3 10 0.881291 0.42
8 2 10 0.721928 0.32
9 1 10 0.468996 0.18
10 0 10 0 0

Similar to computing “Information Gain” for split operation using attribute “A” at node, using GINI impurity is computed at parent node. Selection of an attribute to split node is based on reduction in impurity. If a node (N) with “t” total elements is split into multiple “k” sub nodes, with each node containing “t(i)” elements, aggregated GINI impurity is


Similar to Decision Trees with “Information Gain” using GINI impurity index, attributes that result in split of parent node that larger nodes(more number values) with higher purity are preferred.

Next is measuring impurity using “Misclassification Error”.

Decision Trees – Part 5

Information gain as discussed in earlier post is “reduction in un-orderedness” in training data due to split. Tree building procedure considers attributes that maximize information gain. Entropy , Information gain are explained in below posts

Entropy at node (N) with probability (p) that an event out of two mutually exclusive events (like head / tail) for class variables can be computed as

E(N) = – probability(p) * Log(2)(p) – probability (1 – p) * Log(2)(1 – p)

This above equation can be generalized for multi class as

E(N) –p1 * log2(p1) – p2*log2(p2)……..-pn*log2(pn) = i between 1 to n Sigma(pi*lo2(pi), mathematically represented as


If node is split into multiple branches, then each branch with subset has their own entropy (computed using earlier formula). To aggregate entropies at branches and compare them with parent node, a weighted approach is considered. Branches with larger number of values / nodes with higher entropies are penalized compared to larger number of values / node with lower entropies.

If Si is ith  subset after split from a universal set S then summarized entropy of all i subsets is



  • Si = Number of elements in ith subset
  • S = Number of elements in super set (at node before split)
  • E(Si) = Entropy of ith Subset.

Information Gain can be computed as difference between (Un)orderedness of Node and summarized unorderedness of child nodes with subset of data (super set of data at node) during split. More the entropy difference better information gain. If set S is split into i sets based on attribute values (A) then information gain is Entropy(S) – Summarized Entropies (child nodes / subsets). Information Gain in set S  split by attribute A in i number of subsets, mathematically  image

Additional Notes:

  • When subset contains homogenous elements (Every element belongs to same class of dependent variable), entropy is minimum equals Zero (0)
  • When subset contains equally likely elements, entropy is maximum equals One (1)

Update: Added graph for Entropy to indicate that its values are between 0 to 1 and since it is logarithmic it is a smooth curve.

  • 0 >= Entropy <= 1


  • If splitting node results in pure subset nodes (entropy 0) then Information Gain is equal to entropy of node.
  • If splitting node does not improve information gain, aggregated entropies equal to node entropy.

If in dataset there is an attribute that uniquely identifies rows, using such an attribute to split data would result in all pure data sets though singleton. Thus information again of using such attribute would be highest.

For example any attributes like below would result in pure splits and thus information gain would be maximum.

  • Customer Name or
  • Pin code or Phone Numbers
  • Dates or Time or Timestamps
  • Identity
  • Colors

Decision Tree algorithms are biased towards such attributes. Even mathematically, since information again is maximum such attribute should be used for splitting data. But are such tree really useful. If such an attribute if at all is used for decision tree, resultant tree would have

  • Tree would be wide (due to many leaf nodes)
    • For example if there is a table with 1 M records, each with Phone number attribute (IV), tree would have 1 M leaves each with individual Phones attribute and class output
  • Such trees would not be useful
    • Over fitting: New row will not have same value but a different value. Thus tree fails to classify them into correct leaf. So decision tree built on such attributes is not useful.
    • Fragmentation: Performance of tree is bad as it has to traverse 1 M records even if Phone Number repeats. Compute how many steps are required to read this data.

To balance bias of information gain towards attributes with lot of values, “Gain Ratio” is used. Prior to computing “Gain Ratio”, understand and compute “Intrinsic Information”. Intrinsic information can be intuitively thought maximum number of steps required to retrieve correct leaf node post split operation. In other terms, it is entropy of distribution post split. If a split results in more number of child nodes then probability of retrieving single child node is 1/(Number of child nodes). Entropy of such distribution is equal to (1/Number of Child Nodes)*Log2(1/Number of Child Nodes). Mathematically it can be written as


Gain Ratio now is equal to (GR) = Information Gain / Intrinsic Information.

There are also other mechanisms like Gini Index

Future posts would dwell upon Gini Index, data (Types) and impact on decision trees , data preparation for decision trees, validation of decision trees and decision trees (for classification and regression) problems..

Let me know if there are more things to be added in future posts specific to decision trees and also feel free to put in comments so that I can learn from them.

Thanks for reading….


Decision Trees – Part 4

Decision tree is a recursive algorithm, implying it splits until

  • All Subsets are pure (entropy = 0 or homogenous)
  • Entropy is not equal to 0 but there are no more attributes to split
  • Entropy is not equal to 0, but predefined tree size is set and tree size has reached to set limit.

Decision Trees keep splitting data until every subset is homogenous and pure, if there are enough attributes available. That implies, decision trees will split data to such an extent that to make subsets pure, it may end up with a singleton leaves. And each leaf of decision tree contains only 1 value.

Say for example there is a column (like identity) that uniquely identifies rows. Using such column for splitting, decision trees algorithm will classify each value in a single leaf (thus making it homogenous but at same not useful).

If a decision tree, classifies data to that detail level (like using identity or unique identifier or transaction dates), such a classifier is not useful, as any new value will be different from an existing values and decision tree classifier would fail to classify such data. Such iterative splitting leading to perfect homogenous leaves would lead to over-fitting of algorithm to sample data.

If Size of decision tree is equal to number of examples in data set, then clearly decision tress is over fitted to training data.

During training, with every split a new layer of branches and / or leaves are formed and thus increasing size of decision trees. Increase in size of decision tree increases its accuracy on training data but at same decreases its effectiveness on test data (due to over fitting).

If a decision tree is over fitted with data, it is made very specific to training data and it is not generalized and thus not useful for test data as well as data other than training. To avoid over fitting (or over growing), decision trees are pruned. The main purposes of pruning are

  • Reduce decision tree classifier complexity and thus increasing prediction accuracy (removal of over-fitting of decision tree algorithm to training data)
  • Removal of portions of tree (branches / roots) that is based on erroneous data or noisy data thus again not useful for prediction (as classifier is not generalized).

There are many approaches to decision tree pruning and math behind pruning. Each of these pruning techniques follow either top down (root to leaf) or bottoms up (leaf to root) approaches. Additionally pruning can be done while building or post building decision trees.

Considering volume and variety of decision trees pruning techniques, there will be separate post for pruning.

Summary: Decision trees have tendency to grow till all leaves of tree are homogenous and thus may end up with singleton leaf nodes. If decision trees predict with 100% accuracy training data, there may be case where classifier has over-fitting problem and is not generalized for subsequent use either with test or real business data. To avoid over-fitting decision trees are pruned (removing branches and / or leaf nodes) without reducing prediction accuracy.


Decision Trees – Part 3

As discussed in previous post, Entropy mathematically helps to compute purity (homogeneity) or Impure (mixed) of segment of dataset.

The goal of splitting data into multiple subsets to incrementally and iteratively reduce entropy and if possible result in subsets (at leaf level) that are completely homogenous.

At each iteration of split, goal is to select feature that reduces entropy to maximum extent possible. The difference between entropies at any node pre and post split is information again.

To understand it better, as we know entropy is measure of impurity (disorder) and any reduction is impurity reduces entropy. When split is done, impurity has to be reduced and thus entropy is reduced. Reduction is entropy implies accuracy of prediction increases. That accuracy of prediction increase at every node due to split is called “Information Gain” due to split and can be mathematically computed.

Information Gain: Information gain for a feature (that is used to split data) is calculated as difference between entropy in segment before split and partitions resulting from split, with entropy aggregated to node level.

If E1 is entropy at a Node and feature (F) is used to split data resulting in partitions of data,

Information Gain (due to F) = E1 – (Weighted average of Entropies of child nodes)

Going back to original data in first post Decision Tress Part I if data were not be split at root level there are (4 cases of Hire (Yes) and 8 cases of NO Hire (No)).

Computing entropy (impurity at root node) = (–4/12)*Log2(4/12) – (8/12)*Log2(8/12) = 0.918296.


Using feature Skill, split data and calculate “Entropy” and “Information Gain”.

  • Skill features partitions data into
    • High: 3 Yes, 0 No. Entropy = 0 (as it homogenous subset all “Yes”)
    • Medium: 1 Yes and 1 No. Entropy = 1 (as it equally likely)
    • Low: 0 Yes and 7 No. Entropy = 0 (as it homogenous, all “No”)
  • Entropy aggregated to Root Node = Sum of Entropies of child nodes = 0.17
    • (3/12)*entropy(High)  = 0
    • (2/12)*Entropy(Medium) = 1/6 = 0.17
    • (7/12)*Entropy(Low) = 0

Information Gain = Entropy before split – Entropy of partitions aggregated

= 0.918296 – 0.166667 = 0.751629

Summary, Information Gain is measure of reduction in entropy due to split of data by a feature. A Feature that increases information gain is maximum suitable for split. For a node to split, maximum Information gain is equal to entropy of node (most useful) and minimum information again is equal to 0 (least useful for split).

This post concludes basics of decision trees and math behind it.. Next post details some important points related to decision trees (like over fitting under fitting, data perpetration etc)

Until then..