Decision Trees – Part 4

Decision tree is a recursive algorithm, implying it splits until

  • All Subsets are pure (entropy = 0 or homogenous)
  • Entropy is not equal to 0 but there are no more attributes to split
  • Entropy is not equal to 0, but predefined tree size is set and tree size has reached to set limit.

Decision Trees keep splitting data until every subset is homogenous and pure, if there are enough attributes available. That implies, decision trees will split data to such an extent that to make subsets pure, it may end up with a singleton leaves. And each leaf of decision tree contains only 1 value.

Say for example there is a column (like identity) that uniquely identifies rows. Using such column for splitting, decision trees algorithm will classify each value in a single leaf (thus making it homogenous but at same not useful).

If a decision tree, classifies data to that detail level (like using identity or unique identifier or transaction dates), such a classifier is not useful, as any new value will be different from an existing values and decision tree classifier would fail to classify such data. Such iterative splitting leading to perfect homogenous leaves would lead to over-fitting of algorithm to sample data.

If Size of decision tree is equal to number of examples in data set, then clearly decision tress is over fitted to training data.

During training, with every split a new layer of branches and / or leaves are formed and thus increasing size of decision trees. Increase in size of decision tree increases its accuracy on training data but at same decreases its effectiveness on test data (due to over fitting).

If a decision tree is over fitted with data, it is made very specific to training data and it is not generalized and thus not useful for test data as well as data other than training. To avoid over fitting (or over growing), decision trees are pruned. The main purposes of pruning are

  • Reduce decision tree classifier complexity and thus increasing prediction accuracy (removal of over-fitting of decision tree algorithm to training data)
  • Removal of portions of tree (branches / roots) that is based on erroneous data or noisy data thus again not useful for prediction (as classifier is not generalized).

There are many approaches to decision tree pruning and math behind pruning. Each of these pruning techniques follow either top down (root to leaf) or bottoms up (leaf to root) approaches. Additionally pruning can be done while building or post building decision trees.

Considering volume and variety of decision trees pruning techniques, there will be separate post for pruning.

Summary: Decision trees have tendency to grow till all leaves of tree are homogenous and thus may end up with singleton leaf nodes. If decision trees predict with 100% accuracy training data, there may be case where classifier has over-fitting problem and is not generalized for subsequent use either with test or real business data. To avoid over-fitting decision trees are pruned (removing branches and / or leaf nodes) without reducing prediction accuracy.


Decision Trees – Part 3

As discussed in previous post, Entropy mathematically helps to compute purity (homogeneity) or Impure (mixed) of segment of dataset.

The goal of splitting data into multiple subsets to incrementally and iteratively reduce entropy and if possible result in subsets (at leaf level) that are completely homogenous.

At each iteration of split, goal is to select feature that reduces entropy to maximum extent possible. The difference between entropies at any node pre and post split is information again.

To understand it better, as we know entropy is measure of impurity (disorder) and any reduction is impurity reduces entropy. When split is done, impurity has to be reduced and thus entropy is reduced. Reduction is entropy implies accuracy of prediction increases. That accuracy of prediction increase at every node due to split is called “Information Gain” due to split and can be mathematically computed.

Information Gain: Information gain for a feature (that is used to split data) is calculated as difference between entropy in segment before split and partitions resulting from split, with entropy aggregated to node level.

If E1 is entropy at a Node and feature (F) is used to split data resulting in partitions of data,

Information Gain (due to F) = E1 – (Weighted average of Entropies of child nodes)

Going back to original data in first post Decision Tress Part I if data were not be split at root level there are (4 cases of Hire (Yes) and 8 cases of NO Hire (No)).

Computing entropy (impurity at root node) = (–4/12)*Log2(4/12) – (8/12)*Log2(8/12) = 0.918296.


Using feature Skill, split data and calculate “Entropy” and “Information Gain”.

  • Skill features partitions data into
    • High: 3 Yes, 0 No. Entropy = 0 (as it homogenous subset all “Yes”)
    • Medium: 1 Yes and 1 No. Entropy = 1 (as it equally likely)
    • Low: 0 Yes and 7 No. Entropy = 0 (as it homogenous, all “No”)
  • Entropy aggregated to Root Node = Sum of Entropies of child nodes = 0.17
    • (3/12)*entropy(High)  = 0
    • (2/12)*Entropy(Medium) = 1/6 = 0.17
    • (7/12)*Entropy(Low) = 0

Information Gain = Entropy before split – Entropy of partitions aggregated

= 0.918296 – 0.166667 = 0.751629

Summary, Information Gain is measure of reduction in entropy due to split of data by a feature. A Feature that increases information gain is maximum suitable for split. For a node to split, maximum Information gain is equal to entropy of node (most useful) and minimum information again is equal to 0 (least useful for split).

This post concludes basics of decision trees and math behind it.. Next post details some important points related to decision trees (like over fitting under fitting, data perpetration etc)

Until then..


Decision Trees – Part 2

Before understanding mathematically How an attribute (feature) is selected (from available features) to best split underlying data, let us explore few related terms.

Pure / Impure: After data is split by decision trees it results in a set of subgroups. The splitting is done to improve accuracy of prediction. The best sub group is one where there is 100% accuracy in prediction. That implies if data is split and resultant sub group is 100% accurate in detection, such sub group is called “Pure Group”.

On other hand if splitting results in sub group where resultant group is still a mixture of output (label) variables then such sub groups are impure and called “Impure Groups”. Splitting of data may result in (0 to N) pure sub groups and (0 to N) impure sub groups.

Both purity / impurity of groups are defined with respect to output (label) variable and not features. Features that result in as many sub groups (groups that have homogenous output (label) variables) are best suited for decision trees algorithms.

Consider example in Decision Trees Part 1 where candidate is hired based on experience, technology, skill. Label (output) variable is “Hire” and features attributes are “Experience”, “Technology”,”Skill”. Split of Hire (output) variable using “Skill” feature attribute results in below tree. Skill has 3 different values (High, Medium and Low). Data funnels through root and falls into any of these sub groups. Values in parenthesis (3Y,0N) are values of output (label) variable.

As you can see, split resulted in 3 subgroups with 2 groups pure (Green color) and 1 group impure (Orange). 2 Pure groups either have 100% Yes (hired) or 100% No (Not Hired). Based on this it can be deduced that people with high skill ratings are always hired irrespective of technology or experience and candidates with low skills are never hired.

  • Pure sub groups: Skill Level (High) and Skill Level (Low).
  • Impure sub groups: Skill Level (Medium)


An additional point to note in above example is though there are 2 pure sub groups, they are not symmetric. They are not symmetric because number of examples vary in each sub group.

To understand this point, let us expand this example and in our data set we have 100 rows with all additional 88 rows with medium skill set. If all our 88 candidates are rejected, then out of 90 candidates with medium skill set only 1 is selected. Even though we have a impure sub group here , such  a group is very useful as it splits data that results in high homogeneity. 100% pure sub groups in real world may not be possible but a highly accurate sub groups is also very useful.

Entropy: Entropy is the measure of disorder of a system.

To understand, most used example of entropy is understand difference between a “clean or ordered” room and a “cluttered or unordered” room.

In a clean or ordered room, every thing has its place and is put in its place. Thus there is very little disorder in room and entropy of such room is less compared to a cluttered room where items are not in place there is complete chaos. Effort need to be put to make unordered room with high entropy to become ordered with low entropy.

In Thermodynamics, “Entropy in universe is every increasing with time, until an external work is done against it”. Example take case of Air Conditioners.

With no air condition in room, temperature of room is always in equilibrium with surroundings, external to room as disorder or random movement of air molecules is same. Entropy when measured between room and external surroundings would be same.

With air condition in room (and with doors closed), there is a less movement of air molecules (due to lower temperature) and thus entropy in room is less compared to surroundings till air condition is running and room doors are closed.

Additional example of Entropy with respect to coin tosses.

  • Fair Coin: When a fair coin is tossed, every time it lands either heads / tails. Before toss of coin, predicting output is not possible. Probability will always be 1/2 (either heads or tails). And we can not learn from previous experiments (coin being fair). As outcome is unpredictable, entropy which is measure of unpredictability is high and equal to 1. (Log2(1/2) = 1) 
  • Coin with Heads on both sides: Consider a coin where heads are present on both sides of coin. When such a coin is tossed, it is always certain that it lands on heads. There is no uncertainty in predicting output of such an experiment and thus entropy of such an event is 0.
  • Biased Coin: Coin that has high tendency to fall on tails compared to heads is a biased coin. When coin is tossed first time, entropy is high as unpredictability  is high but as multiple experiments are done using same coin patterns emerge and finally outcome can be predicted with certain degree of certainty. That implies, entropy (which is measure of uncertainty) has come down and prediction rate has improved.

Mathematical formula to compute entropy:



  • i is a variable that ranges from 1 to maximum number of distinct classes post split operations.
  • p(i) is portion /probability of values falling into that class

Back to our example, using Skill feature, data is split into “High”, “Medium” and “Low” and each one has different number of class variables. See figure below or read earlier blog at Decision Trees Part 1

At level 2 (formed due to split using skill feature) there are three sub groups. Let us compute entropy at each group:

For High: It is pure set as split resulted in homogenous class variables (3Y and 0N). Entropy for this data segment equals to

-3/3 X log2(3/3) – 0/3 X log2(0/3) = 0  log2(0/3) = log2(1) = 0 and 0 multiplied by anything is 0)

For Low: It is again a pure split. Thus entropy of that set equals to 0 (-7/7 * log2(7/7) – 7/7 * log2(0/7) = 0)

For Medium set: –1/2*log2(1/2) – 1/2*log2(1/2) = 1. Entropy is 1 as there is highest disorder in this set and both of them are equally likely possible and is highly  unpredictable.


If we know there are 2 output classes, like head / tail then probability of head is X and probability of tail would 1 – X. In such scenarios entropy curve can be obtained by running

curve( – x  * log2 ( x ) – ( 1 – x ) * log ( 1 – x ), col=”red”,ylab=”Entropy)


In next part we talk about “Information Gain” that details how decision tree uses entropy to select appropriate feature to split data..

Till next time..




Decision Trees – Part 1

Decision tree algorithm builds a “Tree Structure” (Root , Branches and Leaf Nodes) from underlying data. Decision trees are built using heuristic called “recursive partitioning” or “divide and conquer”. High level steps how an algorithm works..

Given a dataset,

  • Should split?
    • Is it required to split data to improve prediction accuracy?
  • Can split?
    • Are there enough features to split to improve prediction accuracy?
  • Assuming “Yes” for both above cases,
  • Why to split?
    • Data is split to multiple sub groups to improve prediction and to gain information about splitting data (“Information Gain”)
  • How to split?
    • Based on data, from set of given attributes, choose an attribute that will be used to split data into sub groups (i.e from Root –> Branches).
    • Based on data on selected column, form “Groups” or “Distinct Categories”. Using groups, categorize and quantify output (label) variable.
    • After sub groups are formed, check if prediction can be done with 100% accuracy (as much as possible).
      • If Yes, stop splitting process.
      • If No, continue splitting process until
        • Prediction is 100% accurate
        • No more attributes (features) to split
        • Tree has become so big with many branches it is practically unusable for prediction.

To understand better below is a simple case study.

Companies like,, and others offer employee hiring services to companies and candidates open for jobs. As cost of sourcing is high, quality candidates (open for hiring) is low and rejection rate (due to low quality candidates) by hiring companies high, it becomes business critical to send right candidates with right skills and capabilities to those companies where candidates have high chances of getting hired.

  • How about before sending a candidate for interview, predict if he / she has high probability of getting hired and then take action?
  • Better still, predict skill gaps, suitability of candidate for a company, get candidate ramped up and then send for interview 🙂 ?

To understand how “Decision Trees” can be used for such predictions, below is sample example data. Data is self explanatory.

  • Experience
Technology Skill Hire
8 – 10 Years Development High Yes
8 – 10 Years Database High Yes
8 – 10 Years Database Low No
6 – 8 Years Architecture Low No
6 – 8 Years Architecture Medium Yes
10 – 12 Years Development High Yes
10 – 12 Years Architecture Low No
10 – 12 Years Database Low No
6 – 8 Years Development Low No
6 – 8 Years Database Medium No
8 – 10 Years Database Low No
8 – 10 Years Database Low No

Based on order of columns that are selected for splitting (sub grouping) data, multiple variant decision trees can be formed.

In below trees,

  • Green indicates prediction with 100 accuracy (data in that group is homogenous or belong to one single category of Yes or No)
  • Yellow / Orange indicate prediction is better but not with certainty.
  • Red indicates prediction does not have any amount of certainty.

Tree 1: Split data skill followed by technology:


Tree 2: Split data Experience –> Technology –> Skill


As you can see from above, both lead to accurate prediction in end but Tree 1 is better than Tree 2 both in terms of number of steps involved (Depth) and Categories (Width) of tree.

Choosing right attribute for splitting data at each iteration of splitting is an important aspect that dictates prediction capabilities, optimizations in decision tree models.

Next post involves more mathematical explanation of “Decision Trees” and how an attribute is selected for split (Information Gain and Entropy) continuing with same case.

Until next post…


Wordcloud using R

Wordcloud is cloud of words (visualization technique) that gives prominence to words based on frequency of each word usage from source text. Here is one example how to create wordcloud using R.

Exercise: Read comments from news article / blog and check trending key words for that topic(s) comments.


  • Iterative process to remove stop words as “tm” library tm_map(x,content_transformer(removeStopwords),stopwords()) does not remove all stop words.
  • content_transfomer() function needs to be used mainly in later version of R. Function such as tolower() of “tm” library return different data type.
  • Standalone wordcloud reflects frequently used words but does not reflect sentiments about them. Wordcloud should be used along with sentimental analysis of posts.

Sample comments below:

  1. We should first take care of all connectivity issues in remote areas. Then we should make the rural masses aware of what their rights are and how can they use it. We should make a simple user friendly app wherein the rural masses will get all-in-one assistance (agriculture,health, animal care,bus time table, online mandi trading and online banking facility.
  2. In my opinion, a citizen in rural india who opens a bank account under the PRADHAN MANTRI JAN DHAN YOJNA should be given a free sim card which has internet facility activated by default and lifetime validity.This would encuorage more and more people to open accounts and will lead them with access to mobile facillty,Interet Facility.

R Code:

  • install.packages(“tm”)
  • library(“tm”)
  • tp <- read.csv(“D:/Customer/Srini/TP.csv”,sep=”,”,header=TRUE, stringsAsFactors =FALSE)
  • tpPost <- tp$Post
  • tpc <- Corpus(VectorSource(tpPost))
  • tpc <- tm_map(tpc,content_transformer(removePunctuation))
  • tpc <- tm_map(tpc,content_transfomer(removeNumbers))
  • tpc <- tm_map(tpc,content_transformer(tolower))
  • tpc <- tm_map(tpc,content_transformer(removeWords),stopwords(“English”))
  • tpc <- tm_map(tpc,content_transformer(stripWhitespaces))
  • inspect(tpc[1:10])
  • stopWords <- c(“can”,”make”,”get”,”need”,”use”,”sir”,”like”,”also”,”open”,”call”,”netfor”,”for”,”new”,”ca”,”issued”,”become”,”cscs”,”will”,”lack”,”first”,”care”,”take”)
  • tpc <- tm_map(tpc,content_transformer(removeWords),stopWords)
  • tpc <- tm_map(tpc,content_transformer(stripWhitespace))
  • inspect(tpc[1:10])
  • wordcloud(tpc,min.freq=200,max.words=Inf,random.order=FALSE)


  • pal2 <- brewer.pal(8,”Dark2″)
  • wordcloud(tpc,min.freq=200,max.words=Inf,random.order=FALSE,rot.per=.15,colors=pal2)


  • wordcloud(tpc,min.freq=200,max.words=Inf,random.order=FALSE,scale = c(8,.2),rot.per=.15,colors=pal2)


Until next time…


Linear Regression.. Simple understanding..

A uni-variate regression captures linear relationship between an independent variable and a dependent variable. For each value of X  and corresponding Y values are plotted as shown below. Intentionally I have put Y as 1.5 * X. There are lots of notes on regression analysis and there could be much more theoretical explanations.


But some of points that need to be noted are

  • Each point on scatter plotter graph is an instance of a data and Regression line aims to generate pattern of relationship between X and Y variables.

When we try to draw a regression line for above data there are 2 degrees of freedom for us to use.

1. Angle of regression line wrt. to X and Y axis (Called slope).

2. Intercept of regression line i.e where would regression line cross Y axis when X = 0.

For above example, regression line is as below, where  slope is 1.5 and intercept is 0.


Excel provides a way to control intercept and thus angle or slope of regression line. Above case is regression line with intercept = 0. Changing intercept values, forces a different slope as regression line should be such that there should be minimum deviation of estimated from actual value. That is why if we notice last case where intercept is 1000, regression line R2 value is negative indicating, regression line is NOT a best fit line for provided data and infact it has opposite slope. In other words, last case, though our data points have a positive correlation using intercept, a negative correlation has been forced.

Regression Line with intercept = 25


Regression Line with intercept = 50


Regression Line with intercept = 100



Regression Line with intercept = 1000


This was a simple thought process (for those of you who play with stats gurus, this is a trivial thing), but when have to understand “How is excel computing values of slope and intercept” in such a manner that estimated value  is as near to actual value (Gradient descent??)

Octave: In-built functions (P1)

Octave similar to other language support inbuilt functions. Few of inbuilt functions are listed with examples below.


  1. who: Displays all variables currently in memory.
    • Example:
      • Step 1: m1 = [1 2;3 4] This creates a m1 variable in memory and assigns 2×2 matrix.
      • Step 2: m2 = m1 .^2. This results in a matrix where each element of matrix is squared [1 4; 9 16] and result is assigned to m2.
      • Step 3: m1 + m2. Matrix addition[2 6; 12 20]
      • who(). This function display all variables in memory. Variables displayed are “ans”,”m1” &”m2”
  2. clear: Clears variable(s) from memory.
    • Example:
      • Step 1: m1 = [1 2;3 4] This creates a m1 variable in memory and assigns 2×2 matrix.
      • Step 2: m2 = m1 .^2. This results in a matrix where each element of matrix is squared [1 4; 9 16] and result is assigned to m2.
      • Step 3: m1 + m2. Matrix addition[2 6; 12 20]
      • who(). This function display all variables in memory. Variables displayed are “ans”,”m1” &”m2”
      • clear m1. This clears variable m1 from memory
      • clear() to remove all variables from memory.
  3. pwd(): Present Working Directory
    • Example:
      • Step 1: pwd() This results in current working directory. “C:\Users\<User Name>” is default working directory.
  4. cd(): Change Directory:
    • Example:
      • Step 1: pwd() This results in current working directory. “C:\Users\<User Name>” is default working directory.
      • Step 2: cd(“D://Readiness//Data Science//Machine Learning//Octave Tutorials”) will change present working directory to new path.
      • Step 3: pwd(). This displays new modified path of present working director
  • Note:
    • Octave is case sensitive. “who” is different from “WHO” or “Who”
    • “ans” is a default variable that is created by Octave runtime to store result of any computation. It stores only last result of any computation and is overwritten by next operation. Value of result of computation can be accessed by “ans” variable.