# Decision Trees – Part 5

Information gain as discussed in earlier post is “reduction in un-orderedness” in training data due to split. Tree building procedure considers attributes that maximize information gain. Entropy , Information gain are explained in below posts

Entropy at node (N) with probability (p) that an event out of two mutually exclusive events (like head / tail) for class variables can be computed as

E(N) = – probability(p) * Log(2)(p) – probability (1 – p) * Log(2)(1 – p)

This above equation can be generalized for multi class as

E(N) –p1 * log2(p1) – p2*log2(p2)……..-pn*log2(pn) = i between 1 to n Sigma(pi*lo2(pi), mathematically represented as

If node is split into multiple branches, then each branch with subset has their own entropy (computed using earlier formula). To aggregate entropies at branches and compare them with parent node, a weighted approach is considered. Branches with larger number of values / nodes with higher entropies are penalized compared to larger number of values / node with lower entropies.

If Si is ith  subset after split from a universal set S then summarized entropy of all i subsets is

where

• Si = Number of elements in ith subset
• S = Number of elements in super set (at node before split)
• E(Si) = Entropy of ith Subset.

Information Gain can be computed as difference between (Un)orderedness of Node and summarized unorderedness of child nodes with subset of data (super set of data at node) during split. More the entropy difference better information gain. If set S is split into i sets based on attribute values (A) then information gain is Entropy(S) – Summarized Entropies (child nodes / subsets). Information Gain in set S  split by attribute A in i number of subsets, mathematically

• When subset contains homogenous elements (Every element belongs to same class of dependent variable), entropy is minimum equals Zero (0)
• When subset contains equally likely elements, entropy is maximum equals One (1)

Update: Added graph for Entropy to indicate that its values are between 0 to 1 and since it is logarithmic it is a smooth curve.

• 0 >= Entropy <= 1

• If splitting node results in pure subset nodes (entropy 0) then Information Gain is equal to entropy of node.
• If splitting node does not improve information gain, aggregated entropies equal to node entropy.

If in dataset there is an attribute that uniquely identifies rows, using such an attribute to split data would result in all pure data sets though singleton. Thus information again of using such attribute would be highest.

For example any attributes like below would result in pure splits and thus information gain would be maximum.

• Customer Name or
• Pin code or Phone Numbers
• Dates or Time or Timestamps
• Identity
• Colors

Decision Tree algorithms are biased towards such attributes. Even mathematically, since information again is maximum such attribute should be used for splitting data. But are such tree really useful. If such an attribute if at all is used for decision tree, resultant tree would have

• Tree would be wide (due to many leaf nodes)
• For example if there is a table with 1 M records, each with Phone number attribute (IV), tree would have 1 M leaves each with individual Phones attribute and class output
• Such trees would not be useful
• Over fitting: New row will not have same value but a different value. Thus tree fails to classify them into correct leaf. So decision tree built on such attributes is not useful.
• Fragmentation: Performance of tree is bad as it has to traverse 1 M records even if Phone Number repeats. Compute how many steps are required to read this data.

To balance bias of information gain towards attributes with lot of values, “Gain Ratio” is used. Prior to computing “Gain Ratio”, understand and compute “Intrinsic Information”. Intrinsic information can be intuitively thought maximum number of steps required to retrieve correct leaf node post split operation. In other terms, it is entropy of distribution post split. If a split results in more number of child nodes then probability of retrieving single child node is 1/(Number of child nodes). Entropy of such distribution is equal to (1/Number of Child Nodes)*Log2(1/Number of Child Nodes). Mathematically it can be written as

Gain Ratio now is equal to (GR) = Information Gain / Intrinsic Information.

There are also other mechanisms like Gini Index

Future posts would dwell upon Gini Index, data (Types) and impact on decision trees , data preparation for decision trees, validation of decision trees and decision trees (for classification and regression) problems..

Let me know if there are more things to be added in future posts specific to decision trees and also feel free to put in comments so that I can learn from them.