As discussed in previous post, Entropy mathematically helps to compute purity (homogeneity) or Impure (mixed) of segment of dataset.

The goal of splitting data into multiple subsets to incrementally and iteratively reduce entropy and if possible result in subsets (at leaf level) that are completely homogenous.

At each iteration of split, goal is to select feature that reduces entropy to maximum extent possible. The difference between entropies at any node pre and post split is information again.

To understand it better, as we know entropy is measure of impurity (disorder) and any reduction is impurity reduces entropy. When split is done, impurity has to be reduced and thus entropy is reduced. Reduction is entropy implies accuracy of prediction increases. That accuracy of prediction increase at every node due to split is called “Information Gain” due to split and can be mathematically computed.

**Information Gain:** Information gain for a feature (that is used to split data) is calculated as difference between entropy in segment before split and partitions resulting from split, with entropy aggregated to node level.

If E1 is entropy at a Node and feature (F) is used to split data resulting in partitions of data,

Information Gain (due to F) = E1 – (Weighted average of Entropies of child nodes)

Going back to original data in first post Decision Tress Part I if data were not be split at root level there are (4 cases of Hire (Yes) and 8 cases of NO Hire (No)).

Computing entropy (impurity at root node) = (–4/12)*Log2(4/12) – (8/12)*Log2(8/12) = 0.918296.

Using feature Skill, split data and calculate “Entropy” and “Information Gain”.

- Skill features partitions data into
- High: 3 Yes, 0 No. Entropy = 0 (as it homogenous subset all “Yes”)
- Medium: 1 Yes and 1 No. Entropy = 1 (as it equally likely)
- Low: 0 Yes and 7 No. Entropy = 0 (as it homogenous, all “No”)

- Entropy aggregated to Root Node = Sum of Entropies of child nodes = 0.17
- (3/12)*entropy(High) = 0
- (2/12)*Entropy(Medium) = 1/6 = 0.17
- (7/12)*Entropy(Low) = 0

Information Gain = Entropy before split – Entropy of partitions aggregated

= 0.918296 – 0.166667 = 0.751629

Summary, Information Gain is measure of reduction in entropy due to split of data by a feature. A Feature that increases information gain is maximum suitable for split. For a node to split, maximum Information gain is equal to entropy of node (most useful) and minimum information again is equal to 0 (least useful for split).

This post concludes basics of decision trees and math behind it.. Next post details some important points related to decision trees (like over fitting under fitting, data perpetration etc)

Until then..

Guru