Before understanding mathematically How an attribute (feature) is selected (from available features) to best split underlying data, let us explore few related terms.
Pure / Impure: After data is split by decision trees it results in a set of subgroups. The splitting is done to improve accuracy of prediction. The best sub group is one where there is 100% accuracy in prediction. That implies if data is split and resultant sub group is 100% accurate in detection, such sub group is called “Pure Group”.
On other hand if splitting results in sub group where resultant group is still a mixture of output (label) variables then such sub groups are impure and called “Impure Groups”. Splitting of data may result in (0 to N) pure sub groups and (0 to N) impure sub groups.
Both purity / impurity of groups are defined with respect to output (label) variable and not features. Features that result in as many sub groups (groups that have homogenous output (label) variables) are best suited for decision trees algorithms.
Consider example in Decision Trees Part 1 where candidate is hired based on experience, technology, skill. Label (output) variable is “Hire” and features attributes are “Experience”, “Technology”,”Skill”. Split of Hire (output) variable using “Skill” feature attribute results in below tree. Skill has 3 different values (High, Medium and Low). Data funnels through root and falls into any of these sub groups. Values in parenthesis (3Y,0N) are values of output (label) variable.
As you can see, split resulted in 3 subgroups with 2 groups pure (Green color) and 1 group impure (Orange). 2 Pure groups either have 100% Yes (hired) or 100% No (Not Hired). Based on this it can be deduced that people with high skill ratings are always hired irrespective of technology or experience and candidates with low skills are never hired.

Pure sub groups: Skill Level (High) and Skill Level (Low).

Impure sub groups: Skill Level (Medium)
An additional point to note in above example is though there are 2 pure sub groups, they are not symmetric. They are not symmetric because number of examples vary in each sub group.
To understand this point, let us expand this example and in our data set we have 100 rows with all additional 88 rows with medium skill set. If all our 88 candidates are rejected, then out of 90 candidates with medium skill set only 1 is selected. Even though we have a impure sub group here , such a group is very useful as it splits data that results in high homogeneity. 100% pure sub groups in real world may not be possible but a highly accurate sub groups is also very useful.
Entropy: Entropy is the measure of disorder of a system.
To understand, most used example of entropy is understand difference between a “clean or ordered” room and a “cluttered or unordered” room.
In a clean or ordered room, every thing has its place and is put in its place. Thus there is very little disorder in room and entropy of such room is less compared to a cluttered room where items are not in place there is complete chaos. Effort need to be put to make unordered room with high entropy to become ordered with low entropy.
In Thermodynamics, “Entropy in universe is every increasing with time, until an external work is done against it”. Example take case of Air Conditioners.
With no air condition in room, temperature of room is always in equilibrium with surroundings, external to room as disorder or random movement of air molecules is same. Entropy when measured between room and external surroundings would be same.
With air condition in room (and with doors closed), there is a less movement of air molecules (due to lower temperature) and thus entropy in room is less compared to surroundings till air condition is running and room doors are closed.
Additional example of Entropy with respect to coin tosses.

Fair Coin: When a fair coin is tossed, every time it lands either heads / tails. Before toss of coin, predicting output is not possible. Probability will always be 1/2 (either heads or tails). And we can not learn from previous experiments (coin being fair). As outcome is unpredictable, entropy which is measure of unpredictability is high and equal to 1. (Log2(1/2) = 1)

Coin with Heads on both sides: Consider a coin where heads are present on both sides of coin. When such a coin is tossed, it is always certain that it lands on heads. There is no uncertainty in predicting output of such an experiment and thus entropy of such an event is 0.

Biased Coin: Coin that has high tendency to fall on tails compared to heads is a biased coin. When coin is tossed first time, entropy is high as unpredictability is high but as multiple experiments are done using same coin patterns emerge and finally outcome can be predicted with certain degree of certainty. That implies, entropy (which is measure of uncertainty) has come down and prediction rate has improved.
Mathematical formula to compute entropy:
where

i is a variable that ranges from 1 to maximum number of distinct classes post split operations.

p(i) is portion /probability of values falling into that class
Back to our example, using Skill feature, data is split into “High”, “Medium” and “Low” and each one has different number of class variables. See figure below or read earlier blog at Decision Trees Part 1
At level 2 (formed due to split using skill feature) there are three sub groups. Let us compute entropy at each group:
For High: It is pure set as split resulted in homogenous class variables (3Y and 0N). Entropy for this data segment equals to
3/3 X log2(3/3) – 0/3 X log2(0/3) = 0 log2(0/3) = log2(1) = 0 and 0 multiplied by anything is 0)
For Low: It is again a pure split. Thus entropy of that set equals to 0 (7/7 * log2(7/7) – 7/7 * log2(0/7) = 0)
For Medium set: –1/2*log2(1/2) – 1/2*log2(1/2) = 1. Entropy is 1 as there is highest disorder in this set and both of them are equally likely possible and is highly unpredictable.
If we know there are 2 output classes, like head / tail then probability of head is X and probability of tail would 1 – X. In such scenarios entropy curve can be obtained by running
curve( – x * log2 ( x ) – ( 1 – x ) * log ( 1 – x ), col=”red”,ylab=”Entropy)
In next part we talk about “Information Gain” that details how decision tree uses entropy to select appropriate feature to split data..
Till next time..
Guru
References:

You Tube Videos by Victor Lavrenko on Decision Trees.
Pingback: Decision Trees – Part 3  Abhyast
Pingback: Decision Trees – Part 5  Abhyast