Decision Trees – Part 4

Decision tree is a recursive algorithm, implying it splits until

  • All Subsets are pure (entropy = 0 or homogenous)
  • Entropy is not equal to 0 but there are no more attributes to split
  • Entropy is not equal to 0, but predefined tree size is set and tree size has reached to set limit.

Decision Trees keep splitting data until every subset is homogenous and pure, if there are enough attributes available. That implies, decision trees will split data to such an extent that to make subsets pure, it may end up with a singleton leaves. And each leaf of decision tree contains only 1 value.

Say for example there is a column (like identity) that uniquely identifies rows. Using such column for splitting, decision trees algorithm will classify each value in a single leaf (thus making it homogenous but at same not useful).

If a decision tree, classifies data to that detail level (like using identity or unique identifier or transaction dates), such a classifier is not useful, as any new value will be different from an existing values and decision tree classifier would fail to classify such data. Such iterative splitting leading to perfect homogenous leaves would lead to over-fitting of algorithm to sample data.

If Size of decision tree is equal to number of examples in data set, then clearly decision tress is over fitted to training data.

During training, with every split a new layer of branches and / or leaves are formed and thus increasing size of decision trees. Increase in size of decision tree increases its accuracy on training data but at same decreases its effectiveness on test data (due to over fitting).

If a decision tree is over fitted with data, it is made very specific to training data and it is not generalized and thus not useful for test data as well as data other than training. To avoid over fitting (or over growing), decision trees are pruned. The main purposes of pruning are

  • Reduce decision tree classifier complexity and thus increasing prediction accuracy (removal of over-fitting of decision tree algorithm to training data)
  • Removal of portions of tree (branches / roots) that is based on erroneous data or noisy data thus again not useful for prediction (as classifier is not generalized).

There are many approaches to decision tree pruning and math behind pruning. Each of these pruning techniques follow either top down (root to leaf) or bottoms up (leaf to root) approaches. Additionally pruning can be done while building or post building decision trees.

Considering volume and variety of decision trees pruning techniques, there will be separate post for pruning.

Summary: Decision trees have tendency to grow till all leaves of tree are homogenous and thus may end up with singleton leaf nodes. If decision trees predict with 100% accuracy training data, there may be case where classifier has over-fitting problem and is not generalized for subsequent use either with test or real business data. To avoid over-fitting decision trees are pruned (removing branches and / or leaf nodes) without reducing prediction accuracy.

Guru

Decision Trees – Part 2

Before understanding mathematically How an attribute (feature) is selected (from available features) to best split underlying data, let us explore few related terms.

Pure / Impure: After data is split by decision trees it results in a set of subgroups. The splitting is done to improve accuracy of prediction. The best sub group is one where there is 100% accuracy in prediction. That implies if data is split and resultant sub group is 100% accurate in detection, such sub group is called “Pure Group”.

On other hand if splitting results in sub group where resultant group is still a mixture of output (label) variables then such sub groups are impure and called “Impure Groups”. Splitting of data may result in (0 to N) pure sub groups and (0 to N) impure sub groups.

Both purity / impurity of groups are defined with respect to output (label) variable and not features. Features that result in as many sub groups (groups that have homogenous output (label) variables) are best suited for decision trees algorithms.

Consider example in Decision Trees Part 1 where candidate is hired based on experience, technology, skill. Label (output) variable is “Hire” and features attributes are “Experience”, “Technology”,”Skill”. Split of Hire (output) variable using “Skill” feature attribute results in below tree. Skill has 3 different values (High, Medium and Low). Data funnels through root and falls into any of these sub groups. Values in parenthesis (3Y,0N) are values of output (label) variable.

As you can see, split resulted in 3 subgroups with 2 groups pure (Green color) and 1 group impure (Orange). 2 Pure groups either have 100% Yes (hired) or 100% No (Not Hired). Based on this it can be deduced that people with high skill ratings are always hired irrespective of technology or experience and candidates with low skills are never hired.

  • Pure sub groups: Skill Level (High) and Skill Level (Low).
  • Impure sub groups: Skill Level (Medium)

image

An additional point to note in above example is though there are 2 pure sub groups, they are not symmetric. They are not symmetric because number of examples vary in each sub group.

To understand this point, let us expand this example and in our data set we have 100 rows with all additional 88 rows with medium skill set. If all our 88 candidates are rejected, then out of 90 candidates with medium skill set only 1 is selected. Even though we have a impure sub group here , such  a group is very useful as it splits data that results in high homogeneity. 100% pure sub groups in real world may not be possible but a highly accurate sub groups is also very useful.

Entropy: Entropy is the measure of disorder of a system.

To understand, most used example of entropy is understand difference between a “clean or ordered” room and a “cluttered or unordered” room.

In a clean or ordered room, every thing has its place and is put in its place. Thus there is very little disorder in room and entropy of such room is less compared to a cluttered room where items are not in place there is complete chaos. Effort need to be put to make unordered room with high entropy to become ordered with low entropy.

In Thermodynamics, “Entropy in universe is every increasing with time, until an external work is done against it”. Example take case of Air Conditioners.

With no air condition in room, temperature of room is always in equilibrium with surroundings, external to room as disorder or random movement of air molecules is same. Entropy when measured between room and external surroundings would be same.

With air condition in room (and with doors closed), there is a less movement of air molecules (due to lower temperature) and thus entropy in room is less compared to surroundings till air condition is running and room doors are closed.

Additional example of Entropy with respect to coin tosses.

  • Fair Coin: When a fair coin is tossed, every time it lands either heads / tails. Before toss of coin, predicting output is not possible. Probability will always be 1/2 (either heads or tails). And we can not learn from previous experiments (coin being fair). As outcome is unpredictable, entropy which is measure of unpredictability is high and equal to 1. (Log2(1/2) = 1) 
  • Coin with Heads on both sides: Consider a coin where heads are present on both sides of coin. When such a coin is tossed, it is always certain that it lands on heads. There is no uncertainty in predicting output of such an experiment and thus entropy of such an event is 0.
  • Biased Coin: Coin that has high tendency to fall on tails compared to heads is a biased coin. When coin is tossed first time, entropy is high as unpredictability  is high but as multiple experiments are done using same coin patterns emerge and finally outcome can be predicted with certain degree of certainty. That implies, entropy (which is measure of uncertainty) has come down and prediction rate has improved.

Mathematical formula to compute entropy:

image

where

  • i is a variable that ranges from 1 to maximum number of distinct classes post split operations.
  • p(i) is portion /probability of values falling into that class

Back to our example, using Skill feature, data is split into “High”, “Medium” and “Low” and each one has different number of class variables. See figure below or read earlier blog at Decision Trees Part 1

At level 2 (formed due to split using skill feature) there are three sub groups. Let us compute entropy at each group:

For High: It is pure set as split resulted in homogenous class variables (3Y and 0N). Entropy for this data segment equals to

-3/3 X log2(3/3) – 0/3 X log2(0/3) = 0  log2(0/3) = log2(1) = 0 and 0 multiplied by anything is 0)

For Low: It is again a pure split. Thus entropy of that set equals to 0 (-7/7 * log2(7/7) – 7/7 * log2(0/7) = 0)

For Medium set: –1/2*log2(1/2) – 1/2*log2(1/2) = 1. Entropy is 1 as there is highest disorder in this set and both of them are equally likely possible and is highly  unpredictable.

image

If we know there are 2 output classes, like head / tail then probability of head is X and probability of tail would 1 – X. In such scenarios entropy curve can be obtained by running

curve( – x  * log2 ( x ) – ( 1 – x ) * log ( 1 – x ), col=”red”,ylab=”Entropy)

image

In next part we talk about “Information Gain” that details how decision tree uses entropy to select appropriate feature to split data..

Till next time..

Guru

 

References:

Decision Trees – Part 1

Decision tree algorithm builds a “Tree Structure” (Root , Branches and Leaf Nodes) from underlying data. Decision trees are built using heuristic called “recursive partitioning” or “divide and conquer”. High level steps how an algorithm works..

Given a dataset,

  • Should split?
    • Is it required to split data to improve prediction accuracy?
  • Can split?
    • Are there enough features to split to improve prediction accuracy?
  • Assuming “Yes” for both above cases,
  • Why to split?
    • Data is split to multiple sub groups to improve prediction and to gain information about splitting data (“Information Gain”)
  • How to split?
    • Based on data, from set of given attributes, choose an attribute that will be used to split data into sub groups (i.e from Root –> Branches).
    • Based on data on selected column, form “Groups” or “Distinct Categories”. Using groups, categorize and quantify output (label) variable.
    • After sub groups are formed, check if prediction can be done with 100% accuracy (as much as possible).
      • If Yes, stop splitting process.
      • If No, continue splitting process until
        • Prediction is 100% accurate
        • No more attributes (features) to split
        • Tree has become so big with many branches it is practically unusable for prediction.

To understand better below is a simple case study.

Companies like Monster.com, Glassdoor.com, indeed.com and others offer employee hiring services to companies and candidates open for jobs. As cost of sourcing is high, quality candidates (open for hiring) is low and rejection rate (due to low quality candidates) by hiring companies high, it becomes business critical to send right candidates with right skills and capabilities to those companies where candidates have high chances of getting hired.

  • How about before sending a candidate for interview, predict if he / she has high probability of getting hired and then take action?
  • Better still, predict skill gaps, suitability of candidate for a company, get candidate ramped up and then send for interview 🙂 ?

To understand how “Decision Trees” can be used for such predictions, below is sample example data. Data is self explanatory.

  • Experience
Technology Skill Hire
8 – 10 Years Development High Yes
8 – 10 Years Database High Yes
8 – 10 Years Database Low No
6 – 8 Years Architecture Low No
6 – 8 Years Architecture Medium Yes
10 – 12 Years Development High Yes
10 – 12 Years Architecture Low No
10 – 12 Years Database Low No
6 – 8 Years Development Low No
6 – 8 Years Database Medium No
8 – 10 Years Database Low No
8 – 10 Years Database Low No

Based on order of columns that are selected for splitting (sub grouping) data, multiple variant decision trees can be formed.

In below trees,

  • Green indicates prediction with 100 accuracy (data in that group is homogenous or belong to one single category of Yes or No)
  • Yellow / Orange indicate prediction is better but not with certainty.
  • Red indicates prediction does not have any amount of certainty.

Tree 1: Split data skill followed by technology:

image

Tree 2: Split data Experience –> Technology –> Skill

image

As you can see from above, both lead to accurate prediction in end but Tree 1 is better than Tree 2 both in terms of number of steps involved (Depth) and Categories (Width) of tree.

Choosing right attribute for splitting data at each iteration of splitting is an important aspect that dictates prediction capabilities, optimizations in decision tree models.

Next post involves more mathematical explanation of “Decision Trees” and how an attribute is selected for split (Information Gain and Entropy) continuing with same case.

Until next post…

Guru