Decision Trees – Part 8

Preceding seven posts detailed

  • Decision Trees basics
  • Attribute to split data on basis of impurity
    • Information Gain (C5), Gain Ratio
    • GINI Index (CART)
    • Misclassification Error

In this post, I will try to detail out data types and impact of data types on Decision Trees.

In a data set there are attributes whose value needs to be predicted (dependent variables) and there are attributes that help in predicting called independent variables.

Variable Type Data Type Model
Dependent Variable Categorical Classification
Dependent Variable Continuous Regression
Variable Type Data Type Remarks
Independent Variable Categorical

Can be used for both classification and regression models. But if there are too many categorical values (like Date / Name / Phone Number, Pin Code), it is either better to ignore such columns are convert them to values that are useful for prediction.

Independent Variable Continuous

Again can be used both for classification and regression models. With continuous variable understanding specific algorithms in important. Some implementations will quantize values based on single threshold values and others may quantize based on range, min , max or using normal curve. And there could many others that are unawares.

This post details as much as possible data understanding and preparation specific to independent variables.

Categorical Variables:

  • Null Values: Attributes that have all or most of values blank may not be useful for prediction, so can be removed / ignored.
  • Identical Values: Attributes with same value for all examples are not useful, so can be safely removed / ignored.
  • Unique Values: Attributes with unique values for each example like an alphanumeric identifier (other than date, timestamp or numeric) (Phone Number, Pin Codes, Names, User or Login Names, Email Addresses).
    • Phone Number: Generally first few digits of phone numbers are common and may correspond to circle (location). If there are no other attributes that like to geography may be this can be used (if accurate, never worked on it)
    • Zip Code: Same as above
    • Name: Will name of person / company or entity will have any bearing on output and if so, are names (either first or last names repeated).
      • For example for prediction of child names based on parent names, in such cases names may be useful but if say problem of interest is related to sales, will name have an impact on sale. If no, remove columns, else retain column but split it into first and last names (may be last name repeats more often).
    • Email Address: Take domain name (after @)
  • Correlated Attributes: Find attributes (IV) that have high correlation between attributes. An IV highly correlated with other IV(s). Due to high correlation only one of such attributes could be used.
    • Example: Month Name (Jan.. Dec) and Month Numbers (1 to 12) are highly correlated.
    • Using either of them to split data would result in same sub tree structure.
    • Such scenarios use only one attribute for modeling.
    • Selection of which attribute to use depends upon
      • Performance (Select high performing attribute)
      • Business scenario (Select descriptive).
      • For most input data, knowledge of domain and data are necessary to take such decisions.

Continuous Variables:

  • Excel in some cases converts date and time to numerical values. There will be need to convert them back to date and time and then prepare data. Working on date and time includes
    • Splitting date and time components into Date (Day , Month, Quarter, Semester, Year) and Time (Hours, Minutes, Seconds) depending upon requirements.
      • Notes for me: Need to check if numerical representation of date and / or time has better impact on regression models.
Advertisements

Decision Trees – Part 7

In earlier posts we discussed how an attribute is selected based on Information Gain (Entropy) , GINI Index. Similar to those computations “Misclassification Error” is another method to select optimal attribute to split and build decision trees.

At Node (N) with S number of total elements and “i” number of elements belonging to class, “Misclassification Error” can be computed asclip_image002

Misclassification errors range between 0 (minimum) and 0.5 (maximum)

image

Below is table where nodes are split as “Hired” and “Not Hired” and how Entropy / GINI impurity index and Miscalculation error computed values.

Hired Not Hired Total Entropy GINI Misc.
0 10 10 0.00 0 0
1 9 10 0.47 0.18 0.1
2 8 10 0.72 0.32 0.2
3 7 10 0.88 0.42 0.3
4 6 10 0.97 0.48 0.4
5 5 10 1.00 0.5 0.5
6 4 10 0.97 0.48 0.4
7 3 10 0.88 0.42 0.3
8 2 10 0.72 0.32 0.2
9 1 10 0.47 0.18 0.1
10 0 10 0.00 0 0

Summary:

  • To select an appropriate attribute for splitting, decision trees uses method “impurity reduction” method.
  • Impurity of nodes can be computed using
    • Information Gain
    • GINI Impurity Index
    • Misclassification Error
  • Impurity reduction can be computed as difference between “Impurity as Node before Split” and “Aggregated impurities of all child nodes”.
  • Information Gain method is biased towards categorical attributes that has many distinct values (singleton splits with 100% purity). To avoid this, enhanced measurement called “Gain Ratio” is used.

Next post is data types and impact of data types on Decision Trees..

Guru

Decision Trees – Part 6

Similar to information again, alternate measures can be used to measure impurity of node and thus play role in selection of an attribute to split node to sub nodes (branches) or leaves.

GINI: Similar to Information Gain, GINI measures impurity of a node. GINI is an alternative and can be used in place of Information Gain. Example CART (Classification Tree) uses GINI index for splitting decision tree nodes.

Node (N) with “s” total data elements has subset (count) data elements of class “i”, then

clip_image002[7]

Similar to entropy, GINI impurity index values range from 0 to .5. Graph plotted with values of GINI index is below

image

While entropy values range between 0 to 1 , GINI index values range between 0 to 0.5. Additionally GINI includes number of classes (and count), it may not need to compute something like “Gain Ratio” as in case of Information Gain.

See below comparison of Entropy and GINI values with different splits of Hired and Not Hired in 10 total candidates.

Hired Not Hired Total (Hired / Not Hired) Entropy GINI
0 10 10 0 0
1 9 10 0.468996 0.18
2 8 10 0.721928 0.32
3 7 10 0.881291 0.42
4 6 10 0.970951 0.48
5 5 10 1 0.5
6 4 10 0.970951 0.48
7 3 10 0.881291 0.42
8 2 10 0.721928 0.32
9 1 10 0.468996 0.18
10 0 10 0 0

Similar to computing “Information Gain” for split operation using attribute “A” at node, using GINI impurity is computed at parent node. Selection of an attribute to split node is based on reduction in impurity. If a node (N) with “t” total elements is split into multiple “k” sub nodes, with each node containing “t(i)” elements, aggregated GINI impurity is

clip_image002[10]

Similar to Decision Trees with “Information Gain” using GINI impurity index, attributes that result in split of parent node that larger nodes(more number values) with higher purity are preferred.

Next is measuring impurity using “Misclassification Error”.

Decision Trees – Part 5

Information gain as discussed in earlier post is “reduction in un-orderedness” in training data due to split. Tree building procedure considers attributes that maximize information gain. Entropy , Information gain are explained in below posts

Entropy at node (N) with probability (p) that an event out of two mutually exclusive events (like head / tail) for class variables can be computed as

E(N) = – probability(p) * Log(2)(p) – probability (1 – p) * Log(2)(1 – p)

This above equation can be generalized for multi class as

E(N) –p1 * log2(p1) – p2*log2(p2)……..-pn*log2(pn) = i between 1 to n Sigma(pi*lo2(pi), mathematically represented as

image

If node is split into multiple branches, then each branch with subset has their own entropy (computed using earlier formula). To aggregate entropies at branches and compare them with parent node, a weighted approach is considered. Branches with larger number of values / nodes with higher entropies are penalized compared to larger number of values / node with lower entropies.

If Si is ith  subset after split from a universal set S then summarized entropy of all i subsets is

image

where

  • Si = Number of elements in ith subset
  • S = Number of elements in super set (at node before split)
  • E(Si) = Entropy of ith Subset.

Information Gain can be computed as difference between (Un)orderedness of Node and summarized unorderedness of child nodes with subset of data (super set of data at node) during split. More the entropy difference better information gain. If set S is split into i sets based on attribute values (A) then information gain is Entropy(S) – Summarized Entropies (child nodes / subsets). Information Gain in set S  split by attribute A in i number of subsets, mathematically  image

Additional Notes:

  • When subset contains homogenous elements (Every element belongs to same class of dependent variable), entropy is minimum equals Zero (0)
  • When subset contains equally likely elements, entropy is maximum equals One (1)

Update: Added graph for Entropy to indicate that its values are between 0 to 1 and since it is logarithmic it is a smooth curve.

  • 0 >= Entropy <= 1

image

  • If splitting node results in pure subset nodes (entropy 0) then Information Gain is equal to entropy of node.
  • If splitting node does not improve information gain, aggregated entropies equal to node entropy.

If in dataset there is an attribute that uniquely identifies rows, using such an attribute to split data would result in all pure data sets though singleton. Thus information again of using such attribute would be highest.

For example any attributes like below would result in pure splits and thus information gain would be maximum.

  • Customer Name or
  • Pin code or Phone Numbers
  • Dates or Time or Timestamps
  • Identity
  • Colors

Decision Tree algorithms are biased towards such attributes. Even mathematically, since information again is maximum such attribute should be used for splitting data. But are such tree really useful. If such an attribute if at all is used for decision tree, resultant tree would have

  • Tree would be wide (due to many leaf nodes)
    • For example if there is a table with 1 M records, each with Phone number attribute (IV), tree would have 1 M leaves each with individual Phones attribute and class output
  • Such trees would not be useful
    • Over fitting: New row will not have same value but a different value. Thus tree fails to classify them into correct leaf. So decision tree built on such attributes is not useful.
    • Fragmentation: Performance of tree is bad as it has to traverse 1 M records even if Phone Number repeats. Compute how many steps are required to read this data.

To balance bias of information gain towards attributes with lot of values, “Gain Ratio” is used. Prior to computing “Gain Ratio”, understand and compute “Intrinsic Information”. Intrinsic information can be intuitively thought maximum number of steps required to retrieve correct leaf node post split operation. In other terms, it is entropy of distribution post split. If a split results in more number of child nodes then probability of retrieving single child node is 1/(Number of child nodes). Entropy of such distribution is equal to (1/Number of Child Nodes)*Log2(1/Number of Child Nodes). Mathematically it can be written as

image 

Gain Ratio now is equal to (GR) = Information Gain / Intrinsic Information.

There are also other mechanisms like Gini Index

Future posts would dwell upon Gini Index, data (Types) and impact on decision trees , data preparation for decision trees, validation of decision trees and decision trees (for classification and regression) problems..

Let me know if there are more things to be added in future posts specific to decision trees and also feel free to put in comments so that I can learn from them.

Thanks for reading….

Guru

Decision Trees – Part 4

Decision tree is a recursive algorithm, implying it splits until

  • All Subsets are pure (entropy = 0 or homogenous)
  • Entropy is not equal to 0 but there are no more attributes to split
  • Entropy is not equal to 0, but predefined tree size is set and tree size has reached to set limit.

Decision Trees keep splitting data until every subset is homogenous and pure, if there are enough attributes available. That implies, decision trees will split data to such an extent that to make subsets pure, it may end up with a singleton leaves. And each leaf of decision tree contains only 1 value.

Say for example there is a column (like identity) that uniquely identifies rows. Using such column for splitting, decision trees algorithm will classify each value in a single leaf (thus making it homogenous but at same not useful).

If a decision tree, classifies data to that detail level (like using identity or unique identifier or transaction dates), such a classifier is not useful, as any new value will be different from an existing values and decision tree classifier would fail to classify such data. Such iterative splitting leading to perfect homogenous leaves would lead to over-fitting of algorithm to sample data.

If Size of decision tree is equal to number of examples in data set, then clearly decision tress is over fitted to training data.

During training, with every split a new layer of branches and / or leaves are formed and thus increasing size of decision trees. Increase in size of decision tree increases its accuracy on training data but at same decreases its effectiveness on test data (due to over fitting).

If a decision tree is over fitted with data, it is made very specific to training data and it is not generalized and thus not useful for test data as well as data other than training. To avoid over fitting (or over growing), decision trees are pruned. The main purposes of pruning are

  • Reduce decision tree classifier complexity and thus increasing prediction accuracy (removal of over-fitting of decision tree algorithm to training data)
  • Removal of portions of tree (branches / roots) that is based on erroneous data or noisy data thus again not useful for prediction (as classifier is not generalized).

There are many approaches to decision tree pruning and math behind pruning. Each of these pruning techniques follow either top down (root to leaf) or bottoms up (leaf to root) approaches. Additionally pruning can be done while building or post building decision trees.

Considering volume and variety of decision trees pruning techniques, there will be separate post for pruning.

Summary: Decision trees have tendency to grow till all leaves of tree are homogenous and thus may end up with singleton leaf nodes. If decision trees predict with 100% accuracy training data, there may be case where classifier has over-fitting problem and is not generalized for subsequent use either with test or real business data. To avoid over-fitting decision trees are pruned (removing branches and / or leaf nodes) without reducing prediction accuracy.

Guru

Decision Trees – Part 2

Before understanding mathematically How an attribute (feature) is selected (from available features) to best split underlying data, let us explore few related terms.

Pure / Impure: After data is split by decision trees it results in a set of subgroups. The splitting is done to improve accuracy of prediction. The best sub group is one where there is 100% accuracy in prediction. That implies if data is split and resultant sub group is 100% accurate in detection, such sub group is called “Pure Group”.

On other hand if splitting results in sub group where resultant group is still a mixture of output (label) variables then such sub groups are impure and called “Impure Groups”. Splitting of data may result in (0 to N) pure sub groups and (0 to N) impure sub groups.

Both purity / impurity of groups are defined with respect to output (label) variable and not features. Features that result in as many sub groups (groups that have homogenous output (label) variables) are best suited for decision trees algorithms.

Consider example in Decision Trees Part 1 where candidate is hired based on experience, technology, skill. Label (output) variable is “Hire” and features attributes are “Experience”, “Technology”,”Skill”. Split of Hire (output) variable using “Skill” feature attribute results in below tree. Skill has 3 different values (High, Medium and Low). Data funnels through root and falls into any of these sub groups. Values in parenthesis (3Y,0N) are values of output (label) variable.

As you can see, split resulted in 3 subgroups with 2 groups pure (Green color) and 1 group impure (Orange). 2 Pure groups either have 100% Yes (hired) or 100% No (Not Hired). Based on this it can be deduced that people with high skill ratings are always hired irrespective of technology or experience and candidates with low skills are never hired.

  • Pure sub groups: Skill Level (High) and Skill Level (Low).
  • Impure sub groups: Skill Level (Medium)

image

An additional point to note in above example is though there are 2 pure sub groups, they are not symmetric. They are not symmetric because number of examples vary in each sub group.

To understand this point, let us expand this example and in our data set we have 100 rows with all additional 88 rows with medium skill set. If all our 88 candidates are rejected, then out of 90 candidates with medium skill set only 1 is selected. Even though we have a impure sub group here , such  a group is very useful as it splits data that results in high homogeneity. 100% pure sub groups in real world may not be possible but a highly accurate sub groups is also very useful.

Entropy: Entropy is the measure of disorder of a system.

To understand, most used example of entropy is understand difference between a “clean or ordered” room and a “cluttered or unordered” room.

In a clean or ordered room, every thing has its place and is put in its place. Thus there is very little disorder in room and entropy of such room is less compared to a cluttered room where items are not in place there is complete chaos. Effort need to be put to make unordered room with high entropy to become ordered with low entropy.

In Thermodynamics, “Entropy in universe is every increasing with time, until an external work is done against it”. Example take case of Air Conditioners.

With no air condition in room, temperature of room is always in equilibrium with surroundings, external to room as disorder or random movement of air molecules is same. Entropy when measured between room and external surroundings would be same.

With air condition in room (and with doors closed), there is a less movement of air molecules (due to lower temperature) and thus entropy in room is less compared to surroundings till air condition is running and room doors are closed.

Additional example of Entropy with respect to coin tosses.

  • Fair Coin: When a fair coin is tossed, every time it lands either heads / tails. Before toss of coin, predicting output is not possible. Probability will always be 1/2 (either heads or tails). And we can not learn from previous experiments (coin being fair). As outcome is unpredictable, entropy which is measure of unpredictability is high and equal to 1. (Log2(1/2) = 1) 
  • Coin with Heads on both sides: Consider a coin where heads are present on both sides of coin. When such a coin is tossed, it is always certain that it lands on heads. There is no uncertainty in predicting output of such an experiment and thus entropy of such an event is 0.
  • Biased Coin: Coin that has high tendency to fall on tails compared to heads is a biased coin. When coin is tossed first time, entropy is high as unpredictability  is high but as multiple experiments are done using same coin patterns emerge and finally outcome can be predicted with certain degree of certainty. That implies, entropy (which is measure of uncertainty) has come down and prediction rate has improved.

Mathematical formula to compute entropy:

image

where

  • i is a variable that ranges from 1 to maximum number of distinct classes post split operations.
  • p(i) is portion /probability of values falling into that class

Back to our example, using Skill feature, data is split into “High”, “Medium” and “Low” and each one has different number of class variables. See figure below or read earlier blog at Decision Trees Part 1

At level 2 (formed due to split using skill feature) there are three sub groups. Let us compute entropy at each group:

For High: It is pure set as split resulted in homogenous class variables (3Y and 0N). Entropy for this data segment equals to

-3/3 X log2(3/3) – 0/3 X log2(0/3) = 0  log2(0/3) = log2(1) = 0 and 0 multiplied by anything is 0)

For Low: It is again a pure split. Thus entropy of that set equals to 0 (-7/7 * log2(7/7) – 7/7 * log2(0/7) = 0)

For Medium set: –1/2*log2(1/2) – 1/2*log2(1/2) = 1. Entropy is 1 as there is highest disorder in this set and both of them are equally likely possible and is highly  unpredictable.

image

If we know there are 2 output classes, like head / tail then probability of head is X and probability of tail would 1 – X. In such scenarios entropy curve can be obtained by running

curve( – x  * log2 ( x ) – ( 1 – x ) * log ( 1 – x ), col=”red”,ylab=”Entropy)

image

In next part we talk about “Information Gain” that details how decision tree uses entropy to select appropriate feature to split data..

Till next time..

Guru

 

References:

Decision Trees – Part 1

Decision tree algorithm builds a “Tree Structure” (Root , Branches and Leaf Nodes) from underlying data. Decision trees are built using heuristic called “recursive partitioning” or “divide and conquer”. High level steps how an algorithm works..

Given a dataset,

  • Should split?
    • Is it required to split data to improve prediction accuracy?
  • Can split?
    • Are there enough features to split to improve prediction accuracy?
  • Assuming “Yes” for both above cases,
  • Why to split?
    • Data is split to multiple sub groups to improve prediction and to gain information about splitting data (“Information Gain”)
  • How to split?
    • Based on data, from set of given attributes, choose an attribute that will be used to split data into sub groups (i.e from Root –> Branches).
    • Based on data on selected column, form “Groups” or “Distinct Categories”. Using groups, categorize and quantify output (label) variable.
    • After sub groups are formed, check if prediction can be done with 100% accuracy (as much as possible).
      • If Yes, stop splitting process.
      • If No, continue splitting process until
        • Prediction is 100% accurate
        • No more attributes (features) to split
        • Tree has become so big with many branches it is practically unusable for prediction.

To understand better below is a simple case study.

Companies like Monster.com, Glassdoor.com, indeed.com and others offer employee hiring services to companies and candidates open for jobs. As cost of sourcing is high, quality candidates (open for hiring) is low and rejection rate (due to low quality candidates) by hiring companies high, it becomes business critical to send right candidates with right skills and capabilities to those companies where candidates have high chances of getting hired.

  • How about before sending a candidate for interview, predict if he / she has high probability of getting hired and then take action?
  • Better still, predict skill gaps, suitability of candidate for a company, get candidate ramped up and then send for interview 🙂 ?

To understand how “Decision Trees” can be used for such predictions, below is sample example data. Data is self explanatory.

  • Experience
Technology Skill Hire
8 – 10 Years Development High Yes
8 – 10 Years Database High Yes
8 – 10 Years Database Low No
6 – 8 Years Architecture Low No
6 – 8 Years Architecture Medium Yes
10 – 12 Years Development High Yes
10 – 12 Years Architecture Low No
10 – 12 Years Database Low No
6 – 8 Years Development Low No
6 – 8 Years Database Medium No
8 – 10 Years Database Low No
8 – 10 Years Database Low No

Based on order of columns that are selected for splitting (sub grouping) data, multiple variant decision trees can be formed.

In below trees,

  • Green indicates prediction with 100 accuracy (data in that group is homogenous or belong to one single category of Yes or No)
  • Yellow / Orange indicate prediction is better but not with certainty.
  • Red indicates prediction does not have any amount of certainty.

Tree 1: Split data skill followed by technology:

image

Tree 2: Split data Experience –> Technology –> Skill

image

As you can see from above, both lead to accurate prediction in end but Tree 1 is better than Tree 2 both in terms of number of steps involved (Depth) and Categories (Width) of tree.

Choosing right attribute for splitting data at each iteration of splitting is an important aspect that dictates prediction capabilities, optimizations in decision tree models.

Next post involves more mathematical explanation of “Decision Trees” and how an attribute is selected for split (Information Gain and Entropy) continuing with same case.

Until next post…

Guru