Decision Trees – Part 8

Preceding seven posts detailed

  • Decision Trees basics
  • Attribute to split data on basis of impurity
    • Information Gain (C5), Gain Ratio
    • GINI Index (CART)
    • Misclassification Error

In this post, I will try to detail out data types and impact of data types on Decision Trees.

In a data set there are attributes whose value needs to be predicted (dependent variables) and there are attributes that help in predicting called independent variables.

Variable Type Data Type Model
Dependent Variable Categorical Classification
Dependent Variable Continuous Regression
Variable Type Data Type Remarks
Independent Variable Categorical

Can be used for both classification and regression models. But if there are too many categorical values (like Date / Name / Phone Number, Pin Code), it is either better to ignore such columns are convert them to values that are useful for prediction.

Independent Variable Continuous

Again can be used both for classification and regression models. With continuous variable understanding specific algorithms in important. Some implementations will quantize values based on single threshold values and others may quantize based on range, min , max or using normal curve. And there could many others that are unawares.

This post details as much as possible data understanding and preparation specific to independent variables.

Categorical Variables:

  • Null Values: Attributes that have all or most of values blank may not be useful for prediction, so can be removed / ignored.
  • Identical Values: Attributes with same value for all examples are not useful, so can be safely removed / ignored.
  • Unique Values: Attributes with unique values for each example like an alphanumeric identifier (other than date, timestamp or numeric) (Phone Number, Pin Codes, Names, User or Login Names, Email Addresses).
    • Phone Number: Generally first few digits of phone numbers are common and may correspond to circle (location). If there are no other attributes that like to geography may be this can be used (if accurate, never worked on it)
    • Zip Code: Same as above
    • Name: Will name of person / company or entity will have any bearing on output and if so, are names (either first or last names repeated).
      • For example for prediction of child names based on parent names, in such cases names may be useful but if say problem of interest is related to sales, will name have an impact on sale. If no, remove columns, else retain column but split it into first and last names (may be last name repeats more often).
    • Email Address: Take domain name (after @)
  • Correlated Attributes: Find attributes (IV) that have high correlation between attributes. An IV highly correlated with other IV(s). Due to high correlation only one of such attributes could be used.
    • Example: Month Name (Jan.. Dec) and Month Numbers (1 to 12) are highly correlated.
    • Using either of them to split data would result in same sub tree structure.
    • Such scenarios use only one attribute for modeling.
    • Selection of which attribute to use depends upon
      • Performance (Select high performing attribute)
      • Business scenario (Select descriptive).
      • For most input data, knowledge of domain and data are necessary to take such decisions.

Continuous Variables:

  • Excel in some cases converts date and time to numerical values. There will be need to convert them back to date and time and then prepare data. Working on date and time includes
    • Splitting date and time components into Date (Day , Month, Quarter, Semester, Year) and Time (Hours, Minutes, Seconds) depending upon requirements.
      • Notes for me: Need to check if numerical representation of date and / or time has better impact on regression models.

Decision Trees – Part 7

In earlier posts we discussed how an attribute is selected based on Information Gain (Entropy) , GINI Index. Similar to those computations “Misclassification Error” is another method to select optimal attribute to split and build decision trees.

At Node (N) with S number of total elements and “i” number of elements belonging to class, “Misclassification Error” can be computed asclip_image002

Misclassification errors range between 0 (minimum) and 0.5 (maximum)

image

Below is table where nodes are split as “Hired” and “Not Hired” and how Entropy / GINI impurity index and Miscalculation error computed values.

Hired Not Hired Total Entropy GINI Misc.
0 10 10 0.00 0 0
1 9 10 0.47 0.18 0.1
2 8 10 0.72 0.32 0.2
3 7 10 0.88 0.42 0.3
4 6 10 0.97 0.48 0.4
5 5 10 1.00 0.5 0.5
6 4 10 0.97 0.48 0.4
7 3 10 0.88 0.42 0.3
8 2 10 0.72 0.32 0.2
9 1 10 0.47 0.18 0.1
10 0 10 0.00 0 0

Summary:

  • To select an appropriate attribute for splitting, decision trees uses method “impurity reduction” method.
  • Impurity of nodes can be computed using
    • Information Gain
    • GINI Impurity Index
    • Misclassification Error
  • Impurity reduction can be computed as difference between “Impurity as Node before Split” and “Aggregated impurities of all child nodes”.
  • Information Gain method is biased towards categorical attributes that has many distinct values (singleton splits with 100% purity). To avoid this, enhanced measurement called “Gain Ratio” is used.

Next post is data types and impact of data types on Decision Trees..

Guru

Decision Trees – Part 6

Similar to information again, alternate measures can be used to measure impurity of node and thus play role in selection of an attribute to split node to sub nodes (branches) or leaves.

GINI: Similar to Information Gain, GINI measures impurity of a node. GINI is an alternative and can be used in place of Information Gain. Example CART (Classification Tree) uses GINI index for splitting decision tree nodes.

Node (N) with “s” total data elements has subset (count) data elements of class “i”, then

clip_image002[7]

Similar to entropy, GINI impurity index values range from 0 to .5. Graph plotted with values of GINI index is below

image

While entropy values range between 0 to 1 , GINI index values range between 0 to 0.5. Additionally GINI includes number of classes (and count), it may not need to compute something like “Gain Ratio” as in case of Information Gain.

See below comparison of Entropy and GINI values with different splits of Hired and Not Hired in 10 total candidates.

Hired Not Hired Total (Hired / Not Hired) Entropy GINI
0 10 10 0 0
1 9 10 0.468996 0.18
2 8 10 0.721928 0.32
3 7 10 0.881291 0.42
4 6 10 0.970951 0.48
5 5 10 1 0.5
6 4 10 0.970951 0.48
7 3 10 0.881291 0.42
8 2 10 0.721928 0.32
9 1 10 0.468996 0.18
10 0 10 0 0

Similar to computing “Information Gain” for split operation using attribute “A” at node, using GINI impurity is computed at parent node. Selection of an attribute to split node is based on reduction in impurity. If a node (N) with “t” total elements is split into multiple “k” sub nodes, with each node containing “t(i)” elements, aggregated GINI impurity is

clip_image002[10]

Similar to Decision Trees with “Information Gain” using GINI impurity index, attributes that result in split of parent node that larger nodes(more number values) with higher purity are preferred.

Next is measuring impurity using “Misclassification Error”.

Decision Trees – Part 5

Information gain as discussed in earlier post is “reduction in un-orderedness” in training data due to split. Tree building procedure considers attributes that maximize information gain. Entropy , Information gain are explained in below posts

Entropy at node (N) with probability (p) that an event out of two mutually exclusive events (like head / tail) for class variables can be computed as

E(N) = – probability(p) * Log(2)(p) – probability (1 – p) * Log(2)(1 – p)

This above equation can be generalized for multi class as

E(N) –p1 * log2(p1) – p2*log2(p2)……..-pn*log2(pn) = i between 1 to n Sigma(pi*lo2(pi), mathematically represented as

image

If node is split into multiple branches, then each branch with subset has their own entropy (computed using earlier formula). To aggregate entropies at branches and compare them with parent node, a weighted approach is considered. Branches with larger number of values / nodes with higher entropies are penalized compared to larger number of values / node with lower entropies.

If Si is ith  subset after split from a universal set S then summarized entropy of all i subsets is

image

where

  • Si = Number of elements in ith subset
  • S = Number of elements in super set (at node before split)
  • E(Si) = Entropy of ith Subset.

Information Gain can be computed as difference between (Un)orderedness of Node and summarized unorderedness of child nodes with subset of data (super set of data at node) during split. More the entropy difference better information gain. If set S is split into i sets based on attribute values (A) then information gain is Entropy(S) – Summarized Entropies (child nodes / subsets). Information Gain in set S  split by attribute A in i number of subsets, mathematically  image

Additional Notes:

  • When subset contains homogenous elements (Every element belongs to same class of dependent variable), entropy is minimum equals Zero (0)
  • When subset contains equally likely elements, entropy is maximum equals One (1)

Update: Added graph for Entropy to indicate that its values are between 0 to 1 and since it is logarithmic it is a smooth curve.

  • 0 >= Entropy <= 1

image

  • If splitting node results in pure subset nodes (entropy 0) then Information Gain is equal to entropy of node.
  • If splitting node does not improve information gain, aggregated entropies equal to node entropy.

If in dataset there is an attribute that uniquely identifies rows, using such an attribute to split data would result in all pure data sets though singleton. Thus information again of using such attribute would be highest.

For example any attributes like below would result in pure splits and thus information gain would be maximum.

  • Customer Name or
  • Pin code or Phone Numbers
  • Dates or Time or Timestamps
  • Identity
  • Colors

Decision Tree algorithms are biased towards such attributes. Even mathematically, since information again is maximum such attribute should be used for splitting data. But are such tree really useful. If such an attribute if at all is used for decision tree, resultant tree would have

  • Tree would be wide (due to many leaf nodes)
    • For example if there is a table with 1 M records, each with Phone number attribute (IV), tree would have 1 M leaves each with individual Phones attribute and class output
  • Such trees would not be useful
    • Over fitting: New row will not have same value but a different value. Thus tree fails to classify them into correct leaf. So decision tree built on such attributes is not useful.
    • Fragmentation: Performance of tree is bad as it has to traverse 1 M records even if Phone Number repeats. Compute how many steps are required to read this data.

To balance bias of information gain towards attributes with lot of values, “Gain Ratio” is used. Prior to computing “Gain Ratio”, understand and compute “Intrinsic Information”. Intrinsic information can be intuitively thought maximum number of steps required to retrieve correct leaf node post split operation. In other terms, it is entropy of distribution post split. If a split results in more number of child nodes then probability of retrieving single child node is 1/(Number of child nodes). Entropy of such distribution is equal to (1/Number of Child Nodes)*Log2(1/Number of Child Nodes). Mathematically it can be written as

image 

Gain Ratio now is equal to (GR) = Information Gain / Intrinsic Information.

There are also other mechanisms like Gini Index

Future posts would dwell upon Gini Index, data (Types) and impact on decision trees , data preparation for decision trees, validation of decision trees and decision trees (for classification and regression) problems..

Let me know if there are more things to be added in future posts specific to decision trees and also feel free to put in comments so that I can learn from them.

Thanks for reading….

Guru