# Decision Trees – Part 1

Decision tree algorithm builds a “Tree Structure” (Root , Branches and Leaf Nodes) from underlying data. Decision trees are built using heuristic called “recursive partitioning” or “divide and conquer”. High level steps how an algorithm works..

Given a dataset,

• Should split?
• Is it required to split data to improve prediction accuracy?
• Can split?
• Are there enough features to split to improve prediction accuracy?
• Assuming “Yes” for both above cases,
• Why to split?
• Data is split to multiple sub groups to improve prediction and to gain information about splitting data (“Information Gain”)
• How to split?
• Based on data, from set of given attributes, choose an attribute that will be used to split data into sub groups (i.e from Root –> Branches).
• Based on data on selected column, form “Groups” or “Distinct Categories”. Using groups, categorize and quantify output (label) variable.
• After sub groups are formed, check if prediction can be done with 100% accuracy (as much as possible).
• If Yes, stop splitting process.
• If No, continue splitting process until
• Prediction is 100% accurate
• No more attributes (features) to split
• Tree has become so big with many branches it is practically unusable for prediction.

To understand better below is a simple case study.

Companies like Monster.com, Glassdoor.com, indeed.com and others offer employee hiring services to companies and candidates open for jobs. As cost of sourcing is high, quality candidates (open for hiring) is low and rejection rate (due to low quality candidates) by hiring companies high, it becomes business critical to send right candidates with right skills and capabilities to those companies where candidates have high chances of getting hired.

• How about before sending a candidate for interview, predict if he / she has high probability of getting hired and then take action?
• Better still, predict skill gaps, suitability of candidate for a company, get candidate ramped up and then send for interview 🙂 ?

To understand how “Decision Trees” can be used for such predictions, below is sample example data. Data is self explanatory.

 Experience Technology Skill Hire 8 – 10 Years Development High Yes 8 – 10 Years Database High Yes 8 – 10 Years Database Low No 6 – 8 Years Architecture Low No 6 – 8 Years Architecture Medium Yes 10 – 12 Years Development High Yes 10 – 12 Years Architecture Low No 10 – 12 Years Database Low No 6 – 8 Years Development Low No 6 – 8 Years Database Medium No 8 – 10 Years Database Low No 8 – 10 Years Database Low No

Based on order of columns that are selected for splitting (sub grouping) data, multiple variant decision trees can be formed.

In below trees,

• Green indicates prediction with 100 accuracy (data in that group is homogenous or belong to one single category of Yes or No)
• Yellow / Orange indicate prediction is better but not with certainty.
• Red indicates prediction does not have any amount of certainty.

Tree 1: Split data skill followed by technology:

Tree 2: Split data Experience –> Technology –> Skill

As you can see from above, both lead to accurate prediction in end but Tree 1 is better than Tree 2 both in terms of number of steps involved (Depth) and Categories (Width) of tree.

Choosing right attribute for splitting data at each iteration of splitting is an important aspect that dictates prediction capabilities, optimizations in decision tree models.

Next post involves more mathematical explanation of “Decision Trees” and how an attribute is selected for split (Information Gain and Entropy) continuing with same case.

Until next post…

Guru