Preceding seven posts detailed
- Decision Trees basics
- Attribute to split data on basis of impurity
- Information Gain (C5), Gain Ratio
- GINI Index (CART)
- Misclassification Error
In this post, I will try to detail out data types and impact of data types on Decision Trees.
In a data set there are attributes whose value needs to be predicted (dependent variables) and there are attributes that help in predicting called independent variables.
|Variable Type||Data Type||Model|
|Variable Type||Data Type||Remarks|
Can be used for both classification and regression models. But if there are too many categorical values (like Date / Name / Phone Number, Pin Code), it is either better to ignore such columns are convert them to values that are useful for prediction.
Again can be used both for classification and regression models. With continuous variable understanding specific algorithms in important. Some implementations will quantize values based on single threshold values and others may quantize based on range, min , max or using normal curve. And there could many others that are unawares.
This post details as much as possible data understanding and preparation specific to independent variables.
- Null Values: Attributes that have all or most of values blank may not be useful for prediction, so can be removed / ignored.
- Identical Values: Attributes with same value for all examples are not useful, so can be safely removed / ignored.
- Unique Values: Attributes with unique values for each example like an alphanumeric identifier (other than date, timestamp or numeric) (Phone Number, Pin Codes, Names, User or Login Names, Email Addresses).
- Phone Number: Generally first few digits of phone numbers are common and may correspond to circle (location). If there are no other attributes that like to geography may be this can be used (if accurate, never worked on it)
- Zip Code: Same as above
- Name: Will name of person / company or entity will have any bearing on output and if so, are names (either first or last names repeated).
- For example for prediction of child names based on parent names, in such cases names may be useful but if say problem of interest is related to sales, will name have an impact on sale. If no, remove columns, else retain column but split it into first and last names (may be last name repeats more often).
- Email Address: Take domain name (after @)
- Correlated Attributes: Find attributes (IV) that have high correlation between attributes. An IV highly correlated with other IV(s). Due to high correlation only one of such attributes could be used.
- Example: Month Name (Jan.. Dec) and Month Numbers (1 to 12) are highly correlated.
- Using either of them to split data would result in same sub tree structure.
- Such scenarios use only one attribute for modeling.
- Selection of which attribute to use depends upon
- Performance (Select high performing attribute)
- Business scenario (Select descriptive).
- For most input data, knowledge of domain and data are necessary to take such decisions.
- Excel in some cases converts date and time to numerical values. There will be need to convert them back to date and time and then prepare data. Working on date and time includes
- Splitting date and time components into Date (Day , Month, Quarter, Semester, Year) and Time (Hours, Minutes, Seconds) depending upon requirements.
- Notes for me: Need to check if numerical representation of date and / or time has better impact on regression models.