Handling Missing Values

Once again, part of project work at IIITB, we were tasked to find out best startup to invest. Not going into details of project, instead I would like to pickup a specific problem where we were supposed to handle missing values.

Only fundingamount feature is required for our analysis.

No of Observations: 114949 with 19990 missing values or NAs.

image

Summary statistics: summary(missingValues$fundingamount)

Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
0.000e+00 3.225e+05 1.681e+06 1.043e+07 7.000e+06 2.127e+10     19990

To look at missing data across all features and observations, “mice” package may be used.

Multi feature missing observations:

md.pattern(missingValues)

image

It is cross tab between features and number of observations missing. From above result

  • fundingamount is present in 94959 (value 1) and missing in 19990 (value 0)
  • fundingtype is present in all observations.

Edge columns (row number 3 and column V3) are best understand using binary math.

  • fundingtype it has 1 for both 94959 and 19990. Adding 1 to 1 in binary gives 10 take 0.
  • Thus missing values for fundingtype is 0 implies there are no missing values.
  • Similar is computation is done across rows
    • 94959 observations have values both for fundingtype and fundingamount (thus missing (0))

Visually interpreting missing data:

aggr_plot <- aggr(
  missingValues, col=c(“navyblue”,’red’),
  numbers = T, sortVars = T,
  labels = names(missingValues),
  cex.axis = 7, gap = 3,
  ylab = c(“History of missing data”, “Pattern”)
  )

Output:

 Variables sorted by number of missings: 
      Variable     Count
 fundingamount 0.1739032
   fundingtype 0.0000000

Graphical Result:

image

As there 17% missing values in funding amount of total 114 K rows, ignoring NA rows is not an optimal solution. Instead requirement is to impute values in place of NA. Now challenge is what values it impute.

As feature is a ratio variable options for imputing include:

  • Use 0
  • Use Mean  , Median
  • Compute mean, after outlier clippings
  • Compute mean, after Boxplot and find outliers and clip
  • Feature transformation
  • Using MICE / PCA (Considering only 2 features these will not be done)

Computing base mean and density distribution:

Let us compute base mean after removing NA values. This base mean is impacted by outliers on either sides.

  • plot(density(filter(missingValues,!is.na(fundingamount))$fundingamount),col = “red”,main=”Dist. Rasied Amount”)

Notice red line that is parallel to Y axis. It is a extreme right skewed distribution, where there are few fundingamounts equal to 2 billion but almost most of them are near to 0. Also above summary output reinforces that Min value = 0, Max Value = 2 Billion dollars.

image[22]

  • boxplot(filter(missingValues,!is.na(fundingamount))$fundingamount,main=”Dist. Rasied Amount”)

Below box plot also show data is skewed and there seems to only on

image

  • floor(mean(filter(missingValues,!is.na(fundingamount))$fundingamount))
Base Mean: 10,426,869 (10.42 M Dollars)

Clipping Single Outlier values from Top and Bottom:

As outlier impact mean, one simple method we used was to clip single outlier values (observations could be many if equal to value) both from top and bottom of mean. And then we computed mean again. Notice even after we removed single outlier values on either side our distribution is extremely skewed. Only mean reduced by 20 million.

  • outlier(filter(missingValues,!is.na(fundingamount))$fundingamount) –> maxOutlier
  • outlier(filter(missingValues,!is.na(fundingamount))$fundingamount,opposite = TRUE) –> minOutlier
  • plot(density(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount),main=”Dist. fundingamount with one outlier removed”,col=”red”)

image

boxplot(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount,main=”Dist. fundingamount with one outlier removed”)

image

  • mean(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount)
Mean with Single Outliers values clipped 10202965 10.20 M USD

Clipping Outlier values Using Box Plot:

BoxPlot provides a heuristics where any point away 1.5 times whiskers on either side is considered outlier. Using that we removed outliers again. Now we saw data move slightly towards normal form. We could have adjusted 1.5 to 1.2 or even less but with that our number of rows would come down. But nevertheless it showed we were on right track.

    • outliervalues <- boxplot.stats(missingValues$fundingamount)$out
  • plot( density ( filter ( missingValues, !(fundingamount %in% outliervalues) & ! is.na(fundingamount))$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)

image

  • boxplot((filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)

image

  • floor(mean(filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))$fundingamount))
Mean after outlier clipping using Boxplot: 3064248 (3.06 M Dollars)

Log Transformation:

We did log transformation of variable and plotted. Then transformed data for approximately normal. Used this to find mean and applied antilog to get mean value. This value was used to replace NA. One thing post replacement of NA with mean arrived in this method did not much alter shape of original distribution.

  • plot(density(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)

image

  • boxplot((log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)

image

  • floor(exp(mean(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount))))
Mean After Log Transformation: 1422627 (1.42 Million Dollars)

Means across various methods:

Method Mean
Base 10.42 M USD
Single Outlier Values Clipped 10.20 M USD
Outliers clipped with Box Plot 3.06 M USD
Log Transformation 1.42 M USD

Summary: As data is completely skewed, mean for such distribution is not apt representative and outlier clipped though helped to some extent did not remove data skew as Log transformation achieved. Based on this in our project we have considered Log transformation as optimal and used that to compute mean that was used to replace NA/ missing values.

If you feel we could have tried other options as well please let us know through comments section.

Until next learning…

-Guru

Naive Bayes (Introduction to Probability) – Part 1

Naive Bayes algorithm is based on mathematical foundations of probability and is an extension of conditional probability. My goal at end of this NAIVE BAYES series is to read, understand, experience and exhilarate and enjoy Naive Bayes and transform myself into a Bayesian geek :). So let me start from ground up and in this post detailing high level probability concepts. Sorry this is going to be one long post.

  • Definition of Probability:
    • Probability is a numerical statement or quantification of likelihood of a result or an outcome of phenomenon or experiment or situation whose outcome is uncertain. Probability can also be understood to be quantification of randomness (risk), randomness that makes result all the more uncertain.
    • Probability enables to think in systematic manner providing quantified outcomes about arbitrary uncertain events.
    • Probabilities axioms:
      • Non Negativity: Probabilities will be non negative. Probability of event ranges from 0 to 1 (inclusive). 0 <= Probability <= 1.
        • 0 implies event will not occur and
        • 1 implies even will occur
    • Normalization: Probability of entire sample space (all possible output) is 1.
    • Additivity: if P (A & B) = NULL Set then P(A union B) = P(A) + P(B)
  • Probabilistic Sample Space: For an experiment all possible outcomes constitute sample space. It is one super set where all events (subsets) belong.  Sample space can be discrete / finite or continuous and infinite based on experiment / situation considered.
    • Example
      • Discrete and finite: Outcome of a dice thrown would have 6 distinct results with each result different from others and it is finite as it has only 6 possible outcomes.
      • Continuous and infinite: Assume we draw imaginary circles in sky to explore stars inside that circle. Probability of finding stars within circle is an event or subspace. Sky is superset and infinite. sample space can be continuous. Another example could be stock price of a company. Sample space could range from 0 to a large set of number and if not rounded to nearest number will be continuous number.
  • Event: Event is subset of sample space that is of interest for experiment. Of all outcomes of an experiment, event is that outcome that is related to scope of experiment or study.
  • Types of Probabilities:
    • There are 2 types of probabilities. Objective Probabilities and Subjective Probabilities.
    • Objective probability: Probability derived from past or historical data. Additionally it can also be derived from “Classical” or “Logical” probability where probability of events can be derived without performing experiments.

Examples of Objective probability: Capturing student marks for past academic years. Using frequency distribution as basis of probability, it can be predicted that any student randomly picked during current or future academic years has high likelihood would get marks between 60 to 80 (Sum of Probabilities P(60) + P(70) + P(80) = .2 + .34 + .2 = 0.74)

Count of Student

Marks Probability (Relative)
5 40 0.03
15 50 0.10
30 60 0.20
50 70 0.34
30 80 0.20
15 90 0.10
5 100 0.03

In Classical probability likelihood of events can be computed without performing experiments and is mostly useful in academics. Example

  • Tossing a fair coin and finding probability of either  “Head” or “Tail”. (1/2)
  • Rolling a fair dice and finding probability of any number between 1 to 6 (1/6)
  • Drawing a red colored card from a pack of 52 cards. (26/52 = 1/2)

Subjective probabilities are cases where past records may or may not be useful in quantification. Instead experiences of experts either through methods of survey are conducted to quantify or predict. Example could be sending questionnaire and seeking response from analyst about likelihood of stock market crash.

Note: Naive Bayes is an example of Objective probabilities where past data is used to build frequency tables and algorithm learns likelihood of each outcome.

Types of Event:

Mutually Exclusive Event: An event is said to be mutually exclusive if only one (of all possible) outcome is possible for any arbitrary event. Examples

  • An incoming Email is marked as “Spam” or “Not – Spam” but it can not be both.
  • A day selected a random will belong to any but only one month at a time. There are 12 distinct outcomes and each of them are mutually exclusive.
  • Toss of coin would result in “Head” or “Tail” but not both.

Below Spam and Not Spam are mutually exclusive and collectively exhaustive. In cases of events with only two (collectively exhaustive) and mutually exclusive outcomes if probability of one case is known probability of other case is 1 – probability(known case). If probability of Not Spam is 90% or .9 then probability of spam is 1 – 0.9 = .1 or 10% of mails received are marked spam.

Classification model generally is also an example of “Mutually Exclusive” events as data categorized by classifier belongs to a single class. Each class output is disjoint set with other outputs.

image

Collectively Exhaustive: A event is collectively exhaustive if list of outcomes includes every possible outcome. Above spam or not spam is an example of collectively exhaustive event thus sum of P(Spam) + P(Not Spam) = 1. Similarly random day selected belonging to any of 12 months (Jan to Dec) is also mutually exclusive and collectively exhaustive.

Equally Likely: Events with outcomes where likelihood of each outcome is same are equally likely. Example in a fair coin toss, both “Head” and “Tail” outcomes are “Equally likely” , “Mutually Exclusive” and “Collectively exhaustive”. In real world practical scenarios it is highly unlikely that all events are equally likely. Even coin toss is not equally likely and thus always marked as fair coin. (http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair)

Experiments are understood by knowing context better. An experiment may contain a single trial or multiple trial. For example, experiment may be a single toss of coin and other experiment maybe toss of 4 coins repeated over times. “Joint” and “Conditional” probabilities capture probabilities of events.

Joint probabilities deal events across multiple experiments where combination (both) of events is computed. Events from multiple experiments may be independent and dependent.

Conditional probabilities deal event across multiple experiments that captures how outcome of one experiment impacts outcome of other experiment.

Independent Events: Events in which outcome of one trail does not have impact or effect on outcome of subsequent trials. Example Student securing 70% score in exam and spam in emails are independent events.

For example in Naive Bayes, features are assumed to be independent of each other and are only related to label (output) variable. Thus the name “Naive” as  it is highly likely that in pratical scenarios, such assumptions would fail.

Dependent Events: Events in which outcome of one event impacts outcome of second event.

  • Probability of Salary greater than 100 K is dependent  upon Number of years of education.
  • Probability of Rain is dependent upon cloud formation.

Probability with mutually exclusive events:

Mutually exclusive, collectively exhaustive and equally likely deal with outcomes of single instance of an experiment. Independent and dependent events deal with outcomes of multiple experiments how experiments are related to each other to impact outcome of each experiment (event) .

Below is case that details all events (collectively exhaustive) of a roll of dice experiment. As can be noticed each event is mutually exclusive or disjoint sets (as there is no intersection). Now if we are interested in events that result in even numbers from a single roll (2,4,6) then probability of each event (1/6) and probability of event with result an even number is 1/6 + 1/6 + 1/6 = 3/6 = 1/2.  Generalizing, for events that are mutually exclusive can be computed by simple sum.

P(On1 or On2 or On3… Onk) = P(n1) + P(n2) + P(n3)… P(nk).

image

Probability of events where outcomes are mutually exclusive can be computed by adding individual probabilities of each outcome If A and B are mutually exclusive outcomes of an event, then P(A or B) = P(A) + P(B) and P(A & B) = 0 as there is no intersection between P(A) and P(B).

Probability with non-mutually exclusive events:

If events are not mutually exclusive, that is there is an overlap, in such cases probability is computed as P(A or B) = P(A) + P(B) – P(A & B).

  • P(A or B) = P(A) + P(B) – P(A & B)
  • P (A or B) = P(A) + P(B and A’s complement)
    • A’s complement is entire sample space expect or other than A.
  • P (A or B or C) = P(A) + P(B and Ac) + P(C and Bc and Ac)
    • Ac is complement of A
    • Bc is complement of B

Below is an example of case where outcomes of experiments are NOT mutually exclusive. What is probability that any card drawn from pack of cards is either a “Red Card” or any of King Faced card. There are 26 red cards out of 52 cards in a pack. Probability of red cards is 26 / 52 = 1/2. But that red cards is inclusive of 2 king faced red card as well. Similarly King Cards are 4 in 52 cards. Thus 4/52 = 1/13 that is inclusive of 2 king cards that are in Red color. Coming to question to compute probability of outcomes of events that are not mutually exclusive

P(Red Cards or King Faced cards) = P(Red Cards) + P(King Faced Cards) – P(Red cards that are King faced). 1/2 + 1/13 – 1/26 = 0.54

As both Red Card outcome as well as King faced cards outcome include King faced red cards there is a double counting of “king faced red cards”. To correct probability we deduct once intersection of Red cards that are King faced

image

  • For Mutually exclusive or disjoint events:
    • P(A or B) = P(A) + P(B)
  • For Non Mutually exclusive or overlapping events:
    • P(A or B) = P(A) + P(B) – P(A intersection B)
    • P (A union B) <= P(A) + P(B)
    • If A is subset of B then P(A) < P(B)

Joint probabilities with independent events: When outcome of experiment does not have impact on outcome of next experiment then joint probability is product of probabilities of each event.

P(A & B) = P(A) x P(B)

Example: A fair 6 sided dice is rolled 2 times. What is probability that both dice rolled result in number 6. As individual dice roll is mutually exclusive, probability of landing on 6 is 1/6. Requirement that experiments should have 1/6. Probability (6 on first roll and 6 on second roll) = P(6 on first roll) * P(6 on second roll) = 1/6 x 1/6 = 1/36.

Conditional probabilities of Independent events: When outcome of an experiment does not have impact on outcome of next experiment than conditional probability P(A | B) (read probability of A give probability of B) is probability of  A.

  • P(A | B) = P(A) or
  • P(B | A) = P(B)

Example: A fair coin is tossed two times. What is probability that coin lands on head given that first experiment resulted in a Tail. As outcomes of both events are independent, outcome of first experiment does not have any impact on second.

  • P(H given T) = P(H) = 1/2.
  • P(T given H) = P(T) = 1/2

Next post conditional probability of dependent events and then jump in Bayes Theorem.

What is in Birth Day…

In a class of 50, I had this bet.. There will be atleast 2 people in class who share same birth date(date and month only and not year).

To clarify, person A with DOB 1st January 1999 shares birth day with person  X with DOB 1st January 1965. Years can be different but dates and months are same.

What is probability that such an event might or might not occur?

Before arriving solution to above problem, brief basics of probability..

P(Event) = Count of events that meet condition / Total number Events.

  • First person:
    • As he / she is first person, DOB can be any of 365 days of total 365 days.
  • Second person:
    • First person has already chosen a DOB, assuming, in class all of them have different DOB, then there are only 364 days remaining. So probability of second person is 364/365
  • Third person:
    • First and second already shared DOB,
      • Probability: 363/365
  • Fifty’th person:
    • All prior 49 have already taken 49 days and so there are only 365- 49 days = 316 days. Probability = 316/365

As each person DOB is independent,

Combined Probability: 1×364/365…………316/365 = 0.029626

That implies there is 0.029626 = 0.03 or 3% chances that 2 persons do not share same DOB. Conversely 1-0.29626 = .97034 = 97.034% that there will be atleast 2 persons sharing same DOB.

Find surprising.. try it out..

Until next time..

-Guru

My Phone Number revealed….

During recent session in Intro slide I put my entire phone number except last three digits. It is stupid to do that.. may be or may be not..

Phone number is a 10 digit number with each digit ranging from 0-9. For example if my phone number is 3021067894 then

  1. Each position of phone number can vary from 0-9
  2. And every position is independent of other position. Example 3 at first position is not dependent upon 0 at second position.

Now coming to solve “Guessing Phone Number Problem”, for each position, from total 10 numbers only 1 can be selected. So probability of selecting a number from total ten number is 1/10. And since each position is independent of every other position we have (1/10)*(1/10)…

As I revealed 7 digits of my 10 digit phone number only 3 positions have to be guessed, simple right??

Total number of combinations: (10)*(10)*(10) = 1000 and

In that only 1 number is correct = 1.

So, likely chances of getting correct number (IN FIRST TRY) is 1/1000 = 0.001 or conversely 1-0.001 = .999.

That implies 99.9% times people still will not be able to find my number Smile.

Me being stupid:

Assumption here is only 1 try what if someone writes a loop and puts all numbers together, then probability he has my number is 1 (Certain).

May be it not that stupid after all…

  • I am not a celebrity for people to search for my number spending their valuable time
  • Even if they write a loop to get my correct numbers, what is feedback mechanism to check if they have right number or if I have given first 7 digits right Smile. May be call all numbers or may be use Machine Learning to guess my numbers…

Until next guess….

Guru