Handling Missing Values

Once again, part of project work at IIITB, we were tasked to find out best startup to invest. Not going into details of project, instead I would like to pickup a specific problem where we were supposed to handle missing values.

Only fundingamount feature is required for our analysis.

No of Observations: 114949 with 19990 missing values or NAs.

Summary statistics: summary(missingValues\$fundingamount)

```Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's
0.000e+00 3.225e+05 1.681e+06 1.043e+07 7.000e+06 2.127e+10     19990```

To look at missing data across all features and observations, “mice” package may be used.

Multi feature missing observations:

md.pattern(missingValues)

It is cross tab between features and number of observations missing. From above result

• fundingamount is present in 94959 (value 1) and missing in 19990 (value 0)
• fundingtype is present in all observations.

Edge columns (row number 3 and column V3) are best understand using binary math.

• fundingtype it has 1 for both 94959 and 19990. Adding 1 to 1 in binary gives 10 take 0.
• Thus missing values for fundingtype is 0 implies there are no missing values.
• Similar is computation is done across rows
• 94959 observations have values both for fundingtype and fundingamount (thus missing (0))

Visually interpreting missing data:

aggr_plot <- aggr(
missingValues, col=c(“navyblue”,’red’),
numbers = T, sortVars = T,
labels = names(missingValues),
cex.axis = 7, gap = 3,
ylab = c(“History of missing data”, “Pattern”)
)

Output:

``` Variables sorted by number of missings:
Variable     Count
fundingamount 0.1739032
fundingtype 0.0000000```

Graphical Result:

As there 17% missing values in funding amount of total 114 K rows, ignoring NA rows is not an optimal solution. Instead requirement is to impute values in place of NA. Now challenge is what values it impute.

As feature is a ratio variable options for imputing include:

• Use 0
• Use Mean  , Median
• Compute mean, after outlier clippings
• Compute mean, after Boxplot and find outliers and clip
• Feature transformation
• Using MICE / PCA (Considering only 2 features these will not be done)

Computing base mean and density distribution:

Let us compute base mean after removing NA values. This base mean is impacted by outliers on either sides.

• plot(density(filter(missingValues,!is.na(fundingamount))\$fundingamount),col = “red”,main=”Dist. Rasied Amount”)

Notice red line that is parallel to Y axis. It is a extreme right skewed distribution, where there are few fundingamounts equal to 2 billion but almost most of them are near to 0. Also above summary output reinforces that Min value = 0, Max Value = 2 Billion dollars.

• boxplot(filter(missingValues,!is.na(fundingamount))\$fundingamount,main=”Dist. Rasied Amount”)

Below box plot also show data is skewed and there seems to only on

• floor(mean(filter(missingValues,!is.na(fundingamount))\$fundingamount))
`Base Mean: 10,426,869 (10.42 M Dollars)`

Clipping Single Outlier values from Top and Bottom:

As outlier impact mean, one simple method we used was to clip single outlier values (observations could be many if equal to value) both from top and bottom of mean. And then we computed mean again. Notice even after we removed single outlier values on either side our distribution is extremely skewed. Only mean reduced by 20 million.

• outlier(filter(missingValues,!is.na(fundingamount))\$fundingamount) –> maxOutlier
• outlier(filter(missingValues,!is.na(fundingamount))\$fundingamount,opposite = TRUE) –> minOutlier
• plot(density(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))\$fundingamount),main=”Dist. fundingamount with one outlier removed”,col=”red”)

boxplot(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))\$fundingamount,main=”Dist. fundingamount with one outlier removed”)

• mean(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))\$fundingamount)
`Mean with Single Outliers values clipped 10202965 10.20 M USD`

Clipping Outlier values Using Box Plot:

BoxPlot provides a heuristics where any point away 1.5 times whiskers on either side is considered outlier. Using that we removed outliers again. Now we saw data move slightly towards normal form. We could have adjusted 1.5 to 1.2 or even less but with that our number of rows would come down. But nevertheless it showed we were on right track.

• outliervalues <- boxplot.stats(missingValues\$fundingamount)\$out
• plot( density ( filter ( missingValues, !(fundingamount %in% outliervalues) & ! is.na(fundingamount))\$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)

• boxplot((filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))\$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)

• floor(mean(filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))\$fundingamount))
`Mean after outlier clipping using Boxplot: 3064248 (3.06 M Dollars)`

Log Transformation:

We did log transformation of variable and plotted. Then transformed data for approximately normal. Used this to find mean and applied antilog to get mean value. This value was used to replace NA. One thing post replacement of NA with mean arrived in this method did not much alter shape of original distribution.

• plot(density(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )\$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)

• boxplot((log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )\$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)

• floor(exp(mean(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )\$fundingamount))))
`Mean After Log Transformation: 1422627 (1.42 Million Dollars)`

Means across various methods:

 Method Mean Base 10.42 M USD Single Outlier Values Clipped 10.20 M USD Outliers clipped with Box Plot 3.06 M USD Log Transformation 1.42 M USD

Summary: As data is completely skewed, mean for such distribution is not apt representative and outlier clipped though helped to some extent did not remove data skew as Log transformation achieved. Based on this in our project we have considered Log transformation as optimal and used that to compute mean that was used to replace NA/ missing values.

If you feel we could have tried other options as well please let us know through comments section.

Until next learning…

-Guru